Title Kernel Descriptors for Visual Recognition
Conference NIPS 2010
Author Liefeng Bo and Xiaofeng Ren and Dieter Fox

这是一篇2010年的老文章,从读博士开始,我就在follow他的工作,最近有朋友问题kernel descriptor的问题,并且由于最近想基于kernel descriptor做一点改进的工作,把他的论文和代码重读了一遍,主要是论文,在这里总结一下。

What is kernel?


简单来说,kernel就是一种内积的高级说法,而内积其实就是一种linear kernel,而常见的kernel还有polynomial kernel以及Gaussian kernel,显示如下:

  • polynomial kernel: \[ \mathbf{k}(x, z) = (<x, z>+1)^{\gamma}, \gamma\in Z^{+} \]
  • gaussian kernel: \[ \mathbf{k}(x, z) = \exp\left(-\dfrac{||x-z||^{2}}{2\sigma^{2}}\right), \sigma\in R-{0} \]
  • laplacian kernel: \[ \mathbf{k}(x, z) = \exp\left(-\dfrac{||x-z||}{\sigma}\right), \sigma \in R-{0} \]

首先回到最简单分类问题的开始,如果有两个sample point,label分别为\( (+1, -1) \),给定一个测试样本\( x^{t} \),如何确定它的标签。最简单的想法就是比较测试样本\( x^{t} \)和两个训练样本 \( x^{+}, x^{-} \)的相似性,赋予的标签和跟近似的训练样本一直。于是乎,相似性的问题浮出了水面,最传统的相似性度量包括或者。与其相关的研究被成为distance learning。但是,简单的度量方式往往只适合线性可分的情况,如下图右侧所示。那么kernel trick 完成的工作就是把低维的特征投影到高维。在下图中,投影的方程为:

Feature mapping
Feature mapping from low dimension to high dimension

经过这样的变换,本来一个线性不可分的问题变成了线性可分,而相似性的度量也更加准确。这里我们给出的是特征投影方程\( \phi: R^{2} \rightarrow R^{3} \),但是很多情况下,找出这样的特征投影方法是比较困难的。我们再次看一下通过观察投影过后特征的内积(相似性度量):

我们发现,可以控制内积的变换,在不知道的情况下,完成相似性的度量,而内积的变换方法就是核函数,又被称为kernel trick。简单总结一下kernel的几个特点:

  • 利用变换将特征空间投影到高维空间,又被称为Hilbert空间;
  • 尽量使得在内,问题线性可分,这也就是kernel SVM的基础知识;
  • 高维空间内的内积可以通过低维空间内的内积进行核函数操作求得,计算更快捷;
  • kernel function必须是有限半正定

 

Back to Kernel Descriptor


通过上面的总结,我们可以知道kernel在描述相似性方面的一些优点。现在我们需要回到计算机视觉里面传统的基于histogram的描述子(以及任何描述方法):SIFT, HoG以及BoVW(Bag of Visual Words)。在这三者之间,SIFT和HoG与Kernel Descriptor相关,而BoVW则与Liefeng Bo的另一篇文章Efficient Matching Kernel相关。其实这两个工作的思想方法是很相似的,简单总结就是传统的度量相似性的方法由于离散化的原因过于粗糙,而通过gaussian kernel可以更细致的描述相似性,但是与此同时带来了投影的特征空间无穷大的问题,于是用高维空间来近似无限维空间,而特征和高维空间基的相似性作为最终的kernel descriptor。

从基于histogram的描述方法说起,其实初衷就是为了更有效的描述不规则的离散分布的数值,将数值空间分为个histogram,统计每个histogram中的数值个数,而最终对这堆离散数据的表达则化为histogram。这就带来了第一个误差,也就是数据的近似。而当需要匹配两个特征描述子的时候,通常的做法也就是比较两个归一化histogram的距离。而论文中提出了从Kernel的视角看这个老问题的新方法,以gradient为例:

而误差就是出在上面公式的函数。举一个简答的例子,一个指向为和一个指向为的向量,经过histogram化之后,被分配到了两个不同的histogram,他们之间的相似性就消失了,但是简单的几何知识可以告诉我们,这两个角度其实是很接近的,而他们的相似性应该考虑其中。文章引入了来更细致的描述:

通过将简单的角度通过上式进行变化,我们可以更好的描述角度。而的相似性可以通过gaussian kernel来表达:

作者同样考虑了像素点在邻域内的位置,引入了另一个kernel:

最终的相似性比较用kernel表示为:

但是,kernel只能用来描述两个特征点之间的相似性,如何获得单个描述子就成为了另外一个问题。传统的做法是通过kernel function求得我们之前讨论的投影方程。然而这里的问题是,由于使用了高斯核,特征空间的分解将是无限维的,我们需要将无限维空间投影到有限维空间的基上。可以简单的理解为,将特征空间内的表述与有限空间内的基做内积。再将这个无限维的特征投影到又个基张成的空间内,完成了又无限维至有限维的转换。然而作者为了进一步压缩特征的维数,再一次使用了kernel PCA达到特征降维的目的,这一步应该是比较好理解的。

Kernel Descriptor给出了一个lower-level的特征描述,如何得到middle-level或者是higher-level的描述,常用的方法是BoVW以及Spatial Pyramid Matching。而在这篇论文中,作者使用了自己提出了EMK,思想方法也比较好理解,如果将来有机会再介绍一下。

When we first designed the framework of perception module for Amazon Picking Challenge(APC), we tried to continue using our existing RGB-D object recognition and pose estimation pipeline in our pipeline. However, immediately after we saw the actual items we need to recognize in the competition, we are aware of that the traditional method using keypoints won’t work since most of the items are not textured enough for stable keypoint detection and matching. Also, we are really concerned about the RGB resolution from the Kinect sensor and the distortion in Kinect V2 is so severe that the corner of the shelf are bended therefore we gave it up. Speaking of Kinect V2, when we were at the competition, MIT team really did a good work using Kinect V2 by putting a pair of Kinect V2 in a formation below and it avoided the distortion problem in the corner and it is really a brilliant idea.

MIT Kinect V2 Configuration
MIT Kinect V2 Configuration

In order to recognize untextured/less-textured objects, there are two different methods: 1) using RGB-D feature and descritor and 2) using more advanced maching learning techniques. Unfortunately, during my experiences, there is no stable RGB-D keypoint detector or descriptor available for object recognition now. I decided to turn help from machine learning communities. Finally, there are 2 different methods implemented: 1) kernel descriptor and 2) EBlearn. We digged into the detailed for both methods theoretically and practically.

Our recognition pipeline shown very good performance in object detection for both textured and less-textured items even under very cluttered environments. The remaining problem in my point of view in this robotic perception is pose estimation or state estimation for manipulation. Using machine learning methods, once we get the detected roi, the traditional approach to compute the relative pose is via ICP but it is really not good enough especially when the roi is not accurate. Again, since I cannot find a good depth keypoint detector and descriptor, compute the relative pose via feature matching and SVD seems to be difficult. My future work will cover RGB-D descriptor for less-texture object recognition and pose estimation.

This is package uses a traditional pattern for object recognition: extract features, obtain Bag-of-Words representation and classify the features using libSVM. The pipeline is shown below, from Marcel Tella Amo.

Object Classification Pipeline using SVM and BoW

 

The features used in this packages are: SIFT for grayscale image, FPFH for point cloud and HoG for structure information. This combination can provide a reasonable good classification results for simple objects.

You can learn how to using SIFT and HoG extractor in OpenCV, FPFH descriptor in Point Cloud Library in this package. You can also know how to use libSVM from C API.

Title Monocular SLAM Supported Object Recognition
Conference RSS 2015
Author Sudeep Pillai and John Leonard

Contributions in my opinion

This paper combines visual SLAM with object recognition, from the first glance, it may look similar as the SLAM++ paper from Andrew Davison group, however, the problems the author want to address are different. The SLAM part in this work acts as a pre-processed step to obtain the reconstructed point cloud, and further partition the point cloud using density based over-segmentation. From this results, the author reprojected segmented point cloud to different viewpoints and recognise the items in image space. The paper spends a lot of time in explain the image feature coding strategy starting from traditional BoVW to recent VLAD and even more recent FLAIR. In short words, FLAIR enables user to detect the position of the objects in the image. As a common sense, sliding window detection needs to solve the scalability issue, however, BoVW representation can avoid this issue by sacrificing the ability to localise the object in the image. How to localise the object in the image, FLAIR seems to be solution for this problem. In conclusion, I will categorize this paper as an extension of FLAIR rather than a combination of SLAM with object recognition. I also assume the scalability which the author highlighted in the abstract inherits from FLAIR. This work shows improvements on UW RGB-D scene dataset.

Questions in my opinion

Given an RGB sensor, I believe the work is absolute great idea. By doing SLAM, the geometrical information is taken into consideration, and it is no wonder it can generate better results compared with traditional methods such as BING. But the questions are :1) what will happen if RGB-D sensor is equipmented, the part of SLAM seems to be reduntant; 2) what will happen is the SLAM part shows errors in scene reconstruction. I am keen to know if the work is done by RGB-D sensor, 1) since RGB-D structure of every frame is available from RGB-D sensor, can online detection achieve the same results compared this paper which is doing detection after SLAM? 2) what can continuous frames/detection results help each other? 3) instead of generating bounding box detection results, are we able to generate a more accurate detection results using RGB-D segmentation or superpixel segmentation?

Conclusion

This paper proves that object recognition can be improved using the SLAM which adding more viewpoint information. and using FLAIR as detection is a great idea to improve the scalability issue.

Preliminary understanding on the scalability of FLAIR FLAIR dense samples the image space and instead of going through all the possible object candidatures, it only goes through the image space.

I recently started learning Lua because the interest of learning popular deep learning package Torch. Before I jumped into Torch, I have already fallen in love with Lua. First of all, it is very lightweight and fast as it says. It contains some very interesting features that I haven’t met in C/C++ and Python such as the function in class. It is more ‘astonishing’ that Lua is written in pure C and the size is very small. I only read few lines from source code, but I have to admit it is the most beautiful code I ever met before based on my personal experiences. Some of useful links of learning lua is below:

Torch and Caffe are the most widely spread deep learning toolboxes in current research field. It seems that Caffe is more popular in US universities and Torch is more popular in Europe maybe because where are the developers from. The reason I selected Torch is because the post in Tomasz’s blog and Yann Lecun said that

1
Torch is for research in deep learning, Caffe is OK for using ConvNets as a "black box" (or a grey box), but not flexible enough for innovative research in deep learning. That's why Facebook and DeepMind both use Torch for almost everything.

Command parser in Lua using Torch

If you are experienced in

1
Boost.Program_options
, the following example will be straight forward to you:

1
2
3
4
5
6
7
8
9
10
11
cmd = torch.CmdLine()
cmd:text()
cmd:text('Student info')
cmd:text()
cmd:text('Options:')
-- global:
cmd:option('-age', 18, 'age of the student') -- int param
cmd:option('-gender', true, 'is boy')        -- true = boy, false = girl
cmd:option('-name', 'adam', 'name')          -- name, string
cmd:text()
opt = cmd:parse(arg or {})

By type in

1
th -i <filename> -h/--help
, the program options will printed in console:

1
2
3
4
5
6
7
8
Usage: /home/kanzhi/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th [options]

Student info

Options:
  -age    age of the student [18]
  -gender is boy [true]
  -name   name [adam]

After parse the args to

1
opt
, the options are saved as the table. You can also simply run the following code to the the parameters settings:

1
2
3
4
print 'default paremter settings: '
for k, v in pairs(opt) do
   print (k, v)
end