Kernel Descriptor in Depth

Title Kernel Descriptors for Visual Recognition
Conference NIPS 2010
Author Liefeng Bo and Xiaofeng Ren and Dieter Fox

What is kernel?

• polynomial kernel: $\mathbf{k}(x, z) = (<x, z>+1)^{\gamma}, \gamma\in Z^{+}$
• gaussian kernel: $\mathbf{k}(x, z) = \exp\left(-\dfrac{||x-z||^{2}}{2\sigma^{2}}\right), \sigma\in R-{0}$
• laplacian kernel: $\mathbf{k}(x, z) = \exp\left(-\dfrac{||x-z||}{\sigma}\right), \sigma \in R-{0}$

• 利用变换$\phi$将特征空间投影到高维空间$\mathbf{H}$，又被称为Hilbert空间；
• 尽量使得在$\mathbf{H}$内，问题线性可分，这也就是kernel SVM的基础知识；
• 高维空间内的内积可以通过低维空间内的内积进行核函数操作求得，计算更快捷；
• kernel function必须是有限半正定

Back to Kernel Descriptor

Kernel Descriptor给出了一个lower-level的特征描述，如何得到middle-level或者是higher-level的描述，常用的方法是BoVW以及Spatial Pyramid Matching。而在这篇论文中，作者使用了自己提出了EMK，思想方法也比较好理解，如果将来有机会再介绍一下。

Amazon Picking Challenge -- Perception Module

When we first designed the framework of perception module for Amazon Picking Challenge(APC), we tried to continue using our existing RGB-D object recognition and pose estimation pipeline in our pipeline. However, immediately after we saw the actual items we need to recognize in the competition, we are aware of that the traditional method using keypoints won’t work since most of the items are not textured enough for stable keypoint detection and matching. Also, we are really concerned about the RGB resolution from the Kinect sensor and the distortion in Kinect V2 is so severe that the corner of the shelf are bended therefore we gave it up. Speaking of Kinect V2, when we were at the competition, MIT team really did a good work using Kinect V2 by putting a pair of Kinect V2 in a formation below and it avoided the distortion problem in the corner and it is really a brilliant idea.

In order to recognize untextured/less-textured objects, there are two different methods: 1) using RGB-D feature and descritor and 2) using more advanced maching learning techniques. Unfortunately, during my experiences, there is no stable RGB-D keypoint detector or descriptor available for object recognition now. I decided to turn help from machine learning communities. Finally, there are 2 different methods implemented: 1) kernel descriptor and 2) EBlearn. We digged into the detailed for both methods theoretically and practically.

Our recognition pipeline shown very good performance in object detection for both textured and less-textured items even under very cluttered environments. The remaining problem in my point of view in this robotic perception is pose estimation or state estimation for manipulation. Using machine learning methods, once we get the detected roi, the traditional approach to compute the relative pose is via ICP but it is really not good enough especially when the roi is not accurate. Again, since I cannot find a good depth keypoint detector and descriptor, compute the relative pose via feature matching and SVD seems to be difficult. My future work will cover RGB-D descriptor for less-texture object recognition and pose estimation.

Object Classification using RGB-D features with SVM

This is package uses a traditional pattern for object recognition: extract features, obtain Bag-of-Words representation and classify the features using libSVM. The pipeline is shown below, from Marcel Tella Amo.

The features used in this packages are: SIFT for grayscale image, FPFH for point cloud and HoG for structure information. This combination can provide a reasonable good classification results for simple objects.

You can learn how to using SIFT and HoG extractor in OpenCV, FPFH descriptor in Point Cloud Library in this package. You can also know how to use libSVM from C API.

Monocular SLAM Supported Object Recognition

Title Monocular SLAM Supported Object Recognition
Author Sudeep Pillai and John Leonard

Contributions in my opinion

This paper combines visual SLAM with object recognition, from the first glance, it may look similar as the SLAM++ paper from Andrew Davison group, however, the problems the author want to address are different. The SLAM part in this work acts as a pre-processed step to obtain the reconstructed point cloud, and further partition the point cloud using density based over-segmentation. From this results, the author reprojected segmented point cloud to different viewpoints and recognise the items in image space. The paper spends a lot of time in explain the image feature coding strategy starting from traditional BoVW to recent VLAD and even more recent FLAIR. In short words, FLAIR enables user to detect the position of the objects in the image. As a common sense, sliding window detection needs to solve the scalability issue, however, BoVW representation can avoid this issue by sacrificing the ability to localise the object in the image. How to localise the object in the image, FLAIR seems to be solution for this problem. In conclusion, I will categorize this paper as an extension of FLAIR rather than a combination of SLAM with object recognition. I also assume the scalability which the author highlighted in the abstract inherits from FLAIR. This work shows improvements on UW RGB-D scene dataset.

Questions in my opinion

Given an RGB sensor, I believe the work is absolute great idea. By doing SLAM, the geometrical information is taken into consideration, and it is no wonder it can generate better results compared with traditional methods such as BING. But the questions are :1) what will happen if RGB-D sensor is equipmented, the part of SLAM seems to be reduntant; 2) what will happen is the SLAM part shows errors in scene reconstruction. I am keen to know if the work is done by RGB-D sensor, 1) since RGB-D structure of every frame is available from RGB-D sensor, can online detection achieve the same results compared this paper which is doing detection after SLAM? 2) what can continuous frames/detection results help each other? 3) instead of generating bounding box detection results, are we able to generate a more accurate detection results using RGB-D segmentation or superpixel segmentation?

Conclusion

This paper proves that object recognition can be improved using the SLAM which adding more viewpoint information. and using FLAIR as detection is a great idea to improve the scalability issue.

Preliminary understanding on the scalability of FLAIR FLAIR dense samples the image space and instead of going through all the possible object candidatures, it only goes through the image space.

Command Parser in Lua with Torch

I recently started learning Lua because the interest of learning popular deep learning package Torch. Before I jumped into Torch, I have already fallen in love with Lua. First of all, it is very lightweight and fast as it says. It contains some very interesting features that I haven’t met in C/C++ and Python such as the function in class. It is more ‘astonishing’ that Lua is written in pure C and the size is very small. I only read few lines from source code, but I have to admit it is the most beautiful code I ever met before based on my personal experiences. Some of useful links of learning lua is below:

Torch and Caffe are the most widely spread deep learning toolboxes in current research field. It seems that Caffe is more popular in US universities and Torch is more popular in Europe maybe because where are the developers from. The reason I selected Torch is because the post in Tomasz’s blog and Yann Lecun said that

 1 Torch is for research in deep learning, Caffe is OK for using ConvNets as a "black box" (or a grey box), but not flexible enough for innovative research in deep learning. That's why Facebook and DeepMind both use Torch for almost everything. 

Command parser in Lua using Torch

If you are experienced in

 1 Boost.Program_options 
 , the following example will be straight forward to you:

 1 2 3 4 5 6 7 8 9 10 11 cmd = torch.CmdLine() cmd:text() cmd:text('Student info') cmd:text() cmd:text('Options:') -- global: cmd:option('-age', 18, 'age of the student') -- int param cmd:option('-gender', true, 'is boy') -- true = boy, false = girl cmd:option('-name', 'adam', 'name') -- name, string cmd:text() opt = cmd:parse(arg or {}) 

By type in

 1 th -i -h/--help 
 , the program options will printed in console:

 1 2 3 4 5 6 7 8 Usage: /home/kanzhi/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th [options] Student info Options: -age age of the student [18] -gender is boy [true] -name name [adam] 

After parse the args to

 1 opt 
 , the options are saved as the table. You can also simply run the following code to the the parameters settings:

 1 2 3 4 print 'default paremter settings: ' for k, v in pairs(opt) do print (k, v) end