8th International DAAAM Baltic Conference "INDUSTRIAL ENGINEERING - 19-21 April 2012, Tallinn, Estonia LOCAL AND GLOBAL DESCRIPTORS FOR PLACE RECOGNITION IN ROBOTICS Shvarts, D. & Tamre, M. Abstract: The simultaneous autolocalization and mapping of the environment is one of the most pressing problems of robotics. Among the existing SLAM algorithms, place recognition is a must for several cases. As an example, in multirobot SLAM we have several individual maps created by various robots. In order to combine them into one global map we have to identify common places before merging them. In this paper, two methods that were successfully used for performing scene recognition between different images have been compared. We have considered the advantages and limitations of each method regarding our tasks. Key words: local descriptors, global descriptors, GIST, SIFT, SURF. 1. INTRODUCTION SLAM, standing for Simultaneous Localization and Mapping, tries to locate a mobile robot in its environment and estimate a map of it from sensory information [1]. A wide array of sensors have been used, but nowadays cameras are the preferred ones. At its core, SLAM algorithms apply sequential estimation techniques that estimate a model from noisy data. In a SLAM framework, the ability of recognizing a previously mapped area is useful in several occasions: for correcting the estimation drift when an area is revisited (a problem known as loop closure) [2]; for relocation in an estimated map (the kidnapped robot problem) [3]; or 351 for fusing information between multiple robots that are mapping the same area (multi-robot SLAM) [4]. Place recognition in visual SLAM has been usually addressed by constructing a visual vocabulary of local descriptors [2, 3]. Such vocabulary can be expensive to build and store if a robot performs an exploratory trajectory and accumulates new images. While there exists global descriptors in the computer vision literature, they have never been used in SLAM. This paper proposes the comparison of several local and global descriptors for the purpose of place recognition in robotics, both in terms of performance and computational cost. 2. LOCAL DESCRIPTORS In computer vision, local interest points have been used to solve many problems like object recognition, image registration, 3D reconstruction, and more. The usual approach is based on selecting some points in the image and perform a local analysis on theses ones. For a successfully work of such methods, a sufficient number of such keypoints have to be detected. In addition, these points should be distinguishable and stable features that can be accurately localized. A lot of research on the behavior of several types of feature descriptor and detectors has been done. We compared the results of such investigations to select the appropriate feature descriptor and detector for further work. The best feature detector has to meet the following requirements:
The extracted keypoints have to be rotation and scale invariant. Invariance to luminance transformation, at least partly. Invariant to blur and noise. A comparison of six methods that were implemented in OpenCV library was showed in [10]. Five quality and one performance test was done for each kind of descriptor. Fig. 3. The result of lighting test for OpenCV s feature detector algorithms. Based on the materials presented in [10] there are two descriptors that showed the most stable results SIFT and SURF. It should also be noticed that these algorithms are the slowest among all the tested [Fig. 4]. Fig. 1. The result of rotation test for OpenCV s feature detector algorithms. [Fig. 1] shows that almost all algorithms are partially invariant to rotation except BRIEF, presenting SIFT the best repeatability. Close to SIFT are ORB and SURF feature descriptors. [Fig. 2] shows the scale invariance performance of different algorithms. Fig. 2. The Scale test for OpenCV's features detector algorithms. Again, the most stable results showed SURF and SIFT descriptors. Almost all the descriptors have a high degree of invariance to brightness change, as shown in [Fig. 3] Fig. 4. Speed test of OpenCV's implemented algorithms The use of these algorithms in real-time applications could be limited due to these computational cost. However, the high quality of the calculated keypoints makes these algorithms irreplaceable for solving many problems in computer vision. A detailed description of those algorithms is not presented here for the sake of brevity. The reader is referred to the original papers [7] and [13] for a deep understanding of both algorithms. We carried out several additional tests with SIFT and SURF algorithms to determine witch of those has the most suitable performances for our particular purpose. The results are presented in (Fig. 5) and in table 1. 352
Fig. 5. a) A test-image with extracted keypoints by using SIFT descriptor, b) the same image with extracted keypoints by applying SURF descriptor SURF SIFT Image size 480x640x3 480x640x3 Number of 1126 1511 points extracted keypoints keypoints The execution time 878.86ms 1245.8ms Table 1. The comparison of two local descriptors. Both SIFT and SURF descriptors have showed similar results and could apply for solving a issue in SLAM application but with certain limitations. The number of extracted keypoints has played a major role by choosing suitable descriptors for further work. 3. GLOBAL DESCRIPTORS In the previous section we have investigated the properties of different local descriptors. There are several descriptors and we can choose the best one to accomplish specific task. But if we use a local descriptor, the representation of a whole image is restricted to the description of a set of points that was successfully extracted from the image. In contrast, global descriptors summarize the whole image in a single descriptor, being GIST the most representative [8]. We are going to highlight the major aspects of global descriptors in this chapter. Investigation in the field of global descriptor is conditional upon that the recognition of the real world scene based on encoding the global configuration, ignoring most of the details and objects information [8]. The abstract description of a scene can be obtained by discrete Fourier transformation of an image:, 1,, is the intensity distribution of the image along a spatial variables,, and are the spatial frequency variables. The complex function, can be decomposed into two terms,,, the amplitude spectrum of the image, and, the phase function of Fourier transformation. The phase function represents the information relative to local properties and amplitude spectrum give unlocalized information about to the image structure. The energy spectrum of Fourier transformation,, is a distribution of signal s energy among the different spatial frequencies. The global description of a scene is encoded in this distribution and provides dimensional representation of the image,. It is impossible to operate with such representation of the image in practice, due to the high dimensionality of energy spectrum. The standard way for data reduction of matrix of energy spectrum is principal components analysis. It is needed to rearrange the matrix representation in a column vector than PCA extracts a subspace spanned by a subset of a KL functions. The direct implementation of this method is impossible in practice. The reliable calculation of the KL basis function required a number of image samples more then. But in practice we don't have them usually. [8] suggests sampling the function, as: 353
,, being are a set of Gaussian functions. We have tested the MATLAB code (created by author) to examine properties of GIST descriptor and the possibility to use it instead of local descriptors for scene matching. 4. EXPERIMENTAL RESULTS In this section we examine two different descriptors, global and local. The aim of the experiment is to prove the ability of the descriptors to match two images of the same scene. The way of solving this problem is well understood. The problem consists of estimation of homography between pairs of images. First we have tested the local descriptor. The algorithm is presented below: Algorithm: Local descriptor in problem of matching of two images. Input: Two putative matched images. 1. Extract SIFT features from first and second image. In this section we use SIFT descriptor based on studies in section 3. 2. Estimation of putative correspondences: Find k nearest-neighbors for each feature. From the featurematching step we have identified images that have a large number of matches between them. Than we consider m images that have a largest number of matched points and use RANSAC to select of inliers that are have an impact on calculation of homography. 3. Fundamental matrix estimation: Find geometrically consistent feature matches using RANSAC to solve for the fundamental matrix computation between pairs of images. We just select the image that has a largest number of inlier points. Output: Two matched images. The better way is presented in [12]. But in our case we have made a simper experiment. In the next experiment we examine the properties of a GIST descriptors for automatic image matching. As input set we use the same set of images. During the experiment we calculate the gist descriptor for each image. The best matching is an image with a smaller distance between GIST vectors. The result of two experiments was obvious. All descriptors are invariant to rotation and scaling. As we said earlier, we don't look at the execution time of methods. We focused more on the properties of methods. Both methods can be successfully applied for the automatic matching of images. But without any additional measurement of performances is obvious that GIST descriptor is faster. Fig. 6. Dynamic of SIFT descriptor by image matching. In the top image the result of matched points is 44. And we show an increase of numbers of matched points from 44 at the top to 158 points in the bottom image. We show in the figure 7 the result of the GIST descriptor. 354
global descriptors regarding performance and cost. b) Fig. 7. Image a) is an input image for a GIST descriptor b) output from the GISTtest algorithm. a) Finally, we describe an image set. As input for algorithm we chooses the first image from the 586 sets of images. The set of images is an image sequence created from the moving camera. The image resolution is 568x320 pixels. 5. CONCLUSION In this paper we have tested different image descriptors, several local and a global one, in order to foresee its possible use in a robotic application. From the several local descriptors that have been evaluated we observe a compromise between speed and performance: SIFT and SURF present the higher invariance to different transformations, but are more expensive to compute than the rest of the local descriptors. Regarding the global descriptor GIST, we have observed a good performance for scene recognition, a higher compacity and possibly low cost, which indicates a good potential for image matching in robotics. As future work, our aim is to perform a detailed comparison between local and 5. REFERENCES [1] H. Durrant-Whyte and T. Bailey, Simultaneous localization and mapping (SLAM): Part I the essential algorithms Robotics and Automation Magazine, vol. 13, no. 2, pp. 99 110, 2006. [2] D. Galvez-Lopez and J. D. Tardos, Real-time loop detection with bags of binary words Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, sept. 2011, pp. 51 58. [3] B. Williams, G. Klein, and I. Reid, Real-time SLAM relocalisation in IEEE 11th International Conference on Computer Vision, 2007, p. 1:8. [4] S. Thrun and Y. Liu, Multi-robot slam with sparse extended information filers Robotics Research, pp. 254 266, 2005. [5] K.Mikolajczyk,T.Tuytelaars,C.Schmi,A.Zi sserman,j.matas,f.schaffalitzky, T. Kadir, and L. Gool, A comparison of affine region detectors International Journal of Computer Vision, vol. 65, no. 1, pp. 43 72, 2005. [6] K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 10, pp. 1615 1630, 2005. [7] D. G. Lowe, Distinctive image features from scale-invariant keypoints Interna- tional Journal of Computer Vision, vol. 60, no. 2, pp. 91 110, 2004. [8] A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope International Journal of Computer Vision, vol. 42, no. 3, pp. 145 175, 2001. [9] B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. Tardos, An image to map loop closing method for monocular SLAM in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2008. IROS 2008, 2008, pp. 2053 2059. 355
[10] Feature descriptor comparison report http://computer-visiontalks.com/2011/08/feature-descriptorcomparison-report/ [11] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision ISBN: 0521623049, 2000 [12] M. Brown, D. G. Lowe, Recognising Panoramas. [13] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, SURF: Speeded Up Robust Features, Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008 356