Global localization from a single feature correspondence

Global localization from a single feature correspondence Friedrich Fraundorfer and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology {fraunfri,bischof}@icg.tu-graz.ac.at Abstract: This paper presents a new approach to global localization for mobile robots from a single feature correspondence only. The method is based on a piece-wise planar environment map and uses planar natural landmarks as commonly encountered in man made environments. The method is especially useful in case of large occlusions. The standard Lu and Hager pose estimation algorithm [6] is extended to improve accuracy and robustness. Localization experiments are performed in a room-size real world scenario. 1 Introduction This work particularly focuses on a specific detail in mobile robot localization, global localization. It is needed when no prior information about the robots position is known, e.g. when the robot is switched on at an arbitrary position or for the kidnapped robot problem. In addition global localization may also be used to verify and correct the robots position estimated incrementally from odometry (mechanical or visual) measures. Such methods inherently accumulate position errors and the longer the system runs the larger it deviates from the correct position. Such situations could be detected and corrected using global localization. Recent advances in wide-baseline matching (see [7, 13, 9, 5, 4, 8, 3]) opened the door to solve the data correspondence problem for natural visual landmarks. One successful approach to global localization has been presented by Se et al. [12]. Their system uses a metric map consisting of 3D points as landmarks. The map is composed of various sub-maps which are merged through global alignment. The extracted landmarks are scale-invariant features, DoG-keypoints [5]. The landmark correspondence problem is solved with a rotation invariant descriptor (SIFT [5]) which allows a fast and reliable matching of corresponding landmarks. The robots pose is then calculated from the 3D-2D landmark correspondences between the landmarks detected in the current view and the corresponding 3D landmarks in the map. Difficulties with that approach arise when only a small number of feature correspondences can be detected or when the accuracy of 3D-2D point correspondences is weak. Heavy occlusions (e.g. due to surrounding people) might make it difficult to establish a large enough number of

correspondences and usually only a large number of correspondences assures a good estimate. On the other hand the accuracy of point correspondences strongly effects the precision of the pose estimate. And simple SIFT-feature matches (especially in wide-baseline cases) will only achieve limited accuracy. The intention of this work therefore is to focus on the problem, when only one feature correspondence could be established and how to solve pose estimation for such a worst case scenario. We will present a method which successfully allows pose estimation from a single planar feature correspondence of a very small image region with an area as small as 200 1000 pixel (see section 3). Additional 3D-2D point correspondences are established within a single planar landmark and the pose is estimated using the Lu and Hager [6] iterative pose estimation algorithm. An extension to the Lu and Hager pose estimation algorithm is presented which improves accuracy and robustness. The method is based on a piece-wise planar world representation whose details are also outlined in this paper in section 2. Results for localization and map building experiments are given in section 4. Finally we draw some conclusions in section 5. 2 Map building The proposed world map is a set of 3D reconstructions of planar interest regions. It is organized in sub-maps, which are linked by similarity transformations. It is possible to merge the single sub-maps into one coordinate frame to build a single complete world map. The planar map features are represented with their full 6DOF. Each map feature is associated with an image region. An interest region detector (in our case MSER [7]) is used to extract descriptive and distinctive parts of an image as natural landmarks. The proposed map building algorithm is an off-line batch algorithm. Its input is an image sequence from a single camera. 3D reconstruction is done in a structure from motion approach. The algorithm first identifies submaps in the whole image sequence. Subsequently for each sub-map a metric reconstruction is built. The last step is linking the sub-maps. 2.1 Sub-map identification We assume that we have a large set of unorganized images of the area to map (we do not assume subsequent images, thus the images might be acquired by multiple robots). This step partitions the whole set, into sub-sets containing images with a short-baseline variation only. Each partition will than act as a sub-map. To partition the image set we calculate a global feature vector for each image and define a similarity measure in feature space. To calculate the global image description we re-sample the image to a very low resolution of 64 64 pixel. Then we calculate the SIFT descriptor of the whole image which results in a feature vector of length 128. The feature vectors of images from similar viewpoints with a short baseline

form clusters in feature space. The similarity is measured with the Euclidean distance of the vectors. The partitioning of the image set is now done by hierarchical clustering [1] in feature space. Every cluster will act as a sub-map, and two images from the cluster are chosen to represent the sub-map. The remaining images are not used for further processing. 2.2 Sub-map reconstruction Each sub-map is defined by two short baseline images. With the 5-point stereo algorithm of Nister [10] the camera positions are estimated. Next, scene planes are identified in the images using inter-image homographies [2]. From the resulting point correspondences the planes can be reconstructed in 3D. In a next step features for wide baseline matching are extracted from the images, in this case MSER regions. A local affine frame (LAF) [7] is computed for every region and used for normalization. A SIFT-descriptor is calculated for each normalized MSER region. Only planar landmarks should be contained in the map, thus the planarity has to be checked for each image region. A landmark located within the detected planar areas is of course planar. Features which are not located at one of the detected planes are discarded. The sub-map is finally represented by the identified planes in 3D and in the image, the camera positions and the extracted MSER regions as image patch and its SIFT-descriptors. A KD-tree data-structure is used to store the SIFT-features. 2.3 Sub-map linking Two sub-maps can be linked if both contain at least one common planar feature. Identifying corresponding features is done by feature matching. We search for two close SIFT-feature vectors in feature space. This is done by computing the distance matrix of all SIFT-features (using KD-tree) and sort the distances in an ascending order. This list of tentative matches is now verified with an iterative correlation based region matcher [2] starting at the one with the smallest distance. The match verification will go on until a link between all sub-maps has been found or all list entries have been processed. The region matching provides a set of point correspondences with sub-pixel accuracy for each matched region. Two sub-maps to be linked are represented in two different coordinate frames and differ in scale. Thus linking two sub-maps is estimating a rigid transformation (rotation R and translation T ) and a scale factor s. First, we calculate a 3D reconstruction of the point correspondences in each submap. By projecting the image points onto the plane we get the 3D reconstructions of the point correspondences. For the scale factor s two arbitrary points from the reconstructed point set are selected and the distance is measured. The distance between the corresponding points from the other sub-map is calculated too and the ratio between these two distances defines the scale factor s. The second sub-map is then scaled with s. Now the rigid transformation between both sub-maps can easily be estimated from the corresponding 3D point set. The second sub-map is then transformed into the coordinate frame of the first sub-map using R

and t and the second sub-map s planes and features are added to the first sub-map. 3 Localization from a single feature correspondence 3.1 Finding feature correspondence Similar to the sub-map linking step, MSER regions are extracted from the image of the current view. A LAF is computed for each region and region normalization is performed. A SIFTdescriptor is computed from the normalized image patch. Using the SIFT-descriptor tentative feature correspondences are searched within the map features. The matches are confirmed by the already mentioned iterative region matching. Matching yields planar regions only (which is a requirement for the subsequent pose estimation) because only planar regions are stored in the map (see section 2). The region matching algorithm establishes point correspondences within the support region of the feature (the area the SIFT-descriptor is computed from). That means, that a single landmark correspondence gives rise to a set of point correspondences which allows pose estimation. 3.2 Pose estimation from a single landmark Pose estimation will be performed using the iterative pose estimation algorithm from Lu and Hager [6]. The algorithm returns the full 3DOF position and 3DOF rotation of the robot. Position and rotation will be computed from a set of 3D-2D point correspondences. For each matched landmark the matching algorithm returns 2D point correspondences between the landmark in the actual view and the landmark stored in the environment map. As the map contains the 3D parameters of the plane on which the landmark is located the corresponding 3D coordinates of the point set can be computed by projecting the 2D point matches (from the landmark in the map) onto the corresponding 3D plane. This creates the 3D-2D point correspondences for the pose estimation. To successfully apply the Lu and Hager pose estimation we have to deal with the following two issues: There exist two possibly solutions for the pose. This possible ambiguity exists because of two local minima of the used error function. This was shown by Schweighofer and Pinz [11] which results into arbitrary pose jumps. To obtain best results it is necessary to provide a coarse initial pose (rotation only) for the algorithm. The first issue is addressed as follows. Most of the time we get more than one corresponding feature. In such a case we verify if the computed pose coincides with the other feature correspondences. In detail, we check if the 2D point correspondences for the other features

Algorithm 1 Pose estimation from planar landmarks (sub-sampling method). Q [] {list to hold possible solutions R, t for pose estimation} for all region correspondences do project 2D points on plane to create 3D-2D matches for i = 1 to n do select random subset S from 3D-2D correspondences of size p compute R,t from S using 5-point algorithm 3D-2D pose estimation with initial rotation R add pose (R,t) to Q if 3D points are located in front of the camera end for end for for i = 1 to length(q) do calculate mean epipolar distance using R, t from Q(i) on 2D-2D correspondences over all matching regions end for return R, t with minimal epipolar distance satisfy the epipolar constraint imposed from the calculated pose. Only for a correct solution the additional correspondences would satisfy the constraint. For the case when only one feature match is present a wrong solution can be identified if it exhibits an impossible camera configuration, i.e. the reconstruction is not in-front of the camera. However, if both solutions would provide valid configurations the correct one can not always be detected. But such cases can be handled with a higher-level statistical SLAM framework. Difficulties with the second issue arise as we want to perform global localization. For an accurate solution the pose estimation needs a coarse initial rotation. Without initial rotation the minimization might stop at a local minimum. In a tracking scenario (which is the usual field of application for such pose estimation algorithms) the initial rotation is given from a previous position. For global localization however no previous position is known. In our case the initial rotation is obtained by essential matrix computation (using the 5-point stereo algorithm by Nister [10]). The 5-point algorithm would already return a complete pose estimation, but for our special application (point correspondences located on a plane) the accuracy of the results is not sufficient. Especially the translation estimate shows a large variance. The estimated rotation however works good enough to initialize the Lu and Hager pose estimation. The resulting method is outlined in Algorithm 1. 4 Experiments For the experiments a mobile robot was equipped with a single camera and a laser range finder (LRF). The camera has a resolution of 800 600 pixels and is equipped with a wide-angle lens.

(a) (b) Figure 1: Test environment office : (a) Floor plan created by laser range finder. Red circles mark the positions of laser readings. The robots path is drawn in black and interpolated between the laser readings. (b) 3D piece-wise planar map (created from the camera images shown in (a)) used in the localization experiments.

The wide-angle lens has a field-of-view of 90 which already causes severe radial distortions at the image borders. The radial distortions are removed by calibration and re-sampling. The robot was steered manually through the test environment office while the camera was continuously capturing images and the LRF taking readings. The LRF readings were used to create a floor plan and to compute the path of the robot using the software ScanStudio 1). The floor plan is illustrated in Figure 1a). The circles mark the positions of the acquired laser leadings and the robots path has been interpolated between the distinct positions. 5 short-baseline image pairs were selected as sub-maps for the map building algorithm. The result of the map building algorithm (from section 2) is shown in Fig. 1b). In the following we describe two experiments. The first experiment is about computing the path of the robot from the acquired image data with respect to the computed 3D map using the algorithm of section 3. A second experiment evaluates the accuracy of the localization algorithm and compares it with standard Lu and Hager pose estimation [6]. 4.1 Computing the path of the robot A part of the robots path through the test environment is computed using the proposed method for global localization. The corresponding path is marked with the large rectangle in Fig. 1a). For every image acquired within this section of the path the 3D pose of the robot is computed. Fig. 2 shows two views of the environment map augmented with the computed path. The computed poses (camera positions) are depicted as cones pointing into the direction of the robot and starting at the current robot location. (a) (b) Figure 2: Reconstructed path of the robot: Images show the environment map augmented with robot positions computed for a part of the robots path. (a) Top view (b) 3D view. 1) ScanStudio is available from http://www.activrobots.com/

4.2 Evaluation of the localization algorithm Next we investigate the accuracy of the localization algorithm. As a measure of accuracy the epipolar distance between 2D image points and epipolar lines is used. Each pose is computed from a single landmark only. The other detected landmarks are now used to assess the quality of the computed pose by computing the epipolar distance between the 2D point correspondences and the epipolar lines. The proposed sub-sampling algorithm (Alg. 1) creates n subsets of size p from the point correspondences of a region. That means, every region generates n solutions. The best solutions is selected and we will show in this experiment, that in most cases there exists a subset which produces a better solution than computing the pose from all correspondences. Using all correspondences of a landmark for pose estimation is the usual way the Lu and Hager [6] algorithm (LH method) is applied. The results of both methods are compared by means of the epipolar distance measure. For our algorithm n was set to 50 and p was set to 10. Fig. 4 and Table 1 summarize the results. We calculated the poses for 3 different sequences (all part of the robots path through the office). The table shows the achieved average epipolar distance (of the best solutions), minimal and maximal distance and the standard deviation as well as the average area in pixel of the image regions used for the pose estimation. It is evident that our proposed algorithm achieves a smaller epipolar error than the LH method. The result is even more impressive by looking at individual frames, e.g., for frame 15 in sequence 2 the epipolar distance achieved by the simple algorithm was 13.15 pixel while our method came down to 0.75 pixel. This is more than a significant improvement, in fact the solution produced by the simple method will be wrong. The differences between both methods are illustrated in Fig. 3 where the positions of a forward motion sequence are computed. That means, all the camera positions should be aligned in a row. The result with the sub-sampling method shows only a little deviation from a straight line. The results obtained from the LH method however shows large deviations. One hardly sees, that this should be the path from a forward motion only. One column in the tables shows the average area of the image regions in pixel from which the poses have been computed. We want to stress the impressing fact, that our method achieves pose estimation with an epipolar distance error below 1 pixel with an image region of approximately 400 pixel. 5 Conclusion We presented a method for visual global localization from a single landmark correspondence only. Prerequisite for the method is a piece-wise planar environment map. Map building is only shortly addressed, the main part focuses on the localization algorithm. The pose of the robot is computed with iterative pose estimation from additional 3D-2D correspondences

(a) (b) Figure 3: (a) Results for global localization. Images show poses computed for a forward motion sequence using the sub-sampling algorithm (b) LH method. 6 avg. epipolar distance [pixel] 5 4 3 2 1 Sub-sampling method LH method 0 S1 S2 S3 S1+S2+S3 Figure 4: The graph compares the average epipolar distance for our sub-sampling method and the standard Lu and Hager method (LH) [6] method for 3 different image sequences (S1,S2,S3) and all sequences together. Our sub-sampling method produces a smaller error than the LH method. detected within a single landmark. The pose is estimated with the algorithm from Lu and Hager [6]. However, the analysis showed that a straightforward estimation from all detected correspondences does not produce optimal and robust results. A sub-sampling scheme is introduced to generate multiple hypotheses and select the best solution. The improvements in pose estimation are shown visually and in quantitative assessments. Furthermore, our experiments show that robust pose estimation is still possible from very small landmarks. Even landmarks with an area of about 400 pixel allow robust pose estimation. This allows global localization in extreme situations, like large occlusions or minimal scene overlap. References [1] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-Interscience, 2000. [2] Friedrich Fraundorfer, Konrad Schindler, and Horst Bischof. Piecewise planar scene reconstruction from sparse correspondences. Submitted to Image and Vision Computing.

sequence 1 avg. epidist min. epidist max. epidist stddev epidist avg. patch area 31 frames [pixel] [pixel] [pixel] [pixel] [pixel] LH method 1.47 0.98 2.55 0.31 - sub-sampling method 1.07 0.87 1.40 0.12 552 sequence 2 avg. epidist min. epidist max. epidist stddev epidist avg. patch area 21 frames [pixel] [pixel] [pixel] [pixel] [pixel] LH method 5.60 0.43 28.33 8.67 - sub-sampling method 3.52 0.30 25.22 6.74 201 sequence 3 avg. epidist min. epidist max. epidist stddev epidist avg. patch area 9 frames [pixel] [pixel] [pixel] [pixel] [pixel] LH method 1.53 1.05 1.93 0.29 - sub-sampling method 1.09 0.90 1.33 0.14 473 all sequences avg. epidist min. epidist max. epidist stddev epidist avg. patch area 61 frames [pixel] [pixel] [pixel] [pixel] [pixel] LH method 2.90 0.43 28.33 5.39 - sub-sampling 1.92 0.30 25.22 4.06 420 Table 1: Epipolar distances for pose estimation using the LH method and for our sub-sampling method. [3] T. Kadir, A. Zisserman, and M. Brady. An affine invariant salient region detector. In Proc. 7th European Conference on Computer Vision, Prague, Czech Republic, pages Vol I: 228 241, 2004. [4] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2):79 116, 1998. [5] D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91 110, November 2004. [6] C.P. Lu, G.D. Hager, and E. Mjolsness. Fast and globally convergent pose estimation from video images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6):610 622, June 2000. [7] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc. 13th British Machine Vision Conference, Cardiff, UK, pages 384 393, 2002. [8] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proc. 7th European Conference on Computer Vision, Copenhagen, Denmark, page I: 128 ff., 2002. [9] Krystian Mikolajczyk and Cordelia Schmid. Indexing based on scale invariant interest points. In Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, pages 525 531, 2001. [10] D. Nister. An efficient solution to the five-point relative pose problem. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, pages II: 195 202, 2003. [11] Gerald Schweighofer and Axel Pinz. Robust pose estimation from a planar target. Technical Report TR-EMT-2005-01, Graz University of Technology, 2005. [12] Stephen Se, David G. Lowe, and James J. Little. Vision-based global localization and mapping for mobile robots. IEEE Transactions on Robotics, 21(3):364 375, 2005. [13] T. Tuytelaars and L. Van Gool. Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 1(59):61 85, 2004.