Improving Image-Based Localization Through Increasing Correct Feature Correspondences

Size: px

Start display at page:

Download "Improving Image-Based Localization Through Increasing Correct Feature Correspondences"

Barrie Freeman
6 years ago
Views:

1 Improving Image-Based Localization Through Increasing Correct Feature Correspondences Guoyu Lu 1, Vincent Ly 1, Haoquan Shen 2, Abhishek Kolagunda 1, and Chandra Kambhamettu 1 1 Video/Image Modeling and Synthesis (VIMS) Lab. Department of Computer and Information Sciences. University of Delaware 2 Zhejiang University, China Abstract. Image-based localization is to provide contextual information based on a query image. Current state-of-the-art methods use 3D Structure-from-Motion reconstruction model to aid in localizing the query image, either by 2D-to-3D matching or by 3D-to-2D matching. By adding camera pose estimation, the system can perform image localization more accurately. However, incorrect feature correspondences between the 2D image and 3D reconstruction remains the main reason for failures in image localization. In our paper, we introduce a new method, which adds features embedding, to reduce the incorrect feature correspondences. We do the query expansion to add correspondences, where the associated 3d point has a high probability to be found in the same camera as the seed set. Using the techniques described, the registration accuracy can be significantly improved. Experiments on several large image datasets have shown our methods to outperform most state-of-the-art methods. 1 Introduction Image-based localization is the approach of estimating the camera positon from the photos. Using an image obtained by a camera, an image-based localization system can compute the camera position and navigate the user. This application has attracted increasing attention in multiple areas, such as robot localization [1] and landmark recognition [2]. Image-based localization is particularly important in areas where GPS signal is weak such as around large buildings. Image-based localization was first attempted by searching through a database containing images of a city. The best matched image to the query image is retrieved as the location indicator. With the development of Structurefrom-Motion (SfM) reconstruction techniques, a 3D model can be generated from a set of city building images. State-of-the-art image-based localization method matches a 2D query image to a 3D Model built using SfM reconstruction to estimate the camera orientation, which achieves higher localization accuracy. SIFT feature [3] is used in the image-based localization task as a local feature to determine correspondences. SIFT feature is invariant to scaling, rotation and partly illumination change, and it has been successfully applied to object recognition, 3D reconstruction, motion tracking and many other computer vision tasks. However, since SIFT features emphasize large bin

2 2 Authors Suppressed Due to Excessive Length values in the distance computation, large values in the descriptors would result in incorrect correspondences. Also the descriptors in 3D SfM reconstruction model is much denser than the descriptors in 2D image, which would also dramatically increase the chance of incorrect correspondences. This results in a less stable configuration for camera pose estimation. In this paper, we add more high confidence 2D-to-3D correspondences using query expansion. The points in the selected correspondences have the highest probability to be seen in the same camera as the seed points. These correspondences increase the possibility of successfully estimating the camera pose. In addition, we make use of Hellinger kernel for computing the distance between descriptors. Instead of using Euclidean space for similarity computation, Hellinger kernel compares the descriptors using L1 distance. The use of Hellinger kernel mitigates the impact of extreme bin values. Also the newly learnt descriptor allows the corresponding descriptors to be better assigned to the same visual word, which we use to search for the nearest neighbor descriptor. The paper is organized as the following sections. Section 2 briefly introduces the related work of image-based localization and the techniques for learning descriptors. Section 3 introduces image-based localization pipeline. Section 4 presents the query expansion method we use. Section 5 discusses Hellinger kernel in SIFT descriptor similarity computation. Section 6 presents our localization result and the analysis. Section 7 concludes the paper. 2 Related Work Image-based localization is based on matching the query image against an image database or 3D reconstruction model. Compared to GPS navigation, image-based localization can still be employed in large building areas and provide higher localization accuracy [4]. Originally, image-based localization method used a database containing the building facade views to estimate the pose of the query image which is associated with a 3D coordinate system [5]. Similarly, [6] searches through an image database for the closest image in descriptor space for localization in urban scene. Method using vocabulary tree is used in [4] to achieve real-time pose estimation. Xiao et al. [7] uses bag-of-words method together with geometric verification to improve the object localization accuracy. Irschara et al. [8] propose to retrieve images containing most descriptors matching the 3D points and [9] realizes the 3D-to-2D matching through mutual visibility information. [10] uses visibility information between points and cameras to choose points for camera pose estimation. Sattler et al. [11] propose to directly match the descriptors extracted from the 2D image against the descriptors from the 3D points to improve the localization accuracy. Local features are largely used in image retrieval problems. To achieve better retrieval accuracy using local features, [12, 13] adopt an approach to learn a lower dimensional embedding from labeled match and non-match pairs. Philbin et al. [14] classify the original descriptor pairs into three groups, positive pairs, nearest neighbor pairs and random negative pairs. A projection matrix is learned through minimizing a marginbased cost function based on the three descriptor pair groups. Hashing methods are also used in reducing the descriptor quantization error. Kulis et al. [15] introduce a scalable

Lecture Notes in Computer Science 3 coordinate-descent algorithm to learn functions based on minimizing the error caused by binary embedding. Yagnik et al.

3 Lecture Notes in Computer Science 3 coordinate-descent algorithm to learn functions based on minimizing the error caused by binary embedding. Yagnik et al.[16] present a feature embedding method (WTA- Hash) based on partial order statistics and the embedding method can be extended to the case of polynomial kernels. [17] proposes LDAHash method to learn a projection matrix resembling classic LDA method and then quantize the descriptors into Hamming space whose lower dimensional binary descriptor increases the image retrieval accuracy. [18] adds strong spatial constraints to verify the image return for suppressing the false positives in query expansion. [19] uses a linear SVM to discriminatively learn a weight vector for re-querying, which achieves a significant improvement over standard query expansion method. 3 Image-based Localization The goal of image-based localization is to navigate the user, based on the images captured by their mobile devices. The user takes a photo of his or her surroundings and sends it to the image-based localization system. By matching the query image against the 3D model in the system and conducting pose estimation, the user can receive the navigation and location information from the localization system, as shown in Figure 1. Fig. 1: 2D-to-3D image matching Image-based localization systems originally search through an image database for the best image candidate, where the best is the image with the most feature correspondences. With the development of Structure-from-Motion reconstruction technique, 3D model is used in image-based localization system which allows better orientation estimation. Compared to 2D images, the descriptor space in 3D SfM reconstruction model is much denser. Our image-based localization method uses direct 2D-to-3D matching [11]. The basic idea is to find the correspondences between the 2D features and the 3D points. Correspondences are determined by searching for the 2D descriptor s nearest neighbors from all the descriptors in the 3D space. To accelerate the matching process, all the 3D descriptors are clustered into 100k visual words using the K-means algorithm [20]. Each descriptor extracted from the 2D image is matched to one of the visual words to simplify the search for the nearest neighbor descriptors. When enough correspondences are found, the search process will end.

4 4 Authors Suppressed Due to Excessive Length A kd-tree based approach is used to find the approximate nearest neighbor descriptors, which is supported by the FLANN library [21]. Each 3D point is represented by the mean value of all descriptors belonging to the point. Similar to the 3D descriptors, descriptors in the 2D image are also assigned to the visual words using the same centroid value in assigning 3D descriptors. 2D-to-3D correspondences are searched for within the same visual word. A 2D-to-3D correspondence is accepted if the two nearest neighbor of the 2D query descriptor passes the SIFT test ratio. In case that more than one 2D features match the same 3D point, the 2D feature with the smallest Euclidean distance will be selected as the matching feature. When 100 correspondences are found, we stop searching for correspondences. These correspondences will be used in the later pose estimation process. The threshold of 100 is chosen to balance image registration speed and image registration accuracy. Only images consisting of at least 12 inliers will be registered. A registered image means that the image is correctly matched to a 3D model and the camera pose is known. The inliers are found by the Random Sample Consensus (RANSAC) algorithm [22] using 6-point-direct-linear-transformation (6-point DLT) [23] to estimate the camera 3D pose. In L2 space, a correspondence between a 2D feature and a 3D point (2df, 3dp) is accepted if the square distance of the 2d feature s two nearest neighbor descriptor follows the condition D(2df, 3dp 1 )/D(2df, 3dp 2 ) < (1) D(2df, 3dp) = Dimensionality 1 (2df 3dp) 2 (2) In Equation 2, 2df and 3dp denote the descriptors belonging to the 2d feature and the 3d point. 4 Dual Checking We further add a reverse checking step to further filter the unstable correspondences. The 3d point which passes the ratio test in the last step will be mapped into features in the 2D image. We record the distance between mean value of the 3d point s descriptors and the nearest neighbor 2D descriptor, as well as the distance between the mean value of the 3D point s descriptor and the second nearest neighbor 2D descriptor. If the 3D point passes the ratio test, the correspondence is accepted, as shown in the following equations: D(3dp, 2df 1 )/D(3dp, 2df 2 ) < 0.64 (3) D(3dp, 2df) = Dimensionality 1 (3dp 2df) 2 (4)

5 Lecture Notes in Computer Science 5 In [11], the authors rank the visual words by size (pairs number formed by 3D point and 2D descriptor) from small to large for the purpose of reducing the search cost. In our paper, we count the number of 3D points in each visual word and sort the visual words in decreasing order, as experiments show that the visual words containing more points will be more likely to generate 3d points which pass the SIFT ratio tests in two directions. In each visual word, the points are sorted based on the number of cameras that the point is visible in. The more cameras that the point is visible in, the higher the searching priority. 5 Query Expansion Query Expansion is largely used in web search engines for augmenting search result by adding more keywords. We also use query expansion in our localization pipeline to augment the list of possible correspondences. Inspired by the points choosing method based on camera joint visible possiblity [10], we treat all the currently accepted 2D to 3D correspondence s 3D point as base seeds to be expanded. As multiple points can be seen on the same camera, we choose points which are visible with the base seeds from the same camera, as shown in Equation 5. P 1 S P rob(p 1, P 2) = P 1 S (cameras see P 1) (cameras see P 2) camera number Here, the S is the base seed set. P 1 is the 3d point in the seed set and P 2 is the 3d point in the rest of base set. P rob is the probability that the two points are visible in the same camera. P rob is computed by the number of cameras that see both P 1 and P 2 divided by the total number of cameras. We define a ratio to threshold the minimum number of scenes a point should be seen in before being considered. Among the points passing the ratio, points are ranked based on the sum of the physical distance to all the points in the base seed set, shown as Equation 6. The point with the largest distance will be given the highest priority, as the large distance will aid in pose estimation. We select 100 points with the highest priorities and search the nearest neighbor feature in the 2d image as the correspondences. Distance(P 1, P 2) = P 1 S P 1 S (P 1x P 2x)2 + (P 1y P 2y) 2 + (P 1z P 2z) 2 (6) Here the distance is in the 3d coordinates. P 1x, P 1y and P 1z are separately representing the coordinate value in X, Y and Z directions. (5) 6 Feature Learning Since SIFT feature is used for Structure-from-Motion reconstruction in our 3D model, we also use SIFT in our image-based localization task. SIFT feature is commonly used

6 6 Authors Suppressed Due to Excessive Length in image retrieval problems due to its useful properties: invariance to rotation, scaling and illumination change. However, compared to common image retrieval tasks, imagebased localization is more challenging since the descriptors for 3D models are much denser than the descriptors extracted from 2D images. The greater density of 3D descriptors yields many matches over a smaller region compared to 2D descriptors, thus adding a large number of incorrect correspondences. Higher incorrect correspondences results in poor camera pose estimation results. In [24], the authors describe that only a few components in a SIFT descriptor dominate the similarity computation. Additionally, the sign information is lost with L2-normalized descriptors. For these reasons, SIFT feature still has limits in obtaining sufficient correct correspondences for imagebased localization. To overcome these limitations, [19] proposed the Hellinger kernel to calculate descriptors distance. Instead of computing the Euclidean distance as Equation 7, D(X, Y ) = X Y 2 n n n = x 2 i + y 2 (7) i 2 xi y i i=1 i=1 the similarity between two descriptors is calculated as shown in Equation 8 H(X, Y ) = XY = i=1 n xi y i (8) X and Y are two descriptor vectors. x i and y i represent the components of the vectors X and Y. In original space, SIFT feature uses L2 normalization. Using Hellinger kernel, we normalize SIFT descriptors using L1 distance before calculating two vectors distance. By using Hellinger kernel, the influence of large bin values is reduced while the influence of small bin values becomes more substantial, which aids in rejecting incorrect feature correspondences. i=1 7 Experimental Results To evaluate the performance of our new proposed method for image-based localization, we conducted experiments using the new learnt descriptors, projected using the Hellinger kernel, on 3 challenging datasets: Dubrovnik [9], Vienna[8] and Aachen dataset [25]. Dubrovnik is a large dataset. The 3D model is reconstructed using the photos from Flickr. Some images are removed from the reconstruction together with their descriptors and 3D points that can be seen on only one camera. These removed images are used for query images. The query images of Vienna have a maximum dimension of 1600 pixels in both width and height. The 266 query images for Vienna dataset are selected from Panoramio website. The images in the Aachen dataset was collected over a 2-year period by different cameras. The query image overcomes the typical mobile phone camera shortcomings, such as motion blur and lack of focus. The datasets are representative of several different scenarios. The Vienna dataset images are

7 Lecture Notes in Computer Science 7 from uniform intervals of urban scenes. Dubrovnik depicts large clustered sets of views usually found on Internet photo collection website. Aachen dataset contains different lighting and weather conditions, as well as occlusions by construction sites. Detailed information can be found in Table 1. Dataset number of 3D points number of descriptors Size(MB) number of query images Dubrovnik 1,886,884 9,606,317 1, Vienna 1,123,028 4,854, Aachen 1,540,786 7,281,501 1, Table 1: The datasets used for evaluation. Size describes the binary.info file size with all descriptors and 3D points information. In the 2D-to-3D localization pipeline, all the 3D descriptors are classified into visual words. The query descriptors in the 2D image are also assigned to the visual word, selected by minimizing the distance to the visual word centers. After assigning a query descriptor to a visual word, the correspondence is found for a descriptor via nearest neighbor search. Making use of the new descriptors learned from Hellinger kernel, we re-classify all the descriptors into visual words by K-means and search the correspondences through the newly learned visual words. Experiment results compared with state-of-the-art methods are shown as Table 2 Method Registered images Registered images Registered images of Dubrovnik of Vienna of Aachen P2F [9] Voc.tree(all) [9] Fast Direct 2D-to-3D [11] Voc. tree GPU [8] D-to-3D Hellinger kernel Table 2: Comparison between our method with different state-of-the-art methods. From Table 2, the new system outperforms most state-of-the-art methods in localization accuracy. Using the newly learnt descriptor does not require additional memory. As the process of learning new descriptors and forming visual words can be done offline, the new system does not decrease the speed for the localizing an image. We give image examples that get registered in the new localization pipeline as shown in Figure 2. These images fail to be registered in the original pipeline. From the images shown above, we can see that images with shadow and even large rotations can be registered with our method, yet fail with the old system. Some image examples which fail in the new system are given in Figure 3. As depicted, localization will fail for images with significant illumination change. Images largely dominated by people will also fail registration. This is because from the salient part of the images, we cannot extract features corresponding to the features in the 3D reconstruction model, which result in the failure of the camera pose estimation.

8 8 Authors Suppressed Due to Excessive Length (a) (b) (c) (d) (e) (f) Fig. 2: Image examples that get registered in the new localization pipeline. These images fail to be registered in the original pipeline. (a,b,c,d,e,f) are all the images that cannot be registered in the original localization pipeline. But in the new pipeline, they are successfully registered. (a) (b) (c) (d) (e) (f) Fig. 3: The examples which fail in being registered in the new localization pipeline. The camera pose estimation for the images (a, b, c, d, e, f) is not successful.

9 Lecture Notes in Computer Science 9 8 Conclusion In this paper, we conduct both ways feature matching for 3D point to features in the 2D image (from 2D to 3D and then from 3D to 2D), which gives us reliable correspondences and a seed set. Furthermore, we add a query expansion step to augment our initial list of correspondences with correspondences whose 2D features having a high probability to be jointly visible with a seed point in an image. These correspondences benefit the camera pose estimation in the final step. We also use the Hellinger kernel to learn new descriptors and use the newly learnt descriptors for image-based localization, which is much more challenging than common image retrieval problems. Without requiring additional speed or memory, our system dramatically improves the localization accuracy. We expect that image registration rate and speed can be further improved by pruning less informative points. 9 Acknowledgement This work was made possible by NSF CDI-Type I grant References 1. Meier, L., Tanskanen, P., Fraundorfer, F., Pollefeys, M.: Pixhawk: A system for autonomous flight using onboard computer vision. In: In ICRA. (2011) 2. Chen, D., Baatz, G., Koser, K., Tsai, S., Vedantham, R., Pylvanainen, T., Roimela, K., Chen, X., Bach, J., Pollefeys, M., Girod, B., Grzeszczuk, R.: City-scale landmark identification on mobile devices. In: In CVPR. (2011) 3. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60 (2004) 4. Steinhoff, U., O., D., Perko, R., Schiele, B., Leonardis, A.: How computer vision can help in outdoor positioning. In: European conference on Ambient intelligence. AmI 07 (2007) 5. Robertson, D., Cipolla, R.: An image-based system for urban navigation. In: BMVC. (2004) 6. Zhang, W., Kosecka, J.: Image based localization in urban environments. In: 3DPVT. (2006) 7. Xiao, J., Chen, J., Yeung, D., Quan, L.: Structuring visual words in 3d for arbitrary-view object localization. In: ECCV. (2008) 8. Irschara, A., Zach, C., Frahm, J., Bischof, H.: From structure-from-motion point clouds to fast location recognition. In: CVPR. (2009) 9. Li, Y., Snavely, N., Huttenlocher, D.P.: Location recognition using prioritized feature matching. In: ECCV. (2010) 10. Choudhary, S., Narayanan, P.: Visibility probability structure from sfm datasets and applications. In: ECCV. (2012) 11. Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2d-to-3d matching. In: ICCV. (2011) 12. Hua, G., Brown, M., Winder, S.: Discriminant embedding for local image descriptors. In: ICCV. (2007) 13. Winder, S., Hua, G., Brown, M.: Picking the best daisy. In: CVPR. (2009) 14. Philbin, J., Isard, M., Sivic, J., Zisserman, A.: Descriptor learning for efficient retrieval. In: ECCV. (2010)

10 10 Authors Suppressed Due to Excessive Length 15. Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. In: NIPS. (2009) 16. Yagnik, J., Strelow, D., Ross, D., sung Lin, R.: The power of comparative reasoning. In: ICCV. (2011) 17. Strecha, C., Bronstein, A., Bronstein, M., Fua, P.: Ldahash: Improved matching with smaller descriptors. TPAMI 34 (2012) 18. Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: Automatic query expansion with a generative feature model for object retrieval. In: ICCV. (2011) 19. Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: In CVPR. (2012) 20. Philbin, J., Chum, O., Isard, M., J., S., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR. (2007) 21. Muja, M., Lowe, D.: Fast approximate nearest neighbors with automatic algorithm configuration. In: ICCTPA. (2009) 22. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (1981) 23. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: (2004) 24. Jain, M., Benmokhtar, R., Gros, P., Jegou, H.: Hamming embedding similarity-based image classification. In: In ICMR. (2012) 25. Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for image-based localization revisited. In: In BMVC. (2012)

Memory efficient large-scale image-based localization

DOI 10.1007/s11042-014-1977-3 Memory efficient large-scale image-based localization Guoyu Lu Nicu Sebe Congfu Xu Chandra Kambhamettu Springer Science+Business Media New York 2014 Abstract Local features