3D model search and pose estimation from single images using VIP features

3D model search and pose estimation from single images using VIP features Changchang Wu 2, Friedrich Fraundorfer 1, 1 Department of Computer Science ETH Zurich, Switzerland {fraundorfer, marc.pollefeys}@inf.ethz.ch Jan-Michael Frahm 2, Marc Pollefeys 1,2 2 Department of Computer Science UNC Chapel Hill, USA {ccwu,jmf}@cs.unc.edu Abstract This paper describes a method to efficiently search for 3D models in a city-scale database and to compute the camera poses from single query images. The proposed method matches SIFT features (from a single image) to viewpoint invariant patches (VIP) from a 3D model by warping the SIFT features approximately into the orthographic frame of the VIP features. This significantly increases the number of feature correspondences which results in a reliable and robust pose estimation. We also present a 3D model search tool that uses a visual word based search scheme to efficiently retrieve 3D models from large databases using individual query images. Together the 3D model search and the pose estimation represent a highly scalable and efficient city-scale localization system. The performance of the 3D model search and pose estimation is demonstrated on urban image data. query image 3D model from 1.3 M images matching part of the 3D model 1. Introduction Searching for 3D models is a key feature in city-wide localization and pose estimation from mobile devices. From a single snapshot image the corresponding 3D model needs to be found and 3D-2D matches between the model and the image need to be established to estimate the users pose (see illustration in Fig. 1). Main challenges so far are the correspondence problem (3D-2D) and the scalability of the approach. In this paper we will contribute to both of this topics. The first contribution will be a 3D-2D matching method that is based on viewpoint invariant patches (VIP) and can deal with severe viewpoint changes. The second contribution will be the use of a visual word based recognition scheme for efficient and scalable database retrieval. Our database consists of small individual 3D models that represent parts of a large scale reconstruction. Each 3D model is textured and is represented by a collection of VIP features in the database. When querying with an input image, the input s image SIFT features are matched with the database s VIP features to determine the corresponding 3D model. Fi- Figure 1. Mobile vision based localization: A single image from a mobile device is used to search for the corresponding 3D model in a city-scale database and determine thus the user s location. SIFT features extracted from the query image will be matched to VIP features from the 3D models in the database. nally, 3D-2D matches between the 3D model and the input image are established for pose estimation. Viewpoint-invariant-patches (VIP) have been used for registering 3D models in [9] so far. The main idea is to create ortho-textures for the 3D models and detect local features, e.g. SIFT, on them. For this, planes in the 3D model are detected and a virtual camera is set fronto-parallel to each plane. Features are now extracted from the virtual camera image from which the perspective transformation of the initial viewpoint change is removed. In this paper we extend this method to create matches between a 3D model and a single image (3D-2D). In the original method features from both models are represented in the canonical (orthographic) form. In our case only the 978-1-4244-2340-8/08/$25.00 2008 IEEE

features from the 3D model are represented in the canonical form while the features from the single image are perspectively transformed. However, while matching will not work for features under large perspective transformation, features which are almost fronto-parallel will match very well with the canonical representation. Under the assumption that the camera of the query image and the 3D plane of the matching features are parallel we can generate hypotheses for the camera pose of the query image. And using these hypotheses we can warp parts of the query image so that they match the perspective transform of the canonical features of the 3D model. This allows us to generate many more additional matches for robust and reliable pose estimation. For exhaustive search in large databases this method would be to slow, therefore we use the method described by Nister and Stewenius [5] for an efficient model search. The model search works with quantized SIFT (and VIP) descriptor vectors, so called visual words. The paper is structured in the following way. The following section describes relevant related work. Section 3 describes the first contribution of this paper, pose estimation using VIP and SIFT features. Section 4 describes how to search for 3D models in large databases efficiently. Section 5 shows experiments on urban image data and finally section 6 draws some conclusions. 2. Related work Many texture based feature detectors and descriptors have been developed for robust wide-baseline matching. One of the most popular is Lowe s SIFT detector [3]. The SIFT detector defines a feature s scale in scale space and a feature orientation from the gradient histogram in the image plane. Using the orientation, the SIFT detector generates normalized image patches to achieve 2D similarity transformation invariance. Many feature detectors, including affine covariant features, use the SIFT descriptor to represent patches. SIFT-descriptors are also used to encode VIP features. However, the VIP approach will work with other feature descriptors, too. Mikolajczyk et al. give a comparison of several local features in [4]. The recently proposed VIP features [9] go beyond affine invariance to robustness to projective transformations. The authors investigated the use of VIP features to align 3D models, but they did not investigate the case of matching VIP to features from single images. Most vision based location systems so far have been demonstrated on small databases [6, 8, 11]. Recently Schindler et al. [7] presented a scheme for city-scale environments. The method uses the visual word based recognition scheme following the approach in [5, 2, 2]. However, Schindler et al. only focused on location recognition. The pose of the user is not computed. Our proposed method will combine both, scalable location recognition and pose estimation. Pose estimation only is the focus of the work in [10]. The authors propose a method to accurately compute the camera pose from 3D-2D matches. High accuracy is achieved by extending the set of initial matches with region growing. Their method could be used as a last step in our localization approach to refine the computed pose. 3. Pose from SIFT-VIP matches Figure 2. VIP s detected on a 3D model. 3.1. Viewpoint-Invariant Patch (VIP) detection VIP s are features that can be extracted from textured 3D models which combine images with corresponding depth maps. VIPs are invariant to 3D similarity transformations. They can be used to robustly and efficiently align 3D models of the same scene from videos taken from significantly different viewpoints. In this paper we ll mostly consider 3D models obtained from video by SfM, but the method is equally applicable to textured 3D models obtained using LIDAR or other sensors. The robustness to 3D similarities exactly corresponds to the ambiguity of 3D models obtained from images, while the ambiguities of other sensors can often be described by a 3D Euclidean transformation or with even fewer degrees of freedom. The undistortion is based on local scene planes or on local planar approximations of the scene. Conceptually, for every point on the surface the local tangent plane s normal is estimated and a texture patch is generated by orthogonal projection onto the plane. Within the local ortho-texture patch it is determined if the point corresponds to a local extremal response of the Difference-of-Gaussians (DoG) Filter in scale space. If it is the orientation is determined in the tangent plane by the dominant gradient direction and a SIFT descriptor on the tangent plane is extracted. Using the tangent plane avoids the poor repeatability of interest point detection under projective transformations seen in popular feature detectors [4].

parallel to the local surface normal passing through the 3D point. This step makes the VIP invariant to the intrinsics and extrinsic of the original camera generating an ortho-texture patch. (a) 2. Verify VIP, and find its orientation and size. Keep a 3D point as a VIP feature only when its corresponding pixel in the ortho-texture patch is a stable 2D image feature. Like [3] a DoG Filter and local extrema suppression is used. VIP orientation is found based on the dominant gradient direction in the ortho-texture patch. With the virtual camera, the size and orientation of a VIP can be obtained by transforming the scale and orientation of its corresponding image feature to world coordinates. A VIP is then fully defined as (x, σ, n, d, s) where x is its 3D position, σ is the patch size n is the surface normal at this location, d is texture s dominant orientation as a vector in 3D s is the SIFT descriptor that describes the viewpointnormalized patch. Note, a sift feature is a sift descriptor plus it s position, scale and orientation. Fig. 2 shows VIP features detected on a 3D model. (b) (c) Figure 3. (a) Initial SIFT-VIP matches. Most matches are as expected on the fronto-parallel plane (left image is query image). (b) Camera pose estimated from SIFT-VIP match (re). (c) Resulting set of matches established with the proposed method. The initial set of 17 matches could be extended to 92 correct matches. The method established many matches on the other plane, too. Viewpoint-normalized image patches need to be generated to describe VIPs. Viewpoint-normalization is similar to the normalization of image patches according to scale and orientation performed in SIFT and normalization according to ellipsoid in affine covariant feature detectors. The viewpoint normalization can be divided into the following steps: 1. Warp the image texture for each 3D point, conceptually, using an orthographic camera with optical axis 3.2. Matching VIP with SIFT To match SIFT features from a single image with VIP features from a 3D model, the SIFT features extracted from the image need to be fronto-parallel (or close to) to the VIP features in the model. This might hold only for a fraction of features whose plane is accidentally parallel to the camera viewpoint. For all other features we will warp the corresponding image areas, so that they approximately match the canonical form of the VIP features. The projective warp can be computed along the following steps: 1. Compute the approximate camera position of the query image in the local coordinate frame from at least one fronto-parallel SIFT-VIP match. 2. Determine image areas that need to be warped by projecting the VIP features of the model into the query image. 3. Compute the warp homography for each image area from the 3D plane of the VIP and the estimated camera pose. The whole idea is based on the assumption that inital matches between VIP and SIFT features are fronto-parallel (see Fig. 3(a) for example matches). This assumption allows to compute an estimate for the camera pose of the

query image. The VIP feature is located on a plane in 3D and is defined by the feature s center point X (in 3D) and the normal vector n of the plane. Our assumption is that the image plane of the SIFT feature is parallel to the plane and that the principal ray of the camera center is in the direction of n and connects X and the center of the SIFT feature x. This fixes the camera pose along the normal vector n. The distance d between the camera center and the plane can be computed from the scale ratio of the matched feature with the help of the focal length f. d = f S s The focal length f of the camera can be taken from the EXIF data of the image or from camera calibration. S is the scale of the VIP feature and s is the scale of the matching SIFT feature. The missing rotation around the principal axis r can finally be recovered from the dominant gradient direction of the image patches. Fig. 3(b) shows a camera pose estimated from a SIFT-VIP match. Now with the camera P fully defined this approximation can be used to compute the necessary warps. For each VIP feature in the 3D model we determine the corresponding image region in the query image, by projecting the VIP region (specified by center point and scale) onto the image plane. Next we compute the homography transform H that will warp our image region to the canonical form of the VIP feature with (1) H = R + 1 d T N T (2) where R and T are rotation and translation from P to the virtual camera of the VIP feature and N is the normal vector of the VIP plane in the coordinate system of P. Finally we are looking for stable 2D image feature in the warped image area by applying the SIFT detector. Clearly our assumptions are not met exactly which results in an inaccurate camera pose estimate. SIFT descriptors, which are developed for wide-baseline matching, enable matching within a certain range of viewpoint change and thus the camera plane might not be exactly parallel to the VIP feature plane. However, we do not depend on an exact pose estimate for this step. We account for the uncertainty in the camera pose by enlarging the region to warp. In addition, remaining differences between the VIP and SIFT feature can be compensated with SIFT matching. Fig. 3 shows examples of final SIFT-VIP matches. The initial matching between SIFT and VIP features results in 17 matches. From this a camera pose estimate can be computed which allows to warp the SIFT detections in the input image into approximate fronto-parallel configuration. Matching the rectified SIFT detections with the VIP features yields 92 correct matches. Algorithm 1 3D model search and pose estimation 1: Extract SIFT features from query image 2: Compute visual word document vector for query image 3: Compute L 2 distances to all document vectors in 3D model database (inverted file query) 4: Use 3D model corresponding to the smallest distance as matching 3D model 5: Match SIFT features from query image to VIP features from database 3D model (nearest neighbor matching) 6: Compute camera pose hypotheses from SIFT-VIP matches 7: Warp the query image according to the camera pose hypotheses and extract fronto-parallel SIFT features. 8: Match fronto-parallel SIFT features to VIP features 9: Compute final pose from SIFT-VIP matches 3.3. Pose estimation The 3D-2D matches between VIP and SIFT features can now be used to compute the camera pose accurately and thus determine the location of the user within the map. The main benefit for pose estimation is that we could significantly increase the number of feature matches, which results in a reliable and robust pose estimation. An outline of the complete localization method is given in Algorithm 1. 4. Efficient 3D model search in large databases For pose estimation as described in the previous section the corresponding 3D model needs to be known. For large databases, necessary for city-wide localization, an exhaustive search through all the 3D models is not possible. Thus a first step prior pose estimation is to search for the corresponding 3D model. Our database consists of small individual 3D models that represent parts of a large scale vision based 3D reconstruction, created as described in [1]. Each individual 3D model is represented by a set of VIP features extracted from the model texture. These features are used to create a visual word database as described in [5]. This allows for efficient model search to determine the 3D model necessary for pose estimation. Similar to [5], firstly, VIP features are extracted from the 3D models. Each VIP descriptor is quantized by a hierarchical vocabulary tree. All visual words from one 3D model form a document vector which is a v-dimensional vector where v is the number of possible visual words. It is usually extremely sparse. For a model query the similarity between the query document vector to all document vectors in a database is computed. As similarity score we use the L 2 distance between document vectors. The organization of the database as an inverted file and the sparseness of the document vectors allows a very efficient scoring. For scoring, the different visual words are weighted based on the

inverse document frequency (IDF) measure. The database images are ranked by the L 2 distance. The vector with the lowest distance is reported as the most similar match. In a next step initial SIFT-VIP matches are sought to start the pose estimation algorithm. Corresponding features can be efficiently determined by using the quantized visual word description. Features with the same visual word description are reported as matches which only takes O(n) time where n is the number of features. The visual word description is very efficient. The plain visual word database size is DB inv = 4fI, (3) where f is the maximum number of visual words per model and I is the number of models in the database. The factor 4 comes from the use of 4 byte integers to hold the model index where a visual word occurred. If we assume an average of 1000 visual words per model a database containing 1 million models would only need 4GB of RAM. In addition to visual words we also need to store the 2D coordinates, scale and rotation for the SIFT features and additional 3D coordinates, plane parameters and virtual camera for the VIP features, which still allows to store a huge number of models in the database. 5. Experiments 5.1. SIFT-VIP matching results We conducted an experiment to compare standard SIFT- SIFT matching with our proposed SIFT-VIP matching. Fig. 4(a) shows the established SIFT-SIFT matches. Only 10 matches could be detected and many of them are actually mis-matches. When computing the initial SIFT- VIP matches, the number of correspondences increases to 25, most of them are correct (see Fig. 4(b)). The proposed method however is able to detect 91 correct SIFT-VIP matches as shown in Fig. 4(c). This is a significantly higher number of matches which allows a more accurate pose estimation. Note, that the matches are nicely distributed on two scene planes. Fig. 4(d) shows the resulting pose estimate in red color. Fig. 5 shows the camera position hypotheses from single SIFT-VIP matches in green. Each match generates one hypothesis. The red camera is the correct camera pose. All the camera estimates are set fronto-parallel to the VIP feature in the 3D model and therefore the camera estimates generated from the plane not fronto-parallel to the real camera position are off. However, it can be seen that many pose hypotheses are very close to the correct solution. Each of them can be used to extend the initial SIFT-VIP matches to a larger set. Fig. 6 shows an example with 3 scene planes. The 105 (partially incorrect) SIFT-SIFT matches get extended to 223 correct SIFT-VIP matches on all the 3 scene planes. Fig. 6(b) shows examples for orthographic VIP patches. The images show the extracted SIFT patches from the query image, the warped SIFT patches and the VIP patches of the 3d model. (from left to right). Ideally the warped SIFT patches and the VIP patches should be perfectly aligned. However, as the initial SIFT-VIP matches are not exactly fronto-parallel the camera pose is inaccurate and the patches are not perfectly registered. But the difference is not very large, which means that our simple pose estimation works impressively well. 5.2. 3D model search performance evaluation In this experiment we show the performance of the 3D model search. The video data to create the models in the first place was acquired with car mounted cameras while driving through a city. Two cameras were mounted on the roof of a car, one was pointing straight sidewards the other one was pointing forward in a 45 angle. The fields of view of both cameras do not overlap but as the system is moving over time the captured scene parts will overlap. To retrieve ground truth data for the camera motion the image acquisition was synchronized with a highly accurate GPS-inertia system. Accordingly we know the location of the camera for each video frame. In this experiment a 3D model database represented by VIP features is created from the side camera video. The database will be queried with the video frames from the side camera which are represented by SIFT features. The database contains 113 3D models which will be queried with 2644 images. The query video frames have a resolution of 1024 768 which resulted in up to 5000 features per frame. The vocabulary tree used was trained on general image data from the web. The 3D model search results are visualized by plotting lines between frame-to-3d model matches (see Fig. 7). The identical camera paths of the forward and side camera are shifted by a small amount in x and y direction to make the matching links visible. We only draw matches below a distance threshold of 10m so that mis-matches get filtered out. The red markers are the query camera positions and the green markers are the 3D model positions in the database. In the figure the top-10 ranked matches are drawn. Usually one considers the topn ranked matches as possible hypotheses and verifies the correct one geometrically. In our case this can be done by the pose estimation. Fig. 8 shows some correct example matches. 5.3. 3D model query with cell phone images We developed an application that allows to query a 3D city-model database (see screenshot in Fig. 9) from arbitrary input images. The database so far contains 851 3D models and the retrieval works in real-time. Fig. 9(b) shows an image query with a cell phone image. The cell phone query

(a) (b) (c) Figure 5. Camera pose hypotheses from SIFT-VIP matches (green). The groundtruth camera pose of the query image is shown in red. Multiple hypotheses are very close to the real camera pose. 6. Conclusion (d) Figure 4. Comparison of standard SIFT-SIFT matching and our proposed SIFT-VIP method. (a) SIFT-SIFT matches. Only 10 matches could be found, most of them are mis-matches. (b) Initial SIFT-VIP matches. 25 matches could be found, most of them are correct. (c) Resulting set of matches established with the proposed method. The initial set of 25 matches could be extended to 91 correct matches. (d) The SIFT-VIP matches in 3D showing the estimated camera pose (red). image shows different resolution and was taken month later, nevertheless we could match it perfectly. In this paper we addressed two important topics in visual localization. Firstly, we investigated the case of 3D-2D pose estimation using VIP and SIFT features. We showed that it is possible to match images to 3D models by matching SIFT features to VIP features. We demonstrated, that it is possible to increase the number of initial SIFT-VIP matches significantly by warping the query features into the orthographic frame of the VIP features. This increases the reliability and robustness of pose estimation. Secondly, we demonstrated a 3D model search scheme that is efficiently scalable up to city-scales. Localization experiments with images from camera phones showed that this approach is suitable for city-wide localization from mobile devices. References [1] A. Akbarzadeh, J. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, H. Towles, D. Nister, and M. Pollefeys. Towards urban 3d reconstruction from video. In 3D Data Processing, Visualization and Transmission, pages 1 8, 2006. [2] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative

-2220-2240 -2260 y [m] -2280-2300 -2320-2340 -2360 320 340 360 380 400 x [m] 420 440 460 480 Figure 7. 3D model search. Red markers are query camera positions and green markers are the 3D model positions in the database. Lines show matches below a 10 m distance threshold. Each match should be seen as a match hypothesis which is to be verified by the geometric constraints of pose estimation. (a) (b) Figure 6. (a) SIFT-VIP matches and estimated camera pose for a scene with 3 planes. (b) Examples for warped SIFT patches and orthographic VIP patches. From left to right: Extraced SIFT patch from query images, warped SIFT patch, VIP patch of 3d model. The VIP patches are impressively well aligned to the warped SIFT patches, despite the inaccuracies of the camera pose. Figure 8. Matches from the 3D model search. Left: Query image from the forward camera. Right: Retrieved 3D models.

(a) [7] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, Minnesota, pages 1 7, 2007. [8] H. Shao, T. Svoboda, T. Tuytelaars,, and L. J. V. Gool. Hpat indexing for fast object/scene recognition based on local appearance. In Conference on Image and video retrieval, pages 71 80, 2003. [9] C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 3d model matching with viewpoint invariant patches (vips). In To appear in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2008. [10] G. Yang, J. Becker, and C. Stewart. Estimating the location of a camera with respect to a 3d model. In 3DIM07, pages 159 166, 2007. [11] W. Zhang and J. Kosecka. Image based localization in urban environments. In 3D Data Processing, Visualization and Transmission, pages 33 40, 2006. (b) Figure 9. (a) Screenshots of our 3D model search tool. The query image can be selected from a list on the left. As a result the corresponding 3D model shows up. (b) Query with an image from a camera phone. feature model for object retrieval. In Proc. 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 2007. [3] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91 110, 2004. [4] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision, 65(1-2):43 72, 2005. [5] D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, New York City, New York, pages 2161 2168, 2006. [6] D. Robertsone and R. Cipolla. An image-based system for urban navigation. In Proc. 14th British Machine Vision Conference, London, UK, pages 1 10, 2004.