3D model search and pose estimation from single images using VIP features

Similar documents
Visual Word based Location Recognition in 3D models using Distance Augmented Weighting

Viewpoint Invariant Features from Single Images Using 3D Geometry

From Structure-from-Motion Point Clouds to Fast Location Recognition

Global localization from a single feature correspondence

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Geometry based Repetition Detection for Urban Scene

Instance-level recognition part 2

Local features and image matching. Prof. Xin Yang HUST

Local features: detection and description May 12 th, 2015

Evaluation and comparison of interest points/regions

Local features: detection and description. Local invariant features

Scale Invariant Feature Transform

Specular 3D Object Tracking by View Generative Learning

Video Google: A Text Retrieval Approach to Object Matching in Videos

Instance-level recognition II.

A Comparison of SIFT, PCA-SIFT and SURF

Scale Invariant Feature Transform

A minimal case solution to the calibrated relative pose problem for the case of two known orientation angles

Local invariant features

Object Recognition with Invariant Features

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Image-based Modeling and Rendering: 8. Image Transformation and Panorama

Chapter 3 Image Registration. Chapter 3 Image Registration

Handling Urban Location Recognition as a 2D Homothetic Problem

Feature Based Registration - Image Alignment

Stereo and Epipolar geometry

Gain Adaptive Real-Time Stereo Streaming

Simultaneous Recognition and Homography Extraction of Local Patches with a Simple Linear Classifier

Performance Evaluation of Scale-Interpolated Hessian-Laplace and Haar Descriptors for Feature Matching

Stereoscopic Images Generation By Monocular Camera

Large scale object/scene recognition

arxiv: v1 [cs.cv] 28 Sep 2018

SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY. Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang

AUTOMATED 3D RECONSTRUCTION OF URBAN AREAS FROM NETWORKS OF WIDE-BASELINE IMAGE SEQUENCES

Large Scale Image Retrieval

CS4670: Computer Vision

Construction of Precise Local Affine Frames

Local Features: Detection, Description & Matching

Computer Vision for HCI. Topics of This Lecture

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Local Features Tutorial: Nov. 8, 04

Motion Estimation and Optical Flow Tracking

A Novel Algorithm for Color Image matching using Wavelet-SIFT

Instance-level recognition

Large Scale 3D Reconstruction by Structure from Motion

Camera Drones Lecture 3 3D data generation

THE image based localization problem as considered in

Patch Descriptors. CSE 455 Linda Shapiro

Measuring camera translation by the dominant apical angle

Instance-level recognition

Invariant Features from Interest Point Groups

The SIFT (Scale Invariant Feature

calibrated coordinates Linear transformation pixel coordinates

Patch Descriptors. EE/CSE 576 Linda Shapiro

Lecture 14: Indexing with local features. Thursday, Nov 1 Prof. Kristen Grauman. Outline

Stereo Vision. MAN-522 Computer Vision

Visual localization using global visual features and vanishing points

Closing the Loop in Appearance-Guided Structure-from-Motion for Omnidirectional Cameras

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Photo Tourism: Exploring Photo Collections in 3D

Region matching for omnidirectional images using virtual camera planes

3D reconstruction how accurate can it be?

SIFT - scale-invariant feature transform Konrad Schindler

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Prof. Feng Liu. Spring /26/2017

Hierarchical Building Recognition

Feature Detection and Matching

3D Model Acquisition by Tracking 2D Wireframes

Region Graphs for Organizing Image Collections

Augmenting Reality, Naturally:

Large-scale visual recognition The bag-of-words representation

Speeding up the Detection of Line Drawings Using a Hash Table

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

Dense 3D Reconstruction. Christiano Gava

Lecture 12 Recognition

Unit 3 Multiple View Geometry

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

Outline 7/2/201011/6/

Robust Online Object Learning and Recognition by MSER Tracking

Efficient Representation of Local Geometry for Large Scale Object Retrieval

CS223b Midterm Exam, Computer Vision. Monday February 25th, Winter 2008, Prof. Jana Kosecka

III. VERVIEW OF THE METHODS

Improving feature based object recognition in service robotics by disparity map based segmentation.

Robust 6DOF Motion Estimation for Non-Overlapping, Multi-Camera Systems

A NEW FEATURE BASED IMAGE REGISTRATION ALGORITHM INTRODUCTION

URBAN STRUCTURE ESTIMATION USING PARALLEL AND ORTHOGONAL LINES

Visual Recognition and Search April 18, 2008 Joo Hyun Kim

Salient Visual Features to Help Close the Loop in 6D SLAM

An Overview of Matchmoving using Structure from Motion Methods

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Indexing local features and instance recognition May 16 th, 2017

Hamming embedding and weak geometric consistency for large scale image search

Indexing local features and instance recognition May 14 th, 2015

Structure Guided Salient Region Detector

Regular Paper Challenges in wide-area structure-from-motion

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Static Scene Reconstruction

BUILDING POINT GROUPING USING VIEW-GEOMETRY RELATIONS INTRODUCTION

Transcription:

3D model search and pose estimation from single images using VIP features Changchang Wu 2, Friedrich Fraundorfer 1, 1 Department of Computer Science ETH Zurich, Switzerland {fraundorfer, marc.pollefeys}@inf.ethz.ch Jan-Michael Frahm 2, Marc Pollefeys 1,2 2 Department of Computer Science UNC Chapel Hill, USA {ccwu,jmf}@cs.unc.edu Abstract This paper describes a method to efficiently search for 3D models in a city-scale database and to compute the camera poses from single query images. The proposed method matches SIFT features (from a single image) to viewpoint invariant patches (VIP) from a 3D model by warping the SIFT features approximately into the orthographic frame of the VIP features. This significantly increases the number of feature correspondences which results in a reliable and robust pose estimation. We also present a 3D model search tool that uses a visual word based search scheme to efficiently retrieve 3D models from large databases using individual query images. Together the 3D model search and the pose estimation represent a highly scalable and efficient city-scale localization system. The performance of the 3D model search and pose estimation is demonstrated on urban image data. query image 3D model from 1.3 M images matching part of the 3D model 1. Introduction Searching for 3D models is a key feature in city-wide localization and pose estimation from mobile devices. From a single snapshot image the corresponding 3D model needs to be found and 3D-2D matches between the model and the image need to be established to estimate the users pose (see illustration in Fig. 1). Main challenges so far are the correspondence problem (3D-2D) and the scalability of the approach. In this paper we will contribute to both of this topics. The first contribution will be a 3D-2D matching method that is based on viewpoint invariant patches (VIP) and can deal with severe viewpoint changes. The second contribution will be the use of a visual word based recognition scheme for efficient and scalable database retrieval. Our database consists of small individual 3D models that represent parts of a large scale reconstruction. Each 3D model is textured and is represented by a collection of VIP features in the database. When querying with an input image, the input s image SIFT features are matched with the database s VIP features to determine the corresponding 3D model. Fi- Figure 1. Mobile vision based localization: A single image from a mobile device is used to search for the corresponding 3D model in a city-scale database and determine thus the user s location. SIFT features extracted from the query image will be matched to VIP features from the 3D models in the database. nally, 3D-2D matches between the 3D model and the input image are established for pose estimation. Viewpoint-invariant-patches (VIP) have been used for registering 3D models in [9] so far. The main idea is to create ortho-textures for the 3D models and detect local features, e.g. SIFT, on them. For this, planes in the 3D model are detected and a virtual camera is set fronto-parallel to each plane. Features are now extracted from the virtual camera image from which the perspective transformation of the initial viewpoint change is removed. In this paper we extend this method to create matches between a 3D model and a single image (3D-2D). In the original method features from both models are represented in the canonical (orthographic) form. In our case only the 978-1-4244-2340-8/08/$25.00 2008 IEEE

features from the 3D model are represented in the canonical form while the features from the single image are perspectively transformed. However, while matching will not work for features under large perspective transformation, features which are almost fronto-parallel will match very well with the canonical representation. Under the assumption that the camera of the query image and the 3D plane of the matching features are parallel we can generate hypotheses for the camera pose of the query image. And using these hypotheses we can warp parts of the query image so that they match the perspective transform of the canonical features of the 3D model. This allows us to generate many more additional matches for robust and reliable pose estimation. For exhaustive search in large databases this method would be to slow, therefore we use the method described by Nister and Stewenius [5] for an efficient model search. The model search works with quantized SIFT (and VIP) descriptor vectors, so called visual words. The paper is structured in the following way. The following section describes relevant related work. Section 3 describes the first contribution of this paper, pose estimation using VIP and SIFT features. Section 4 describes how to search for 3D models in large databases efficiently. Section 5 shows experiments on urban image data and finally section 6 draws some conclusions. 2. Related work Many texture based feature detectors and descriptors have been developed for robust wide-baseline matching. One of the most popular is Lowe s SIFT detector [3]. The SIFT detector defines a feature s scale in scale space and a feature orientation from the gradient histogram in the image plane. Using the orientation, the SIFT detector generates normalized image patches to achieve 2D similarity transformation invariance. Many feature detectors, including affine covariant features, use the SIFT descriptor to represent patches. SIFT-descriptors are also used to encode VIP features. However, the VIP approach will work with other feature descriptors, too. Mikolajczyk et al. give a comparison of several local features in [4]. The recently proposed VIP features [9] go beyond affine invariance to robustness to projective transformations. The authors investigated the use of VIP features to align 3D models, but they did not investigate the case of matching VIP to features from single images. Most vision based location systems so far have been demonstrated on small databases [6, 8, 11]. Recently Schindler et al. [7] presented a scheme for city-scale environments. The method uses the visual word based recognition scheme following the approach in [5, 2, 2]. However, Schindler et al. only focused on location recognition. The pose of the user is not computed. Our proposed method will combine both, scalable location recognition and pose estimation. Pose estimation only is the focus of the work in [10]. The authors propose a method to accurately compute the camera pose from 3D-2D matches. High accuracy is achieved by extending the set of initial matches with region growing. Their method could be used as a last step in our localization approach to refine the computed pose. 3. Pose from SIFT-VIP matches Figure 2. VIP s detected on a 3D model. 3.1. Viewpoint-Invariant Patch (VIP) detection VIP s are features that can be extracted from textured 3D models which combine images with corresponding depth maps. VIPs are invariant to 3D similarity transformations. They can be used to robustly and efficiently align 3D models of the same scene from videos taken from significantly different viewpoints. In this paper we ll mostly consider 3D models obtained from video by SfM, but the method is equally applicable to textured 3D models obtained using LIDAR or other sensors. The robustness to 3D similarities exactly corresponds to the ambiguity of 3D models obtained from images, while the ambiguities of other sensors can often be described by a 3D Euclidean transformation or with even fewer degrees of freedom. The undistortion is based on local scene planes or on local planar approximations of the scene. Conceptually, for every point on the surface the local tangent plane s normal is estimated and a texture patch is generated by orthogonal projection onto the plane. Within the local ortho-texture patch it is determined if the point corresponds to a local extremal response of the Difference-of-Gaussians (DoG) Filter in scale space. If it is the orientation is determined in the tangent plane by the dominant gradient direction and a SIFT descriptor on the tangent plane is extracted. Using the tangent plane avoids the poor repeatability of interest point detection under projective transformations seen in popular feature detectors [4].

parallel to the local surface normal passing through the 3D point. This step makes the VIP invariant to the intrinsics and extrinsic of the original camera generating an ortho-texture patch. (a) 2. Verify VIP, and find its orientation and size. Keep a 3D point as a VIP feature only when its corresponding pixel in the ortho-texture patch is a stable 2D image feature. Like [3] a DoG Filter and local extrema suppression is used. VIP orientation is found based on the dominant gradient direction in the ortho-texture patch. With the virtual camera, the size and orientation of a VIP can be obtained by transforming the scale and orientation of its corresponding image feature to world coordinates. A VIP is then fully defined as (x, σ, n, d, s) where x is its 3D position, σ is the patch size n is the surface normal at this location, d is texture s dominant orientation as a vector in 3D s is the SIFT descriptor that describes the viewpointnormalized patch. Note, a sift feature is a sift descriptor plus it s position, scale and orientation. Fig. 2 shows VIP features detected on a 3D model. (b) (c) Figure 3. (a) Initial SIFT-VIP matches. Most matches are as expected on the fronto-parallel plane (left image is query image). (b) Camera pose estimated from SIFT-VIP match (re). (c) Resulting set of matches established with the proposed method. The initial set of 17 matches could be extended to 92 correct matches. The method established many matches on the other plane, too. Viewpoint-normalized image patches need to be generated to describe VIPs. Viewpoint-normalization is similar to the normalization of image patches according to scale and orientation performed in SIFT and normalization according to ellipsoid in affine covariant feature detectors. The viewpoint normalization can be divided into the following steps: 1. Warp the image texture for each 3D point, conceptually, using an orthographic camera with optical axis 3.2. Matching VIP with SIFT To match SIFT features from a single image with VIP features from a 3D model, the SIFT features extracted from the image need to be fronto-parallel (or close to) to the VIP features in the model. This might hold only for a fraction of features whose plane is accidentally parallel to the camera viewpoint. For all other features we will warp the corresponding image areas, so that they approximately match the canonical form of the VIP features. The projective warp can be computed along the following steps: 1. Compute the approximate camera position of the query image in the local coordinate frame from at least one fronto-parallel SIFT-VIP match. 2. Determine image areas that need to be warped by projecting the VIP features of the model into the query image. 3. Compute the warp homography for each image area from the 3D plane of the VIP and the estimated camera pose. The whole idea is based on the assumption that inital matches between VIP and SIFT features are fronto-parallel (see Fig. 3(a) for example matches). This assumption allows to compute an estimate for the camera pose of the

query image. The VIP feature is located on a plane in 3D and is defined by the feature s center point X (in 3D) and the normal vector n of the plane. Our assumption is that the image plane of the SIFT feature is parallel to the plane and that the principal ray of the camera center is in the direction of n and connects X and the center of the SIFT feature x. This fixes the camera pose along the normal vector n. The distance d between the camera center and the plane can be computed from the scale ratio of the matched feature with the help of the focal length f. d = f S s The focal length f of the camera can be taken from the EXIF data of the image or from camera calibration. S is the scale of the VIP feature and s is the scale of the matching SIFT feature. The missing rotation around the principal axis r can finally be recovered from the dominant gradient direction of the image patches. Fig. 3(b) shows a camera pose estimated from a SIFT-VIP match. Now with the camera P fully defined this approximation can be used to compute the necessary warps. For each VIP feature in the 3D model we determine the corresponding image region in the query image, by projecting the VIP region (specified by center point and scale) onto the image plane. Next we compute the homography transform H that will warp our image region to the canonical form of the VIP feature with (1) H = R + 1 d T N T (2) where R and T are rotation and translation from P to the virtual camera of the VIP feature and N is the normal vector of the VIP plane in the coordinate system of P. Finally we are looking for stable 2D image feature in the warped image area by applying the SIFT detector. Clearly our assumptions are not met exactly which results in an inaccurate camera pose estimate. SIFT descriptors, which are developed for wide-baseline matching, enable matching within a certain range of viewpoint change and thus the camera plane might not be exactly parallel to the VIP feature plane. However, we do not depend on an exact pose estimate for this step. We account for the uncertainty in the camera pose by enlarging the region to warp. In addition, remaining differences between the VIP and SIFT feature can be compensated with SIFT matching. Fig. 3 shows examples of final SIFT-VIP matches. The initial matching between SIFT and VIP features results in 17 matches. From this a camera pose estimate can be computed which allows to warp the SIFT detections in the input image into approximate fronto-parallel configuration. Matching the rectified SIFT detections with the VIP features yields 92 correct matches. Algorithm 1 3D model search and pose estimation 1: Extract SIFT features from query image 2: Compute visual word document vector for query image 3: Compute L 2 distances to all document vectors in 3D model database (inverted file query) 4: Use 3D model corresponding to the smallest distance as matching 3D model 5: Match SIFT features from query image to VIP features from database 3D model (nearest neighbor matching) 6: Compute camera pose hypotheses from SIFT-VIP matches 7: Warp the query image according to the camera pose hypotheses and extract fronto-parallel SIFT features. 8: Match fronto-parallel SIFT features to VIP features 9: Compute final pose from SIFT-VIP matches 3.3. Pose estimation The 3D-2D matches between VIP and SIFT features can now be used to compute the camera pose accurately and thus determine the location of the user within the map. The main benefit for pose estimation is that we could significantly increase the number of feature matches, which results in a reliable and robust pose estimation. An outline of the complete localization method is given in Algorithm 1. 4. Efficient 3D model search in large databases For pose estimation as described in the previous section the corresponding 3D model needs to be known. For large databases, necessary for city-wide localization, an exhaustive search through all the 3D models is not possible. Thus a first step prior pose estimation is to search for the corresponding 3D model. Our database consists of small individual 3D models that represent parts of a large scale vision based 3D reconstruction, created as described in [1]. Each individual 3D model is represented by a set of VIP features extracted from the model texture. These features are used to create a visual word database as described in [5]. This allows for efficient model search to determine the 3D model necessary for pose estimation. Similar to [5], firstly, VIP features are extracted from the 3D models. Each VIP descriptor is quantized by a hierarchical vocabulary tree. All visual words from one 3D model form a document vector which is a v-dimensional vector where v is the number of possible visual words. It is usually extremely sparse. For a model query the similarity between the query document vector to all document vectors in a database is computed. As similarity score we use the L 2 distance between document vectors. The organization of the database as an inverted file and the sparseness of the document vectors allows a very efficient scoring. For scoring, the different visual words are weighted based on the

inverse document frequency (IDF) measure. The database images are ranked by the L 2 distance. The vector with the lowest distance is reported as the most similar match. In a next step initial SIFT-VIP matches are sought to start the pose estimation algorithm. Corresponding features can be efficiently determined by using the quantized visual word description. Features with the same visual word description are reported as matches which only takes O(n) time where n is the number of features. The visual word description is very efficient. The plain visual word database size is DB inv = 4fI, (3) where f is the maximum number of visual words per model and I is the number of models in the database. The factor 4 comes from the use of 4 byte integers to hold the model index where a visual word occurred. If we assume an average of 1000 visual words per model a database containing 1 million models would only need 4GB of RAM. In addition to visual words we also need to store the 2D coordinates, scale and rotation for the SIFT features and additional 3D coordinates, plane parameters and virtual camera for the VIP features, which still allows to store a huge number of models in the database. 5. Experiments 5.1. SIFT-VIP matching results We conducted an experiment to compare standard SIFT- SIFT matching with our proposed SIFT-VIP matching. Fig. 4(a) shows the established SIFT-SIFT matches. Only 10 matches could be detected and many of them are actually mis-matches. When computing the initial SIFT- VIP matches, the number of correspondences increases to 25, most of them are correct (see Fig. 4(b)). The proposed method however is able to detect 91 correct SIFT-VIP matches as shown in Fig. 4(c). This is a significantly higher number of matches which allows a more accurate pose estimation. Note, that the matches are nicely distributed on two scene planes. Fig. 4(d) shows the resulting pose estimate in red color. Fig. 5 shows the camera position hypotheses from single SIFT-VIP matches in green. Each match generates one hypothesis. The red camera is the correct camera pose. All the camera estimates are set fronto-parallel to the VIP feature in the 3D model and therefore the camera estimates generated from the plane not fronto-parallel to the real camera position are off. However, it can be seen that many pose hypotheses are very close to the correct solution. Each of them can be used to extend the initial SIFT-VIP matches to a larger set. Fig. 6 shows an example with 3 scene planes. The 105 (partially incorrect) SIFT-SIFT matches get extended to 223 correct SIFT-VIP matches on all the 3 scene planes. Fig. 6(b) shows examples for orthographic VIP patches. The images show the extracted SIFT patches from the query image, the warped SIFT patches and the VIP patches of the 3d model. (from left to right). Ideally the warped SIFT patches and the VIP patches should be perfectly aligned. However, as the initial SIFT-VIP matches are not exactly fronto-parallel the camera pose is inaccurate and the patches are not perfectly registered. But the difference is not very large, which means that our simple pose estimation works impressively well. 5.2. 3D model search performance evaluation In this experiment we show the performance of the 3D model search. The video data to create the models in the first place was acquired with car mounted cameras while driving through a city. Two cameras were mounted on the roof of a car, one was pointing straight sidewards the other one was pointing forward in a 45 angle. The fields of view of both cameras do not overlap but as the system is moving over time the captured scene parts will overlap. To retrieve ground truth data for the camera motion the image acquisition was synchronized with a highly accurate GPS-inertia system. Accordingly we know the location of the camera for each video frame. In this experiment a 3D model database represented by VIP features is created from the side camera video. The database will be queried with the video frames from the side camera which are represented by SIFT features. The database contains 113 3D models which will be queried with 2644 images. The query video frames have a resolution of 1024 768 which resulted in up to 5000 features per frame. The vocabulary tree used was trained on general image data from the web. The 3D model search results are visualized by plotting lines between frame-to-3d model matches (see Fig. 7). The identical camera paths of the forward and side camera are shifted by a small amount in x and y direction to make the matching links visible. We only draw matches below a distance threshold of 10m so that mis-matches get filtered out. The red markers are the query camera positions and the green markers are the 3D model positions in the database. In the figure the top-10 ranked matches are drawn. Usually one considers the topn ranked matches as possible hypotheses and verifies the correct one geometrically. In our case this can be done by the pose estimation. Fig. 8 shows some correct example matches. 5.3. 3D model query with cell phone images We developed an application that allows to query a 3D city-model database (see screenshot in Fig. 9) from arbitrary input images. The database so far contains 851 3D models and the retrieval works in real-time. Fig. 9(b) shows an image query with a cell phone image. The cell phone query

(a) (b) (c) Figure 5. Camera pose hypotheses from SIFT-VIP matches (green). The groundtruth camera pose of the query image is shown in red. Multiple hypotheses are very close to the real camera pose. 6. Conclusion (d) Figure 4. Comparison of standard SIFT-SIFT matching and our proposed SIFT-VIP method. (a) SIFT-SIFT matches. Only 10 matches could be found, most of them are mis-matches. (b) Initial SIFT-VIP matches. 25 matches could be found, most of them are correct. (c) Resulting set of matches established with the proposed method. The initial set of 25 matches could be extended to 91 correct matches. (d) The SIFT-VIP matches in 3D showing the estimated camera pose (red). image shows different resolution and was taken month later, nevertheless we could match it perfectly. In this paper we addressed two important topics in visual localization. Firstly, we investigated the case of 3D-2D pose estimation using VIP and SIFT features. We showed that it is possible to match images to 3D models by matching SIFT features to VIP features. We demonstrated, that it is possible to increase the number of initial SIFT-VIP matches significantly by warping the query features into the orthographic frame of the VIP features. This increases the reliability and robustness of pose estimation. Secondly, we demonstrated a 3D model search scheme that is efficiently scalable up to city-scales. Localization experiments with images from camera phones showed that this approach is suitable for city-wide localization from mobile devices. References [1] A. Akbarzadeh, J. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, H. Towles, D. Nister, and M. Pollefeys. Towards urban 3d reconstruction from video. In 3D Data Processing, Visualization and Transmission, pages 1 8, 2006. [2] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative

-2220-2240 -2260 y [m] -2280-2300 -2320-2340 -2360 320 340 360 380 400 x [m] 420 440 460 480 Figure 7. 3D model search. Red markers are query camera positions and green markers are the 3D model positions in the database. Lines show matches below a 10 m distance threshold. Each match should be seen as a match hypothesis which is to be verified by the geometric constraints of pose estimation. (a) (b) Figure 6. (a) SIFT-VIP matches and estimated camera pose for a scene with 3 planes. (b) Examples for warped SIFT patches and orthographic VIP patches. From left to right: Extraced SIFT patch from query images, warped SIFT patch, VIP patch of 3d model. The VIP patches are impressively well aligned to the warped SIFT patches, despite the inaccuracies of the camera pose. Figure 8. Matches from the 3D model search. Left: Query image from the forward camera. Right: Retrieved 3D models.

(a) [7] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, Minnesota, pages 1 7, 2007. [8] H. Shao, T. Svoboda, T. Tuytelaars,, and L. J. V. Gool. Hpat indexing for fast object/scene recognition based on local appearance. In Conference on Image and video retrieval, pages 71 80, 2003. [9] C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 3d model matching with viewpoint invariant patches (vips). In To appear in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2008. [10] G. Yang, J. Becker, and C. Stewart. Estimating the location of a camera with respect to a 3d model. In 3DIM07, pages 159 166, 2007. [11] W. Zhang and J. Kosecka. Image based localization in urban environments. In 3D Data Processing, Visualization and Transmission, pages 33 40, 2006. (b) Figure 9. (a) Screenshots of our 3D model search tool. The query image can be selected from a list on the left. As a result the corresponding 3D model shows up. (b) Query with an image from a camera phone. feature model for object retrieval. In Proc. 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 2007. [3] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91 110, 2004. [4] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision, 65(1-2):43 72, 2005. [5] D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, New York City, New York, pages 2161 2168, 2006. [6] D. Robertsone and R. Cipolla. An image-based system for urban navigation. In Proc. 14th British Machine Vision Conference, London, UK, pages 1 10, 2004.