Hierarchical Building Recognition

Size: px

Start display at page:

Download "Hierarchical Building Recognition"

Derrick Eaton
5 years ago
Views:

1 Hierarchical Building Recognition Anonymous Abstract We propose a novel and efficient method for recognition of buildings and locations with dominant man-made structures. The method exploits two different representations of model views, which differ in their complexity and discriminative capability. At the coarse level the model views are represented by spatially tuned color histograms computed over dominant orientations structures in the image. This representation enables fast and efficient retrieval of the closest match from the database. At the finer level the matching is accomplished using feature descriptors associated with local image regions. The probabilistic formulation of the matching as well as means of dealing with multiple matches due to repetitive structures is discussed. Extensive experiments on two building databases with varying degree of occlusions, viewpoint and illumination changes are presented. 1 Introduction In this paper, we study the problem of building recognition. As an instance of the recognition problem, this domain is interesting since the class of buildings contains many similarities, while at the same time calls for the techniques which are capable of fine discrimination between different instances of the class. The problem is also of interest in the context of navigation applications, where the goal is to determine the pose of the camera with respect to the scene given an image of the scene and the database of previously recorded images of the city or urban area. From the perspective of applications the natural concern is efficiency and scalability. 1.1 Related work One of the central issues pertinent to the recognition problem is the choice of suitable representation of the class and its scalability to large number of exemplars. In the context of object recognition, both global and local image descriptors have been considered. Global image descriptors typically consider the entire image as a point in the highdimensional space and model the changes in appearance as a function of viewpoint using subspace methods [11]. Given the subspace representation the pose of the camera is typically obtained by spline interpolation method, exploiting the continuity of the mapping between the object appearance and continuously changing viewpoint. Alternative global representations proposed in the past include responses to banks of filters [19], multi-dimensional histograms [7]. In [17] the authors suggested to improve discriminantive power of plain color indexing technique with encoding of spatial information, by dividing the image into 5 partially overlapping regions. Multi-channel approach using histograms was explored in the context of general object recognition [10]. The disadvantage of this approach is that using multiple indices can make the combined index relatively large. Local feature based techniques have recently become very effective in the context of different object recognition problems since they fare favorably in the presence of large amount of clutter and changes in viewpoint. The representatives of local feature based techniques include scale invariant features [9, 14] and their associated descriptors, which are invariant with respect to rotation or affine transformations. Several authors motivated by the navigation applications also studied the problem of building recognition. In [12] authors proposed matching using point features prior aligning the views of the buildings to their canonical views in the database, followed by the relative pose recovery between the views. Authors in [16] used matching of invariant regions, while in [6] the line segments were matched with the assumption of planar motion. Coarse classification of locations was also motivating application in work of [18] on context-based place recognition. 1.2 Approach overview We propose to tackle the building recognition problem by a two stage hierarchical approach. The first stage is comprised of an efficient coarse classification scheme based localized color histograms computed over dominant orientation structures in the image. A variable number of candidates, with their associated confidence measures are then subjected to the second, feature based matching stage. The main contribution of this paper is the introduction of the localized color histogram descriptor used in the fast indexing scheme and simplified probabilistic feature based matching method used in the final recognition stage. We carry out extensive experiments on two building databases with varying

degree of occlusions, viewpoint and illumination changes and discuss the applicability of our approach to other recognition problems.

robustness of local feature descriptors, we propose a representation of buildings which trades these characteristics favorably.

Parallel lines in the world intersect in the image plane at vanishing points.

Since these orientations typically belong to man-made structures, we propose to compute the color distribution of pixels, whose orientation complies with main vanishing directions.

We will outline the process in more detail in the following section. 2.

The detection of line segments is followed by an efficient approach for simultaneous grouping of lines into dominant vanishing directions and estimation of vanishing points using expectation

In our experiments, the number of EM iterations is set to be 10, but we often observe good convergence after less the 5 iterations.

In such cases, the first recognition stage is bypassed and matching based on local descriptors is carried out. We have not encountered this situation throughout our experiments. 2.

The EM process typically returns two or three vanishing points, which correspond to principal directions Figure 1: Three views of same building. Top: the original image.

2 degree of occlusions, viewpoint and illumination changes and discuss the applicability of our approach to other recognition problems. 2 Spatially tuned color histogram In order to exploit the efficiency and compactness of the histogram based representations and at the same time gain the advantages of discrimination capability and robustness of local feature descriptors, we propose a representation of buildings which trades these characteristics favorably. The representation is motivated by the observation, that manmade structures like buildings contain constrained geometric structure, such as parallel and orthogonal lines and planar structures. Parallel lines in the world intersect in the image plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three orthogonal axes of the world coordinate frame. Since these orientations typically belong to man-made structures, we propose to compute the color distribution of pixels, whose orientation complies with main vanishing directions. Discriminative power is gained by weakly encoding the spatial information, which is achieved by treating the histograms of the different dominant orientations separately. We will outline the process in more detail in the following section. 2.1 Dominant vanishing directions The detection of dominant vanishing directions in the image, which are due to the presence of dominant man-made structures is based on our earlier work [1]. The detection of line segments is followed by an efficient approach for simultaneous grouping of lines into dominant vanishing directions and estimation of vanishing points using expectation maximization algorithm (EM). The EM algorithm typically converges in several iterations, due to the effective initialization stage based on peaks in orientation histogram. In our experiments, the number of EM iterations is set to be 10, but we often observe good convergence after less the 5 iterations. In the presence of buildings which lack dominant orientations, the vanishing point estimation process is terminated due to the lack of straight line support. In such cases, the first recognition stage is bypassed and matching based on local descriptors is carried out. We have not encountered this situation throughout our experiments. 2.2 Pixels membership assignment In the above step, we have obtained the global direction information from the line segments aligned with the dominant orientations. The EM process typically returns two or three vanishing points, which correspond to principal directions Figure 1: Three views of same building. Top: the original image. Middle: pixel membership assigned by geometric constraints. Bottom: pixel membership assigned in the end, deep blue represents area classified as background, red, light blue and yellow color represent three group of pixels, respectively. v i,v j and v k in the world coordinate frame. Now let s assume that there are 3 principal directions and that the image plane rotation of the camera doesn t exceed 45 o. Then we can identify those three principal directions as left v i, right v j and vertical v k, respectively. Each image pixel can then be classified as either aligned with one of the directions or belonging to outlier model. Pixels with the sufficiently high gradient value are classified as aligned with one of the directions if the difference between its local gradient direction and the direction v i,v j and v k is less than some threshold. In our experiments, the threshold is set to be 30 o. Based on the above criterion, pixels belonging to background clutter (e.g. trees and grassland) will be still selected as middle row of Figure 1 shows. Note however that those pixels are located in area where gradient direction changes frequently and their neighbor pixels are unlikely to belong to same group. Hence most of them can be eliminated by doing connected component analysis and removing small connected components. The building pixels survive this stage because they typically have consistent gradient direction. The final group membership assignments are shown in the bottom row of Figure 1, where most background bushes and trees are removed. Note also that the memberships of individual pixels remains stable across different views, which enables us to achieve representation which has high repeatability with respect to change of viewpoints. We will next demonstrate that augmenting this spatial representation with color information will yield discriminative descriptor for the class.

Figure 2: Three views of another building with more background clutter and viewpoint change. Top: the original image.

3 Color histogram formation Considering only pixels which belong to dominant directions, we compute a color histogram for each group of pixels.

Unlike the traditional color indexing technique where pixel colors are represented in 3D RGB space or 2D hue-saturation space, we adopt the 1D representation [3].

5000 0.5000 0.45400 0.0460 R G B H = arctan(c b, C r )/π 1 H 1 (1) The hue distribution of each group of pixels is quantized into 16 bins.

histogram bin according to the distance of between the value and the bin s central value. 2.

3 Figure 2: Three views of another building with more background clutter and viewpoint change. Top: the original image. Bottom: pixel membership assigned in the end, deep blue represents area classified as background, red, light blue and yellow color represent three group of pixels, respectively. 2.3 Color histogram formation Considering only pixels which belong to dominant directions, we compute a color histogram for each group of pixels. First, the RGB value of each group of pixels is mapped to hue space. Unlike the traditional color indexing technique where pixel colors are represented in 3D RGB space or 2D hue-saturation space, we adopt the 1D representation [3]. The representation considers only hue value calculated by taking the arctan2(c b, C r ) with (Y, C b, C r ) defined as: Y C b C r = Hue values are calculated by: R G B H = arctan(c b, C r )/π 1 H 1 (1) The hue distribution of each group of pixels is quantized into 16 bins. In order to avoid the boundary effects, which cause the histogram to change abruptly when some values shift smoothly from one bin to another, we use linear interpolation to assign a weight to each histogram bin according to the distance of between the value and the bin s central value. 2.4 Building retrieval Given a 16 3 = 48 descriptor representing each model image, building retrieval can proceed by comparing descriptors of a test image to all models. There is one subtle issue: because of the viewpoint change, pixels which belong to left group may happen to be in the right group, and vice versa. Currently we handle this issue by combining the histograms of left and right group into one. In the end, we have two histograms yielding 16 2 = 32 indexing vector to represent each image. We may lose some discriminative power in this way, but we already saw considerable improvement over using single global histogram. One byproduct we obtain is short indexing vector, which is good for both storage and comparison. In this stage, the two histograms are normalized individually. To compare a test image to different models, different distance measures can be used. We tried L1 distance and χ 2 statistics. Though χ 2 statistics is not a metric (triangle inequality doesn t hold), we chose to use it because it brings about 7% increase in recognition rate compared to L1 distance. Given the descriptor of a test image h t and of model h p, the χ 2 distance is defined as: χ 2 (h t, h p ) = k (h t (k) h p (k)) 2 h t (k) + h p (k) where k is number of bins in the histogram (k = 32 in our case). The compact size of descriptors makes the comparison very fast, which is especially beneficial when dealing with very large database. In the first recognition stage, we return a subset of models, which will be considered in the second stage. The number of models considered will depend on the ambiguity of the first top k results. The ambiguity is quantified as: Am = χ 2 1st (2) 1 n 1 ( i χ2 i ) (3) where i = 2, 3,..., n, and χ 2 i is the i th closest distance of the result. We set n to be 5 in our experiments. The ambiguity measure will be very low when the test image is easy to classify and is close to 1 when it s hard to identify. Then the number of listed results N r can be calculated by N r = N m Am 2, here N m is maximum size of list we allowed, which we set to be 20. The actual size of the list is typically much smaller. When each object model has multiple views in the database, the smallest χ 2 distance among those views is used to compute the size of list. We report the recognition rates obtained by this first recognition stage in Section 4. 3 Local feature based refinement The purpose of the second recognition stage is to further refine the results, enable finer matching and establish correspondences for subsequent recovery of the relative pose between the test and model view. In this stage we exploit the SIFT features and their associated descriptors introduced by [9]. For each model image, the feature descriptors are extracted off line and saved in the database along with the color indexing vectors. After extracting features from a test image its descriptors are matched to those of the models selected in the first recognition stage.

a) b) c) Figure 3: a) match result using only nearest neighbor ration criteria (11 matches); b) result using only cosine measure (15 matches); c) result using both measures (24 matches).

In the original matching scheme suggested in [9] a pair of keypoints is considered a match if the distance ratio between the closest match and second closest one is below some threshold τ r.

One option for tackling this issue would be to perform some clustering in the space of descriptors to capture this repeatability as suggested in the context of texture analysis [8].

We instead choose to add another criterion, which considers two features as matched, when the cosine of angle between their descriptors is above some threshold τ c.

The use of both nearest neighbor ratio and distance threshold yields larger number of matches as demonstrated in Figure 3.

Given k candidate models obtained in the first stage, we can refine the recognition by matching SIFT keypoints.

In such case the best model is the one with the highest number of successfully matched points: C = max {C(Q, I j )}.

Figure 5: Another example that appearance based technique helps identify correct model.

4 a) b) c) Figure 3: a) match result using only nearest neighbor ration criteria (11 matches); b) result using only cosine measure (15 matches); c) result using both measures (24 matches). Figure 4: The other three models listed as in top 3 in coarse recognition stage, have much less matches than the correct model. In the original matching scheme suggested in [9] a pair of keypoints is considered a match if the distance ratio between the closest match and second closest one is below some threshold τ r. In the context of buildings, which contain many repetitive structures this criterion will reject many possible matches, because up to k nearest neighbors may have very close distances. One option for tackling this issue would be to perform some clustering in the space of descriptors to capture this repeatability as suggested in the context of texture analysis [8]. While this strategy would be suitable for recognition, it would not allow us to obtain actual correspondences. We instead choose to add another criterion, which considers two features as matched, when the cosine of angle between their descriptors is above some threshold τ c. For each keypoint in the query image, only the nearest neighbor is kept if multiple candidates pass the distance threshold. The use of both nearest neighbor ratio and distance threshold yields larger number of matches as demonstrated in Figure 3. Although the difference between the number of matches using only distance threshold and both criteria is small, the missed matches are often crucial in case of models with small number of features. Given k candidate models obtained in the first stage, we can refine the recognition by matching SIFT keypoints. Denoting the number of matches between each candidate model image I j and the test image Q by {C(Q, I j )} the most likely model can be determined using simple voting strategy. In such case the best model is the one with the highest number of successfully matched points: C = max {C(Q, I j )}. j The Figures 5 and 4 show the top 4 candidates returned by the first stage and their respective SIFT matches. Note that the correct models have many more successful matches than other candidates. Figure 5: Another example that appearance based technique helps identify correct model. The correct model has much more matches, although it was listed as a third candidate by the coarse recognition stage. 3.1 Probabilistic Formulation In order to better understand the advantages and limitations of the local feature based methods in our domain, we have carried out additional set of experiments involving only local feature based methods used in the second recognition stage. Since this strategy may be useful in the absence of color information, we have extended the database by additional 68 building models captured by grayscale images. There are 3-4 views per building, with larger amount of viewpoint variation and clutter. Using solely voting strategy outlined above has proven ineffective due to the large variation in the number of features detected in different building models as well as repetitive nature of the man-made structures. To improve the recognition rate achieved using standard voting, we introduced simple probabilistic model which properly takes into account both the number of features in the model and quality of the attained matches characterized by similarity between descriptors. Several probabilistic formulations have been considered in the past in the context of object recognition. The existing models differ in their complexity and the number of aspects they account for probabilistically (e.g. photometric, geometric aspects) [13, 4]. The probabilistic model we use accounts only for quality of the matches without explicitly considering spatial relationships between features, because the repetitive patterns often cause features to be matched to different locations.

5 In order to quantify the match quality we need to reconcile the matching criteria based on the distance threshold and nearest neighbor ratio test. For keypoints pairs that get matched by exceeding distance ratio threshold, their similarity scores can be quite low, which does not reflect the fact that they are potentially high quality matches. This is reconciled by learning the match quality for these matches in controlled experiment by observing the portion of the matched pairs which are correct correspondences. More details about this stage can be found in [2]. Another possibility would be to use the Mahalanobis distance between two descriptors where the covariance is learned in controlled experiment. In the probabilistic setting the recognition problem can be formulated as the problem of computing posterior probability of the building model given a query image Q, P (I j Q). The probability that a building I j appears in query image Q depends only on the set of matches {m i } = {C(Q, I j )} and can be written as P (I j {m i }). Since spatial relationships between individual matches are not considered, the indexes of the keypoints are omitted. In the following section we also omit the model index j to improve the clarity. Instead of computing P (I {m i }) directly, we consider the probability that Q doesn t contain model I, which can be expressed as P (Ī {m i}) = 1 P (I {m i }). As we will demonstrate shortly, by doing so, we avoid the need for assigning probabilities to the unmatched features. Assuming that all matched pairs are independent, we then obtain: P (Ī Q) = P (Ī {m P ({mi} Ī)P (Ī) i}) = P ({m i }) n i=1 P (mi Ī)P (Ī) = n i=1 P (m (4) i) n i=1 = (1 P (m i I))P (Ī) n i=1 (P (m i I) + P (m i Ī)) (5) Where n is the number of matched pairs, P (Ī) is the prior which is assumed to be uniform for all models. The confidence measure obtained from the first stage of recognition can be used in place of prior. (1 P (m i I)) (P (m i I)+P (m Denoting α i = i Ī)) we can write: n P (I Q) = 1 i=1 α i P (Ī). (6) The term α i naturally depends on the match quality, which is related to the distance between two descriptors. Instead of evaluating P (I Q) based on all detected features and assigning a small constant probability ɛ to the features which were detected but not matched, we evaluate this term only for matched features. By excluding the contribution of unmatched features, we avoid situation where they significantly skew the overall likelihood estimate. Assuming that there are m features which were not matched, these features will contribute to the overall likelihood function by the following term (1 ɛ) m and can affect the final likelihood despite the fact that large number of correct matches was detected. α i is also related to each model s number of features; it will be large for model with more features. Because the probability that none of its features get matched will be lower. We choose approximately same number of features for each model to eliminate this bias. Model Feature Selection Stage. In order to choose a subset of features to make the number of features to be approximately the same in all models, we need to consider the quality of each feature. This quality can be measured by its repeatability and distinctiveness. A feature which appears in multiple views of same model is more repeatable and likely to appear in a query view. On the other hand, a feature which appears in multiple models is less characteristic than those present only in views of one model. By studying the local information content of each feature in the model views [5] we can estimate the posterior probability of all model features P (I j m i ) by directly estimating this conditional density using Parzen windows approach. The probability would be higher when the feature is more repeatable and lower when the feature is less characteristic. For each model, we keep only those features with P (I j m i ) higher than certain threshold τ p. We set the threshold so that each image contains at most 500 discriminative features. Originally, models can have thousands of features. This procedure removes around 60% of features from the original feature set. Learning the parameter α. With the influence of feature number removed, the relationship between α i and similarity score d i between two descriptors can be represented by a function: F : d i α i. We learn this function in a supervised setting. Given a test image and a subset of model images (including the correct model) and by fixing a particular similarity threshold τ s, we can obtain a large set of matched pairs between the test image and the model images. Denote the cardinality of the set to be M. Then we can identify the set of correct matches among them with cardinality denoted by N. N/M is used as the probability of true correspondences for that given τ s and M N M is the ατ parameter we want for this τ s. Repeating this for a number of different τ s, we get a set of sample points along the function curve which represents relationship between the similarity score d i and α i. Following robust function F (d i ) = 2s 3 π(s 2 + (1 d i ) 2 ) 2 (7) approximates the curve shape very well when s = Parameter α i is obtained by dividing F (d i ) with its maxi-

mum to obtain values in interval [0, 1].

4 Experimental results To test our approach, we downloaded the ZuBuD [15] database which contains 200 buildings

Some of the image were taken with camera rotated 90 o, so we pre-rotated them before experiments, because the 90

As we mentioned before, our approach can currently tolerate 45 o rotation around optical axis.

We compared it with few alternatives: a) using pixels from the detected straight lines only to form one color

The first views of the 200 buildings are chosen as models, the second views are chosen as test image.

The first three columns of the table list the percentage of correctly classified building that are listed in the

pixels 1 st top 5 list average size line pixels 65.5% 83.5% 88% 5.5 single histogram 69% 89% 92% 5.

As shown in the table, based on one building view per model, we obtain 83% recognition rate, which shows that it

While the top 5 list shows 93% hit ratio, an average size of 5.1 list provides 95% hit ratio.

This setting brings noticeable improvement. Out of 115 query images, we got 90.

The remaining 4 images are very difficult to recognize.

Figure 8 shows the two misclassified buildings.

The query image and top four results are listed from left to right. Some images are resized for display purpose.

dimensional spatially tuned color histogram has very good discrimination capability.

the line matching technique described in [6]. The ambiguity measure we obtained in this experiment is very small.

Besides its apparent efficiency in comparison, the indexing vector extraction is quite efficient, too.

6 mum to obtain values in interval [0, 1]. With the mapping between α and d i in place, the posterior probability of each model given a set of matches can be computed from Equation 6. If we set the mapping to be a constant, then the probability formulation has the same effect as voting. 4 Experimental results To test our approach, we downloaded the ZuBuD [15] database which contains 200 buildings of Zürich with 5 views each. Some of the image were taken with camera rotated 90 o, so we pre-rotated them before experiments, because the 90 o rotation will cause ambiguity. As we mentioned before, our approach can currently tolerate 45 o rotation around optical axis. To demonstrate the benefits of using spatially tuned histogram, we carried out a preliminary experiment. We compared it with few alternatives: a) using pixels from the detected straight lines only to form one color histogram; b) using all the pixels from the three groups to form one color histogram. The first views of the 200 buildings are chosen as models, the second views are chosen as test image. The results are summarized in Table 1. The first three columns of the table list the percentage of correctly classified building that are listed in the top k list, the last column shows the average size of list for all the test images. pixels 1 st top 5 list average size line pixels 65.5% 83.5% 88% 5.5 single histogram 69% 89% 92% 5.0 our approach 83% 93% 95% 5.1 Table 1: Recognition performance of the first recognition stage. As shown in the table, based on one building view per model, we obtain 83% recognition rate, which shows that it has fairly good discriminant power. We can also see the benefit of using variable top k list. While the top 5 list shows 93% hit ratio, an average size of 5.1 list provides 95% hit ratio. For the second experiment we use the query images provided by Zürich database and use all 5 views per model. This setting brings noticeable improvement. Out of 115 query images, we got 90.3% correct recognition result, and 96.5% of them have correct model in top 5 list. The remaining 4 images are very difficult to recognize. Three of them come from one building; they failed because of large lighting and viewpoint change between the query and the model views. The 4th failure is due to dramatic viewpoint change, which is difficult to recognize even for human. Figure 8 shows the two misclassified buildings. We can see the 32 Figure 6: Example of correct recognized test image. The query image and top four results are listed from left to right. Some images are resized for display purpose. Figure 7: Incorrect recognitions with correct models in the list. dimensional spatially tuned color histogram has very good discrimination capability. The first stage recognition shows better recognition rate than the result reported by [16] and is comparable to the line matching technique described in [6]. The ambiguity measure we obtained in this experiment is very small. For 64 query images only 1 candidate is selected. The maximum list number is 9, and the average number is Besides its apparent efficiency in comparison, the indexing vector extraction is quite efficient, too. Though it s currently implemented in Matlab, our experiment shows that processing one image only takes 2 seconds on a 1.5GHz notebook computer. If planar motion can be assumed, the process can be faster, because the vanishing directions are know as prior. 4.1 Local feature based recognition We held two separate experiments in this stage. First, we do the fine recognition for the candidates provided by the coarse recognition stage. As we demonstrated in Figure 5 and Figure 4, the correct models matches outnumber other candidates by a large margin and hence the benefits of the probabilistic matching didn t manifest here. Our experiment shows that all the incorrect coarse recognitions except the 4 failures were corrected. Our final recognition rate for ZuBud query image is 96%, which shows improvement on the reported works in the same database [6] [16]. The second experiment was based on a database of 68 buildings with no color information available. The first view of each building is used as model, the other two views are used as test images. Table 2 summarizes the results. In the exper-

Figure 8: The two buildings which failed. The query images are in left, with their corresponding five model views right.

test set (method) 1 st top 5 view 2 (voting) 95.5% 98.5% view 2 (probabilistic) 98.5% 98.5% view 3 (voting) 88.2% 98.

iments with the second database, we see the benefit of the probabilistic model.

probability score more distinct than match number. Figure 9a-b shows one such example.

In this case the building was correctly recognized.

5 Conclusion and future work In this paper, we proposed a hierarchical scheme for building recognition.

Our experiments show that it has very good discriminating capability comparable to the local feature based techniques, without

When multiple views of models are available, the selected candidates can be obtained with high accuracy and correct

Due to its compact size this representation is very efficient and scales well to large databases.

In the context of feature based matching scheme we have proposed a new simple probabilistic models accounting for feature quality

The second a) b) c) Figure 9: a) The wrong model with 25 matches; b) while the correct model with 22 matches; voting would

12; c) correct classification using probabilistic voting.

The bias toward models with more features is resolved by a feature selection process, which also favorably improves the

Both stages complement each other well and enable us to successfully recognize buildings despite large variations in viewpoint

We are currently extending the experiments to incorporate the pose recovery stage using the matches obtained in the second stage.

7 Figure 8: The two buildings which failed. The query images are in left, with their corresponding five model views right. Top: One query view of the building which cause 3 failure. Bottom: the 4th failure. test set (method) 1 st top 5 view 2 (voting) 95.5% 98.5% view 2 (probabilistic) 98.5% 98.5% view 3 (voting) 88.2% 98.5% view 3 (probabilistic) 91.2% 98.5% Table 2: Recognition performance of SIFT feature based matching. iments with the second database, we see the benefit of the probabilistic model. For 6 images which will get wrong results by voting, will get classified correctly using the probabilistic formulation with probability score more distinct than match number. Figure 9a-b shows one such example. Additional example in Figure 9c shows the intricacies of feature based matching in the presence of repetitive structures. In this case the building was correctly recognized. We are currently extending this work by incorporating the pose recovery stage using the matches obtained in the second stage. 5 Conclusion and future work In this paper, we proposed a hierarchical scheme for building recognition. We first introduce a new spatially tuned color histogram representation of buildings which is used in the first recognition stage. Our experiments show that it has very good discriminating capability comparable to the local feature based techniques, without the need of finding correspondences. When multiple views of models are available, the selected candidates can be obtained with high accuracy and correct classification is achieved in the first stage. Due to its compact size this representation is very efficient and scales well to large databases. In the second stage we used local feature based matching to further refine the recognition results. In the context of feature based matching scheme we have proposed a new simple probabilistic models accounting for feature quality and demonstrated its performance on a larger database. The second a) b) c) Figure 9: a) The wrong model with 25 matches; b) while the correct model with 22 matches; voting would produce wrong result. While the probability of the wrong model is only 0.03, much less than the correct one which get 0.12; c) correct classification using probabilistic voting. recognition stage is quite self-contained and can stand on its own if color information is not available. The bias toward models with more features is resolved by a feature selection process, which also favorably improves the efficiency of the second matching stage and makes the model database more compact. Both stages complement each other well and enable us to successfully recognize buildings despite large variations in viewpoint and presence of background clutter. We are currently extending the experiments to incorporate the pose recovery stage using the matches obtained in the second stage. References [1] Anonymous. [2] Anonymous. [3] H. Aoki, B. Schiele, and A. Pentland. Recognizing personal location from video, [4] M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In Proceedings of ECCV, [5] G. Fritz and L. Paletta. Object recognition using local information content. In ICPR 04, pages 15 18, [6] T. Goedeme and T. Tuytelaars. Fast wide baseline matching for visual navigation. In CVPR 04, pages 24 29, [7] B. Schiele H. Aoki and A. Pentland. Recognizing places using image sequences. In Conference on Perceptual User Interfaces, San Francisco, November [8] S. Lazebnik, C. Schmid, and J. Ponce. Affine-invariant local descriptors and neighborhood statistics for texture recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages , [9] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, page to appear, 2004.

8 [10] B. W. Mel. Seemore: Combining color, shape and textures historgramming in neurally inspired approach to visual object recognition. Neural Computation, 9: [11] S. Nayar, S. Nene, and H. Murase. Subspace methods for robot vision. IEEE Transactions on Robotics and Automation, 6(5): , [12] D. Robertson and R. Cipolla. An image-based system for urban navigation. In BMVC, [13] C. Schmid. A structured probabilistic model for recognition. In Proceedings of CVPR, Kauai, Hawai, pages , [14] C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. Pattern Analysis and Machine Intelligence, 19: , [15] H. Shao, T. Svoboda, and L. Van Gool. Zubud-zurich buildings database for image based recognition. Technique report No. 260, Swiss Federal Institute of Technology, Switzerland, [16] H. Shao, T. Svoboda, T. Tuytelaars, and L. Van Gool. Hpat indexing for fast object/scene recognition based on local appearance. In computer lecture notes on Image and video retrieval, pages 71 80, July [17] M. Stricker and A. Dimai. Spectral covariance and fuzzy regions for image indexing. Machine Vision and Applications, 10:66 73, [18] A. Torralba, K. Murphy, W. Freeman, and M. Rubin. Context-based vision system for place and object recognition. In International Conference on Computer Vision, [19] A. Torralba and P. Sinha. Recognizing indoor scenes. MIT AI Memo, 2001.

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract