Image Feature Evaluation for Contents-based Image Retrieval

Image Feature Evaluation for Contents-based Image Retrieval Adam Kuffner and Antonio Robles-Kelly, Department of Theoretical Physics, Australian National University, Canberra, Australia Vision Science, Technology and Applications (VISTA), NICTA, Canberra, Australia Department of Information Engineering, Australian National University, Canberra, Australia u9669@anu.edu.au antonio.robles-kelly@nicta.com.au Abstract This paper is concerned with feature evaluation for content-based image retrieval. Here we concentrate our attention on the evaluation of image features amongst three alternatives, namely the, the maximally stable extremal regions and the scale invariant feature transform. To evaluate these image features in a content-based image retrieval setting, we have used the KD-tree algorithm. We use the KD-tree algorithm to match those features corresponding to the query image with those recovered from the images in the data set under study. With the matches at hand, we use a nearest neighbour approach to threshold the Euclidean distances between pairs of corresponding features. In this way, the retrieval is such that those features whose pairwise distances are small, vote for a retrieval candidate in the data-set. This voting scheme allows us to arrange the images in the data set in order of relevance and permits the recovery of measures of performance for each of the three alternatives. In our experiments, we focus in the evaluation of the effects of scaling and rotation in the retrieval performance. Introduction The contents-based retrieval in image databases is a daunting and often costly task. As in a traditional database, the image must be described in order to be incorporated to the database and form part of the index. The engine of the database uses the index to quick-search the most probable candidates and then, making use of a similarity measure, retrieves the candidate images in order of relevance. Image retrieval has been a long standing problem in computer vision and pattern recognition and early surveys can be tracked back to the mid-eighties (Tamura & Yokoya 98). Nonetheless, one of the first attempts to cast the problem of retrieving images from a database as a task based upon content was that introduced in (Sclaroff & Pentland 99). Here, Sclaroff and Pentland presented a method in which the user is allowed to provide a search model, such as a sketch or example image so as to perform a query whose output is a set of thumbnails ordered by relevance. In their model, the concept of relevance implies similarity, which is modeled as a continuous value between zero and unity. This measure is often modeled as National ICT Australia is funded by the Australian Governments Backing Australia s Ability initiative, in part through the Australian Research Council. Copyright c 6, Australian Computer Society, Inc. This paper appeared at HCSNet Workshop on the Use of Vision in HCI (VisHCI 6), Canberra, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 56. R. Goecke, A. Robles-Kelly & T. Caelli, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included. Figure : Diagram of our voting score recovery scheme. a metric based on image features such as colours, corners, edges, etc. Following this trend, automatic image database systems use elementary features in the image to index and retrieve the candidates from the database. For instance, QBIC (Niblack et al. 99) allows images to be retrieved using shape, colour and texture. FourEyes (Picard 995) employs a high-level image feature processing scheme to modify the structure of the database and the retrieval parameters. Photobook (Pentland, Picard & Sclaroff 99) is a collection of tools to search and organise image datasets. SQUID (Shape Queries Using Image Databases) (Farzin & Kittler 996) uses a scale space representation of shape so as to accomplish queries based upon contour similarity. The main argument levelled against these systems concerns their lack of robustness to rotation and occlusion. Also, they often require a human expert to determine the parameters of the search criteria. As a result, retrieval methods vary greatly from one application to another and currently available image database systems make use of a hash or index to retrieve the images. Furthermore, the results of the query operation and the performance of retrieval applications rely heavily on an appropriate selection of the image features used to characterise the images under study. As a result, recently, there has been an increasing interest in the evaluation of image features and descriptors computed from interest points on the image. For instance, Caneiro and Jepson (Caneiro & Jepson ) have used Receiver Operating Characteristics (ROC) to compare test query descriptors against a library of reference computed from a separate dataset. Mikolajczyk and Schmid (Mikolajczyk & Schmid 5) have evaluated a number of local descriptors in the context of matching and recognition. Mikolajczyk et al. (Mikolajczyk, Tuytelaars, Schmid, Zisserman, Matas, Schaffalitzky, Kadir & Gool n.d.) have taken this analysis further and evaluated local descriptors subject to affine transformations. In a series of related developments, Randen and Husoy (Randen & Husoy 999) and Varma and Zisserman (Varma & Zisserman. ) have compared different local image descriptors, also known as filters, for texture classification.

Object Object Object Object Object 5 Object 6 Object 7 Object 8 Object 9 Object Object Object Object Object Object 5 Object 6 Object 7 Object 8 Object 9 Object Figure : Sample views for the objects in the Columbia University COIL database. Image Feature Evaluation and Contents-based Image Retrieval In contrast with other studies elsewhere in the literature, here we compare the performance of local descriptors for purposes of contents-based image retrieval. Rather than evaluating the capabilities of the image features to describe the scene subject to affine transformations, we focus on the adequacy of the local descriptors for contentsbased image retrieval and compare them using the same evaluation criteria and experimental vehicle. To this end, we have selected three alternatives which have previously shown a good performance in such a context and evaluate the retrieval rate under viewpoint rotation and image scaling. As mentioned earlier, in this paper we aim to evaluate the adequacy for content-based image retrieval of three feature recovery methods. These are the Harris corner detector (Harris & Stephens 988), the Maximally Stable Extremal Regions () (Matas, Chum, Martin & Pajdla ) and the Scale Invariant Feature Transform (SIFT) (Lowe ). Thus, we have divided the section into two parts. The first of these concerns an overview of the local image descriptors used as alternatives for the recognition process. The second of these introduces the retrieval scheme used for purposes of the evaluation presented in this paper.. Retrieval Process Having provided an overview of the image features to be evaluated, we now present the image retrieval scheme used throughout the paper for the purposes of evaluating the suitability of the local descriptors above for purposes of content-based image retrieval. The diagramatic representation of our retrieval scheme is shown in Figure. Our method recovers, at input, the features for both, the images in the database and the query image. With the image features at hand, we recover the correspondences between features in both, the query and each of the data images making use of the KD-tree algorithm (Bentley 975). These correspondences are an equivalence relation which we use to recover a score that depicts the similarity between the query and the data images. This score is based upon the Euclidean distances between pairs of corresponding features. Being more formal, consider the query image I Q and the data image I D whose respective feature sets are Ω Q = {ω Q (), ω Q (),..., ω ( Ω )} and Ω D = {ω D (), ω D (),..., ω D ( Ω )}, where ω Q (i) and ω D (i) are the i th feature vectors for the model and the data images. If there is a match between the feature vectors ω Q (i) and ω D (j), their squared Euclidean distance r(ω Q (i), ω D (j)) can be used to recover the score β = Γ max{ Ω Q, Ω D } where Γ is the set of feature vectors recovered from the data image I D whose pairwise distances with respect to () their matching query-image features are below a given threshold ɛ, i.e. ω D (i) Γ r(ω Q (i), ω D (j)) ɛ. As a result, if β =, the feature vectors recovered from the query image are all sufficiently close to the features in the data image. Furthermore, the number of features for the query and data images must be equal, i.e. Ω Q = Ω D. On the other hand, if β tends to zero, then the number of feature vectors in the query image which are far apart from those features in the data image to which they have been matched is large. Hence, β is a normalised voting score which can be viewed as a similarity measure between the query and the data image. Furthermore, β captures the similarity between images based upon their features. Thus, the retrieval is based upon the image contents. To construct the feature vectors ω Q (i) and ω D (i) we have considered the nature of each of the three alternatives evaluated here. Furthermore, by construction, the feature vectors are a set of parameters that describe the feature under study. For instance, recall that the Harris Corner detector finds specific corners on objects and features within a grey scaled image. It does this by taking an image and convolving it with a Sobel gradient filter to produce gradient maps, which are then used to compute the locally averaged moment matrix. It then combines the eigenvalues of the moment matrix to compute a corner strength, of which maximum values indicate the corner positions. Thus, in our experiments, our feature vector corresponds to the x and y coordinates on the image plane for the detected corners. In the case of the, the algorithm operates on a grey-scale image by finding the regions that are maximally stable with respect to changes in pixel intensities. With the at hand, we fit an ellipse to each of the recovered regions making use of the algorithm of Fitzgibbon, Pilu and Fisher (Fitzgibbon, Pilu & Fisher 999), which fits ellipses to the recovered so as to minimise the sum of squared algebraic distances. Thus, for the, our feature vector is a five-dimensional one comprised by the centroid coordinates, orientation and major and minor axis lengths of the ellipse fitted to each of the maximally stable regions. In contrast with the and, where the feature vector is based upon the geometric interpretation of the aim of computation of the feature recovery method, in the case of the SIFT, we make use of the 8-element descriptor yield by the method of Lowe (Lowe ). The SIFT recovers and characterises points invariant to scaling making use of a four-stage cascading filter approach which commences with a scale-space extrema detection. This first step consists of a differenceof-gaussians, which is used to identify potential interest points. Keypoint localisation is then used to eliminate previously calculated keypoints that, either have low contrast or are not localised on an edge. After recovering keypoint orientations, local gradient data is used to construct a descriptor for each of the recovered keypoints. Results As mentioned earlier, our aim here is the evaluation of the three feature descriptors above for purposes of contents-

ϕ =.5 harris corner scale.5 results mser scale.5 results sift scale.5 results.5.5.5.5.5.5.5.5.5 5 5 5 5 5 5 ϕ =.5 harris corner scale.5 results mser scale.5 results sift scale.5 results.5.5.5.5.5.5.5.5.5 5 5 5 5 5 5 ϕ =.75 harris corner scale.75 results mser scale.75 results sift scale.75 results.5.5.5.5.5.5.5.5.5 5 5 5 5 5 5 ϕ = harris corner scale results mser scale results sift scale results.5.5.5.5.5.5.5.5.5 5 5 5 5 5 5 Figure : Retrieval results for the three alternatives of image feature vectors and four different values of the scaling factor ϕ. based image retrieval. Thus, in this section, we assess the quality of the retrieval results using, as an experimental vehicle, the Columbia University COIL- and an inhouse acquired database of urban scenes. To evaluate the effect of scaling on the image retrieval operation performance, we have performed experiments with four imagescales ϕ, which correspond to 5%, 5%, 75% and % of the image size, i.e. ϕ = {.5,.5,.75, }. For each of the views, our feature set is comprised by the feature vectors recovered by each of the three alternatives under study.. Columbia University COIL- Database The COIL- database contains 7 views for objects acquired by rotating the object under study about the vertical axis. The scaling of the views and this rotation account for the affine transformations mentioned earlier. In Figure, we show sample views for each of the objects in the database. For our feature-based image retrieval experiments, we have removed out of the 7 views for each object, i.e. every other seven views. These views are our query images. The views in the database constitute our data set, i.e. 6 views. For each of our query views, we retrieve the four images from the data set which amount to the highest values of the voting score β. Ideally, these scheme should select the four data views indexed immediately before and after the query view. In other words, the correct retrieval results for the query view indexed i are those views indexed i, i, i + and i +. This scheme allows us to use the number of correctly recovered views as a measure of the accuracy of the matching algorithm and, hence, lends itself naturally to the performance assessment task in hand. In Figure, we show, for each scale and image feature alternative, i.e., and SIFT descriptors, the mean retrieval rate as a function of object index. We also indicate, using error bars, the standard error for the ten query images per object. In the plots, we have used the indexing provided in Figure.. Urban-scene Database Having presented evaluation results on the COIL- database, we now turn our attention to a more challenging setting. This is provided by a database of urban scenes. This database contains views for scenes acquired

Scene Scene Scene Scene Scene 5 Scene 6 Scene 7 Scene 8 Scene 9 Scene Figure : Sample views for the scenes in our database. ϕ =.5 harris corner scale.5 results MSER scale.5 results SIFT scale.5 results.5.5.5.5.5.5.5.5.5 6 8 6 8 6 8 ϕ =.5 harris corner scale.5 results MSER scale.5 results SIFT scale.5 results.5.5.5.5.5.5.5.5.5 6 8 6 8 6 8 ϕ =.75 harris corner scale.75 results MSER scale.75 results SIFT scale.75 results.5.5.5.5.5.5.5.5.5 6 8 6 8 6 8 ϕ = harris corner scale results MSER scale results SIFT scale results.5.5.5.5.5.5.5.5.5 6 8 6 8 6 8 Figure 5: Retrieval results for the three alternatives of image feature vectors and four different values of the scaling factor ϕ. by rotating the camera about its vertical axis from o to 66 o degrees in steps of 6 o, i.e. views per scene. This viewpoint rotation, in conjunction with the scaling operations on the imagery, accounts for the affine transformations of the scene under study. In Figure, we show sample views for each of the scenes in the database. For our feature-based image retrieval experiments, we have followed an akin approach to that employed on the COIL- database. At this point, it is worth noting that, since our images a true-colour ones, we have converted them into gray-scale. After performing this conversion as a preprocessing step, we have removed out of the views for each object. This amounts to one every other three views. We use the excised views as query images,

whereas the remaining 8 images in the database constitutes our data set. As done previously, we retrieve the four images from the data set which amount to the highest values of the voting score β for each of the query views. Again, the correct retrieval results for the query images are those immediately before and after the view of reference. This scheme, being consistent with the one used to assess the performance of the image retrieval results on the COIL database, not only permits the direct association of the number of correctly recovered views to the accuracy measures computed from our experiments, but allows a direct comparison between the datasets used in both parts of our quantitative study. In Figure 5, we repeat the sequence in Figure for our urban-scene database. The plots show the mean and standard deviation for the retrieval rates as a function of object index and image-scale ϕ. In the plots, we have used the indexing provided in Figure.. Discussion From the plots, we can conclude that the best performance, in terms of mean retrieval rate is given by the. This is regardless of the scaling factor ϕ. Despite providing a margin of improvement in terms of performance with respect to the alternatives, the standard error for the is the largest, as it is the variation in mean retrieval rate between scales. It is also worth noting that the SIFT is scale invariant, as claimed in (Lowe ). Thus, for the SIFT, the retrieval rate variation is small between scales, the retrieval rate itself is the lowest of the three alternatives. This may be due to the fact that the SIFT can be affected by viewpoint rotations. Finally, the show a sensitivity to scale and rotation which delivers a mean retrieval rate and standard error which are half-way between the and the SIFT. This is in accordance with the stability assumption on the regions recovered by the algorithm in (Matas et al. ). Conclusions In this paper, we have presented an evaluation of three local image descriptors for purposes of contents-based image retrieval. In our experiments, we have accounted for the effects of rotation and scaling transformations on the retrieval rate and the standard error. Despite the evaluation presented here is not exhaustive, our experimental setting is quite general and can be easily extended to other descriptors elsewhere in the literature. Furthermore, the KD- Tree algorithm used here may be substituted with other relational matching algorithms for purposes of further comparison and evaluation. Harris, C. J. & Stephens, M. (988), A combined corner and edge detector, in Proc. th Alvey Vision Conference, pp. 7 5. Lowe, D. (), Distinctive image features from scaleinvariant keypoints, International Journal of Computer Vision 6(), 9. Matas, J., Chum, O., Martin, U. & Pajdla, T. (), Robust wide baseline stereo from maximally stable extremal regions, in Proceedings of the British Machine Vision Conference, pp. 8 9. Mikolajczyk, K. & Schmid, C. (5), A performance evaluation of local descriptors, IEEE Trans. on Pattern Analysis and Machine Intelligence 7(), 65 6. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T. & Gool, L. V. (n.d.), A comparison of affine region detectors. Submitted to the International Journal of Computer Vision, downloadable from http://lear.inrialpes.fr/pubs/5/mtszmskg5. Niblack et al., W. (99), The qbic project: Querying images by content using color, texture and shape, in Proc. SPIE Conference on Storage and Retrieval of Image and Video Databases, 98, pp. 7 87. Pentland, A. P., Picard, R. W. & Sclaroff, S. (99), Photobook: tools for contents based manipulation of image databases, in Storage and Retrieval for Image and Video Database, Vol. II, pp. 7. Picard, R. W. (995), Light-years from lena: Video and image libraries ind the furture, in International Conference on Image Processing, Vol., pp.. Randen, T. & Husoy, J. H. (999), Filtering for texture classification : A comparative study, IEEE Trans. on Pattern Analysis and Machine Intelligence (), 9. Sclaroff, S. & Pentland, A. (99), A model framework for correspondence and description, in Proceedings of the International Conference on Computer Vision, pp. 8. Tamura, H. & Yokoya, N. (98), Image databe systems: a survey, Pattern Recognition 7(), 9 9. Varma, M. & Zisserman., A. (), Texture classification: Are filter banks necessary?, in Conference on Computer Vision and Pattern Recognition, pp. I:77 8. References Bentley, J. (975), Multidimensional binary search trees used for associative searching, Communications of the ACM 8(9). Caneiro, G. & Jepson, A. (), Phase-based local features, in European Conference on Computer Vision, pp. I: 8 96. Farzin, S. A. M. & Kittler, J. (996), Robust and efficient shape indexing through curvature scale space, in Proceedings of the 7th British Machine Vision Conference, Vol., pp. 5 6. Fitzgibbon, A., Pilu, M. & Fisher, R. B. (999), Direct least square fitting of ellipses, IEEE Trans. on Pattern Analysis and Machine Intelligence (5), 76 8.