An Effective Image Retrieval Mechanism Using Family-based Spatial Consistency Filtration with Object Region

International Journal of Automation and Computing 7(1), February 2010, 23-30 DOI: 10.1007/s11633-010-0023-9 An Effective Image Retrieval Mechanism Using Family-based Spatial Consistency Filtration with Object Region Jing Sun Ying-Jie Xing School of Mechanical Engineering, Dalian University of Technology, Dalian 116023, PRC Abstract: How to construct an appropriate spatial consistent measurement is the key to improving image retrieval performance. To address this problem, this paper introduces a novel image retrieval mechanism based on the family filtration in object region. First, we supply an object region by selecting a rectangle in a query image such that system returns a ranked list of images that contain the same object, retrieved from the corpus based on 100 images, as a result of the first rank. To further improve retrieval performance, we add an efficient spatial consistency stage, which is named family-based spatial consistency filtration, to re-rank the results returned by the first rank. e elaborate the performance of the retrieval system by some experiments on the dataset selected from the key frames of TREC Video Retrieval Evaluation 2005 (TRECVID2005). The results of experiments show that the retrieval mechanism proposed by us has vast major effect on the retrieval quality. The paper also verifies the stability of the retrieval mechanism by increasing the number of images from 100 to 2000 and realizes generalized retrieval with the object outside the dataset. Keywords: Content-based image retrieval, object region, family-based spatial consistency filtration, local affine invariant feature, spatial relationship. 1 Introduction Research in content-based image retrieval (CBIR) is converging towards building an efficient retrieval mechanism, including the detection of the feature, the selection of the query region, the construction of the spatial consistency, etc. The objective of this paper is to retrieve the subset of images that contain some query target in the object region based on local invariant feature through using the proposed retrieval mechanism considering the spatial relationship of the feature regions in the matching stage. The traditional and inefficient solutions of CBIR consist of the following steps: 1) feature detection by the detector, 2) feature description by some one high-dimensional descriptor, 3) clustering of the descriptors, and 4) returning a ranked list of the image set by a ranking function. How to detect and measure the similarities among the different objects becomes a core issue of CBIR. Recent work in this domain can be divided into two categories: one is based on the image segmentation, such as Blobworld and SIMPLcity [1] ; the other is the bag of words (Bo) method [2 6] which has become more and more attractive. In the Bo method, some feature regions are detected in the images, and each region is described by some certain descriptors. Then, the descriptors are clustered into a visual vocabulary, and each region is mapped to its closest clustering; each clustering has a clustering centre. At last, an image is represented as a bag of visual words together with their frequencies of occurrence. Manuscript received March 3, 2009; revised May 25, 2009 This work was supported by National High Technology Research and Development Program of China (863 Program) (No. 2007AA01Z416) and National Natural Science Foundation of China (No. 60773056), Beijing New Star Project on Science and Technology (No. 2007B071), Natural Science Foundation of Liaoning Province of China (No. 20052184) Methods aforementioned in fact mimicked the text retrieval system using the analogy of visual words and were equal to an initial filtering idea. This is very computationally expensive, although the initial filtering can greatly reduce the number of documents to be considered. Typically, some existing image retrieval methods [2,7] do not use spatial structure of the Bo in the initial filtering stage. The biggest difference between the image retrieval and the text retrieval is that the visual words in the former have more spatial structures than the latter. For example, someone who wants to retrieve a three-word text query, such as ABC, may, in general, search for documents containing those three words in any order, such as ACB, BAC or CBA, etc., and at any positions in the document. A visual query, however, since it is selected from a query image, includes visual words in a spatial configuration corresponding to some view of the object in this image; it is therefore reasonable to try to make use of this spatial information when searching the corpus for different views of the same object. The paper constructs a retrieval mechanism consisting of several components, including the extraction of the local affine invariant feature regions, the selection of query region based on the similarity score, and the spatial consistency measurement based on the family. The rest of the paper is organized as follows. Section 2 describes the detection of local affine invariant feature. Section 3 constructs an image retrieval mechanism, including the clustering of the scale invariant feature transform (SIFT) descriptors, the object region by the rectangle, and the family-based spatial consistency filtration. Section 4 illustrates some experiments to verify the retrieval mechanism proposed by us. An image set is set up firstly in Section 4.1; the evaluation standard of retrieval results is given in Section 4.2; in Section 4.3, four schemes are de-

24 International Journal of Automation and Computing 7(1), February 2010 signed to prove the correctness of the retrieval mechanism proposed by this paper through the retrieval examples; the stability with the increasing number of images is verified in Section 4.4; generalized retrieval with the object outside the dataset is realized in Section 4.5; the retrieval time is analyzed in Section 4.6. Section 5 summarizes the contributions of our study and highlights our future research directions. 2 Local affine invariant feature detection CBIR is not based on the text nor manually annotated but is based on the recognition of invariant feature in the image [2,7 10]. Since Baumberg [11] proposed the innovation about the feature detection of image, a lot of detection algorithms have emerged in recent years. The Harris-Affine detector [12] has attracted great attention because of its good performance. This paper just uses the detector to extract the feature as the low-level local affine invariant feature for the succeeding retrieval process. 2.1 Affine normalization theory based on shape-adapted The scale-adapted second-moment matrix is often used for describing the gradient distribution in the local image neighborhood; it is also independent of the image resolution [12, 13] : ( ) µ(x, σ I, σ D) = σdg(σ 2 L 2 x(x, σ D) L xl y(x, σ D) I) L xl y(x, σ D) L 2. y(x, σ D) (1) The scale-adapted Harris measurement consists of the trace and determinant of the second-moment matrix, and the local maximum of the Harris measurement can determine the space location of the initial point: cornerness = det(µ(x, σ I, σ D)) α tr 2 (µ(x, σ I, σ D)). (2) Affine Gaussian-scale space is constructed by a series of images which are obtained by convolution with images and non-uniform elliptical Gaussian function: 1 L(x; Σ) = g(σ) I(X) = ( 2π X detσ )e T Σ 1 X 2 I(X). (3) The second-moment matrix µ L, based on affine Gaussian scale space, is defined by µ L( ; Σ I, Σ D) = g( ; Σ D) (( L)( ; Σ I)( L)( ; Σ I) T ). (4) Some variable substitutions are done for convenience of expression µ L(q L; Σ I,L, Σ D,L) = M L, µ R(q R; Σ I,R, Σ D,R) = M R. (5) Lindeberg [13] has proved that the relationship between two points having affine transformation is purely rotary after the neighborhood of the two points is affine normalized (see Fig. 1). This is the theoretical basis of the extraction of local affine invariant feature. (c) (d) Fig. 1 Affine normalization based on the shape-adapted matrices. : q L M 1/2 L q L 1/2 ; (c) (d): qr MR q R ; (d): q L Rq R To modify the shape of the initial point, we should make the local anisotropy region an isotropy region with the shape-adapted matrix [12] : U = k (µ 1 2 ) (k) U (0). (6) 2.2 Synchronous iterative of the location, scale, and shape For some given initial point X(0), the concrete iterative procedures are as follows: Step 1. Initialize the second-moment matrix U(0) to the unit matrix E. Step 2. Normalize the elliptical feature region to the circular feature region by the shape-adapted matrix U (k 1), and the centre of the normalization window (X ) = I(X ) is located at X (k 1) : X (k 1) = U (k 1) 1 X (k 1). Step 3. hen normalized Laplacian of Gaussian (LoG) function has reached the maximum, fix integration scale at X (k 1) is σ (k) I σ (k) D LOG(X, σ I) = σ 2 I L xx(x, σ I) + L yy(x, σ I). (7) Step 4. Select the differentiation scale σ (k) D, σ(k) I, and X (k 1) into µ(x (k 1) λ min(µ)/λ max(µ) obtains the maximum. Step 5. Compute the extremum X (k), σ(k) I = sσ(k) I. Put ). Then,, σ (k) D of the Harris measurement in normalized window whose centre is the point X (k 1). Step 6. Compute µ (k) i = µ ( 1/2) (X (k) µ (k) i, σ(k) I, σ (k) D ). Step 7. Update the shape-adapted matrix U (k) = U (k 1) with µ (k) i the matrix U (k) to λ max(u (k) ) = 1. obtained in Step 6, and then normalize Step 8. Compute X (k) in the image domain by the formula X (k) = X (k 1) + U (k 1) (X (k) X(k 1) ). Step 9. Compute the convergence ratio: 1 λmin(µ) < εc. (8) λ max(µ) If ε C 0.05 or µ is approximately equal to rotation matrix, the above algorithm converges; otherwise, go to Step 2. For an initial point, the above processes may automatically iterate and converge to an affine invariant feature point by modifying the shape, scale, and space location of the initial point. e extracted feature points and their neighborhoods in the skylight from two different views of a car model. Figs. 2 and (c) are the iterative processes of the corresponding feature points in two views, and Figs. 2 and (d) are the normalized regions of and (c), respectively. After iterating five times, the location, scale, and shape of the invariant point do not change anymore.

J. Sun and Y. J. Xing / An Effective Image Retrieval Mechanism Using Family-based 25 with the dataset in Section 4.1. The clustered patches reflect the properties of SIFT descriptors: the clustering is based on the spatial distribution of the image gradient orientation, but not the intensity just across the region. (c) (d) Fig. 2 Iterative detection of the affine invariant point region. The iterative processes of the corresponding feature points in one view; The normalized regions of ; (c) The iterative processes of the corresponding feature points in another view; (d) The normalized regions of (c) 3 Image retrieval mechanisms This paper designs two distinctive strategies to improve the performance of image retrieval: the object region by a rectangle and the family-based spatial consistency filtration. 3.1 The clustering of SIFT descriptors Those elliptical feature regions extracted in Section 2.2 are computed at twice the original size of the detected regions in order to make the appearance of the regions more discriminating after they are described. To obtain the vector representation of the feature regions, we first affinenormalized elliptical feature regions to the same size image patches and describe these patches. The left of Fig. 3 shows all elliptical feature regions in the image, and the right of Fig. 3 shows affinenormalized patches of elliptical feature regions; and the arrows in Fig. 3 indicate the corresponding relations between the feature regions and the patches. This paper describes the feature regions using the SIFT descriptor supplied by Lowe [14]. Because of its rotation invariance and good performance [15], SIFT descriptor is used in many fields of the computer vision [16, 17]. Fig. 3 shows the SIFT descriptors described in the normalization patches in Fig. 3 ; the length and the direction of the black arrow are the magnitude and the direction of the major gradient of the feature point neighborhood with the scale, respectively. Fig. 3 Normalization of elliptical feature regions into the same size patches and description of the patches by SIFT. The elliptical feature regions and their normalized patches; The SIFT descriptors described in the normalization patches in SIFT descriptor is quantized by K-means clustering, called vector quantization, and Mahalanobis distance is used as the distances function. The clustering process of the visual words is presented in Fig. 4. Fig. 5 shows two visual words, which cluster from SIFT descriptors that come from feature regions detected by the method of Section 2.2 Two visual words clustering in Harris-Affine feature re- Fig. 5 gions Fig. 4 The flowchart of clustering process 3.2 The object region by the rectangle To improve the precision of retrieval results, we propose an efficient retrieval method based on invariant feature in the object region. In a query image, some query object belongs to the query region that is defined as the object region. The subordination among the query image, object region, and the query object is query image object region query object. The flowchart of image retrieval based on the object regions is as follows: Step 1. Pre-processing (off-line) Step 1.1. Extract the feature regions. Select the image dataset, and then extract the affine covariant feature regions in the whole images of the dataset. Step 1.2. Describe the feature regions. Represent the invariant feature regions with SIFT descriptors. Step 1.3. Cluster the descriptors. Use the K-mean method to cluster the descriptors, and the clustering centers are the visual words. All the words become the visual vocabulary. Step 1.4. Vectorization of the images. Computer term frequency - inverse document frequency (TF-IDF) of the

26 International Journal of Automation and Computing 7(1), February 2010 visual words, and then construct the vector space for words of the whole dataset. Step 2. Retrieval processing (on-line) Step 2.1. Select the object region. Query region including the object, named the object region, is selected with the rectangle in the retrieval image by manual, and then the visual words are computed in the object region. Step 2.2. First retrieval. According to the similarity score between the object region of the query image and the object region of the image in the dataset, the retrieval results are ranked at the first based on the frequency of the visual words in the object region. Step 2.3. Second retrieval. The first retrieval results are re-ranked by using family-based spatial consistency filtration in the object region. By the standard weighting scheme of TF-IDF, which is the vector-space model of information-retrieval [18], we count the word frequency coefficient t i in every image. The query image and each image in the dataset are represented as a vector of the visual word occurrences, and then we should calculate the similarity score between the query vector and each image vector using the Mahalanobis distance. Query region is selected by the rectangle in retrieval image, and then the words and their frequencies after weighting are counted; thus, the subset R query of visual words in query region is obtained. Compute intersection set between R query and R i (i [1, N]), that is, the set of the visual words in every image, and then accumulate the minimum weighting word frequency in the intersection set. Similarity score S i between the image i in the dataset and the query image with the object region is S i = j min(t jquery, t ji), j R query R i (9) where t jquery is the weighting word frequency in query region; t ij is the weighting word frequency in the word j of the image i. Fig. 6 shows a comparison between retrieval results based on the whole image and those based on the object region. Because we only want to explain the influence of the object region on the retrieval results, spatial consistency is not used here. Fig. 6 shows the query image, and Fig. 6 shows another image having the same background; Fig. 6 (c) shows the retrieval results based on the whole image, and Fig. 6 (d) shows the retrieval results based on the object region in the rectangle. From Fig. 6 (c), we can see that, because the retrieval method based on the whole image is selected, although the query image itself is retrieved as the first image, the image having the same background as the query image is also retrieved with a high score as it has the same content as the query image. Fig. 6 (d) shows the retrieval results using the object region in the rectangle and not containing the image, Fig. 6, because Fig. 6 has not the same content as the object region in Fig. 6. (c) (d) Fig. 6 Comparison of retrieval results between the whole image and the object region. The query image; Another image having the same background with ; (c) The retrieval results based on the whole image; (d) The retrieval results based on the object region in the rectangle 3.3 Family-based spatial consistency filtration The spatial angles between the right matching pairs in two images are very similar. But for the false matching pairs, there are two apparent characteristics: 1) the number of the matching pairs is small; and 2) the spatial angles of the false matching pair are not consistent with the spatial angles of the right matched pair. In other words, the false matching pair is isolated. According to the characteristics of the matching pairs, we propose a filtration method to remove false matching: family-based spatial consistency filtration. The definition and the forming process of the family are introduced below. Family: The set of matching pairs whose spatial angles are very similar. Families sets: All of the families construct the families sets which are the matching pairs for realizing image retrieval. The forming processes of the family are as follows: Step 1. Sort the spatial angles among the different matching pairs: SpaceAngle{ }: {S min,, S i,, S max}. Step 2. Give the thresholds of the spatial angle and the family volume: 1) Th SA: The spatial angle thresholds. It describes the relation of the spatial angles that belong to the same family. 2) Th FamilyNUM: The family volume thresholds. It is the minimum of the number of the matching pairs that compose a family. Step 3. Finding the family in the set of SpaceAngle{ }. Step 4. The number of right matching pairs: MatchedFamily [ ]. Make a family: Procedure MakeFamily() If S i 1 S i > Th SA and S j+1 S j >Th SA then FamilyNUM j i If FamilyNUM > Th FamilyNUM then MachedFamily[n + +] FamilyNUM End if End if If the difference of the spatial angle of matching pairs

27 J. Sun and Y. J. Xing / An Effective Image Retrieval Mechanism Using Family-based is no more than Th SA, the matching belongs to this family; if the number of the matching pairs in one family is less than Th FamilyNUM, the family is ignored as a false matching. As long as a family is found, SpaceAngle{} SpaceAngle{}+1. Fig. 7 shows that the original matching results without the spatial consistency are many false matching pairs; Fig. 7 shows the matching result based on the spatial consistency. The false matching pairs are removed effectively by the family-based spatial consistency filtration. retrieval results with the spatial consistency on the whole image, and seven of the foremost images belong to the same sorts as the query images. It verifies that the spatial consistency filtration proposed by us can significantly improve the retrieval results. 4 Experiments 4.1 The image set To verify the correctness of retrieval mechanism proposed by us, we use 100 images that belong to 10 subjects to do retrieval experiments. There are 10 images in every subject, and in 10 images, there is the same object with the changes of rotation, scale, viewpoint, brightness, and the partial occlusion. The size of each image is 352 240. Using a 3.2 G Pentium 4 PC with 2 G of main memory, the retrieval time of each image is 0.16 s on average. Fig. 9 shows the number of invariant feature regions in 100 images using Harris-Affine detector. There are totally 11834 regions in dataset, and the number of feature regions detected on each image is approximately 120. The number of SIFT descriptors and the clustering centers are 12189 and 487, respectively (see Table 1). Table 1 also shows the data of 2000 images, as will be discussed in Section 4.4. Fig. 7 The function of the spatial consistency on matching. The original matching result without the spatial consistency; The matching result based on the spatial consistency Fig. 8 shows the influence of spatial consistency on the retrieval results. Fig. 9 The number of the feature regions in 100 images by the Harris-Affine detector Table 1 Images 100 2000 (c) Fig. 8 The influence of the spatial consistency on retrieval. The query image; The retrieval results without the spatial consistency; (c) The retrieval results with the spatial consistency on the whole image Fig. 8 shows the query image, and Fig. 8 shows the retrieval results without the spatial consistency; we can find that in the first ten images, there are only two images that are the same sorts with the query image. Fig. 8 (c) shows the 4.2 The clustering parameters of the two image sets Feature regions 11834 375867 Descriptors 12189 381065 Cluster centers 487 16982 The evaluation standard of the retrieval results The paper takes average retrieval accurate ratio (AAR) as the evaluation standard of retrieval results. First, given the ground truth, taking the subject of man as an example, the subject of man is the ground truth for man 1 and man 10, respectively, as the query image. In other words, if we take man 1 as the query image, the correct retrieval result is that the foremost ten images should be the whole images of the subject of man. The dataset has already been classified. Let G be the ground truth, including image I. If the retrieval result outputs T similar images in which there are n images that belong to the ground truth G, then the retrieval accurate ratio (AR) is defined as n AR =. (10) T

28 International Journal of Automation and Computing 7(1), February 2010 For an objective evaluation result, we further present AAR of the several retrieval results to measure the performance of our method. Retrieving each image of the subject and recording the retrieval results of first ten images (T = 10), we compute AAR for every subject: AAR = 1 j j AR i (11) where j is the number of images that every subject contains (i.e., j = 10). Then, we average the average AAR to obtain a mean AAR (maar) score. The maar score is used to evaluate the overall performance of our retrieval mechanism maar = 1 n i=1 n ( 1 j j=1 j AR i) (12) where n is the number of subjects. e select 10 sorts images, so n = 10. i=1 4.3 The retrieval examples Four schemes are designed to verify the correctness of retrieval mechanisms: 1) Based on the whole image: feature extraction, description, matching, and image retrieval are all in the whole image. In fact, it is the typical Bo, which simulates simple text-based image retrieval using the analogy of the visual words. 4.4 Verifying the stability by increasing the number of images To verify the stability of the retrieval mechanism proposed by us, we increase the number of images from 100 to 2000, and then compare the results between 100 and 2000 dataset in order to stress-test retrieval performance when the volume of the dataset is enlarged. All new images which are the same sorts with 100 dataset, and have the new sorts different from 100 image sets, are chosen from key frames of TREC Video Retrieval Evaluation 2005 (TRECVID2005). In doing so, we ensure the variety and the rationality of the sorts of 2000 images. Fig. 11 shows the retrieval results based on the object region with the family-based spatial consistency filtration to 2000 dataset. There are 25 images, the sort of Basketball Game, in 2000 dataset. Fig. 11 shows the query image, and the region in the rectangle is the object region. Figs. 11 and (c) show that there are 18 images in the front 25 images, so AAR of this retrieval is 72. 2) Based on the object region: feature extraction, description, matching, and image retrieval are all in the object region, which are selected by a rectangle. 3) Based on the whole image with the spatial consistency: on the basis of 1), the spatial consistency is used for constraining retrieval results. 4) Based on the object region with the spatial consistency: on the basis of 2), the spatial consistency is used for increasing constraint to retrieval. Fig. 10 shows the AAR of the ten sorts of images. It shows the functions of the object region and the familybased spatial consistency on the AAR, respectively. Fig. 10 Average retrieval accurate ratio of ten sorts of images (c) Fig. 11 Retrieval results based on the object region and the family-based spatial consistency filtration. The query image; & (c) The retrieval results Fig. 12 shows a comparison of retrieval results based on the object region and the family-based spatial consistency filtration between 2000 and 100 dataset. ith the number of images increasing, the ground truth is changed. Fig. 12 shows decreasing degree from 100 to 2000 images at AAR with the object region and the spatial consistency, and Fig. 12 shows maar of the four retrieval mechanisms under 100 and 2000 images, respectively. From

J. Sun and Y. J. Xing / An Effective Image Retrieval Mechanism Using Family-based 29 Fig. 12, we can see that when the number of images increases from 100 to 2000, the performance of retrieval results may decay in some extent. But such results are more ideal relative to the increasing number of images. Here, we introduce the decrease rate of AAR to evaluate the changes of AAR when the capacity of dataset is increasing. The values of the decrease rate under the four retrieval mechanisms are the difference between 100 and 2000 images; when the capacity of dataset has increased to 20 times, under the four different retrieval mechanisms, the decrease rate of maar is 9.2. of logo NBC in Figs. 13,, and (c), respectively. The results show the corresponding relationship of the feature regions between the ground truth and the image from the internet. (c) (d) (e) (f) Fig. 13 Retrieval with the object outside the dataset. One image having logo NBC using the word NBC with Baidu Search Engine; & (c) The retrieval results that belong to the ground truth; (d) The close-up of logo NBC in ; (e) The close-up of logo NBC in ; (f) The close-up of logo NBC in (c) Fig. 12 AAR maar The comparison of the 100 and the 2000 images 4.5 Generalized retrieval In practice, images with the object are always beyond the image dataset, such as the images of product logos or particular buildings from the internet, but not like the object mentioned above which should be in some image belonging to the dataset. Therefore, we should check out if the object is in the dataset or not. This retrieval searching for objects from the other dataset is named generalized retrieval. By the generalized retrieval, we further prove the performance of the feature regions and the retrieval mechanisms constructed by this paper. To evaluate the performance of retrieval mechanisms as it works outside the dataset, we should first set up the ground truth consisting of the objects either in the dataset or in the image from the internet. According to 2000 dataset, we select the sort of NBC to be the ground truth, and also search one image having logo NBC using the word NBC with Baidu Search Engine (see Fig. 13 ). The ellipses in Fig. 13 represent the feature regions, and for the quality of image, we only extract some regions of the image. Figs. 13 and (c) show the retrieval results that belong to the ground truth. Figs. 13 (d), (e), and (f) show the close-up 4.6 Analysis on the retrieval time At last, we will analyze the time consumption of the whole retrieval process. For 100 dataset, the whole retrieval process is very quick because the capacity of dataset is not big. Table 2 shows the time at different stages for 2000 datasets. e make a sampling test to gain an objective retrieval time. 100 images which cover all sorts of the 2000 datasets are selected at random from 2000 images. The average time of the sampling test is the average retrieval time of 2000 dataset. Of course, we also use the average time for feature extraction, description, and clustering. For example, the time of feature extraction and feature description is 18 min, and so the average time of the feature extraction and description for each image is 18 60/2000 = 0.54 s. Table 2 The time consumption of the 2000 images Stages Time (s) Feature detection and description < 0.5 Clustering < 0.8 hole image < 0.2 Retrieval Object region < 0.3 mechanism hole image and family-based < 0.7 Object region and family-based < 0.6 5 Conclusions CBIR is one of the most important research contents in image retrieval. Recently, one major development in this area is the use of the spatial information of the visual words of the query images. Our purpose is to construct an efficient image retrieval mechanism. In this work, we deal with the extraction and description of feature, selection of the object region and the spatial consistency measurement in a combined way to improve the performance of CBIR. Our main contribution can be outlined as follows: 1) Based on the idea of traditional text retrieval, we first detect the Harris-Affine regions as low-level features,

30 International Journal of Automation and Computing 7(1), February 2010 and then describe the features into the SIFT descriptors and cluster them into the visual words. 2) According to the standard weighting and similarity rule for measuring invariant feature, the object region is selected using rectangle in query image; the first sorting for the retrieval images is based on the similarity score between the image in the dataset and the query image. 3) e propose the family-based spatial consistency filtration for the second sorting by means of the spatial angles between the matching pairs in two images. e also give the definition and the forming process of the family. 4) To demonstrate the correctness of the proposed retrieval mechanism, we design four retrieval schemes. At last, we verify the stability by increasing the volume of the dataset. e have realized generalized retrieval with the object outside the dataset and verified the correctness of our retrieval mechanism. The family-based spatial consistency is not suitable for the large-scaling transformation. This is a problem to be tackled in our future work. e can use the spatial consistency based on searching unit at the large-scaling images and try to incorporate the two configurations of the spatial consistency. Moreover, the stress test for retrieval performance will be research direction in the future in scaling up of the dataset. Acknowledgement e thank the Institute of Computing Technology (ICT), Beijing, the Chinese Academy of Sciences for providing the test platform; we are also very grateful for suggestions from and discussions with Dr. Y. D. Zhang and Dr. K. Gao of ICT. References [1] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, J. Malik. Blobworld: A system for region-based image indexing and retrieval. In Proceeding of the 3rd International Conference on Visual Information Systems, IEEE Computer Society, Amsterdam, Holand, vol. 2, pp. 509 516, 1999. [2] J. Sivic, A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision, IEEE, Nice, France, vol. 2, pp. 1470 1477, 2003. [3] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, IEEE, Minneapolis, USA, pp. 1 8, 2007. [4] K. Gao, S. X. Lin, J. B. Guo, D. M. Zhang, Y. D. Zhang, Y. F. u. Object retrieval based on spatially frequent items with informative patches. In Proceedings of IEEE International Conference on Multimedia and Expo, IEEE, Hanoverian, Germany, pp. 1305 1308, 2008. [5] S. Lazebnik, C. Schmid, J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Computer Society Conference on Conference on Computer Vision and Pattern Recognition, IEEE, New York, USA, vol. 2, pp. 2169 2178, 2006. [6] Q. F. Zheng,. Q. ang,. Gao. Effective and efficient object-based image retrieval using visual phrases. In Proceedings of the 14th Annual ACM International Conference on Multimedia, ACM, Santa Barbara, USA, pp. 77 80, 2006. [7] D. Nistér, H. Stewénius. Scalable recognition with a vocabulary tree. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, New York, USA, vol. 2, pp. 2161 2168, 2006. [8] C. Schmid, R. Mohr. Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 5, pp. 530 535, 1997. [9] T. Tuytelaars, L. Van Gool. Content-based image retrieval based on local affinely invariant regions. Lecture Notes in Computer Science, Springer, vol. 1614, pp. 493 500, 1999. [10] F. Schaffalitzky. A. Zisserman. Multi-view matching for unordered image sets. In Proceedings of the 7th European Conference on Computer Vision, Lecture Notes in Computer Science, Springer, Copenhagen, Denmark, vol. 2350, pp. 414 431, 2002. [11] A. Baumberg. Reliable feature matching across widely separated views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Hilton Head Island, USA, vol. 1, pp. 774 781, 2000. [12] K. Mikolajczyk, C. Schmid. Scale and affine invariant interest point detectors. International Journal of Computer Vision, vol. 60, no. 1, pp. 63 86, 2004. [13] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, vol. 30, no. 2, pp. 79 116, 1998. [14] D. G. Lowe. Distinctive image features from scale invariant keypoints. International Journal of Computer Vision, vol. 60, no. 2, pp. 91 110, 2004. [15] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision, vol. 65, no. 1 2, pp. 43 72, 2005. [16] K. Yamamoto, R. Oi. Color correction for multi-view video using energy minimization of view networks. International Journal of Automation and Computing, vol. 5, no. 3, pp. 234 245, 2008. [17] S. F. Liu, C. McMahon, M. Darlington, S. Culley, P. ild. EDCMS: A content management system for engineering documents. International Journal of Automation and Computing, vol. 4, no. 1, pp. 56 70, 2007. [18] R. Baeza-Yates, B. Ribeiro-Neto. Modern Information Retrieval, ACM Press, pp. 24 34, 1999. Jing Sun received the B. A. degree in Yanshan University, PRC and the M. A. degree in Dalian University of Technology, PRC in 1997 and 2002, respectively. Since 2005, she has been a Ph. D. candidate in Dalian University of Technology. She is currently a lecturer at the School of Mechanical and Engineering of Dalian University of Technology. Her research interests include feature extract, image matching, and object retrieval. E-mail: sunjing@dl.cn (Corresponding author) Ying-Jie Xing received the B. A. degree and the M. A. degree in Harbin Institute of Technology, PRC in 1983 and 1986, respectively. He received the Ph. D. degree from University of Yamanashi, Japan in 1996. He is currently an associate professor at the School of Mechanical and Engineering of Dalian University of Technology, PRC. His research interests include feature extract, image processing, and pattern recognition. E-mail: yjxing@dlut.edu.cn