Encoding Spatial Context for Large-Scale Partial-Duplicate Web Image Retrieval

Size: px

Start display at page:

Download "Encoding Spatial Context for Large-Scale Partial-Duplicate Web Image Retrieval"

Clifton Dean
5 years ago
Views:

1 Zhou WG, Li HQ, Lu Y et al. Encoding spatial context for large-scale partial-duplicate Web image retrieval. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 29(5): Sept DOI /s Encoding Spatial Context for Large-Scale Partial-Duplicate Web Image Retrieval Wen-Gang Zhou 1,2 ( ), Hou-Qiang Li 1,2 ( ), Yijuan Lu 3 ( ), Member, ACM, IEEE and Qi Tian 4 ( ), Senior Member, IEEE 1 Chinese Academy of Sciences Key Laboratory of Technology in Geo-Spatial Information Processing and Application System University of Science and Technology of China, Hefei , China 2 Department of Electronic Engineering and Information Science, University of Science and Technology of China Hefei , China 3 Department of Computer Science, Texas State University, San Marcos, TX 78666, U.S.A. 4 Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249, U.S.A. {zhwg, lihq}@ustc.edu.cn; yl12@txstate.edu; qitian@cs.utsa.edu Received March 15, 2014; revised July 15, Abstract Many recent state-of-the-art image retrieval approaches are based on Bag-of-Visual-Words model and represent an image with a set of visual words by quantizing local SIFT (scale invariant feature transform) features. Feature quantization reduces the discriminative power of local features and unavoidably causes many false local matches between images, which degrades the retrieval accuracy. To filter those false matches, geometric context among visual words has been popularly explored for the verification of geometric consistency. However, existing studies with global or local geometric verification are either computationally expensive or achieve limited accuracy. To address this issue, in this paper, we focus on partialduplicate Web image retrieval, and propose a scheme to encode the spatial context for visual matching verification. An efficient affine enhancement scheme is proposed to refine the verification results. Experiments on partial-duplicate Web image search, using a database of one million images, demonstrate the effectiveness and efficiency of the proposed approach. Evaluation on a 10-million image database further reveals the scalability of our approach. Keywords large-scale image retrieval, spatial context coding, spatial verification, affine estimation 1 Introduction Recent years have witnessed the explosive growth of images accessible on the Web. Image retrieval lends itself as a straightforward tool to help users find images of interest from a large image corpus. Traditional image search systems commonly utilize the textual information such as the surrounding text for indexing. Since textual information may be inconsistent with the visual content, content-based image search (CBIR) is desired and attracts lots of attention. In CBIR, a hot topic is partial-duplicate Web image search [1-4]. Given a query image, the target is to find its partial-duplicate versions in a large Web image database. There are many potential applications of such a system, for instance, finding out where an image is derived from and getting more information about it, tracking the appearance of an image online, detecting image copyright violation, and so on. Partial-duplicate image retrieval task is different in some sense from the image-based object retrieval. In image-based object retrieval, the main challenge is image variation due to 3D view-point change, illumination change, or object-class variability [1]. Partial-duplicate Web image retrieval differs in that the target images are usually obtained by Web users who edit the original image with changes in color, scale, rotation, partial occlusion, compression rate, etc. In partial-duplicate Web images, different Regular Paper This work was supported in part to Dr. Wen-Gang Zhou by the Fundamental Research Funds for the Central Universities of China under Grant Nos. WK and WK , the Start-Up Funding from the University of Science and Technology of China under Grant No. KY , the Open Project of Beijing Multimedia and Intelligent Software Key Laboratory in Beijing University of Technology, and the sponsor from Intel ICRI MNC project, in part to Dr. Hou-Qiang Li by the National Natural Science Foundation of China (NSFC) under Grant Nos , , and , in part to Dr. Yijuan Lu by the Army Research Office (ARO) of USA under Grant No. W911NF and the National Science Foundation of USA under Grant No. CRI , and in part to Dr. Qi Tian by ARO under Grant No. W911NF and the Faculty Research Award by NEC Laboratories of America, respectively. This work was supported in part by NSFC under Grant No A preliminary version of the paper was published in the Proceedings of MM Springer Science + Business Media, LLC & Science Press, China

2 838 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 parts are often cropped from the original image and pasted in the target image with modifications. The result is a partial-duplicate version of the original image with different appearances but still sharing some duplicated patches. Some examples are shown in Fig.1. Fig.1. Some examples of partial-duplicate Web images. In large-scale image retrieval systems, the state-ofthe-art approaches[5-18] leverage scalable textual retrieval techniques for image search. Similar to text words in information retrieval, local features, such as SIFT[19], are quantized to visual words[5,20]. Inverted file structure is then adopted to index images via the contained visual words[5,21]. However, compared with text words in information retrieval, the descriptive power of visual words is very limited. And with the increasing size of image database (e.g., greater than one million images) to be indexed, the discriminative power of visual words decreases sharply. Unlike text words in information retrieval, the 2D geometric relationship among visual words in the image plane plays a key role in visual content identification. Geometric verification[1-2,10-11,14,19,22-24] has become very popular recently as an important postprocessing step to improve the retrieval precision. Due to the expensive computational cost, full geometric verification is usually only applied to some top-ranked candidate images, which may fail to identify those relevant results with initial low rank. Some locally spatial verification approaches are developed to trade off the precision and efficiency, including spatially nearest neighbors[5] and bundled feature approach[1]. Above all, based on the Bag-of-Visual-Words model[5], image retrieval mainly relies on improving the discrimination of visual words by reducing feature quantization loss and embedding geometric consistency. In this paper, we propose to address partialduplicate Web image search with a spatial context coding strategy. Our approach is based on the Bag-ofVisual-Words model. Two local features from two images are considered as a match, if the two features are quantized to the same visual word. To verify the local matches between two images, we propose a novel scheme to encode the relative spatial context of local features in images into a binary map. Then through efficient spatial verification based on spatial maps, the false matches of local features can be removed effectively. The verification results are further refined with an affine enhancement scheme. This work is built on our previous work[15] to conduct efficient geometric verification to improve the image retrieval results. The difference lies in three aspects. Firstly, the image plane division for spatial context encoding is more flexible in this work. Unlike [15, 23], we do not require the image plane division to be quadrantwise for the generation of multiple orthogonal separate coding maps. Instead, we generalize the horizontal and vertical maps in [15, 23] into a uniform formulation. Secondly, a new affine enhancement scheme is proposed to address the SIFT drifting issue so as to improve the geometric verification results. Thirdly, in this work, we demonstrate the scalability of our method in a 10million scale image database. The rest of the paper is organized as follows. In Section 2, we introduce the related work. Then, our spatial context encoding scheme is discussed in Section 3. In Section 4, we discuss the implementation details of the image search framework. In Section 5, experimental results are provided. Finally, we make the conclusion in Section 6. 2 Related Work In the past decade, large-scale image retrieval[1,5-12,14-15,25-27] has been significantly boosted by two pioneering studies. The first one is the introduction of local invariant features, such as SIFT[19] and Edge-SIFT[28], for image representation. The second one is the scalable image indexing scheme based on the Bag-of-Visual-Words model[5]. With local features quantized to visual words, image representation can be very compact. Scalability is achieved by the invertedfile index structure and the amount of candidate images is greatly reduced, since we only need to check those images sharing common visual words with the query image. In image retrieval, the retrieval quality (accuracy) is usually measured by the criterion of average precision, which takes consideration of both precision performance and recall performance. To obtain a high average precision, both precision performance and recall performance are required to be good enough. Bias on precision or recall may lead to limited average precision and poor retrieval quality. Based on the Bag-of-Visual-Words model in image search, generally, local feature descriptors quantized to the same visual word are considered to match to each other[5]. Since the dimensionality of SIFT descriptor space is as high as 128, the quantization error is not easy to control. As a result, some noisy local matches

3 Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval 839 are easily incurred, which degrades image retrieval accuracy. To reduce the quantization loss, two different types of approaches have been proposed recently. The first category focuses on exploring individual feature. Some approaches work on reducing quantization error. One of the representative researches is softquantization [12,26,29-30], which quantizes an SIFT descriptor to multiple visual words. Hamming Embedding [10] enriches the visual word with compact information from its original local descriptor with binary signature, and feature scale and orientation values are used to filter false matches. Some other approaches try to enrich the feature representation of images. For instance, Chum et al. [7,25,31] reissued the highly ranked images from the original query to expand as new queries to boost the recall performance. Jégou et al. [13] proposed a compact image representation scheme by aggregating local descriptors in images. It jointly optimizes the dimension reduction and indexing to preserve the quality of vector comparison. Zhang et al. [32] proposed a novel query-sensitive ranking algorithm to rank PCA-based binary hash codes to address neighbor search problem for image search. Above all, since the focus of our work is the post-processing stage, the above quantization methods can be combined with our spatial context coding scheme for better retrieval accuracy. The approaches in the second category concentrate on utilizing the geometric context among local features to suppress false matches [2]. The geometric verification approaches can be categorized into global verification and local verification. Global verification approach is to exploit geometric relationships of all features in a whole image. Full geometric verification based on RANSAC [33] and its variants [10-11,19,34] is the representative global verification method. Although full geometric verification can greatly improve the retrieval precision, it usually incurs expensive computational cost. Therefore, it is usually only applied to a subset of the top-ranked candidate images, resulting in limited retrieval accuracy. In [14], a novel geometric-preserving visual phrase (GVP) is proposed to explore the visual word co-occurrences in the neighborhood areas. And image search is realized by identifying the shared GVPs between images. In VisualSEEk [35], 2-D strings [36] are adopted to represent images with multiple color regions for comparison. The 2-D string approach represents an image as a symbolic projection along the horizontal and vertical directions [36]. Then, the image retrieval problem is converted to a problem of 2-D sequence matching. Our approach also makes use of the relative position between features. However, our approach adopts different representations from the 2D string approach [36] and can flexibly control the strictness of the spatial context encoding. Besides, our approach can flexibly address the rotation changes and enjoys much lower computational complexity in verification than the 2D string approach [36]. On the other hand, some local verification approaches are proposed to check the spatial consistency of features in local areas instead of the entire image plane. In [5], loose spatial consistency from some spatially nearest neighbors is imposed to filter false visualword matches. Such a loose scheme, although efficient, is sensitive to the image noise incurred by editing. Wu et al. [1] proposed to weight local features with the comparison results of the corresponding bundled features. The bundled feature is a group of local features in an MSER region [37]. The MSER regions are defined by an extremal property of the intensity function in the region and on its outer boundary, and are detected as stable regions across a threshold range from watershed-based segmentation [37]. Bundled features are compared by the shared visual word amount and the relative ordering of matched visual words. The matching score of bundled features is used to weight the visual word vote for image similarity. In [8, 38], min-hash is proposed to describe images by independently selecting a set of visual words as global descriptors and to define image similarity as the visual word set overlap. In [9], geometric min-hashing constructs repeatable hash keys with loosely local geometric information for more discriminative description. The potential concern of min-hashing [8,38] and its variant [9] is that although high retrieval precision can be achieved, the retrieval recall performance may be limited unless many more hashing tables are involved, which, however, imposes severer memory burden. Compared with global geometric verification, in general cases, local geometric verification is more computationally efficient for realtime retrieval systems with the sacrifice of retrieval accuracy. Since spatial context plays a key role in the visual content identification and understanding, it has also been widely explored for object classification and recognition in computer vision community. Spatial Pyramid Matching (SPM) [39] represents rigid spatial information by quantizing the image space for scene recognition but cannot address transformation changes. In [40], shape context is proposed to match shapes for object recognition. At a reference point, shape context captures the distribution of the remaining points relative to it. In [41], correlograms of visual words are quantized to compact correlations to model object shape for classification. Yuan et al. [42] proposed to discover meaningful

4 840 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 spatially co-occurring pattern of visual words, called visual phrases. Zhang and Chen [43] proposed to identify high-order spatial features by hashing matched local features into an offset space. Visual phrase based approaches are promising in the application of limited object classification and recognition. But in large-scale image retrieval with a large codebook of visual words, it is intractable to explore the visual phrase of all potential objects. In this paper, we propose an efficient full geometric verification method for large-scale partial-duplicate Web image search. We explicitly encode the spatial context among local features of an image into a coding map. Based on the invariance merit of SIFT feature [19], the coding map captures invariance to translation, rotation, and scaling. Based on the coding maps, we propose an efficient spatial verification scheme to effective identify and remove those spatially inconsistent local matches, which consequently significantly boost the retrieval accuracy with small computational overhead. 3 Our Method The spatial relationship among visual words in an image is critical in identifying duplicate image patches. Assume we have obtained the local feature matching results between two images by feature quantization and the matching results are polluted by some false (spatially inconsistent) matches. To efficiently identify those false matches, we propose to encode the spatial context for spatial verification. We first introduce the spatial context encoding scheme in Subsection 3.1. After that, we discuss how to perform geometric verification in Subsection 3.2. In Subsection 3.3, we discuss how to include those mis-removed pairs with the enhancement based on affine estimation. 3.1 Spatial Context Encoding We encode the relative positions between features in an image plane. A binary spatial map S is generated to describe the relative position between each feature pair along the horizontal direction. For instance, given an image i with K features {v i }, (i = 1,, K), its spatial map S is defined as follows: { 0, if vj is left to v i, S(i, j) = (1) 1, otherwise. In the spatial map S, row i records other features spatial relationships with feature v i in the image. We can also interpret the map as follows. In row i, feature v i is selected as the origin, and the image plane is uniformly divided into two parts along the vertical line. The spatial map then shows in which side other features are located. Therefore, one bit either 0 or 1 can encode the relative spatial position of one feature to another. The spatial context encoded by S in (1) is relatively loose. To describe the spatial relationship of local features more strictly, we advance our encoding scheme to a more general case. In other words, before comparing features relative positions along the horizontal line, we rotate the image plane counterclockwise by a series of r predefined angles θ k = k π r, (k = 0,..., r 1) and then encode the new relative spatial relationship of local features for each rotation into a new coding map. After that, all coding maps are concatenated into a 3-D array G as a general coding map. For instance, after rotating the image plane counterclockwise by θ k according to the image origin point, the new location (x (k) i, y (k) i ) of feature v i can be derived by its original location (x i, y i ): ( (k) ) ( ) ( ) x i cos(θ) sin(θ) xi y (k) =. sin(θ) cos(θ) y i i Then, the generalized spatial map G can be defined as follows: { 0, if v (k) j is left to v (k) i, G(i, j, k) = 1, otherwise. The spatial map described above only captures invariance to translation and scaling change. Invariance to rotation change can be flexibly enhanced by considering the SIFT characteristic orientation when constructing spatial coding map. In other words, instead of checking whether a test feature v (k) j is on the left-hand side of the vertical line passing through a reference feature point v (k) i, we can check whether v (k) j is on the lefthand side of the line passing through v (k) i with normal direction exactly along the orientation of the reference feature point. With the above modification, the new coding map will capture invariance to rotation, as well as translation and scaling change. Our spatial context encoding scheme is a kind of quantization of the image plane. In this sense, it degenerates to model the local histograms as in spatial pyramid matching [39]. From the discussion above, it can be seen that, it is computationally efficient to encode the spatial context. However, the whole spatial maps of all features in an image will cost considerable memory. Fortunately, there is no need to store these maps in our image search scenario. In fact, for each feature, we only need to store the x- and y-coordinate and the characteristic orientation. When checking the feature matching of two images, only the coordinates and the characteristic orientations of these matched features are needed to generate the spatial maps for spatial verification in real time. The details about spatial verification are discussed in the next subsection.

Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval 3.

5 Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval Local Matching Verification In image search, after feature quantization, we can obtain the feature matching results between images. Denote that a query image Iq and a matched image Im are found to share N pairs of matched SIFT features through vector quantization. Then the corresponding sub-spatial-maps of these matched features of both Iq and Im can be generated and denoted as Gq and Gm. If the query image and the matched image are partial duplicate, it is likely that M out of N matches correspond to the duplicate patch and the spatial configuration of those M matched features in two images is very similar, with potential changes in translation, scale, or rotation. Since our spatial maps can handle those changes, the corresponding spatial maps of the M matched features in the two images, which are sub-maps of Gq and Gm, are identical. Therefore, the spatial verification problem can be formulated as identifying a sub-map corresponding to those spatially consistent matches from the complete spatial maps. In our spatial verification, instead of directly identifying spatially consistent matches, we proceed to determine and remove those spatially inconsistent matches. After all those false matches are discarded, the remained matches are the expected spatially consistent results. In fact, it is difficult to discover all spatially inconsistent matches once for all. However, it is possible to find one match that is most likely to be one of the false matches. Intuitively, such a match will incur the most spatial inconsistency. Therefore, the key lies in how to formulate the spatial inconsistency. In the following, we will discuss our spatial verification scheme in details. To investigate the spatial inconsistency of matches, we perform logical exclusive-or (XOR) operation on Gq and Gm : f (i) = r N [ X V (i, j, k). (2) j=1 k=1 If there exists i with f (i) > 0, we define i = arg mini f (i), then the i-th pair will be most likely to be a false match and should be removed. Consequently, we can iteratively remove such mismatching pairs until the maximal value of f is zero. The computational complexity of spatial verification is O(r N 2 ). It should be noted that in the iteration loop, when updating S, we just need to subtract the values of the inconsistent match, instead of recomputing it by (2). Fig.2 shows four real examples of the spatial verification results on two relevant and two irrelevant image pairs. All image pairs have many initial local matches after feature quantization. It can be observed that for those relevant partial-duplicate images, our verification scheme can effectively identify and remove those spatially inconsistent matches. On the other hand, for those irrelevant image pairs, only very few, say, one or two, local matches can pass our spatial verification scheme. V = Gq Gm. Ideally, if all N matches are geometrically consistent, V are zero for all their entries. If some false matches exist, the entries of these false matches on Gq and Gm may be inconsistent. Those inconsistencies will cause the corresponding exclusive-or result in V to be 1. In other words, the relative spatial consistency of feature j to the referred feature i is encoded with an r-bit vector: C = (V (i, j, 1),, V (i, j, r)). And if feature i and feature j are geometrically inconsistent in two images, at least one bit in C willsbe 1. Therefore, the r union (logical OR) operation k=1 V (i, j, k) expresses the geometric inconsistency status of the two features along all rotated directions. As a result, we define the inconsistency of feature i with all other features as: Fig.2. Real examples of spatial verification results with spatial context coding on (a) relevant and (b) irrelevant image pairs. Each line denotes a local match across two images, with the line endpoint at the key point location of the corresponding SIFT feature. The red lines denote those local matches passing our spatial verification while the blue lines denote those that fail to pass the verification. 3.3 Affine Estimation for Enhancement Due to the unavoidable digital error in the detection of key points, some SIFT features exhibit some

842 small drifting. With such drifting error, the relative location of features with the same x- or y-coordinate might be inverse, causing some true matches to fail to pass our spatial verification.

6 842 small drifting. With such drifting error, the relative location of features with the same x- or y-coordinate might be inverse, causing some true matches to fail to pass our spatial verification. Such a phenomenon is prevalent in the case when two duplicate images share many matched feature pairs in a cluster. Besides, small affine transformation of images will exacerbate such error. Moreover, the error will also be worsened by large spatial coding factor r. As illustrated in Fig.3, although the duplicate image pair shares 12 true matches, five of them are discovered as false matches and fail to pass the spatial verification. In this subsection, we propose to address such SIFT drifting problem by affine estimation based on matched pairs passing our spatial verification. J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 tion model. Then the mapping error of the matching feature point pair is p e = (x x )2 + (y y )2. To check the validness of the matching, we set a threshold T such that, if the mapping error of a matched pair that fails to pass spatial verification is less than T, we will consider that pair as a true match. Since the duplicated images may undergo scale changes, it is reasonable to weight the threshold with a factor reflecting the scale difference. Consequently, T is defined as T = s τ, where τ is a weighting constant, and s is the average scale ratio between all true positive matches, defined as: M s= Fig.3. Instance of matching error due to SIFT drifting. The red lines denote the true matches that pass our spatial verification, while the blue lines denote those that fail. r = 6. Since many true matches are still remained, the matched images will be assigned a comparatively high similarity score (defined as the number of local matches passing spatial verification) and consequently the effect of those false negatives on the retrieval performance is very small. In fact, we can also discover those false negatives by means of those discovered spatialconsistent matches. In other words, we can estimate the affine transformation from target image to query image. Generally, there are six parameters of an affine transformation. Let (u, v) and (x, y) be the coordinates of the matched feature points in the matched image and the query image, respectively. Then the affine transformation between (u, v) and (x, y) can be expressed as: µ µ µ µ t u m1 m2 x + 1. = t2 v m3 m3 y The least-squares solution for the parameters p = (m1, m2, m3, m4, t1, t2 ) can be determined by solving a corresponding normal equation[19]. After estimating the parameters of the affine transformation, we can check the other matching pairs that fail to pass the spatial verification. In details, assume (x, y ) is the feature point in the query image estimated by the transforma- X scl q 1 i t, M scl i i=1 where scl qi and scl ti are the scale value of the i-th truly matched SIFT features from the matched image and the query image, respectively. In our experiment, we set τ = 5. Finally, false negative matching pairs due to SIFT drifting will be identified and kept. It should be noted that, unlike RANSAC, our affine model estimation is performed only once. Therefore, it is efficient in implementation. With our affine estimation for enhancement, we can effectively and efficiently recover those false negative matches failing to pass our spatial verification due to SIFT drifting. 4 Retrieval Framework In this section, we introduce our retrieval framework for large-scale partial-duplicate Web image search. In our approach, we adopt the DoG SIFT feature[19] for image representation. We use the SIFT descriptor for quantization, as discussed in Subsection 4.1. From feature quantization, the local matches between images can be identified for spatial verification with spatial coding, as discussed in Subsection 4.2. In Subsection 4.3, we discuss the inverted-file index structure for retrieval. 4.1 SIFT Feature Quantization To build a large-scale image indexing and retrieval system, we need to quantize SIFT descriptors to visual words. Based on the quantization results, images can be readily indexed with inverted file structure. On the other hand, local matches between images can be efficiently determined. If two local features from different images are quantized to a same visual word, the two local features are

7 Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval considered a local match across the two images. From this perspective, quantization works to filter irrelevant features of database images and focuses on only a very small set of features, which are likely to match the query feature. In our feature quantization, the hierarchical visual vocabulary tree approach[6] is adopted as the quantizer. The leaf nodes of the tree are considered as visual words. Each SIFT descriptor is mapped to the scalar ID of the closest visual word. 4.2 Spatial Verification for Retrieval Once the local matches between images are obtained, we can perform spatial verification with spatial maps to remove false matches and impose affine enhancement to avoid mis-filtering true matches, as discussed in Section 3. As a result, a number of local matches are regarded as spatially consistent and kept. The similarity scores between images are defined by counting the number of such true matches. We formulate the image retrieval as a voting problem. Each visual word in the query image votes on its matched images. Intuitively, the tf-idf weight[5] can be used to distinguish different matched features. In our experiments, we simply count the number of matched features which yield similar results and incur less computation. In other words, the similarity between images is defined as the set cardinality of matched visual words passing the verification. 4.3 Index Structure An inverted-file index structure is used for scalable indexing and retrieval. Each visual word has an entry 843 in the index that contains the list of images in which the visual word appears. For each indexed feature, we store its image ID, SIFT scale and orientation value, and its x- and y-coordinate. For each of the four geometric items, we re-scale its range to be between 0 and 255 and represent it with only one byte. In our implementation, we allocate 8 bytes to index a feature, 4 bytes for image ID, and 1 byte for each of the four geometric items. 5 Experiments We conduct experiments on large-scale image database. In Subsection 5.1, we introduce the experiment setup. Then, in Subsection 5.2, we discuss the evaluation results on the one-million image database. After that, we provide the results on the 10.2-million image database. 5.1 Experiment Setup We build our basic dataset by randomly crawling one million images from the Web. The basic dataset is used as distracting images. Two ground truth datasets are used for the evaluation of the partial-duplicate Web image retrieval. The first dataset is collected by ourselves. Following the Tineye search demo results①, we collected and manually labeled partially duplicated Web images of 23 groups from both Tineye② and Google Image search. Some sample images are shown in Figs. 1, 2, and 4. We denote this dataset as DupData. The second dataset is the Copydays dataset③, which contains images generated from the 157 original images, with various editing④, such as cropping, scaling, rotation, JPEG compression, printing and scanning, blur- Fig.4. Six groups of sample images from the DupData dataset. (a) Titanic. (b) Seven-Eleven logo. (c) Energy Star logo. (d) Uncle Sam. (e) Starbucks logo. (f) Apollo. ① searches, Aug ② Tineye Image Search, Aug ③ jegou/data.php/, Aug ④ generator/, Aug

8 844 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 ring, and painting. Some sample images of this dataset are shown Fig.1(b). To evaluate the performance with respect to the database size, three smaller databases (50 K, 200 K, and 500 K) are built by sampling the basic dataset. For the DupData dataset, we randomly select query images from each group of our ground truth dataset. The number of query images from each group is proportional to the group size. Finally, 100 representative query images are obtained from the ground truth dataset. For the Copydays dataset, we select 229 images of the challenging strong category as queries. Mean average precision (map) defined in (3) is used as our evaluation metric. map = 1 Q Q i=1 n k=1 P i(k) rel i (k) R i, (3) where Q denotes the number of query images, R i denotes the number of relevant results for the i-th query image, P i (k) denotes the precision of top k retrieval results of the i-th query image, rel i (k) is an indicator function equaling 1 when the k-th retrieval result of the i-th query image is a relevant image and 0 otherwise, and n denotes the total number of returned results. For the DupData and the Copydays dataset, Q is set as 100 and 229, respectively. In the DupData dataset, R i ranges between 30 and 70 for different categories while in the Copydays dataset, R i ranges from 20 to 24. n is selected as in our experiments, since P i (n) is usually very small when n is larger than and contributes little to the average precision. In our approach, the key parameter is the coding map factor r. The selection of r determines the division extent of image plane to check the relative spatial position of features. We have tested different settings of r on 1-million image database with the Copydays dataset as ground truth. As shown in Fig.5, when r in Fig.5. Impact of r to the map performance on the Copydays ground truth dataset mixed with the 1-million image database. creases, the map performance first grows sharply, then becomes relatively stable and gradually declines after reaching the peak. This is due to the fact that larger r defines stricter relative spatial relationship and increases the discriminative capability to identify those false positive matches. However, when r is too large, say, over 6, the SIFT drifting error will be magnified. Since that error is suppressed by the affine enhancement discussed in Subsection 3.3, the drop in map is gentle. The best performance is achieved when r equals 6, which is selected in the following experiments. Our approach identifies those spatially inconsistent matches from the initial local feature matches between images. In the experimental study, it is observed that, for all those images passing our geometric verification, the average number of feature matches between images before verification is After the spatial verification, only 6.1 features in average are left for those verified images. 5.2 Evaluation Comparison Algorithms. We take the Bag-of-Visual- Words method with visual vocabulary tree (VVT) [6] as the baseline approach. A visual vocabulary of 1 M visual words is adopted, which is trained with 30 million SIFT features sampled from 300 million features of an independent image database. In fact, we have experimented with different visual codebook sizes, and found the 1 M vocabulary yields the best overall performance on the VVT approach. Such visual vocabulary tree is also used for feature quantization in our approach. Two other approaches are adopted to enhance the VVT method for comparison. The first one is reranking via full geometric verification [34], which is based on the estimation of an affine transformation with our implementation of [11]. As a post-processing step, the re-ranking is performed on only the top 300 initial results of the VVT approach. We call this method reranking. The second one is the Bundled Feature [1] approach, which assembles local SIFT features in MSER regions and compares the feature bundles by their share visual word amount and the relative ordering consistency of matched features. We denote this approach as BF. Accuracy. We evaluate all approaches on the Dup- Data dataset and the Copydays dataset, respectively. The map comparison results on the DupData dataset are illustrated in Fig.6. From Fig.6, it can be observed that our approach outperforms the other three methods in retrieval accuracy. With the increase of the image database size, the map performance of all approaches decreases. When the image database size is very small (50 K), the map of our approach is very close to that

Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval 845 of re-ranking method. On the 1 M dataset, the map of VVT is 0.49. Our approach increases it to 0.

9 Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval 845 of re-ranking method. On the 1 M dataset, the map of VVT is Our approach increases it to 0.66, a relatively 35% improvement. Since re-ranking based on full geometric verification is only applied on the top- 300 initial returned results, its performance is highly determined by the recall of initial top-300 results. As can be seen, the performance of re-ranking the results of baseline is 0.61, still 0.05 lower than our approach. Besides, our approach also outperforms BF. detection accuracy and the repeatability of the MSER detection. In partial-duplicate images, the MSER detection can be disturbed by various editing and degrade the effectiveness of the BF approach. Efficiency. We perform the experiments on a server with 2.0 GHz CPU and 16 GB memory. Fig.8 shows the average time cost of all the four approaches. The time for SIFT feature extraction is not included. It takes the VVT 0.90 seconds to perform one image query on average, while for our approach, the average query time cost is 1.1 seconds. Bundled Feature approach is much more time consuming. It takes about 3 seconds in average, which is twice more than our method. Full geometric re-ranking is the most time-consuming, introducing an additional 4.6 seconds over the VVT, although it just verifies the top 300 candidate images. Fig.6. Performance comparison of different methods with different database sizes on the DupData ground truth dataset. The map comparison results on the Copydays dataset are illustrated in Fig.7. The map performance of all comparison algorithms on this dataset exhibits similar trend to that in the Copydays dataset. Our approach achieves steadily the best performance over all the comparison methods in different sizes of image database. The performance of BF is better than the VVT approach but poorer than the other two comparison algorithms. This is partially due to the reason that, different from the other three algorithms, the effectiveness of the BF approach is strongly dependent on the Fig.8. Comparison of average query time cost for different methods on the 1 M database. Memory Cost. In large-scale image search, the memory cost is mainly spent on the index file whose size is proportional to the number of indexed features and the memory cost per indexed feature. Since the number of indexed features is the same for all approaches, we only need to compare the memory cost per indexed feature. The comparison results are shown in Table 1. The VVT approach [6] costs 8 bytes to index one feature, 4 bytes to store one image ID, and another 4 bytes to store the tf-idf weight. In BF [1], for each indexed feature, it has to store the image ID (4 bytes), the number of bundled cell (1 byte), and a list of bundled cell information (4 bytes for each one). For different indexed features, the bundled cell numbers are different. In our implementation, there are about 2.2 bundled cells in average for each feature. Therefore, it takes about 14 bytes for Table 1. Memory Cost on Each Indexed Feature for the Four Approaches Fig.7. Performance comparison of different methods with different database sizes on the Copydays ground truth dataset. Method Memory Cost per Indexed Feature (Byte) VVT 8 BF 14 Reranking 8 Our approach 8

10 846 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 BF [1] to index one feature. Our approach costs the same memory as the reranking approach in indexing each feature. Both approaches take 4 bytes to store the image ID, and another 4 bytes to store the x- and y-coordinates, orientation and scale information. 5.3 Scalability on a 10-Million Image Database To demonstrate the scalability of our spatial coding scheme, we move on to a 10-million database. This database includes about 3 million images from Image- Net [44], and 7 million images crawled from the Web, with 10.2 million images in total. Since the 10-million database contains additional partial duplicates of the DupData dataset, for evaluation purpose, we identify these partially duplicated images from the dataset by searching the database with every image from our ground-truth dataset as the query image. Then, we enrich our original DupData dataset with those images, generating a new DupData ground-truth dataset, which contains images. To evaluate the retrieval performance, we mix the enriched DupData dataset and the CopyDays dataset with the 10-million images, respectively, which are used as distractor. Considering the high memory requirement, we evaluate VVT and our approach on another server (Dell R710) with 2.4 GHz CPU and 32 GB memory, which is much more powerful than the server mentioned in Subsection 5.2. Also, 100 images from the 23 categories are used as queries. The 1 M visual codebook is selected for the VVT method as discussed in Subsection 5.2. In this evaluation, the reranking approach takes the top returned results for geometric verification. Due to the memory limit, the Bundled Feature [1] approach is not compared on this dataset. As discussed in previous subsection, the Bundled Feature approach costs about 14 bytes to index each feature. In our implementation, there are about 300 SIFT features in average per image. Therefore, to index 10 million images, the BF approach will take about 42 GB memory, which exceeds the capability of our server. Table 2 compares the average performance in map and the time cost of the VVT method, the reranking method, and our approach. On the enriched DupData dataset, our approach achieves a relatively 30.1% improvement in map over the VVT approach, from 0.36 to On the Copydays dataset, our approach pushes map from 0.37 of the VVT to 0.53, with a relatively 43.2% improvement. Compared with the reranking method, our approach improves map by 0.02 and 0.04 in the two datasets, respectively. In terms of efficiency, our approach is comparable to the baseline, with about additional 0.5 seconds time cost per query. Compared with the reranking method, our approach is much more efficient, with about 5.9 seconds less time per query. Table 2. Performance Comparison in map and Time Cost on 10-Million Dataset Method map map Query Time (DupData) (CopyDays) (s) VVT Reranking Our approach Conclusions In this paper, we proposed a scheme to encode the spatial context of local features for efficient geometric verification in large-scale partial-duplicate image search. The generated spatial map encodes the relative spatial locations among features in an image and captures invariance to translation, rotation and scaling changes. Based on the spatial context coding maps, spatial verification can be efficiently performed to effectively discover spatially inconsistent feature matches between images. Further, an enhancement scheme based on affine estimation was presented to refine the geometric verification results. As demonstrated in large-scale partial-duplicate Web image retrieval, spatial coding achieves competitive performance compared with existing algorithms and the additional computational overhead over VVT approach is small. Our method aims to identify images sharing some duplicated patches. The geometric context verification scheme was designed for partial-duplicate Web image retrieval. However, it may not work well on object images with severe affine distortion from various observation views. References [1] Wu Z, Ke Q, Isard M, Sun J. Bundling features for large scale partial-duplicate web image search. In Proc. CVPR, June 2009, pp [2] Xie H, Gao K, Zhang Y et al. Efficient feature detection and effective post-verification for large scale near-duplicate image search. IEEE Trans. Multimedia, 2011, 13(6): [3] Xie L, Tian Q, Zhou W et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb. Computer Vision and Image Understanding, 2014, 124: [4] Chu L, Jiang S, Wang S, Zhang Y, Huang Q. Robust spatial consistency graph model for partial duplicate image retrieval. IEEE Trans. Multimedia, 2013, 15(8): [5] Sivic J, Zisserman A. Video Google: A text retrieval approach to object matching in videos. In Proc. the 9th IEEE Int. Conf. Computer Vision, Oct. 2003, pp [6] Nister D, Stewenius H. Scalable recognition with a vocabulary tree. In Proc. CVRP, June 2006, pp [7] Chum O, Philbin J, Sivic J, Isard M, Zisserman A. Total recall: Automatic query expansion with a generative feature

Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval 847 model for object retrieval. In Proc. the 11th IEEE Int. Conf. Computer Vision, Oct. 2007, pp.1-8.

11 Wen-Gang Zhou et al.: Encoding Spatial Context for Image Retrieval 847 model for object retrieval. In Proc. the 11th IEEE Int. Conf. Computer Vision, Oct. 2007, pp.1-8. [8] Chum O, Philbin J, Zisserman A. Near duplicate image detection: Min-Hash and tf-idf weighting. In Proc. the 19th BMVC, Sept. 2008, pp [9] Chum O, Perdoch M, Matas J. Geometric min-hashing: Finding a (thick) needle in a haystack. In Proc. CVPR, June 2009, pp [10] Jegou H, Douze M, Schmid C. Hamming embedding and weak geometric consistency for large scale image search. In Proc. the 10th ECCV, Oct. 2008, pp [11] Philbin J, Chum O, Isard M, Sivic J, Zisserman A. Object retrieval with large vocabularies and fast spatial matching. In Proc. CVPR, June 2007, pp.1-8. [12] Philbin J, Chum O, Isard M et al. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proc. CVPR, June 2008, pp.1-8. [13] Jégou H, Douze M, Schmid C, Pérez P. Aggregating local descriptors into a compact image representation. In Proc. CVPR, June 2010, pp [14] Zhang Y, Jia Z, Chen T. Image retrieval with geometrypreserving visual phrases. In Proc. CVPR, June 2011, pp [15] Zhou W, Lu Y, Li H, Song Y, Tian Q. Spatial coding for large scale partial-duplicate Web image search. In Proc. Int. Conf. Multimedia, Oct. 2010, pp [16] Zheng L, Wang S, Liu Z, Tian Q. LP-Norm IDF for large scale image search. In Proc. CVPR, June 2013, pp [17] Xie H, Zhang Y, Tan J, Guo L, Li J. Contextual query expansion for image retrieval. IEEE Trans. Multimedia, 2014, 16(4): [18] Liu Z, Li H, Zhou W, Zhao R, Tian Q. Contextual hashing for large-scale image search. IEEE Trans. Image Processing, 2014, 23(4): [19] Lowe D G. Distinctive image features from scale invariant keypoints. International Journal of Computer Vision, 2004, 60(2): [20] Zhou W, Lu Y, Li H, Tian Q. Scalar quantization for large scale image search. In Proc. the 20th ACM Multimedia, Oct. 2012, pp [21] Babenko A, Lempitsky V. The inverted multi-index. In Proc. CVPR, June 2012, pp [22] Shen X, Lin Z, Brandt J et al. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In Proc. CVPR, June 2012, pp [23] Zhou W, Li H, Lu Y, Tian Qi. SIFT match verification by geometric coding for large-scale partial-duplicate Web image search. ACM Trans. Multimedia Computing, Communications, and Applications, 2013, 9(1): Article No. 4. [24] Wang W, Zhang D, Zhang Y, Li J, Gu X. Robust spatial matching for object retrieval and its parallel implementation on GPU. IEEE Trans. Multimedia, 2011, 13(6): [25] Chum O, Mikulik A, Perdoch M, Matas J. Total recall II: Query expansion revisited. In Proc. CVPR, June 2011, pp [26] Zhou W, Li H, Lu Y, Wang M, Tian Q. Visual word expansion and BSIFT verification for large-scale image search. Multimedia Systems, /s , Aug [27] Zhang S, Yang M, Wang X et al. Semantic-aware co-indexing for image retrieval. In Proc. ICCV, 2013, pp [28] Zhang S, Tian Q, Lu K et al. Edge-SIFT: Discriminative binary descriptor for scalable partial-duplicate mobile search. IEEE Trans. Image Processing, 2013, 22(7): pp [29] Jégou H, Harzallah H, Schmid C. A contextual dissimilarity measure for accurate and efficient image search. In Proc. CVPR, June 2007, pp.1-8. [30] Zhou W, Yang M, Li H, Wang X, Lin Y, Tian Q. Towards codebook-free: Scalable cascaded hashing for mobile image search. IEEE Trans. Multimedia, 2014, 16(3): [31] Arandjelovic R, Zisserman A. Three things everyone should know to improve object retrieval. In Proc. CVPR, June 2012, pp [32] Zhang X, Zhang L, Shum H Y. QsRank: Query-sensitive hash code ranking for efficient ɛ-neighbor search. In Proc. CVPR, June 2012, pp [33] Fischler M A, Bolles R C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981, 24(6): [34] Chum O, Matas J. Matching with PROSAC-progressive sample consensus. In Proc. CVPR, June 2005, pp [35] Smith J R, Chang S F. VisualSEEk: A fully automated content-based image query system. In Proc. the 4th ACM Multimedia, Nov. 1996, pp [36] Chang S, Shi Q, Yan C. Iconic indexing by 2-D strings. IEEE Trans. Pattern Analysis and Machine Intelligence, 1987, 9(3): [37] Matas J, Chum O, Urban M, Pajdla T. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 2004, 22(10): [38] Chum O, Philbin J, Isard M et al. Scalable near identical image and shot detection. In Proc. CIVR, July 2007, pp [39] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. CVPR, June 2006, pp [40] Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Analysis and Machine Intelligence, 2002, 24(4): [41] Savarese S, Winn J, Criminisi A. Discriminative object class models of appearance and shape by correlatons. In Proc. CVPR, June 2006, pp [42] Yuan J, Wu Y, Yang M. Discovery of collocation patterns: From visual words to visual phrases. In Proc. CVPR, June 2007, pp.1-8. [43] Zhang Y, Chen T. Efficient kernels for identifying unboundedorder spatial features. In Proc. CVPR, June 2009, pp [44] Deng J, Dong W, Socher R et al. ImageNet: A large-scale hierarchical image database. In Proc. CVPR, June 2009, pp Wen-Gang Zhou received his B.E. degree in electronic information engineering from Wuhan University, in 2006, and Ph.D. degree in electronic engineering and information science from University of Science and Technology of China (USTC), Hefei, in He was a research intern in Internet Media Group of Microsoft Research Asia from December 2008 to August From September 2011 to September 2013, he worked as a postdoctoral researcher in Computer Science Department of University of Texas at San Antonio. He is currently an associate professor of the Department of Electronic Engineering and Information Science, USTC. His research interests include multimedia information retrieval and computer vision.

848 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 Hou-Qiang Li received his B.S., M.E. and Ph.D.

He is currently a professor at the Department of Electronic Engineering and Information Science, USTC.

His research has been supported by the National Natural Science Foundation of China (NSFC), National High Technology Research and Development 863 Program of China, Microsoft, Nokia, and Huawei.

12 848 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 Hou-Qiang Li received his B.S., M.E. and Ph.D. degrees from University of Science and Technology of China (USTC), Hefei, in 1992, 1997, and 2000, respectively, all in electronic engineering. He is currently a professor at the Department of Electronic Engineering and Information Science, USTC. His research interests include video coding and communications, image/video analysis, and computer vision. His research has been supported by the National Natural Science Foundation of China (NSFC), National High Technology Research and Development 863 Program of China, Microsoft, Nokia, and Huawei. He is an associate editor of IEEE Transactions on Circuits and Systems for Video Technology and in the editorial board of Journal of Multimedia. He has served on technical/program committees, organizing committees, and as program co-chair and track/session chair for over 10 international conferences. Yijuan Lu is an assistant professor in the Department of Computer Science, Texas State University. She received her Ph.D. degree in computer science from the University of Texas at San Antonio in During , she was a summer intern researcher at FXPAL Lab, Web Search and Mining Group, Microsoft Research Asia (MSRA), National Resource for Biomedical Supercomputing (NRBSC) at the Pittsburgh Supercomputing Center (PSC), Pittsburgh. She was an intern researcher at Media Technologies Lab, Hewlett-Packard Laboratories (HP), 2008, and a research fellow of Multimodal Information Access and Synthesis (MIAS) Center at University of Illinois at Urbana- Champaign (UIUC), Her current research interests include multimedia information retrieval, computer vision, machine learning, data mining, and bioinformatics. She has published extensively and served as reviewers at top conferences and journals. She is the 2007 Best Paper Candidate in Retrieval Track of Pacific-Rim Conference on Multimedia (PCM) and the recipient of 2007 Prestigious HEB Dissertation Fellowship, 2007 Star of Tomorrow Internship Program of MSRA. She is a member of ACM and IEEE. Qi Tian received his B.E. degree in electronic engineering from Tsinghua University, Beijing, in 1992, M.S. degree in electrical and computer engineering from Drexel University in 1996, and Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana-Champaign in He is currently a professor of the Department of Computer Science at the University of Texas at San Antonio (UTSA). Dr. Tian s research interests include multimedia information retrieval and computer vision. He has published over 250 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he received faculty research awards from Google, NEC Laboratories of America, FX- PAL, Akiira Media Systems, and HP Labs. He took a oneyear faculty leave at Microsoft Research Asia (MSRA) during He was the co-author of an ACM ICIMCS 2012 Best Paper, an MMM 2013 Best Paper, a Top 10% Paper Award in MMSP 2011, a Best Student Paper in ICASSP 2006, and a Best Paper at PCM He received 2010 ACM Service Award. Dr. Tian has been serving as program chair, organization committee member and TPC for numerous IEEE and ACM conferences including ACM Multimedia, SIGIR, ICCV, ICME, etc. He is the guest editor of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, and is in the editorial board of IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), IEEE Transactions on Multimedia (TMM), Journal of Multimedia (JMM), and Journal of Machine Visions and Applications (MVA).

Spatial Coding for Large Scale Partial-Duplicate Web Image Search

Spatial Coding for Large Scale Partial-Duplicate Web Image Search Wengang Zhou, Yijuan Lu 2, Houqiang Li, Yibing Song, Qi Tian 3 Dept. of EEIS, University of Science and Technology of China, Hefei, P.R.