Partial-Duplicate Image Retrieval via Saliency-guided Visually Matching

Size: px

Start display at page:

Download "Partial-Duplicate Image Retrieval via Saliency-guided Visually Matching"

Millicent Barber
5 years ago
Views:

Partial-Duplicate Image Retrieval via Saliency-guided Visually Matching Liang Li, Shuqiang Jiang, Zheng-Jun Zha, Zhipeng Wu, Qingming Huang Key Lab of Intell. Info. Process.

The University of Tokyo, Japan School of Computing, National University of Singapore, Singapore Graduate University of Chinese Academy of Sciences, China {lli, sqjiang, qmhuang}@jdl.ac.

jp Abstract This paper proposes a novel partial-duplicate image retrieval scheme based on saliency-guided visually matching, where the localization of duplicate is done simultaneously.

Further, a relative saliency ordering constraint is constructed to refine the retrieval, which captures the robust relative saliency layout of VSRR.

1 Partial-Duplicate Image Retrieval via Saliency-guided Visually Matching Liang Li, Shuqiang Jiang, Zheng-Jun Zha, Zhipeng Wu, Qingming Huang Key Lab of Intell. Info. Process., Institute of Computing Technology, Chinese Academy of Sciences, China Dept. of Information and Communication Eng. The University of Tokyo, Japan School of Computing, National University of Singapore, Singapore Graduate University of Chinese Academy of Sciences, China {lli, sqjiang, Abstract This paper proposes a novel partial-duplicate image retrieval scheme based on saliency-guided visually matching, where the localization of duplicate is done simultaneously. The image is abstracted by Visually Salient and Rich Regions (VSRR), which are of both visual saliency and rich visual content. Further, a relative saliency ordering constraint is constructed to refine the retrieval, which captures the robust relative saliency layout of VSRR. Collaborating with this constraint, we propose an efficient algorithm to embed it into the index system to speedup the retrieval. Comparison experiments with state-of-the-art methods on five databases show the efficiency and effectiveness of our approach. Index Terms Image representation, saliency analysis, relative constraint, partial-duplicate image retrieval Fig.. Example images from a public partial-duplicate image dataset []. I. I NTRODUCTION ITH the rapid improvement of internet and multimedia technology, plenty of partial-duplicate images are generated for picture sharing, information delivering, photo ornamenting, etc. Fig. shows some typical examples of partial-duplicate images. Different from the traditional image retrieval, the duplicate regions among partial-duplicate images are only parts of the whole images while there are various kinds of transformations on scale, viewpoint, illumination and resolution. Although this make the retrieval task more complicated and challenging, partial-duplicate image retrieval is demanded by various real applications (e.g. fake image detection, copy protection, and landmark searching), and has attracted increasing research attentions recently. The problem of partial-duplicate image retrieval is close to that of object-based image retrieval. The traditional objectbased image retrieval methods usually use the whole image as the query and analogize text-retrieval systems by using the bag-of -visual-words (BOV) model [2]. Typically, no spatial information about visual words is used in the retrieval stage so that it is not robust to the background noise. Thus, researchers have been solving this problem at different aspects: ) Aggregating local descriptors into more discriminative descriptions [3], [4]. 2) Constructing weak geometric consistency constraint on the global image representation with local feature [], [5] [7]. 3) Retrieving with a user interested region through user interaction (e.g. Blobworld, SIMPLICity). W The above technologies have proved their efficiency in the traditional image retrieval, but there are some immitigable limitations: ) The aggregating approaches are limited on the partial-duplicate image retrieval: the duplicate region is usually small and only a bit of aggregated features may be extracted; 2) Only constructing such the consistency constraint could not effectively deal with the partial-duplicate image retrieval because there usually exist plenty of background noises in the images; 3) It is impossible that the similar interaction operations are performed on the millions of images in the dataset. Due to these limitations, current technologies are not sufficient to satisfactorily solve the partial-duplicate image retrieval, but they provide some useful insights: ) Removing the background noise from the query image and image dataset will facilitate the retrieval. Further, it is preferable that noise elimination can be automatically implemented. 2) The appropriate constraint can improve the retrieval performance, but the constraint should not reduce the retrieval efficiency. Besides the above insights, there are two observations worth noticing: ) the purpose of people sharing images on the web is to show their interested objects/regions in the images. Similarly, when retrieving the partial duplicate, one also expects that major regions of the returned results are about his interested objects/regions. Usually, in the retrieval, the similar regions concerned by the user are not in the nonsalient

2 2 regions. 2) On the view of user experience, the similar region in the returned images, which the user takes an interest in, had better be interesting for the original image sharers. To solve the partial-duplicate image retrieval, we first introduce the visual attention analysis [8], [9] to filter the nonsalient regions from the image, which can eliminate some background noises of images. Computationally removing the nonsalient regions from the images is a useful solution for preferentially allocating of computational resources in subsequent image analysis. Another characteristic of the partial-duplicate regions is to have rich visual content. Previous technologies cannot guarantee the generated saliency regions with rich visual content. To ensure the regions with rich visual content, we introduce a visual content analysis algorithm to re-filter the saliency regions. In this paper, we propose a novel partial-duplicate image retrieval scheme based on saliency-guided visually matching, and the localization of duplicate is obtained simultaneously. Fig. 2 illustrates the flowchart of our scheme. We abstract the Visually Salient and Rich Regions (VSRR) as retrieval units, which are of visual saliency and rich visual content in the images. We represent the VSRR based on the BOV model; and we take advantages of group sparse coding to encode the visual descriptor, which achieves a lower reconstruction error and obtains sparse representation at the region level. Furthermore, a robust relative constraint based on the saliency analysis is introduced to refine the retrieval performance, which captures the saliency relative layout among interest points of the VSRRs. To accelerate the retrieval process, we propose an efficient algorithm to embed this constraint into the index system, which economizes both the computation time and the storage spaces. Finally, experiments on five image databases for partial-duplicate image retrieval show the efficiency and effectiveness of our approach. The contributions of our work are summarized as follows: A novel partial-duplicate image retrieval scheme is proposed based on saliency-guided visually matching, where the VSRR is used as the retrieval unit and the localization of duplicate is done simultaneously. The VSRR is represented with sparse code of visual words by group sparse algorithm, which can achieve a more exact reconstruction and obtain sparse representation at the whole VSRR. A relative saliency ordering constraint is proposed to raise the retrieval precision, which captures the saliency relative layout among key points of the VSRR. To the best of our knowledge, this is the first time to exploit such a constraint. A fast algorithm is proposed to transform the above constraint into the index system, which dramatically diminishes both the computation time and storage spaces. The rest of this paper is organized as follows: Section II introduces the generation of VSRR. Section III describes the feature representation for VSRR. Section IV elaborates the relative saliency ordering constraint. Section V presents the index structure and retrieval scheme. Section VI discusses the experimental results. Finally, Section VII concludes this paper. Image corpus VSRR extraction & group sparse coding (offline) VSRR with sparse code Returned result Storage (offline) st inverted file Candidate VSRR 2 nd inverted file Relative saliency ordering constraint visual attention analysis visual content filter group sparse coding query sparse code Query image Saliency regions VSRR Fig. 2. The flowchart of our proposed scheme for partial-duplicate image retrieval. II. VISUALLY SALIENT AND RICH REGION GENERATION In this paper we detect the Visually Salient and Rich Regions (VSRR) as the retrieval unit for partial-duplicate image retrieval. We define VSRR as one image region that has rich visual content and visual saliency. The generation procedure of VSRR includes four steps: a) perceptive unit construction; b) saliency map generation; c) original VSRR generation; d) ultimate VSRR selection. Finally, the image is decomposed into a set of VSRRs. A. Perceptive Unit Construction Perceptive unit is defined as an image patch which corresponds to the center of the receptive field. Here we utilize a reliable graph-based segmentation algorithm [0], which incrementally merges smaller-size patches with similar appearances and with small minimum spanning tree weights. B. Saliency Map Generation The image regions that have strong contrast with their surroundings usually attract much human attention. Besides contrast, spatial relationships also play an important role in visual attention. For one region, high contrast to its near regions usually has a more powerful impetus for its saliency than high contrast to far regions. Based on the perceptive units, we compute the saliency map by incorporating spatial relationship with the region contrast [8]. This approach is the bottom-up saliency detection strategy and it can separate objects from the surroundings. Specifically, color histogram for each perceptive unit is firstly built in the L*a*b* color space. Then, for a perceptive unit r k, its saliency value is calculated by measuring its color contrast to all other perceptive units in the image. Meanwhile a weight term of spatial relationship is introduced to increase

3 3 the effects of closer regions and decrease the effects of farther regions. This procedure is defined as: S(r k ) = r k r i exp(d s (r k, r i )/σ 2 s)w(r i )D r (r k, r i ) () where exp(d s (r k, r i )/σs) 2 is the spatial weight and D s (r k, r i ) is the Euclidean spatial distance between the centroid of perceptive unit r k and r i. σ s controls the scale of spatial weight. Smaller value of σ s enlarges the effect of spatial weighting. w(r i ) is the number of pixels in the region r i. D r (r k, r i ) is the color distance between r k and r i, which is defined as: D r (r k, r i ) = N k N i m= n= f(c k,m )f(c i,n )D(c k,m, c i,n ) (2) where D(c k,m, c i,n ) is the distance between pixels c k,m and c i,n. f(c k,m ) is the probability of the m-th color c k,m among all N k colors in the region r k. C. VSRR Generation After generating saliency map, we compute the saliency regions by saliency segmentation, and then obtain the original VSRRs by filtering the regions with inferior saliency. Finally, we select out the ultimate VSRRs which contain abundant visual content. We divide the saliency map into background and initial saliency regions by binarizing the saliency map using a fixed threshold. Then, we iteratively apply Grabcut [] to refine the segmentation result. The initial saliency regions are used to automatically initialize Grabcut. Once initialized, Grabcut is iteratively performed to improve the saliency cut result until the region is convergent. Finally, a set of regions are obtained, which are regarded as the original VSRRs. Further, to obtain the ultimate VSRRs that contain abundant visual content, we re-filter the VSRR by a visual content analysis algorithm. Firstly, the visual content of VSRRs is represented with the BOV representation of SIFT [2] descriptor. The amount of visual content in the VSRR is measured as: Score = K i= N i n i (3) where K is the size of dictionary; n i and N i are the numbers of visual word i in this VSRR and the whole database respectively. n i reflects the repeated structure in the VSRR. N i captures the informativeness of visual words: visual words that appear in many different VSRRs are less informative than those that appear rarely. We filter the VSRRs with Score<ε. ε is set to 0.0. (This dictionary is also used in the Section IV, which is obtained by hierarchical k-means clustering. And it is different from that in the Section III, which is learnt by the group sparse coding algorithm. These two dictionary is used for different purposes.) III. FEATURE REPRESENTATION OF VSRR After obtaining the VSRR, we move our efforts to design its discriminative representation. The popular image representation in image retrieval is the nearest neighbor vector quantization (VQ) based on the BOV model [2]. However, the discriminative power of traditional BOV model is limited due to quantization error. To address this problem, we improve the BOV by using group sparse coding (GSC [3]). Let X = [x,..., x M ] T R M P be a set of local descriptors: min A,D 2 D D M x m a m j d j 2 + γ a j (4) m= j= subject to a m j 0, j, m j= where D = [d,..., d K ] T is the dictionary with K visual words. depicts the l 2 -norm of vector. The reconstruction matrix A = {a j } D j= consists of nonnegative vectors a j = (a j,..., am j ), which specifies the contribution of d j to each descriptor. M is the total number of visual descriptors in the VSRR. The first term in the Equ. 4 weighs the reconstruction error and the second term weighs the degree of sparsity. The larger the parameter γ is, the sparser the reconstruction coefficient is. The advantages of this approach over the traditional BOV include: ) GSC can achieve a much lower reconstruction error rate due to the less restrictive constraint; 2) GSC representation economizes the storage space in encoding image descriptors; 3) Sparsity of GSC allows the representation to be special, and to capture salient properties of images; 4) GSC can obtain the compact representation at the region level, which facilitates the fast index and accurate retrieval. IV. RELATIVE SALIENCY ORDERING CONSTRAINT The ignorance of geometric relationship limits the discriminative power of the BOV model. To address this problem, we propose a novel relative saliency ordering constraint, which is regarded as a saliency-relative layout constraint among interest points in the VSRR. We argue that the relative order of saliency at the interest points is well-preserved in the VSRRs, since the duplicate VSRRs has the common visual pattern and their saliency information distribution is similar. The first step for constraint verification is to find the matching pairs between VSRRs. Here we employ visual words with a large dictionary for efficiently matching. Large dictionary can decrease the matching errors caused by the SIFT quantization. Supposing one query VSRR q and one candidate VSRR c have n matching visual words: VSRR(q) = {v q,..., v qn } and VSRR(c) = {v c,..., v cn }, and v qi and v ci are the i-th matching visual word. S(q) = {α q,..., α qn } and S(c) = {α c,..., α cn } represent the saliency values for the corresponding visual words in q and c respectively. We construct a Saliency Relative Matrix (SRM) for each VSRR: r 2 r n r 2 r 2n SRM = r n r n2 { 0 αi > α j r ij = otherwise (5)

4 visually salient and rich region saliency map visual word & its saliency value Saliency ordinal Saliency Relative Matrix q 5 2 4 3 37 2 43 62 0 XOR c 25 32 99 90 48 5 2 3 4 (a) (b) (c) (d) (e) (f)

(a) VSRR with visual words (colored points); (b) the corresponding saliency map with (a); (c) saliency value at the position of visual word (d) saliency ordinal vector of visual words in (a); (e)

The BOOL element r ij in SRM is defined by comparing the saliency values α i and α j of visual word v i and v j in one VSRR. Each visual word in the VSRR is compared with other visual words.

We measure the inconsistency between the query SRM and the candidate SRM by using the Hamming distance: Dis = SRM q SRM c 0 (6) where 0 (l 0 -norm) is the total number of non-zero element.

To extend this constraint into large scale image retrieval, we transform it into a simple addition operation based on an offline inverted-file index.

RRs from the image datasets, which is detailed in the Section V-B. V. INDEX AND RETRIEVAL In large scale image retrieval systems, efficient indexing is a critical factor for fast retrieval.

4 4 visually salient and rich region saliency map visual word & its saliency value Saliency ordinal Saliency Relative Matrix q XOR c (a) (b) (c) (d) (e) (f) Fig. 3. (a) VSRR with visual words (colored points); (b) the corresponding saliency map with (a); (c) saliency value at the position of visual word (d) saliency ordinal vector of visual words in (a); (e) Saliency Relative Matrix (SRM) of visual words in (a); (f) XOR result of two SRMs in (d). The BOOL element r ij in SRM is defined by comparing the saliency values α i and α j of visual word v i and v j in one VSRR. Each visual word in the VSRR is compared with other visual words. The SRM is an anti-symmetric BOOL matrix which preserves the relative saliency order among visual words. Fig. 3-(e) illustrates the SRM of VSRR q and VSRR c. We measure the inconsistency between the query SRM and the candidate SRM by using the Hamming distance: Dis = SRM q SRM c 0 (6) where 0 (l 0 -norm) is the total number of non-zero element. Intuitively, SRM is not sensitive to inconsistent saliency order, and it avoids the problem of directly comparing saliency ordinal vectors. To extend this constraint into large scale image retrieval, we transform it into a simple addition operation based on an offline inverted-file index. This fast algorithm provides the same constraint function, but it does not need to store and compute the SRM of VSRRs from the image datasets, which is detailed in the Section V-B. V. INDEX AND RETRIEVAL In large scale image retrieval systems, efficient indexing is a critical factor for fast retrieval. In this section, we firstly introduce the bilayer inverted-file index structure of our system, and then detail the retrieval scheme based on this index system. A. Index Structure The retrieval procedure can be divided into two steps: one is to search the candidate VSRRs from the dataset; the other is to refine the result via the relative saliency ordering constraint. For the sake of efficiency, we use a collaborative index structure with a bilayer inverted-file. The st invertedfile preserves the VSRR information to boost the first step. The 2 nd inverted-file stores the saliency order of visual words in each VSRR to verify the further constraint. The output of the first step is the ID of candidate VSRR and image, which is a direct index for the 2 nd inverted-file. The st inverted-file index structure. One VSRR k in an image is represented as a vector VSRR k, with one component for each visual word in the dictionary D. For each visual word, this structure stores the list of VSRR in which the visual word occurs and its term weight. Fig. 4-(a) illustrates the structure of the st inverted-file. VW Freq. is the sum of weight of visual word i for all the VSRRs. visual word weight is the code of the visual word i by group sparse coding for one VSRR. Due to the GSC algorithm, this vector representation is very sparse. The st inverted-file structure utilizes this sparseness to index images and enables fast searching of candidate VSRR. The 2 nd inverted-file index structure. Fig. 4-(b) shows its structure. The 2 nd inverted-file stores the property information of each VSRR. VW count is the count of visual words in this VSRR. VSRR area is the count of pixel in this VSRR. VW i is the ID of visual word i. These visual words are arranged with a sort ascending of saliency value. Note that the visual word dictionary D in this structure is different from the dictionary D in the st inverted-file, where the dictionary D is learnt by the GSC. This dictionary D is obtained by a hierarchical k-means clustering. We use a very large dictionary D to fast search the matching point pairs between the query VSRR and candidate VSRR. B. A Fast Algorithm for the Relative Saliency Ordering Constraint The critical step of the relative saliency ordering constraint is to construct the Saliency Relative Matrix (SRM) of query VSRR and candidate VSRR. In fact, through a simple transformation, we just need to completely compute the SRM of query VSRR once, and then extract different rows and columns to build new SRM for satisfying the demands of different candidate VSRRs. Recalling the structure of SRM (Equ. 5), the row(i) indicates the relative saliency relationship between visual word i and

5 5 (a) (b) Fig. 4. visual word i Image ID VSRR ID VW Freq. VW count VSRR area Image ID VSRR ID Visual word weight 20 bits 4bits VW i VW k Indexed Item 32 bits VW j (a) The st inverted-file structure. (b) The 2 nd inverted-file structure. other visual words. If we reset the order of its rows and columns with a sort ascending of saliency value, according to the definition of Equ. 5, the SRM can be transformed as: 0 SRM + = (7) 0 0 The SRM + is essentially equivalent with the primal SRM. Moreover, for any SRM, when the order of its rows and columns is a sort ascending of saliency value, its SRM is unique: its upper triangle matrix is ones matrix and it is an anti-symmetric BOOL matrix. Thus, the SRM + does not need to be stored at all. Inspired by the above discovery, we design the 2 nd invertedfile index as Fig. 4-(b). Only the ID of visual words of each VSRR is stored with a sort ascending of saliency value. This can economize a huge memory space, and the corresponding inverted-file occupies.5g memory space in our experiment on Million image dataset. For each candidate VSRR, at first, we need to find the matching pairs of visual words with query VSRR. Here, we take the visual word from the candidate VSRR sequentially as the input to search its matching visual word in the query VSRR. Suppose that there are N matching visual words, the inconsistency between VSRRs is 2 M N, M is the total 2 number of zero element in the upper triangular matrix of SRM, which is defined in the Equ. 5. With respect to the time complexity, this approach is fast and satisfying: ) we do not need to compute the SRM of candidate; 2) the inconsistency is calculated directly by adding operation and avoiding the XOR operation. To sum up, this algorithm for the relative saliency ordering constraint saves both the computation time and the storage spaces, and thus it is suitable for the large scale index system. C. Retrieval Scheme Given a query VSRR q, the first task is to find its candidate VSRRs. The procedure can be interpreted as a voting scheme: ) The score of all VSRRs in the database are initialized to 0; 2) For each visual word j in the query, we retrieve the list of VSRRs that contain this visual word through the st inverted files. For each VSRR i in the list, we increase its score by the weight of this visual word score(i)+ =(weight of visual word j ) /(VW j Freq.). After processing all visual words in the query, the final score of VSRR i is the dot product between the vectors of VSRR i and the query q. 3) We normalize the scores to obtain the cosine similarities for ranking. After finding the candidate VSRRs, we compute the inconsistency of SRM between the query VSRR q and candidate VSRR c, which is detailed in the Section V-B. We define a total matching score M(q; c): M(q; c) = M v (q; c) + λm r (q; c) (8) where M v (q; c) is the visual similarity, which measures the cosine similarity between q and c; M r (q; c) is the consistency of relative saliency ordering constraint, which is euqal to - inconsistency of SRM(q; c). λ is a weight parameter. After obtaining the similar score between two VSRRs, we define the similarity between the query image I q and the candidate image I m as: Sim(I q, I m ) = 2 sqrt(r area(q i )) M(q + R area (q q i I q i ) i ; m i ) (9) where q i is the i-th VSRR in the query image I q, and m i is the corresponding matching VSRR in the I m with the q i. R area (q i ) is the area ratio of the VSRR q i relative to the query image. The fractional term is the weight of the similarity of the i-th matching VSRR pair. On the view of users experience, they are more satisfied with such returned results, where the similar region has larger overlapping area in the original query image. Inspired by this observation, we re-rank the returned images by the image similarity Sim. VI. EXPERIMENTS A. Datasets and Evaluation Metric To evaluate our system, we compare our method with the state-of-the-art approaches on five image datasets: ()PDID: a public internet partial-duplicate image dataset [], which consists of 0 image collections and 30K distractors. Each collection has 200 images. (2) PDIDM: a large scale partialduplicate image dataset, which consists of the PDID dataset and one million distracter images from the Flickr. (3) UKbench. This dataset consists of 2550 groups of 4 images each. (4) Mobile: this dataset [4] includes images of 300 objects, and it indexes 300 images of the digital copies downloaded from internet blended by the 0200 images from the UKbench as distractors. (5) Caltech256(50): we choose a subset of Caltech256, which consists of 50 classical object categories with 4988 images. In evaluation, for the datasets of PDID, PDIDM and Caltech256, we use mean average precision (map) as the evaluation metric. For each query image, we compute its Average Precision (AP), which is the area under the precisionrecall curve, and then take the mean value over all query images. For the UKbench dataset, the performance measure

6 6 0.9 BOV Soft BOV bundled feature bundled+ VLAD our approach map American Flag Beijing Olympic Disney Google iphone KFC Monalisa Rocket Starbucks Exit Sign Fig. 5. Comparisons of five approaches with the map for partial-duplicate image retrieval. is the average number of relevant images in the querys top 4 retrieved images. For the Mobile dataset, the top-0 hit rates is used as in [4]. B. Performance Evaluation on Technical Improvement and Parameter Setting ) Comparison with different visual descriptor encoding method: The representation of VSRR is a crucial step to affect the retrieval performance, and we represent the VSRR with sparse code of visual words. In this subsection, we evaluate its performance on the PDID dataset. Baseline: we use a traditional BOV approach as the baseline approach and a dictionary with 5,000 visual words is clustered with hierarchical k-means. Comparison: four popular partial-duplicate image retrieval schemes [], [3], [6], [7] are compared with our approach. () We also enhance the baseline method with soft assignment [7], where the number of nearest neighbors is set to 3 and σ 2 =6,250. (later referred as Soft BOV ). (2) In [6], the SIFT and MSER are extracted from images and bundled into groups ( bundled feature for short). (3) In [], bundled feature is improved by adding an affine invariant geometric constraint ( bundled+ for short). All of these approaches share the common SIFT vocabulary of 5,000 visual words. For [6], the weight parameter of geometric consistency is set to 2. For [], the weight parameter of affine consistency is set to. (4) A state-of-the-art descriptor-vlad (vector of locally aggregated descriptors) [3], which derived from both BOV and Fisher kernel. Its cluster centroids k is set to 64, and the final dimension is 892. Experiment setting of our approach: each VSRR is encoded with GSC, where a dictionary with 978 visual words is learnt through the algorithm. In the relative saliency ordering constraint, we use the dictionary with 5,000 visual words from hierarchical k-means. The parameter λ is set to 0.4. Fig. 5 illustrates the experimental results, leading to four observations. Firstly, our approach improves the map obviously, as can be seen by comparing other state-of-the-art methods. The map of our approach is 68.82%, while that of BOV, Soft BOV, bundled feature, bundled+ and VALD are 32.6%, 38.3%, 59.9%, 53.6% and 47.6% respectively. Secondly, the traditional BOV approach lost its discriminative power on the partial-duplicate image retrieval task. The reason lies in the fact that partial-duplicate images have only a small similar portion. Thirdly, our result on Monalisa is not satisfactory, which is caused by the same reason with the above sub-section. The faces of many Monalisa images in the ground truth dataset are detected as the VSRRs; although they are VSRRs indeed, as a result of the manual Photoshop, these faces are very different to reduce the map. Fourthly, as Fig. 5 shows, the performances of bundled feature, bundled+ and our approach are better than those of BOV, Soft BOV and VLAD. 2) Impact of λ: The parameter λ in Equ. 8 determines the weight of the consistency term of relative saliency ordering constraint. We test the performance using different λ values on the PDID dataset. Experiments show that [0.4, 0.8] is the most effective range for λ, specially, the map achieves the highest (0.703) when λ=0.4. The relative saliency ordering constraint plays an important role in improving the retrieval precision. However, relying too much on it will reduce the retrieval recall. C. Partial-duplicate Image Retrieval on Different Image Dataset In this subsection, we validate the efficiency of our approach on the five different image dataset: PDID, PDIDM, UKbench, Mobile, Caltech256. Baseline: the traditional BOV approach [2] is used as the baseline. A dictionary with 0.5M visual words is used, which is obtained by hierarchical k-means clustering on 50K images. Comparisons: () Soft BOV [7], where the number of nearest neighbors is set to 3 and σ 2 =6,250. This methods use the same dictionary with the baseline. (2) bundled feature [6], where the weight parameter of geometric consistency is set to 2. (3) bundled+ [], where the weight parameter of affine consistency is set to be. This method shares the same dictionary with bundled feature, which is clustered into

7 7 TABLE I COMPARISON WITH THE STATE-OF-THE-ART METHODS ON DIFFERENT DATASETS BOV [2] Soft BOV [7] bundled feature [6] bundled+ [] VLAD [3] SCSM [5] k-r NN [6] CW [5] Ours PDID PDIDM UKbench Mobile Caltech256(50) k visual words by hierarchical k-means. (4) VLAD [3], which cluster centroids k is 64. (5) SCSM [5], a spatiallyconstrained similarity measure method for object retrieval, where the grid size of the voting map is 6 6, and σ 2 =2.5 in the Gaussian weights exp( d/σ 2 ). (6) k-r NN [6], an object retrieval scheme based on an analysis of the k-reciprocal nearest neighbor structure in the image space, where, as [6], the parameter cutoff is set to (7) CW [4], a novel vocabulary tree based approach by introducing contextual weighting of local features in both descriptor and spatial domains, where the vocabulary tree is with a branch factor K=8 and depth L=6. (8) For our approach, we use a dictionary with 0236 visual words for sparse coding of VSRR. In the relative saliency ordering constraint, we use the same dictionary with the baseline. The parameter λ in Equ. 8 is set to 0.4. Table I shows the comparisons with the state-of-the-art methods on the five datasets. Most of them use additional techniques such as post-verification and soft assignment. The results of our approach are the best on PDID, PDIDM and Caltech256, and it is significantly better than previously bestknown work bundled feature [6] on partial-duplicate image retrieval on four datasets (from to on PDID, from 3.5 to 3.58 on UKbench, from to 0.85 on Mobile, from to on Caltech256). The performance of our method is lower than the k-r NN method on the UKbench dataset, but it is better than the k-r NN on the Caltech256. The reason may be that the variety of similar images in the UKbench dataset is slight while that in the Caltech256 is very diverse. The method of KNN-wise is effective and fast for the data with trivial variety, but its robustness is not satisfactory for the strong variety of the scale and viewpoint. On the Mobile dataset, the result of our method is 0.03 less than the bestachieved method CW. The query images in this dataset is captured by the mobile phone so that their illumination and viewpoint are unfavorable, which causes bad influence on our scheme. Fig. 6 details the comparison results with four popular image retrieval methods on the PDIDM dataset, leading to three observations: firstly, our approach significantly improves the map, as can be seen by comparing with baseline, our approach boosts the map from 0.26 to 0.523, a 26.2% improvement. Secondly, soft assignment of visual words plays an important role in improving the performance (a 0% improvement on average). Actually, sparse coding is also a soft assignment of visual words. Thirdly, the performance of all the five methods is not satisfactory on the group of Monalisa. The difference of most Monalisa images in the map American Flag Beijing Olympic Disney Google BOV Soft BOV VLAD CW our approach iphone KFC Monalisa Rocket Starbucks Exit Sign Fig. 6. Comparisons of different methods using map with the large scale image dataset. ground truth locates on the faces of Monalisa. Moreover, the face usually has the rich visual information that has a great influence to the global image representation with local feature. As a result, BOV, Soft-BOV, VLAD and CW do not work well. On the other side, most faces in these images are the visually salient and rich regions, which are detected as the VSRR in our approach. This leads to the low map of our method. Besides the obvious improvements on map, it is necessary to point out that our approach is also efficient. The average time of retrieving a query image on the large scale image dataset is 0.695s(BOV), 0.732s(SoftBOV), 0.745s(VLAD), 0.96s(CW) and.20s(ours). The experiments are executed on an Intel Core i5 2.8 GHz machine with 4GB RAM. Although the computation time of our approach is not as fast as the other methods, it is tolerable in the practical application. In short, our approach shows significant advantages on both accuracy and efficiency. VII. CONCLUSION This paper proposes a new prospective for partial-duplicate image retrieval with Visually Salient and Rich Region, which is the visually salient and rich region in an image. A relative saliency ordering constraint is proposed to improve the retrieval performance. Further, we propose a fast algorithm to transform this constraint into the index system for fast retrieval. Comparison experiments on five image dataset validate the effectiveness of our scheme. In the future we plan to have a deep research on both visual attention analysis and visual content analysis, and introduce a practical partial-duplicate image retrieval system.

8 8 ACKNOWLEDGEMENT This work was supported in part by National Basic Research Program of China (973 Program): 202CB36400, in part by National Natural Science Foundation of China: 60250, REFERENCES [] Z. Wu, Q. Xu, S. Jiang, Q. Huang, P. Cui, and L. Li, Adding affine invariant geometric constraint for partial-duplicate image retrieval, in IEEE ICPR, 200., 5, 6, 7 [2] J. Sivic and A. Zisserman, Video google: A text retrieval approach to object matching in videos, in IEEE ICCV, 2003., 3, 6, 7 [3] H. Jégou, M. Douze, C. Schmid, and P. Pérez, Aggregating local descriptors into a compact image representation, in IEEE CVPR, 200., 6, 7 [4] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier, Large-scale image retrieval with compressed fisher vectors, in IEEE CVPR, 200. [5] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, Spatial coding for large scale partial-duplicate web image search, in ACM Multimedia, 200. [6] Z. Wu, Q. Ke, M. Isard, and J. Sun, Bundling features for large scale partial-duplicate web image search, in IEEE CVPR, 2009., 6, 7 [7] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, Lost in quantization: Improving particular object retrieval in large scale image databases, in IEEE CVPR, 2008., 6, 7 [8] M. Chen, G. Zhang, N. Mitra, X. Huang, and S. Hu, Global contrast based salient region detection, in IEEE CVPR, [9] H. Wu, Y. Wang, K. Feng, T. Wong, T. Lee, and P. Heng, Resizing by symmetry-summarization, ACM Trans. Graph, vol. 29, pp. 9, [0] P. Felzenszwalb and D. Huttenlocher, Efficient graph-based image segmentation, IJCV, [] C. Rother, V. Kolmogorov, and A. Blake, Grabcut: Interactive foreground extraction using iterated graph cuts, ACM Trans. Graph, [2] D. G. Lowe, Distinctive image features from scale invariant keypoints, IJCV, [3] S. Bengio, F. Pereira, Y. Singer, and D. Strelow, Group sparse coding, in NIPS, [4] X. Wang, MingYang, T. Cour, S. Zhu, K. Yu, and T. X. Han, Contextual weighting for vocabulary tree based image retrieval, in IEEE ICCV, 20. 5, 6, 7 [5] X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu, Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking, in IEEE CVPR, [6] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. VanGool, Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors, in IEEE CVPR,

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention Anand K. Hase, Baisa L. Gunjal Abstract In the real world applications such as landmark search, copy protection, fake image