A Kernel Density Based Approach for Large Scale Image Retrieval

A Kerel Desity Based Approach for Large Scale Image Retrieval Wei Tog Departmet of Computer Sciece ad Egieerig Michiga State Uiversity East Lasig, MI, USA togwei@cse.msu.edu Rog Ji Departmet of Computer Sciece ad Egieerig Michiga State Uiversity East Lasig, MI, USA rogji@cse.msu.edu Fegjie Li Departmet of Computer Sciece ad Egieerig Michiga State Uiversity East Lasig, MI, USA lifegji@cse.msu.edu Ail Jai Departmet of Computer Sciece ad Egieerig Michiga State Uiversity East Lasig, MI, USA jai@cse.msu.edu Tiabao Yag Departmet of Computer Sciece ad Egieerig Michiga State Uiversity East Lasig, MI, USA yagtia1@cse.msu.edu ABSTRACT Local image features, such as SIFT descriptors, have bee show to be effective for cotet-based image retrieval CBIR. I order to achieve efficiet image retrieval usig local features, most existig approaches represet a image by a bag-of-words model i which every local feature is quatized ito a visual word. Give the bag-of-words represetatio for images, a text search egie is the used to efficietly fid the matched images for a give query. The mai drawback with these approaches is that the two key steps, i.e., key poit quatizatio ad image matchig, are separated, leadig to sub-optimal performace i image retrieval. I this work, we preset a statistical framework for large-scale image retrieval that uifies key poit quatizatio ad image matchig by itroducig kerel desity fuctio. The key ideas of the proposed framework are a each image is represeted by a kerel desity fuctio from which the observed key poits are sampled, ad b the similarity of a gallery image to a query image is estimated as the likelihood of geeratig the key poits i the query image by the kerel desity fuctio of the gallery image. We preset efficiet algorithms for kerel desity estimatio as well as for effective image matchig. Experimets with large-scale image retrieval cofirm that the proposed method is ot oly more effective but also more efficiet tha the state-of-the-art approaches i idetifyig visually similar images for give queries from large image databases. Categories ad Subject Descriptors H.3.3 [Iformatio Storage ad Retrieval]: Iformatio Search ad Retrieval Retrieval models Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. ICMR 11, April 17-20, Treto, Italy Copyright c 2011 ACM 978-1-4503-0336-1/11/04...$10.00. Geeral Terms Theory Keywords Cotet-based Image Retrieval, Kerel Desity Estimatio, Keypoits Quatizatio 1. INTRODUCTION Cotet-based image retrieval CBIR is a log stadig challegig problem i computer visio ad multimedia. Recet studies [18, 24, 15, 8, 21, 4] have show that local image features e.g. SIFT descriptor [10], ofte referred to as key poits, are effective for idetifyig images with similar visual cotet. The key idea is to represet each image by a set of iterestig patches extracted from the image. By represetig every image patch with a multidimesioal vector, each image is represeted by a bag of feature vectors, which is referred to as bag-of-features represetatio [1]. Oe of the major challeges faced i image retrieval usig the bag-of-features represetatio is its efficiecy. A aive implemetatio compares a query image to every image i the database, makig it ifeasible for large-scale image retrieval. Motivated by the success i text iformatio retrieval [20], the bag-of-words represetatio has become popular for efficiet large-scale image retrieval. This approach first quatizes image features to a vocabulary of visual words, ad represets each image by the couts of visual words or a histogram. Stadard text retrieval techiques ca the be applied to idetify the images that share similar visual cotet as the query image. The quatizatio is typically achieved by groupig the key poits ito a specified umber of clusters usig a clusterig algorithm. A umber of studies have show promisig performace of the bag-of-words approach for image/object retrieval [18, 24, 15, 8, 22, 21, 4, 25, 3] Despite its success, there are still drawbacks with most of the studies usig the bag-of-words model. For istace, these approaches require clusterig all the key poits ito a large umber of clusters, which is computatioally expesive whe the umber of key poits is very large. Although recet progress o approximate earest eighbor search [9, 2, 8, 23, 14] has made it feasible to group billios of key poits ito millios of clusters, the computatioal

cost of these approaches i key poit quatizatio is still very high, as will be revealed i our empirical study. I this paper, we highlight aother fudametal problem with the bag-of-words model for image retrieval that is usually overlooked by most researchers. I almost all the methods developed for large-scale image retrieval, the step of key poit quatizatio is separated from the step of image matchig that is usually implemeted by a text search egie. I other words, the procedure used to quatize key poits ito visual words is idepedet of the similarity measure used by the text search egie to fid visually similar images. I this paper, we develop a statistical framework that uifies these two steps by the itroductio of kerel desity fuctio. The key idea is to view the bag of features extracted from each image as radom samples from a uderlyig ukow distributio. We estimate, for each image, its uderlyig desity fuctio from the observed bag of features. The similarity of a image I i i the database to a give query image Q is computed by the query likelihood pq I i, i.e., the likelihood of geeratig the observed bag of features i Q give the desity fuctio of I i. Thus, the key poit quatizatio step is essetially related to the estimatio of kerel desity fuctio, ad the image matchig step is essetially related to the estimatio of query likelihood. Hece, the itroductio of kerel desity fuctio allows us to lik the two steps coheretly. We emphasize that although the idea of modelig a bag-of-features by a statistical model has bee studied by may authors e.g., [26, 5, 7, 13, 6, 25], there are two computatioal challeges that make them difficult to scale to image retrieval problems with large databases: How to efficietly compute the desity fuctio for each image? This is particularly importat give the large size of image database ad the large umber of key poits to be processed. How to efficietly idetify the subset of images i the database that are visually similar to a give query? I particular, the retrieval model should explicitly avoid the liear sca of image database, which is a fudametal problem with may existig methods for image similarity measuremets. We address the two challeges by a specially desiged kerel desity fuctio. We preset two efficiet algorithms, oe for kerel desity estimatio ad oe for image search. Besides providig a uified framework for key poit quatizatio ad image matchig, the proposed framework also resolves the two shortcomigs of the bag-of-words model for image retrieval: a by avoidig a explicit clusterig of key poits, the proposed framework is computatioally more efficiet. b by ecodig the observed key poits ito a kerel desity fuctio, the proposed framework allows for partial matchig betwee two similar but differet key poits. We verify both the efficiecy ad efficacy of the proposed framework by a empirical study with three large image databases. Our study shows that the proposed framework reduces the computatioal time for key poit quatizatio by a factor of 8 whe compared to the hierarchical clusterig methods, ad by a factor of 30 whe compared to the flat clusterig methods. For all the experimets, we observe that the proposed framework yields sigificatly higher retrieval accuracy tha the state-of-the-art approaches for image retrieval. The rest of the paper is orgaized as follows: Sectio 2 presets the proposed framework for large-scale image retrieval ad efficiet computatioal algorithms for solvig the related optimizatio problems. Sectio 3 presets our empirical study with large-scale image retrieval. Sectio 4 cocludes this work. 2. KERNEL DENSITY FRAMEWORK FOR IMAGE RETRIEVAL Let G = {I 1,...,I C} be the collectio of C images, ad each image I i be represeted by a set of i key poits {x i 1,...,x i i }, where each key poit x i R d is a d dimesioal vector. Similarly, the query image Q is also represeted by a bag of features, i.e., {q 1,...,q m}, where q i R d. The objective of image retrieval is two folds: 1. efficietly idetify the subset of images R from the gallery G that are likely to share similar visual cotet as that of the query image Q, ad 2. effectively rak the images i R accordig to their visual similarity to query Q. We emphasize the importace of the first goal. By efficietly idetifyig the subset of visually similar images for a give query without goig through every image i the database, we are able to build a retrieval model that scale to databases with millios of images. To facilitate the developmet of a statistical model for image retrieval, we assume that key poits of a image I i are radomly sampled from a ukow distributio px I i. Followig the framework of statistical laguage models for text retrieval [11], we eed to efficietly compute i the desity fuctio px I i for every image I i i gallery G, ad ii the query likelihood pq I i, i.e., the probability of geeratig the key poits i query Q give each image I i. Below we discuss the algorithms for the two problems. 2.1 Kerel Desity Based Framework Give the key poits {x 1,...,x } observed from image I, we eed to efficietly estimate its uderlyig desity fuctio px I. The most straightforward approach is to estimate px I by a simple kerel desity estimatio, i.e., px I = 1 κx, x i 1 where κ, :R d R d R + is the kerel desity fuctio that is ormalized as dzκx, z =1. Give the desity fuctio i 1, the similarity of I to the query image Q is estimated by the logarithm of the query likelihood pq I, i.e., m m 1 log pq I = log pq i I = log κx j, q i Despite its simplicity, the major problem with the desity fuctio i 1 is its high computatioal cost whe applied to image retrieval. This is because usig the desity fuctio i 1, we have to compute the log-likelihood pq I i for every image i G before we ca idetify the subset of images that are visually similar to the query Q, makig it impossible for large scale image retrieval. I order to make efficiet image retrieval, we cosider a alterative approach of estimatig the desity fuctio for image I. We assume that for ay image I i the gallery G, its desity fuctio px I is expressed as a weighted mixture models: px I = N α iκx, c i 2 where c i R d,i =1,...,N is a collectio of N poits ceters that are radomly selected from all the key poits observed i G. The choice of radomly selected ceters, although may seem to be aive at the first glace, is i fact strogly supported by the cosistecy results of kerel desity estimatio [16]. I particular, the kerel desity fuctio costructed by radomly selected ceters is

almost optimal whe the umber of ceters is very large. The umber of ceters N is usually chose to be very large, i order to cover the diverse visual cotet of images. α =α 1,...,α N is a probability distributio used to combie differet kerel fuctios. It is importat to ote that ulike 1, the weights α i 2 are ukow ad eed to be determied for each image. As will be show later, with a appropriate choice of kerel fuctio κ,, the resultig weights α will be sparse with most of the elemets beig zero. This is esured by the fact that i a high dimesioal space, almost ay two radomly selected data poits are far away from each other. It is the sparsity of α that makes it possible to efficietly idetify images that are visually similar to the query without havig to sca the etire image database. 2.2 Efficiet Kerel Desity Estimatio I order to use the desity fuctio i 2, we eed to efficietly estimate the combiatio weights α. By assumig key poits x 1,...,x are radomly sampled from px I, our first attempt is to estimate α by a maximum likelihood estimatio, i.e., α =argmaxli, α = log α jκx i, c j α where ={α [0, 1] C : C αi =1} defies a simplex of probability distributios. It is easy to verify that the problem i 3 is covex ad has a global optimal solutio. Although we ca directly apply the stadard optimizatio approaches to fid the optimal solutio α for 3, it is i geeral computatioally expesive because We have to solve 3 for every image. Eve if the optimizatio algorithm is efficiet ad ca solve the problem withi oe secod, for a database with a millio of images, it will take more tha 277 hours to complete the computatio. The umber of weights α to be determied is very large. To achieve the desired performace of image retrieval, we ofte eed a very large umber of ceters, for example oe millio. As a result, it requires solvig a optimizatio problem with millio variables eve for a sigle optimizatio problem i 3. I order to address the computatioal challege, we choose the followig local kerel fuctio for this study 3 κx, c I x c 2 ρ 4 where Iz is a idicator fuctio that outputs 1 if z is true ad zero otherwise. The parameter ρ>0 is a predefied costat that defies the locality of the kerel fuctio. Its value is determied empirically, as show i the experimets. The propositio below shows the sparsity of the solutio α for 3. PROPOSITION 1. Give the local kerel fuctio defied i 4, for the optimal solutio α to 3, we have α j =0for ceter c j if cj xi 2 >ρ max 1 i Propositio 1 follows directly from the fact that κc j, x i=0,i= 1,..., if max 1 i c j x i 2 > ρ. As implied by Propositio 1, α j will be ozero oly if the ceter c j is withi a distace ρ of some key poits. By settig ρ to a small value, we will oly have a small umber of o-zero α j. We ca quickly idetify the subset of ceters with o-zero α j by coductig a efficiet rage search. I our study, this step reduces the umber of variables from 1 millio to about 1, 000. Although Propositio 1 allows us to reduce the umber of variables dramatically, we still have to fid a way to solve 3 efficietly. To this ed, we resort to the boud optimizatio strategy that leads to a simple iterative algorithm for optimizig 3: we deote by α the curret solutio ad by α the updated solutio for 3. It is straightforward to show that {LI, α LI, α } is bouded as follows LI, α LI, α αjκxi, cj = log α jκxi, cj N α jκx i, c j αj l=1 α j κxi, c log l α j By maximizig the upper boud i 5, we have the followig updatig rule for α α j = 1 Z α jκx i, c j l=1 α l κxi, c l where Z is the ormalizatio factor esurig αj =1. Note that α obtaied by iteratively ruig the updatig equatio i 6 is ideed globally optimal because the optimizatio problem i 3 is covex. I our implemetatio, we iitialize α j =1/N, i = 1,...,N, ad obtai the solutio α by oly ruig the iteratio oce, i.e., α j = 1 κx i, c j l=1 κxi, c l We emphasize that although the solutio i 7 is approximated i oly oe update, it is however the exact optimal solutio whe the key poits {x i} N are far apart from each other, as show by the followig theorem. THEOREM 1. Let the kerel fuctio be 4. Assume that all the key poits x 1,...,x are separated by at least 2ρ. The solutio α i 7 optimizes the problem i 3. PROOF. Whe ay two keypoits x i ad x j are separated by at least 2ρ, we have κx i, c k κx j, c k =0for ay ceter c k. This implies that o key poit could make cotributio to the estimatio of weight α k simultaeously for two differet ceters i 6. As a result, the expressio i 6 could be rewritte as α j = 1 Z = 1 Z = 1 Z α j I x i c j ρ l=1 α l κxi, c l I x i c j ρ I x i c j ρ α j α j κxi, cj As a result, the updatig equatio will give the fixed solutio, which is the global optimal solutio. Regularizatio. Although the sparse solutio resultig from the local kerel is computatioally efficiet, the sparse solutio may lead to a poor estimatio of query-likelihood, as demostrated i statistical laguage model [11]. To address this challege, we itroduce α g = α g 1,...,αg N, a global set of weights used for kerel desity fuctio. α g plays the same role as the backgroud lagauge model i statistical laguage models [11]. We defer the discussio of how to compute α g to the ed of this subsectio. Give the global set of 5 6 7

weights α g, we itroduce KLα g α, the Kullback-Leibler divergece betwee α g ad α, as a regularizer i 3, i.e., α = argmaxli, α λklα g α 8 α where λ>0is itroduced to weight the importace of the regularizer. As idicated i 8, by itroducig the KL divergece as the regularizer, we prefer the solutio α that is similar to α g. Note that 8 is equivalet to the MAP estimatio of α by itroducig a Dirichlet prior Dirα N [αi]β i, where β i = λα g i. Similar to the boud optimizatio strategy used for solvig 3, we have the followig approximate solutio for 8 α j = 1 + λ λα g j + κx i, c j l=1 κxi, c l It is importat to ote that, accordig to 9, the solutio for α is o loger sparse if α g is ot sparse, which could potetially lead to high computatioal cost i image matchig. We will discuss a method i the ext subsectio that explicitly addresses this computatioal challege. The remaiig questio is how to estimate α g, the global set of weights. To this ed, we search for the weight α g that ca explai all the key poits observed i all the images of gallery G, i.e., α g =argmax α g 9 C LI i, α g 10 Although we ca employ the same boud optimizatio strategy to estimate α g, we describe below a simple approach that directly utilizes the solutio α for idividual images to costruct α g.we deote by α i =α i 1,...,α i N the optimal solutio that is obtaied by maximizig the log-likelihood LI i, α i of the key poits observed i image I i. Give α i that maximizes LI i, α i, we have LI i, α g LI i, α i + 1 2 αg α i 2 LI i, α i α g α i 11 Hessia matrix 2 LI i, α is computed as i 2 LI i, α = u k i [u k i ], where u k i R N is a vector defied as N [u k i ] j = κx i k, c j/ α jκx i k, c j. l=1 The lemma below allows us to boud the Hessia matrix 2 LI i, α i ; LEMMA 1. NI 2 LI i, α i. PROOF. To boud the maximum eigevalue 2 LI i, α i, we cosider the quatity γ 2 LI i, α i γ with γ 2 =1. Defie η j t = γ 2 LI i, α i γ = i i [ γjκxi k, c j] 2 [ αjκxi k, cj]2 2 γj κxi k, c j αjκxi k, cj = γ j / γj ad η = η1,...,ηn. Defie γj. We have i γ 2 LI i, α i γ t 2 ηjκxi k, c j αjκxi k, cj Sice α i maximizes LI i, α, we have which implies η α i LI i, α 0, i ηjκxi k, c j αjκxi k, cj 1 Sice t N, we have 2 LI i, α i NI. Usig the result i Lemma 1, the objective fuctio i 10 ca be approximated as C C LI i, α g LI i, α i N 2 C α i α g 2 2 12 The global weights α g maximizig 12 is α g = 1 C C αi which shows that α g ca be computed as a average of {α i } C that are optimized for idividual images. 2.3 Efficiet Image Search Give the kerel desity fuctio px I i for each image i gallery G ad a query Q, the ext questio is how to efficietly idetify the subset of images that are likely to be visually similar to the query Q ad furthermore rak those images i the descedig order of their similarity. Followig the framework of statistical laguage models for text retrieval, we estimate the similarity by the likelihood of geeratig the key poits {q i} m observed i the query Q, i.e., m log pq I i= log α i jκq k, c j 13 where α i =α i 1,...,α i N are the weights for costructig the kerel desity fuctio for image I i. Clearly, a aive implemetatio will require a liear sca of all the images i the database before the subset of similar oes ca be foud. To achieve the efficiet image retrieval, we eed to exploit the sparse structure of α i 9. We defie α i j = 1 i We the write α i j as α i j = i κx i k, c j l=1 κxi k, c l 14 λ i + λ αg j + i i + λ αi j 15 Note that although α i j is o-sparse, α i j is sparse. Our goal is to effectively explore the sparsity of α i j for efficiet image retrieval. Usig the expressio i 15, we have log pq I i expressed as m λ log pq I i = log l=1 i + λ αg l + i i + λ αi l κx j, c l m = log 1+ i l=1 αi l κx j, c l λ N l=1 αg l κx + s Q j, c l where s Q = m λ log i + λ + m log α g l κxj, c l l=1 16 Note that i the secod term of s Q is idepedet of the idividual images for the same query, ad ii log pq I i s Q for ay image I i. Give the above facts, our goal is to efficietly fid the

subset of images whose query log-likelihood is strictly larger tha s Q, i.e., log pq I i >s Q. To this ed, we cosider the followig procedure: Fidig relevat ceters C Q for a give query Q. Give a query image Q with key poits q 1,...,q m, we first idetify the subset of ceters, deoted by C Q, that are withi distace ρ of the key poits i Q, i.e., C Q = {c j : q k Qs. t. q k c j 2 ρ}. Fidig the cadidates of similar images usig the relevat ceters. Give the relevat ceters i C Q, we fid the subset of images that have at least oe o-zero α i j for the ceters i C Q, i.e., R Q = Ii G: α i j > 0 c j C Q 17 Theorem 2 shows that all the images with query log-likelihood larger tha s Q belog to R Q. THEOREM 2. Let S Q deote the set of images with query loglikelihood larger tha s Q, i.e., S Q = {I i G:logpQ I i > s Q}. We have S Q = R Q. It is easy to verify the above theorem. I order to efficietly costruct R Q or S Q for a give query Q, we exploit the techique of ivert idexig [11]: we preprocess the images to obtai a list for each c j, deoted V j, that icludes all the images I i with α i j > 0. Clearly, we have R Q = V j 18 c j C Q 2.4 Compare to the Bag-of-Words Model To better uderstad the proposed kerel desity based framework we compare it to the bag-of-words model for image retrieval. I fact, by viewig each radom ceter c i as a differet visual word, ad each α as a histogram vector, we ca see a direct correspodece betwee the bag-of-words model ad the proposed framework. However, the kerel desity based framework is advatageous i that: First, it uifies key poit quatizatio ad image matchig via the itroductio of kerel desity fuctios. Secod, the bag-of-words model requires clusterig all the keypoits ito a large umber of clusters, while the proposed method oly eeds to radomly select a umber of poits from the date which is much more efficiet. Third, i the bag-of-words model, we eed to map each keypoit to the closest visual words. Sice the computatioal cost of this procedure is liear i the umber of keypoits, it is time cosumig whe the umber of keypoits is very large; The proposed method, however, oly eeds to coduct a rage search for every radomly selected ceters which is i geeral sigificatly smaller tha the umber of key poits, for example, oe millio ceters v.s. o billio keypoits. This computatioal savig makes the proposed method more suitable for large image databases tha the bag-ofwords model. Fourth, i the bag-of-words model, the radius of clusters i.e., the maximum distace betwee the keypoits i a cluster ad its ceter could vary sigificatly from cluster to cluster. As a result, for cluster with large radius, two keypoits ca be mapped to the same visual word eve if they differ sigificatly i visual features, leadig to a icosistet criterio for keypoits quatizatio ad Data set # images # features Size of descriptors 5K 5,062 14,972,956 4.7G 5K+1M 1,002,805 823,297,045 252.7G tattoo 101,745 10,843,145 3.4G Table 1: Statistics of the datasets potetially suboptimal performace i retrieval; O the cotrary, the proposed method uses a rage search for each ceter which esures that oly similar keypoits, which are withi the distace of r to the ceter, will cotribute to the correspodig elemet i the weight α of that ceter. Lastly, a keypoit is igored by the proposed method if its distaces to all the ceters are larger tha the threshold. The uderlyig ratioale is that if a keypoit is far away from all ceters, it is very likely to be a outlier ad therefore should be igored; While i the bag-of-words model, every keypoit must be mapped to a cluster ceter eve if the keypoit is far away from all the cluster ceters. 3. EXPERIMENTS 3.1 Datasets To evaluate the proposed method for large-scale image search, we coduct experimets o two bechmark data sets: 1 Oxford buildig dataset with 5, 000 images 5K[18] ad 2 Oxford buildig dataset plus 1 millio Flickr images 5K+1M. I additio, we also test the proposed algorithm over a tattoo image dataset tattoo with about 100, 000 images. Table 1 shows the details of the three datasets. Oxford buildig dataset 5K. The Oxford buildig dataset cosists of 5, 062 images. Although it is a small data set, we use it for evaluatig the proposed algorithm for image retrieval maily because it is oe of the widely used bechmark datasets. The Harris-Laplacia iterestig poit detector is used to detect key poits for each image, ad each key poit is described by a 128-dimesioal SIFT descriptor. O average, about 3, 000 key poits are detected for each image. Oxford buildig dataset plus 1 millio Flickr images 5K+1M. I this dataset, we first crawled Flickr.com to fid about oe millio images of medium resolutio ad the added them ito the Oxford buildig dataset. The same procedure is applied to extract ad represet key poits from the crawled Flickr images. Tattoo image dataset tattoo. Tattoos have bee commoly used i foresics ad law eforcemet agecies to assist i huma idetificatio. The tattoo image database used i our study cosist of 101, 745 images, amog which 61, 745 are tattoo images ad the remaiig 40, 000 images are radomly selected from the ESP dataset 1. The purpose of addig images from the ESP dataset is to verify the capacity of the developed system i distiguishig tattoo images from the other images. O average, about 100 Harris-Laplacia iterestig poits are detected for each image, ad each key poit is described by a 128-dimesioal SIFT descriptor. 1 http://www.gwap.com/gwap/gamespreview/espgame/

3.2 Implemetatio ad Baselies For the implemetatio of the proposed method, the kerel fuctio 4 is used. The ceters for the kerel are radomly selected from the datasets. We employ the FLANN library 2 to perform the efficiet rage search. For all the experimets, we set ρ =0.6 d, where d is the average distace of ay two key poits i the dataset that was estimated based o 1000 radomly sampled pairs. We set the parameter λ =1 where is the average umber of key poit i a image. Two clusterig based bag-of-words models are used as baselies. They are the hierarchical k-meas HKM implemeted i the FLANN library ad the approximate k-meas AKM [18] i which the exact earest eighbor search is replaced by k-d tree based approximate NN search. For HKM the brachig factor is set to be 10 based o our experiece. For AKM we use the implemetatio supplied by [18] for approximate earest eighbor search. A forest of 8 radomized k-d trees is used i all experimets. We iitialize cluster ceters by radomly selectig a umber of key poits i the dataset. The umber of iteratios for k-meas is set to be 10 because we observed that the cluster ceters of k-meas remais almost uchaged after 10 iteratios. For clusterig based methods, a state-of-the-art text retrieval method, Okapi BM25 [19] is used to compute the similarity betwee a query image ad images i the gallery give their bag-of-words represetatios. The iverted idices for both Okapi BM25 ad the proposed retrieval model are stored i memory to make the retrieval procedure efficiet. Overall, the Okapi BM25 method ad the proposed method are efficiet i fidig matched images for a give query. For example, o tattoo image dataset, both the Okapi BM25 model ad the proposed retrieval model take about 0.1 secod to aswer each query. 3.3 Evaluatio Metrics For the Oxford buildig dataset ad the Oxford buildig plus Flickr dataset, we follow [18] ad evaluate the retrieval performace by Average Precisio AP which is computed as the area uder the precisio-recall curve. I particular, a average precisio score is computed for each of the 5 queries from a ladmark specified i the Oxford buildig dataset, ad these results are averaged to obtai the Mea Average Precisio MAP for each ladmark. For tattoo image dataset, the retrieval accuracy is evaluated based o whether a system could retrieve images that share the tattoo symbol as i the query image. Sice for most query tattoo images, oly oe or two true matches exist i the database, aother evaluatio metric, termed Cumulative Matchig Characteristics CMC score [12], is used i this study. For a give rak positio k, its CMC score is computed as the percetage of queries whose matched images are foud i the first k retrieved images. The CMC score is similar to recall, a commo metric used i Iformatio Retrieval. We use CMC score o the tattoo database because it is the most widely used evaluatio metric i face recogitio ad foresic aalysis. Besides the evaluatio of the retrieval results, we also compare the preprocessig time for key poit quatizatio. For our two baselies, this is roughly equal to the time for clusterig all key poits i the dataset ito a large umber of clusters. For the proposed method, this is equal to the time for computig the weight vector α for all the images. We emphasize that the preprocessig time is importat for a CBIR system whe it comes to a large collectio with millios of images. 2 http://www.cs.ubc.ca/~mariusm/idex.php/flann Proposed HKM AKM 5K 0.61 0.53 0.57 5K+1M 0.45 0.36 0.39 Table 2: MAP results of the three algorithms with oe millio cluster/radom ceters. Figure 1: Examples of two queries ad the retrieved images raked from 1-6. The first three rows are based o the 5K dataset, the ext three rows are based o the 5K+1M dataset. The correctly retrieved results are outlied i gree ad the images from the 1M Flickr collectio are outlied i red. 3.4 Results o Oxford Buildig ad Oxford Buildig + Flickr Datasets Followig the settigs i [18], the MAP results of the three algorithms with oe millio cluster/radom ceters are listed i Table 2. I Figure 1, we show two examples of the queries ad the retrieved images. Note that for the 5K+1M dataset, we follow the experimetal protocol i [18] by oly usig the cluster/radom ceters that are the key poits of the images i the 5K dataset. The results clearly show that the proposed method outperforms the two clusterig based bag-of-words models. The preprocessig times of the three algorithms are show i the first two rows i Table 3. For both the datasets, the proposed method is sigificatly more efficiet tha the two clusterig based methods. We emphasize that for the 5K+1M dataset, we split it ito 82 subsets ad each subset cotais about 10,000,000 key poits. These 82 subsets are processed separately o multiple machies, ad are aggregated later to obtai the fial result of key poit quatizatio. The preprocessig time for 5K+1M dataset is estimated by the average processig time of each of the 82 subsets. Note that for the 5K+1M dataset, the preprocessig time of AKM is sigificatly smaller tha HKM. This is because we use the same cluster ceters that are geerated from the 5K dataset to quatize the key poits i the 5K+1M dataset. Hece, the processig time for the 5K+1M dataset oly ivolves fidig the earest eighbor cluster ceter for each key poit i the 5K+1M dataset. We fid that the implemetatio of k-d tree based approximate earest eigh-

Proposed HKM AKM 5K 1.09h 11.4h 36.8h 5K+1M 95h 685h 262h tattoo 1.02h 8.8h 31.1h Table 3: Preprocessig time of the three methods with oe millio cluster/radom ceters. rak 1 rak 2 rak 3 rak 4 rak 5 rak 6 rak 7 rak 8 rak 9 rak 10 Proposed 0.64020.74970.78990.80110.81450.8168 0.819 0.82240.8246 0.8246 AKM 0.6201 0.75750.78660.79660.80110.80450.80670.81230.8145 0.8156 HKM 0.5899 0.7117 0.743 0.75870.76540.7687 0.771 0.771 0.7754 0.7754 CMC 0.95 0.9 0.85 0.8 0.75 0.7 0.65 λ=0.01 λ=0.1 λ=1 λ=10 λ=100 Table 4: The CMC scores for tattoo image retrieval with oe millio cluster/radom ceters. bor search employed i AKM is roughly 3 times faster tha that of HKM, thereby, leadig to a shorter processig time for AKM tha for HKM for the 5K+1M dataset. 3.5 Results o Tattoo Image Dataset We selected 995 images as queries, ad maually idetified the gallery images that have the same tattoo symbols as the query images. We radomly selected 100 images amog the 995 images for traiig λ ad selectig ρ ad used the remaiig images for test. We first show the retrieval results of both the proposed method ad the baselie methods with the parameters tued to achieve the best performace, ad the show how sesitive the proposed algorithm is to the choice of parameters. Table 4 shows the CMC curve for the first 10 retrieval rakig positios. The last row i Table 3 shows the preprocessig time of the three method based o 1 millio cluster/radom ceters. Agai, we observe i the proposed algorithm outperforms the two clusterig based approaches, ad ii the proposed methods is about 9 times faster tha HKM, ad 30 times faster tha AKM. Figure 2 shows the CMC curves of the proposed method with λ varied from 0.01 to 100, where is the average umber of keypoits i a image. I this experimet, we set the umber of radom ceters to be oe millio, ad ρ to be 0.6 d, where d is the average distace betwee ay two keypoits which is estimated from 1,000 radomly sampled keypoits from the collectio. This result shows the performace of the proposed method is overall ot sesitive to the choice of λ. Figure 3 shows the CMC curves of the proposed method with ρ varied from 0.3 d to 1.1 d. I this experimet, we agai fixed the umber of ceters to be oe millio. From the figure we observe that with the exceptio of the smallest radius ρ i.e., r = 0.3 d, the retrieval system achieves similar performace for differet values of ρ. This idicates that the proposed algorithm is i geeral isesitive to the choice of ρ as log as ρ is large eough compared to the average iter-poits distace betwee keypoits. This result ca be uderstood by the fact that i a high dimesioal space, most data poits are far from each other ad as a result, uless we dramatically chage the radius ρ, we do ot expect the poits withi a distace ρ of the ceters to chage sigificatly. Figure 4 shows the performace of the proposed method with differet umber of radomly selected ceters. The λ ad ρ are selected to maximize the performace for the give umber of ceters. We clearly observe a sigificat icrease i the retrieval accuracy whe the umber of ceters is icreased from 10K to 1M. This is ot surprisig because a large umber of radom ceters usually 0 20 40 60 80 100 Number of retrieved images Figure 2: Results of the proposed method for tattoo image retrieval with differet value of λ base o 1 millio radom ceters with ρ =0.6 d CMC 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 ρ =0.3 d ρ =0.4 d ρ =0.5 d ρ =0.6 d ρ =0.7 d ρ =0.8 d ρ =0.9 d ρ =1.0 d ρ =1.1 d 0.5 0 20 40 60 80 100 Number of retrieved images Figure 3: Results of the proposed method for tattoo image retrieval with differet value of ρ base o 1 millio radom ceters results i a better discrimiatio betwee differet SIFT keypoits ad cosequetly leads to a improvemet i the detectio of similar images. A similar observatio is also foud whe we ru our retrieval system usig the bag-of-words model approach which is cosistet with the observatio i [18]. 4. CONCLUSIONS AND FUTURE WORK We have preseted a statistical modelig approach for large-scale image retrieval. We developed efficiet algorithms for i estimatig the desity fuctio of key poit distributio for each idividual image, ad ii idetifyig the subset of images i the gallery that is visually similar to a give query. Our empirical results o three large-scale image retrieval tasks show that the proposed method is both efficiet ad effective for idetifyig images that are visually similar to the query images. This study is limited to developig a statistical model for a bag of features. Several recet studies e.g. [27, 17] have show that by icorporatig the geometric relatioship amog the key poits, oe ca further improve the accuracy of image retrieval. I our future work, we pla to develop statistical approaches that model both a bag of features ad their geometric relatioship.

CMC 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 10k 50k 100k 500k 1M 0.1 0 20 40 60 80 100 Number of retrieved images Figure 4: Results of the proposed method for tattoo image retrieval with differet umber of ceters. 5. ACKNOWLEDGMENTS This research was supported by US Army Research ARO Award W911NF-08-010403, Office of Naval Research ONR N00014-09-1-0663 ad World Class Uiversity WCU program through the Natioal Research Foudatio of Korea fuded by the Miistry of Educatio, Sciece ad Techology R31-10008. 6. REFERENCES [1] G. Csurka ad et al. Visual categorizatio with bags of keypoits. I Workshop o Statistical Learig i Computer Visio, ECCV, 2004. [2] M. Datar ad et al. Locality-sesitive hashig scheme based o p-stable distributios. I SCG, 2004. [3] H. Jégou, M. Douze, ad C. Schmid. Hammig embeddig ad weak geometric cosistecy for large scale image search. I Europea Coferece o Computer Visio, 2008. [4] Y. Ke ad et al. Efficiet ear-duplicate detectio ad sub-image retrieval. I Proceedig of ACM Multimedia, pages 869 876, 2004. [5] J. Kivie ad et al. Learig multiscale represetatios of atural scees usig dirichlet processes. I ICCV, 2007. [6] R. I. Kodor ad T. Jebara. A kerel betwee sets of vectors. I ICML, 2003. [7] S. Lazebik ad et al. A sparse texture represetatio usig affie-ivariat regios. I CVPR, 2003. [8] V. Lepetit ad et al. Radomized trees for real-time keypoit recogitio. I CVPR, 2005. [9] T. Liu ad et al. A ivestigatio of practical approximate earest eighbor algorithms. I NIPS, 2004. [10] D. Lowe. Distictive image features from scale-ivariat keypoits. I IJCV, 2004. [11] C. D. Maig ad et al. Itroductio to Iformatio Retrieval,. Cambridge Uiversity Press, 2008. [12] H. Moo ad P. J. Phillips. Computatioal ad performace aspects of pca-based face recogitio algorithms. Perceptio, 30:303 321, 2001. [13] P. J. Moreo ad et al. A kullback-leibler divergece based kerel for svm classificatio i multimedia applicatios. I NIPS, 2003. [14] M. Muja ad D. G. Lowe. Fast approximate earest eighbors with automatic algorithm cofiguratio. I Iteratioal Coferece o Computer Visio Theory ad Applicatios, 2009. [15] D. Nister ad H. Steweius. Scalable recogitio with a vocabulary tree. I CVPR, 2006. [16] E. Parze. O estimatio of a probability desity fuctio ad mode. A. Math. Stat, 1962. [17] M. Perdoch ad et al. Efficiet represetatio of local geometry for large scale object retrieval. I CVPR, 2009. [18] J. Philbi ad et al. Object retrieval with large vocabularies ad fast spatial matchig. I CVPR, 2007. [19] S. E. Robertso ad et al. Okapi at trec-7. I Proceedigs of the Seveth Text REtrieval Coferece, 1998. [20] G. Salto ad M. J. McGill. Itroductio to Moder Iformatio Retrieval. McGraw-Hill, Ic., 1986. [21] G. Shakharovich ad et al. Nearest-Neighbor Methods i Learig ad Visio: Theory ad Practice. MIT Press, 2006. [22] C. Silpa-Aa ad R. Hartley. Localizatio usig a imagemap. I Proceedigs of the 2004 Australasia Coferece o Robotics & Automatio, 2004. [23] C. Silpa-Aa ad R. Hartley. Optimised kd-trees for fast image descriptor matchig. I CVPR, 2008. [24] J. Sivic ad A. Zisserma. Video Google: A text retrieval approach to object matchig i videos. I ICCV, 2003. [25] J. C. va Gemert, J. M. Geusebroek, C. J. Veema, ad A. W. M. Smeulders. Kerel codebooks for scee categorizatio. I Europea Coferece o Computer Visio, 2008. [26] J. Wi ad et al. Object categorizatio by leared uiversal visual dictioary. I ICCV, 2005. [27] Z. Wu ad et al. Budlig features for large scale partial-duplicate web image search. I CVPR, 2009.