A Posteriori Multi-Probe Locality Sensitive Hashing

Size: px

Start display at page:

Download "A Posteriori Multi-Probe Locality Sensitive Hashing"

Sabina Evans
5 years ago
Views:

1 A Posteror Mult-Probe Localty Senstve Hashng Alexs Joly INRIA Rocquencourt Le Chesnay, 78153, France Olver Busson INA, France Bry-sur-Marne, ABSTRACT Effcent hgh-dmensonal smlarty search structures are essental for buldng scalable content-based search systems on feature-rch multmeda data. In the last decade, Localty Senstve Hashng (LSH) has been proposed as ndexng technque for approxmate smlarty search. Among the most recent varatons of LSH, multprobe LSH technques have been proved to overcome the overlnear space cost drawback of common LSH. Mult-probe LSH s bult on the well-known LSH technque, but t ntellgently probes multple buckets that are lkely to contan query results n a hash table. Our method s nspred by prevous work on probablstc smlarty search structures and mproves upon recent theoretcal work on mult-probe and query adaptve LSH. Whereas these methods are based on lkelhood crtera that a gven bucket contans query results, we defne a more relable a posteror model takng account some pror about the queres and the searched objects. Ths pror knowledge allows a better qualty control of the search and a more accurate selecton of the most probable buckets. We mplemented a nearest neghbors search based on ths paradgm and performed experments on dfferent real vsual features datasets. We show that our a posteror scheme outperforms other mult-probe LSH whle offerng a better qualty control. Comparsons to the basc LSH technque show that our method allows consstent mprovements both n space and tme effcency. Categores and Subject Descrptors H.3.3 [Informaton Storgage and Retreval]: Informaton Search and Retreval General Terms Algorthms, Performance, Theory 1. INTRODUCTION AND RELATED Effcent hgh-dmensonal smlarty search structures are essental for buldng scalable content-based multmeda systems ncludng multmeda search engnes as well as browsng, summarzaton or content enrchment technologes. In- Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. MM 08, October 26 31, 2008, Vancouver, Brtsh Columba, Canada. Copyrght 2008 ACM /08/10...$5.00. deed, multmeda contents are typcally represented by hgh dmensonal feature vectors that are frequently processed by algorthms nvolvng nearest neghbors search, e.g. rankng, matchng, quantzng, clusterng or learnng. Early proposed tree-based ndexng methods for Nearest Neghbors (NN) search such as R-tree [8], SR-tree [13], M- tree [5] or more recently cover-tree [2] return accurate results, but they are not tme effcent for data wth hgh (ntrnsc) dmensonaltes. It has been shown n [21] that when the dmensonalty exceeds about 10, exstng ndexng data structures based on space parttonng are slower than the brute-force, lnear-scan approach. Approxmate nearest-neghbor algorthms have been shown to be an nterestng way of dramatcally mprovng the search speed, and are often a necessty. The prncple s to speed-up the search by returnng only an approxmaton of the exact query results, accordng to an accuracy measure. Some of the frst proposed approxmate solutons were smply extensons of exact methods to the search of ɛ-nn [22, 4]; a ɛ-nn beng an object whose dstance to the query s lower than (1+ɛ) tmes the dstance of the true k-th nearest neghbor. In [22] e.g., Zezula et al. deal wth ɛ-nn n a M-tree. The performance gan s around 20 for a recall of 50% compared to exact results. Clusterng-based approxmate methods have also been proposed to acheve substantal speed-ups over sequental scan [7, 15]. These algorthms partton the data nto clusters and rank them at query tme accordng to ther smlarty wth the query vector. Cluster preprocessng s however very tme consumng and prevents from practcal operatons such as nsertons or deletons. In [9], Houle et al. developed a practcal ndex called SASH for approxmate smlarty queres n extremely hgh-dmensonal data. SASH s a mult-level structure of random samples connected to some of ther neghbors. Queres are processed by frst locatng approxmate neghbors wthn the sample, and then usng the pre-establshed connectons to dscover neghbors wthn the remander of the data set. Overall, one of the most popular approxmate nearest neghbor search algorthms used n multmeda applcatons s the Localty-Senstve Hashng (LSH) [6]. The basc method uses a famly of localty-senstve hash functons composed of lnear projectons over randomly selected drectons n the feature space. The prncple s that nearby objects are hashed nto the same hash bucket wth a hgh probablty, for at least one of the used hash functon. LSH has been proved to acheve very good tme effcency for hgh dmensonal features and has been successfully appled n several 209

2 multmeda applcatons ncludng vsual local features ndexng [14], songs ntersecton [3] or 3D object ndexng [18]. Tme effcency mprovements of the basc LSH method have been proposed recently: In [1], Andon and Indyk propose a near-optmal LSH that uses a Leech lattce for the geometrc hashng nstead of one-dmensonal random projectons. The dea s that lattces offer better quantzaton propertes for the mean square error dssmlarty measure used n Eucldean spaces. In [10], Jegou et al. also use a lattce nstead of random projectons, but mproves the search tme effcency by performng an on-lne selecton of the most approprate hash functons from the whole pool of functons. To acheve hgh search accuracy, LSH methods needs however to use a large number of hash tables and ts man drawback s that t requres very large amount of avalable memory. To solve ths problem, Mult-probe LSH methods have been proposed recently [17, 19]. Such methods are bult on the well-known LSH technque, but nstead probng only the bucket contanng the query n each hash table, they probe multple buckets that are lkely to contan query results. The frst mult-probe LSH strategy, denoted entropy-based LSH, was proposed by Pangrahy [19]. The prncple was to sample multple buckets by randomly generatng perturbated objects near the query object, resultng n several query objects whose results are merged n the end. The ntenton of the method was clearly to trade tme for space requrements. In [17], Lv et al. propose a more effcent Mult-probe LSH method that generates drectly perturbated hash buckets nstead of perturbated query objects, thanks to an effcent algorthm producng optmal probng sequences of hash buckets that are lkely to contan smlar objects to the query. Ths paper presents a new Mult-probe LSH method that generalzes and mproves upon these prevous technques. Whereas they based on a smple lkelhood crteron that a gven bucket contans query results, we defne a more relable a posteror probablstc model takng account some pror about the queres and the searched objects. Ths pror knowledge allows a more accurate selecton of the most probable buckets, mprovng tme effcency and offerng a better qualty control of the search. Our new Mult-probe LSH method s somehow nspred by prevous works of the authors on Probablstc smlarty search structures [20, 12]. Such methods can also be consdered as hashng algorthms but contrary to LSH technques they are based on a sngle multdmensonal hash functon nduced by a space fllng curve or an adaptve grd n the orgnal feature space. The prncple of the search s then to select the most probable hash buckets accordng to a probablstc model of the searched objects, learned on query samples. Such technques have been proved to acheve very good tme effcency for the search of dstorted features n huge datasets and has been successfully appled n scalable content-based copy detecton applcatons [20, 12]. However these technques fal to ndex hgh dmensonal features whose ntrnsc dmensonalty exceeds about 30 to 40 features, manly because they use a sngle multdmensonal grd hash table whereas LSH methods solve ths problem by usng several radomly selected hash functons. The paper s organzed as follows: Secton 2 remnds general prncples of Localty-Senstve Hashng methods. Secton 3 descrbes the proposed a posteror Mult-probe Localty Senstve Hashng method. Secton 4 reports expermental results of the proposed method on real datasets. 2. LOCALITY SENSITIVE HASHING In ths secton, we remnd the general Eucldean Localty- Senstve Hashng algorthm as descrbed n [6], snce we use the same ndexng scheme. The basc dea of LSH s to use a set of hash functons that map smlar objects nto the same hash bucket wth a probablty hgher than non-smlar objects. At ndexng tme, all the feature vectors of the dataset are nserted n L hash tables correspondng to L randomly selected hash functons. At query tme, the query vector s also mapped onto the L hash tables and the correpondng L hash buckets are selected as canddates to contan objects smlar to the query. A fnal step s then performed to flter the canddate objects by computng ther dstance to the query. More formally, let V be a dataset of N d-dmensonal feature vectors n R d under the l 2 norm. For any pont v R d, the notaton v 2 represents the l 2 norm of the vector v. Now let G = g : R d Z k be a famly of hash functons such as: g(v) = (h 1(v),..., h k (v)) where the functons h for [1, k] belongs to a localty senstve hashng functon famly H = h : R d Z [6]. We remnd that a functon famly H = h : R d Z s called (R, cr, p 1, p 2)-senstve for l 2 f for any q, v R d : P r(h(q) = h(v)) p 1 when q v 2 R (1) P r(h(q) = h(v)) p 2 when q v 2 cr (2) where c > 1 and p 1 > p 2. Intutvely, that means that nearby objects wthn dstance R have a greater chance of beng hashed to the same value than objects that are far away (dstance greater than cr). For the l 2 metrc, the typcally used LSH functons h H are defned as: a.v + b h (v) = (3) w where a R d s a random vector wth entres chosen ndependently from a Gaussan dstrbuton and b a real number chosen unformly from the range [0, w]. Now, the LSH ndexng method works as follows: 1. Choose L hash functons g 1,..., g L from G, ndependently and unformly at random (each hash functon g j = (h j,1(v),..., h j,k (v)) s the concatenaton of k LSH functons randomly generated from H). 2. Use each of the L hash functons to construct one hash table (resultng n L hash tables). 3. Insert all ponts v V n each of the L hash tables by computng the correspondng L hash values. At query tme, the L hash values of a gven query vector q are computed n order to generate a set of L canddate hash buckets (one n each hash table). The canddate objects are then fltered by computng ther dstance to the query accordng to the query objectve (typcally the K-nearest neghbors or the neghbors n a gven range). In mult-probe LSH methods [17, 19], the man dfference s that the set of canddate hash buckets s extended to more than one bucket n each hash table, by selectng neghborng 210

3 hash buckets to the query one. The dea s to ncrease the probablty to fnd a relevant neghbor n a sngle hash table and consequently reduce the number L of requred hash tables. The dfferent approaches manly dffer n terms of how they select the multple buckets per hash table and we wll come agan to ths pont n the next secton. 3. A POSTERIORI MULTI-PROBE LOCALITY SENSITIVE HASHING Ths secton descrbes the proposed method. Sub-secton 3.1 ntroduces our new success probablty crteron to fnd neghbors n a gven bucket. Sub-secton 3.2 descrbes the derved Probablstc Query-drected Probng Sequence algorthm. Sub-secton 3.3 focuses on the mplementaton detals for approxmate nearest neghbor search. Fnally, sub-secton 3.4 analyses more precsely the advantages of our method compared to other Mult-probe LSH methods. To make our argumentaton easer, let us frst reformulate the defnton of a hash functon g G as: wth and where and g = g r g r (v) = A v + B (4) A = B = t «a t 1 w,..., a k w b1 w,..., b «k w Thus, a hash functon g r can be unquely defned by a set of parameters θ = {A, B} and we denote g r θ(v) a hash functon parametrzed by a fxed θ. Secondly, let n (q) be the set of relevant neghbors of a gven query q. Ths relevant set of neghbors depends on the targeted smlar objects, e.g. the K-nearest neghbors of q or the results of a R-range query around q or any other relevant vectors. 3.1 Success Probablty Estmaton Although all mult-probe LSH approaches vst multple buckets for each hash table, they are very dfferent n terms of how they probe multple buckets. In ths secton, we wll ntroduce the crteron used by our technque to estmate the success probablty that a gven hash bucket n a gven hash table (among L) contans a relevant object v n(q). We frst start by ntroducng the crteron used by others mult-probe [17, 19] and query-adaptve [10] LSH methods. Localty senstve hashng theory s based on the probablty dstrbuton of the hash values of two gven ponts q and v, over the random choces of the hash functons, e.g. over random choces of parameters set θ. In other words, θ s consdered as a random varable whereas v and q are consdered as constants. More formally, let g r v(θ) denote the hash functon g for a constant v and a varable θ. And let δ q,v(θ) = g r v(θ) g r q(θ) (5) be the dfference of the hash values of v and q as a functon of θ. Due to the property of p-stable dstrbutons [6], to whch the Gaussan dstrbuton used to generate the LSH functons belongs, t s possble to show that the probablty dstrbuton of δ q,v(θ) s also a normal dstrbuton wth ndependent Gaussan components and wth varances σ R proportonal to R = v q 2 : 0 N (0, σ 2 1 R) B p δq,v(θ) (q,v) = A (6) N (0, σr) 2 From ths statement, LSH theory derves the probablty that two gven ponts q and v collde n the same bucket over the randomly pcked hash functons as done n [6]. In a more general case, t s also possble to derve the probablty that q and v belongs to adjacent buckets over the randomly pcked hash functons. Such probablty can be used to estmate the overall search qualty of a step-wse mult-probe LSH approach (as the one mentoned n [17]), whch would consst n probng the same bucket neghborhood n all hash tables (e.g. all the neghborng hash buckets whch dffer n at most c coordnates from the hash bucket of the query). Now, basc mult-probe LSH method [17] and some queryadaptve LSH approach [10] are based on a lkelhood nterpretaton of the probablty dstrbuton p δq,v(θ) (q,v). The densty probablty of δ q,v over random values of θ can ndeed also be nterprated as the lkelhood of δ θ = g r θ(v) g r θ(q) over random choces of q and v, for fxed values of θ (.e. for a fxed hash table). Formally, let l δθ (q,v) θ = p δq,v(θ) (q,v) (7) denote ths lkelhood. From Equaton 6, we can derve: l δθ (q,v) (θ) = K exp g r θ (v) gr (q) θ 2 σ R 2 (8) As ths lkelhood mostly depends on g r θ (v) gr θ (q) 2, the authors of [17] and [10] suggest to use as success crteron of a gven hash bucket u = g θ (q) +, Z d, the dstance between the boundares of ths hash bucket and the real query hash g r θ (q). At ths pont, t s mportant to remnd that the lkelhood l δθ (q,v) (θ) does not model the real probablty to fnd a neghbor v of q n the hash bucket u = g θ (q)+. Consderng ths lkelhood as a probablty densty would be n fact a case of prosecutor s fallacy snce the real densty depends on the pror dstrbuton of v n(q). Our method rather estmates the success probablty of a gven hash bucket n a gven hash table a posteror,.e. for an observed hash functon g r θ parametrzed by a known θ. For a gven query q, n the absence of evdence, a pont v n (q) s ndeed a random varable to whch we assocate a pror probablty dstrbuton p v q (x), x R d. For a gven query q and a gven hash functon g θ, our success crteron s then based on the dstrbuton of g r θ(v) over random choces of v n(q), denoted as p g r θ (v) (q,θ). We wll see later, n secton 3.3, how we can derve ths posteror dstrbuton from pror dstrbutons p v q. For now, we suppose that ths dstrbuton s known. After scalar quantzaton of the components of g r θ, the prob- 211

4 ablty to fnd a relevant neghbor n a bucket characterzed by ts key u = (u 1,..., u k ) Z k s: P gθ (q,θ)(u) = Pr v (g θ (x) = u : x n (q)) = Z u1 +1 u 1... Z uk +1 p g r θ (q,θ)(y)dy u k Now, the prncple of our probablstc mult-probe LSH method s to vst the most probable hash buckets of a gven hash functon g θ accordng to ther posteror probabltes P gθ (q,θ)(u). More precsely, our algorthm selects the mnmal set of hash buckets, such that the global probablty s hgher than a qualty control parameter α. Formally, let U be a set of hash keys u Z k and let ( ) U α = U : X u U P gθ (q,θ)(u) α then we wsh to fnd the set of keys U mn(α) such as: U mn(α) = argmn U U α ( U ) (9) A nave way to construct U mn(α) would be to compute the success probablty of all possble keys and sort them, but t s of course practcally mpossble. A more effcent but approxmate way would be to frst use an s-step-wse probng algorthm that selects all the hash buckets whch dffer n at most s coordnates from the hash bucket of the query and then to sort them accordng to ther posteror probablty. Ths method has the advantage to be generc but s stll not very effcent snce the number of hash buckets probabltes to estmate remans P s n=1 2n k n «. If we tolerate an ndependence hypothess of the components of P gθ (q,θ)(u), t s however possble to use a drastcally more effcent algorthm, smlar to the Query-Drected Probng Sequence algorthm defned n [17] and whch s descrbed n secton 3.2. At ths pont, we can remark that f we consder the followng dscrete dstrbuton P gθ (q,θ)(u) = j 1 f u = gθ (q) 0 f u g θ (q) (10) our method s equvalent to the basc LSH method. Also, f the pror p v q s modeled by an sotropc normal dstrbuton around q, p g r θ (q,θ) would be also an sotropc normal dstrbuton and our method would be somehow equvalent to the common lkelhood-based mult-probe LSH method. 3.2 Probablstc Query-drected Probng Sequence algorthm As mentoned n prevous secton, under the ndependence hypothess of the components of g r θ, t s possble to defne an effcent probablstc probng sequence algorthm. Note that ths s n the general not the case snce g r θ(v) s a functon of the random varable v n(q) whch generally does not have ndependent components. However, we wll see n the experments that t seems to be an acceptable hypothess. From ths assumpton, follows: p g r θ (q,θ)(y) = ky p h r (q,θ )(y ) y R k y R (11) =1 where θ = {a, b } and after quantzaton P gθ (q,θ)(u) = ky P h (q,θ )(u ) u Z k u Z (12) =1 Note that n practce, the doman of u s not nfnte and u belongs to a a very short range of nteger values bounded by: u mn u max = mn x V h(x) = max x V h(x) where V s the dataset we wsh to ndex. Now, to acheve the objectve of Equaton 9, our Probablstc Query-drected Probng algorthm works n a smlar way than the one defned n [17], so we let the reader refer to t for full detals and llustratons. The man dfferences of our algorthm are (1) that the relevance crteron of a hash bucket s not a sum of squared dstances but a product of probabltes and (2) the endng condton s not the number of teratons (.e. the number of probes) but the estmated success probablty over all generated probes (3) the probng s not lmted to the hash buckets adjacent to the query bucket u q = g θ (q). The prncple of the algorthm s to generate a lst of hash buckets n decreasng order of ther probablty P gθ (q,θ)(u) and to stop when the sum of ther probabltes s larger than α. Gven the query object q and the hash functons h for = 1,..., k correspondng to a sngle hash table, we frst compute P h (q,θ )(u ) for = 1,..., k and u = u mn,..., u max. Ths generates an array of k lsts of n probabltes (n = u max u mn + 1). Then, the n values of each of the k lsts are sorted n decreasng order, and fnally the k lsts are also sorted n decreasng order of ther frst element. Let p j[z j] denote the (z j 1)-th element n the j-th lst of the sorted array. A hash bucket wth key u, can now be unquely represented by a sorted key z = (z 1,..., z j,..., z k ) Z k. Accordng to equaton 12, the probablty of a hash bucket characterzed by ts sorted key z remans the product of ndependent probabltes: P r(z) = ky p j[z j] j=1 The problem of generatng hash buckets n decreasng order of ther probablty now reduces to the problem of generatng sorted keys z n decreasng order of ther probablty. To do that, three operatons generatng chld sorted keys from a parent one are defned. The two frst ones (shft and expand) are smlar to the ones used n [17], except that our representaton s not the same. The last one (extend) s added to explore hash buckets not adjacent to the query bucket: shft(z): Ths operaton shfts to the rght the last non zero component of z f t s equal to one and f t s not the last one (e.g. shft((1, 2, 1, 0, 0))=(1, 2, 0, 1, 0)). Otherwse, returns nothng (e.g. shft((1, 2, 1, 0, 1))=, shft((0, 2, 0, 0, 0))= ). 212

5 expand(z): Ths operaton sets to one the component followng the last non zero component of z f t s not the last one z (e.g. expand((1, 2, 1, 0, 0))=(1, 2, 1, 1, 0)). Otherwse, t returns nothng (e.g. expand((1, 2, 1, 0, 1))= ). extend(z): Ths operaton adds one to the last non zero component. (e.g. extend((1, 2, 1, 0, 0))=(1, 2, 2, 0, 0). The mportant property of these three operatons s that they generate chld hash buckets wth probabltes lower than the parent one (P r (shft (z)) < P r (z), P r (expand (z)) < P r (z) and P r (extend (z)) < P r (z)). The other mportant property s that, for any hash bucket characterzed by ts sorted key z, there s a unque sequence of shft, expand and extend operatons whch wll generate z from the startng bucket z 0 = (0,..., 0). Now, the algorthm used to generate the hash buckets n decreasng order of ther probablty and acheve the objectve of equaton 9 s the followng: maxheap= ; OutputKeyLst= ; z 0 = (0,..., 0); maxheap Insert(z 0,P r(z 0)); l = 1; P t = 0; whle P t < α do z l =maxheap ExtractMax(); OutputKeyLst Add(z l ); P t = P t + P r(z l ); z shft =shft(z l ); maxheap Insert(z shft,p r(z shft )); z exp=expand(z l ); maxheap Insert(z expand,p r(z expand )); z ext=extend(z l ); maxheap Insert(z extend,p r(z extend )); l = l + 1; end return OutputKeyLst; A max-heap s used to mantan the collecton of canddate hash buckets and output the top node at each teraton. The number of elements n the heap at any pont of tme s less than two tmes the number of teratons (.e. the number of generated probes). 3.3 Approxmate nearest neghbor search mplementaton A posteror probabltes estmaton We mplemented a nearest neghbors search technque based on the proposed method. Ths secton descrbes how we compute the a posteror probabltes P h (q,θ )(u ) requred by the Probablstc Query-drected Probng algorthm (cf. Equaton 12). Our estmaton s based on a tranng set of N s sampled query objects q s and correspondng retreved objects v n(q s ), typcally obtaned by randomly pckng N s sample objects n the dataset and searchng ther exact nearest neghbors thanks to an exhaustve scan of the dataset. Now, let us model the pror dstrbuton p v q by a multvarate Normal dstrbuton wth condtonal mean µ(q) and condtonal covarance matrx Σ(q): p v q = N (µ(q), Σ(q)) The hash functon gθ(v) r beng a lnear applcaton of v (cf. Equaton 4), t also produces a Gaussan random varable wth condtonal covarance matrx AΣ(q)A T and condtonal mean gθ(µ(q)). r The dstrbuton of the ndependent components h r s then: p h r (q,θ ) = N (h r (µ(q)), a T Σ(q)a ) (13) At ths pont, the condtonal functons µ(q) and Σ(q) for any query q could be estmated by a Parzen wndow over the N s samples of the tranng set. However, ths process would be to expensve at query tme. In our current mplementaton, we smplfed the problem by consderng a smpler model where the mean and varance of p h r (q,θ ) are condtoned only by h r (q) and not q tself. The man fundaton of ths assumpton s to consder h r (v) ndependent from h r (q) whch s qute realstc due to the ndependence of the randomly selected projecton vectors a. In practce, usng the datasets descrbed n secton 4, the normalzed mutual nformaton between pars of such varables s very law, around 0.04 on average wth a maxmum at 0.15 for the most dependent par of hash functons of the HSV dataset. For each sample query q s, we estmate the sample mean µ s and sample covarance matrx Σ s over the retreved neghbors v n(q s ). For all hash functons h r, we then compute the N s hashng values h r (q s ) and assocate them a sample varance (cf. Equaton 13): and a sample mean: σ 2 (h r (q s )) = a T Σ sa µ (h r (q s )) = h r (µ s) The condtonal mean µ (h r (q)) and condtonal varance σ 2 (h r (q)) for any q are then nterpolated by a Gaussan kernel over the N s sample: µ (h r (q)) = P ns P ns P ns s=1 K(hr (q), h r (q s )) h r (µ s) s=1 K(hr (q), hr (q s )) (14) σ 2 (h r s=1 (q)) = K(hr (q), h r (q s )) σ 2 (h r (q s )) P ns s=1 K(hr (q), hr (q (15) s)) where K(x, y) s the Gaussan kernel functon. At ths pont, the dscrete dstrbuton of the hash values after quantzaton can be computed as: P h (q,θ )(h r (q), u ) = Z u +1 y=u N (µ (h r (q)), σ 2 (h r (q)))dy (16) In our experments, we typcally use N s = 1000 query samples and σ K = w for the Gaussan kernel parameter Pre-computaton of probabltes n Look-up Tables In practce, to speed up the Probablstc Query-drected Probng algorthm at query tme, we pre-compute, at ndexng tme, the dscrete dstrbutons P h (q,θ )(h r (q), u ) for quantzed values of h r (q). Let h z (q) [0, N z] be the quantzed value of h r (q): h z (q) = h r (q) u mn u max N z + 1 u mn (17) 213

6 At ndexng tme, we pre-compute the dscrete dstrbutons P h (q,θ )(u ) for all possble quantzed values h z (q) 0,..., N z accordng to Equatons 14, 15 and 16. Over all hash functons h j, ths process generates a set of L k N z Look-up tables of sze u max + 1 u mn. In the experments, we used N z = 2500 and the resultng space requrement for the Look-up tables dd not exceed 5 Mb. At query tme, the probabltes requred by the Probablstc Query-drected Probng algorthm (cf. Equaton 12) are computed only by quantzng h r (q) accordng to Equaton 17 and readng the correspondng values n the Look-up tables Refnement step At query tme, once the most probable hash buckets are selected, the dstance to the query s computed for all the objects they contan and only the objects satsfyng the query objectve are output (v n(q)) Parameters settngs The man parameters of our technque are the common LSH parameters L, k and w (cf. secton 2) and the sngle hash table qualty control parameter α (cf. secton 3.1). A suggested n [6] and [1], we choose common settngs for the LSH technques: k = ln(n) and w = 4R where R s the average dstance of the searched objects v n(q) to the query q. The total qualty of the search α T s set by the user. In the basc LSH technque, t s equal to α T = 1 (1 p k 0) L (18) where p 0 s the probablty that a query q and a neghbor v collde n the same bucket for a sngle hash functon. In our technque, the probablty to retreve a neghbor v n one of the L multdmensonal ndexes s estmated by α and thus we get: α T = 1 (1 α) L (19) To guaranty that the global probablty s hgher than α T we thus choose L as: ln(1 αt ) L = ln(1 α) + 1 (20) The estmaton of the total search tme T t can be expressed as a functon of α: «ln(1 αt ) T t(α) = L T (α) = ln(1 α) + 1 T (α) (21) where T (α) s the search tme n one of the L ndexes. If t exsts, ths functon has a mnmum when ts dervatve s null, leadng to the followng equalty: T (α) T (α) = ln(1 α T ) ln(1 α)(1 α) (ln(1 α) + ln(1 α T )) (22) The second term s a strctly decreasng functon that tends to nfnty when α tends to zero and to zero when α tends to one. The frst term s the logarthmc dervatve of T (α). It s always hgher than zero due to the growth of T (α) wth α and t s usually ncreasng wth α due to the exponental growth of T (α). Thus, T t(α) has usually a unque mnmum at α = α mn that can be determned expermentally by searchng a sngle ndex wth varyng values of α, measurng T (α) and mnmzng T t(α) accordng to equaton 21. Such estmaton s llustrated on Fgure 1 for the three datasets descrbed n the expermental secton 4 and α T =0.95. The mnmum of T t(α) s acheved respectvely at α mn=0.44 for HSV dataset, at α mn=0.57 for SIFT dataset and at α mn=0.78 for Dpole dataset. From these values, we can derve the optmal value of L thanks to equaton 20, whch gves respectvely L=5, L=4 and L=2. T(alpha)*L(alpha) HSV DIPOLE SIFT alpha Fgure 1: Theoretcal global search tme vs success probablty control parameter alpha (global control qualty parameter set to α T = 0.95): Mnnma are acheved respectvely at α mn=0.44 for HSV dataset, at α mn=0.57 for SIFT dataset and at α mn=0.78 for Dpole dataset 3.4 Advantages over other mult-probe LSH schemes We summarze here the advantages of our probablstc mult-probe LSH technque compared to other mult-probe LSH technques: 1. More effcent flterng: takng account the pror dstrbuton of the searched objects allows to reduce sgnfcantly the requred number of probes to acheve smlar recall. Hgh recall can even be obtaned effcently wth a sngle hash table. 2. Search qualty control and parameters estmaton: The relevance crteron of a hash bucket beng a probablty and not a lkelhood score, t allows to have a coarse estmaton of the probablty to fnd relevant objects wthout tunng. Havng an estmaton of the probablty also allows to estmate automatcally the requred number of hash tables L wthout tunng. 3. Genercty and Query adaptvty: Our probablstc flterng algorthm s fully ndependent of the query type. It just requres query samples and correspondng relevant objects sets, not necessarly nearest neghbors. Examples of other relevant objects are dstorted features obtaned after transformaton of a multmeda content or nearest neghbors of the query n another dataset (e.g. a category specfc dataset or a tranng dataset). The search can also be easly adapted to dfferent objectves by pre-computng dfferent pror models and correspondng probabltes Look-up tables for the same ndex structure. A typcal applcaton s to acheve class dependent queres

7 4. EXPERIMENTS 4.1 Expermental setup Ths secton descrbes the confguratons of our experments, ncludng the evaluaton datasets, benchmarks, metrcs, and some mplementaton detals Evaluaton datasets All the experments are based on the three followng vsual features datasets. The dataset szes are chosen such that the ndex data structure of the basc LSH method can ft n man memory. HSV dataset: A set of common 120-dmensonal hsv hstograms extracted from a collecton of 512,927 mages collected from the web n a desgn-orented perspectve (European project TRENDS). SIFT dataset: A set of common SIFT local features [16] extracted n a collecton of 350 mages randomly bult from the ImagEval 1 benchmark corpus. The partcularty of such features s that the generated vectors are very sparse wth a large amount of null components. DIPOLE dataset: A set of 5,405, dmensonal local features based on orented dssocated dpoles extracted around mult-resoluton Harrs nterest ponts [11]. Contrary, to the two prevous datasets, such features are not hstograms but dfferental operators wth dense dstrbutons. The source mage collecton ncludes 5,474 mages bult from the ImagEval benchmark corpus. Table 1 summarzes the dmenson and sze of the three datasets. dataset sze dmenson HSV 512, SIFT 523, DIPOLE 5,405, Table 1: Expermental datasets summary Evaluaton benchmark For each dataset, we randomly pcked a set Q of 1000 objects as query objects. Dependng on the experment, the deal answer of each query (ground truth) s defned ether by the K nearest neghbors or by all the objects n a gven range R (not ncludng the query tself). We used the Eucldean dstance for all datasets. Unless otherwse specfed, we use a K-nearest neghbor search wth K = 100. The performances are evaluated n two man aspects: search qualty and speed. Search qualty s measured by recall: recall = I(Q) A(Q) I(Q) (23) where I(Q) s the deal set of answers over all queres and A(Q) the set of actual answers over all queres. Note that we do not need to consder precson here, snce all 1 canddate objects found n checked hash buckets are fltered at query tme accordng to the query parameters (Top K canddates or all canddates whose dstance s below R for range queres). Search speed s measured by averagng the query tme over the 1000 queres Hardware The evaluaton s done on a PC wth one 64-bt 2GHz CPU, 1024Kb L2 cache sze and 6GB RAM. 4.2 Expermental results Success probablty crteron comparson Although all mult-probe LSH approaches vst multple buckets for each hash table, they are very dfferent n terms of how they probe multple buckets. In [17], Lv et al. already showed that ther query drected probng algorthm based on a lkelhood relevance crteron dd requre substantally fewer number of probes than the entropy-based method of [19] or than a smple step-wse probng. We thus only compare ther lkelhood relevance crteron to our probablstc flterng algorthm wthn our own mplementaton of the technque. To compare the two approaches n detal, we measure the number of vsted hash buckets for varyng recall values. As the lkelhood-based method does not enable any control of the search qualty, we vary drectly the number of vsted buckets (snce t s the man parameter of ths technque) and measure the resultng recall. For our method, we vary the value of the search qualty parameter α and measure the resultng average number of vsted buckets and the resultng recall. We frst dd ths experment for a sngle hash functon (L = 1). The results (number of vsted buckets and recall) are then averaged over 10 randomly pcked hash functons g j. Fgure 2 plots the results obtaned for the HSV dataset. It shows that our method requres substantally fewer number of probes to acheve the same recall. The rato s ncreasng wth the recall value and s about 5 for a typcal recall equal to 0.44 (related to the optmal theoretcal value α mn=0.44, cf. secton 3.3.4). Smlar curves are obtaned for the two other datasets: For SIFT dataset, the gan s about 6 for a recall equal to 0.57 (related to α mn=0.57); For DIPOLE dataset, the rato s about 18 for a recall equal to 0.78 (related to α mn=0.78). In a second step, we dd the same experment usng multple hashng functons wth the values of L derved from the optmzaton procedure descrbed n secton Table 2 summarzes the results obtaned usng α T = 0.95 for our technque and at smlar recall for the lkelhood-based method. It shows that our a posteror success probablty crteron requres substantally fewer number of probes to acheve smlar recall. The reducton rato s equal to 6.17 for HSV dataset, 2.38 for SIFT dataset and 8.9 for DIPOLE dataset. To llustrate why our a posteror method allows a more accurate selecton of the probes, we dd compute some statstcs on the key values of the nearest neghbours of 60,000 sample queres. Fgure 3 plots the expermental dstrbuton 215

8 Number of probes 1e+07 1e A posteror Lkelhood recall Fgure 2: Number of probes requred by lkelhood mult-probe LSH and a posteror mult-probe LSH to acheve certan search qualty (HSV dataset, L=1) dataset method L recall nb of probes HSV a posteror ,813 lkelhood ,500 SIFT a posteror ,689 lkelhood ,400 DIPOLE a posteror ,752 lkelhood ,200 Table 2: Search performance comparson of a posteror probablstc probng vs. lkelhood probng of key dfferences between a query and ts nearest neghbours for two dfferent hash functons, on the SIFT dataset. It frst shows that the neghbours of a query are not systematcally contaned n hash buckets adjacent to the query bucket, justfyng the extenson of our query-drected probng algorthm to non-adjacent buckets. It also llustrates the varablty of the neghbours dstrbuton between dfferent hash functons. probablty hash key dfference Fgure 3: Hash key dfference dstrbuton (between query hash key and neghbours hash keys) for two hash functons on the SIFT dataset Search qualty control To evaluate the search qualty control of our method, we vary the control qualty parameter α T for the three datasets and measure the resultng recall. The number of hash tables L s set to the default settngs (L = 5 for HSV, L = 4 for SIFT and L = 2 for DIPOLE). The parameter α of the Probablstc Query-drected probng algorthm n each hash table s deduced from α T through: α = 1 (1 α T ) 1 L (24) The results are summarzed n Table 3. They show that despte the ndependence hypothess of the a posteror probablty, the qualty control s farly good and mght be accurate enough for most applcatons. Note that the other mult-probe technques do not allow such control at all and that the qualty control of the basc LSH technque gves smlar errors whle usng range queres. Recall dataset α T HSV SIFT DIPOLE α T Table 3: Expermental recall of 100-NN search for dfferent values of the search qualty control parameter α T Comparson to LSH We compared our method to the Eucldean LSH method descrbed n [6]. The code source of ths method s kndly provded by the authors n the E 2 LSH package 2. Snce ths method s dedcated to range queres, we used ths knd of queres. The radus R of the query for each dataset s set to the average dstance of the exact 100-nearest neghbors. It was estmated on a set of 1000 queres sampled from the datasets usng an exhaustve scan. Note that the provded LSH code ncludes a scrpt that computes automatcally the man parameters of LSH n the frst stage of data structure constructon, for a gven dataset, a gven set of queres, a gven range R and a gven qualty control parameter α T. The parameters are chosen so that to optmze the estmated query tme. Snce the E 2 LSH method requres a large amount of memory, the optmal parameters for large datasets mght requre an amount of memory whch s greater than the avalable physcal memory. Therefore, when choosng the optmal parameters, E 2 LSH takes nto consderaton the upper bound on memory t can use. Results obtaned at constant recall on the three datasets are gven n Table 4. They show that our method allows to drastcally reduce the space requrement (L s 18 to 63 tmes smaller) whle reducng sgnfcantly the search tme. The tme effcency of LSH on the DIPOLE dataset s very bad due to ts larger sze and memory lmtaton. The lnks between space requrement and tme effcency are dscussed further through two more detaled experments. 2 andon/lsh/ 216

9 dataset method L recall query tme (s) exh scan HSV LSH a posteror LSH exh scan SIFT LSH a posteror LSH exh scan DIPOLE LSH a posteror LSH Table 4: Search performance comparson between our a posteror LSH method, basc LSH and Exhaustve scan Space requrement vs Tme effcency Snce the man objectve of mult-probe LSH methods s to drastcally reduce the large space requrements of LSH, t s nterestng to compare the tme effcency of both technques accordng to the space requrements. To do that, we artfcally vary the amount of avalable memory passed to the LSH optmzaton scrpt and recompute the LSH parameters and structures for each upper bound on memory. We then re-apply our benchmarkng procedure on each derved structure. For our technque, we only consder the sngle result obtaned wth default parameters. Space requrement of LSH s measured by the rato between the ndex sze and the data sze. Comparatve tme effcency s measured by the rato between LSH query tme and the query tme of our method. Fgure 4 plots ths tme rato accordng to the space requrement of LSH, for the HSV dataset. Note that the space rato of our technque for ths dataset s equal to 0.125, whch means that the ndex s almost 10 tmes smaller than the data tself. The results show that our method s always faster than LSH snce the Tme rato s always larger than 1. For a reasonable space requrement of 1 (ndex sze equal to data sze), our method s about 15 tmes faster than LSH. Snce the curve s convergng for large space rato, we can also estmate that our method s about 2 tmes faster for unlmted memory space. Influence of dataset sze To evaluate the nfluence of dataset sze, we vary the sze of the DIPOLE dataset whch s the larger one. Each sub-dataset s bult by randomly pckng objects n the full dataset and s then ndexed by. The qualty control parameter of both technques was set to α T = The scrpt optmzng LSH parameters was appled to each sub-dataset. The parameters of our method for each dataset were computed accordng to the procedure descrbed n secton We also appled an exhaustve scan on each sub-dataset to have a baselne lnear reference. The results are plotted on Fgure 5 n normal and logarthmc coordnates. It shows that at constant avalable memory, the ncrease of LSH search speed over dataset sze s supra-lnear whereas the one of our method s sublnear. Note that the number of hash tables used by LSH (optmzed by LSH scrpt accordng to avalable memory) decreases from L = 378 to L = 36 when the dataset szencreasess from N = 100, 000 objects to N = 5, 405, 324 objects. Ths s due to the fact that the deal dmenson k of the hash functons s ncreasng wth the sze of the dataset and snce L ncreases wth k, an upper lmt on memory mposes an upper lmt on k and L. Tme (s) e+06 2e+06 3e+06 4e+06 5e+06 6e scan lsh ap-lsh scan lsh ap-lsh DB sze Tme rato (LSH/a posteror LSH) LSH space requrement Tme (s) e-04 1e e+06 1e+07 DB sze Fgure 5: Search tme effcency comparson when varyng the sze of the dataset - the bottom graph represents the same curves n logarthmc coordnates Fgure 4: Search tme rato (LSH / our method) accordng to LSH space requrement (normalzed by dataset space) 5. CONCLUSION AND FUTURE WORKS In ths paper, we presented a new smlarty search technque that can be used to buld effcent content-based search 217

10 systems on feature-rch multmeda data. The technque s nspred by prevous theoretcal works on mult-probe Localty Senstve Hashng and mprove them by takng account a pror knowledge about the searched objects through an effcent probablstc query drected probng algorthm. Ths technque allows a better qualty control of the search and a more accurate selecton of the most probable buckets. We dd show n the experments that the number of requred probes can be reduced sgnfcantly compared to commonly used lkelhood based success crteron. Comparsons to the basc LSH technque show that our method allows consstent mprovements both n space and tme effcency. Furtermore, we thnk that ths technque has a hgh potental regardng new multmeda systems nvolvng context-aware or personalzed retreval mechansms. The pror knowledge used by our flterng algorthm can ndeed be easly adapted to dfferent contextual or personalzed knowledge. The retreval wll be therefore automatcally focused on the context-aware or personalzed targeted objects whle makng the search more effcent. Future works could address two mprovements of the proposed technque. The frst one s to model more relable pror knowledge, e.g. by dervng more relable condtonal pror dstrbuton p v q n the orgnal feature space. The second one s to use the pror knowledge not only for the smlarty search but also for the ndex constructon. The hash functons of common LSH technques are ndeed randomly selected ndependently of the dataset and the targeted objects. Consstent mprovements could be acheved by generatng the hash functons accordng to data dependent dstrbutons. 6. ACKNOWLEDGMENTS Ths work was funded by the European Comsson wthn VITALAS project, 7. REFERENCES [1] A. Andon and P. Indyk. Near-optmal hashng algorthms for approxmate nearest neghbor n hgh dmensons. Communcatons of ACM, 51(1), [2] A. Beygelzmer, S. Kakade, and J. Langford. Cover trees for nearest neghbor. In Proc. of conf. on Machne learnng, pages , New York, NY, USA, [3] M. Casey and M. Slaney. Song ntersecton by approxmate nearest neghbour search. In In Proc. Int. Symp. on Musc Informaton Retreval, pages , [4] P. Cacca and M. Patella. Pac nearest neghbor queres: Approxmate and controlled search n hgh-dmensonal and metrc spaces. In Proc. of Int. Conf. on Data Engneerng, pages , [5] P. Cacca, M. Patella, and P. Zezula. M-tree: An effcent access method for smlarty search n metrc spaces. In Proc. of Int. Conf. on Very Large Data Bases, pages , [6] M. Datar, N. Immorlca, P. Indyk, and V. S. Mrrokn. Localty-senstve hashng scheme based on p-stable dstrbutons. In Proc. of Symposum on Computatonal geometry, pages , [7] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. Abbad. Approxmate nearest neghbor searchng n multmeda databases. In ICDE, pages , [8] A. Guttman. R-trees: A dynamc ndex structure for spatal searchng. In Proc. of ACM SIGMOD Conf. of Management of Data, pages 47 57, [9] M. E. Houle and J. Sakuma. Fast approxmate smlarty search n extremely hgh-dmensonal data sets. In Proc. of the Int. Conf. on Data Engneerng, pages , [10] H. Jegou, L. Amsaleg, C. Schmd, and P. Gros. Query-adaptatve localty senstve hashng. In Internatonal Conference on Acoustcs, Speech, and Sgnal Processng. IEEE, to appear. [11] A. Joly. New local descrptors based on dssocated dpoles. In CIVR 07: Proceedngs of the 6th ACM nternatonal conference on Image and vdeo retreval, pages , [12] A. Joly, O. Busson, and C. Frélcot. Content-based copy retreval usng dstorton-based probablstc smlarty search. IEEE Trans. on Multmeda, 9(2): , [13] N. Katayama and S. Satoh. The sr-tree: An ndex structure for hgh-dmensonal nearest neghbor queres. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, pages , [14] Y. Ke, R. Sukthankar, and L. Huston. Effcent near-duplcate detecton and sub-mage retreval. In Proc. of ACM Int. Conf. on Multmeda, [15] C. L, E. Chang, M. Garca-Molna, and G. Wederhold. Clusterng for approxmate smlarty search n hgh-dmensonal spaces. IEEE Trans. on Knowledge and Data Engneerng, 14(4): , [16] D. G. Lowe. Object recognton from local scale-nvarant features. In Proc. of Int. Conf. on Computer Vson, pages , [17] Q. Lv, W. Josephson, Z. Wang, M. Charkar, and K. L. Mult-probe lsh: effcent ndexng for hgh-dmensonal smlarty search. In Proc. of Conf. on Very Large Data Bases, pages , [18] M.-B. Mate, S. M.-Y. Shan, M.-H. S. Sawhney, S. M.-Y. Tan, M.-R. Kumar, M.-D. Huber, and M.-M. Hebert. Rapd object ndexng usng localty senstve hashng and jont 3d-sgnature space estmaton. IEEE Trans. Pattern Anal. Mach. Intell., 28(7): , [19] R. Pangrahy. Entropy based nearest neghbor search n hgh dmensons. In Proc. of annual ACM-SIAM symposum on Dscrete algorthm, pages , [20] S. Poullot, O. Busson, and M. Crucanu. Z-grd-based probablstc retreval for scalng up content-based copy detecton. In CIVR 07: Proceedngs of the 6th ACM nternatonal conference on Image and vdeo retreval, pages , [21] R. Weber, H. J. Schek, and S. Blott. A quanttatve analyss and performance study for smlarty-search methods n hgh-dmensonal spaces. In Proc. of Int. Conf. on Very Large Data Bases, pages , [22] P. Zezula, P. Savno, G. Amato, and F. Rabtt. Approxmate smlarty retreval wth m-trees. Very Large Data Bases Journal, 7(4): ,

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features