Nearest Keyword Set Search in Multi-dimensional Datasets

Size: px

Start display at page:

Download "Nearest Keyword Set Search in Multi-dimensional Datasets"

Shana Bradley
6 years ago
Views:

1 Nerest Keyword Set Serch in Multi-dimensionl Dtsets Vishwkrm Singh Deprtment of Computer Science University of Cliforni Snt Brbr, USA Emil: Ambuj K. Singh Deprtment of Computer Science University of Cliforni Snt Brbr, USA Emil: rxiv: v1 [cs.db] 12 Sep 2014 Abstrct Keyword-bsed serch in text-rich multidimensionl dtsets fcilittes mny novel pplictions nd tools. In this pper, we consider objects tht re tgged with keywords nd re embedded in vector spce. For these dtsets, we study queries tht sk for the tightest groups of points stisfying given set of keywords. We propose novel method clled ProMiSH (Projection nd Multi Scle Hshing) tht uses rndom projection nd hsh-bsed index structures, nd chieves high sclbility nd speedup. We present n exct nd n pproximte version of the lgorithm. Our empiricl studies, both on rel nd synthetic dtsets, show tht ProMiSH hs speedup of more thn four orders over stte-of-the-rt tree-bsed techniques. Our sclbility tests on dtsets of sizes up to 10 million nd dimensions up to 100 for queries hving up to 9 keywords show tht ProMiSH scles linerly with the dtset size, the dtset dimension, the query size, nd the result size. I. INTRODUCTION Objects (e.g., imges, chemicl compounds, or documents) re often chrcterized by collection of relevnt fetures, nd re commonly represented s points in multi-dimensionl ttribute spce. For exmple, imges (chemicl compounds) re represented using color (molecule) feture vectors. These objects lso very often hve descriptive text informtion ssocited with them, e.g., imges re tgged with loctions. In this pper, we consider multi-dimensionl dtsets where ech dt point hs set of keywords. The presence of keywords llows for the development of new tools for querying nd exploring these multi-dimensionl dtsets. In this pper, we study nerest keyword set serch (NKS) queries on text-rich multi-dimensionl dtsets. An NKS query is set of user provided keywords. The top-1 result of n NKS query is set of dt points which contins ll the query keywords nd the points form the tightest cluster in the multidimensionl spce. Figure 1 illustrtes n NKS query. The multi-dimensionl points in the dtset re represented by dots. Ech point hs unique identifier nd is tgged with set of keywords. For query Q={, b, c}, the set of points {7, 8, 9} contins ll the query keywords {, b, c} nd re nerest to ech other compred to ny other set of points contining these query keywords. Therefore, the set {7, 8, 9} is the top-1 result for the query Q. NKS queries re useful for mny pplictions, e.g., photoshring socil networks, web serch engines, mp services 1, GIS systems 2 [1], subgrph serch, nd for geo-tgging of d b, e 1 2 c, g q, t 11 c, v 7 b, g 9 4 e 10 c, d 6 b, o, g 5 Fig. 1. An exmple of n NKS query on keyword tgged multi-dimensionl dtset. Query is Q={, b, c}. The top-1 result is the set of points {7, 8, 9}. objects nd regions [2]. Consider photo-shring socil network like Fcebook where photos re tgged with people nmes nd loctions. These photos cn be embedded in high-dimensionl feture spce of texture, color, or shpe [3], [4]. Here n NKS query cn find group of similr photos which contins set of people. NKS serches re lso useful when lbeled grphs re embedded in high dimensionl spce (e.g., through Lipschitz embedding [5]) for ese of processing. In this cse, serch for subgrph tht hs the needed lbels cn be nswered by n NKS serch in the embedded spce [6]. NKS queries cn lso revel geogrphic ptterns. GIS cn chrcterize region by high-dimensionl set of ttributes, e.g., pressure, humidity, nd soil types. Additionlly, these regions cn lso be tgged with informtion such s diseses. An epidemiologist cn use NKS queries to discover pttern by finding set of similr regions which contins ll the diseses of her interest. Query Definition: Let D R d be d-dimensionl dtset hving N points. Ech point o D hs unique identifier (id). Ech point is lso tgged with set of keywords σ(o)={v 1,.., v t } V, where V is dictionry of size U of ll the unique keywords in D. We use L 2 (Eucliden norm) to mesure distnce between ny two points, i.e., dist(o i, o j ) = o i o j 2. We mesure the nerness of set of points A by the mximum distnce between ny two points in A, clled dimeter r(a). r(a) = mx o i,o j A o i o j 2 A reltively smll vlue of r(a) implies tht the corresponding objects re more similr to ech other. A q-size NKS query Q={v Q1,..., v Qq } hs q unique keywords provided by user. Set A D is possible result, clled cndidte, of Q if it contins points for ll the query keywords, i.e., Q o A σ(o), nd no subset of A does so. We llow overlpping cndidtes. If S is the set of ll cndidtes of c, h 13

2 D : A dtset V : A dictionry of unique keywords in D Q : A set of keywords comprising query o : A point in D v : A keyword N(v) : Number of points in D hving keyword v N : Number of points in D U : Number of unique keywords in D q : Number of keywords in query Q d : Number of dimensions of point t : Averge number of keywords per point k : Number of top results w 0 : Initil bin-width for hshtble m : Number of unit rndom vectors used for projection L : Number of Hshtble-Inverted Index structures s : A scle vlue r : Dimeter of set of points z : A d-dimensionl unit rndom vector TABLE I. Q, then result of Q is the cndidte A such tht A = rg min A S r(a). A top-k NKS query retrieves k cndidtes hving the lest dimeter. If two cndidtes hve equl dimeters, then they re further rnked by their crdinlity. We cn lso mesure the nerness of set of points A by sum of ll its pirwise distnces s(a). Here we show with n exmple tht s(a) does not yield tighter cluster thn r(a). Let A={o 1, o 2, o 3, o 4 } be set of points with following pirwise distnces: {d(o 1, o 2 )=2, d(o 1, o 3 )=1, d(o 2, o 3 )=4, d(o 1, o 4 )=3, d(o 2, o 4 )=3, d(o 3, o 4 )=8}. For the set A 1 ={o 1, o 2, o 3 } we hve r(a 1 )=4 nd s(a 1 )=7 wheres for the set A 2 ={o 1, o 2, o 4 } we hve r(a 2 )=3 nd s(a 2 )=8. Here we see tht A 1 hs smller dimeter wheres A 2 hs smller sum of pirwise distnces. In this pper we use dimeter. A serch using tree-bsed index ws proposed by Zhng et l. [2], [7] to solve NKS queries on multi-dimensionl dtsets. The performnce of this lgorithm deteriortes shrply with n increse in the dimension of the dtset s the pruning techniques become ineffective. Our empiricl results show tht this lgorithm my tke hours to terminte for high-dimensionl dtset hving only few thousnds points. Authors lso noted tht tree-bsed lgorithm does not scle with the dimension of the dtset. As discussed previously, NKS queries re useful for pplictions of vrying dimensions. Therefore, there is need for n efficient lgorithm tht scles linerly with the dtset dimension nd yields prcticl query times on lrge dtsets. We propose ProMiSH (Projection nd Multi-Scle Hshing) to efficiently solve NKS queries. We present n exct (ProMiSH-E) nd n pproximte (ProMiSH-A) version of the lgorithm. ProMiSH-E lwys retrieves the true top-k results, nd therefore hs 100% ccurcy. ProMiSH-A is much more time nd spce efficient but returns results whose dimeters re within smll pproximtion rtio of the dimeters of the true results. Both lgorithms scle linerly with the dtset dimension, the dtset size, the query size, nd the result size. Thus, ProMiSH possesses ll the three desired chrcteristic of good serch lgorithm: 1) high qulity of results (ccurcy), 2) high efficiency, nd 3) good sclbility. ProMiSH-E uses set of hshtbles nd inverted indices to perform loclized serch of the results. ProMiSH-E hshtbles re inspired from Loclity Sensitive Hshing (LSH) [8], which is stte-of-the-rt method for the nerest neighbor serch in high-dimensionl spces. The index structure of ProMiSH-E supports ccurte serch, unlike LSH-bsed methods tht llow only pproximte serch with probbilistic gurntees. ProMiSH-E cretes hshtbles t multiple bin-widths, clled scles. A serch in hshtble yields subsets of points tht contin query results. ProMiSH-E explores ech subset using novel pruning bsed strtegy. An optiml strtegy is NP-Hrd; therefore, ProMiSH-E uses greedy pproch. ProMiSH-A is n pproximte vrition of ProMiSH-E to chieve even more spce nd time efficiency. A GLOSSARY OF NOTATIONS USED IN THE PAPER. We evluted the performnce of ProMiSH on both rel nd synthetic dtsets. We used stte-of-the-rt Virtul br*- Tree [2] s reference method for comprison. The empiricl results revel tht ProMiSH consistently outperforms Virtul br*-tree on dtsets of ll dimensions. The difference in performnce of ProMiSH nd Virtul br*-tree grows to more thn four orders of mgnitude with n increse in the dtset dimension, the dtset size, nd the query size. Our sclbility tests on dtsets of sizes up to 10 million nd dimensions up to 100 for queries of sizes up to 9 show tht ProMiSH scles linerly with the dtset size, the dtset dimension, the query size, nd the result size. Our dtsets hd s mny s 24, 874 unique keywords nd dt point ws tgged with mximum of 14 keywords. The spce cost nlysis of the lgorithms show tht ProMiSH-A is much more spce efficient thn both ProMiSH-E nd Virtul br*-tree. Our min contributions re: (1) novel multi-scle index for sclble nswering of NKS queries, (2) n efficient cndidte genertion technique from subset of points, nd (3) extensive empiricl studies. The pper is orgnized s follows. A literture survey is presented in section II. Index structures re described in section III. An exct serch lgorithm (ProMiSH-E) to find subsets of points contining the results is given in section IV. Section V discusses how nswers re generted from the subsets. The pproximte lgorithm (ProMiSH-A) nd n nlysis of its pproximtion rtio is presented in section VI. Complexity of ProMiSH is nlyzed in section VII. Empiricl results re presented in section VIII. We discuss extension of ProMiSH to disk in section IX. Finlly, we provide conclusions nd future work in section X. A glossry of the nottions is shown in tble I. II. LITERATURE SURVEY A vriety of queries, semnticlly different from our NKS queries, hve been studied in literture on text-rich sptil dtsets. Loction-specific keyword queries on the web nd in the GIS systems [9], [10], [11], [12] were erlier nswered using combintion of R-Tree [13] nd inverted index. Felipe et l. [14] developed IR 2 -Tree to rnk objects from sptil dtsets bsed on combintion of their distnces to the query loctions nd the relevnce of their text descriptions to the query keywords. Cong et l. [15] integrted R-tree nd inverted file to nswer query similr to Felipe et l. [14] using different rnking function. Mrtins et l. [16] computed text relevncy nd loction proximity independently, nd then combined the two rnking scores. Co et l. [17] recently proposed method to retrieve group of sptil web objects such tht the group s keywords cover the query s keywords nd the objects in the group re nerest to the query loction nd hve the lowest inter-object distnces. Other keywordbsed queries on sptil dtsets re ggregte nerest keyword serch in sptil dtbses [18], top-k preferentil query [19], finding top-k sites in sptil dt bsed on their influence on feture points [20], nd optiml loction queries [21], [22].

3 Fig. 2. Division of projected vlues of points on unit rndom vector into overlpping bins of equl width w=2r. Our NKS query is similr to the m-closest keywords query of Zhng et l. [7]. They designed br*-tree bsed on R*- tree [23] tht lso stores bitmps nd minimum bounding rectngles (MBRs) of keywords in every node long with points MBRs. The cndidtes re generted by the priori lgorithm [24]. They prune unwnted cndidtes bsed on the distnces between MBRs of points or keywords nd the best found dimeter. Their pruning techniques become ineffective with n increse in the dtset dimension s there is lrge overlp between MBRs due to the curse of dimensionlity. This leds to n exponentil number of cndidtes nd lrge query times. A poor estimtion of strting dimeter further worsens the performnce of their lgorithm. br*-tree lso suffered from high storge cost, therefore Zhng et l. modified br*-tree to crete Virtul br*-tree [2] in memory t run time. Virtul br*-tree is creted from pre-stored R*-Tree which indexes ll the points, nd n inverted index which stores keyword informtion nd pth from the root node in R*-Tree for ech point. Both br*-tree nd Virtul br*-tree, re structurlly similr, nd use similr cndidte genertion nd pruning techniques. Therefore, Virtul br*-tree shres similr performnce weknesses s br*-tree. Tree-bsed indices, e.g., R-Tree [13] nd M-Tree [25], hve been reserched extensively for n efficient ner neighbor serch in high-dimensionl spces. These indices fil to scle to dimensions greter thn 10 becuse of the curse of dimensionlity [26]. VA-file [26] nd idistnce [27] provide better sclbility with the dtset dimension. However, the tsk of designing n efficient method for solving NKS queries by dpting VA-file or idistnce is not obvious. Rndom projections [28] with hshing [29], [30], [8], [31], [32] hs come to be the stte-of-the-rt method for n efficient ner neighbor serch in high-dimensionl dtsets. Dtr et l. [8] used rndom vectors constructed from p-stble distributions to project points, nd then computed hsh keys for the points by splitting the line of projected vlues into disjoint bins. They conctented hsh keys obtined for point from m rndom vectors to crete finl hsh key for the point. All points were indexed into hshtble using their hsh keys. Our index structure is inspired from the sme. Multi-wy distnce joins of set of multi-dimensionl dtsets, ech of which is indexed into R-Tree, hve been studied in literture [33], [34]. As discussed bove, treebsed index fils to scle with the dimension of the dtset. Further, it is not strightforwrd to dpt these lgorithms if every query requires multi-wy distnce join only on subset of the points of ech dtset. III. INDEX FOR EXACT SEARCH In this section, we describe the index structure of ProMiSH- E. It hs two min dt structures. The first dt structure is keyword-point inverted index I kp tht indexes ll the points in the dtset D using their keywords. I kp is shown with dshed rectngle in figure 3. The second dt structure consists of multiple hshtbles nd their corresponding inverted indices. We cll hshtble H together with its corresponding inverted index I khb s HI structure. Hsh Bucket Ids Points 1 2 1:, d 7: c, v 2: b, e 8: 9: b, g 3 Retrieve Hsh Buckets. 6: c, d 10: e Hshtble H Perform subset serch on ech retrieved hsh bucket using points hving query keyword. Fig. 3. Keywords b c Smllest dimeter r k * Point Ids Keyword- Point Inverted Index I kp Find ll the hsh buckets in I khb hving ll the query keywords, e.g., bucket 2. HI s If r k * w 0 2 s- 1 Yes Terminte Keywords Hsh Bucket Ids b c Keyword- Bucket Inverted Index I khb No Find ll the points hving query keyword Index structure nd flow of execution of ProMiSH. s = s+1 s = 0 Q= {, b, c} START We crete hshtble H s follows. We rndomly choose m d-dimensionl unit vectors. We compute projection z.o of ech point o in D on ech unit rndom vector z. Next, we split ech line of projected vlues into consecutive overlpping bins of width w s shown in figure 2. Here bin is eqully overlpped by two other bins. We ssign ech point o hsh key bsed on the bin in which it lies. Since the line is split into overlpping bins, ech point o lies in two bins, nd therefore gets two hsh keys {b 1, b 2 } from ech unit rndom vector z. For exmple, the line of projected vlues T in figure 2 hs been split into overlpping bins {x1, x2, x3, y1, y2, y3}. Point o lies in bins x1 nd y2, nd therefore gets two hsh keys corresponding to ech of the bins. We compute hsh keys using equtions 1 nd 2: h 1 (o) = z.o w (1) h 2 (o) = z.o w 2 + C (2) w where C is constnt to distinguish vlues of h 1 nd h 2. A vlue of C cn be (mx(h 1 ) min(h 1 ) + 2). We get m pirs of hsh keys for ech dt point o using m unit rndom vectors. We tke crtesin product of these m pirs of hsh keys to generte 2 m signtures for ech point o. A signture sig(o)={b j1,..., b jm } of point o contins hsh key from ech of the m pirs. For exmple, let z 1 nd z 2 be two unit rndom vectors for m=2. Let the hsh keys of point o be {x 1, y 1 } from z 1 nd {x 2, y 2 } from z 2. ProMiSH cretes size signtures {x 1 x 2, x 1 y 2, y 1 x 2, y 1 y 2 } for o by crtesin product. We hsh ech point o using ech of its 2 m signtures s hsh key into the hshtble H. A signture sig(o) of point o is converted into hshtble bucket identifier (bucket id) using stndrd hsh function, e.g., ( b ji pr i ) %hshtble size, where pr i is rndom prime number. We store point just by its id in the hsh bucket. For ech hshtble H, we crete corresponding inverted index I khb. For ech bucket of H, we compute the union of keywords of its points. Then, we index ech bucket of the hshtble H ginst ech of the unique keywords it contins in the inverted index I khb. We show HI structure in figure 3 with dotted rectngle. We crete HI s structures for incresing bin-width w=w 0 2 s,

4 where w 0 is initil bin-width nd s {0,..., L 1} is the scle. If pmx is the mximum spn of projected vlues of points on ny unit rndom vector, then IV. L = log 2 ( pmx w 0 EXACT SEARCH (PROMISH-E) ). (3) Here we describe the lgorithm ProMiSH-E to find subsets of points tht contin the true query results. First, we introduce lemms which gurntee tht ProMiSH-E lwys retrieves the true top-k results using the index structure. Then, we describe the steps of ProMiSH-E to find the subsets. The lgorithm to find results from these subsets is described in section V. Lemm 1: Let R d be d-dimensionl Eucliden spce. Let z be vector uniformly picked from unit (d-1)-sphere such tht z R d nd z 2 = 1. For ny two points o 1 nd o 2 in R d, we hve o 1 o 2 2 z.o 1 z.o 2 2. Proof: Since, n Eucliden spce with dot product is n inner product spce, we hve z.o 1 z.o 2 2 = z.(o 1 o 2 ) z 2 o 1 o 2 2 = o 1 o 2 2 since z 2 = 1 The inequlity follows from Cuchy-Schwrz inequlity. Lemm 2: If set of points A = {o 1,..., o n } in R d with dimeter r is projected onto d-dimensionl unit rndom vector z, nd the line is split into overlpping bins of equl width w 2r, then there exists bin contining ll the points of set A. Proof: From lemm 1 nd the definition of dimeter, we hve o i, o j A, z.o i z.o j o i o j r. Therefore, the spn of projected vlues of the points in set A, i.e., mx(z.o 1,..., z.o n ) min(z.o 1,..., z.o n ), is r. Since the line is split into overlpping bins of width 2r, it follows from the construction, s shown in figure 2, tht line segment of width r is fully contined in one of the bins. Hence, ll the points in set A will lie in the sme bin. We illustrte here with n exmple how lemm 2 gurntees retrievl of the true results. For query Q, let the dimeter of its top-1 result be r. We project ll the dt points in D on unit rndom vector nd split the projected vlues into overlpping bins of bin-width 2r. Now, if we perform serch in ech of the bins independently, then lemm 2 gurntees tht the top-1 result of query Q is found in one of the bins. A flow of execution of ProMiSH-E is shown in figure 3. A serch strts with the HI structure t scle s=0. ProMiSH- E finds buckets of hshtble H, ech of which contins ll the query keywords, using the inverted index I khb. Then, ProMiSH-E explores ech selected bucket using n efficient pruning bsed technique to generte results. ProMiSH-E termintes fter exploring HI structure t the smllest scle s such tht the kth result hs the dimeter r k w 02 s 1. Algorithm 1 detils the steps of ProMiSH-E. It mintins bitset BS. For ech v Q Q, ProMiSH-E retrieves the list of points corresponding to v Q from I kp in step 4. For ech point o in the retrieved list, ProMiSH-E mrks the bit corresponding to o s identifier in BS s true in step 5. Thus, ProMiSH-E finds ll the points in D which re tgged with t lest one query keyword. Next, the serch continues in the HI structures, Algorithm 1 ProMiSH-E In: Q: query keywords; k: number of top results In: w 0 : initil bin-width 1: P Q [e([ ], + )]: priority queue of top-k results 2: HC: hshtble to check duplicte cndidtes 3: BS : bitset to trck points hving query keyword 4: for ll o vq QI kp [v Q ] do 5: BS[o] true /* Find points hving query keywords*/ 6: end for 7: for ll s {0,..., L 1} do 8: Get HI t s 9: E[ ] 0 /* List of hsh buckets */ 10: for ll v Q Q do 11: for ll bid I khb [v Q ] do 12: E[bId] E[bId] : end for 14: end for 15: for ll i (0,..., SizeOf(E)) do 16: if E[i] = SizeOf(Q) then 17: F /* Obtin subset of points */ 18: for ll o H[i] do 19: if BS[o] = true then 20: F F o 21: end if 22: end for 23: if checkduplictecnd(f, HC) = flse then 24: serchinsubset(f, P Q) 25: end if 26: end if 27: end for 28: /* Check termintion condition */ 29: if P Q[k].r w 0 2 s 1 then 30: Return P Q 31: end if 32: end for 33: /* Perform serch on D if lgorithm hs not terminted */ 34: for ll o D do 35: if BS[o] = true then 36: F F o 37: end if 38: end for 39: serchinsubset(f, P Q) 40: Return P Q beginning t s=0. For ny given scle s, ProMiSH-E ccesses the HI structure creted t the scle in step 8. ProMiSH- E retrieves ll the lists of hsh bucket ids corresponding to keywords in Q from the inverted index I khb in steps (10-11). An intersection of these lists yields set of hsh buckets ech of which contins ll the query keywords in steps (12-16). For the exmple in figure 3, this intersection yields the bucket id 2. For ech selected hsh bucket, ProMiSH-E retrieves ll the points in the bucket from the hshtble H. ProMiSH-E filters these points using bitset BS to get subset of points F in steps (17-22). Subset F contins only those points which re tgged with t lest one query keyword nd is explored further. Subset F is checked whether it hs been explored erlier or not using checkduplictecnd (Algorithm 2) in step 23. Since ech point is hshed using 2 m signtures, duplicte subsets my be generted. If F hs not been explored erlier, then ProMiSH-E performs serch on it using serchinsubset (Algorithm 3) in step 24. Results re inserted into priority queue P Q of size k. Ech entry e([ ], r) of P Q is tuple contining set of points nd the set s dimeter. P Q is initilized with k entries, ech of whose set is empty nd the dimeter is +. Entries of P Q re ordered by their dimeters. Entries with equl dimeters re further ordered by their set sizes. A new result is inserted into P Q only if its dimeter is smller thn the kth smllest dimeter in P Q. If ProMiSH-E does not terminte fter exploring the HI structure t the scle

5 Algorithm 2 checkduplictecnd In: F : subset; HC: hshtble of subsets 1: F sort(f ) 2: pr1: list of prime numbers; pr2: list of prime numbers; 3: for ll o F do 4: pr 1 rndomselect(pr1); pr 2 rndomselect(pr2) 5: h 1 h 1 + (o pr 1 ); h 2 h 2 + (o pr 2 ) 6: end for 7: h h 1 h 2 ; 8: if isempty(hc[h])=flse then 9: if elementwisemtch(f, HC[h]) = true then 10: Return true; 11: end if 12: end if 13: HC[h].dd(F ); 14: Return flse; s, then the serch proceeds to HI t the scle (s + 1). ProMiSH-E termintes when the kth smllest dimeter r k in P Q becomes less thn or equl to hlf of the current binwidth w=w 0 2 s in steps (29-31). Since r k w02s 2, lemm 2 gurntees tht ech true cndidte is fully contined in one of the bins of the hshtble, nd therefore must hve been explored. If ProMiSH-E fils to terminte fter exploring HI t ll the scle levels s {0,..., L 1}, then it performs serch in the complete dtset D in steps (34-39). Algorithm checkduplictecnd (Algorithm 2) uses hshtble HC to check duplictes for subset F. Points in F re sorted by their identifiers. Two seprte stndrd hsh functions re pplied to the identifiers of the points in the sorted order to generte two hsh vlues in steps (2-6). Both the hsh vlues re conctented to get hsh key h for the subset F in step 7. The use of multiple hsh functions helps to reduce hsh collisions. If HC lredy hs list of subsets t h, then n element-wise mtch of F is performed with ech subset in the list in steps (8-9). Otherwise, F is stored in HC using key h in step 13. V. SEARCH IN A SUBSET OF DATA POINTS We present n lgorithm for finding top-k tightest clusters in subset of points. A subset is obtined from hshtble bucket s explined in section IV. Points in the subset re grouped bsed on the query keywords. Then, ll the promising cndidtes re explored by multi-wy distnce join of these groups. The join uses r k, the dimeter of the kth result obtined so fr by ProMiSH-E, s the distnce threshold. We explin multi-wy distnce join with n exmple. A multi-wy distnce join of q groups {g 1,..., g q } finds ll the tuples {o 1,i,..., o x,j, o y,k,..., o q,l } such tht x, y: o x,j g x, o y,k g y, nd o x,j o y,k 2 r k. Figure 4() shows groups {, b, c} of points obtined for query Q={, b, c} from subset F. We show n edge between pir of points of two groups if the distnce between the points is t most r k, e.g, n edge between point o 1 in group nd point o 3 in group b. A multi-wy distnce join of these groups finds tuples {o 1, o 3, o 9 } nd {o 10, o 3, o 9 }. Ech tuple obtined by multi-wy join is promising cndidte for query. A. Group Ordering A suitble ordering of the groups leds to n efficient cndidte explortion by multi-wy distnce join. We first perform pirwise inner joins of the groups with distnce threshold r k. In inner join, pir of points from two groups O 1 O 8 O 10 3 b O 3 O 4 O O 2 O 6 O 9 c b 3 2 () Pirwise inner joins (b) A grph representtion Fig. 4. (), b, nd c re groups of points of subset F obtined for query Q={, b, c}. A point o in group g is joined to point o in nother group g if o o r k. The groups in the order {, c, b} genertes the lest number of cndidtes by multi-wy join. (b) A grph of pirwise inner joins. Ech group is node in the grph. The weight of n edge is the number of point pirs obtined by n inner join of the corresponding groups. re joined only if the distnce between them is t most r k. Figure 4() shows such pirwise inner joins of the groups {, b, c}. We see from figure 4() tht multi-wy distnce join in the order {, b, c} explores 2 true cndidtes {{o 1, o 3, o 9 }, {o 10, o 3, o 9 }} nd flse cndidte {o 1, o 4, o 6 }. A multi-wy distnce join in the order {, c, b} explores the lest number of cndidtes 2. Therefore, proper ordering of the groups leds to n effective pruning of flse cndidtes. Optiml ordering of groups for the lest number of cndidtes genertion is NP-hrd [35]. We propose greedy pproch to find the ordering of groups. We explin the lgorithm with grph in figure 4(b). Groups {, b, c} re nodes in the grph. The weight of n edge is the count of point pirs obtined by n inner join of the corresponding groups. The greedy method strts by selecting n edge hving the lest weight. If there re multiple edges with the sme weight, then n edge is selected t rndom. Let the edge c, with weight 2, be selected in figure 4(b). This forms the ordered set ( c). The next edge to be selected is the lest weight edge such tht t lest one of its nodes is not included in the ordered set. Edge cb, with weight 2, is picked next in figure 4(b). Now the ordered set is ( c b). This process termintes when ll the nodes re included in the set. ( c b) gives the ordering of the groups. Algorithm 3 shows how the groups re ordered. The kth smllest dimeter r k is retrieved form the priority queue P Q in step 1. For given subset F nd query Q, ll the points re grouped using query keywords in steps (2-5). A pirwise inner join of the groups is performed in steps (6-18). An djcency list AL stores the distnce between points which stisfy the distnce threshold r k. An djcency list M stores the count of point pirs obtined for ech pir of groups by the inner join. A greedy lgorithm finds the order of the groups in steps (19-30). It repetedly removes n edge with the smllest weight from M till ll the groups re included in the order set curorder. Finlly, groups re sorted using curorder in step 30. B. Nested Loops with Pruning We perform multi-wy distnce join of the groups by nested loops. For exmple, consider the set of points in figure 4. Ech point o,i of group is checked ginst ech point o b,j of group b for the distnce predicte, i.e., o,i o b,j 2 r k. If pir (o,i, o b,j ) stisfies the distnce predicte, then it forms tuple of size 2. Next, this tuple is checked ginst ech point of group c. If point o c,k stisfies the distnce predicte with both the points o,i nd o b,j, then tuple (o,i, o b,j, o c,k ) of size 3 is generted. Ech intermedite tuple generted by nested loops stisfies the property tht the distnce between every pir of its points is t most r k. This 2 c

6 Algorithm 3 serchinsubset In: F : subset of points; Q: query keywords; q: query size In: P Q: priority queue of top-k results 1: r k P Q[k].r /* kth smllest dimeter */ 2: SL [(v, [ ])]: list of lists to store groups per query keyword 3: for ll v Q do 4: SL[v] { o F : o is tgged with v} /* form groups */ 5: end for 6: /* Pirwise inner joins of the groups*/ 7: AL: djcency list to store distnces between points 8: M 0: djcency list to store count of pirs between groups 9: for ll (v i, v j ) Q such tht i q, j q, i < j do 10: for ll o SL[v i ] do 11: for ll o SL[v j ] do 12: if o o 2 r k then 13: AL[o, o ] o o 2 14: M[v i, v j ] M[v i, v j ] : end if 16: end for 17: end for 18: end for 19: /* Order groups by greedy pproch */ 20: curorder [ ] 21: while Q do 22: (v i, v j ) removesmllestedge(m) 23: if v i curorder then 24: curorder.ppend(v i ); Q Q \ v i 25: end if 26: if v j curorder then 27: curorder.ppend(v j ); Q Q \ v j 28: end if 29: end while 30: sort(sl, curorder) /* order groups */ 31: findcndidtes(q, AL, P Q, Idx, SL, curset, cursetr, r k ) property effectively prunes flse tuples very erly in the join process nd helps to gin high efficiency. A cndidte is found when tuple of size q is generted. If cndidte hving dimeter smller thn the current vlue of r k is found, then the priority queue P Q nd the vlue of r k re updted. The new vlue of r k is used s distnce threshold for future itertions of nested loops. We find results by nested loops s shown in Algorithm 4 (findcndidtes). Nested loops re performed recursively. An intermedite tuple curset is checked ginst ech point of group SL[Idx] in steps (2-23). First, it is determined using AL whether the distnce between the lst point in curset nd point o in SL[Idx] is t most r k in step 3. Then, the point o is checked ginst ech point in curset for the distnce predicte in steps (5-15). The dimeter of curset is updted in steps (9-11). If point o stisfies the distnce predicte with ech point of curset, then new tuple newcurset is formed in step 17 by ppending o to curset. Next, recursive cll is mde to findcndidtes on the next group SL[Idx + 1] with newcurset nd newcursetr. A cndidte is found if curset hs point from every group. A result is inserted into P Q fter checking for duplictes in steps (26-33). A duplicte check is done by sequentil mtch with the results in P Q. For lrge vlue of k, method similr to Algorithm 2 cn be used for duplicte check. If new result gets inserted into P Q, then the vlue of r k is updted in step 18. VI. APPROXIMATE SEARCH (PROMISH-A) We present ProMiSH-A tht is more spce nd time efficient thn ProMiSH-E. We lso use sttisticl model to show tht ProMiSH-A retrieves results within smll pproximtion rtio of the true results with high probbility. Algorithm 4 findcndidtes In: q: query size; SL: list of groups In: AL: djcency list of distnces between points In: P Q: priority queue of top-k results In: Idx: group index in SL In: curset: n intermedite tuple In: cursetr: n intermedite tuple s dimeter 1: if Idx q then 2: for ll o SL[Idx] do 3: if AL[curSet[Idx-1], o] r k then 4: newcursetr cursetr 5: for ll o curset do 6: dist AL[o, o ] 7: if dist r k then 8: flg true 9: if newcursetr < dist then 10: newcursetr dist 11: end if 12: else 13: flg flse; brek; 14: end if 15: end for 16: if flg = true then 17: newcurset curset.ppend(o) 18: r k findcndidtes(q, AL, P Q, Idx+1, SL, newcurset, newcursetr, r k ) 19: else 20: Continue; 21: end if 22: end if 23: end for 24: return r k 25: else 26: if checkduplicteanswers(curset, P Q) = true then 27: return r k 28: else 29: if cursetr < P Q[k].r then 30: P Q.Insert([curSet, cursetr]) 31: return P Q[k].r 32: end if 33: end if 34: end if The index structure nd the serch method of ProMiSH- A re vritions of ProMiSH-E, therefore we describe only the differences. The index structure of ProMiSH-A differs from ProMiSH-E only in the wy the line of projected vlues of points on unit rndom vector is split. ProMiSH- A splits the line into non-overlpping bins of equl width, unlike ProMiSH-E which splits the line into overlpping bins. Therefore, ech dt point o gets one hsh key from unit rndom vector z in ProMiSH-A. A signture sig(o) is creted for ech point o by the conctention of its hsh keys obtined from ech of the m unit rndom vectors. Ech point is hshed using its signture sig(o) into hshtble t given scle. The serch technique of ProMiSH-A differs from ProMiSH-E in the initiliztion of priority queue P Q nd the termintion condition. ProMiSH-A strts with n empty priority queue P Q, unlike ProMiSH-E whose priority queue is initilized with k entries. ProMiSH-A checks for termintion condition fter fully exploring hshtble t given scle. It termintes if it hs k entries in its priority queue P Q. Since ech point is hshed only once into hshtble of ProMiSH- A, it does not perform subset duplicte check or result duplicte check. Bound on pproximtion rtio: Define pproximtion rtio ρ 1 s the rtio of the dimeter of the result reported by ProMiSH-A r to the dimeter of the true result r, i.e., ρ= r r. Let D be d-dimensionl dtset nd Q={v Q1,, v Qq } be n NKS query. Let f v be the probbility mss function of the

() d=2 (b) d=16 Fig. 5. Probbility mss functions f r of dimeters of cndidtes of query of size 3 on 2-dimensionl nd 16-dimensionl rel dtsets. Dtset Dimension d 2 4 8 16 32 Percentge Rtio ( Np ) N n 0.

7 () d=2 (b) d=16 Fig. 5. Probbility mss functions f r of dimeters of cndidtes of query of size 3 on 2-dimensionl nd 16-dimensionl rel dtsets. Dtset Dimension d Percentge Rtio ( Np ) N n TABLE II. PERCENTAGE RATIO OF THE EXPECTED NUMBER OF CANDIDATES N p TO THE TOTAL NUMBER OF CANDIDATES Nn OF A QUERY. keywords v V. Using f v, we get the number of points tgged with query keyword v Q s N(v Q ) = f v (v Q ) N. Therefore, the totl number of cndidtes for query Q in D is q N n = f v (v Qi ) N. (4) i=1 Let f r be the probbility mss function of dimeters of cndidtes of Q. Then, the totl number of cndidtes of query Q hving dimeter r is given by N r = f r (r) N n. (5) We project ll the points in dtset D, which contin t lest one query keyword v Q, onto unit rndom vector z. We split the line of projected vlues into non-overlpping bins of equl width w. Let P r(a r) be the conditionl probbility for rndom unit vectors tht cndidte A of query Q hving dimeter r is fully contined within bin. For m independent unit rndom vectors, the joint probbility tht cndidte A is contined in bin in ech of the m vectors is P r(a r) m. The probbility tht no cndidte of dimeter r is retrieved by ProMiSH-A from the hshtble, creted using m unit rndom vectors, is (1 P r(a r) m ) Nr. Let the dimeter of the top-1 result of query Q be r. Then, the probbility P (r ) of t lest one cndidte of ny dimeter r, where r r r, being retrieved by ProMiSH-A is given by P (r ) = 1 r r=r (1 P r(a r) m ) Nr. (6) For given constnt λ, 0 λ 1, we cn compute the smllest vlue of r using eqution 6 such tht λ P (r ). The vlue ρ = r r gives n upper bound on the pproximtion rtio of the results returned by ProMiSH-A with the probbility λ. We empiriclly computed ρ for queries of size q=3 for different vlues of λ using this model. We used 32- dimensionl rel dtset hving 1 million points described in section VIII for our study. For set of rndomly chosen queries of size 3, we computed the vlues of N r nd P r(a r) 2. We used projections on 1 million rndom vectors nd binwidth of w=100 for computing P r(a r) 2. We obtined the pproximtion rtio bound of ρ =1.4 nd ρ =1.5 for λ=0.8 nd λ=0.95 respectively. VII. COMPLEXITY ANALYSIS OF PROMISH We first show using sttisticl model tht ProMiSH effectively prunes the flse cndidtes. Then, we nlyze the () d=2 (b) d=16 Fig. 6. Vlues of P r(a r) 2 for vrying dimeters of cndidtes of query of size 3 on 2-dimensionl nd 16-dimensionl rel dtsets. time nd the spce complexity of ProMiSH. Let D be d- dimensionl dtset of size N where ech point o is tgged with t keywords. Let U be the number of unique keywords in D. Let Q={v Q1,, v Qq } be n NKS query of size q. Sttisticl Model: Let the set A D with dimeter r be the top-1 result of query Q. We use t=1 for our model. Let f v be the probbility mss function of the keywords v V. Let f r be the probbility mss function of dimeters of cndidtes of Q. The totl number of cndidtes N n nd N r of query Q re given by equtions 4 nd 5 respectively. We select ll the points in D which contin t lest one query keyword v Q. We project these points on unit rndom vector z. We split the line of projected vlues into overlpping bins of equl width w = 2r. Let P r(a r) be the conditionl probbility for rndom unit vectors tht cndidte A of query Q hving dimeter r is fully contined within bin. For m independent unit rndom vectors, the joint probbility tht cndidte A is contined in bin in ech of the m vectors is P r(a r) m. The expected number of cndidtes explored by ProMiSH in hshtble, creted using m unit rndom vectors, is N p = r P r(a r) m N r. (7) We empiriclly computed the probbility mss function f r, the probbility P r(a r) m, nd the rtio of N p to N n. We used rel dtsets of size N =1 million nd vrying dimensions for our experiments. These dtsets re described in section VIII. We used rndomly selected queries of size q=3. We show probbility mss functions f r of dimeters of cndidtes of query Q on dtsets of dimensions d=2 nd d=16 in figure 5. We computed dimeters of ll the cndidtes of query Q in the dtset to obtin f r nd r. The dimeters of the cndidtes were scled to lie between 0 nd 1. We show vlues of P r(a r) 2 for vrying dimeters of cndidtes of query Q on dtsets of dimensions d=2 nd d=16 in figure 6. To compute P r(a r), we rndomly chose cndidte A of dimeter r. We projected ll the points of A on one million unit rndom vectors. Then, we computed the number of vectors on ech of which ll the points in A lie in the sme bin. We mke following observtions from the bove nlysis: () dimeters of the cndidtes of query hve hevytiled distribution, nd (b) the vlue of P r(a r) m decreses exponentilly with the dimeter of the cndidte of query. The first observtion implies tht lrge number of the cndidtes hve dimeters much lrger thn r. The second observtion implies tht the cndidtes with dimeter lrger thn r hve much smller chnce of flling in bin thn A, nd thus being probed by ProMiSH. Therefore, most of the flse cndidtes, i.e., cndidtes with dimeters lrger thn r, re effectively pruned out by ProMiSH using its index. We present the percentge rtio of N p to N n in tble II for dtsets of vrying dimensions. Ech rtio is computed s n

8 verge of 50 rndom queries. We observe from tble II tht ProMiSH prunes more thn 99% of the flse cndidtes for dtsets of low dimensions, e.g., d=2. For high dimensions, e.g., d=32, more thn 50% of the flse cndidtes get pruned. Time complexity: We ssume tht the dt points re uniformly distributed cross ll the keywords. Therefore, the totl number of the dt points tgged with keyword v is TABLE III. Id Dtset Size (N) Dictionry Size U Averge t 1 10,000 5, ,000 6, ,000 7, ,000 7, Million 24, DESCRIPTION OF REAL DATASETS OF FIVE DIFFERENT SIZES. N(v) = N ( t U ). Let the index structure of ProMiSH-E be comprised of HI structures t L scle levels where the vlue of L is obtined by eqution 3. Let H s be the hshtble t scle s. We ssume without ny loss of generlity tht the hshtble H s is creted using m=1 unit rndom vector. Let pspn be the spn of the projected vlues of the dt points on the unit rndom vector. We ssume tht the dt points tgged with keyword v re uniformly distributed on the line of projected vlues. ProMiSH-E divides the the line of projected vlues into overlpping bins to compute the hsh keys of the points using bin-width of w=w 0 2 s. Therefore, the number of the dt points hving keyword v lying in bucket b of H s is N(vb) = N(v) w/pspn = N(v)/2 L s. We first compute the cost of serch in bucket b of H s. The cost of pirwise inner joins for query Q of size q for d-dimensionl dt points is (N(vb) q) 2 d/2. Nested loop enumertes the cndidtes by looking up the precomputed distnces between the points from the djcency list. Therefore, the worst cse cost of the nested loop is N(vb) q. The totl cost of serch in bucket b of the hshtble H s is T (bs) = ((N(vb) q) 2 d/2) + N(vb) q. The totl number of buckets in H s of ProMiSH-E is 2 L s+1. Therefore, the cost of serch in H s is T (H s ) = 2 L s+1 T (bs). ProMiSH-A divides the line of projected vlues into nonoverlpping bins. The totl number of buckets in H s of ProMiSH-A is 2 L s. Therefore, the cost of serch in H s is T (H s ) = 2 L s T (bs). We present the query times of ProMiSH for NKS queries on multiple rel nd synthetic dtsets in section VIII. Spce complexity: Let the spce cost of point s identifier, dimension of point, nd keyword be E bytes individully. The index structure of ProMiSH consists of the keyword-point inverted index I kp nd L pirs of hshtble H nd keywordbucket inverted index I khb. The spce cost of I kp is S(I kp ) =(N E t) bytes. For ProMiSH-E, ech point is hshed into hshtble H using 2 m signtures, therefore hshtble tkes S E (H) =(2 m N E) bytes. For ProMiSH-A, ech point is hshed using only one signture, therefore hshtble tkes S A (H) =(N E) bytes. The spce cost of I khb inverted index is S(I khb ) = (U M log 2 M/8) bytes, where M is the number of buckets in hshtble H. The totl spce cost of the index of ProMiSH-E is S(I kp ) + S E (H) + S(I khb ). The totl spce cost of the index of ProMiSH-A is S(I kp ) + S A (H) + S(I khb ). The rtio of index size to dtset size is further nlyzed in section VIII-D. Fig. 7. Averge pproximtion rtio of ProMiSH-A for vrying query sizes on 32-dimensionl rel dtsets of vrious sizes. VIII. EMPIRICAL EVALUATION We evluted the performnce of ProMiSH-E nd ProMiSH-A on synthetic nd rel dtsets. We used recently introduced Virtul br*-tree [2] s reference method for comprison (see section II for description). We first introduce the dtsets nd the metrics used for mesuring the performnce of the lgorithms. Then, we discuss the qulity results of the lgorithms on rel dtsets. Next, we describe comprtive results of ProMiSH-E, ProMiSH-A, nd Virtul br*-tree on both synthetic nd rel dtsets. We lso report sclbility results of ProMiSH on both synthetic nd rel dtsets. Finlly, we present comprison of the spce usge of ll the lgorithms. Dtsets: We used both synthetic nd rel dtsets for experiments. Synthetic dt ws rndomly generted. Ech component of d-dimensionl synthetic point ws chosen uniformly from [0-10,000]. Ech synthetic point ws rndomly tgged with t keywords. A dtset is chrcterized by its (1) size, N; (2) dimensionlity, d; (3) dictionry size, U; nd (4) the number of keywords ssocited with ech point, t. We creted vrious synthetic dtsets by vrying these prmeters for our empiricl studies. Our NKS query is useful for finding tight clusters of photos which contin ll the keywords provided by user in photoshring socil network s discussed in section I. Bsed on this ppliction, we used imges hving descriptive tgs s rel dtsets. We downloded imges with their textul keywords from Flickr 3. We trnsformed ech imge into gryscle. We creted d-dimensionl dtset by extrcting d-dimensionl color histogrm from ech imge. Ech dt point ws tgged with the keywords of its corresponding imge. We describe rel dtsets of five different sizes used in our empiricl studies in tble III. The lrgest rel dtset hd 24, 874 unique keywords nd ech point in it ws tgged with 11 keywords. A query for dtset ws creted by rndomly picking set of keywords from the dictionry of the dtset. A query is prmeterized by its size q. Performnce metrics: We used pproximtion rtio, query time, nd spce usge s metrics to evlute the qulity of results (ccurcy), the efficiency, nd the sclbility of the serch lgorithms. 3

9 Fig. 8. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=5 on synthetic dtsets of vrying dimensions d. Vlues of N=100,000, t=1, nd U=1,000 were used for ech dtset. Fig. 10. Query time comprison of lgorithms for retrieving top-1 results for queries of vrying sizes q on 10-dimensionl synthetic dtset hving 100,000 points. Vlues of t=1 nd U=1,000 were used for the dtset. Fig. 9. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=5 on 25-dimensionl synthetic dtsets of vrying sizes N. Vlues of t=1 nd U=1,000 were used for ech dtset. We mesured the qulity of results of n lgorithm by its pproximtion rtio [30], [32]. For 1 i k, if r i is the ith dimeter in top-k results retrieved by n lgorithm for query Q nd ri is the true ith dimeter, then the pproximtion rtio of the lgorithm for top-k serch is given by ρ(q) = ( k r i i=1 r )/k. The smller the vlue of ρ(q), the better is i the qulity of the results returned by the lgorithm. The lest vlue of ρ(q) is 1. We report the verge pproximtion rtio (AAR) for the queries of given size, which is the men of the pproximtion rtios of 50 queries. We vlidted the time efficiency of the lgorithms by mesuring their query times. The index structure nd the dtset for ech method reside in memory. Therefore, the query time mesured s the elpsed CPU time between the strt nd the completion of query gives fir comprison between the methods. A query ws executed multiple times nd the verge execution time ws tken s its query time. Finlly, we report the query time for query size q s n verge of 50 different queries. The query time of serch lgorithm minly depends on the dtset size N, the dtset dimension d, nd the query size q. Therefore, we vlidted the sclbility of the lgorithms by computing their query times for vrying vlues of N, d, nd q. We verified the spce efficiency of n lgorithm by computing the rtio of its index memory footprint to the dtset memory footprint. Implementtion of the methods: We implemented ll the methods in Jv. For Virtul br*-tree, we fixed the lef node size to 1,000 entries nd other nodes sizes to 100 entries. Virtul br*-tree finds only the smllest subset, therefore we used k=1 for ProMiSH for fir comprison. We used the vlue of m=2 nd L=5 to crete the index structure of ProMiSH-E nd ProMiSH-A. For dtset, if pmx is the mximum spn of projected vlues of dt points on ny unit rndom vector, then vlue of w 0 = pmx ws used s the 2 initil bin-width. L All the experiments were performed on mchine hving Qud-Core Intel Xeon CPU@2.00GHz, 4,096 KB cche, nd 98 GB min memory nd running 64-bit Linux version 2.6. Fig. 11. Query time nlysis of ProMiSH lgorithms for retrieving top-1 results for queries of vrying sizes q on 25-dimensionl synthetic dtsets of vrying sizes N. Vlues of t=1 nd U=200 were used for ech dtset. A. Qulity Test We vlidted the result qulity of ProMiSH-E, ProMiSH- A nd Virtul br*-tree by their verge pproximtion rtios (AAR). ProMiSH-E nd Virtul br*-tree perform n exct serch. Therefore, they lwys retrieve the true top-k results, nd hve AAR of 1. We used the results returned by them s the ground truth. Figure 7 shows AAR computed over top- 5 results retrieved by ProMiSH-A for vrying query sizes on two 32-dimensionl rel dtsets. We observe from figure 7 tht AAR of ProMiSH-A is lwys less thn 1.5. This low AAR llows ProMiSH-A to return prcticlly useful results with very efficient time nd spce complexity. B. Efficiency on Synthetic Dtsets We performed experiments on multiple synthetic dtsets to verify the efficiency nd the sclbility of ProMiSH. We first discuss the comprison of query times of Virtul br*-tree, ProMiSH-A, nd ProMiSH-E for vrying dtset dimensions d, dtset sizes N, nd query sizes q. We found tht ProMiSH performs t lest four orders of mgnitude better thn Virtul br*-tree. We lso show results of the sclbility tests of ProMiSH for vrying vlues of N, d, q, nd the result size k. Our sclbility results revel liner performnce of ProMiSH with N, d, q, nd k. All the query times re mesured in milliseconds (ms) nd shown in log scle in ll the figures. The query times of ProMiSH-E, ProMiSH-A, nd Virtul br*-tree for retrieving top-1 results for queries of size 5 on dtsets of vrying dimensions d re shown in figure 8. We used dtset of 100,000 points where ech point ws tgged with t=1 keyword using dictionry of size U=1,000. For the dtset of dimension 25, ProMiSH-A completed in 1.8 ms nd ProMiSH-E took only 4.2 ms. Conversely, results for Virtul br*-tree could not be obtined since it rn for more thn 5 hours. We observed tht ProMiSH not only significntly outperforms Virtul br*-tree on dtsets of ll dimensions but the difference in performnce lso grows to more thn five orders with n increse in the dtset dimension. We show the query times of the lgorithms on 25-dimensionl dtsets of vrying sizes N for queries of size 5 in

10 Fig. 16. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=4 on 16-dimensionl rel dtsets of vrying sizes N. Fig. 12. Query time nlysis of ProMiSH for retrieving top-1 results for queries of vrying sizes q on lrge synthetic dtsets of vrying dimensions d. Vlues of N=3 million, t=1, nd U=200 were used for ech dtset. Fig. 13. Query time nlysis of ProMiSH lgorithms for retrieving top-k results for queries of sizes 3 nd 6 on 50-dimensionl synthetic dtset of size N=3 million. Vlues of t=1 nd U=200 were used for the dtset. figure 9. Ech dtset used dictionry of size U=1,000 nd t=1 keyword per point. Virtul br*-tree filed to finish for the dtset of size N=100,000 even fter 5 hours of execution. We report the query times of the lgorithms for queries of vrying sizes q on 10-dimensionl dtset of size N=100,000 in figure 10. Ech dt point ws tgged with t=1 keyword using dictionry of size U=1,000. For query of size 5, ProMiSH-A hd query time of 1.7 ms, ProMiSH-E hd query time of 4.2 ms, nd Virtul br*-tree hd query time of 305 seconds. We gin observed tht ProMiSH outperforms Virtul br*-tree by more thn five orders of mgnitude with n increse in the dtset size nd the query size. Fig. 14. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=4 on rel dtsets of vrying dimensions d nd size N=50,000. Fig. 17. Query time nlysis of ProMiSH lgorithms for retrieving top-1 results for queries of vrying sizes q on rel dtsets of vrying dimensions nd size N=1 million. All the bove results show tht the query time of ProMiSH increses linerly with the dtset size N, the dtset dimension d, nd the query size q. In contrst, Virtul br*-tree fils to scle with q, d, nd N. These results confirm tht the pruning criteri of Virtul br*-tree, s discussed in section II, becomes ineffective with n increse in the dimension of the dtset. This leds to n exponentil genertion of potentil cndidtes nd lrge query times. Next, we present sclbility results of ProMiSH-E nd ProMiSH-A on lrge synthetic dtsets of vrying dimensions for lrge query sizes nd vrying result sizes. Ech dtset used dictionry of size U=200. A point in ech dtset ws tgged with t=1 keyword. Figure 11 shows the query times for queries of vrying sizes q on 25-dimensionl dtsets of vrying sizes N. ProMiSH-E hd query time of 29 seconds nd ProMiSH-A hd query time of 6 seconds for queries of size 9 on dtset of 10 million points. We observed tht ProMiSH-A is n order of mgnitude fster thn ProMiSH-E for queries of ll sizes. We see from figure 11 tht ProMiSH scles linerly with the query size nd the dtset size. Figure 12 shows the query times of ProMiSH for queries of vrying sizes on 3 million size dtsets of vrying dimensions. ProMiSH-E hd query time of 4.7 seconds nd ProMiSH-A hd query time of 0.3 seconds for queries of size q=9 on 100-dimensionl dtset. ProMiSH-A is n order of mgnitude fster thn ProMiSH-E on dtsets of ll dimensions. We observed tht both lgorithms scle linerly with dimension d of the dtset. Figure 13 shows the query times for retrieving the top-k results for queries of vrying sizes q on 50-dimensionl dtset. It revels liner performnce of both lgorithms for incresing k. ProMiSH-A is n order of mgnitude better thn ProMiSH-E for ny result size k. All these tests show tht the query time of ProMiSH scles linerly with the dtset size, the dtset dimension, the query size, nd the result size. Fig. 15. Query time comprison of lgorithms for retrieving top-1 results for queries of vrying sizes q on 16-dimensionl rel dtset of size N=70,000. C. Efficiency on Rel Dtsets We evluted the efficiency nd the sclbility of ProMiSH on multiple rel dtsets. We first discuss query time comprisons of lterntive lgorithms for vrying dtset dimensions

2 RELATED WORK. TABLE 1 A glossary of notations used in the paper.

2 RELATED WORK. TABLE 1 A glossary of notations used in the paper. Nearest Keyword Set Search in Multi-dimensional Datasets Vishwakarma Singh, Bo Zong, Ambuj K. Singh Abstract Keyword-based search in text-rich multi-dimensional datasets facilitates many novel applications