Nearest Keyword Set Search in Multi-dimensional Datasets

Size: px
Start display at page:

Download "Nearest Keyword Set Search in Multi-dimensional Datasets"

Transcription

1 Nerest Keyword Set Serch in Multi-dimensionl Dtsets Vishwkrm Singh Deprtment of Computer Science University of Cliforni Snt Brbr, USA Emil: Ambuj K. Singh Deprtment of Computer Science University of Cliforni Snt Brbr, USA Emil: rxiv: v1 [cs.db] 12 Sep 2014 Abstrct Keyword-bsed serch in text-rich multidimensionl dtsets fcilittes mny novel pplictions nd tools. In this pper, we consider objects tht re tgged with keywords nd re embedded in vector spce. For these dtsets, we study queries tht sk for the tightest groups of points stisfying given set of keywords. We propose novel method clled ProMiSH (Projection nd Multi Scle Hshing) tht uses rndom projection nd hsh-bsed index structures, nd chieves high sclbility nd speedup. We present n exct nd n pproximte version of the lgorithm. Our empiricl studies, both on rel nd synthetic dtsets, show tht ProMiSH hs speedup of more thn four orders over stte-of-the-rt tree-bsed techniques. Our sclbility tests on dtsets of sizes up to 10 million nd dimensions up to 100 for queries hving up to 9 keywords show tht ProMiSH scles linerly with the dtset size, the dtset dimension, the query size, nd the result size. I. INTRODUCTION Objects (e.g., imges, chemicl compounds, or documents) re often chrcterized by collection of relevnt fetures, nd re commonly represented s points in multi-dimensionl ttribute spce. For exmple, imges (chemicl compounds) re represented using color (molecule) feture vectors. These objects lso very often hve descriptive text informtion ssocited with them, e.g., imges re tgged with loctions. In this pper, we consider multi-dimensionl dtsets where ech dt point hs set of keywords. The presence of keywords llows for the development of new tools for querying nd exploring these multi-dimensionl dtsets. In this pper, we study nerest keyword set serch (NKS) queries on text-rich multi-dimensionl dtsets. An NKS query is set of user provided keywords. The top-1 result of n NKS query is set of dt points which contins ll the query keywords nd the points form the tightest cluster in the multidimensionl spce. Figure 1 illustrtes n NKS query. The multi-dimensionl points in the dtset re represented by dots. Ech point hs unique identifier nd is tgged with set of keywords. For query Q={, b, c}, the set of points {7, 8, 9} contins ll the query keywords {, b, c} nd re nerest to ech other compred to ny other set of points contining these query keywords. Therefore, the set {7, 8, 9} is the top-1 result for the query Q. NKS queries re useful for mny pplictions, e.g., photoshring socil networks, web serch engines, mp services 1, GIS systems 2 [1], subgrph serch, nd for geo-tgging of d b, e 1 2 c, g q, t 11 c, v 7 b, g 9 4 e 10 c, d 6 b, o, g 5 Fig. 1. An exmple of n NKS query on keyword tgged multi-dimensionl dtset. Query is Q={, b, c}. The top-1 result is the set of points {7, 8, 9}. objects nd regions [2]. Consider photo-shring socil network like Fcebook where photos re tgged with people nmes nd loctions. These photos cn be embedded in high-dimensionl feture spce of texture, color, or shpe [3], [4]. Here n NKS query cn find group of similr photos which contins set of people. NKS serches re lso useful when lbeled grphs re embedded in high dimensionl spce (e.g., through Lipschitz embedding [5]) for ese of processing. In this cse, serch for subgrph tht hs the needed lbels cn be nswered by n NKS serch in the embedded spce [6]. NKS queries cn lso revel geogrphic ptterns. GIS cn chrcterize region by high-dimensionl set of ttributes, e.g., pressure, humidity, nd soil types. Additionlly, these regions cn lso be tgged with informtion such s diseses. An epidemiologist cn use NKS queries to discover pttern by finding set of similr regions which contins ll the diseses of her interest. Query Definition: Let D R d be d-dimensionl dtset hving N points. Ech point o D hs unique identifier (id). Ech point is lso tgged with set of keywords σ(o)={v 1,.., v t } V, where V is dictionry of size U of ll the unique keywords in D. We use L 2 (Eucliden norm) to mesure distnce between ny two points, i.e., dist(o i, o j ) = o i o j 2. We mesure the nerness of set of points A by the mximum distnce between ny two points in A, clled dimeter r(a). r(a) = mx o i,o j A o i o j 2 A reltively smll vlue of r(a) implies tht the corresponding objects re more similr to ech other. A q-size NKS query Q={v Q1,..., v Qq } hs q unique keywords provided by user. Set A D is possible result, clled cndidte, of Q if it contins points for ll the query keywords, i.e., Q o A σ(o), nd no subset of A does so. We llow overlpping cndidtes. If S is the set of ll cndidtes of c, h 13

2 D : A dtset V : A dictionry of unique keywords in D Q : A set of keywords comprising query o : A point in D v : A keyword N(v) : Number of points in D hving keyword v N : Number of points in D U : Number of unique keywords in D q : Number of keywords in query Q d : Number of dimensions of point t : Averge number of keywords per point k : Number of top results w 0 : Initil bin-width for hshtble m : Number of unit rndom vectors used for projection L : Number of Hshtble-Inverted Index structures s : A scle vlue r : Dimeter of set of points z : A d-dimensionl unit rndom vector TABLE I. Q, then result of Q is the cndidte A such tht A = rg min A S r(a). A top-k NKS query retrieves k cndidtes hving the lest dimeter. If two cndidtes hve equl dimeters, then they re further rnked by their crdinlity. We cn lso mesure the nerness of set of points A by sum of ll its pirwise distnces s(a). Here we show with n exmple tht s(a) does not yield tighter cluster thn r(a). Let A={o 1, o 2, o 3, o 4 } be set of points with following pirwise distnces: {d(o 1, o 2 )=2, d(o 1, o 3 )=1, d(o 2, o 3 )=4, d(o 1, o 4 )=3, d(o 2, o 4 )=3, d(o 3, o 4 )=8}. For the set A 1 ={o 1, o 2, o 3 } we hve r(a 1 )=4 nd s(a 1 )=7 wheres for the set A 2 ={o 1, o 2, o 4 } we hve r(a 2 )=3 nd s(a 2 )=8. Here we see tht A 1 hs smller dimeter wheres A 2 hs smller sum of pirwise distnces. In this pper we use dimeter. A serch using tree-bsed index ws proposed by Zhng et l. [2], [7] to solve NKS queries on multi-dimensionl dtsets. The performnce of this lgorithm deteriortes shrply with n increse in the dimension of the dtset s the pruning techniques become ineffective. Our empiricl results show tht this lgorithm my tke hours to terminte for high-dimensionl dtset hving only few thousnds points. Authors lso noted tht tree-bsed lgorithm does not scle with the dimension of the dtset. As discussed previously, NKS queries re useful for pplictions of vrying dimensions. Therefore, there is need for n efficient lgorithm tht scles linerly with the dtset dimension nd yields prcticl query times on lrge dtsets. We propose ProMiSH (Projection nd Multi-Scle Hshing) to efficiently solve NKS queries. We present n exct (ProMiSH-E) nd n pproximte (ProMiSH-A) version of the lgorithm. ProMiSH-E lwys retrieves the true top-k results, nd therefore hs 100% ccurcy. ProMiSH-A is much more time nd spce efficient but returns results whose dimeters re within smll pproximtion rtio of the dimeters of the true results. Both lgorithms scle linerly with the dtset dimension, the dtset size, the query size, nd the result size. Thus, ProMiSH possesses ll the three desired chrcteristic of good serch lgorithm: 1) high qulity of results (ccurcy), 2) high efficiency, nd 3) good sclbility. ProMiSH-E uses set of hshtbles nd inverted indices to perform loclized serch of the results. ProMiSH-E hshtbles re inspired from Loclity Sensitive Hshing (LSH) [8], which is stte-of-the-rt method for the nerest neighbor serch in high-dimensionl spces. The index structure of ProMiSH-E supports ccurte serch, unlike LSH-bsed methods tht llow only pproximte serch with probbilistic gurntees. ProMiSH-E cretes hshtbles t multiple bin-widths, clled scles. A serch in hshtble yields subsets of points tht contin query results. ProMiSH-E explores ech subset using novel pruning bsed strtegy. An optiml strtegy is NP-Hrd; therefore, ProMiSH-E uses greedy pproch. ProMiSH-A is n pproximte vrition of ProMiSH-E to chieve even more spce nd time efficiency. A GLOSSARY OF NOTATIONS USED IN THE PAPER. We evluted the performnce of ProMiSH on both rel nd synthetic dtsets. We used stte-of-the-rt Virtul br*- Tree [2] s reference method for comprison. The empiricl results revel tht ProMiSH consistently outperforms Virtul br*-tree on dtsets of ll dimensions. The difference in performnce of ProMiSH nd Virtul br*-tree grows to more thn four orders of mgnitude with n increse in the dtset dimension, the dtset size, nd the query size. Our sclbility tests on dtsets of sizes up to 10 million nd dimensions up to 100 for queries of sizes up to 9 show tht ProMiSH scles linerly with the dtset size, the dtset dimension, the query size, nd the result size. Our dtsets hd s mny s 24, 874 unique keywords nd dt point ws tgged with mximum of 14 keywords. The spce cost nlysis of the lgorithms show tht ProMiSH-A is much more spce efficient thn both ProMiSH-E nd Virtul br*-tree. Our min contributions re: (1) novel multi-scle index for sclble nswering of NKS queries, (2) n efficient cndidte genertion technique from subset of points, nd (3) extensive empiricl studies. The pper is orgnized s follows. A literture survey is presented in section II. Index structures re described in section III. An exct serch lgorithm (ProMiSH-E) to find subsets of points contining the results is given in section IV. Section V discusses how nswers re generted from the subsets. The pproximte lgorithm (ProMiSH-A) nd n nlysis of its pproximtion rtio is presented in section VI. Complexity of ProMiSH is nlyzed in section VII. Empiricl results re presented in section VIII. We discuss extension of ProMiSH to disk in section IX. Finlly, we provide conclusions nd future work in section X. A glossry of the nottions is shown in tble I. II. LITERATURE SURVEY A vriety of queries, semnticlly different from our NKS queries, hve been studied in literture on text-rich sptil dtsets. Loction-specific keyword queries on the web nd in the GIS systems [9], [10], [11], [12] were erlier nswered using combintion of R-Tree [13] nd inverted index. Felipe et l. [14] developed IR 2 -Tree to rnk objects from sptil dtsets bsed on combintion of their distnces to the query loctions nd the relevnce of their text descriptions to the query keywords. Cong et l. [15] integrted R-tree nd inverted file to nswer query similr to Felipe et l. [14] using different rnking function. Mrtins et l. [16] computed text relevncy nd loction proximity independently, nd then combined the two rnking scores. Co et l. [17] recently proposed method to retrieve group of sptil web objects such tht the group s keywords cover the query s keywords nd the objects in the group re nerest to the query loction nd hve the lowest inter-object distnces. Other keywordbsed queries on sptil dtsets re ggregte nerest keyword serch in sptil dtbses [18], top-k preferentil query [19], finding top-k sites in sptil dt bsed on their influence on feture points [20], nd optiml loction queries [21], [22].

3 Fig. 2. Division of projected vlues of points on unit rndom vector into overlpping bins of equl width w=2r. Our NKS query is similr to the m-closest keywords query of Zhng et l. [7]. They designed br*-tree bsed on R*- tree [23] tht lso stores bitmps nd minimum bounding rectngles (MBRs) of keywords in every node long with points MBRs. The cndidtes re generted by the priori lgorithm [24]. They prune unwnted cndidtes bsed on the distnces between MBRs of points or keywords nd the best found dimeter. Their pruning techniques become ineffective with n increse in the dtset dimension s there is lrge overlp between MBRs due to the curse of dimensionlity. This leds to n exponentil number of cndidtes nd lrge query times. A poor estimtion of strting dimeter further worsens the performnce of their lgorithm. br*-tree lso suffered from high storge cost, therefore Zhng et l. modified br*-tree to crete Virtul br*-tree [2] in memory t run time. Virtul br*-tree is creted from pre-stored R*-Tree which indexes ll the points, nd n inverted index which stores keyword informtion nd pth from the root node in R*-Tree for ech point. Both br*-tree nd Virtul br*-tree, re structurlly similr, nd use similr cndidte genertion nd pruning techniques. Therefore, Virtul br*-tree shres similr performnce weknesses s br*-tree. Tree-bsed indices, e.g., R-Tree [13] nd M-Tree [25], hve been reserched extensively for n efficient ner neighbor serch in high-dimensionl spces. These indices fil to scle to dimensions greter thn 10 becuse of the curse of dimensionlity [26]. VA-file [26] nd idistnce [27] provide better sclbility with the dtset dimension. However, the tsk of designing n efficient method for solving NKS queries by dpting VA-file or idistnce is not obvious. Rndom projections [28] with hshing [29], [30], [8], [31], [32] hs come to be the stte-of-the-rt method for n efficient ner neighbor serch in high-dimensionl dtsets. Dtr et l. [8] used rndom vectors constructed from p-stble distributions to project points, nd then computed hsh keys for the points by splitting the line of projected vlues into disjoint bins. They conctented hsh keys obtined for point from m rndom vectors to crete finl hsh key for the point. All points were indexed into hshtble using their hsh keys. Our index structure is inspired from the sme. Multi-wy distnce joins of set of multi-dimensionl dtsets, ech of which is indexed into R-Tree, hve been studied in literture [33], [34]. As discussed bove, treebsed index fils to scle with the dimension of the dtset. Further, it is not strightforwrd to dpt these lgorithms if every query requires multi-wy distnce join only on subset of the points of ech dtset. III. INDEX FOR EXACT SEARCH In this section, we describe the index structure of ProMiSH- E. It hs two min dt structures. The first dt structure is keyword-point inverted index I kp tht indexes ll the points in the dtset D using their keywords. I kp is shown with dshed rectngle in figure 3. The second dt structure consists of multiple hshtbles nd their corresponding inverted indices. We cll hshtble H together with its corresponding inverted index I khb s HI structure. Hsh Bucket Ids Points 1 2 1:, d 7: c, v 2: b, e 8: 9: b, g 3 Retrieve Hsh Buckets. 6: c, d 10: e Hshtble H Perform subset serch on ech retrieved hsh bucket using points hving query keyword. Fig. 3. Keywords b c Smllest dimeter r k * Point Ids Keyword- Point Inverted Index I kp Find ll the hsh buckets in I khb hving ll the query keywords, e.g., bucket 2. HI s If r k * w 0 2 s- 1 Yes Terminte Keywords Hsh Bucket Ids b c Keyword- Bucket Inverted Index I khb No Find ll the points hving query keyword Index structure nd flow of execution of ProMiSH. s = s+1 s = 0 Q= {, b, c} START We crete hshtble H s follows. We rndomly choose m d-dimensionl unit vectors. We compute projection z.o of ech point o in D on ech unit rndom vector z. Next, we split ech line of projected vlues into consecutive overlpping bins of width w s shown in figure 2. Here bin is eqully overlpped by two other bins. We ssign ech point o hsh key bsed on the bin in which it lies. Since the line is split into overlpping bins, ech point o lies in two bins, nd therefore gets two hsh keys {b 1, b 2 } from ech unit rndom vector z. For exmple, the line of projected vlues T in figure 2 hs been split into overlpping bins {x1, x2, x3, y1, y2, y3}. Point o lies in bins x1 nd y2, nd therefore gets two hsh keys corresponding to ech of the bins. We compute hsh keys using equtions 1 nd 2: h 1 (o) = z.o w (1) h 2 (o) = z.o w 2 + C (2) w where C is constnt to distinguish vlues of h 1 nd h 2. A vlue of C cn be (mx(h 1 ) min(h 1 ) + 2). We get m pirs of hsh keys for ech dt point o using m unit rndom vectors. We tke crtesin product of these m pirs of hsh keys to generte 2 m signtures for ech point o. A signture sig(o)={b j1,..., b jm } of point o contins hsh key from ech of the m pirs. For exmple, let z 1 nd z 2 be two unit rndom vectors for m=2. Let the hsh keys of point o be {x 1, y 1 } from z 1 nd {x 2, y 2 } from z 2. ProMiSH cretes size signtures {x 1 x 2, x 1 y 2, y 1 x 2, y 1 y 2 } for o by crtesin product. We hsh ech point o using ech of its 2 m signtures s hsh key into the hshtble H. A signture sig(o) of point o is converted into hshtble bucket identifier (bucket id) using stndrd hsh function, e.g., ( b ji pr i ) %hshtble size, where pr i is rndom prime number. We store point just by its id in the hsh bucket. For ech hshtble H, we crete corresponding inverted index I khb. For ech bucket of H, we compute the union of keywords of its points. Then, we index ech bucket of the hshtble H ginst ech of the unique keywords it contins in the inverted index I khb. We show HI structure in figure 3 with dotted rectngle. We crete HI s structures for incresing bin-width w=w 0 2 s,

4 where w 0 is initil bin-width nd s {0,..., L 1} is the scle. If pmx is the mximum spn of projected vlues of points on ny unit rndom vector, then IV. L = log 2 ( pmx w 0 EXACT SEARCH (PROMISH-E) ). (3) Here we describe the lgorithm ProMiSH-E to find subsets of points tht contin the true query results. First, we introduce lemms which gurntee tht ProMiSH-E lwys retrieves the true top-k results using the index structure. Then, we describe the steps of ProMiSH-E to find the subsets. The lgorithm to find results from these subsets is described in section V. Lemm 1: Let R d be d-dimensionl Eucliden spce. Let z be vector uniformly picked from unit (d-1)-sphere such tht z R d nd z 2 = 1. For ny two points o 1 nd o 2 in R d, we hve o 1 o 2 2 z.o 1 z.o 2 2. Proof: Since, n Eucliden spce with dot product is n inner product spce, we hve z.o 1 z.o 2 2 = z.(o 1 o 2 ) z 2 o 1 o 2 2 = o 1 o 2 2 since z 2 = 1 The inequlity follows from Cuchy-Schwrz inequlity. Lemm 2: If set of points A = {o 1,..., o n } in R d with dimeter r is projected onto d-dimensionl unit rndom vector z, nd the line is split into overlpping bins of equl width w 2r, then there exists bin contining ll the points of set A. Proof: From lemm 1 nd the definition of dimeter, we hve o i, o j A, z.o i z.o j o i o j r. Therefore, the spn of projected vlues of the points in set A, i.e., mx(z.o 1,..., z.o n ) min(z.o 1,..., z.o n ), is r. Since the line is split into overlpping bins of width 2r, it follows from the construction, s shown in figure 2, tht line segment of width r is fully contined in one of the bins. Hence, ll the points in set A will lie in the sme bin. We illustrte here with n exmple how lemm 2 gurntees retrievl of the true results. For query Q, let the dimeter of its top-1 result be r. We project ll the dt points in D on unit rndom vector nd split the projected vlues into overlpping bins of bin-width 2r. Now, if we perform serch in ech of the bins independently, then lemm 2 gurntees tht the top-1 result of query Q is found in one of the bins. A flow of execution of ProMiSH-E is shown in figure 3. A serch strts with the HI structure t scle s=0. ProMiSH- E finds buckets of hshtble H, ech of which contins ll the query keywords, using the inverted index I khb. Then, ProMiSH-E explores ech selected bucket using n efficient pruning bsed technique to generte results. ProMiSH-E termintes fter exploring HI structure t the smllest scle s such tht the kth result hs the dimeter r k w 02 s 1. Algorithm 1 detils the steps of ProMiSH-E. It mintins bitset BS. For ech v Q Q, ProMiSH-E retrieves the list of points corresponding to v Q from I kp in step 4. For ech point o in the retrieved list, ProMiSH-E mrks the bit corresponding to o s identifier in BS s true in step 5. Thus, ProMiSH-E finds ll the points in D which re tgged with t lest one query keyword. Next, the serch continues in the HI structures, Algorithm 1 ProMiSH-E In: Q: query keywords; k: number of top results In: w 0 : initil bin-width 1: P Q [e([ ], + )]: priority queue of top-k results 2: HC: hshtble to check duplicte cndidtes 3: BS : bitset to trck points hving query keyword 4: for ll o vq QI kp [v Q ] do 5: BS[o] true /* Find points hving query keywords*/ 6: end for 7: for ll s {0,..., L 1} do 8: Get HI t s 9: E[ ] 0 /* List of hsh buckets */ 10: for ll v Q Q do 11: for ll bid I khb [v Q ] do 12: E[bId] E[bId] : end for 14: end for 15: for ll i (0,..., SizeOf(E)) do 16: if E[i] = SizeOf(Q) then 17: F /* Obtin subset of points */ 18: for ll o H[i] do 19: if BS[o] = true then 20: F F o 21: end if 22: end for 23: if checkduplictecnd(f, HC) = flse then 24: serchinsubset(f, P Q) 25: end if 26: end if 27: end for 28: /* Check termintion condition */ 29: if P Q[k].r w 0 2 s 1 then 30: Return P Q 31: end if 32: end for 33: /* Perform serch on D if lgorithm hs not terminted */ 34: for ll o D do 35: if BS[o] = true then 36: F F o 37: end if 38: end for 39: serchinsubset(f, P Q) 40: Return P Q beginning t s=0. For ny given scle s, ProMiSH-E ccesses the HI structure creted t the scle in step 8. ProMiSH- E retrieves ll the lists of hsh bucket ids corresponding to keywords in Q from the inverted index I khb in steps (10-11). An intersection of these lists yields set of hsh buckets ech of which contins ll the query keywords in steps (12-16). For the exmple in figure 3, this intersection yields the bucket id 2. For ech selected hsh bucket, ProMiSH-E retrieves ll the points in the bucket from the hshtble H. ProMiSH-E filters these points using bitset BS to get subset of points F in steps (17-22). Subset F contins only those points which re tgged with t lest one query keyword nd is explored further. Subset F is checked whether it hs been explored erlier or not using checkduplictecnd (Algorithm 2) in step 23. Since ech point is hshed using 2 m signtures, duplicte subsets my be generted. If F hs not been explored erlier, then ProMiSH-E performs serch on it using serchinsubset (Algorithm 3) in step 24. Results re inserted into priority queue P Q of size k. Ech entry e([ ], r) of P Q is tuple contining set of points nd the set s dimeter. P Q is initilized with k entries, ech of whose set is empty nd the dimeter is +. Entries of P Q re ordered by their dimeters. Entries with equl dimeters re further ordered by their set sizes. A new result is inserted into P Q only if its dimeter is smller thn the kth smllest dimeter in P Q. If ProMiSH-E does not terminte fter exploring the HI structure t the scle

5 Algorithm 2 checkduplictecnd In: F : subset; HC: hshtble of subsets 1: F sort(f ) 2: pr1: list of prime numbers; pr2: list of prime numbers; 3: for ll o F do 4: pr 1 rndomselect(pr1); pr 2 rndomselect(pr2) 5: h 1 h 1 + (o pr 1 ); h 2 h 2 + (o pr 2 ) 6: end for 7: h h 1 h 2 ; 8: if isempty(hc[h])=flse then 9: if elementwisemtch(f, HC[h]) = true then 10: Return true; 11: end if 12: end if 13: HC[h].dd(F ); 14: Return flse; s, then the serch proceeds to HI t the scle (s + 1). ProMiSH-E termintes when the kth smllest dimeter r k in P Q becomes less thn or equl to hlf of the current binwidth w=w 0 2 s in steps (29-31). Since r k w02s 2, lemm 2 gurntees tht ech true cndidte is fully contined in one of the bins of the hshtble, nd therefore must hve been explored. If ProMiSH-E fils to terminte fter exploring HI t ll the scle levels s {0,..., L 1}, then it performs serch in the complete dtset D in steps (34-39). Algorithm checkduplictecnd (Algorithm 2) uses hshtble HC to check duplictes for subset F. Points in F re sorted by their identifiers. Two seprte stndrd hsh functions re pplied to the identifiers of the points in the sorted order to generte two hsh vlues in steps (2-6). Both the hsh vlues re conctented to get hsh key h for the subset F in step 7. The use of multiple hsh functions helps to reduce hsh collisions. If HC lredy hs list of subsets t h, then n element-wise mtch of F is performed with ech subset in the list in steps (8-9). Otherwise, F is stored in HC using key h in step 13. V. SEARCH IN A SUBSET OF DATA POINTS We present n lgorithm for finding top-k tightest clusters in subset of points. A subset is obtined from hshtble bucket s explined in section IV. Points in the subset re grouped bsed on the query keywords. Then, ll the promising cndidtes re explored by multi-wy distnce join of these groups. The join uses r k, the dimeter of the kth result obtined so fr by ProMiSH-E, s the distnce threshold. We explin multi-wy distnce join with n exmple. A multi-wy distnce join of q groups {g 1,..., g q } finds ll the tuples {o 1,i,..., o x,j, o y,k,..., o q,l } such tht x, y: o x,j g x, o y,k g y, nd o x,j o y,k 2 r k. Figure 4() shows groups {, b, c} of points obtined for query Q={, b, c} from subset F. We show n edge between pir of points of two groups if the distnce between the points is t most r k, e.g, n edge between point o 1 in group nd point o 3 in group b. A multi-wy distnce join of these groups finds tuples {o 1, o 3, o 9 } nd {o 10, o 3, o 9 }. Ech tuple obtined by multi-wy join is promising cndidte for query. A. Group Ordering A suitble ordering of the groups leds to n efficient cndidte explortion by multi-wy distnce join. We first perform pirwise inner joins of the groups with distnce threshold r k. In inner join, pir of points from two groups O 1 O 8 O 10 3 b O 3 O 4 O O 2 O 6 O 9 c b 3 2 () Pirwise inner joins (b) A grph representtion Fig. 4. (), b, nd c re groups of points of subset F obtined for query Q={, b, c}. A point o in group g is joined to point o in nother group g if o o r k. The groups in the order {, c, b} genertes the lest number of cndidtes by multi-wy join. (b) A grph of pirwise inner joins. Ech group is node in the grph. The weight of n edge is the number of point pirs obtined by n inner join of the corresponding groups. re joined only if the distnce between them is t most r k. Figure 4() shows such pirwise inner joins of the groups {, b, c}. We see from figure 4() tht multi-wy distnce join in the order {, b, c} explores 2 true cndidtes {{o 1, o 3, o 9 }, {o 10, o 3, o 9 }} nd flse cndidte {o 1, o 4, o 6 }. A multi-wy distnce join in the order {, c, b} explores the lest number of cndidtes 2. Therefore, proper ordering of the groups leds to n effective pruning of flse cndidtes. Optiml ordering of groups for the lest number of cndidtes genertion is NP-hrd [35]. We propose greedy pproch to find the ordering of groups. We explin the lgorithm with grph in figure 4(b). Groups {, b, c} re nodes in the grph. The weight of n edge is the count of point pirs obtined by n inner join of the corresponding groups. The greedy method strts by selecting n edge hving the lest weight. If there re multiple edges with the sme weight, then n edge is selected t rndom. Let the edge c, with weight 2, be selected in figure 4(b). This forms the ordered set ( c). The next edge to be selected is the lest weight edge such tht t lest one of its nodes is not included in the ordered set. Edge cb, with weight 2, is picked next in figure 4(b). Now the ordered set is ( c b). This process termintes when ll the nodes re included in the set. ( c b) gives the ordering of the groups. Algorithm 3 shows how the groups re ordered. The kth smllest dimeter r k is retrieved form the priority queue P Q in step 1. For given subset F nd query Q, ll the points re grouped using query keywords in steps (2-5). A pirwise inner join of the groups is performed in steps (6-18). An djcency list AL stores the distnce between points which stisfy the distnce threshold r k. An djcency list M stores the count of point pirs obtined for ech pir of groups by the inner join. A greedy lgorithm finds the order of the groups in steps (19-30). It repetedly removes n edge with the smllest weight from M till ll the groups re included in the order set curorder. Finlly, groups re sorted using curorder in step 30. B. Nested Loops with Pruning We perform multi-wy distnce join of the groups by nested loops. For exmple, consider the set of points in figure 4. Ech point o,i of group is checked ginst ech point o b,j of group b for the distnce predicte, i.e., o,i o b,j 2 r k. If pir (o,i, o b,j ) stisfies the distnce predicte, then it forms tuple of size 2. Next, this tuple is checked ginst ech point of group c. If point o c,k stisfies the distnce predicte with both the points o,i nd o b,j, then tuple (o,i, o b,j, o c,k ) of size 3 is generted. Ech intermedite tuple generted by nested loops stisfies the property tht the distnce between every pir of its points is t most r k. This 2 c

6 Algorithm 3 serchinsubset In: F : subset of points; Q: query keywords; q: query size In: P Q: priority queue of top-k results 1: r k P Q[k].r /* kth smllest dimeter */ 2: SL [(v, [ ])]: list of lists to store groups per query keyword 3: for ll v Q do 4: SL[v] { o F : o is tgged with v} /* form groups */ 5: end for 6: /* Pirwise inner joins of the groups*/ 7: AL: djcency list to store distnces between points 8: M 0: djcency list to store count of pirs between groups 9: for ll (v i, v j ) Q such tht i q, j q, i < j do 10: for ll o SL[v i ] do 11: for ll o SL[v j ] do 12: if o o 2 r k then 13: AL[o, o ] o o 2 14: M[v i, v j ] M[v i, v j ] : end if 16: end for 17: end for 18: end for 19: /* Order groups by greedy pproch */ 20: curorder [ ] 21: while Q do 22: (v i, v j ) removesmllestedge(m) 23: if v i curorder then 24: curorder.ppend(v i ); Q Q \ v i 25: end if 26: if v j curorder then 27: curorder.ppend(v j ); Q Q \ v j 28: end if 29: end while 30: sort(sl, curorder) /* order groups */ 31: findcndidtes(q, AL, P Q, Idx, SL, curset, cursetr, r k ) property effectively prunes flse tuples very erly in the join process nd helps to gin high efficiency. A cndidte is found when tuple of size q is generted. If cndidte hving dimeter smller thn the current vlue of r k is found, then the priority queue P Q nd the vlue of r k re updted. The new vlue of r k is used s distnce threshold for future itertions of nested loops. We find results by nested loops s shown in Algorithm 4 (findcndidtes). Nested loops re performed recursively. An intermedite tuple curset is checked ginst ech point of group SL[Idx] in steps (2-23). First, it is determined using AL whether the distnce between the lst point in curset nd point o in SL[Idx] is t most r k in step 3. Then, the point o is checked ginst ech point in curset for the distnce predicte in steps (5-15). The dimeter of curset is updted in steps (9-11). If point o stisfies the distnce predicte with ech point of curset, then new tuple newcurset is formed in step 17 by ppending o to curset. Next, recursive cll is mde to findcndidtes on the next group SL[Idx + 1] with newcurset nd newcursetr. A cndidte is found if curset hs point from every group. A result is inserted into P Q fter checking for duplictes in steps (26-33). A duplicte check is done by sequentil mtch with the results in P Q. For lrge vlue of k, method similr to Algorithm 2 cn be used for duplicte check. If new result gets inserted into P Q, then the vlue of r k is updted in step 18. VI. APPROXIMATE SEARCH (PROMISH-A) We present ProMiSH-A tht is more spce nd time efficient thn ProMiSH-E. We lso use sttisticl model to show tht ProMiSH-A retrieves results within smll pproximtion rtio of the true results with high probbility. Algorithm 4 findcndidtes In: q: query size; SL: list of groups In: AL: djcency list of distnces between points In: P Q: priority queue of top-k results In: Idx: group index in SL In: curset: n intermedite tuple In: cursetr: n intermedite tuple s dimeter 1: if Idx q then 2: for ll o SL[Idx] do 3: if AL[curSet[Idx-1], o] r k then 4: newcursetr cursetr 5: for ll o curset do 6: dist AL[o, o ] 7: if dist r k then 8: flg true 9: if newcursetr < dist then 10: newcursetr dist 11: end if 12: else 13: flg flse; brek; 14: end if 15: end for 16: if flg = true then 17: newcurset curset.ppend(o) 18: r k findcndidtes(q, AL, P Q, Idx+1, SL, newcurset, newcursetr, r k ) 19: else 20: Continue; 21: end if 22: end if 23: end for 24: return r k 25: else 26: if checkduplicteanswers(curset, P Q) = true then 27: return r k 28: else 29: if cursetr < P Q[k].r then 30: P Q.Insert([curSet, cursetr]) 31: return P Q[k].r 32: end if 33: end if 34: end if The index structure nd the serch method of ProMiSH- A re vritions of ProMiSH-E, therefore we describe only the differences. The index structure of ProMiSH-A differs from ProMiSH-E only in the wy the line of projected vlues of points on unit rndom vector is split. ProMiSH- A splits the line into non-overlpping bins of equl width, unlike ProMiSH-E which splits the line into overlpping bins. Therefore, ech dt point o gets one hsh key from unit rndom vector z in ProMiSH-A. A signture sig(o) is creted for ech point o by the conctention of its hsh keys obtined from ech of the m unit rndom vectors. Ech point is hshed using its signture sig(o) into hshtble t given scle. The serch technique of ProMiSH-A differs from ProMiSH-E in the initiliztion of priority queue P Q nd the termintion condition. ProMiSH-A strts with n empty priority queue P Q, unlike ProMiSH-E whose priority queue is initilized with k entries. ProMiSH-A checks for termintion condition fter fully exploring hshtble t given scle. It termintes if it hs k entries in its priority queue P Q. Since ech point is hshed only once into hshtble of ProMiSH- A, it does not perform subset duplicte check or result duplicte check. Bound on pproximtion rtio: Define pproximtion rtio ρ 1 s the rtio of the dimeter of the result reported by ProMiSH-A r to the dimeter of the true result r, i.e., ρ= r r. Let D be d-dimensionl dtset nd Q={v Q1,, v Qq } be n NKS query. Let f v be the probbility mss function of the

7 () d=2 (b) d=16 Fig. 5. Probbility mss functions f r of dimeters of cndidtes of query of size 3 on 2-dimensionl nd 16-dimensionl rel dtsets. Dtset Dimension d Percentge Rtio ( Np ) N n TABLE II. PERCENTAGE RATIO OF THE EXPECTED NUMBER OF CANDIDATES N p TO THE TOTAL NUMBER OF CANDIDATES Nn OF A QUERY. keywords v V. Using f v, we get the number of points tgged with query keyword v Q s N(v Q ) = f v (v Q ) N. Therefore, the totl number of cndidtes for query Q in D is q N n = f v (v Qi ) N. (4) i=1 Let f r be the probbility mss function of dimeters of cndidtes of Q. Then, the totl number of cndidtes of query Q hving dimeter r is given by N r = f r (r) N n. (5) We project ll the points in dtset D, which contin t lest one query keyword v Q, onto unit rndom vector z. We split the line of projected vlues into non-overlpping bins of equl width w. Let P r(a r) be the conditionl probbility for rndom unit vectors tht cndidte A of query Q hving dimeter r is fully contined within bin. For m independent unit rndom vectors, the joint probbility tht cndidte A is contined in bin in ech of the m vectors is P r(a r) m. The probbility tht no cndidte of dimeter r is retrieved by ProMiSH-A from the hshtble, creted using m unit rndom vectors, is (1 P r(a r) m ) Nr. Let the dimeter of the top-1 result of query Q be r. Then, the probbility P (r ) of t lest one cndidte of ny dimeter r, where r r r, being retrieved by ProMiSH-A is given by P (r ) = 1 r r=r (1 P r(a r) m ) Nr. (6) For given constnt λ, 0 λ 1, we cn compute the smllest vlue of r using eqution 6 such tht λ P (r ). The vlue ρ = r r gives n upper bound on the pproximtion rtio of the results returned by ProMiSH-A with the probbility λ. We empiriclly computed ρ for queries of size q=3 for different vlues of λ using this model. We used 32- dimensionl rel dtset hving 1 million points described in section VIII for our study. For set of rndomly chosen queries of size 3, we computed the vlues of N r nd P r(a r) 2. We used projections on 1 million rndom vectors nd binwidth of w=100 for computing P r(a r) 2. We obtined the pproximtion rtio bound of ρ =1.4 nd ρ =1.5 for λ=0.8 nd λ=0.95 respectively. VII. COMPLEXITY ANALYSIS OF PROMISH We first show using sttisticl model tht ProMiSH effectively prunes the flse cndidtes. Then, we nlyze the () d=2 (b) d=16 Fig. 6. Vlues of P r(a r) 2 for vrying dimeters of cndidtes of query of size 3 on 2-dimensionl nd 16-dimensionl rel dtsets. time nd the spce complexity of ProMiSH. Let D be d- dimensionl dtset of size N where ech point o is tgged with t keywords. Let U be the number of unique keywords in D. Let Q={v Q1,, v Qq } be n NKS query of size q. Sttisticl Model: Let the set A D with dimeter r be the top-1 result of query Q. We use t=1 for our model. Let f v be the probbility mss function of the keywords v V. Let f r be the probbility mss function of dimeters of cndidtes of Q. The totl number of cndidtes N n nd N r of query Q re given by equtions 4 nd 5 respectively. We select ll the points in D which contin t lest one query keyword v Q. We project these points on unit rndom vector z. We split the line of projected vlues into overlpping bins of equl width w = 2r. Let P r(a r) be the conditionl probbility for rndom unit vectors tht cndidte A of query Q hving dimeter r is fully contined within bin. For m independent unit rndom vectors, the joint probbility tht cndidte A is contined in bin in ech of the m vectors is P r(a r) m. The expected number of cndidtes explored by ProMiSH in hshtble, creted using m unit rndom vectors, is N p = r P r(a r) m N r. (7) We empiriclly computed the probbility mss function f r, the probbility P r(a r) m, nd the rtio of N p to N n. We used rel dtsets of size N =1 million nd vrying dimensions for our experiments. These dtsets re described in section VIII. We used rndomly selected queries of size q=3. We show probbility mss functions f r of dimeters of cndidtes of query Q on dtsets of dimensions d=2 nd d=16 in figure 5. We computed dimeters of ll the cndidtes of query Q in the dtset to obtin f r nd r. The dimeters of the cndidtes were scled to lie between 0 nd 1. We show vlues of P r(a r) 2 for vrying dimeters of cndidtes of query Q on dtsets of dimensions d=2 nd d=16 in figure 6. To compute P r(a r), we rndomly chose cndidte A of dimeter r. We projected ll the points of A on one million unit rndom vectors. Then, we computed the number of vectors on ech of which ll the points in A lie in the sme bin. We mke following observtions from the bove nlysis: () dimeters of the cndidtes of query hve hevytiled distribution, nd (b) the vlue of P r(a r) m decreses exponentilly with the dimeter of the cndidte of query. The first observtion implies tht lrge number of the cndidtes hve dimeters much lrger thn r. The second observtion implies tht the cndidtes with dimeter lrger thn r hve much smller chnce of flling in bin thn A, nd thus being probed by ProMiSH. Therefore, most of the flse cndidtes, i.e., cndidtes with dimeters lrger thn r, re effectively pruned out by ProMiSH using its index. We present the percentge rtio of N p to N n in tble II for dtsets of vrying dimensions. Ech rtio is computed s n

8 verge of 50 rndom queries. We observe from tble II tht ProMiSH prunes more thn 99% of the flse cndidtes for dtsets of low dimensions, e.g., d=2. For high dimensions, e.g., d=32, more thn 50% of the flse cndidtes get pruned. Time complexity: We ssume tht the dt points re uniformly distributed cross ll the keywords. Therefore, the totl number of the dt points tgged with keyword v is TABLE III. Id Dtset Size (N) Dictionry Size U Averge t 1 10,000 5, ,000 6, ,000 7, ,000 7, Million 24, DESCRIPTION OF REAL DATASETS OF FIVE DIFFERENT SIZES. N(v) = N ( t U ). Let the index structure of ProMiSH-E be comprised of HI structures t L scle levels where the vlue of L is obtined by eqution 3. Let H s be the hshtble t scle s. We ssume without ny loss of generlity tht the hshtble H s is creted using m=1 unit rndom vector. Let pspn be the spn of the projected vlues of the dt points on the unit rndom vector. We ssume tht the dt points tgged with keyword v re uniformly distributed on the line of projected vlues. ProMiSH-E divides the the line of projected vlues into overlpping bins to compute the hsh keys of the points using bin-width of w=w 0 2 s. Therefore, the number of the dt points hving keyword v lying in bucket b of H s is N(vb) = N(v) w/pspn = N(v)/2 L s. We first compute the cost of serch in bucket b of H s. The cost of pirwise inner joins for query Q of size q for d-dimensionl dt points is (N(vb) q) 2 d/2. Nested loop enumertes the cndidtes by looking up the precomputed distnces between the points from the djcency list. Therefore, the worst cse cost of the nested loop is N(vb) q. The totl cost of serch in bucket b of the hshtble H s is T (bs) = ((N(vb) q) 2 d/2) + N(vb) q. The totl number of buckets in H s of ProMiSH-E is 2 L s+1. Therefore, the cost of serch in H s is T (H s ) = 2 L s+1 T (bs). ProMiSH-A divides the line of projected vlues into nonoverlpping bins. The totl number of buckets in H s of ProMiSH-A is 2 L s. Therefore, the cost of serch in H s is T (H s ) = 2 L s T (bs). We present the query times of ProMiSH for NKS queries on multiple rel nd synthetic dtsets in section VIII. Spce complexity: Let the spce cost of point s identifier, dimension of point, nd keyword be E bytes individully. The index structure of ProMiSH consists of the keyword-point inverted index I kp nd L pirs of hshtble H nd keywordbucket inverted index I khb. The spce cost of I kp is S(I kp ) =(N E t) bytes. For ProMiSH-E, ech point is hshed into hshtble H using 2 m signtures, therefore hshtble tkes S E (H) =(2 m N E) bytes. For ProMiSH-A, ech point is hshed using only one signture, therefore hshtble tkes S A (H) =(N E) bytes. The spce cost of I khb inverted index is S(I khb ) = (U M log 2 M/8) bytes, where M is the number of buckets in hshtble H. The totl spce cost of the index of ProMiSH-E is S(I kp ) + S E (H) + S(I khb ). The totl spce cost of the index of ProMiSH-A is S(I kp ) + S A (H) + S(I khb ). The rtio of index size to dtset size is further nlyzed in section VIII-D. Fig. 7. Averge pproximtion rtio of ProMiSH-A for vrying query sizes on 32-dimensionl rel dtsets of vrious sizes. VIII. EMPIRICAL EVALUATION We evluted the performnce of ProMiSH-E nd ProMiSH-A on synthetic nd rel dtsets. We used recently introduced Virtul br*-tree [2] s reference method for comprison (see section II for description). We first introduce the dtsets nd the metrics used for mesuring the performnce of the lgorithms. Then, we discuss the qulity results of the lgorithms on rel dtsets. Next, we describe comprtive results of ProMiSH-E, ProMiSH-A, nd Virtul br*-tree on both synthetic nd rel dtsets. We lso report sclbility results of ProMiSH on both synthetic nd rel dtsets. Finlly, we present comprison of the spce usge of ll the lgorithms. Dtsets: We used both synthetic nd rel dtsets for experiments. Synthetic dt ws rndomly generted. Ech component of d-dimensionl synthetic point ws chosen uniformly from [0-10,000]. Ech synthetic point ws rndomly tgged with t keywords. A dtset is chrcterized by its (1) size, N; (2) dimensionlity, d; (3) dictionry size, U; nd (4) the number of keywords ssocited with ech point, t. We creted vrious synthetic dtsets by vrying these prmeters for our empiricl studies. Our NKS query is useful for finding tight clusters of photos which contin ll the keywords provided by user in photoshring socil network s discussed in section I. Bsed on this ppliction, we used imges hving descriptive tgs s rel dtsets. We downloded imges with their textul keywords from Flickr 3. We trnsformed ech imge into gryscle. We creted d-dimensionl dtset by extrcting d-dimensionl color histogrm from ech imge. Ech dt point ws tgged with the keywords of its corresponding imge. We describe rel dtsets of five different sizes used in our empiricl studies in tble III. The lrgest rel dtset hd 24, 874 unique keywords nd ech point in it ws tgged with 11 keywords. A query for dtset ws creted by rndomly picking set of keywords from the dictionry of the dtset. A query is prmeterized by its size q. Performnce metrics: We used pproximtion rtio, query time, nd spce usge s metrics to evlute the qulity of results (ccurcy), the efficiency, nd the sclbility of the serch lgorithms. 3

9 Fig. 8. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=5 on synthetic dtsets of vrying dimensions d. Vlues of N=100,000, t=1, nd U=1,000 were used for ech dtset. Fig. 10. Query time comprison of lgorithms for retrieving top-1 results for queries of vrying sizes q on 10-dimensionl synthetic dtset hving 100,000 points. Vlues of t=1 nd U=1,000 were used for the dtset. Fig. 9. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=5 on 25-dimensionl synthetic dtsets of vrying sizes N. Vlues of t=1 nd U=1,000 were used for ech dtset. We mesured the qulity of results of n lgorithm by its pproximtion rtio [30], [32]. For 1 i k, if r i is the ith dimeter in top-k results retrieved by n lgorithm for query Q nd ri is the true ith dimeter, then the pproximtion rtio of the lgorithm for top-k serch is given by ρ(q) = ( k r i i=1 r )/k. The smller the vlue of ρ(q), the better is i the qulity of the results returned by the lgorithm. The lest vlue of ρ(q) is 1. We report the verge pproximtion rtio (AAR) for the queries of given size, which is the men of the pproximtion rtios of 50 queries. We vlidted the time efficiency of the lgorithms by mesuring their query times. The index structure nd the dtset for ech method reside in memory. Therefore, the query time mesured s the elpsed CPU time between the strt nd the completion of query gives fir comprison between the methods. A query ws executed multiple times nd the verge execution time ws tken s its query time. Finlly, we report the query time for query size q s n verge of 50 different queries. The query time of serch lgorithm minly depends on the dtset size N, the dtset dimension d, nd the query size q. Therefore, we vlidted the sclbility of the lgorithms by computing their query times for vrying vlues of N, d, nd q. We verified the spce efficiency of n lgorithm by computing the rtio of its index memory footprint to the dtset memory footprint. Implementtion of the methods: We implemented ll the methods in Jv. For Virtul br*-tree, we fixed the lef node size to 1,000 entries nd other nodes sizes to 100 entries. Virtul br*-tree finds only the smllest subset, therefore we used k=1 for ProMiSH for fir comprison. We used the vlue of m=2 nd L=5 to crete the index structure of ProMiSH-E nd ProMiSH-A. For dtset, if pmx is the mximum spn of projected vlues of dt points on ny unit rndom vector, then vlue of w 0 = pmx ws used s the 2 initil bin-width. L All the experiments were performed on mchine hving Qud-Core Intel Xeon CPU@2.00GHz, 4,096 KB cche, nd 98 GB min memory nd running 64-bit Linux version 2.6. Fig. 11. Query time nlysis of ProMiSH lgorithms for retrieving top-1 results for queries of vrying sizes q on 25-dimensionl synthetic dtsets of vrying sizes N. Vlues of t=1 nd U=200 were used for ech dtset. A. Qulity Test We vlidted the result qulity of ProMiSH-E, ProMiSH- A nd Virtul br*-tree by their verge pproximtion rtios (AAR). ProMiSH-E nd Virtul br*-tree perform n exct serch. Therefore, they lwys retrieve the true top-k results, nd hve AAR of 1. We used the results returned by them s the ground truth. Figure 7 shows AAR computed over top- 5 results retrieved by ProMiSH-A for vrying query sizes on two 32-dimensionl rel dtsets. We observe from figure 7 tht AAR of ProMiSH-A is lwys less thn 1.5. This low AAR llows ProMiSH-A to return prcticlly useful results with very efficient time nd spce complexity. B. Efficiency on Synthetic Dtsets We performed experiments on multiple synthetic dtsets to verify the efficiency nd the sclbility of ProMiSH. We first discuss the comprison of query times of Virtul br*-tree, ProMiSH-A, nd ProMiSH-E for vrying dtset dimensions d, dtset sizes N, nd query sizes q. We found tht ProMiSH performs t lest four orders of mgnitude better thn Virtul br*-tree. We lso show results of the sclbility tests of ProMiSH for vrying vlues of N, d, q, nd the result size k. Our sclbility results revel liner performnce of ProMiSH with N, d, q, nd k. All the query times re mesured in milliseconds (ms) nd shown in log scle in ll the figures. The query times of ProMiSH-E, ProMiSH-A, nd Virtul br*-tree for retrieving top-1 results for queries of size 5 on dtsets of vrying dimensions d re shown in figure 8. We used dtset of 100,000 points where ech point ws tgged with t=1 keyword using dictionry of size U=1,000. For the dtset of dimension 25, ProMiSH-A completed in 1.8 ms nd ProMiSH-E took only 4.2 ms. Conversely, results for Virtul br*-tree could not be obtined since it rn for more thn 5 hours. We observed tht ProMiSH not only significntly outperforms Virtul br*-tree on dtsets of ll dimensions but the difference in performnce lso grows to more thn five orders with n increse in the dtset dimension. We show the query times of the lgorithms on 25-dimensionl dtsets of vrying sizes N for queries of size 5 in

10 Fig. 16. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=4 on 16-dimensionl rel dtsets of vrying sizes N. Fig. 12. Query time nlysis of ProMiSH for retrieving top-1 results for queries of vrying sizes q on lrge synthetic dtsets of vrying dimensions d. Vlues of N=3 million, t=1, nd U=200 were used for ech dtset. Fig. 13. Query time nlysis of ProMiSH lgorithms for retrieving top-k results for queries of sizes 3 nd 6 on 50-dimensionl synthetic dtset of size N=3 million. Vlues of t=1 nd U=200 were used for the dtset. figure 9. Ech dtset used dictionry of size U=1,000 nd t=1 keyword per point. Virtul br*-tree filed to finish for the dtset of size N=100,000 even fter 5 hours of execution. We report the query times of the lgorithms for queries of vrying sizes q on 10-dimensionl dtset of size N=100,000 in figure 10. Ech dt point ws tgged with t=1 keyword using dictionry of size U=1,000. For query of size 5, ProMiSH-A hd query time of 1.7 ms, ProMiSH-E hd query time of 4.2 ms, nd Virtul br*-tree hd query time of 305 seconds. We gin observed tht ProMiSH outperforms Virtul br*-tree by more thn five orders of mgnitude with n increse in the dtset size nd the query size. Fig. 14. Query time comprison of lgorithms for retrieving top-1 results for queries of size q=4 on rel dtsets of vrying dimensions d nd size N=50,000. Fig. 17. Query time nlysis of ProMiSH lgorithms for retrieving top-1 results for queries of vrying sizes q on rel dtsets of vrying dimensions nd size N=1 million. All the bove results show tht the query time of ProMiSH increses linerly with the dtset size N, the dtset dimension d, nd the query size q. In contrst, Virtul br*-tree fils to scle with q, d, nd N. These results confirm tht the pruning criteri of Virtul br*-tree, s discussed in section II, becomes ineffective with n increse in the dimension of the dtset. This leds to n exponentil genertion of potentil cndidtes nd lrge query times. Next, we present sclbility results of ProMiSH-E nd ProMiSH-A on lrge synthetic dtsets of vrying dimensions for lrge query sizes nd vrying result sizes. Ech dtset used dictionry of size U=200. A point in ech dtset ws tgged with t=1 keyword. Figure 11 shows the query times for queries of vrying sizes q on 25-dimensionl dtsets of vrying sizes N. ProMiSH-E hd query time of 29 seconds nd ProMiSH-A hd query time of 6 seconds for queries of size 9 on dtset of 10 million points. We observed tht ProMiSH-A is n order of mgnitude fster thn ProMiSH-E for queries of ll sizes. We see from figure 11 tht ProMiSH scles linerly with the query size nd the dtset size. Figure 12 shows the query times of ProMiSH for queries of vrying sizes on 3 million size dtsets of vrying dimensions. ProMiSH-E hd query time of 4.7 seconds nd ProMiSH-A hd query time of 0.3 seconds for queries of size q=9 on 100-dimensionl dtset. ProMiSH-A is n order of mgnitude fster thn ProMiSH-E on dtsets of ll dimensions. We observed tht both lgorithms scle linerly with dimension d of the dtset. Figure 13 shows the query times for retrieving the top-k results for queries of vrying sizes q on 50-dimensionl dtset. It revels liner performnce of both lgorithms for incresing k. ProMiSH-A is n order of mgnitude better thn ProMiSH-E for ny result size k. All these tests show tht the query time of ProMiSH scles linerly with the dtset size, the dtset dimension, the query size, nd the result size. Fig. 15. Query time comprison of lgorithms for retrieving top-1 results for queries of vrying sizes q on 16-dimensionl rel dtset of size N=70,000. C. Efficiency on Rel Dtsets We evluted the efficiency nd the sclbility of ProMiSH on multiple rel dtsets. We first discuss query time comprisons of lterntive lgorithms for vrying dtset dimensions

2 RELATED WORK. TABLE 1 A glossary of notations used in the paper.

2 RELATED WORK. TABLE 1 A glossary of notations used in the paper. Nearest Keyword Set Search in Multi-dimensional Datasets Vishwakarma Singh, Bo Zong, Ambuj K. Singh Abstract Keyword-based search in text-rich multi-dimensional datasets facilitates many novel applications

More information

LECT-10, S-1 FP2P08, Javed I.

LECT-10, S-1 FP2P08, Javed I. A Course on Foundtions of Peer-to-Peer Systems & Applictions LECT-10, S-1 CS /799 Foundtion of Peer-to-Peer Applictions & Systems Kent Stte University Dept. of Computer Science www.cs.kent.edu/~jved/clss-p2p08

More information

9 Graph Cutting Procedures

9 Graph Cutting Procedures 9 Grph Cutting Procedures Lst clss we begn looking t how to embed rbitrry metrics into distributions of trees, nd proved the following theorem due to Brtl (1996): Theorem 9.1 (Brtl (1996)) Given metric

More information

On the Detection of Step Edges in Algorithms Based on Gradient Vector Analysis

On the Detection of Step Edges in Algorithms Based on Gradient Vector Analysis On the Detection of Step Edges in Algorithms Bsed on Grdient Vector Anlysis A. Lrr6, E. Montseny Computer Engineering Dept. Universitt Rovir i Virgili Crreter de Slou sin 43006 Trrgon, Spin Emil: lrre@etse.urv.es

More information

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method A New Lerning Algorithm for the MAXQ Hierrchicl Reinforcement Lerning Method Frzneh Mirzzdeh 1, Bbk Behsz 2, nd Hmid Beigy 1 1 Deprtment of Computer Engineering, Shrif University of Technology, Tehrn,

More information

UNIT 11. Query Optimization

UNIT 11. Query Optimization UNIT Query Optimiztion Contents Introduction to Query Optimiztion 2 The Optimiztion Process: An Overview 3 Optimiztion in System R 4 Optimiztion in INGRES 5 Implementing the Join Opertors Wei-Png Yng,

More information

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have Rndom Numers nd Monte Crlo Methods Rndom Numer Methods The integrtion methods discussed so fr ll re sed upon mking polynomil pproximtions to the integrnd. Another clss of numericl methods relies upon using

More information

4452 Mathematical Modeling Lecture 4: Lagrange Multipliers

4452 Mathematical Modeling Lecture 4: Lagrange Multipliers Mth Modeling Lecture 4: Lgrnge Multipliers Pge 4452 Mthemticl Modeling Lecture 4: Lgrnge Multipliers Lgrnge multipliers re high powered mthemticl technique to find the mximum nd minimum of multidimensionl

More information

MATH 25 CLASS 5 NOTES, SEP

MATH 25 CLASS 5 NOTES, SEP MATH 25 CLASS 5 NOTES, SEP 30 2011 Contents 1. A brief diversion: reltively prime numbers 1 2. Lest common multiples 3 3. Finding ll solutions to x + by = c 4 Quick links to definitions/theorems Euclid

More information

Fig.1. Let a source of monochromatic light be incident on a slit of finite width a, as shown in Fig. 1.

Fig.1. Let a source of monochromatic light be incident on a slit of finite width a, as shown in Fig. 1. Answer on Question #5692, Physics, Optics Stte slient fetures of single slit Frunhofer diffrction pttern. The slit is verticl nd illuminted by point source. Also, obtin n expression for intensity distribution

More information

CSEP 573 Artificial Intelligence Winter 2016

CSEP 573 Artificial Intelligence Winter 2016 CSEP 573 Artificil Intelligence Winter 2016 Luke Zettlemoyer Problem Spces nd Serch slides from Dn Klein, Sturt Russell, Andrew Moore, Dn Weld, Pieter Abbeel, Ali Frhdi Outline Agents tht Pln Ahed Serch

More information

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li 2nd Interntionl Conference on Electronic & Mechnicl Engineering nd Informtion Technology (EMEIT-212) Complete Coverge Pth Plnning of Mobile Robot Bsed on Dynmic Progrmming Algorithm Peng Zhou, Zhong-min

More information

Algorithm Design (5) Text Search

Algorithm Design (5) Text Search Algorithm Design (5) Text Serch Tkshi Chikym School of Engineering The University of Tokyo Text Serch Find sustring tht mtches the given key string in text dt of lrge mount Key string: chr x[m] Text Dt:

More information

II. THE ALGORITHM. A. Depth Map Processing

II. THE ALGORITHM. A. Depth Map Processing Lerning Plnr Geometric Scene Context Using Stereo Vision Pul G. Bumstrck, Bryn D. Brudevold, nd Pul D. Reynolds {pbumstrck,brynb,pulr2}@stnford.edu CS229 Finl Project Report December 15, 2006 Abstrct A

More information

Misrepresentation of Preferences

Misrepresentation of Preferences Misrepresenttion of Preferences Gicomo Bonnno Deprtment of Economics, University of Cliforni, Dvis, USA gfbonnno@ucdvis.edu Socil choice functions Arrow s theorem sys tht it is not possible to extrct from

More information

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey Alignment of Long Sequences BMI/CS 776 www.biostt.wisc.edu/bmi776/ Spring 2012 Colin Dewey cdewey@biostt.wisc.edu Gols for Lecture the key concepts to understnd re the following how lrge-scle lignment

More information

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe CSCI 0 fel Ferreir d Silv rfsilv@isi.edu Slides dpted from: Mrk edekopp nd Dvid Kempe LOG STUCTUED MEGE TEES Series Summtion eview Let n = + + + + k $ = #%& #. Wht is n? n = k+ - Wht is log () + log ()

More information

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1 Mth 33 Volume Stewrt 5.2 Geometry of integrls. In this section, we will lern how to compute volumes using integrls defined by slice nlysis. First, we recll from Clculus I how to compute res. Given the

More information

Cone Cluster Labeling for Support Vector Clustering

Cone Cluster Labeling for Support Vector Clustering Cone Cluster Lbeling for Support Vector Clustering Sei-Hyung Lee Deprtment of Computer Science University of Msschusetts Lowell MA 1854, U.S.A. slee@cs.uml.edu Kren M. Dniels Deprtment of Computer Science

More information

Text mining: bag of words representation and beyond it

Text mining: bag of words representation and beyond it Text mining: bg of words representtion nd beyond it Jsmink Dobš Fculty of Orgniztion nd Informtics University of Zgreb 1 Outline Definition of text mining Vector spce model or Bg of words representtion

More information

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1): Overview (): Before We Begin Administrtive detils Review some questions to consider Winter 2006 Imge Enhncement in the Sptil Domin: Bsics of Sptil Filtering, Smoothing Sptil Filters, Order Sttistics Filters

More information

The Distributed Data Access Schemes in Lambda Grid Networks

The Distributed Data Access Schemes in Lambda Grid Networks The Distributed Dt Access Schemes in Lmbd Grid Networks Ryot Usui, Hiroyuki Miygi, Yutk Arkw, Storu Okmoto, nd Noki Ymnk Grdute School of Science for Open nd Environmentl Systems, Keio University, Jpn

More information

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016 Solving Prolems y Serching CS 486/686: Introduction to Artificil Intelligence Winter 2016 1 Introduction Serch ws one of the first topics studied in AI - Newell nd Simon (1961) Generl Prolem Solver Centrl

More information

such that the S i cover S, or equivalently S

such that the S i cover S, or equivalently S MATH 55 Triple Integrls Fll 16 1. Definition Given solid in spce, prtition of consists of finite set of solis = { 1,, n } such tht the i cover, or equivlently n i. Furthermore, for ech i, intersects i

More information

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer. DBMS Architecture SQL INSTRUCTION OPTIMIZER Dtbse Mngement Systems MANAGEMENT OF ACCESS METHODS BUFFER MANAGER CONCURRENCY CONTROL RELIABILITY MANAGEMENT Index Files Dt Files System Ctlog DATABASE 2 Query

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully

More information

Efficient Techniques for Tree Similarity Queries 1

Efficient Techniques for Tree Similarity Queries 1 Efficient Techniques for Tree Similrity Queries 1 Nikolus Augsten Dtbse Reserch Group Deprtment of Computer Sciences University of Slzburg, Austri July 6, 2017 Austrin Computer Science Dy 2017 / IMAGINE

More information

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) Numbers nd Opertions, Algebr, nd Functions 45. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) In sequence of terms involving eponentil growth, which the testing service lso clls geometric

More information

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants A Heuristic Approch for Discovering Reference Models by Mining Process Model Vrints Chen Li 1, Mnfred Reichert 2, nd Andres Wombcher 3 1 Informtion System Group, University of Twente, The Netherlnds lic@cs.utwente.nl

More information

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search Uninformed Serch [These slides were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI t UC Berkeley. All CS188 mterils re vilble t http://i.berkeley.edu.] Tody Serch Problems Uninformed Serch Methods

More information

CS 221: Artificial Intelligence Fall 2011

CS 221: Artificial Intelligence Fall 2011 CS 221: Artificil Intelligence Fll 2011 Lecture 2: Serch (Slides from Dn Klein, with help from Sturt Russell, Andrew Moore, Teg Grenger, Peter Norvig) Problem types! Fully observble, deterministic! single-belief-stte

More information

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007 CS 88: Artificil Intelligence Fll 2007 Lecture : A* Serch 9/4/2007 Dn Klein UC Berkeley Mny slides over the course dpted from either Sturt Russell or Andrew Moore Announcements Sections: New section 06:

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Unit 5 Vocabulary. A function is a special relationship where each input has a single output.

Unit 5 Vocabulary. A function is a special relationship where each input has a single output. MODULE 3 Terms Definition Picture/Exmple/Nottion 1 Function Nottion Function nottion is n efficient nd effective wy to write functions of ll types. This nottion llows you to identify the input vlue with

More information

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization An Efficient Divide nd Conquer Algorithm for Exct Hzrd Free Logic Minimiztion J.W.J.M. Rutten, M.R.C.M. Berkelr, C.A.J. vn Eijk, M.A.J. Kolsteren Eindhoven University of Technology Informtion nd Communiction

More information

What are suffix trees?

What are suffix trees? Suffix Trees 1 Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl

More information

Midterm 2 Sample solution

Midterm 2 Sample solution Nme: Instructions Midterm 2 Smple solution CMSC 430 Introduction to Compilers Fll 2012 November 28, 2012 This exm contins 9 pges, including this one. Mke sure you hve ll the pges. Write your nme on the

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries Tries Yufei To KAIST April 9, 2013 Y. To, April 9, 2013 Tries In this lecture, we will discuss the following exct mtching prolem on strings. Prolem Let S e set of strings, ech of which hs unique integer

More information

Stained Glass Design. Teaching Goals:

Stained Glass Design. Teaching Goals: Stined Glss Design Time required 45-90 minutes Teching Gols: 1. Students pply grphic methods to design vrious shpes on the plne.. Students pply geometric trnsformtions of grphs of functions in order to

More information

50 AMC LECTURES Lecture 2 Analytic Geometry Distance and Lines. can be calculated by the following formula:

50 AMC LECTURES Lecture 2 Analytic Geometry Distance and Lines. can be calculated by the following formula: 5 AMC LECTURES Lecture Anlytic Geometry Distnce nd Lines BASIC KNOWLEDGE. Distnce formul The distnce (d) between two points P ( x, y) nd P ( x, y) cn be clculted by the following formul: d ( x y () x )

More information

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Lecture 10 Evolutionry Computtion: Evolution strtegies nd genetic progrmming Evolution strtegies Genetic progrmming Summry Negnevitsky, Person Eduction, 2011 1 Evolution Strtegies Another pproch to simulting

More information

Chapter 2 Sensitivity Analysis: Differential Calculus of Models

Chapter 2 Sensitivity Analysis: Differential Calculus of Models Chpter 2 Sensitivity Anlysis: Differentil Clculus of Models Abstrct Models in remote sensing nd in science nd engineering, in generl re, essentilly, functions of discrete model input prmeters, nd/or functionls

More information

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Solving Prolems y Serching CS 486/686: Introduction to Artificil Intelligence 1 Introduction Serch ws one of the first topics studied in AI - Newell nd Simon (1961) Generl Prolem Solver Centrl component

More information

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring

More information

Spatial Cohesion Queries

Spatial Cohesion Queries Sptil Cohesion Queries Dimitris Schridis Technische Universität Wien Vienn, Austri dimitris@ec.tuwien.c.t Antonios Deliginnkis Technicl University of Crete Chni, Greece deli@softnet.tuc.gr ABSTRACT Given

More information

Fall 2018 Midterm 1 October 11, ˆ You may not ask questions about the exam except for language clarifications.

Fall 2018 Midterm 1 October 11, ˆ You may not ask questions about the exam except for language clarifications. 15-112 Fll 2018 Midterm 1 October 11, 2018 Nme: Andrew ID: Recittion Section: ˆ You my not use ny books, notes, extr pper, or electronic devices during this exm. There should be nothing on your desk or

More information

CSCI 446: Artificial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificil Intelligence Serch Instructor: Michele Vn Dyne [These slides were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI t UC Berkeley. All CS188 mterils re vilble t http://i.berkeley.edu.]

More information

2 Computing all Intersections of a Set of Segments Line Segment Intersection

2 Computing all Intersections of a Set of Segments Line Segment Intersection 15-451/651: Design & Anlysis of Algorithms Novemer 14, 2016 Lecture #21 Sweep-Line nd Segment Intersection lst chnged: Novemer 8, 2017 1 Preliminries The sweep-line prdigm is very powerful lgorithmic design

More information

Efficient Regular Expression Grouping Algorithm Based on Label Propagation Xi Chena, Shuqiao Chenb and Ming Maoc

Efficient Regular Expression Grouping Algorithm Based on Label Propagation Xi Chena, Shuqiao Chenb and Ming Maoc 4th Ntionl Conference on Electricl, Electronics nd Computer Engineering (NCEECE 2015) Efficient Regulr Expression Grouping Algorithm Bsed on Lbel Propgtion Xi Chen, Shuqio Chenb nd Ming Moc Ntionl Digitl

More information

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases Overview Mobile computing nd dtbses Generl issues in mobile dt mngement Dt dissemintion Dt consistency Loction dependent queries Interfces Detils of brodcst disks thlis klfigopoulos Network rchitecture

More information

SkyDiver: A Framework for Skyline Diversification

SkyDiver: A Framework for Skyline Diversification SkyDiver: A Frmework for Skyline Diversifiction George Vlkns Dept. of Informtics nd Telecommunictions University of Athens Athens, Greece gvlk@di.uo.gr Apostolos N. Ppdopoulos Dept. of Informtics Aristotle

More information

2014 Haskell January Test Regular Expressions and Finite Automata

2014 Haskell January Test Regular Expressions and Finite Automata 0 Hskell Jnury Test Regulr Expressions nd Finite Automt This test comprises four prts nd the mximum mrk is 5. Prts I, II nd III re worth 3 of the 5 mrks vilble. The 0 Hskell Progrmming Prize will be wrded

More information

A Comparison of the Discretization Approach for CST and Discretization Approach for VDM

A Comparison of the Discretization Approach for CST and Discretization Approach for VDM Interntionl Journl of Innovtive Reserch in Advnced Engineering (IJIRAE) Volume1 Issue1 (Mrch 2014) A Comprison of the Discretiztion Approch for CST nd Discretiztion Approch for VDM Omr A. A. Shib Fculty

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline CSCI1950 Z Comput4onl Methods for Biology Lecture 2 Ben Rphel Jnury 26, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Outline Review of trees. Coun4ng fetures. Chrcter bsed phylogeny Mximum prsimony Mximum

More information

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties, Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

Transparent neutral-element elimination in MPI reduction operations

Transparent neutral-element elimination in MPI reduction operations Trnsprent neutrl-element elimintion in MPI reduction opertions Jesper Lrsson Träff Deprtment of Scientific Computing University of Vienn Disclimer Exploiting repetition nd sprsity in input for reducing

More information

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search. CS 88: Artificil Intelligence Fll 00 Lecture : A* Serch 9//00 A* Serch rph Serch Tody Heuristic Design Dn Klein UC Berkeley Multiple slides from Sturt Russell or Andrew Moore Recp: Serch Exmple: Pncke

More information

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards A Tutology Checker loosely relted to Stålmrck s Algorithm y Mrtin Richrds mr@cl.cm.c.uk http://www.cl.cm.c.uk/users/mr/ University Computer Lortory New Museum Site Pemroke Street Cmridge, CB2 3QG Mrtin

More information

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem Announcements Project : erch It s live! Due 9/. trt erly nd sk questions. It s longer thn most! Need prtner? Come up fter clss or try Pizz ections: cn go to ny, ut hve priority in your own C 88: Artificil

More information

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis CS143 Hndout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexicl Anlysis In this first written ssignment, you'll get the chnce to ply round with the vrious constructions tht come up when doing lexicl

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Computing offsets of freeform curves using quadratic trigonometric splines

Computing offsets of freeform curves using quadratic trigonometric splines Computing offsets of freeform curves using qudrtic trigonometric splines JIULONG GU, JAE-DEUK YUN, YOONG-HO JUNG*, TAE-GYEONG KIM,JEONG-WOON LEE, BONG-JUN KIM School of Mechnicl Engineering Pusn Ntionl

More information

CS481: Bioinformatics Algorithms

CS481: Bioinformatics Algorithms CS481: Bioinformtics Algorithms Cn Alkn EA509 clkn@cs.ilkent.edu.tr http://www.cs.ilkent.edu.tr/~clkn/teching/cs481/ EXACT STRING MATCHING Fingerprint ide Assume: We cn compute fingerprint f(p) of P in

More information

Looking up objects in Pastry

Looking up objects in Pastry Review: Pstry routing tbles 0 1 2 3 4 7 8 9 b c d e f 0 1 2 3 4 7 8 9 b c d e f 0 1 2 3 4 7 8 9 b c d e f 0 2 3 4 7 8 9 b c d e f Row0 Row 1 Row 2 Row 3 Routing tble of node with ID i =1fc s - For ech

More information

CHAPTER III IMAGE DEWARPING (CALIBRATION) PROCEDURE

CHAPTER III IMAGE DEWARPING (CALIBRATION) PROCEDURE CHAPTER III IMAGE DEWARPING (CALIBRATION) PROCEDURE 3.1 Scheimpflug Configurtion nd Perspective Distortion Scheimpflug criterion were found out to be the best lyout configurtion for Stereoscopic PIV, becuse

More information

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012 Dynmic Progrmming Andres Klppenecker [prtilly bsed on slides by Prof. Welch] 1 Dynmic Progrmming Optiml substructure An optiml solution to the problem contins within it optiml solutions to subproblems.

More information

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

In the last lecture, we discussed how valid tokens may be specified by regular expressions. LECTURE 5 Scnning SYNTAX ANALYSIS We know from our previous lectures tht the process of verifying the syntx of the progrm is performed in two stges: Scnning: Identifying nd verifying tokens in progrm.

More information

COMBINATORIAL PATTERN MATCHING

COMBINATORIAL PATTERN MATCHING COMBINATORIAL PATTERN MATCHING Genomic Repets Exmple of repets: ATGGTCTAGGTCCTAGTGGTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized

More information

MATH 2530: WORKSHEET 7. x 2 y dz dy dx =

MATH 2530: WORKSHEET 7. x 2 y dz dy dx = MATH 253: WORKSHT 7 () Wrm-up: () Review: polr coordintes, integrls involving polr coordintes, triple Riemnn sums, triple integrls, the pplictions of triple integrls (especilly to volume), nd cylindricl

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Section 10.4 Hyperbolas

Section 10.4 Hyperbolas 66 Section 10.4 Hyperbols Objective : Definition of hyperbol & hyperbols centered t (0, 0). The third type of conic we will study is the hyperbol. It is defined in the sme mnner tht we defined the prbol

More information

Pointwise convergence need not behave well with respect to standard properties such as continuity.

Pointwise convergence need not behave well with respect to standard properties such as continuity. Chpter 3 Uniform Convergence Lecture 9 Sequences of functions re of gret importnce in mny res of pure nd pplied mthemtics, nd their properties cn often be studied in the context of metric spces, s in Exmples

More information

INTRODUCTION TO SIMPLICIAL COMPLEXES

INTRODUCTION TO SIMPLICIAL COMPLEXES INTRODUCTION TO SIMPLICIAL COMPLEXES CASEY KELLEHER AND ALESSANDRA PANTANO 0.1. Introduction. In this ctivity set we re going to introduce notion from Algebric Topology clled simplicil homology. The min

More information

Graph Exploration: Taking the User into the Loop

Graph Exploration: Taking the User into the Loop Grph Explortion: Tking the User into the Loop Dvide Mottin, Anj Jentzsch, Emmnuel Müller Hsso Plttner Institute, Potsdm, Germny 2016/10/24 CIKM2016, Indinpolis, US Who we re Dvide Mottin grph mining, novel

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

ISG: Itemset based Subgraph Mining

ISG: Itemset based Subgraph Mining ISG: Itemset bsed Subgrph Mining by Lini Thoms, Stynryn R Vlluri, Kmlkr Krlplem Report No: IIIT/TR/2009/179 Centre for Dt Engineering Interntionl Institute of Informtion Technology Hyderbd - 500 032, INDIA

More information

Frequent Closed Itemset Mining Using Prefix Graphs

Frequent Closed Itemset Mining Using Prefix Graphs Frequent Closed Itemset Mining Using Prefix Grphs H. D. K. Moonesinghe, Smh Fodeh, Png-Ning Tn Deprtment of Computer Science & Engineering Michign Stte University Est Lnsing, MI 48824 (moonesin, fodehsm,

More information

Expected Worst-case Performance of Hash Files

Expected Worst-case Performance of Hash Files Expected Worst-cse Performnce of Hsh Files Per-Ake Lrson Deprtment of Informtion Processing, Abo Akdemi, Fnriksgtn, SF-00 ABO 0, Finlnd The following problem is studied: consider hshfilend the longest

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley AI Adjcent Fields Philosophy: Logic, methods of resoning Mind s physicl system Foundtions of lerning, lnguge, rtionlity Mthemtics Forml representtion nd proof Algorithms, computtion, (un)decidility, (in)trctility

More information

Statistical classification of spatial relationships among mathematical symbols

Statistical classification of spatial relationships among mathematical symbols 2009 10th Interntionl Conference on Document Anlysis nd Recognition Sttisticl clssifiction of sptil reltionships mong mthemticl symbols Wl Aly, Seiichi Uchid Deprtment of Intelligent Systems, Kyushu University

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Representation of Numbers. Number Representation. Representation of Numbers. 32-bit Unsigned Integers 3/24/2014. Fixed point Integer Representation

Representation of Numbers. Number Representation. Representation of Numbers. 32-bit Unsigned Integers 3/24/2014. Fixed point Integer Representation Representtion of Numbers Number Representtion Computer represent ll numbers, other thn integers nd some frctions with imprecision. Numbers re stored in some pproximtion which cn be represented by fixed

More information

Fall 2018 Midterm 2 November 15, 2018

Fall 2018 Midterm 2 November 15, 2018 Nme: 15-112 Fll 2018 Midterm 2 November 15, 2018 Andrew ID: Recittion Section: ˆ You my not use ny books, notes, extr pper, or electronic devices during this exm. There should be nothing on your desk or

More information

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs. Lecture 5 Wlks, Trils, Pths nd Connectedness Reding: Some of the mteril in this lecture comes from Section 1.2 of Dieter Jungnickel (2008), Grphs, Networks nd Algorithms, 3rd edition, which is ville online

More information

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork MA1008 Clculus nd Liner Algebr for Engineers Course Notes for Section B Stephen Wills Deprtment of Mthemtics University College Cork s.wills@ucc.ie http://euclid.ucc.ie/pges/stff/wills/teching/m1008/ma1008.html

More information

Math 35 Review Sheet, Spring 2014

Math 35 Review Sheet, Spring 2014 Mth 35 Review heet, pring 2014 For the finl exm, do ny 12 of the 15 questions in 3 hours. They re worth 8 points ech, mking 96, with 4 more points for netness! Put ll your work nd nswers in the provided

More information

Self-Organizing Hierarchical Routing for Scalable Ad Hoc Networking

Self-Organizing Hierarchical Routing for Scalable Ad Hoc Networking 1 Self-Orgnizing Hierrchicl Routing for Sclble Ad Hoc Networking Shu Du Ahmed Khn Sntshil PlChudhuri Ansley Post Amit Kumr Sh Peter Druschel Dvid B. Johnson Rudolf Riedi Rice University Abstrct As devices

More information

SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs

SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs SAPPER: Sugrph Indexing nd Approximte Mtching in Lrge Grphs Shijie Zhng, Jiong Yng, Wei Jin EECS Dept., Cse Western Reserve University, {shijie.zhng, jiong.yng, wei.jin}@cse.edu ABSTRACT With the emergence

More information

Replicating Web Applications On-Demand

Replicating Web Applications On-Demand Replicting Web Applictions On-Demnd Swminthn Sivsubrmnin Guillume Pierre Mrten vn Steen Dept. of Computer Science, Vrije Universiteit, Amsterdm {swmi,gpierre,steen}@cs.vu.nl Abstrct Mny Web-bsed commercil

More information

Presentation Martin Randers

Presentation Martin Randers Presenttion Mrtin Rnders Outline Introduction Algorithms Implementtion nd experiments Memory consumption Summry Introduction Introduction Evolution of species cn e modelled in trees Trees consist of nodes

More information

Engineer To Engineer Note

Engineer To Engineer Note Engineer To Engineer Note EE-169 Technicl Notes on using Anlog Devices' DSP components nd development tools Contct our technicl support by phone: (800) ANALOG-D or e-mil: dsp.support@nlog.com Or visit

More information

12-B FRACTIONS AND DECIMALS

12-B FRACTIONS AND DECIMALS -B Frctions nd Decimls. () If ll four integers were negtive, their product would be positive, nd so could not equl one of them. If ll four integers were positive, their product would be much greter thn

More information

On String Matching in Chunked Texts

On String Matching in Chunked Texts On String Mtching in Chunked Texts Hnnu Peltol nd Jorm Trhio {hpeltol, trhio}@cs.hut.fi Deprtment of Computer Science nd Engineering Helsinki University of Technology P.O. Box 5400, FI-02015 HUT, Finlnd

More information

A REINFORCEMENT LEARNING APPROACH TO SCHEDULING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS

A REINFORCEMENT LEARNING APPROACH TO SCHEDULING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS A REINFORCEMENT LEARNING APPROACH TO SCHEDULING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS Ji-Eun Roh (), Te-Eog Lee (b) (),(b) Deprtment of Industril nd Systems Engineering, Kore Advnced Institute

More information

Scalable Distributed Data Structures: A Survey Λ

Scalable Distributed Data Structures: A Survey Λ Sclble Distributed Dt Structures: A Survey Λ ADRIANO DI PASQUALE University of L Aquil, Itly ENRICO NARDELLI University of L Aquil nd Istituto di Anlisi dei Sistemi ed Informtic, Itly Abstrct This pper

More information

Preserving Constraints for Aggregation Relationship Type Update in XML Document

Preserving Constraints for Aggregation Relationship Type Update in XML Document Preserving Constrints for Aggregtion Reltionship Type Updte in XML Document Eric Prdede 1, J. Wenny Rhyu 1, nd Dvid Tnir 2 1 Deprtment of Computer Science nd Computer Engineering, L Trobe University, Bundoor

More information

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism Efficient K-NN Serch in Polyphonic Music Dtses Using Lower Bounding Mechnism Ning-Hn Liu Deprtment of Computer Science Ntionl Tsing Hu University Hsinchu,Tiwn 300, R.O.C 886-3-575679 nhliou@yhoo.com.tw

More information