Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Size: px
Start display at page:

Download "Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search"

Transcription

1 Can We Beat the Prefx Flterng? An Adaptve Framework for Smlarty Jon and Search Jannan Wang Guolang L Janhua Feng Department of Computer Scence and Technology, Tsnghua Natonal Laboratory for Informaton Scence and Technology (TNLst), Tsnghua Unversty, Bejng, Chna wjn8@mals.thu.edu.cn; lguolang@tsnghua.edu.cn; fengjh@tsnghua.edu.cn ABSTRACT As two mportant operatons n data cleanng, smlarty jon and smlarty search have attracted much attenton recently. Exstng methods to support smlarty jon usually adopt a prefx-flterng-based framework. They select a prefx of each object and prune object pars whose prefxes have no overlap. We have an observaton that prefx lengths have sgnfcant effect on the performance. Dfferent prefx lengths lead to sgnfcantly dfferent performance, and prefx flterng does not always acheve hgh performance. To address ths problem, n ths paper we propose an adaptve framework to support smlarty jon. We propose a cost model to judcously select an approprate prefx for each object. To effcently select prefxes, we devse effectve ndexes. We extend our method to support smlarty search. Expermental results show that our framework beats the prefx-flterngbased framework and acheves hgh effcency. Categores and Subject Descrptors: H..4 [Database Management]: Systems Textual Databases; H.. [Informaton Storage and Retreval]: Informaton Search and Retreval Search Process General Terms: Algorthms, Expermentaton, Performance Keywords: Prefx Flterng, Smlarty Search, Smlarty Jon, Adaptve Framework, Cost Model. INTRODUCTION As two mportant operatons n data cleanng, smlarty jon and smlarty search have attracted sgnfcant attenton from the database communty recently. Gven two collectons of objects, smlarty jon returns all smlar object pars. Smlarty jon has many real applcatons n data cleanng and near duplcate object detecton and elmnaton. For example, an nsurance company has two sets of customer records from two data sources. An nsurance clerk wants to elmnate the duplcates from the two sets. As the Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGMOD, May 4,, Scottsdale, Arzona, USA. Copyrght ACM //5...$.. two customer records may have dfferent representatons, the clerk needs to use smlarty jon to correlate the two sets. Smlarty search, gven a collecton of objects and a query object, fnds all objects smlar to the query object. Smlarty search also has many applcatons n nformaton retreval and natural language processng. For example, as many queres ssued to a search engne contan typos, search engnes can use smlarty search to suggest relevant queres. To quantfy smlarty between objects, many smlarty functons have been proposed, such as jaccard smlarty, cosne smlarty, dce smlarty, overlap smlarty, edt smlarty. Gven two objects, a smlarty functon takes as nput the two objects and returns the smlarty of the two objects. If the smlarty s not smaller than a gven threshold, the objects are taken to be smlar. Exstng methods to support smlarty jon employ a flterand-verfcaton framework [4]. The basc dea s to frst use an effcent flter to prune those object pars that cannot be smlar and then verfy the survved object pars by computng ther real smlarty. In the flter step, the prefx flterng s a domnant technque and many exstng methods employ a prefx-flterng-based framework [,4]. The prefx flterng frst transforms each object to a set of elements (see Secton..). Then t sorts the elements of each object based on a global orderng, and selects a prefx set for each object based on a gven smlarty threshold (see Secton..). It proves that f two objects are smlar, the prefx sets of the two objects must have overlap. Fnally, t utlzes an nverted ndex to prune those object pars whose prefx sets have no overlap (see Secton..). We have an observaton that prefx lengths have much effect on the performance. Dfferent prefx lengths lead to sgnfcantly dfferent performance, and the prefx flterng nearly always gets the worst performance (see Secton ). Intutvely, longer prefx lengths have larger prunng power, but nvolve more flterng tme. On the contrary, shorter prefx lengths acheve hgher flterng performance, but lead to longer verfcaton tme. It calls for a method to adaptvely select an approprate prefx length for each object. To ths end, we propose an adaptve framework to address ths problem. We propose a cost model to judcously select an approprate prefx for each object. To effcently select prefxes, we devse effectve ndex structures. We develop effectve prunng technques to mprove the performance. We also extend our method to support smlarty search. Moreover, our method can support all of the above smlarty functons. To summarze, we make the followng contrbutons.

2 R S r { vldb, sgmod, cde,, jagadsh } r { jagadsh, koudas, vldb, edbt, cde } r { koudas, dvesh, jagadsh, edbt, cde } r 4 { vldb, cde, koudas, jagadsh, dvesh } r 5 {, dvesh, edbt, vldb, sgmod } s { nck, koudas,, vldb, sgmod } { nck, vldb, cde, sgmod, edbt } { koudas, dvesh, sgmod, cde, edbt } { cde, sgmod,, jagadsh, dvesh } {, vldb, edbt, cde, jagadsh } Fgure : Two collectons of objects. We propose an adaptve framework to support both smlarty jon and smlarty search. We develop a cost model to judcously select an approprate prefx for each object. We extend our method to support smlarty search and develop effectve prunng technques. We have mplemented our method. Expermental results on real data sets show that our framework beats the prefx flterng and acheves hgh performance for both smlarty jon and smlarty search. The rest of ths paper s organzed as follows. We frst gve the problem formulaton and ntroduce the prefx flterng n Secton, and then analyze the prefx-flterng-based framework theoretcally and expermentally n Secton. Our adaptve framework s proposed n Secton 4. We extend our framework to support smlarty search n Secton 5. Expermental studes are conducted n Secton 6. We revew related work n Secton 7 and conclude the paper n Secton 8.. PRELIMINARIES. Problem Formulaton A smlarty functon s used to quantfy the smlarty of two objects. Gven two objects r and s, a smlarty functon, denoted by sm(r, s), returns a value to represent ther smlarty. The larger the value, the more smlar the two objects. Generally, users specfy a smlarty threshold θ, and two objects are smlar f ther smlarty s not smaller than the threshold,.e. sm(r,s) θ. In our paper, we focus on two types of objects, sets and strngs, whch are wdely used n many real applcatons. If the objects are sets, we consder the followng smlarty functons to quantfy ther smlarty. Defnton. Let r and s be two sets. Overlap smlarty: sm o(r,s) = r s. Dce smlarty: sm d (r,s) = r s r + s. Cosne smlarty: sm c(r,s) = r s Jaccard smlarty: sm j(r,s) = r s. where r ( s ) denotes the sze of set r(s). r s r + s r s. For example, consder r = {sgmod, cde, vldb} and s = {sgmod, cde}. r s =, r =, and s =. Ther overlap smlarty s sm o(r,s) =, ther dce smlarty s sm d (r,s) = 4, ther cosne smlarty s smc(r,s) = 5 6 and ther jaccard smlarty s sm j(r,s) =. If the objects are strngs, we use edt dstance to quantfy ther smlarty. A global orderng e jagadsh e koudas e nck e 4 dvesh e 5 e 6 vldb e 7 edbt e 8 cde e 9 sgmod R S r {e,e 5,e 6,e 8,e 9} r {e,e,e 6,e 7,e 8} r {e,e,e 4,e 7,e 8} r 4 {e,e,e 4,e 6,e 8} r 5 {e 4,e 5,e 6,e 7,e 9} s {e,e,e 5,e 6,e 9} {e,e 6,e 7,e 8,e 9} {e,e 4,e 7,e 8,e 9} {e,e 4,e 5,e 8,e 9} {e,e 5,e 6,e 7,e 8} Fgure : Two collectons of objects n Fgure after sortng elements n each object based on a global orderng. Defnton. Let r and s be two strngs. The edt dstance ed(r, s) between r and s s defned as the mnmum number of sngle-character edt operatons (nsertons, deletons and nsertons) to transform r to s. The edt smlarty s defned as es(r,s) = ed(r,s) max( r, s ). For example, ed(sgmod, sagmd) = and es(sgmod,sagmd) =. Note that the edt dstance s a dstance functon. Dfferent from a smlarty functon, the smaller the value ed(r, s), the more smlar the two objects. Therefore, gven an edt-dstance threshold θ, two objects are smlar f and only f ther edt dstance s not larger than θ, ed(r,s) θ. Next we defne the SmJon and SmSearch queres. Defnton (SmJon query). Gven two collectons of object R and S, a smlarty functon sm, and a specfed smlarty threshold θ, a SmJon query returns all object pars r,s R S such that sm(r,s) θ,.e. { r,s r,s R S,sm(r,s) θ}. For example, gven two collectons of objects R and S n Fgure, jaccard smlarty sm j and θ =, the SmJon queryreturnsobjectpars{ r,, r,, r,, r, } snce ther jaccard smlarty s not smaller than, e.g. sm j(r,) =. For the other object pars, ther jaccard smlarty s smaller than, e.g. smj(r,s) = <. 4 Defnton 4 (SmSearch query). Gven a collecton of objects S, a smlarty functon sm, a query object r, and a smlarty threshold θ, a SmSearch query returns all objects s S s.t. sm(r,s) θ,.e. {s s S,sm(r,s) θ}. For example, gven a collecton of objects S n Fgure. Suppose r = {nck, koudas, dvesh, vldb, } and jaccard smlarty sm j and θ =. The SmSearch query returns one object {s } snce sm j(r,s ) =, and for any other object s S, the jaccard smlarty between r and s s smaller than, e.g. smj(r,s) = <. 4. Prefx-Flterng Framework A brute-force method to answer SmJon query s to frst compute the smlarty of each object par and then return the pars whose smlarty s not smaller than θ. The tme complexty of ths method s O(cost v R S ) where cost v s the average cost of computng the smlarty of an object par. If there are a large number of objects n R and S, the method becomes qute expensve. In ths secton, we ntroduce the state-of-the-art framework, namely prefx flterng [4], whch can address ths problem effcently. Its basc dea s to frst use an effcent flter to prune those object pars that cannot be smlar and then verfy the survved object pars by computng ther real smlarty. Snce

3 the number of survved object pars s much smaller than R S, even n several orders of magntude, the algorthms based on ths framework outperform the brute-force method sgnfcantly... Mappng Object to Set The prefx-flterng framework frst maps objects to sets. Then we can transform varous smlarty functons to the overlap smlarty functon on the sets. That s gven a smlarty functon sm, a threshold θ, and two objects r,s, f sm(r,s) θ, then the overlap smlarty of the sets must be no smaller than a threshold t. Next we dscuss how to map objects to the sets and how to compute the threshold t. Frst, consder the set smlarty functons n Defnton. We can smply map each object to tself and the overlap threshold t can be deduced as follows. If sm o(r,s) θ, then r s θ, thus t = θ. If sm d (r,s) θ, then r s θ θ r, thus t = θ θ r. If sm c(r,s) θ, then r s θ r, thus t = θ r. If sm j (r,s) θ, then r s θ r, thus t = θ r. Second, for the edt dstance and the edt smlarty n Defnton, we map each object to ts q-gram set. The q- gram set of a strng r, denoted by Q q(r), conssts of all the substrngs of r wth length q. For example, Q (sgmod) = {s,g,gm,mo,od}. Usng q-gram sets, we can deduce the overlap threshold t as follows. If ed(r,s) θ, then Q q(r) Q q(s) r + (θ +) q, thus t = r + (θ +) q. Ifes(r,s) θ, then Q q(r) Q q(s) r + ( θ θ r +) q, thus t = r + ( θ θ r +) q. Obvously the object pars whose mapped sets share smaller than t common elements can be pruned. For example, consder two collectons of objects n Fgure. Suppose the jaccard-smlarty threshold s θ =.8. For the object r, the overlap threshold s t =.8 5 = 4. Three object pars r,s, r,, and r, can be pruned snce r s = < 4, r = < 4, and r = < 4. Note that these methods may result n duplcated elements n a mapped set, to avod mult-set ntersecton, we append each element wth an ordnary number to dstngush duplcated elements [4]... Prefx Flterng Exstng methods utlze a prefx-flterng technque to flter the object pars whch share smaller than t common elements. Frstly, t fxes a global orderng on the elements of all the objects. Then t sorts the elements of each object based on the global orderng. Let Prefx(r) be the prefx set of r that conssts of the frst r t+ elements. It proves that f r s t, ther prefx sets must have overlap,.e. Prefx(r ) Prefx(s ) φ. Therefore, t can flter the object pars whose prefx sets have no overlap [4]. For example, the table on the left of Fgure shows a global orderng on the elements of all the objects n Fgure. We use e to denote the element n the -th poston of the global orderng. Consder r = { vldb, sgmod, cde,, jagadsh} n Fgure. The correspondng postons of the elements n the global orderng are e 6 = vldb, e 9 = sgmod, e 8 = cde, e 5 =, e = jagadsh. After sortng the elements accordng to the global orderng, we obtan r = {e,e 5,e 6,e 8,e 9}. Smlarly, we can obtan the other sorted objects as shown on the rght of Fgure. Suppose t = 4. Then Prefx(r ) = {e,e 5} and Prefx(s ) = {e,e }. We can flter the par r,s based on prefx flterng snce Prefx(r ) Prefx(s ) = φ... Inverted Index Note that we do not need to enumerate each object par r,s R S to verfy whether Prefx(r) Prefx(s) = φ holds. Instead we use an nverted ndex to fnd the object pars r,s R S such that Prefx(r) Prefx(s) φ effcently. An nverted ndex maps an element to a lst of objects that contan the element. Such a lst of objects s called an nverted lst. We frst buld an nverted ndex on the prefx-set set of objects n a collecton, e.g., S, and then enumerate objects n another collecton R. For each r R, to obtan object s S such that Prefx(r) Prefx(s) φ, we only need to merge the nverted lsts of elements n P refx(r). For example, suppose t = 4. The table on the top of Fgure (a) shows the prefx-set set {Prefx(s) s S}. Below s the correspondng nverted ndex. Consder Prefx(r ) = {e,e 5}. We merge nverted lsts e {,} and e 5 {} to obtan objects and whose prefx sets have overlap wth Prefx(r ).. FIXED-LENGTH PREFIX SCHEME Many smlarty-jon algorthms[,4,9,,5 7] have been developed based on the prefx-flterng framework. They neglect the fact that prefx lengths have sgnfcant effect on the performance. In ths secton, we provde a deep analyss of the prefx-flterng framework theoretcally and expermentally. We conclude that the prefx-flterng framework s not effectve enough and can be mproved to acheve hgher performance. For ease of presentaton, we frst ntroduce some notatons. Suppose the elements of each object are sorted based on a global orderng. Let P l denote l-prefx scheme. P l (s) s defned as the l-prefx set of s consstng of the frst s t+l elements of s ( l t). Let P l (S) = {P l (s) s S} denote the collecton of l-prefx sets of S. Let I S l denote the nvertedndex bulton P l (S), andi S l (e) denote the nverted lst of element e whch conssts of the objects n S whose l- prefx sets contan e. For smplcty, f the context s clear, I S l and I S l (e) are abbrevated as I l and I l (e) respectvely. Fgure shows four nverted ndexes I l bult on P l (S) for l 4. Recall Secton.., snce P refx(r) conssts of the frst r t + elements of r, the prefx-flterng framework essentally utlzes -prefx scheme (.e. P ) for flterng object pars. Next we study flter condtons usng other prefx schemes. Consder two objects r and s. Suppose r s t. For t-prefx scheme, snce r = P t(r) and s = P t(s), we have P t(r) P t(s) t. For (t )-prefx scheme, as P t (r) and P t (s) are respectvely obtaned by removng the last elements from r and s, we have P t (r) P t (s) t. Iteratvely, for l-prefx scheme, we have P l (r) P l (s) l. We can prune the object pars r,s f P l (r) P l (s) < l. The correctness s formalzed n Lemma. Lemma. For any object par r,s R S, f P l (r) P l (s) < l, then r s < t.

4 s s {e,e } {e,e 6 } {e,e 4} {e,e 4 } {e,e 5} {e,e,e 5} {e,e 6,e 7 } {e,e 4,e 7} {e,e 4,e 5 } {e,e 5,e 6} {e,e,e 5,e 6} {e,e 6,e 7,e 8 } {e,e 4,e 7,e 8} {e,e 4,e 5,e 8 } {e,e 5,e 6,e 7} s s {e,e,e 5,e 6,e 9} {e,e 6,e 7,e 8,e 9 } {e,e 4,e 7,e 8,e 9} {e,e 4,e 5,e 8,e 9 } {e,e 5,e 6,e 7,e 8} Inverted Index I Inverted Index I Inverted Index I Inverted Index I 4 e e e e 4 e 5 e 6 e e e e 4 e 5 e 6 e 7 e e e e 4 e 5 e 6 e 7 e 8 e e e e 4 e 5 e 6 e 7 e 8 e 9 s s s s s s s s s s s s s s s5 s5 (a) -prefx scheme (b) -prefx scheme (c) -prefx scheme (d) 4-prefx scheme Fgure : Inverted Indexes bult on P (S), P (S), P (S), P 4(S) (t = 4) Next we develop a framework, called FxPrefxScheme, whch can use any fxed-length prefx scheme to prune object pars based on Lemma. For smplcty, suppose we use the overlap smlarty. Intally, FxPrefxScheme sorts the elements n each object of R and S based on the global element orderng. Then the framework bulds an nverted ndex I l on P l (S) and utlzes the ndex to flter pars r,s such that P l (r) P l (s) < l. To acheve ths goal, for each r R, tconsders theelementsenp l (r)andretrevesther correspondng nverted lsts I l (e). For any object s I l (e), ts l-prefx set, P l (s), must contan element e. As e P l (r), P l (r) and P l (s) share the common element e. And snce there s no duplcated element n each object (Secton..), P l (r) P l (s) s exactly the number of nverted lsts I l (e) for e P l (r)that contan theobject s. We scan the nverted lsts one by one and use a hash map H[s] to mantan the numberofnvertedlststhatcontantheobjects. IfH[s] l holds, we take s as a canddate of r. After scannng all nverted lsts, we verfy the canddates by computng the real smlarty. Example. Consder two collectons of objects, R and S n Fgure. Gven an overlap threshold t = 4 and - prefx scheme P, we show how FxPrefxScheme utlzes P to fnd r,s R S s.t. r s 4. Frstly, we buld an nverted ndex I on P (S) (See Fgure (b)). Then we enumerate each r R and fnd ts smlar objects n S. Consder r = {e,e 5,e 6,e 8,e 9} R. To obtan smlar objects of r, we consder t-prefx set P (r ) = {e,e 5,e 6} that conssts of the frst r t + l = elements of r. We retreve the nverted lsts from I, I (e ) = {,}, I (e 5) = {s,,}, I (e 6) = {,}, correspondng to the elements n P (r ). Snce appears n I (e ) and I (e 5), we have H[] =. As H[] l = holds, s a canddate of r. Smlarly, we can compute H[] =, H[s ] =, and H[] =, thus s also a canddate. Next we verfy the canddates by computng r and r, and comparng them wth the threshold t = 4. As r 4 and r 4, and are smlar objects of r. Obvously the prefx-flterng framework (l = ) s a specal case of FxPrefxScheme framework. Next we prove that the prefx-flterng framework cannot always have good performance theoretcally and expermentally. Theoretcal Analyss. We analyze the tme cost of Fx- PrefxScheme framework usng dfferent prefx schemes. The framework manly ncludes the followng two steps. We gnore the cost of sortng elements n each object and buldng an nverted ndex snce the former remans the same for any prefx scheme and the latter s much smaller than other steps. Flter. For each object r R, FxPrefxScheme needs to scan the nverted lst of each elements e P l (r), the total flter cost s r R e P l (r) I l(e). Verfcaton. Let C l (r) denote the canddate set of r whch conssts of the objects that appear n at least l nverted lsts of the elements n P l (r) and cost v(r) denote the average cost of verfyng a canddate r. For all objects r R, the total verfcaton cost s r R costv(r) C l(r). By addng the two cost, we obtan the total cost of Fx- PrefxScheme usng l-prefx scheme,.e. ( ) ( ) Θ l = I l (e) + cost v(r) C l (r). () r Re P l (r) r R Obvously, Θ s the cost of prefx flterng. For the cost of longer prefxschemes,.e. Θ l (l > ), thefltercostncreases snce both P l (r) and I l (e) ncrease, whle the verfcaton cost decreases snce P l has a more powerful flter condton than P whch can lead to fewer canddates (as proved n Lemma ). Therefore, Θ l (l > ) may nvolve smaller costs than Θ. Lemma. For any r R, C (r) C (r) C t(r). Expermental Analyss. We also conduct an experment on DBLP-Set data set (The data set descrpton s n Secton 6) to compare the runnng tme of FxPrefxScheme usng dfferent prefx schemes. Fgure 4 reports the results. The x-axs denotes the overlap threshold whch s vared from 8 to. We can see that -prefx scheme (prefxflterng) performs the worst among all prefx schemes. For example, when the overlap threshold s t = 8, FxPrefxScheme wth -prefx scheme consumed 88s whle FxPrefxScheme wth other prefx schemes took less than s. Another observaton s that prefx schemes have a great effect on the performance of FxPrefxScheme. For nstance, for threshold t =, the performance of FxPrefxScheme wth dfferent prefx schemes vares from 7s (-prefx scheme) to 456s (-prefx scheme). From the experments and the theoretcal analyss, we have a concluson that a fxed prefx scheme may not always acheve the hghest performance. To acheve the hghest performance, we need to dynamcally select the prefx length. More mportantly, we do not need to fx the prefx length for all objects. Instead we can select dfferent prefx lengths for dfferent objects. To ths end, we propose We suppose all operatons have the same unt cost for ease of presentaton.

5 Tme (* seconds) prefx scheme -prefx scheme -prefx scheme 4-prefx scheme 5-prefx scheme 6-prefx scheme 8 9 Overlap Threshold Fgure 4: Runnng tme of FxPrefxScheme usng dfferent prefx schemes on DBLP-Set data set. an adaptve framework to judcously select varable-length prefx schemes for dfferent objects n Secton ADAPTIVE FRAMEWORK FOR SmJon In ths secton, we frst present a varable-length prefx scheme n Secton 4.. Then n Secton 4., we propose an adaptve framework to select approprate prefxes, and gve two challenges that arse n our framework. Fnally, we present effectve methods n Secton. and 4.4 to address these two problems respectvely. 4. Varable-Length Prefx Scheme Instead of fxng the same prefx scheme for all objects, we adaptvely select a varable-length prefx scheme for each object r R. We call ths method AdaptPrefxScheme. Suppose we use the l r-prefx scheme for object r. The total cost of AdaptPrefxScheme s Θ = r R where F lr (r) s the flter cost F lr (r) = Θ lr (r) = r R( Flr (r)+v lr (r) ) () e P lr (r) and V lr (r) s the verfcaton cost I lr (e), () V lr (r) = cost v(r) C lr (r). (4) As FxPrefxScheme s a specal case of AdaptPrefxScheme, AdaptPrefxScheme performs better than Fx- PrefxScheme. In ths paper we study how to select the best prefx scheme for each object n order to acheve the hghest performance. We use the followng example to llustrate our basc dea. Example. Consder the example n Fgure. Gven overlap smlarty and the threshold 4, for each r R, we respectvely utlze dfferent prefx schemes to fnd objects s S s.t. r s 4, and compute the correspondng cost. Consder the object r 4 = {e,e,e 4,e 6,e 8} R. If we use -prefx scheme, then P (r 4) = {e,e }. We retreve I (e ) = {,} and I (e ) = {s,} from the nverted ndex n Fgure (a). We obtan F (r 4) = I (e ) + I (e ) = 4. As s,,, at least appear n one nverted lst, the canddate set s C (r 4) = {s,,,}. Snce we need r + s cost to verty r s 4, we have cost v(r 4) = r 4 + s =, thus V (r 4) = cost v(r 4) C (r 4) = 4. The total cost of usng -prefx scheme s Θ (r 4) = F (r 4)+V (r 4) = 44. If we use -prefx scheme for r 4, then P (r 4) = {e,e,e 4}. We retreve I (e ) = {,}, I (e ) = {s,} and I (e 4) = {,} from the nverted ndex n Fgure (b). We have F (r 4) = I (e ) + I (e ) + I (e 4) = 6. As and appear n at least two nverted lsts, the canddate set s C (r 4) = {,}, thus V (r 4) = cost v(r 4) C (r 4) =. The cost of usng -prefx scheme s Θ (r 4) = F (r 4) + V (r 4) = 6. Smlarly Θ (r 4) = 9 and Θ 4(r 4) =. As Θ (r 4) s mnmum, -prefx scheme s optmal for r 4. Table shows Θ l (r) for all objects r R. We can see dfferent objects have varous optmal prefx schemes. For example, t s optmal for r to select -prefx scheme whle for r, -prefx scheme can lead to the mnmum cost. If all objects are requred to select the same scheme, the mnmum total cost s Θ = r RΘ(r) =. But f we can select an optmal prefx scheme for each object, the mnmum cost wll be Θ (r )+Θ (r )+Θ (r )+Θ (r 4)+Θ 4(r 5) = 8. Table : Θ l (r) for all objects r R. l = l = l = l = 4 Θ l (r ) 7 6 Θ l (r ) Θ l (r ) Θ l (r 4) Θ l (r 5) 7 5 Θ l Overvew of Our Framework We present an overvew of our AdaptPrefxScheme framework. Fgure 5 gves the pseudo-code. The framework frst bulds an ndex on S (Lne ). Then t enumerates objects n R (Lne ). For each r R, the framework automatcally selects an approprate prefx scheme P l for r rather than usng a fxed one (Lne 4). Next t uses the selected prefx to flter objects n S and obtan a canddate set of the survved objects (Lne 5). Fnally, the framework verfes the canddates and returns smlar object pars (Lne 6) Algorthm : AdaptPrefxScheme (R, S, t) Input: R,S : two collectons of objects t : an overlap threshold Output: O : all pars of objects r,s such that r s t begn Buld an ndex that can support varable-length prefx schemes on S ; for each r R s.t. r t do Select a prefx scheme P l for r; Utlze P l to flter objects and get canddates; Verfy the canddates and add results to O; end Fgure 5: AdaptPrefxScheme framework. In our framework, there are two challenges to select varablelength prefx schemes for objects. The frst one s how to use the selected prefx scheme to do flterng and the other one s how to select the prefx scheme for an object. We frst consder the frst challenge. Consder two objects r and r j. Suppose r selects l r -length prefx scheme and r j selects l rj -length prefx scheme. Then r needs to use the nverted ndex I lr to do flterng whle r j needs to use the nverted ndex I lrj to do flterng. To address ths ssue, a nave method s to buld nverted ndexes for all prefx schemes,.e. I, I,, I t. Obvously, ths method s expensve n terms of ndexng tme and space. In Secton 4., we study how to buld effectve ndexes to support effectve flterng for varable-length prefx schemes. Next we consder the second challenge. Gven an object r, a straghtforward method enumerates each possble prefx

6 scheme P l (r) (l [,t]), then estmates the value of Θ l (r), denoted by Θ l (r), and fnally select P lo (r) such that Θ lo (r) s mnmum,.e. l o = argmn l [,t] Θl (r). However, ths method neglects the estmaton cost. Let E l (r) denote the estmaton cost for estmatng Θ l (r). The total estmaton cost to select the optmal prefx scheme s l [,t] E l(r). If the estmaton cost s expensve, t wll be rather tmeconsumng to estmate the cost for all prefx schemes. In addton, to estmate Θ l (r), we need to estmate the canddate-set sze. That s gven a group of nverted lsts, we need to estmate the number of elements that appear n at least l nverted lsts. The VSOL estmator [7] whch s proposed to estmate the selectvty of approxmate strng queres can be appled to address ths problem. The technque computes mn-wse sgnatures for each nverted lst, and utlzes these sgnatures to estmate the number of elements. However the cost of computng sgnatures s very hgh. For SmJon queres, ths cost should be added to the smlarty-jon cost. Therefore, t s necessary to develop an estmaton approach to avod such expensve sgnaturecomputaton step. To address these ssues, we propose an effcent method n Secton Delta Inverted Indexes In ths secton, we propose delta nverted ndexes to support effectve flterng usng varable-length prefx schemes. Recall the nverted ndex I l. Gven an element e, the nverted lst I l (e)keeps the objects whose l-prefxset contans e. Smlarly, I l+ (e) keeps the objects whose (l+)-prefx set contans e. ObvouslyI l (e) I l+ (e). To save space, we only keep the dfferent objects between I l (e) and I l+ (e). Let I (e) = I (e) and I l+ (e)( l t ) denote the delta nverted lst of e between I l (e) and I l+ (e), that s I l+ (e) = I l+ (e) I l (e). Thus we buld delta nverted ndexes I,, I t to replace I,, I t. Then we dscuss how to buld the delta nverted ndexes. Intally, delta nverted ndexes are empty. Then for each object s S, we vst ts elements based on the global element orderng. If the element e s n -prefx set of s, we nsert s nto I (e); otherwse, we nsert s nto I l (e) such that l-prefx set contans e but (l )-prefx set does not. Snce each element n S s at most added nto one delta nverted ndex, the space complexty s O( s S s ). As the tme complexty of nsertng an element to a lst s O(), the tme complexty s O( s S s ). Example. Consder the collecton S n Fgure and suppose t = 4. To buld delta nverted ndexes on S,we frst ntalze four empty nverted ndexes,.e., I, I, I and I 4. Then we nsert s,,, nto the ndexes. Suppose s,, have been nserted. Fgure 6 shows the process of nsertng = {e,e 5,e 6,e 7,e 8}. Snce the - prefx set of s {e,e 5}, we nsert nto I (e ) and I (e 5) respectvely. Snce e 6 s n -prefx set but not n -prefx set, we nsert nto I (e 6). Smlarly, we nsert nto I (e 7) and nto I 4(e 8). Nextwe dscuss howtouse deltanvertedndexestodoflterng. Supposewewanttofndthecanddatesofanobjectr w.r.t l-prefx scheme. If we use nverted ndexes, we need to merge the nverted lsts I l (e) for e P l (r), and fnd the objects that appear n at least l lsts. Interms of deltanverted ndexes, snce I l (e) s dvded nto I (e),, I l (e), we e e e e 4 s s I I e 5 e 6 e 5 e 7 s e 6 e 6 e 7 e e 9 8 s s s 5 s s 4 = { e e 5 e 6 e 7 e 8 } I I 4 Fgure 6: Delta nverted ndexes bult on the collecton S n Fgure (t = 4). need to merge the delta nverted lsts I (e) for l and e P l (r). If we use a hash-based method to merge lsts, the method usng the delta nverted ndex has the same tme complexty wth that usng the nverted ndex I l,.e., e P l (r) [,l] I(e) = e P l (r) I l(e). 4.4 Adaptvely Selectng Prefx Scheme To select an optmal prefx of an object, the brute-force method whch estmates all possble prefx lengths and selects the best one s very expensve as dscussed n Secton 4.. To address ths ssue, we propose a cost-based method to select an approprate prefx for an object. We have an observaton that wth the ncrease of the prefx length, the overall cost (the sum of the flter cost and verfcaton cost) usually frst ncreases and then decrease. For example, n Fgure 4, when the overlap threshold s 8, the runnng tme of FxPrefxScheme frst ncreases wth prefx lengths from to, and then decreases wth prefx lengths from to 6. Ths s because wth the ncreases of prefx lengths, the flterng tme ncreases and the verfcaton tme decreases. Thus there s a tradeoff between the flterng cost and verfcaton cost. Based on ths observaton, we compare the l-prefx scheme wth the (l+)-prefx scheme from l = to t. If the (l+)-prefxscheme s not better than the l-prefx scheme, we stop the algorthm and select the l-prefx scheme as r s prefx scheme; otherwse, we contnue to compare the (l + )-prefx scheme and the (l + )-prefx scheme. To decde whch one s better between l-prefx and (l+)- prefx, we compute the total cost of selectng them as r s prefx scheme. If the l-prefx scheme s selected, we need to estmate Θ (r) for each [,l+], thus the total cost wll be Θ l (r) + [,l+] E(r). Smlarly, f the (l + )- prefx scheme s selected, the total cost wll be Θ l+ (r) + [,l+] E(r). Obvously, f Θ l(r) < Θ l+ (r) + E l+ (r), the l-prefx scheme s better as t takes less cost; otherwse, the (l+)-prefx scheme s better. We can see f the algorthmfnallyselectsthel e-prefxschemeasr sprefxscheme, t only estmate Θ (r) for each [,l e+] rather than for all possble prefx schemes (.e. [,t]). Next, we dscuss how to effectvely estmate Θ l (r) and gve the estmaton cost E l+ (r). The cost Θ l (r) conssts of the flter cost and the verfcaton cost. Based on Equaton, we can easly get the flter cost byaddngupthelengthsofnvertedlsts oftheelements n r s l-prefx set,.e. F l (r) = e P l (r) I l(e). As we use the delta nverted ndexes, we need add up the lengths of e 8 l I(e). delta nverted lsts,.e. F l (r) = e P l (r) For ease of presentaton, we use Φ l (r) to denote the set of deltanvertedlsts tobemergedfor l-prefxschementheflterstepofr,.e., Φ l (r) = { I (e) e P l (r), l}. So

7 the flter cost for l-prefx scheme can be equvalently denoted by F l (r) = I(e) Φ l (r) I(e). Note we do not need to compute the flter cost for l-prefx scheme from scratch, snce we have already gotten the flter cost for (l )-prefx scheme, and n the flter step, the set of delta nverted lsts to be merged for l-prefx scheme s a superset of the set of those to be merged for (l )-prefx scheme. Let Φ l (r) denote the set of addtonal delta nverted lsts to be merged for l-prefx scheme comparng to (l )-prefx scheme,.e., Φ l (r) = Φ l (r) Φ l (r). Then we have F l (r) = F l (r) + I(e) Φ l (r) I(e). Therefore, we can obtan F l (r) by only computng I(e) Φ l (r) I(e) wth Φ l (r) cost. In order to get the verfcaton cost w.r.t an object r, we need to estmate the average cost to verfy a canddate and the canddate-set sze,.e., cost v(r) and C l (r). To estmate cost v(r), consder acanddatesandoverlap smlarty. Snce the elements n each object have been sorted based on the global orderng, we can use Merge-Jon algorthm to compute r s, thus the cost of verfyng a canddate s r + s, whch s only related to the length of a canddate. So we compute the cost correspondng to every possble length of a canddate and use the average of these cost as the estmator of cost v(r). Based on ths dea, we can obtan s u cost v(r)= = r + s u+ s l for overlap smlarty, where s u and s l are respectvely the upper-boundand the lower-bound of s. Usng the smlar dea, we can obtan cost v(r) for other smlarty functons as shown n Table. s = s l ( r + s ) s u s l + Table : The estmaton of the average cost of verfyng a canddate s w.r.t an object r for dfferent smlarty functons. (θ = θ for edt dstance; otherwse for edt smlarty, θ = ( θ) r θ ) SmFunc s l s u Verfy r,s costv(r) sm o(r,s) θ max s S s sm d (r,s) θ θ r θ θ r sm c (r,s) θ r r θ r + s sm j(r,s) θ r r θ ed(r,s) r θ r + θ (θ +) es(r,s) θ r r θ r + s u+ s l mn( r, s ) (θ +) s u s l + r + s l ( s u s l +) Nextwedscusshowtoestmatecanddate-setsze, C l (r). We frst estmate canddate-set sze w.r.t -prefx scheme, C (r) (Secton 4.4.), then estmate canddate-set sze w.r.t -prefx scheme C (r) (Secton 4.4.). Fnally we extend our method to estmate canddate-set sze w.r.t l-prefx scheme C l (r) (l > ) (Secton 4.4.) Estmatng canddate-set sze w.r.t -prefx scheme Weestmatecanddate-setszew.r.t-prefxscheme, C (r), to decde whch one between -prefx scheme and -prefx scheme s better. If -prefx scheme s better, t wll be selected as r s prefx scheme. If we use -prefx scheme, we need to merge the lsts n Φ (r). That s, nsertng the objects of each lst n Φ (r) nto a hash map and fnd the objects that appear n at least one lst. If -prefx scheme s better, the selected prefx scheme must be longer than -prefx scheme. Suppose l e-prefx scheme s selected as r s prefx scheme (l e ). If we use l e-prefx scheme, we need Ths equaton s deduced from (θ + ) s u s = s l mn( r, s ) s u s l + to merge the lsts n Φ le (r). That s, nsertng the objects of each lst n Φ le (r) nto a hash map and fnd the objects that appear n at least l e lsts. Snce Φ (r) Φ le (r), when comparng -prefx scheme and -prefx scheme, no matter whch prefx scheme s better, the lsts n Φ (r) must be merged. Therefore, we can merge the lsts n Φ (r) to get the real value of C (r) before comparng -prefx scheme and -prefx scheme. Example 4 llustrates the method to estmate C (r). Example 4. For example, consder r = {e,e 5,e 6,e 8,e 9} n Fgure. To estmate C (r ), we frst get Φ (r ) = { I (e ), I (e 5)}. As shown n Fgure 6, I (e ) = {,} and I (e 5) = {}. Based on our analyss above, no matter whch prefx scheme s selected, we need to merge I (e ) and I (e 5), therefore we can obtan C (r ) = {,} by mergng these two lsts. Then we get the real value C (r ) = Estmatng canddate-set sze w.r.t -prefx scheme In ths secton, we focus on estmatng canddate-set sze w.r.t -prefx scheme, C (r), whch s the number of objects that appear n at least two lsts n Φ (r). As C (r) has been computed (dscussed n Secton 4.4.), we can utlze C (r) to estmate C (r). Snce C (r) C (r) (See Lemma ), we only need to check for each s C (r) whether s C (r) holds. We dvde C (r) nto two dsjont sets, C = (r) and C > (r), where C = (r) denotes the set of objects that appear nonly one lst nφ (r), andc > (r)denotestheset ofobjects that appear n more than one lst n Φ (r). For each object s C > (r), snce Φ (r) Φ (r), s must appear n at least two lsts n Φ (r), thus s C (r). For each object s C = (r), f s appears n the lsts n Φ (r), s must appear n at least two lsts n Φ (r) = Φ (r) + Φ (r),.e. s C (r); otherwse, s / C (r). Therefore, as shown n Equaton 5, C (r) can be computed based on C (r). C (r) = C > (r) + C = (r) I(e). (5) I(e) Φ (r) For nstance, consder C (r ) = {,} n Example 4. We show how to compute C (r ) based on C (r ). C > (r ) = {} snce appears n more than one lst n Φ (r ),.e. I (e ) and I (e 5). For the objects n C > (r ), they must belong to C (r ) (.e. C (r )). C = (r ) = {} snce appears n only one lst n Φ (r ) (.e. I (e 5)). For the objects n C = (r ), we need check whether they appear n the lsts n Φ (r ) = { I (e 6), I (e ), I (e 5), I (e 6)}. As I (e 5), C = (r) I(e) Φ (r ) I(e) =. Based on Equaton 5, we obtan C (r) =. In order to use Equaton 5 to estmate C (r), there are two ssues that need to be addressed:. How to effcently compute C > (r) ;. How to effcently and effectvely estmate C = (r) I(e) Φ (r) I(e). The frst one can be easly addressed. Recall the algorthm of estmatng C (r) n Secton Durng the process of mergng the lsts n Φ (r), we mantan a hash map H wth H[s] storng the number of processed lsts that contan object s. Intally, C > (r) =. When fndng H[s] = holds, C > (r) = C > (r) +. After processng all lsts n Φ (r), we return C > (r).

8 Next we study the second problem. We have an nterestng observaton that none of lsts n Φ (r) have overlaps. Thus the unon of I(e) Φ (r) s actually equal to the multset unon of I(e) Φ (r). Lemma proves the correctness of ths observaton. Lemma. Gven a collecton of S and delta nverted ndexes I,, I l+ bult on S, for any r R, we have I(e) = I(e) I(e) Φ l+ (r) I(e) Φ l+ (r) Based on Lemma, we only need to estmate C = (r) I(e). (6) I(e) Φ (r) If the context s clear, I(e) Φ (r) I(e) s abbrevated as I(e) for ease of notaton. Gven an object s I(e), the condtonal probablty of s C = (r) holds s ( P s C = (r) s ) C = (r) I(e) I(e) = I(e). (7) To estmate the condtonal probablty, consder K sampled objects, (s,,,s K ), whch are randomly selected wth replacement from I(e). For any s ( [,K]), the probablty of s C = (r) holds s equal to the condtonal probablty n Equaton 7, thus an unbased estmator of the condtonal probablty s ( P s C = (r) s ) I(e) = K K C = (r)(s ), (8) where C = (r)(s ) = f s C = (r) holds, and otherwse. Note that for a random object s, t s very effcent to check whether C = (r)(s ) = holds. Ths s because when estmatng C (r), we mantanahashmap H for theobjects n C (r), and C = (r)(s ) = (.e. s C = (r)) ff. H[s ] =. Based on Equatons 7 and 8, an unbased estmator of C = (r) I(e) s K = = K C = (r)(s ) I(e). (9) Therefore, based on Equaton and 9, an unbased estmator of C (r) s Ĉ(r) = C> (r) + K K C = (r)(s ) = I(e). () Next we show how to compute ths equaton effcently. We frst compute I(e) by addng up the length of each I(e) Φ (r). Then we select K random objects from I(e). To acheve ths goal, consder a vrtual lst of objects obtaned by jonng all delta lsts I(e) Φ (r). Gven a random poston n the vrtual lst, we can return the correspondng object wth O ( Φ ) (r) cost. The cost can be mproved to O ( log Φ ) (r) by bnary search but requres an extra O ( Φ ) (r) ntalzaton cost. After gettng K random objects (s,,s K ), we compute the number of random objects such that H[s ] = holds,.e., K = C = (r) (s ). Fnally, we can obtan Ĉ(r) based on Equaton. Example 5 llustrates how to estmate C (r). (a) C (r ) Φ (r ) (b) C (r ) Φ (r ) r = { e e 5 e 6 e 8 e 9 } I (e ) I (e 5 ) I (e 5) I(e 6) s I (e 6 ) + I(e) = 4 C (r ) = {, } C > (r ) = H = { :, :} K = random objects s H[s]= = (s) C (r) Fasle True False > K C (r ) = C (r ) + (s ). 7 C = K (r ) = + I(e) = = Fgure 7: An llustraton of estmatng C (r ). Example 5. Consder r n Example 4. To estmate C (r ), we merge the lsts n Φ (r ) = { I (e ), I (e 5)}, then we can obtan C (r ), C > (r ) and H as shown n Fgure 7(a). To estmate C (r ) based on Equaton, we also need to compute I(e) and K K = C = (r) (s ). Fgure 7(b) llustrates ths process. Snce Φ (r ) = { I (e 5), I (e 6), I (e 6)}, I(e) = I (e 5) + I (e 6) + I (e 6) = 4. Suppose K = objects, {s 5,,}, are randomly selected wth replacement from I(e). For the object s5, as H[], we have / C = (r), thus C = (r)() =. For the object, as H[] =, we have C = (r), thus C = (r)() =. For the object, as H[], we have / C = (r), thus C = (r)() =. Therefore, K K = C = (r) (s ) = (++) =. Based on Equaton, we obtan Ĉ(r) = + 4 = Estmatng canddate-set sze w.r.t l-prefx scheme (l > ) We extend the estmaton method of canddate-set sze w.r.t -prefxscheme to support l-prefxscheme, C l (r) (l > ), whch uses H and C = (r) to estmate C (r), where H s obtaned by mergng the lsts n Φ (r). Next we show that the correspondng H and C = l (r) can also be computed before the estmaton of C l (r) (l > ). We use C (r) as an example to ntroduce our dea. C (r) needs to be estmated only when -prefx scheme s better than -prefx scheme. In ths case, -prefx scheme wll not be selected as r s prefx scheme. Weestmate C (r) nordertodecdeether-prefx scheme or -prefx scheme s better. Usng a smlar analyss as Secton 4.4., we can merge the lsts n Φ (r) n advance, and obtan H and C = (r) before the estmaton of C (r). Snce the lsts n Φ (r) have been merged when estmatng C (r), we only need to merge the lsts n Φ (r). Smlarly, we can also deduce that the correspondng H and C = l (r) can be computed before the estmaton of C l (r) (l > ). Thus an unbased estmator of C l (r) s Ĉl(r) = C > l (r) + K K C = l (r)(s ) = I(e) Φ l (r) I(e), () where (s,,,s K ) are K sampled objects randomly selected wth replacement from I(e) Φ l (r) I(e), and C = l (r)(s ) = f s C l (r) = holds and otherwse. Our estmaton algorthm can obtan an unbased estmator of C l (r) and the estmator wll become more accurate wth the ncrease of the number of sampled objects. The correctness s proved n Theorem.

9 Theorem. Let < δ <, ǫ >, K ǫ log. Then we have () E( Ĉl(r) ) = C δ C l (r) ; () Cl (r) Ĉl(r) P( C l ǫ) δ, where E( ) denotes the expected (r) value and C = C l (r) C > l (r). Next we analyze the cost of estmatng Θ l (r),.e., E l (r). Θ l (r) conssts of flter cost and verfcaton cost. The flter cost can be estmated wth Φ l (r) cost as shown at the begnnng of Secton 4.4. To estmate the verfcaton cost, we need to estmate canddate-set sze w.r.t l-prefx scheme. Recall our estmaton algorthm, selectng K random objects needs Φ l (r) +K log Φ l (r) cost, and checkng H[s ] = l for all random objects needs K cost. Therefore, the total cost of estmatng Θ l (r) s E l (r) = Φ l (r) + K log Φ l (r) +K. Based on the defnton of Φ l (r), we have Φ l (r) = P l (r) l = ( r t + l) l. Thus Φ l (r) = Φ l (r) Φ l (r) = r t+l whch s qute small and ncreases lnearly wth l. 5. ADAPTIVE FRAMEWORK FOR Sm- Search In ths secton, we study how to extend our adaptve framework to support a SmSearch query. Recall SmJon, gven R and S, our framework frst bulds delta nverted ndexes on S based on a specfed smlarty threshold, and then utlzes the ndex to fnd smlar objects for each r R w.r.t the same specfed smlarty threshold. Dfferent from a SmJon query, before answerng a SmSearch query, we have no dea about whch threshold wll be specfed, so the ndex bult on S should be able to deal wth any threshold. A straghtforward method s to buld delta nverted ndexes for all possble thresholds. However, the number of possble thresholds may be large, e.g. there are max s S s possble thresholds for a SmSearch query w.r.t overlap smlarty, so the method wll ncur a huge ndex sze. In the followng, we desgn an ndex structure that has the same sze as the nverted ndex bult on S but can support a Sm- Search query wth any threshold. We have an observaton that the objects wth the same length wll have the same number of elements n ther l- prefx set (.e., s t+l). In ths way we can group objects n S accordng to ther lengths. Let S s denote the group of objects wth length s. The maxmal threshold of a Sm- Search query for S s s s. Instead of buldng delta nverted ndexes on S s for each threshold n [, s ], we buld delta nverted ndexes only for the maxmal threshold s, denoted by I s, I s,, I s s. We can easly see the total ndex sze s the same as the nverted ndex bult on S,.e., O( s S s ). For example, consder S n Fgure. We show ts ndex structure n Fgure 8. Snce all objects n S has the same length, there s only one group,.e. S 5. For ths group, we use the same method as SmJon to buld delta nverted ndexes for threshold 5,.e., I, 5, I5. 5 Consder a query object r, a threshold θ and a deduced overlap threshold t (Secton..). To use our adaptve framework to fnd canddates s S such that r s t, we can use the above ndex structure to generate an nverted lst I l (e) whch conssts of objects whose l-prefx set contans e. Snce I s (e) conssts of the objects wth length s whose -th element s e, and l-prefx set contans s t + l elements, the objects wth length s whose l- S e s5 I e s s e e s e4 I e5 e6 e 5 s s4 I e 7 e 6 s e6 I 4 e 7 e 8 s s5 s I 5 e 8 e 9 s Fgure 8: An SmSearch ndex structure bult on S n Fgure. prefx set contans e can be represented by s t+l = I s (e). Notce n Table, we have deduced the upper-bound ( s u) and the lower-bound ( s l ) of the length of r s canddates. Therefore, the nverted lst I l (e) can be generated by s u s = s l s t+l = I s (e). For example, consder the ndex structure n Fgure 8. Gven r = {e 5,e 6,e 7,e 8,e 9}, θ = 4 and a deduced overlap threshold t = θ = 4, we compute s l = θ = 4, s u = max s S s = 5. Suppose we want to generate I (e 5). Based on our ndex structure, we have 5 s =4 s 4+ = I s (e 5)= I(e 5 5) I(e 5 5). Poston-aware Prunng. As dscussed above, we can merge some delta nverted lsts n our ndex structure to generate an nverted lst I l (e). Next we propose a technque to prune delta nverted lsts n order to further mprove the performance. Consder a delta nverted lst I s (e). For any object s I s (e), we have s[] = e. Let e be the j-th element of a query object r,.e. r[j] = e. We show the frst punnng condton on the left part of Fgure 9. Snce s[] = r[j] andtheelementsnsandr aresortedbasedonthe same global orderng, the elements before s[] at most share j common elements wth those before r[j] and the elements after s[] at most share s common elements wth those after r[j], the overlap between s and r s at most j+( s ). If j +( s ) < t holds, then the overlap between s and r must smaller than t, thus we can prune I s (e). Smlarly, we obtan another prunng condton as shown on the rght part of Fgure 9. That s f + ( r j) < t holds, we can prune I s (e). Therefore, for s u s = s l s t+l = I s (e), we prune I s (e) f > j+ s t or < j r +t. Recall the above example. We prune I(e 5 5) as > j + s t (.e., > +5 4). s r s[] e r[j] e j s - j+( s -) < t r s e s[] e r[j] +( r -j) < t s Prune I (e) If >j+ s -t or <j- r +t Fgure 9: An llustraton of poston-aware prunng. 6. EXPERIMENT We have mplemented our technques to support SmSearch and SmJon queres, and compared wth the followng stateof-the-art methods. ppjon and ppjon+[7] are prefx-flterng based algorthms that can answer SmJon queres for Jaccard and Cosne smlartes. They both utlze poston flterng to optmze ther algorthms. However ppjon+ also employs suffx flterng to further prune canddates. Ed- Jon [5] s a prefx-flterng based algorthm that can han- r -j s

10 Table : Dataset statstcs Data Sets Szes avg len max len mn len QueryLog-Strng,8, DBLP-Strng,85, DBLP-Set,85, ENRON-Set 57, dle SmJon queres for Edt dstance. Tre-Jon [] s a tre based algorthm that can support SmJon queres for Edt dstance. ChunkGram [9] s a prefx-flterng based algorthm that can answer SmJon and SmSearch queres for Edt dstance. Flamngo 4 s a data cleanng package that ncludes DvdeSkp [] algorthm to answer SmSearch queres for Jaccard smlarty, Cosne smlarty, and Edt dstance. We downloaded these algorthms from ther respectve webstes. Although there are some other methods, such as Part-Enum [], B ed -Tree [9], All-Pars [], pror work [9,7] has shown that they cannot outperform the above selected algorthms. We used four real data sets to evaluate our methods. ) DBLP-Strng was obtaned from the DBLP Bblography 5. Each strng s a concatenaton of author names and the ttle of a publcaton. ) QueryLog-Strng s a collecton of query strngs that were randomly chosen from the AOL Query Log 6. ) DBLP-Set was derved from DBLP-Strng by splttng each strng nto a token set based on non-alphanumerc characters. 4) ENRON-Set was obtaned from the Enron emal collecton 7. Wesplt theemal ttleandbodyntoatokenset based on non-alphanumerc characters. We assume the elements n each data set have no weght, whch s the same as many pror work [5,,4,9,5,7,9]. Table shows more detals about the data sets. All the algorthms were mplemented n C++ and compled usng GCC 4.. wth -O flag. We used nverse document frequency (IDF) to sort the elements. All the experments were run on a Ubuntu machne wth an Intel Core Quad X545.GHz processor and 4 GB memory. 6. Varable-Length Prefx Scheme In ths secton, we compare varable-length prefx scheme wth fxed-length prefx scheme by computng ther total cost n the flter and verfcaton step w.r.t overlap smlarty. For the varable-length prefx scheme, we specfed the prefx scheme for each object wth the mnmum cost. For the fxed-length prefx scheme, we specfed the same prefx scheme for all objects. Fgure reports the results on DBLP-Set and ENRON-Set data sets. In the X axs, refers to the varable-length prefx scheme, and an nteger refers to the fxed-length prefx scheme and the nteger value refers to the specfed prefx scheme. We see that the varable-length prefx scheme always took less cost than the fxed-length prefx scheme. For example, on the DBLP-Set data set, when the threshold s t = 5, even f the best prefx scheme was specfed for fxed-length prefx scheme,.e. -prefx scheme, ts cost (9.5 8 ) was stll % larger than that of the varable-length prefx scheme ( ). The reason s that dfferent objects may have dfferent optmal prefx schemes. Therefore, we need to study how to adaptvely selectng a prefx scheme for an object nstead of usng a fxed one ley/db enron/ Cost (* 8 ) t= t= t= t= t=4 t=5 * Cost (* ) t= t= t= t= t=4 t=5 * Prefx Schemes Prefx Schemes (a) DBLP-Set (b) ENRON-Set Fgure : Comparson of varable-length prefx scheme and fxed-length prefx scheme 6. Adaptve Selecton of Prefx Schemes In ths secton, we evaluate the qualty of our adaptve selecton method. If our method could not estmate the cost effectvely, t would select a bad prefx scheme. We computed the cost of performng a SmJon query usng our method and that of the optmal method whch used the prefx scheme wth the mnmal cost, and reported the rato of the cost of our method to that of the optmal method, by varyng percentages of sampled objects. Fgure shows the result. We can see wth the ncrease of percentage of sampled objects, the cost rato became smaller. On the DBLP- Set dataset, when the percentage s larger than %, thecost rato was smaller than.5. That s, our method at most needed.5% more cost than the optmal method. On the ENRON-Set data set, we found the optmal prefx scheme was typcally longer than (see Fgure (b)), thus our method needed to perform cost estmaton more than tmes for an object. However even for such data set, when the percentage s larger than%, our methodat most needed% more cost than the optmal method. These results ndcated that our estmaton method was very effectve. Note that ncreasng the percentage of sampled objects would make the estmaton process more expensve, and we sampled % objects for our estmaton method n the followng experments. Cost Rato.4... t= t= t= t= t=4 t=5.%.5% % 5% % Percentage of Sampled Objects Cost Rato t= t= t= t= t=4 t=5.%.5% % 5% % Percentage of Sampled Objects (a) DBLP-Set (b) ENRON-Set Fgure : Evaluatng effectveness of adaptve selecton of prefx schemes. Tme (seconds) AdaptPrefxScheme Selecton of Prefx Schemes 4 5 Tme (seconds) Overlap Threshold Overlap Threshold (a) DBLP-Set (b) ENRON-Set Fgure : Evaluatng effcency of adaptve selecton of prefx schemes. 5 5 AdaptPrefxScheme Selecton of Prefx Schemes 4 5 Next we evaluate the effcency of our adaptve selecton method. We vared the overlap thresholds, and computed the runnng tme of AdaptPrefxScheme. AdaptPrefxScheme needed to nclude the selecton tme of prefx

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

Efficient Semantically Equal Join on Strings in Practice

Efficient Semantically Equal Join on Strings in Practice Thammasat Int. J. Sc. Tech., Vol. 4, No., Aprl-June 009 Effcent Semantcally Equal Jon on Strngs n Practce Juggapong Natwcha Computer Engneerng Department, Faculty of Engneerng Chang Ma Unversty, Chang

More information

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints TPL-ware Dsplacement-drven Detaled Placement Refnement wth Colorng Constrants Tao Ln Iowa State Unversty tln@astate.edu Chrs Chu Iowa State Unversty cnchu@astate.edu BSTRCT To mnmze the effect of process

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss. Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

arxiv: v3 [cs.ds] 7 Feb 2017

arxiv: v3 [cs.ds] 7 Feb 2017 : A Two-stage Sketch for Data Streams Tong Yang 1, Lngtong Lu 2, Ybo Yan 1, Muhammad Shahzad 3, Yulong Shen 2 Xaomng L 1, Bn Cu 1, Gaogang Xe 4 1 Pekng Unversty, Chna. 2 Xdan Unversty, Chna. 3 North Carolna

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION 1 FENG YONG, DANG XIAO-WAN, 3 XU HONG-YAN School of Informaton, Laonng Unversty, Shenyang Laonng E-mal: 1 fyxuhy@163.com, dangxaowan@163.com, 3 xuhongyan_lndx@163.com

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

Mining User Similarity Using Spatial-temporal Intersection

Mining User Similarity Using Spatial-temporal Intersection www.ijcsi.org 215 Mnng User Smlarty Usng Spatal-temporal Intersecton Ymn Wang 1, Rumn Hu 1, Wenhua Huang 1 and Jun Chen 1 1 Natonal Engneerng Research Center for Multmeda Software, School of Computer,

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Semantic Image Retrieval Using Region Based Inverted File

Semantic Image Retrieval Using Region Based Inverted File Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Greedy Technique - Definition

Greedy Technique - Definition Greedy Technque Greedy Technque - Defnton The greedy method s a general algorthm desgn paradgm, bult on the follong elements: confguratons: dfferent choces, collectons, or values to fnd objectve functon:

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array Inserton Sort Dvde and Conquer Sortng CSE 6 Data Structures Lecture 18 What f frst k elements of array are already sorted? 4, 7, 1, 5, 1, 16 We can shft the tal of the sorted elements lst down and then

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

Adaptive Load Shedding for Windowed Stream Joins

Adaptive Load Shedding for Windowed Stream Joins Adaptve Load Sheddng for Wndowed Stream Jons Bu gra Gedk College of Computng, GaTech bgedk@cc.gatech.edu Kun-Lung Wu, Phlp Yu T.J. Watson Research, IBM {klwu,psyu}@us.bm.com Lng Lu College of Computng,

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Summarizing Data using Bottom-k Sketches

Summarizing Data using Bottom-k Sketches Summarzng Data usng Bottom-k Sketches Edth Cohen AT&T Labs Research 8 Park Avenue Florham Park, NJ 7932, USA edth@research.att.com Ham Kaplan School of Computer Scence Tel Avv Unversty Tel Avv, Israel

More information

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm Internatonal Journal of Advancements n Research & Technology, Volume, Issue, July- ISS - on-splt Restraned Domnatng Set of an Interval Graph Usng an Algorthm ABSTRACT Dr.A.Sudhakaraah *, E. Gnana Deepka,

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Fast Computation of Shortest Path for Visiting Segments in the Plane

Fast Computation of Shortest Path for Visiting Segments in the Plane Send Orders for Reprnts to reprnts@benthamscence.ae 4 The Open Cybernetcs & Systemcs Journal, 04, 8, 4-9 Open Access Fast Computaton of Shortest Path for Vstng Segments n the Plane Ljuan Wang,, Bo Jang

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research Schedulng Remote Access to Scentfc Instruments n Cybernfrastructure for Educaton and Research Je Yn 1, Junwe Cao 2,3,*, Yuexuan Wang 4, Lanchen Lu 1,3 and Cheng Wu 1,3 1 Natonal CIMS Engneerng and Research

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort Sortng: The Bg Pcture Gven n comparable elements n an array, sort them n an ncreasng (or decreasng) order. Smple algorthms: O(n ) Inserton sort Selecton sort Bubble sort Shell sort Fancer algorthms: O(n

More information

Virtual Machine Migration based on Trust Measurement of Computer Node

Virtual Machine Migration based on Trust Measurement of Computer Node Appled Mechancs and Materals Onlne: 2014-04-04 ISSN: 1662-7482, Vols. 536-537, pp 678-682 do:10.4028/www.scentfc.net/amm.536-537.678 2014 Trans Tech Publcatons, Swtzerland Vrtual Machne Mgraton based on

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

CS1100 Introduction to Programming

CS1100 Introduction to Programming Factoral (n) Recursve Program fact(n) = n*fact(n-) CS00 Introducton to Programmng Recurson and Sortng Madhu Mutyam Department of Computer Scence and Engneerng Indan Insttute of Technology Madras nt fact

More information

Sorting. Sorting. Why Sort? Consistent Ordering

Sorting. Sorting. Why Sort? Consistent Ordering Sortng CSE 6 Data Structures Unt 15 Readng: Sectons.1-. Bubble and Insert sort,.5 Heap sort, Secton..6 Radx sort, Secton.6 Mergesort, Secton. Qucksort, Secton.8 Lower bound Sortng Input an array A of data

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Analysis of Continuous Beams in General

Analysis of Continuous Beams in General Analyss of Contnuous Beams n General Contnuous beams consdered here are prsmatc, rgdly connected to each beam segment and supported at varous ponts along the beam. onts are selected at ponts of support,

More information

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSE 326: Data Structures Quicksort Comparison Sorting Bound CSE 326: Data Structures Qucksort Comparson Sortng Bound Bran Curless Sprng 2008 Announcements (5/14/08) Homework due at begnnng of class on Frday. Secton tomorrow: Graded homeworks returned More dscusson

More information

Optimal Workload-based Weighted Wavelet Synopses

Optimal Workload-based Weighted Wavelet Synopses Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,

More information

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database A Mult-step Strategy for Shape Smlarty Search In Kamon Image Database Paul W.H. Kwan, Kazuo Torach 2, Kesuke Kameyama 2, Junbn Gao 3, Nobuyuk Otsu 4 School of Mathematcs, Statstcs and Computer Scence,

More information

Adaptive Load Shedding for Windowed Stream Joins

Adaptive Load Shedding for Windowed Stream Joins Adaptve Load Sheddng for Wndowed Stream Jons Buğra Gedk, Kun-Lung Wu, Phlp S. Yu, Lng Lu College of Computng, Georga Tech Atlanta GA 333 {bgedk,lnglu}@cc.gatech.edu IBM T. J. Watson Research Center Yorktown

More information

An Image Fusion Approach Based on Segmentation Region

An Image Fusion Approach Based on Segmentation Region Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua

More information

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp Lfe Tables (Tmes) Summary... 1 Data Input... 2 Analyss Summary... 3 Survval Functon... 5 Log Survval Functon... 6 Cumulatve Hazard Functon... 7 Percentles... 7 Group Comparsons... 8 Summary The Lfe Tables

More information

GSLM Operations Research II Fall 13/14

GSLM Operations Research II Fall 13/14 GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Efficient Broadcast Disks Program Construction in Asymmetric Communication Environments

Efficient Broadcast Disks Program Construction in Asymmetric Communication Environments Effcent Broadcast Dsks Program Constructon n Asymmetrc Communcaton Envronments Eleftheros Takas, Stefanos Ougaroglou, Petros copoltds Department of Informatcs, Arstotle Unversty of Thessalonk Box 888,

More information

F Geometric Mean Graphs

F Geometric Mean Graphs Avalable at http://pvamu.edu/aam Appl. Appl. Math. ISSN: 1932-9466 Vol. 10, Issue 2 (December 2015), pp. 937-952 Applcatons and Appled Mathematcs: An Internatonal Journal (AAM) F Geometrc Mean Graphs A.

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE 1 ata Structures and Algorthms Chapter 4: Trees BST Text: Read Wess, 4.3 Izmr Unversty of Economcs 1 The Search Tree AT Bnary Search Trees An mportant applcaton of bnary trees s n searchng. Let us assume

More information

CPU Load Shedding for Binary Stream Joins

CPU Load Shedding for Binary Stream Joins Under consderaton for publcaton n Knowledge and Informaton Systems CPU Load Sheddng for Bnary Stream Jons Bugra Gedk 1,2, Kun-Lung Wu 1, Phlp S. Yu 1 and Lng Lu 2 1 IBM T.J. Watson Research Center, Hawthorne,

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSE 326: Data Structures Quicksort Comparison Sorting Bound CSE 326: Data Structures Qucksort Comparson Sortng Bound Steve Setz Wnter 2009 Qucksort Qucksort uses a dvde and conquer strategy, but does not requre the O(N) extra space that MergeSort does. Here s the

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap Int. Journal of Math. Analyss, Vol. 8, 4, no. 5, 7-7 HIKARI Ltd, www.m-hkar.com http://dx.do.org/.988/jma.4.494 Emprcal Dstrbutons of Parameter Estmates n Bnary Logstc Regresson Usng Bootstrap Anwar Ftranto*

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

Constructing Minimum Connected Dominating Set: Algorithmic approach

Constructing Minimum Connected Dominating Set: Algorithmic approach Constructng Mnmum Connected Domnatng Set: Algorthmc approach G.N. Puroht and Usha Sharma Centre for Mathematcal Scences, Banasthal Unversty, Rajasthan 304022 usha.sharma94@yahoo.com Abstract: Connected

More information

Simultaneously Fitting and Segmenting Multiple- Structure Data with Outliers

Simultaneously Fitting and Segmenting Multiple- Structure Data with Outliers Smultaneously Fttng and Segmentng Multple- Structure Data wth Outlers Hanz Wang a, b, c, Senor Member, IEEE, Tat-un Chn b, Member, IEEE and Davd Suter b, Senor Member, IEEE Abstract We propose a robust

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation Intellgent Informaton Management, 013, 5, 191-195 Publshed Onlne November 013 (http://www.scrp.org/journal/m) http://dx.do.org/10.36/m.013.5601 Qualty Improvement Algorthm for Tetrahedral Mesh Based on

More information