High-dimensional Proximity Joins. Kyuseok Shim Ramakrishnan Srikant Rakesh Agrawal. IBM Almaden Research Center. 650 Harry Road, San Jose, CA 95120

Size: px

Start display at page:

Download "High-dimensional Proximity Joins. Kyuseok Shim Ramakrishnan Srikant Rakesh Agrawal. IBM Almaden Research Center. 650 Harry Road, San Jose, CA 95120"

Dorthy Farmer
5 years ago
Views:

1 High-dimensinal Prximity Jins Kyusek Shim Ramakrishnan Srikant Rakesh Agrawal IBM Almaden Research Center 650 Harry Rad, San Jse, CA Abstract Many emerging data mining applicatins require a prximity (similarity) jin between pints in a high-dimensinal dmain. We present a new algrithm that utilizes a new data structure, called the -kd tree, fr fast spatial prximity jins n high-dimensinal pints. This data structure reduces the number f neighbring leaf ndes that are cnsidered fr the jin test, as well as the traversal cst f nding apprpriate branches in the internal ndes. The strage cst fr internal ndes is independent f the number f dimensins. Hence the prpsed data structure scales t high-dimensinal data. We analyze the cst f the jin fr the -kd tree and the R-tree family, and shw that the -kd tree will perfrm better fr high-dimensinal jins. Empirical evaluatin, using synthetic and real-life datasets, shws that prximity jin using the -kd tree is typically 2 t 40 times faster than the R + tree, with the perfrmance gap increasing with the number f dimensins. We als discuss hw sme f the ideas f the -kd tree can be applied t the R-tree family. These biased R-trees perfrm better than the crrespnding traditinal R-trees fr highdimensinal prximity jins, but d nt match the perfrmance f the -kd tree. 1 Intrductin Many emerging data mining applicatins require ecient prcessing f prximity (similarity) jins n high-dimensinal pints. Examples include applicatins in time-series databases[afs93, ALSS95], multimedia databases [Jag94, NBE + 93, NC91], medical databases [ACF + 93, TBS90], and scientic databases[vas93]. Sme typical queries in these applicatins include: (1) discver all stcks with similar price mvements; (2) nd all pairs f similar images; (3) retrieve music scres similar t a target music scre. These queries are ften a prelude t clustering the bjects. Fr example, given all pairs f similar images, the images can be clustered int grups such that the images in each grup are similar. T mtivate the need fr multidimensinal indices in such applicatins, cnsider the prblem f nding all pairs f similar time-sequences. The technique in [ALSS95] slves this prblem by breaking each time-sequences int a set f cntiguus subsequences, and nding all subsequences similar t each ther. If tw sequences have \enugh" similar subsequences, they are cnsidered similar. T nd similar subsequences, each subsequence is mapped t a pint in a multi-dimensinal Currently at Bell Labratries, Murray Hill, NJ. 1

2 space. Typically, the dimensinality f this space is quite high. The prblem f nding similar subsequences is nw reduced t the prblem f nding pints that are clse t the given pint in the multi-dimensinal space. A pair f pints are cnsidered \clse" if they are within distance f each ther with sme distance metric (such as L 2 r L1 nrms) that invlves all dimensins, where is specied by the user. A multi-dimensinal index structure (the R + tree) was used fr nding all pairs f clse pints. This apprach hlds fr ther dmains, such as image data. In this case, the image is brken int a grid f sub-images, key attributes f each sub-image mapped t a pint in a multi-dimensinal space, and all pair f similar sub-images are fund. If \enugh" sub-images f tw images match, a mre cmplex matching algrithm is applied t the images. A clsely related prblem is t nd all bjects similar t a given bjects. This translates t nding all pints clse t a query pint. Even if there is n direct mapping frm an bject t a pint in a multi-dimensinal space, this paradigm can still be used if a distance functin between bjects is available. An algrithm is presented in [FL95] fr generating a mapping frm an bject t a multi-dimensinal pint, given a set f bjects and a distance functin. Current spatial access methds (see [Sam89, Gut84] fr an verview) have mainly cncentrated n string map infrmatin, which is a 2-dimensinal r 3-dimensinal space. While they wrk well with lw dimensinal data pints, the time and space fr these indices grw rapidly with dimensinality. Mrever, while CPU cst is high fr prximity jins, existing indices have been designed with the reductin f I/O cst as their primary gal. We discuss these pints further later in the paper, after reviewing current multidimensinal indices. T vercme the shrtcmings f current indices fr high-dimensinal prximity jins, we prpse a structure called the -kd tree. This is a main-memry data structure ptimized fr perfrming prximity jins. The -kd tree als has a very small build time. This lets the -kd tree use the prximity distance limit as a parameter in building the tree. Empirical evaluatin shws that the build plus jin time fr the -kd tree is typically 2 t 40 times less than the jin time fr the R + tree [SRF87], 1 with the perfrmance gap increasing with the number f dimensins. A pure main-memry data structure wuld nt be very useful, since the data in many applicatins will nt t in memry. We present a jin algrithm that can handle large amunts f data while still using the -kd tree. Prblem Denitin We will cnsider tw versins f the spatial prximity jin prblem: Self-jin: Given a set f N high-dimensinal pints and a distance metric, nd all pairs f pints that are within distance f each ther. 1 Our experiments indicated that the R + tree was better than the R tree [Gut84] r the R tree [BKSS90] tree fr high-dimensinal prximity jins. 2

3 Nn-self-jin: Given tw sets S 1 and S 2 f high-dimensinal pints and a distance metric, nd pairs f pints, ne each frm S 1 and S 2, that are within distance f each ther. The distance metric fr tw n dimensinal pints ~ X and ~ Y that we cnsider is L p = nx 1 jx i? Y i j p! 1=p ; 1 p 1: L 2 is the familiar Euclidean distance, L 1 the Manhattan distance, and L1 crrespnds t the maximum distance in any dimensin. Prperties The prblem f high-dimensinal prximity jins with sme distance metric and parameter has the fllwing prperties: The feature vectr chsen fr similarity cmparisn is high-dimensinal. Every dimensin f the feature vectr is mapped int a numeric value whse bunds are knwn. The distance functin is cmputed cnsidering every dimensin f the feature vectr. The prximity distance limit is nt large, since indices are nt eective when the selectivity f the prximity jin is large (i.e. when every pint matches with mst ther pints). Paper Organizatin. In Sectin 2, we give an verview f existing spatial indices, and describe their shrtcmings when used fr high-dimensinal prximity jins. Sectin 3 describes the -kd tree and the algrithm fr prximity jins. In Sectin 4, we analyze the perfrmance fr the -kd tree and the R + tree fr prximity jins. We give a perfrmance evaluatin in Sectin 5. Sectin 6 discusses biased R + trees. We cnclude in Sectin 7. 2 Current Multidimensinal Index Structures We rst discuss the R-tree family f indices, which are the mst ppular multi-dimensinal indices, and describe hw t use them fr prximity jins. We als give a brief verview f ther indices. We then discuss prblems f the current index structures fr high-dimensinal prximity jins. 2.1 The R-tree family R-tree [Gut84] is a balanced tree in which each nde represents a rectangular regin. Each internal nde in a R-tree stres a minimum bunding rectangle (MBR) fr each f its children. The MBR cvers the space f the pints in the child nde. The MBRs f siblings can verlap. The decisin whether t traverse a subtree in an internal nde depends n whether its MBR verlaps with the space cvered by query. When a nde becmes full, it is split. Ttal area f the tw MBRs resulting frm the split is minimized while splitting a nde. Figure 1 shws an example f R-tree. This tree 3

4 N1 L1 L2 N2 L3 N1 N2 L1 L2 L3 L4 L4 (a) Space Cvered by Bunding Rectangles (b) R-tree Figure 1: Example f an R-tree cnsists f 4 leaf ndes and 3 internal ndes. The MBRs are N1,N2,L1,L2,L3 and L4. The rt nde has tw children whse MBRs are N1 and N2. R tree [BKSS90] added tw majr enhancements t R-tree. First, rather than just cnsidering the area, the nde splitting heuristic in R tree als minimizes the perimeter and verlap f the bunding regins. Secnd, R tree intrduced the ntin f frced reinsert t make the shape f the tree less dependent n the rder f the insertin. When a nde becmes full, it is nt split immediately, but a prtin f the nde is reinserted frm the tp level. With these tw enhancements, the R tree generally utperfrms R-tree. R + tree [SRF87] impses the cnstraint that n tw bunding regins f a nn-leaf nde verlap. Thus, except fr the bundary surfaces, there will be nly ne path t every leaf regin, which can reduce search and jin csts. X-tree [BKK96] avids splits that culd result in high degree f verlap f bunding regins fr R -tree. Their experiments shw that the verlap f bunding regins increases signicantly fr high dimensinal data resulting in perfrmance deteriratin in the R -tree. Instead f allwing splits that prduce high degree f verlaps, the ndes in X-tree are extended t mre than the usual blck size, resulting in s called super-ndes. Experiments shw that X-tree imprves the perfrmance f pint query and nearest-neighbr query cmpared t R -tree and TV-tree (described belw). N cmparisn with R + -tree is given in [BKK96] fr pint data. Hwever, since the R + - tree des nt have any verlap, and the gains fr the X-tree are btained by aviding verlap, ne wuld nt expect the X-tree t be better than the R + -tree fr pint data. Prximity Jin The jin algrithm using R-tree cnsiders each leaf nde, extends its MBR with -distance, and nds all leaf ndes whse MBR intersects with this extended MBR. The algrithm then perfrms a nested-lp jin r srt-merge jin fr the pints in thse leaf ndes, the jin cnditin being that the distance between the pints is at mst. (Fr the srt-merge jin, the pints are rst srted n ne f the dimensins.) T reduce redundant cmparisns between pints when jining tw leaf ndes, we culd rst screen pints. The bundary f each leaf nde is extended by, and nly pints that lie within the intersectin f the tw extended regins need be jined. Figure 2 shws an example, where the rectangles with slid lines represent the MBRs f tw leaf ndes and the dtted lines illustrate the 4

5 L1 L2 L1 L2 (a) Overlapping MBRs (R tree and R? tree) (b) Nn-verlapping MBRs (R + tree) Figure 2: Screening pints fr jin test extended bundaries. The shaded area cntains screened pints. 2.2 Other Index Structures kdb tree [Rb81] is similar t the R + tree. The main dierence is that the bunding rectangles cver the entire space, unlike the MBRs f the R + tree. hb-tree [LS09] is similar t the kdb tree except that bunding rectangles f the children f an internal nde are rganized as a K-D tree [Ben75] rather than as a list f MBRs. (The K-D-tree is a binary tree fr multi-dimensinal pints. In each level f the K-D-tree, nly ne dimensin, chsen cyclically, is used t decide the subtree fr traversal.) Further, the bunding regins may have rectangular hles in them. This reduces the cst f splitting a nde cmpared t the kdb tree. TV-tree [LJF94] uses a variable number f dimensins fr indexing. TV-tree has a design parameter (\active dimensin") which is typically a small integer (1 r 2). Fr any nde, nly dimensins are used t represent bunding regins and t split ndes. Fr the ndes clse t the rt, the rst dimensins are used t dene bunding rectangles. As the tree grws, sme ndes may cnsist f pints that all have the same value n their rst, say, k dimensins. Since the rst k dimensins can n lnger distinguish the pints in thse ndes, the next dimensins (after the k dimensins) are used t stre bunding regins and fr splitting. This reduces the strage and traversal cst fr internal ndes. Grid-le [NHS84] partitins the k-dimensinal space as a grid; multiple grid buckets may be placed in a single disk page. A directry structure keeps track f the mapping frm grid buckets t disk pages. A grid bucket must t within a leaf page. If a bucket verws, the grid is split n ne f the dimensins. 2.3 Prblems with Current Indices The index structures described abve suer frm the fllwing prblems fr perfrming prximity jins with high-dimensinal pints: 5

6 Figure 3: Number f neighbring leaf ndes. Number f Neighbring Leaf Ndes. The splitting algrithms in the R-tree variants utilize every dimensin equally fr splitting in rder t minimize the vlume f hyper-rectangles. This causes the number f neighbring leaf ndes within -distance f a given leaf nde t increase dramatically with the number f dimensins. T see why this happens, assume that a R-tree has partitined the space s that there is n \dead regin" between bunding rectangles. Then, with a unifrm distributin in a 3-dimensinal space, we may get 8 leaf ndes as shwn in Figure 3. Ntice that every leaf nde is within -distance f every ther leaf nde. In an n dimensinal space, there may be O(2 n ) leaf ndes within -distance f every leaf nde. The prblem is smewhat mitigated because f the use f MBRs. Hwever, the number f neighbrs within -distance still increases dramatically with the number f dimensins. This prblem als hlds fr ther multi-dimensinal structures, except perhaps the TV-tree. Hwever, the TV-tree suers frm a dierent prblem { it will nly use the rst k dimensins fr splitting, and des nt cnsider any f the thers (unless many pints have the same value in the rst k dimensins). With enugh data pints, this leads t the same prblem as fr the R-tree, thugh fr the ppsite reasn. Since the TV-tree uses nly the rst k dimensins fr splitting, each leaf nde will have many neighbring leaf ndes within -distance. Nte that the prblem aects bth the CPU and I/O cst. The CPU cst is aected because f the traversal time as well as time t screen all the neighbring pages. I/O cst is aected because we have t access all the neighbring pages. Strage Utilizatin. The kdb tree and R-tree family, including the X-tree, represent the bunding regins f each nde by rectangles. The bunding rectangles are represented by \min" and \max" pints f the hyper-rectangle. Thus, the space needed t stre the representatin f bunding rectangles increases linearly with the number f dimensins. This is nt a prblem fr the hb-tree (which des nt stre MBRs), the TV-tree (which nly uses a few dimensins at a time), r the grid le. Traversal Cst. When traversing a R-tree r kdb tree, we have t examine the bunding regins f children in the nde t determine whether t traverse the subtree. This step requires checking the ranges f every dimensin in the representatin f bunding rectangles. Thus, the CPU cst f examining bunding rectangles increases prprtinally t the number f dimensins f data pints. This prblem is mitigated fr the hb-tree r the TV-tree. This is nt a prblem fr the 6

7 grid-le. Build Time. The set f bjects participating in a spatial jin may ften be pruned by selectin predicates [LR94] (e.g. nd all similar internatinal funds). In thse cases, it may be faster t perfrm the nn-spatial selectin predicate rst (select internatinal funds) and then perfrm spatial jin n the result. Thus it is smetimes necessary t build a spatial index n-the-y. Current indices are designed t be built nce; the cst f building them can be mre than the cst f the jin [PD96]. Skewed Data. Handling skewed data is a prblem fr the grid-le. In a k-dimensinal space, a single data page verw may result in a k?1 dimensinal slice being added t the grid-le directry. If the grid-le had n buckets befre the split, and the splitting dimensin had m partitins, n=m new cells are added t the grid after the split. Thus, the size f the directry structure can grw rapidly fr skewed high-dimensinal pints. Summary. Each index has gd and bad features fr prximity jin f high-dimensinal pints. It wuld be dicult t design a general-purpse multi-dimensinal index which des nt have any f the shrtcmings listed abve. Hwever, by designing a special-purpse data structure, we can attack these prblems. We nw describe a new data structure, -kd tree, which is a special-purpse data structure fr this purpse. This data structure can be cnsidered analgus t hash tables built fr equality jins. 3 The -kd tree We intrduce the -kd tree in Sectin 3.1 and then discuss its design ratinale in Sectin kd tree denitin We rst dene the -kd tree 2. We then describe hw t perfrm prximity jins using the -kd tree, rst fr the case where the data ts in memry, and then fr the case where it des nt. -kd tree We assume, withut lss f generality, that the c-rdinates f the pints in each dimensin lie between 0 and +1. Pinters t the data pints are stred in leaf ndes. We start with a single leaf nde. Whenever the number f pints in a leaf nde exceeds a threshld, the leaf nde is split, and cnverted t an interir nde. If the leaf nde was at level i, the ith dimensin is used fr splitting the nde. The nde is split int b1=c parts, such that the width f each new leaf nde in the ith dimensin is either r slightly greater than. In the rest f this sectin, we assume withut lss f generality that is an exact divisr f 1. An example f -kd tree fr tw dimensinal space, with = 0:25, is shwn in Figure 4. 2 It is really a trie, but we call it a tree since it is cnceptually similar t kdb tree. 7

8 rt leaf leaf leaf leaves Figure 4: -kd tree Prximity Jin using the -kd tree Let x be an internal nde in the -kd tree. We use x[i] t dente the ith child f x. Let f be the fanut f the tree. Nte that f = 1=. Figure 5 describes the jin algrithm. The algrithm initially calls self-jin(rt), fr the self-jin versin, r jin(rt1, rt2), fr the nn-self-jin versin. The prcedures leaf-jin(x, y) and leaf-self-jin(x) perfrm a srt-merge jin n leaf ndes. Fr high-dimensinal data, the -kd tree will rarely use all the dimensins fr splitting. (Fr instance, with dimensins and a f 0.1, there wuld have t be mre than pints befre all dimensins are used.) Thus we can usually use ne f the free unsplit dimensin as a cmmn \srt dimensin". The pints in every leaf nde are kept srted n this dimensin, rather then being srted repeatedly during the jin. When jining tw leaf ndes, the algrithm des a srt-merge using this dimensin. Nte that we can use the same jin cde fr all the L p distance metrics, with nly the nal test between a pair f pints being metric-dependent. Memry Management The value f is ften given at run-time. Since the value f is a parameter fr building the data structure, it may nt be pssible t build a disk-based versin f the -kd tree in advance. Instead, we srt the multi-dimensinal pints with the rst splitting dimensin and keep them as an external le. We rst describe the jin algrithm, assuming that main-memry can hld all pints within a 2 distance n the rst dimensin, and then generalize it. The jin algrithm rst reads pints whse values in the srted dimensin lie between 0 and 2, builds the -kd tree fr thse pints in main memry, and perfrms the prximity jin in memry. The algrithm then deallcates the space used fr the pints whse values in the srted dimensin are between 0 and, reads pints whse values are between 2 and 3, build the -kd tree fr these pints, and perfrms the jin 8

9 prcedure jin(x, y) begin if leaf-nde(x) and leaf-nde(y) then leaf-jin(x, y); else if leaf-nde(x) then begin fr i = 1 t f d jin(x, y[i]); end else if leaf-nde(y) then begin fr i = 1 t f d jin(x[i], y); end else begin fr i = 1 t f? 1 d begin jin(x[i], y[i]); jin(x[i], y[i+1]); jin(x[i+1], y[i]); end jin(x[f], y[f]); end end prcedure self-jin(x) begin if leaf-nde(x) then leaf-self-jin(x); else begin fr i = 1 t f?1 d begin self-jin(x[i], x[i]); jin(x[i], x[i+1]); end self-jin(x[f], x[f]); end end Figure 5: Jin algrithm (a) Step 1 (b) Step 2 (c) Step 3 (d) Step 4 Figure 6: Memry Management: Example 1 prcedure again. This prcedure is cntinued until all the pints have been prcessed. Nte that we nly read each pint the disk nce. This prcedure is illustrated in Figure 6. The shaded prtin illustrates the data present in main memry in each step. This prcedure wrks because the build time fr the -kd tree is extremely small. It can be generalized t the case where a 2 chunk f the data des nt t in memry. The basic idea is t partitin the data int 2 chunks using an additinal dimensin. Then, the jin prcedure (i.e read pints int memry, build -kd, perfrm jin and s n) is instead repeated fr each 4 2 chunk f the data using the additinal dimensin. This prcess is shwn in Figure 7. The rst square shws the data partitined int a grid, with the fur 2 chunks in the tp left crner (11, 12, 21 and 22) shaded. These 4 chunks are read int memry, and the jin is perfrmed between these chunks, and fr each chunk. Chunks 11 and 21 are then drpped, chunks 13 and 23 read int memry, and the jins between 12, 22, 13 and 23 perfrmed. This prcess cntinues till all jins are cmplete. 9

10 Figure 7: Memry Management: Example Design Ratinale Tw distinguishing features f -kd tree are: Biased Splitting : We d nt treat all dimensins equally fr chsing split pints. Instead, we preferentially split ne dimensin multiple times. Sized Splitting : When we split a nde, we split the nde in sized chunks. We discuss belw hw these features help -kd tree slve the prblems with current indices utlined in Sectin 2. Number f Neighbring Leaf Ndes. Recall that with current indices, the number f neighbring leaf pages may increase expnentially with the number f dimensins. The -kd slves this prblem because f the biased splitting. When the length f the bunding rectangle f each leaf ndes in the split dimensin is at least, at mst tw neighbring leaf ndes need t be cnsidered fr the jin test. Hwever, as the length f the bunding rectangle in the split dimensin becmes less than, the number f neighbr leaf ndes fr jin test increases. Hence when a leaf nde becmes full, we split the nde int several children, each f size in the split dimensin. We split the nde int several children at nce, rather than gradually, in rder t reduce the build time and the number f children in each neighbr.

11 D0 X D2 X D1 (a) Chsing split dimensin glbally (b) Chsing split dimensin lcally Figure 8: Glbal and Lcal Ordering f Split Dimensins We have tw alternatives fr chsing the next split dimensin: glbal rdering and lcal rdering. Glbal rdering uses the same split dimensin fr all the ndes in the same level, while lcal rdering chses the split dimensin based n the distributin f pints in each nde. Examples f these tw cases are shwn in Figure 8, fr a 3-dimensinal space. Fr bth rderings, the dimensin D0 is used fr splitting in the rt nde (i.e. level 0). Fr glbal rdering, nly D1 is used fr splitting in level 1. Hwever, fr lcal rdering, bth D1 and D2 are chsen alternatively fr neighbring ndes in level 1. Cnsider the leaf nde labeled X. With glbal rdering, it has 5 neighbr leaf ndes (shaded in the gure). The number f neighbrs increases t 9 fr lcal rdering. Ntice that the space cvered by the neighbrs fr glbal rder is a prper subset f that cvered by the neighbrs fr lcal rdering. The dierence in the space cvered by the tw rderings increases as decreases. Hence we chse glbal rdering fr split dimensins, rather than lcal rdering. When the number f pints are s huge that the -kd tree is frced t split every dimensin, then the number f neighbrs will be cmparable t ther indices. Hwever, till that limit, the number f neighbrs depends n the number f pints (and their distributin) and, and is independent f the number f dimensins. The rder in which dimensins are chsen fr splitting can signicantly aect the space utilizatin and jin cst if crrelatins exist between sme f the dimensins. This prblem can be slved by statistically analyzing a sample f the data, and chsing fr the next split the dimensin that has the least crrelatin with the dimensins already used fr splitting. Space Requirements. Fr each internal nde, we simply need an array f pinters t its children. We d nt need t stre minimum bunding rectangles because they can be cmputed. Hence the space required depends nly n the number f pints (and their distributin), and is independent f the number f dimensins. Traversal Cst. Since we split ndes in sized chunks, traversal cst is extremely small. The jin prcedure never has t check bunding rectangles f ndes t decide whether r nt they may 11

12 Ttal number f pints T Range f pints 0 t 1 R + tree -kd tree Average number f pints per leaf nde N r N e Number f dimensins used fr splitting D r D e Table 1: Ntatin cntain pints within distance. Build time. The build time is small because we d nt have cmplex splitting algrithms, r splits that prpagate upwards. Skewed data. data reasnably. Since splitting a nde des nt aect ther ndes, the -kd tree will handle skewed 4 Analysis In this sectin, we analyze the number f jin tests fr the -kd tree, and the number f jin and screen tests fr the R + tree. The gal is t understand the behavir f the indices as the number f dimensins and number f pints vary. Fr simplicity, we assume that the MBRs f the R + tree cver the whle space. (That is, we really use a K-D-B tree as a prxy fr the R+ in this analysis. Fr datasets with a lt f pints, the gap between the MBRs will be small; hence this is a reasnable apprximatin.) In this analysis, we nly cnsider unifrm distributin f pints. We als assume that the R + tree will always split a rectangle n a dimensin which has nt been used fr splitting (if such a dimensin is available), 3 and that the bunding rectangle is split at its midpint. Further, we assume that the set f free dimensins is the same fr all leaf pages. This issue is similar t the issue f glbal vs. lcal splitting fr the -kd tree; hence this assumptin is favrable t the R + tree. We rst cnsider the cases where bth the R + tree and the -kd tree have unsplit dimensins left, and then extraplate t the case where either the R + tree, r bth indices, have n unsplit dimensins. Table 1 summarizes the ntatin we will use in this sectin. 4.1 R + tree: analysis Recall that in the, the jin test is perfrmed fr pints in each leaf nde, as well as between pairs f leaf ndes that have an verlap when their MBRs are extended by. A naive apprach t perfrming the jin test wuld be t use the nested-lp jin. In ther wrds, fr each pint in 3 The traditinal splitting heuristic which tries t minimize bth the vlume and perimeter f the hyper-rectangles will result in the R + tree usually splitting the MBR n a dimensin which has nt been used fr splitting. 12

13 (a) 1-dimensinal bundary (b) 0-dimensinal bundary Figure 9: Eect f Screening left leaf nde, we examine all pints in right leaf nde. Hwever, as shwn earlier in Figure 2, the algrithm can rst screen the pints in each leaf ndes t see if they t within the extended MBR f the ther leaf nde. Let D r be the number f split dimensins used by the R + tree. Then the R + tree will have 2 D r pages, with T=2 D r pints in each. Let L m be the maximum number f pints in a leaf page. Since T=2 D r < L m, D r = dlg 2 (T=L m )e. Nte every leaf page falls within the extended MBR f every ther leaf nde, since they all meet at the mid-pint f the space. Thus fr each leaf page, the pints in every ther leaf page have t be screened. Since there are T=N r pages, and T pints have t be screened fr each page, the ttal number f screen tests is given by # Screen Tests T 2 =N r (1) Next, we lk at the number f jin tests. If tw leaf ndes have a (D r?k)-dimensinal cmmn bundary (amng the dimensins used fr splitting), the expected number f pints in each nde left after screening is N r (2) k, where N r is the average number f pints in a leaf nde. (Fr each dimensin which is nt a bundary, 2 f the remaining pints are left after screening, since the size f the page in each dimensin is 1/2.) This is illustrated in Figure 9. Fr the tw leaf ndes with a 1-dimensinal (0-dimensinal) cmmn bundary, there are 2 (4 2 ) f the pints left after screening n each nde. The number f jin tests fr tw leaf pages with a (D r?k)-dimensinal cmmn bundary wuld be (N r (2) k ) 2, using a nested-lp jin. If we rst srt the pints n ne f the dimensins, and use a srt-merge jin, the cst drps t (N r (2) k ) 2 2 = N 2 r (2) 2k+1 (2) We d nt cunt the time fr the srt, since we can srt all the pints n ne f the unsplit dimensins at the time the tree is built. Hence, if we knw the number f neighbr leaf ndes that have a cmmn (D r?k)-dimensinal bundary, we can estimate the number f jin tests. We can represent the psitin f each leaf page in D r -dimensinal space as a D r -dimensinal blean array, where \True" in the k-th dimensin wuld crrespnd t the nde being abve the mid-pint f the space in the k-th dimensin. If tw leaf pages had the same value fr k dimensins, they wuld have a k-dimensinal cmmn 13

14 bundary. Since there are has (2).! D r k! ways f chsing k dimensins that are the same, each leaf page D r neighbrs that have a k-dimensinal cmmn bundary. k We can nw cmpute the ttal number f jin tests, fr all (T=N r ) leaf pages, by plugging in # Jin Tests = (T=N r ) XD r i=1 D r i! N 2 r (2) 2i+1 = T N r XD r i=1 D r i! (2) 2i+1 Since is typically quite small, 2 wuld be cnsiderably smaller than D r. Hence we can ignre terms with 5 r higher pwers f, resulting in T N r D r 8 3 (3) The abve frmula des nt include the cst f a self-jin n the pints in each leaf page. Since this requires N 2 r 2 cmparisns per leaf page, and there are (T=N r ) pages, the ttal number f jin tests fr these self-jins is Cmbining (3) and (4), we get T N r 2 (4) # Jin Tests T N r 2 (1 + D r 4 2 ) (5) Since typically N r < (T=N r ) (the number f pints per pages times is less than the number f leaf pages), and D r 4 2 < 1, we get T N r 2 < T T=N r That is, the number f jin tests (Equatin 5) is less than the number f screen tests (Equatin 1). Smene might argue that nt perfrming screen test may imprve the cst f prximity jin. Assume that we d nt perfrm screen and perfrm srt-merge jin nly. Then, the number f jin tests between a pair f leaf ndes becmes N r N r 2. The cst f screening pints fr a pair f leaf pages is 2N r. Since N r > 1, the jin cst withut screening is mre expensive than the cst f screening kd tree: analysis Let the -kd tree have a depth f D e. Each leaf has N e T D e pints n average. Recall that there is n screen cst fr the -kd. If we use a srt-merge jin, the jin cst between a pair f leaf ndes is N e N e 2 14

15 We d nt cunt the time fr the srt since that can be dne at the time the tree is built (that is, nce per leaf nde, rather than nce per pair f leaf ndes within distance). A leaf nde in the -kd has at mst 3 D e? 1 neighbrs within distance. Nw, 3 D e? 1 (T=N e ) (3) D e, since (T=N e ) is the number f leaf ndes, and (3) D e the fractin f leaf ndes within distance. Thus the ttal number f jin tests is (T=N e ) [(T=N e ) (3) D e ] N e 2 2 = T 2 2 (3) D e We can simplify the frmula by multiplying by a fudge factr f 1.5 (at the cst f penalizing the -kd tree a little). Thus we get # Jin Tests T 2 (3) D e+1 (6) 4.3 Cmparisn f csts Since there are n screen csts fr the -kd tree, Equatin 6 shws the dminant factr behind the perfrmance f the -kd tree. We can nw cmpare this with the dminant factr fr the R + tree, the number f screen tests given in Equatin 1. Frm the tw equatins, the -kd tree will be faster than the R + tree when (3) De+1 < (1=N r ) (7) Plugging in sme typical values, N r = 50; = 0:05 int Equatin 7, the number f tests fr the -kd tree is cnsiderably smaller even fr D e = 2. Nte that the perfrmance f the R + tree cannt be imprved by increasing the size f leaf pages (thus decreasing 1=N r ), since the jin cst wuld becme dminant as the size increased. Equatin 7 ignres traversal cst, which is much lwer fr the -kd tree than the R + tree since the -kd des nt have t check fr intersectin f MBRs. Further, Equatin 7 ignres the number f jin tests fr the R + tree cmpletely and just cnsiders screen csts. On the ther hand, since the R + tree des nt partitin the space, but uses MBRS, the number f screens fr the R + tree is likely t be smewhat less than suggested by the abve frmulae. In Sectin 5.5, we give empirical results fr the number f jin tests fr the -kd tree, and the number f jin and screen tests fr the R + tree, while varying several parameters. These results crrelate well with the abve analysis. 4.4 Other Cases We nw cnsider the case where bth indices have n unsplit dimensins, fllwed by the case where nly the R + tree has unsplit dimensins. (Since the R + tree des unbiased splitting, it will always run ut f dimensins befre the -kd tree.) 15

16 If the -kd tree has n unsplit dimensins left, we expect the number f jin tests t be similar fr the -kd tree and the R + tree, since bth f them will partitin the space in a similar manner. Fr high-dimensinal data, we expect this case t be very rare. Fr example, with a f 0.1, and dimensinal data, there have t be mre than 11 pints befre this ccurs. If nly the R + tree has n unsplit dimensins left, the perfrmance gap will narrw cmpared t the case where bth trees have unsplit dimensins. As the R + tree lls the space, the screen cst becmes less dminant cmpared t the jin cst. Hwever, the sum f the csts is still higher than fr the -kd tree, since the R + tree still perates in a higher dimensinal space than the -kd tree. Anther way t lk at this case is t cnsider it being in between the case where bth trees having unsplit dimensins and case where bth trees having n unsplit dimensins. Since there is a gradual transitin, the perfrmance gap narrws till the number f jin plus screen tests becme cmparable when bth trees have n unsplit dimensins left. 5 Perfrmance Evaluatin We empirically cmpared the perfrmance f the -kd tree with bth and a srt-merge algrithm. The experiments were perfrmed n an IBM RS/ wrkstatin with a CPU clck rate f 66 MHz, 128 MB f main memry, and running AIX Data was stred n a lcal disk, with measured thrughput f abut 1.5 MB/sec. We rst describe the algrithms cmpared in Sectin 5.1, and the datasets used in experiments in Sectin 5.2. Next, we shw the perfrmance f the algrithms n synthetic and real-life datasets in Sectins 5.3 and 5.4 respectively. Finally, we explain the bserved perfrmance by lking at the number f jin tests and screen cunts in Sectin Algrithms -kd tree. We implemented the -kd tree algrithm described in Sectin 3.1. A leaf nde was cnverted t an internal nde (i.e. split) if its memry usage exceeded 4096 bytes. Hwever, if there were n dimensins left fr splitting, the leaf nde was allwed t exceed this limit. The executin times fr the -kd tree include the I/O cst f reading an external srted le cntaining the data pints, as well as the cst f building the data structure. Since the external le can be generated nce and reused fr dierent value f, the executin times d nt include the time t srt the external le. R + tree. Our experiments indicated that the R + tree was faster than the R tree fr prximity jins n a set f high-dimensinal pints. (Recall that the dierence between R + tree and R? tree is that R + tree des nt allw verlap between minimum bunding rectangles. Hence it reduces the number f verlapping leaf ndes t be cnsidered fr the spatial prximity jin, resulting in faster executin time.) We therefre used R + tree fr ur experiments. We used a page size f

17 -kd R + tree Srt-Merge Jin Cst Yes Yes Yes Build Cst Yes N { Srt Cst (rst dim.) N { N Table 2: Csts included in the executin times. Parameter Default Value Range f Values Number f Pints,000,000 t 1 millin Number f Dimensins 4 t 28 (jin distance) t 0.2 Range f Pints -1 t +1 -same- Distance Metric L 2 -nrm L 1, L 2, L1 nrms Table 3: Synthetic Data Parameters bytes. In ur experiments, we ensured that the R + tree always t in memry and a built R + tree was available in memry befre the jin executin began. Thus, the executin time fr R + tree des nt include any build time it nly includes CPU time fr main-memry jin. (Althugh this gives the R + tree an unfair advantage, we err n the cnservative side.) 2-level Srt-Merge. Cnsider a simple srt-merge algrithm, which reads the data frm a le srted n ne f the dimensins and perfrms the jin test n all pairs f pints whse values in the srt dimensin are clser than. We implemented a mre sphisticated versin f this algrithm, which reads a 2 chunk f the srted data int memry, further srts in memry this data n a secnd dimensin, and then perfrms the jin test n pairs f pints whse values in the secnd srt dimensin are clse than. The algrithm then drps the rst chunk frm memry and reads the next chunk, and s n. The executin times reprted fr this algrithm als d nt include the external srt time. Table 2 summarizes the csts included in the executin times fr each algrithm. 5.2 Data Sets and Perfrmance Metrics Synthetic Datasets. We generated tw types f synthetic datasets: unifrm and gaussian. The values in each dimensin were randmly generated in the range?1:0 t 1:0 with either unifrm r gaussian distributin. Fr the Gaussian distributin, the mean and the standard deviatin were 0 and 0.25 respectively. Table 3 shws the parameters fr the datasets, alng with their default values and the range f values fr which we cnducted experiments. Distance Functins We used L 1, L 2 and L1 as distance functins in ur experiments. The extended bunding rectangles btained by extending MBRs by dier slightly in R + tree depending n distance functins. Figure shws the extended bunding regins fr the L 1, L 2 and L1 nrms. 17

18 (a) L 1 (b) L 2 (c) L1 Figure : Bunding Regins extended by Executin Time (sec.) 0 Unifrm Distributin 2-level Srt-merge Executin Time (sec.) 00 0 Gaussian Distributin 2-level Srt-merge Epsiln Epsiln Figure 11: Perfrmance n Synthetic Data: Value The rectangles with slid line represents the MBR f a leaf nde and the dashed lines the extended bunding regins. This dierence in the regins cvered by the extended regins may result in a slightly dierent number f intersecting leaf ndes fr a given a leaf nde. Hwever, in the R-tree family f spatial indices, the selectin query is usually represented by rectangles t reduce the cst f traversing the index. Thus, the extended bunding rectangles t be used t traverse the index fr bth L 1 and L 2 becme the same as that fr L Results n Synthetic Data value. Figure 11 shws the results f varying frm 0.01 t 0.2, fr bth unifrm and gaussian data distributins. L 2 is used as distance metric. We did nt explre the behavir f the algrithms fr greater than 0.2 since the jin result becmes t large t be meaningful. Nte that the executin times are shwn n a lg scale. The -kd tree algrithm is typically arund 2 t 20 times faster than the ther algrithms. Fr lw values f (0.01), the 2-level srt-merge algrithm is quite eective. In fact, the srt-merge algrithm and the -kd algrithm d almst the same actins, since the -kd will nly have arund 2 levels (excluding the rt). Fr the gaussian distributin, the perfrmance gap between the -kd tree and the R + tree narrws fr high values f because the jin result is very large. 18

19 Executin Time (sec.) 00 0 L 1 -nrm 2-level Srt-merge Executin Time (sec.) 00 0 L1-nrm 2-level Srt-merge Epsiln Epsiln Figure 12: Perfrmance: Distance Metrics (Gaussian Distributin) Distance Metric. Figure 12 shws the results f varying fr the L 1 and L1 nrms fr the gaussian distributin. The results fr the same datasets fr the L 2 nrm were shwn in Figure 11. The relative perfrmance f the algrithms is almst identical fr the three distance metrics. Althugh nt shwn, we btained similar results fr the unifrm distributin, and in ur ther experiments as well. Hence we nly shw the results fr the L 2 -nrm in the remaining experiments. Number f Dimensins. Figure 13 shws the results f increasing the number f dimensins frm 4 t 28. Again, the executin times are shwn using a lg scale. The -kd algrithm is arund 5 t 19 times faster than the srt-merge algrithm. Fr 8 dimensins r higher, it is arund 3 t 47 times faster than the R + tree, the perfrmance gap increasing with the number f dimensins. Fr 4 dimensins, it is nly slightly faster, since there are enugh pints fr the -kd tree t be lled in all dimensins. Fr the R + tree, increasing the number f dimensins increases the verhead f traversing the index, as well as the number f neighbring leaf ndes and the cst f screening them. Hence the time increases dramatically when ging frm 4 t 28 dimensins. 4 Even the srt-merge algrithm perfrms better than the R + tree at higher dimensins. In cntrast, the executin time fr the -kd remains rughly cnstant as the number f dimensins increases. Number f Pints. T see the scale up f -kd tree, we varied the number f pints frm,000 t 1,000,000. The results are shwn in Figure 14. Fr R + tree, we d nt shw results fr 1,000,000 pints because the tree n lnger t in main memry. Nne f the algrithms have linear scale-up; but the srt-merge algrithms has smewhat wrse scaleup than the ther tw algrithms. Fr the gaussian distributin, the perfrmance advantage f the -kd tree cmpared t the R + tree remains fairly cnstant (as a percentage). Fr the unifrm distributin, the relative perfrmance advantage 4 The dip in the R + tree executin time when ging frm 4 t 8 dimensin fr the gaussian distributin is because f the decrease in jin result size. This eect is als nticeable fr the -kd tree, fr bth distributins. 19

20 0 Unifrm Distributin 2-level Srt-merge 00 Gaussian Distributin 2-level Srt-merge Executin Time (sec.) Executin Time (sec.) Dimensin Dimensin Figure 13: Perfrmance n Synthetic Data: Number f Dimensins Unifrm Distributin 2-level Srt-merge Gaussian Distributin 2-level Srt-merge Executin Time (sec.) 0 Executin Time (sec.) Number f Pints ( 000s) Number f Pints ( 000s) Figure 14: Perfrmance n Synthetic Data: Number f Pints f the -kd tree varies since the average depth f the -kd tree des nt increase gradually as the number f pints increases. Rather, it jumps suddenly, frm arund 3 t arund 4, etc. These transitins ccur between 20,000 and 50,000 pints, and between 500,000 and 750,000 pints. Nn-self-jins. Figure 15 shws the executin times fr a prximity jin between tw dierent datasets (generated with dierent randm seeds). The size f ne f the datasets was xed at,000 pints, and the size f the ther dataset was varied frm,000 pints dwn t 5,000 pints. Fr experiments where the secnd dataset had,000 pints r fewer, each experiment was run 5 times with dierent randm seeds fr the secnd dataset and the results averaged. With bth datasets at,000 pints, the perfrmance gap between the R + tree and the -kd tree is similar t that n a self-jin with 200,000 pints. As the size f the secnd dataset decreases, the perfrmance gap als decreases. The reasn is that the time t build the index is included fr the -kd tree, but nt fr the R + tree. 20

21 0 Unifrm Distributin 0 Gaussian Distributin Executin Time Executin Time Rati f Size f Tw Data Set Rati f Size f Tw Data Set Figure 15: Nn-self-jins 00 2-level Srt-merge 0 2-level Srt-merge Executin Time 0 Executin Time Epsiln Dimensin Figure 16: Perfrmance n Mutual Fund Data 5.4 Experiment with a Real-life Data Set We experimented with the fllwing real-life dataset. Similar Time Sequences Cnsider the prblem f nding similar time sequences. The algrithm prpsed in [ALSS95] rst nds similar \atmic" subsequences, and then stitches tgether the atmic subsequence matches t get similar subsequences r similar sequences. Each sequence is brken int atmic subsequences by using a sliding windw f size w. The atmic subsequences are then mapped t pints in a w-dimensinal space. The prblem f nding similar atmic subsequences nw crrespnds t the prblem f nding pairs f w-dimensinal pints within distance f each ther, using the L1 nrm. (See [ALSS95] fr the ratinale behind this apprach.) The time sequences in ur experiment were the daily clsing prices f 795 U.S. mutual funds, frm Jan 4, 1993 t March 3, There were arund 400,000 pints fr the experiment (since each sequence is brken using a sliding windw). The data was btained frm the MIT AI Labratries' Experimental Stck Market Data Server ( We varied the 21

22 windw size (i.e. dimensin) frm 8 t 16 and frm 0.05 t 0.2. Figure 16 shws the resulting executin times fr the three algrithms. The results are quite similar t thse btained n the synthetic dataset, with the -kd tree utperfrming the ther tw algrithms. 5.5 Number f jin and screen tests Figure 17 shws the number f jin tests and screen cunts fr the 3 algrithms. In general, the R + tree has fewer jin tests than the -kd tree, but cnsiderably mre screen tests. Ntice that the relative curves fr the jin tests fr the -kd and srt-merge, and the screen tests fr the R + tree, are very similar t the executin times shwn in Sectin 5.4. The relative number f jin and screen tests fr the R + tree, and the jin tests -kd tree are als as predicted by the analysis in Sectin 4. In particular, cnsider the graphs shwing the numbers f tests while varying the number f dimensins. At higher dimensins, the R + tree has a lt mre screen tests than jin tests. The gap decreases at lwer dimensins as the R + trees \lls ut" the space. Further, the number f jin tests fr the -kd tree is independent f the number f dimensins. As the number f pints increases, the number f jin tests fr the -kd tree increases smthly fr the gaussian data, since the average height f the tree increases smthly. Fr unifrm data, the number f jin tests actually decreases when ging frm 25,000 t 50,000 pints and frm 500,000 t 750,000 pints. The reasn is that the average depth f the -kd tree jumps frm arund 3 (excluding the rt) t arund 4 and frm 4 t 5, decreasing the number f jin tests. 5.6 Summary The -kd tree was typically 2 t 47 times faster than the R + tree n self-jins, with the perfrmance gap increasing with the number f dimensins. It was typically 2 t 20 times faster than the srtmerge. The 2-level srt-merge was usually slwer than R + tree. But fr high dimensins (> 15) r lw values f (0.01), it was faster than the R + tree. Fr nn-self-jins, the results were similar when the datasets being jined were nt f very dierent sizes. Fr datasets with dierent sizes (e.g. 1: rati), the -kd tree was still faster than the R + tree. But the perfrmance gap narrwed since we include the build time fr the -kd tree, but nt fr the R + tree. The distance metric did nt signicantly aect the results: the relative perfrmance f the algrithms was almst identical fr the L 1, L 2 and L1 nrms. 6 Biased R-trees We shwed that the -kd is a fast data structure fr high-dimensinal prximity jins. Since the R-tree family is a very ppular index structure, and many cmmercial systems have implemented 22

23 Jin Test and Screen Cunts 1e+09 1e+08 1e+07 1e Unifrm Distributin 2-level Srt-merge : # f Jin Test : # f Jin Test R+tree : # f Screen Test R+tree : # f Jin Test Jin Test and Screen Cunts 1e+ 1e+09 1e+08 1e+07 1e+06 Gaussian Distributin 2-level Srt-merge : # f Jin Test : # f Jin Test R+tree : # f Screen Test R+tree : # f Jin Test Epsiln Epsiln Jin Test and Screen Cunts 1e+ 1e+09 1e+08 1e+07 1e level Srt-merge : # f Jin Test : # f Jin Test R+tree : # f Screen Test R+tree : # f Jin Test Jin Test and Screen Cunts 1e+ 1e+09 1e+08 1e+07 1e+06 2-level Srt-merge : # f Jin Test : # f Jin Test R+tree : # f Screen Test R+tree : # f Jin Test Number f Pints ( 000s) Number f Pints ( 000s) Jin Test and Screen Cunts 1e+09 1e+08 1e+07 1e+06 2-level Srt-merge : # f Jin Test : # f Jin Test R+tree : # f Screen Test R+tree : # f Jin Test Jin Test and Screen Cunts 1e+09 1e+08 1e+07 1e+06 2-level Srt-merge : # f Jin Test : # f Jin Test R+tree : # f Screen Test R+tree : # f Jin Test Dimensin Dimensin Figure 17: Jin Test and Screen Cunts 23

24 120 Unifrm Distributin Gaussian Distributin Executin Time Executin Time Build Epsiln Build Epsiln Figure 18: Eect f using dierent build s R trees, we decided t explre the pssibility f incrprating sme the ideas used in the -kd tree t the R tree family. Althugh we lk at the R + tree in this sectin, we expect the results t als apply t ther members f the R tree family. S far, we built the -kd tree n the y, as required. Hwever, this is nt an ptin fr the R + tree since its build time is much higher. We rst examine the perfrmance f the -kd tree when built with a dierent value than the value used fr the prximity jin, t see if it still maintains its perfrmance advantage. 6.1 Using dierent build and jin values fr the -kd tree We lked at the result f building the -kd tree with ne value, say the build, and ding the jin with a dierent, say the jin. Figure 18 shws the executin time fr a xed jin f 0.1, fr dierent build values. The hrizntal line shws the time fr the fr the same jin. As expected, the best perfrmance fr the -kd tree ccurs when the build is the same as the jin. Hwever, its perfrmance is better than that fr the thrughut a wide range f build values. There are several eects which impact the shape f the graph. Build > 0.1 As the build increases, the number f jin test increases, since the vlume f the space in adjacent buckets increases. This results in a gradual increase in the verall executin time, fr bth distributins. Build < 0.1 Fr build frm 0.1 t arund 0.025, the dminant factr is again the number f jin tests. Assume the depth each leaf nde is k. When bth build and jin are 0.1, each leaf nde has 3 k?1 neighbrs within distance. Hwever, when the build value is a bit smaller than 0.1 (e.g ), there are 5 k?1 neighbrs within distance fr each leaf nde. Since the number f pints in each bucket is almst the same whether the build is 0.1 r 0.099, the ttal number 24

25 f jin tests shts up when the build is decreased slightly frm 0.1. As the build decreases further, the number f pints in each leaf nde decreases, while the number f neighbrs stays the same. Hence the number f jin tests, and the ttal executin time decrease frm build f thrugh When the build decreases frm 0.05 t 0.049, the number f neighbrs jumps again, frm 5 k?1 t 7 k?1, and the cycle cntinues. Fr the unifrm distributin, the average height f the tree decreases abruptly by arund 1 level when the build is arund Hence the number f jin tests increases. Fr the gaussian distributin, there are n abrupt changes in the level f the tree. Hence this pattern is clearly visible frm 0.1 t arund Build 0.1 Fr build belw 0.02, the dminant factr is the traversal cst. Fr the unifrm distributin, the number f leaf ndes increases threefld as the build drps frm 0.02 t Mre imprtantly, the number f neighbrs t lk at increases frm arund 3 t Thugh the number f jin tests des nt change dramatically, the verhead f making functin calls while traversing the tree increases the verall time. Fr the unifrm distributin, this eect is mitigated by the fact that the average height f the tree cmes dwn as the build decreases frm 0.02 t Hence the verall time actually cmes dwn befre increasing again. 6.2 Biased Splitting fr the R + tree Since the -kd tree retains its perfrmance advantage even if it is built with a dierent than the jin, we explre using as a parameter fr building the R + tree. We apply a biased splitting heuristic t the R + tree: the same dimensin is selected repeatedly fr splitting as lng as the length f this dimensin in the MBR is at least 2. Traditinal heuristics are used t decide the split pint in the split dimensin. If the length f the current split dimensin is less than 2, the split heuristic mves t the next dimensin cyclically. We call this mdied R + tree the biased R + tree. We cmpared tw versins f R + -tree, ne with a traditinal splitting algrithm (that is, unbiased splitting) and the ther with biased splitting algrithm. Figure 19 shws that the average number f leaf ndes within distance fr the R + tree and the biased R + tree, as the number f dimensins increases. Biased splitting results in arund 5 t 25 times fewer intersecting pages, with the rati increasing with the number f dimensins. Figure 20 shws the executin times fr the biased R + tree. As expected, the perfrmance is mid-way between the perfrmance f the R + tree and the -kd tree. There are tw main reasns fr the biased R + tree remaining slwer than the -kd tree. First, the traversal cst is much higher fr the biased R + tree since the extended regins fr each leaf page still have t be checked against the MBRs in internal ndes. Secnd, the biased splitting heuristic stps splitting at 2, cmpared t fr the -kd tree. We btained similar results when varying and the number f pints. Since the biased R + tree nly uses a few dimensins fr splitting at each level, it need nt stre 25

High-dimensional Similarity Joins

High-dimensional Similarity Joins High-dimensinal Similarity Jins Kyusek Shim æ Ramakrishnan Srikant Rakesh Agrawal IBM Almaden Research Center 650 Harry Rad, San Jse, CA 95120 Abstract Many emerging data mining applicatins require a similarity