The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

The Powe-Method: A Compehensive Estimation Techniue fo Multi-Dimensional Queies Yufei Tao Chistos Faloutsos Dimitis Papadias Depatment of Compute Science City Univesity of Hong Kong Tat Chee Avenue, Hong Kong taoyf@cs.cityu.edu.hk Depatment of Compute Science Canegie Mellon Univesity Pittsbugh, USA chistos@cs.cmu.edu Depatment of Compute Science HKUST Clea Wate Bay, Hong Kong dimitis@cs.ust.hk ABSTRACT Existing estimation appoaches fo multi-dimensional databases often ely on the assumption that data distibution in a small egion is unifom, which seldom holds in pactice. Moeove, thei applicability is limited to specific estimation tasks unde cetain distance metic. This pape develops the Powe-method, a compehensive techniue applicable to a wide ange of uey optimization poblems unde vaious metics. The Powemethod eliminates the local unifomity assumption and is accuate even in scenaios whee existing appoaches completely fail. Futhemoe, it pefoms estimation by evaluating only one simple fomula with minimal computational ovehead. Extensive expeiments confim that the Powemethod outpefoms pevious techniues in tems of accuacy and applicability to vaious optimization scenaios. 1. INTRODUCTION The most common uey types in multi-dimensional (e.g., spatial, multimedia, time-seies) databases (DB) can be classified into thee categoies: Given a uey point and a adius, a ange-seach (RS) etieves all points o DB such that dist(o,), whee dist(o,) denotes the distance between o and. A k neaest neighbo (knn) uey etuns the k closest data points o 1,o 2,...,o k to a uey, o fomally: given any point o DB {o 1,o 2,...,o k }, dist(o i,)dist(o,) fo all 1ik. A egional distance (self-) join (RDJ) etieves all pais of objects (in some constained egion) that ae close to each othe. Fomally, it specifies a uey point, a constaint adius, a distance theshold t, and etuns (o 1,o 2 ) DB DB such that o 1 o 2, dist(o 1,), dist(o 2,), and dist(o 1,o 2 )t. The special case whee =, coesponds to a global distance join (GDJ). The esult of any uey depends on the undelying distance metic. The most widely used metic is the L x nom, o fomally, given two points, o whose coodinates on the i-th dimension (1im) ae yi and o yi espectively, thei L x distance is: L x (,o)=[σ i=1~m ( yi o yi ) x ] 1/x L x (,o)=max i=1~m ( yi o yi ) (fo x) (fo x=) Fo instance, (i) if L is assumed, RS coesponds to a window uey centeed at with extent 2 on each dimension wheeas, (ii) if L 2 is assumed, the uey egion coesponds to a cicle centeed at with adius. 1.1 Motivation Efficient optimization of multi-dimensional ueies euies accuate estimation of the following values [29, 7, 4]: Quey selectivity, whee the selectivity of a RS (RDJ) uey is the atio between the numbe of etieved points (pais) and the dataset cadinality (size of the catesian poduct DB DB). The concept of selectivity does not apply to knn ueies whee exactly k points ae etuned. Quey cost, in tems of the numbe of disk accesses using a multi-dimensional index. Existing estimation methods fall into two categoies. Local methods poduce a tailoed estimate depending on the uey s concete location, typically using a histogam [1, 17], which patitions the data space into (disjoint o ovelapping) buckets. As explained late, histogams have seveal limitations. Fist, they assume that the data distibution in each bucket is unifom (i.e., local unifomity assumption), which is aely tue fo eal datasets. Second, estimation fo ueies unde any metic othe than L may euie expensive evaluation time. Thid, thei applicability to knn ueies is uestionable. Global methods [15, 7] povide a single estimate coesponding to the aveage selectivity/cost of all ueies, independently of thei locations. Such techniues avoid the poblems of histogams, but have a seious dawback: since ueies at vaious locations may have diffeent chaacteistics, thei espective selectivity/cost can diffe consideably fom the aveage value. In summay, cuently thee is not a compehensive appoach able to pefom all estimation tasks (i.e., selectivity/cost pediction in any distance metic) effectively. Such an appoach is highly desiable in pactical systems, whee it is impotant to have a single efficient method that applies to all cases, instead of multiple methods that ae effective fo cetain sub-poblems but inefficient/inapplicable to othes.

1.2 Contibutions This pape develops the Powe-method, a novel estimation techniue which combines the advantages of both local and global methods, but avoids thei deficiencies. The poposed method possesses the following attactive featues: It eliminates the local unifomity assumption and, theefoe, is accuate even in scenaios whee existing techniues fail. It is the fist compehensive techniue applicable to selectivity and cost estimation fo all the uey types mentioned ealie. It suppots all L x metics with small space euiements. It pefoms any uey estimation by evaluating only one simple fomula; hence, its computational ovehead is minimal. Extensive expeimentation poves that the Powe-Method achieves accuate estimation (with aveage elative eo below 2%) in cicumstances whee taditional methods completely fail. The est of the pape is oganized as follows. Section 2 eviews existing uey estimation techniues. Section 3 intoduces the novel concept of local powe law, poves the elated theoems, and illustates its implementation in pactice. Section 4 expeimentally evaluates its pefomance, and Section 5 concludes the pape with diections fo futue wok. 2. RELATED WORK Most appoaches fo multi-dimensional selectivity estimation ae based on histogams [22, 26, 1, 17, 19, 8], which constuct a small numbe of ectangula buckets, and stoe fo each bucket b the numbe 1 of points N b in its extents. The density D b of bucket b is defined as D b =N b /a b, whee a b is the aea of b. To estimate the selectivity of a RS (ange seach) uey, a histogam fist computes fo each bucket b, the intesection aea a int (with ); assuming that the data in each bucket ae unifom, the numbe of ualifying objects inside this bucket is then estimated as D b a int. The final selectivity is obtained by summing the estimate fom each intesecting bucket. Theodoidis et al. [33] suggest that, unde the local unifomity assumption, RDJ (egional distance join) selectivity estimation can be educed to GDJ (global distance join) on unifom data, fo which seveal cost models ae available [2, 3, 33]. The idea is to apply these unifom models inside the uey constained egion Q, based on the density of the bucket that coves Q. If Q intesects multiple buckets, thei aveage density is used. The cost estimation of RS [33, 29] and RDJ [33] follows the same easoning. Application of histogams has been limited mainly to RS ueies unde the L metic, whee the intesection aea a int (between the bucket and the uey) is a (hype) ectangle (e.g., uey 1 in Figue 1a). Fo ueies in the othe metics, howeve, the aea computation is usually expensive. In Figue 1a, fo instance, the L 2 RS uey 2 intesects bucket b into a athe iegula shape whose aea a int is difficult to compute. To tackle this poblem, Bechtold, et al. [5] suggest the monte-calo method, which geneates a set of unifom points in the bucket, counts the numbe of them in the uey egion, and estimates the a int as 1 In case of ovelapping buckets, a point in the intesection egion of multiple buckets is assigned to a uniue bucket. a b /, whee a b is the bucket s aea. A lage numbe of andom points (which inceases with the dimensionality) is necessay to obtain accuate estimation. Repeating this pocess fo evey (patially) intesecting bucket may lead to significant ovehead. bucket b intesection aeas 2 1 b 1 b 2 vincinity cicle (a) Iegula intesection aea (b) Difficulty fo knn Figue 1. Deficiencies of histogams We ae not awae of any local estimation method fo knn uey cost. The main difficulty of applying histogams lies in the pediction of d k (the distance fom the uey point to its k-th NN), which is the fist step of the cost analysis. The value of d k should be such that the vicinity cicle centeing at with adius d k coves expectedly k points (assuming unifomity in buckets). Finding such a cicle euies non-tivial epetitive tuning (inceasing-deceasing the adius, depending on how many points fall in the cuent tial cicle). This is especially complicated if the cicle intesects multiple buckets (the cicle in Figue 1b intesects 3 buckets) and poduces iegula intesection egions. To avoid this, [5, 4] apply thei (unifom) knn cost model, to non-unifom distibutions anyway, and show that (supisingly) sufficient accuacy may be achieved. In Section 3.3 we eveal the conditions that must be satisfied fo this conjectue to hold in pactice. Othe local multi-dimensional estimation techniues include sampling [25, 27, 12, 34, 6, 13, 18], kenel estimation [1], single value decomposition [26], compessed histogams [24, 2, 23, 35], sketches [32], maximal independence [14], Eule fomula [19, 3, 21], etc. These methods, howeve, taget specific cases (mostly RS selectivity unde the L nom), and thei extensions to othe poblems ae unclea. Finally, among the global estimation methods, [7, 16] analyze selectivity estimation of RS and GDJ using datasets factal dimensions, but thei analysis does not addess RDJ. Regading aveage uey costs, [15] studies RS ueies, while knn etieval is discussed in [28, 4]. 3. THE LOCAL POWER LAW Section 3.1 explains the poblems associated with the local unifomity assumption in histogams. Then, Section 3.2 descibes the local powe law (LPLaw) that ovecomes these poblems, and Section 3.3 solves all the estimation poblems using the LPLaw. Finally, Section 3.4 elaboates the implementation of LPLaw in pactice. Ou analysis focuses on biased ueies (i.e., the uey distibution follows that of data), due to thei pactical impotance [7, 4, 28, 34], while unbiased ueies ae discussed in Section 3.4. 3.1 Density tap Histogams ae based on the hypothesis that the data distibution in sufficiently small egions of space is piece-wise unifom. Thus, they compute and stoe the density of such egions. A main contibution of this wok is to show that this, appaently easonable, hypothesis is wong fo the vast majoity of eal b 3

data, as well as the vast majoity of pefect, Euclidean-geomety datasets. Conside a set of N=1 2D points along the majo diagonal of a unit suae and a uey point (Figue 2). What is the density in the vicinity of, e.g., a suae egion centeed at? The answe is supising: undefined! 1 exponents 2 (evealing the independence between the two dimensions), while those in dense aeas have highe constants. Given a point p in a m-dimensional dataset, we define the L - neighbohood of p as the m-dimensional box centeing at with length 2 on each axis. Then, the LPLaw coefficients of p can be measued as follows. uey point Figue 2. Density tap The density changes damatically with the adius of the vicinity, diveging to infinity when goes to zeo (we call this phenomenon density tap)! This is so counte-intuitive that we give an aithmetic example. If the adius is =.5, the density is 1. If the adius shinks to =.5, then the density is 1! Convesely, the density goes to zeo, with gowing adius! In fact it is easy to show that the density D() fo adius is given by: D()=N/(4). Notice that the paadox is geneated not by a mathematical oddity like the Hilbet cuve o the Siepinski tiangle, but by a line, a pefectly behaving Euclidean object! Moeove, many eal objects behave like lines o collections of lines (highway netwoks, ives). Thus, the poblem is eal, and can not be downplayed as a mathematical oddity it does appea in pactice! The next uestion then is: if the local density aound a point is not well defined, what is the invaiant that could help us descibe somehow the neighbohood of? The following section povides the answe. 3.2 Assumptions, definitions, and popeties The esolution to the density tap comes fom the concept of intinsic dimensionality. The data points aound the vicinity of point in Figue 2 fom a linea manifold asking fo thei density is as unusual as asking fo the aea of a line. The invaiant we hinted befoe is encapsulated in the concept of Local Powe Law (LPLaw): Assumption 3.1 (Local powe law): Let p be a data point in a dataset. Fo cetain ange of, the numbe nb p () of points with distances no moe than fom p can be epesented as: p np ( ) = nb c whee c p and n p ae constants temed the local constant and exponent, espectively. Fo convenience we efe to c p and n p collectively as the coefficients of the LPLaw fo point p. Each LPLaw models the distibution aound a paticula data point (e.g., in Figue 2 with c =2N and n =1). In geneal, howeve, the LPLaw of vaious points may diffe in two aspects. Fist, two points can have diffeent local exponents, implying that the data coelation chaacteistics in thei espective vicinity ae diffeent. This is obvious in Figue 3a that mixes fou distibutions with diffeent intinsic dimensions. The local exponent of each point is detemined by the intinsic dimension of the egion it lies in. Second, two points may have simila local exponents but diffeent constants, implying that the data densities in thei espective vicinity ae diffeent. Figue 3b illustates the 2D independent Zipf distibution, whee all points have local p 1/8-neighbohood of p p (a) Mixtue dataset (b) 2D Zipf Figue 3. Non-factal examples Lemma 3.1 (LPLaw unde L ): Given a data point p such that the data distibution in its L -neighbohood is self-simila with intinsic dimension d p, then the LPLaw of p unde the L metic is: N nbp = p ρ p d p ( ) d whee N p denotes the numbe of points in the L - neighbohood of p. Note that can be any value in the ange of whee the LPLaw holds. Similaly, the LPLaw of othe distance metics L x (x) can be measued in the L x -neighbohood of a point p (i.e., a L x m-dimensional sphee centeing at p with adius ) while the following lemma povides a faste way to deive a L x LPLaw fom its L countepat. Lemma 3.2 (LPLaw unde L x ): Given a m-dimensional data point p with L LPLaw nb p ()=c p np, its L x LPLaw is: ( 1) ( 1) np / m VolSphee x np nbxp ( ) = cp VolSphee whee nb xp () is the numbe of points within L x distance fom p, and VolSphee x (1) is the volume of a m-dimensional L x sphee with adius 1. It is clea that the LPLaw of a point p unde diffeent distance metics have the same local exponent n p, and thei local constants can be deived fom each othe using n p. The intuitive explanation is that the -neighbohood of a point unde vaious metics only affects how many points fall in the neighbohood (elated to the neighbohood volume), but does not influence the way data ae coelated (which is a data chaacteistic captued by the exponent). The LPLaw is satisfied in many eal datasets. We illustate this using two eal distibutions (Figue 4): (i) SC dataset, which contains 36k points epesenting the coast line of Scandinavia, and (ii) the CA dataset, which contains 62k points epesenting locations in Califonia. Figue 5a plots nb p () (i.e., the numbe of points within distance to a point p in the L metic) as a function of (in log-log scale) fo the two points p 1, p 2 in Figue

4a. Similaly, Figue 5b illustates the same infomation fo the two points p 3, p 4 in Figue 4b. It is clea that nb p () appoximates a powe law in all cases, and has vaious exponents (i.e., slopes of the fitting lines in Figue 5) fo diffeent points. 1k 1k 1 1 nb p () p 1 p 2 p 3 p 4 (a) SC dataset (b) CA dataset Figue 4. Real distibutions p 1 p 2 1k 1k 1 1 nb p () 1.1.1.1 1.1.1.1 (a) SC dataset (b) CA dataset Figue 5. LPLaw in eal data Figue 6a (6b) demonstates the local exponent (constant) distibution fo SC. The values ae measued using L.5- neighbohoods (i.e., a suae with length.1 on each axis). Figues 6c, 6d illustate the coesponding distibutions fo CA. Constants and exponents diffe substantially in the data space (suggesting that a global law fo the whole dataset would intoduce inaccuacy), but ae highly location-dependent (confiming the intuition behind LPLaw). local exponent 4 local constant p 3 p 4 Theoem 3.1 (RS selectivity): Given a RS uey point with adius, the selectivity of is: ( ) Sel c = whee N is the cadinality of the dataset, and c, n ae the local constant (unde the coesponding distance metic) and exponent at location, espectively. Poof: Staightfowad since the numbe nb () of points etieved by satisfies the LPLaw nb ()=c n. Hence the selectivity euals c n /N. Theoem 3.2 (RDJ selectivity): Let be a RDJ uey with (i) constaint adius, and (ii) distance theshold t. The selectivity of is: ( ) Sel = c 2 ( t ) n n N 2 ( 2N ) whee N is the cadinality of the dataset, c, n ae the local constant (unde the coesponding distance metic) and exponent at location espectively. Poof: (Sketch) Let N be the numbe of points in the constained egion (i.e., within distance to ). By the LPLaw (Definition 3.1), N =c n. Applying the GDJ selectivity fomula poposed in [7] inside the constained egion (afte appopiate nomalization), we obtain that the total numbe of ualifying pais euals N 2 (t/) n /2. Thus, Theoem 3.2 follows. Fo uey cost estimation we conside the R*-tee [11] due to its populaity and simplicity. Paticulaly, we measue the uey cost in tems of the numbe of R-tee leaf accesses since, (i) this numbe dominates the total cost, and (ii) in pactice non-leaf nodes ae usually found in the buffe diectly. Unlike the selectivity analysis, the deivation fo the uey cost esults in diffeent fomulae fo vaious distance metics. In the seuel we povide the theoems fo L as the esults fo the othe metics can be deived in a simila manne. Ou discussion is based on the following lemma: longitude altitude (a) Exponent map (SC) 2 local exponent 1 longitude altitude longitude altitude (b) Constant map (SC) 1 5 1 local constant 5 longitude altitude (c) Exponent map (CA) (d) Constant map (CA) Figue 6. LPLaw coefficient distibutions 3.3 Estimation using the LPLaw In the seuel we apply LPLaw to solve all the estimation poblems defined in Section 1, namely, the selectivity of RS, RDJ, and the uey cost of RS, knn, and RDJ. Lemma 3.3 (R-tee node extent): Let l be the length of a leaf MBR on each dimension; then: l=2(f/c ) 1/n, whee c, n ae the L local constant and exponent at the centoid of the MBR espectively, and f is the aveage node fanout (i.e., numbe of enties in a node). Theoem 3.3 (RS uey cost): Given a RS uey point with adius, the cost of is: ( ) Cost c f 1/ n n = + f c whee c, n ae the local constant (unde metic L ) and exponent at, and f is the aveage node fanout. Poof: (Sketch) This poblem can be educed to the aveage uey cost of a hypothetical point dataset with cadinality c /2 n and intinsic dimension n. As shown in [4], the L RS uey

cost fo such a dataset euals [c /(2 n f)](l+2) n, whee l is the length of a leaf MBR on each axis (see Lemma 3.3). The theoem esults fom this euation, afte necessay simplification. Theoem 3.4 (knn uey cost): The cost of a knn uey is: ( ) 1 1/ n 1/ n n Cost = f k f + whee n is the local exponent at and f the aveage node fanout. Poof: Let d k be the distance between the uey point and its k- th NN. Then, by Definition 3.1, we have k=c (d k ) n, whee c is the local constant at location ; hence d k =(k/c ) 1/n. As poven in [5], the cost of a knn uey euals that of a RS uey with adius d k. As a esult, the knn cost can be epesented as in Theoem 3.3, setting =d k. The fomula in the theoem esults fom necessay simplification (which emoves c ). An impotant obsevation fom the above theoem is that, the cost of a knn uey is not affected by the local constant, but depends only on the local exponent. Thus, the conjectue of [5, 4], that the knn cost is the same fo all datasets (i.e., independently of the data density in the vicinity of ), only holds fo datasets with the same local exponent (i.e., 2 if unifom models ae applied). Theoem 3.5 (RDJ uey cost): Let be a RDJ with (i) constaint adius, and (ii) distance theshold t. The cost of is: Cost()= 1/ n 2n 1/ n n 1/ n 2 ( f / c ) t c + f 2 c f 2 + + + f c f c whee c and n ae the local constant (unde metic L ) and exponent at, espectively. Poof: (Sketch) Let be the numbe of nodes intesecting the constained egion. By Theoem 3.3 we have: =(c /f)[(f/c ) 1/n +] n. The centoid distibution of these nodes MBRs also has intinsic dimension n. The uey cost euals two times the numbe of centoid pais within distance l+t (whee l is the extent of a leaf) plus. Following the GDJ selectivity analysis [7], can be deived as =½[(l+t)/] n. Afte substituting the coesponding vaiables, the theoem follows. 3.4 Implementation of the Powe-method The last section showed that all the estimation tasks can be pefomed by evaluating a single euation based on the LPLaw at the uey location. Motivated by the fact that, points close in space usually have simila local constants and exponents (see Figue 6), we can pe-compute the LPLaw fo a set of epesentative points, and pefom the estimation using the LPLaw of a point close to. Based on this idea, we select a set A of ancho points fom the database (using any sampling techniue [25, 27, 34]) and, fo each ancho point, compute the LPLaw coefficients in its L -neighbohood (using Lemma 3.1). Given a biased uey, the estimation algoithm fist finds the n point p in A neaest to, and then obtains the estimates using the LPLaw of p. It is woth mentioning that, this method diffes consideably fom a sampling method in the following ways. Fist, it can pefom all the estimation tasks, while sampling has been limited to selectivity estimation. Second, even fo selectivity pediction, it evaluates the LPLaw of a single ancho point, while sampling examines each point against the uey conditions. Thid, it is efficient independently of the data distibution, while it is a well-known poblem that sampling is inaccuate fo skewed data [12]. Fouth, it achieves satisfactoy accuacy using a vey small numbe (1 in ou expeiments) of anchos, while sampling euies a much lage faction of the dataset [12]. The Powe-method is optimized fo biased ueies. On the othe hand, uey optimization fo unbiased ueies is less impotant because, RS and RDJ ueies in the non-populated egions usually etun only a small numbe of objects and the uey optimize should employ an index stuctue (athe than seuential scan). Detecting whethe a uey is biased is easy: we check if the uey egion coves any ancho point. The optimize pefoms uey estimation only if the answe is positive, using the LPLaw associated with the patch (ancho point) closest to. As shown in the expeiments, using this policy, ou implementation maintains satisfactoy pefomance even fo un-biased ueies. On the othe hand, a knn uey that is fa away fom data is most likely to be meaningless [9], because in this case the distance fom to its k-th NN is simila to that between and its (k+k')-th NN, whee k' is a lage constant, so that the k NNs etuned ae not necessaily bette than the next k' NNs. As suggested in [9], such ueies should be avoided, o the uses should at least be waned about the significance of the esult. Detecting meaningless ueies can be achieved by computing the distance between the uey and its neaest ancho point. If the distance is lage than cetain theshold, then we judge the uey to be meaningless. 4. EXPERIMENTS This section compaes expeimentally the accuacy and the computational ovehead of the Powe-method with minskew [1], a benchmak histogam in the liteatue [34, 17, 8], and the global method that povides an aveage estimate using the factal dimension [15, 7, 4]. Fo ou expeiments we use a PIII 1Ghz CPU and fou eal datasets: SC/UK that contain 36k/4k points epesenting the coast lines of Scandinavia/the United Kingdom, and CA/LB that include 62k/53k points coesponding to locations in Califonia/Long Beach county [31] (Figue 4 shows the visualization of SC and CA). Ou implementation of the Powe-method (denoted as powe in the seuel) andomly selects a set of ancho points fom each dataset. To save space we patition the space into 64 64 cells, snap each ancho point to its closest cell cone, and epesent its coodinates as those of the cone (i.e., using 12 bits). The LPLaw coefficients of each ancho point ae computed using L.5-neighbohoods (see Lemma 3.1). The total numbe anchos is 1 so that the total memoy consumption is 1K bytes. Fo minskew, the bucket constuction algoithm fist obtains, fo each cell in the 64 64 gid, the numbe (i.e., feuency) of data

points coveed. Then, the cells ae gouped into 14 buckets 2 (so that the size of minskew is also 1K) using a geedy algoithm that aims at minimizing a cetain function. The bucket stuctue is athe sensitive to the dataset and the optimization function deployed; in many cases the algoithm teminates without poducing enough buckets (the poblem is also mentioned in [34]). Simila tuning poblems exist fo most multi-dimensional histogams such as genhist [17]. To alleviate the poblem, we tested multiple optimization functions. The best oveall pefomance was obtained by minimizing i=1~b (v i /n i ), (whee n i is the aveage feuency of cells in bucket b i and v i the vaiance of n i ) and we follow this appoach in ou implementation. A uey wokload consists of 5 ueies with the same paametes (unless specifically stated, the uey distibution is biased). The wokload estimation eo is defined as: i act i est i / i act i (same as [1]), whee act i (est i ) is the actual (estimated) selectivity/cost of the i-th uey in the wokload. We epot the esults fo L and L 2 distance metics due to thei paticula impotance in pactice. Fo minskew and metic L 2, if the uey egion (of RS/RDJ) patially intesects a bucket, the intesection aea is computed using the monte-calo method with 1 andom points, which (based on ou expeiments using diffeent numbes) leads to a easonable tadeoff between estimation time and accuacy. We evaluate all estimation tasks discussed ealie, stating with selectivity pediction. 4.1 Selectivity estimation Figue 7a shows the eo of L RS selectivity estimation as a function of the uey adius (anging fom.1 to.1) using the SC dataset 3. Powe significantly outpefoms the pevious techniues. Note that powe has the minimum eo at =.5, because the LPLaw is obtained using.5-neighbohoods. As expected, the global powe law yields consideable eo due to the selectivity vaiation of individual ueies. Minskew is even less accuate fo small ueies due to the violation of the local unifomity assumption. Its pecision gadually impoves fo lage ueies (which is consistent with pevious studies [1]), because a lage uey coves moe buckets completely, fom which pecise patial estimation can be obtained. Figue 7b shows the esults fo LB dataset unde the L 2 metic, confiming simila obsevations. 1 9 8 7 6 5 4 3 2 1 estimation eo (%) global.2.4.6.8.1 minskew powe 7 estimation eo (%) 6 5 4 3 2 1.2.4.6.8.1 (a) SC (L ) (b) LB (L 2 ) Figue 7. Eo vs. uey adius (RS selectivity) 2 The numbe of buckets (14) is lage than the numbe of ancho points, since each bucket stoes a single value (feuency) instead of two (local coefficients). 3 Due to the space constaint, in each set of expeiments we show the esults of two datasets in L and L 2 metics espectively. To evaluate the estimation eo of egional distance join selectivity, we fix the uey constaint adius to.5, and vay the distance theshold t fom.1 to.1. Figue 8 shows the esults fo vaious datasets and metics. Thee is no pediction by the global powe law because, as mentioned in Section 2.2, the application of this method to RDJ is unclea (existing analysis [16] focuses only on global distance joins). Powe is almost an ode of magnitude moe accuate than minskew. Since selectivity estimation is meaningless fo knn ueies, in the seuel we poceed with cost (i.e., numbe of disk accesses) estimation. 1 estimation eo (%) 9 8 7 6 5 4 3 2 1.2.4.6.8.1 t minskew powe 7 estimation eo (%) 6 5 4 3 2 1.2.4.6 t.8.1 (a) UK (L 2 ) (b) CA (L ) Figue 8. Eo vs. distance theshold t (RDJ selectivity) 4.2 Quey cost estimation This section examines the accuacy of pedicting the uey costs of RS, knn and RDJ ueies. Figue 9 plots the eo ate, as a function of uey adius fo RS ueies. The elative pefomance of altenative methods (and the coesponding explanation) is simila to that in Figue 7. 8 estimation eo (%) 7 6 5 4 3 2 1 global.2.4.6.8.1 minskew powe 9 estimation eo (%) 8 7 6 5 4 3 2 1.2.4.6.8.1 (a) SC (L ) (b) LB (L 2 ) Figue 9. Eo vs. uey adius (RS cost) As mentioned in Section 2.1, thee is no pevious wok on knn cost estimation using histogams. Thus, we eplace minskew with the cost model poposed in [5], which assumes local unifomity aound the uey s location. Figue 1 compaes this model with powe and global, vaying k fom 1 to 1. 9estimation eo (%) 7 8 6 5 4 3 global 2 1 1 2 4 k 6 8 1 local unifomity powe 5 estimation eo (%) 45 4 35 3 25 2 15 1 5 1 2 4 6 8 1 k (a) UK (L 2 ) (b) CA (L ) Figue 1. Eo vs. k (knn cost) The local unifomity assumption leads to substantial eo,

confiming the necessity of captuing local data coelation. Global has easonable pefomance, because, accoding to Theoem 4.3, the cost of knn ueies is only affected by the local exponent at the uey point. As a esult, compaed to othe estimation poblems, the vaiation of local constants does not intoduce additional eo. The fact that the accuacy of powe impoves with k can be explained as follows. Fo small k, the distance fom the uey to the k-th NN is vey shot, and falls out of the ange whee LPLaw holds. As this distance inceases with k, it is bette modeled by the LPLaw, leading to moe accuate pediction. Figue 11 compaes the cost pediction of altenative methods fo egional distance joins (uey adius =.5), whee the distance theshold t anges fom.1 to.1. The behavio is analogous to Figue 8, and the supeioity of powe is obvious (the global powe law is again inapplicable). 12 estimation eo (%) 1 8 6 4 2.2.4.6 t.8.1 minskew powe 7 estimation eo (%) 6 5 4 3 2 1.2.4.6.8.1 t (a) SC (L ) (b) LB (L 2 ) Figue 11. Eo vs. dist theshold t (RDJ cost, =.5) 4.3 Estimation fo unbiased ueies To study the efficiency of powe fo unbiased ueies, we epeat the above expeiments using wokloads whee uey locations unifomly distibute in the data space. Figue 12 illustates the esults fo RS selectivity pediction. Compaed with biased wokloads (Figue 7), the accuacy of all methods, except global, impoves as many ueies etun empty esults, in which case thei estimation is tivial. The eo of global inceases because the aveage selectivity does not captue unbiased ueies at all. Since the elative pefomance of all the methods is the same in the othe expeiments with unifom wokloads, we omit them to avoid edundancy. 7 estimation eo (%) 6 5 4 3 2 1 global.2.4.6.8.1 minskew powe 9 estimation eo (%) 8 7 6 5 4 3 2 1.2.4.6.8.1 (a) UK (L 2 ) (b) CA (L ) Figue 12. Results fo unbiased ueies (ange seach) 4.4 Pe-pocessing and estimation ovehead This section evaluates the computational ovehead of the vaious methods. The global powe law is omitted as its ovehead is negligible. Accoding to Table 1, minskew is slightly faste to constuct than powe. Howeve, as discussed in Section 2.1, histogams (including minskew) may incu significant ovehead in geneal L x metics (x). Table 1. Constuction time (seconds) Dataset Powe Minskew SC 18 15 UK 19 15 CA 38 32 LB 32 29 In ode to veify this, Table 2 compaes the estimation time fo L 2 ueies (fo each method given the same adius, the estimation time fo L 2 RS/RDJ selectivity/cost estimation is simila). The ovehead of powe is constant because only one point is used to obtain the local coefficients, independently of the numbe of intesecting buckets. On the othe hand, minskew is significantly moe expensive and its ovehead inceases with the uey adius, as moe buckets patially intesect the uey (fo which the monte-calo method must be pefomed). Table 2. Estimation time fo L 2 RS/RDJ ueies (msec) Quey adius Powe Minskew.1.5 3.3.5 5.5.5 9.7.5 15.9.5 25 5. CONCLUSION The unifomity assumption is extemely easy to believe and analyze. Although it took many yeas befoe it was discedited, it is still lingeing, hiding inside the local unifomity assumption, which is the basis undeneath most multidimensional histogams. In this pape, not only we spot the density tap poblem of the local unifomity assumption, but we also show how to esolve it, using a novel pespective, which leads to the concept of Local Powe Law. The advantages of LPLaw ae: It accuately models eal datasets, and leads to singlefomula estimation methods fo selectivities of all the popula uey types, fo all the L x distance metics (Manhattan, Euclidean, etc.) It also leads to single-fomula estimation methods fo uey I/O costs something that no othe published method can achieve. We also popose a simple implementation of the LPLaw that is fast to initialize and un. Extensive expeiments on seveal eal datasets confim the effectiveness, efficiency and flexibility of ou techniues. Quey optimization and data mining ae elated, both looking fo methods to concisely descibe a dataset. The paametes of LPLaw do exactly that: they use the local coefficients to descibe the vicinity of a point. Thus, these coefficients ae suitable fo data mining and patten ecognition tasks. Fo example, in Figue 6a, data in the noth-west pat of Scandinavia have highe local exponents which, in etospect, coespond to the Nowegian fjods (nothwest Noway has LPLaw exponents in the ange 1.3-1.5). This is actually a ule that can be used to detect outlies and extapolate hidden/coupted values: suppose that the vey noth pat of Noway is not available what can we say about it? Clealy, we can speculate that the points we ae missing will have high

LPLaw exponents. We believe that the above examples ae just the beginning. Exploiting LPLaws fo discoveing pattens in eal datasets (ules, outlies, clustes) is a vey pomising aea of eseach. ACKNOWLEDGEMENTS Yufei Tao and Dimitis Papadias wee suppoted by gants HKUST 6197/3E and HKUST 681/2E fom Hong Kong RGC. Chistos Faloutsos was suppoted by National Science Foundation unde Gants No. IIS-9988876, IIS-83148, IIS- 11389, IIS-2917, IIS-25224, by the Pennsylvania Infastuctue Technology Alliance (PITA) Gant No. 22-91- 1, and by the Defense Advanced Reseach Pojects Agency unde Contact No. N661--1-8936. REFERENCES [1] Achaya, S., Poosala, V., Ramaswamy, S. Selectivity Estimation in Spatial Databases. SIGMOD, 1999. [2] Aef, W., Samet, H. A Cost Model fo Quey Optimization Using R-Tees. ACM GIS, 1994. [3] An, N., Yang, Z., Sivasubamaniam, A. Selectivity Estimation fo Spatial Joins. ICDE, 21. [4] Bohm, C. A Cost Model fo Quey Pocessing in High Dimensional Data Spaces. TODS, 25(2): 129-178, 2. [5] Bechtold, S., Bohm, C., Keim, D.A., Kiegel, H. A Cost Model fo Neaest Neighbo Seach in High- Dimensional Data Space. PODS, 1997. [6] Babcock, B., Chaudhui, S., Das, G. Dynamic Sample Selection fo Appoximate Quey Pocessing. SIGMOD, 23. [7] Belussi, A., Faloutsos, C. Estimating the Selectivity of Spatial Queies Using the Coelation's Factal Dimension. VLDB, 1995. [8] Buno, N., Gavano, L., Chaudhui, S. STHoles: A Wokload Awae Multidimensional Histogam. SIGMOD, 21. [9] Beye, K., Goldstein, J., Ramakishnan, R. When Is "Neaest Neighbo" Meaningful? ICDT, 1999. [1] Blohsfeld, B., Kous, D., Seege, B. A Compaison of Selectivity Estimatos fo Range Queies on Metic Attibutes. SIGMOD, 1999. [11] Beckmann, N., Kiegel, H., Schneide, R., Seege, B. The R*-tee: An Efficient and Robust Access Method fo Points and Rectangles. SIGMOD, 199. [12] Chaudhui, S., Das, G., Data, M., Motwani, R., Naasayya, V. Ovecoming Limitations of Sampling fo Aggegation Queies. ICDE, 21. [13] Chaudhui, S., Das, G., Naasayya. A Robust, Optimization-Based Appoach fo Appoximate Answeing of Aggegate Queies. SIGMOD, 21. [14] Deshpande, A., Gaofalakis, M., Rastogi, R. Independence Is Good: Dependency-Based Histogam Synopses fo High-Dimensional Data. SIGMOD, 21. [15] Faloutsos, C., Kamel, I. Beyond Unifomity and Independence, Analysis of R-tees Using the Concept of Factal Dimension. PODS, 1994. [16] Faloutsos, C., Seege, B, Taina, A., Taina, C. Spatial Join Selectivity Using Powe Laws. SIGMOD, 2. [17] Gunopulos, D., Kollios, G., Tsotas, V., Domeniconi, C. Appoximate Multi-Dimensional Aggegate Range Queies ove Real Attibutes. SIGMOD, 2. [18] Jemaine, C. Making Sampling Robust with APA. VLDB, 23. [19] Jin, J., An, N., Sivasubamaniam, A. Analyzing Range Queies on Spatial Data. ICDE, 2. [2] Lee, J., Kim, D., Chung, C. Multidimensional Selectivity estimation Using Compessed Histogam Infomation. SIGMOD, 1999. [21] Lin, X., Liu, Q., Yuan, Y., Zhou, X. Multiscale histogams: Summaizing topological elations in lage spatial datasets. VLDB, 23. [22] Mualikishna, M., DeWitt, D. Eui-Depth Histogams fo Estimating Selectivity Factos fo Multi-Dimensional Queies. SIGMOD, 1988. [23] Mattias, Y., Vitte, J., Wang, M. Dynamic Maintenance of Wavelet-Based Histogams. VLDB, 2. [24] Mattias, Y., Vitte, J., Wang, M. Wavelet-Based Histogams fo Selectivity Estimation. SIGMOD, 1998. [25] Olken, F., Rotem, D. Random Sampling fom Database Files: A Suvey. SSDBM, 199. [26] Poosala, Y., Ioannidis, Y. Selectivity Estimation without the Attibute Value Independence Assumption. VLDB, 1997. [27] Palme, C., Faloutsos, C. Density Biased Sampling: An Impoved Method fo Data Mining and Clusteing. SIGMOD, 2. [28] Pagel, B., Kon, F., Faloutsos, C. Deflating the Dimensionality Cuse using Multiple Factal Dimensions. ICDE, 2. [29] Pagel, B., Six, H., Toben, H., Widmaye, P. Towads an Analysis of Range Quey Pefomance in Spatial Data Stuctues. PODS, 1993. [3] Sun, C., Agawal, D., El Abbadi, A. Exploing Spatial Datasets with Histogams. ICDE, 22. [31] Http://www.census.gov/geo/www/tige/ [32] Thape, N., Guha, S., Indyk, P., Koudas, N. Dynamic Multidimensional Histogams. SIGMOD, 22. [33] Theodoidis, Y., Stefanakis, E., Sellis, T. Efficient Cost Models fo Spatial Queies Using R-tees. TKDE, 12(1): 19-32, 2. [34] Wu, Y., Agawal, D., El Abbadi, A. Applying the Golden Rule of Sampling fo Quey Estimation. SIGMOD, 21. [35] Wang, M., Vitte, J., Lim, L., Padmanabhan, S. Wavelet- Based Cost Estimation fo Spatial Queies. SSTD, 21.