STING : A Statistical Information Grid Approach to Spatial Data Mining

STING : A Statstcal Informaton Grd Approach to Spatal Data Mnng We Wang, Jong Yang, and Rchard Muntz Department of Computer Scence Unversty of Calforna, Los Angeles {wewang, jyang, muntz}@cs.ucla.edu February 0, 1997 Abstract Spatal data mnng,.e., dscovery of nterestng characterstcs and patterns that may mplctly exst n spatal databases, s a challengng task due to the huge amounts of spatal data and to the new conceptual nature of the problems whch must account for spatal dstance. Clusterng and regon orented queres are common problems n ths doman. Several approaches have been presented n recent years, all of whch requre at least one scan of all ndvdual objects (ponts). Consequently, the computatonal complexty s at least lnearly proportonal to the number of objects to answer each query. In ths paper, we propose a herarchcal statstcal nformaton grd based approach for spatal data mnng to reduce the cost further. The dea s to capture statstcal nformaton assocated wth spatal cells n such a manner that whole classes of queres and clusterng problems can be answered wthout recourse to the ndvdual objects. In theory, and confrmed by emprcal studes, ths approach outperforms the best prevous method by at least an order of magntude, especally when the data set s very large. 1 Introducton In general, spatal data mnng, or knowledge dscovery n spatal databases, s the extracton of mplct knowledge, spatal relatons and dscovery of nterestng characterstcs and patterns that are not explctly represented n the databases. These technques can play an mportant role n understandng spatal data and n capturng ntrnsc relatonshps between spatal and nonspatal data. Moreover, such dscovered relatonshps can be used to present data n a concse manner and to reorganze spatal databases to accommodate data semantcs and acheve hgh performance. Spatal data mnng has wde applcatons n many felds, ncludng GIS Systems, mage database exploraton, medcal magng, etc.[che97, Fay96a, Fay96b, Kop96a, Kop96b] The amount of spatal data obtaned from satellte, medcal magery and other sources has been growng tremendously n recent years. A crucal challenge n spatal data mnng s the effcency of spatal data mnng algorthms due to the often huge amount of spatal data and the complexty of spatal data types and spatal accessng methods. In ths paper, we ntroduce a new statstcal nformaton grd-based method (STING) to effcently process many common regon orented queres on a set of ponts. Regon orented queres are defned later more precsely but nformally, they ask for the selecton of regons satsfyng certan condtons on densty, total area, etc. Ths paper s organzed as follows. We frst dscuss related work n Secton. We propose our statstcal nformaton grd herarchcal structure and dscuss the query types t can support n Sectons 3 and 4, respectvely. The general algorthm as well as a detaled example of processng a 1

query are gven n Secton 5. We analyze the complexty of our algorthm n Secton 6. In Secton 7, we analyze the qualty of STING s result and propose a suffcent condton under whch STING s guaranteed to return the correct result. Lmtng Behavor of STING s n Secton 8 and, n Secton 9, we analyze the performance of our method. Fnally, we offer our conclusons n Secton 10. Related Work Many studes have been conducted n spatal data mnng, such as generalzaton-based knowledge dscovery [Kno96, Lu93], clusterng-based methods [Est96, Ng94, Zha96], and so on. Those most relevant to our work are dscussed brefly n ths secton and we emphasze what we beleve are lmtatons whch are addressed by our approach..1 Generalzaton-based Approach [Lu93] proposed two generalzaton based algorthms: spatal-data-domnant and non-spatal-datadomnant algorthms. Both of these requre that a generalzaton herarchy s gven explctly by experts or s somehow generated automatcally. (However, such a herarchy may not exst or the herarchy gven by the experts may not be entrely approprate n some cases.) The qualty of mned characterstcs s hghly dependent on the structure of the herarchy. Moreover, the computatonal complexty s O(NlogN), where N s the number of spatal objects. Gven the above dsadvantages, there have been efforts to fnd algorthms that do not requre a generalzaton herarchy, that s, to fnd algorthms that can dscover characterstcs drectly from data. Ths s the motvaton for applyng clusterng analyss n spatal data mnng, whch s used to dentfy regons occuped by ponts satsfyng specfed condtons.. Clusterng-based Approach..1 CLARANS [Ng94] presents a spatal data mnng algorthm based on a clusterng algorthm called CLARANS (Clusterng Large Applcatons based upon RANdomzed Search) on spatal data. Ths s the frst paper that ntroduces clusterng technques nto spatal data mnng problems and t represents a sgnfcant mprovement on large data sets over tradtonal clusterng methods. However the computatonal complexty of CLARANS s stll hgh. In [Ng94] t s clamed that CLARANS s lnearly proportonal to the number of ponts, but actually the algorthm s nherently at least quadratc. The reason s that CLARANS apples a random search-based method to fnd an optmal clusterng. The tme taken to calculate the cost dfferental between the current clusterng and one of ts neghbors (n whch only one cluster medod s dfferent) s lnear and the number of neghbors that needs to be examned for the current clusterng s controlled by a parameter called maxneghbor, whch s defned as max(50, 1.5%K(N - K)) where K s the number of clusters. Ths means that the tme consumed at each step of searchng s Θ(KN ). It s very dffcult to

estmate how many steps need to be taken to reach the local optmum, but we can certanly say that the computatonal complexty of CLARANS s Ω(KN ). Ths observaton s consstent wth the results of our experments and those mentoned n [Est96] whch show that the performance of CLARANS s close to quadratc n the number of ponts. Moreover, the qualty of the results can not be guaranteed when N s large snce randomzed search s used n the algorthm. In addton, CLARANS assumes that all objects are stored n man memory. Ths clearly lmts the sze of the database to whch CLARANS can be appled... BIRCH Another clusterng algorthm for large data sets, called BIRCH (Balanced Iteratve Reducng and Clusterng usng Herarches), s ntroduced n [Zha96]. The authors employ the concepts of Clusterng Feature and CF tree. Clusterng feature s summarzng nformaton about a cluster. CF tree s a balanced tree used to store the clusterng features. Ths algorthm makes full use of the avalable memory and requres a sngle scan of the data set. Ths s done by combnng closed clusters together and rebuldng CF tree. Ths guarantees that the computaton complexty of BIRCH s lnearly proportonal to the number of objects. We beleve BIRCH stll has one other drawback: Ths algorthm may not work well when clusters are not sphercal because t uses the concept of radus or dameter to control the boundary of a cluster 1...3 DBSCAN Recently, [Est96] proposed a densty based clusterng algorthm (DBSCAN) for large spatal databases. Two parameters Eps and MnPts are used n the algorthm to control the densty of normal clusters. DBSCAN s able to separate nose from clusters of ponts where nose conssts of ponts n low densty regons. DBSCAN makes use of an R* tree to acheve good performance. The authors llustrate that DBSCAN can be used to detect clusters of any shape and can outperform CLARANS by a large margn (up to several orders of magntude). However, the complexty of DBSCAN s O(NlogN). Moreover, DBSCAN requres a human partcpant to determne the global parameter Eps. (The parameter MnPts s fxed to 4 n ther algorthm to reduce the computatonal complexty.) Before determnng Eps, DBSCAN has to calculate the dstance between a pont and ts kth (k = 4) nearest neghbors for all ponts. Then t sorts all ponts accordng to the prevous calculated dstances and plots the sorted k-dst graph. Ths s a tme consumng process. Furthermore, a user has to examne the graph and fnd the frst valley of the graph. The correspondng dstance s chosen as the value of Eps and the resultng clusterng qualty s hghly dependent on the Eps parameter. When the pont set to be clustered s the response set of objects satsfyng some qualfcaton, then the determnaton of Eps must be done each tme and the cost of DBSCAN wll be hgher. (In [Est96], the cost quoted dd not nclude ths overhead.) Moreover, all algorthms descrbed above have the common drawback that they are all querydependent approaches. That s, the structures used n these approaches are dependent on specfc query. They are bult once for each query and are generally of no use to answer further queres. Therefore, these approaches need to scan the data sets at least once for each query, whch causes 1 We could not verfy ths snce we do not have BIRCH source code. 3

the computatonal complextes of all above approaches to be at least O(N), where N s the number of objects. In ths paper, we propose a statstcal nformaton grd-based approach called STING (STatstcal INformaton Grd) to spatal data mnng. The spatal area s dvded nto rectangular cells. We have several dfferent levels of such rectangular cells correspondng to dfferent resoluton and these cells form a herarchcal structure. Each cell at a hgh level s parttoned to form a number of cells of the next lower level. Statstcal nformaton of each cell s calculated and stored beforehand and s used to answer queres. The advantages of ths approach are: It s a query-ndependent approach snce the statstcal nformaton exsts ndependently of queres. It s a summary representaton of the data n each grd cell, whch can be used to facltate answerng a large class of queres. The computatonal complexty s O(K), where K s the number of grd cells at the lowest level. Usually, K << N, where N s the number of objects. Query processng algorthms usng ths structure are trval to parallelze the computng. When data s updated, we do not need to recompute all nformaton n the cell herarchy. Instead, we can do an ncremental update. 3 Grd Cell Herarchy 3.1 Herarchcal Structure We dvde the spatal area nto rectangle cells (e.g., usng lattude and longtude) and employ a herarchcal structure. Let the root of the herarchy be at level 1; ts chldren at level, etc. A cell n level corresponds to the unon of the areas of ts chldren at level + 1. In ths paper each cell (except the leaves) has 4 chldren and each chld corresponds to one quadrant of the parent cell. The root cell at level 1 corresponds to the whole spatal area (whch we assume s rectangular for smplcty). The sze of the leaf level cells s dependent on the densty of objects. As a rule of thumb, we choose a sze such that the average number of objects n each cell s n the range from several dozens to several thousands. In addton, a desrable number of layers could be obtaned by changng the number of cells that form a hgher level cell. In ths paper, we wll use 4 as the default value unless otherwse specfed. In ths paper, we assume our space s of two dmensons although t s very easy to generalze ths herarchy structure to hgher dmensonal models. In two dmensons, the herarchcal structure s llustrated n Fgure 1. Some strateges can be appled when constructng the herarchcal structure to ensure K N, whch are beyond the scope of ths paper. 4

1st level (top level) could have only one cell. A cell of (-1)th level corresponds to 4 cells of th level..................... 1st layer.... (-1)th layer th layer Fgure 1. Herarchcal Structure For each cell, we have attrbute-dependent and attrbute-ndependent parameters. The attrbutendependent parameter s: n number of objects (ponts) n ths cell As for the attrbute-dependent parameters, we assume that for each object, ts attrbutes have numercal values. (We wll address the categorcal case n future research.) For each numercal attrbute, we have the followng fve parameters for each cell: m mean of all values n ths cell s standard devaton of all values of the attrbute n ths cell mn the mnmum value of the attrbute n ths cell max the maxmum value of the attrbute n ths cell dstrbuton the type of dstrbuton that the attrbute value n ths cell follows The parameter dstrbuton s of enumeraton type. Potental dstrbuton types are: normal, unform, exponental, and so on. The value NONE s assgned f the dstrbuton type s unknown. The dstrbuton type wll determne a kernel calculaton n the generc algorthm as wll be dscussed n detal shortly. 3. Parameter Generaton We generate the herarchy of cells wth ther assocated parameters when the data s loaded nto the database. Parameters n, m, s, mn, and max of bottom level cells are calculated drectly from data. The value of dstrbuton could be ether assgned by the user f the dstrbuton type s known before hand or obtaned by hypothess tests such as χ -test. Parameters of hgher level cells can be easly calculated from parameters of lower level cell. Let n, m, s, mn, max, dst be parameters of current cell and n, m, s, mn, max, and dst be parameters of correspondng lower level cells, respectvely. The n, m, s, mn, and max can be calculated as follows. n = m = n m n n 5

( s + m ) n s = n mn = mn( mn ) max = max( max ) m The determnaton of dst for a parent cell s a bt more complcated. Frst, we set dst as the dstrbuton type followed by most ponts n ths cell. Ths can be done by examnng dst and n. Then, we estmate the number of ponts, say confl, that conflct wth the dstrbuton determned by dst, m, and s accordng to the followng rule: 1. If dst dst, m m and s s, then confl s ncreased by an amount of n ;. If dst dst, but ether m m or s s s not satsfed, then set confl to n (Ths enforces dst wll be set to NONE later); 3. If dst = dst, m m and s s, then confl s ncreased by 0; 4. If dst = dst, but ether m m or s s s not satsfed, then confl s set to n. Fnally, f confl s greater than a threshold t (Ths threshold s a small constant, say 0.05, whch s n set before the herarchcal structure s bult), then we set dst as NONE; otherwse, we keep the orgnal type. For example, the parameters of lower level cells are as follows. Then the parameters of current cell wll be 1 3 4 n 100 50 60 10 m 0.1 19.7 1.0 0.5 s.3..4.1 mn 4.5 5.5 3.8 7 max 36 34 37 40 dst NORMAL NORMAL NORMAL NONE Table 1: Parameters of Chldren Cells n = 0 m = 0.7 s =.37 mn = 3.8 max = 40 dst = NORMAL The dstrbuton type s stll NORMAL based on the followng: Snce there are 10 ponts whose dstrbuton type s NORMAL, dst s frst set to NORMAL. After examnng dst, m, and s of each lower level cell, we fnd out confl = 10. So, dst s kept as NORMAL ( confl = 0.045 < 0.05). n We only need to go through the data set once n order to calculate the parameters assocated wth the grd cells at the bottom level, the overall complaton tme s lnearly proportonal to the number of objects wth a small constant factor. (And only has to be done once not for each query.) Wth 6

ths structure n place, the response tme for a query s much faster snce t s O(K) nstead of O(N). We wll analyze performance n more detal n later sectons. 4 Query Types If the statstcal nformaton stored n the STING herarchcal structure s not suffcent to answer a query, then we have recourse to the underlyng database. Therefore, we can support any query that can be expressed by the SQL-lke language descrbed later n ths secton. However, the statstcal nformaton n the STING structure can answer many commonly asked queres very effcently and we often do not need to access the full database. Even when the statstcal nformaton s not enough to answer a query, we can stll narrow the set of possble choces. STING can be used to facltate several knds of spatal queres. The most commonly asked query s regon query whch s to select regons that satsfy certan condtons (Ex1). Another type of query selects regons and returns some functon of the regon, e.g., the range of some attrbutes wthn the regon (Ex). We extend SQL so that t can be used to descrbe such queres. The formal defnton s n Appendx. The followng are several query examples. Ex1. Select the maxmal regons that have at least 100 houses per unt area and at least 70% of the house prces are above $400K and wth total area at least 100 unts wth 90% confdence. SELECT REGION FROM house-map WHERE DENSITY IN (100, ) AND prce RANGE (400000, ) WITH PERCENT (0.7, 1) AND AREA (100, ) AND WITH CONFIDENCE 0.9 Ex. Select the range of age of houses n those maxmal regons where there are at least 100 houses per unt area and at least 70% of the houses have prce between $150K and $300K wth area at least 100 unts n Calforna. SELECT RANGE(age) FROM house-map WHERE DENSITY IN (100, ) AND prce RANGE (150000, 300000) WITH PERCENT (0.7, 1) AND AREA (100, ) AND LOCATION Calforna 5 Algorthm Wth the herarchcal structure of grd cells on hand, we can use a top-down approach to answer spatal data mnng queres. For each query, we begn by examnng cells on a hgh level layer. Note that t s not necessary to start wth the root; we may begn from an ntermedate layer (but we do not pursue ths mnor varaton further due to lack of space). 7

Startng wth the root, we calculate the lkelhood that ths cell s relevant to the query at some confdence level usng the parameters of ths cell (exactly how ths s computed s descrbed later). Ths lkelhood can be defned as the proporton of objects n ths cell that satsfy the query condtons. (If the dstrbuton type s NONE, we estmate the lkelhood usng some dstrbutonfree technques nstead.) After we obtan the confdence nterval, we label ths cell to be relevant or not relevant at the specfed confdence level. When we fnsh examnng the current layer, we proceed to the next lower level of cells and repeat the same process. The only dfference s that nstead of gong through all cells, we only look at those cells that are chldren of the relevant cells of the prevous layer. Ths procedure contnues untl we fnsh examnng the lowest level layer (bottom layer). In most cases, these relevant cells and ther assocated statstcal nformaton are enough to gve a satsfactory result to the query. Then, we fnd all the regons formed by relevant cells and return them. However, n rare cases (People may want very accurate result for specal purposes, e.g. mltary), ths nformaton are not enough to answer the query. Then, we need to retreve those data that fall nto the relevant cells from database and do some further processng. After we have labeled all cells as relevant or not relevant, we can easly fnd all regons that satsfy the densty specfed by a breadth-frst search. For each relevant cell, we examne cells wthn a certan dstance (how to choose ths dstance s dscussed below) from the center of current cell to see f the average densty wthn ths small area s greater than the densty specfed. If so, ths area s marked and all relevant cells we just examned are put nto a queue. Each tme we take one cell from the queue and repeat the same procedure except that only those relevant cells that are not examned before are enqueued. When the queue s empty, we have dentfed one regon. The dstance we use above s calculated from the specfed densty and the granularty of the bottom f level cell. The dstance d = max( l, ) where l, c, and f are the sde length of bottom layer cell, πc the specfed densty, and a small constant number set by STING (It does not vary from a query to f another), respectvely. Usually, l s the domnant term n max( l, ). As a result, ths dstance πc can only reach the neghbor cells. In ths case, we just need to examne neghborng cells and fnd regons that are formed by connected cells. Only when the granularty s very small, ths dstance could cover a number of cells. In ths case, we need to examne every cell wthn ths dstance nstead of only neghborng cells. For example, f the objects n our database are houses and prce s one of the attrbutes, then one knd of query could be Fnd those regons wth area at least A where the number of houses per unt area s at least c and at least β% of the houses have prce between a and b wth (1 - α) confdence where a < b. Here, a could be - and b could be +. Ths query can be wrtten as SELECT REGION FROM house-map WHERE DENSITY IN [c, ) AND prce RANGE [a, b] WITH PERCENT [β%, 1] AND AREA [A, ) AND WITH CONFIDENCE 1 - α 8

We begn from the top layer that has only one cell and stop at the bottom level. Assume that the prce n each bottom layer cell s approxmately normally dstrbuted. (For other dstrbuton types the dea s essentally the same except that we use dfferent dstrbuton functon and lookup table.) Note that prce n a hgher level cell could have dstrbuton type as NONE. For each cell, f the dstrbuton type s normal, we frst calculate the proporton of houses whose prce s wthn the range [a, b]. The probablty that a prce s between a and b s p = P( a prce b) P( a m prce m b m = ) s s s P( a m b m = Z ) s s b m a m = Φ( ) Φ( ) s s where m and s are the mean and standard devaton of all prces n ths cell respectvely. Snce we assume all prces are ndependent gven the mean and varance, the number of houses wth prce between a and b has a bnomal dstrbuton wth parameters n and p, where n s the number of houses. Now we consder the followng cases accordng to n, n p, and n(1 - p ). 1. When n 30, we can use bnomal dstrbuton drectly to calculate the confdence nterval of the number of houses whose prce falls nto [a, b], and dvde t by n to get the confdence nterval for the proporton.. When n > 30, n p 5, and n(1 - p ) 5, the proporton that the prce falls n [a, b] has a normal dstrbuton N( p, p( 1 p) / n ) approxmately. Then 100(1 - α)% confdence nterval of the proporton s p ± z α/ p( 1 p) / n = [p 1, p ]. 3. When n > 30 but n p < 5, the Posson dstrbuton wth parameter λ = n p s approxmately equal to the bnomal dstrbuton wth parameters n and p. Therefore, we can use the Posson dstrbuton nstead. 4. When n > 30 but n(1 - p ) < 5, we can calculate the proporton of houses (X) whose prce s not n [a, b] usng Posson dstrbuton wth parameter λ = n(1 - p ), and 1 - X s the proporton of houses whose prce s n [a, b]. For a cell, f the dstrbuton type s NONE, we can estmate the proporton range [p 1, p ] that the prce falls n [a, b] by some dstrbuton-free technques, such as Chebyshev s nequalty [Dev91]. s s 1. If m [a, b], then [ p1, p ] = 0,mn max,, 1 ( a m ) ( b m ) ;. If m = a or m = b, then [p 1, p ] = [0, 1]; s s 3. If m (a, b), then [ p1, p ] = max 1, 1, 0, 1 ( a m) ( b m). 9

Once we have the confdence nterval or the estmated range [p 1, p ], we can label ths cell as relevant or not relevant. Let S be the area of cells at bottom layer. If p n < S c β%, we label ths cell as not relevant; otherwse, we label t as relevant. Each tme when we fnsh examnng a layer, we go down one level and only examne those cells that form the relevant cells at hgher layer. After we labeled the cells at bottom layer, we scan those relevant cells and return those regons formed by at least A/S adjacent relevant cells. Ths can be done n O(K) tme. The above algorthm s summarzed n Fgure. Statstcal Informaton Grd-based Algorthm: 1. Determne a layer to begn wth.. For each cell of ths layer, we calculate the confdence nterval (or estmated range) of probablty that ths cell s relevant to the query. 3. From the nterval calculated above, we label the cell as relevant or not relevant. 4. If ths layer s the bottom layer, go to Step 6; otherwse, go to Step 5. 5. We go down the herarchy structure by one level. Go to Step for those cells that form the relevant cells of the hgher level layer. 6. If the specfcaton of the query s met, go to Step 8; otherwse, go to Step 7. 7. Retreve those data fall nto the relevant cells and do further processng. Return the result that meet the requrement of the query. Go to Step 9. 8. Fnd the regons of relevant cells. Return those regons that meet the requrement of the query. Go to Step 9. 9. Stop. Fgure. STING Algorthm 6 Analyss of the STING Algorthm In above algorthm, Step 1 takes constant tme. Steps and 3 requre a constant tme for each cell to calculate the confdence nterval or estmate proporton range and also a constant tme to label the cell as relevant or not relevant. Ths means that we need constant tme to process each cell n Steps and 3. The total tme s less than or equal to the total number of cells n our herarchcal structure. Notce that the total number of cells s 1.33K, where K s the number of cells at bottom layer. We obtan the factor 1.33 because the number of cells of a layer s always one-forth of the number of cells of the layer one level lower. So the overall computaton complexty on the grd herarchy structure s O(K). Usually, the number of cells needed to be examned s much less, especally when many cells at hgh layers are not relevant. In Step 8, the tme t takes to form the regons s lnearly proportonal to the number of cells. The reason s that for a gven cell, the number of cells need to be examned s constant because both the specfed densty and the granularty can be regarded as constants durng the executon of a query and n turn the dstance s also a constant snce t s determned by the specfed densty. Snce we assume each cell at bottom layer usually has several dozens to several thousands objects, K << N. So, the total complexty s stll O(K).Usually, we do not need to do Step 7 and the overall computatonal complexty s O(K). 10

In the extreme case that we need to go to Step 7, we stll do not need to retreve all data from database. Therefore, the tme requred n ths step s stll less than lnear. So, ths algorthm outperforms other approaches greatly. 7 Qualty of STING STING makes use of statstcal nformaton to approxmate the expected results of query. Therefore, t could be mprecse snce data ponts can be arbtrarly located. However, under one of the followng two condtons, STING can guarantee the accuracy of ts result. Let A and c be the mnmum area and densty specfed by query, respectvely. Let R and l be a regon satsfyng the condtons specfed by the query and the sde length of bottom level cell, respectvely. Defnton 1. Let F be a regon. The wdth of F s defned as the sde length of the maxmum square that can ft n F. 1. Let W be the wdth of R. If W - 4( W/l +1)l A, then R must be returned by STING. The reason s that the square wth sde length W covers more than W /l - 4( W/l +1) bottom level cells entrely. Snce all these cells wll be detected, STING s able to return R. Defnton. Let S 1 and S be two squares. The dstance between S 1 and S s defned as the maxmum dstance between vertces of S 1 and S.. If at least A/l squares wth sde length of l can ft n R and there exsts a tree on those f squares such that the dstance between the parent square and ts chld s wthn where f s πc the small constant set by the system, then R must be returned by STING. The reason s that each of those squares covers at least one bottom level cell entrely. Therefore, STING s able to dscover R. The above s the suffcent condton for STING to return accurate results. However, n most of other cases, STING s also able to return correct answers wth hgh confdence. The worst case scenaro for STING would be a cluster of ponts rght at the corners of four cells n the center of the map. We use the followng strategy to solve ths problem. 1. We make the sze of bottom level cell near zero such that each bottom level cell contans at most one data pont f no two ponts collocate. We only nstantate a cell f there s at least one data pont n t.. We ntellgently construct the herarchcal structure such that the number of nstantated cells n a hgher layer s at most half of that n one level lower. 3. We only keep a certan number of top levels on lne and the rest layers are kept off-lne. If an off-lne layer s needed, we can dynamcally load t n. However, users rarely requres such precson. Pursut of ths extenson s beyond the scope of ths paper and wll be dealt wth n future work. 11

8 Lmtng Behavor of STING s Equvalent to DBSCAN The regons returned by STING are an approxmaton of the result by DBSCAN. As the granularty approaches zero, the regons returned by STING approach the result of DBSCAN. In order to compare to DBSCAN, we only use the number of ponts here snce DBSCAN can only cluster ponts accordng to ther spatal locaton. (.e., we do not consder condtons on other attrbutes.) DBSCAN has two parameters: Eps and MnPts. (Usually, MnPts s fxed to k.) In our case, STING has only one parameter: the densty c. We set c = MnPts + 1 = k + 1 n order to πeps πeps approxmate the result of DBSCAN. The reason s that the densty of any area nsde the clusters detected by DBSCAN s at least MnPts + 1 snce for each core pont there are at least MnPts πeps ponts (excludng tself) wthn dstance Eps. In STING, for each cell, f n < S c, then we label t as not relevant; otherwse, we label t as relevant where n and S are the number of ponts n ths cell and the area of bottom layer cell, respectvely. When we form the regons from relevant cells, k + 1 the examnng dstance s set to be d = max( l, ). When the granularty s very small, πc k + 1 becomes the domnant term. As the granularty approaches zero, the area of each cell at πc bottom layer goes to zero. So, f there s at least one pont n a cell, ths cell wll be labeled as relevant. Now what we need to do s to form the regon to be returned accordng to dstance d and k + 1 k + 1 densty c. We can see that d = = = Eps. For each relevant cell, we examne the πc k + 1 π πeps area around t (wthn dstance d) to see f the densty s greater than c. Ths s equvalent to check f the number of ponts (ncludng tself) wthn ths area s greater than c πd = k + 1. As a result, the result of STING approaches that of DBSCAN when the granularty approaches zero. 9 Performance We run several tests to evaluate the performance of STING. The followng tests are run on a SPARC 10 machne wth Solars.4 operatng system (19 MB memory). 9.1 Performance Comparson of Two Dstrbutons To obtan performance metrc of STING, we mplemented the house-prce example dscussed n Secton 5. Ex1 s the query that we posed. We generated two data sets, both of whch have 100,000 data ponts (houses). The herarchcal structure has seven layers n ths test. Frst, we generate a data set (DS1) such that the prce s normally dstrbuted n each cell (wth smlar mean). The herarchcal structure generaton tme s 9.8 seconds. (Generaton needs to be done once for each data set. All the queres for the same data set can use the same structure. Therefore, we do not need to generate t for each query.) It takes STING 0.0 second to answer the query gven the STING 1

structure exsts. The expected result and the result returned by STING are n Fgure 3a and 3b, respectvely. Fgure 3a. Expected result of DS1 Fgure 3b. STING s result of DS1 From Fgure 3a and 3b, we can see that STING s result s very close to the expected one. In the second data set (DS), the prces n each bottom layer cell follow a normal dstrbuton (wth dfferent mean) but they do not follow any known dstrbuton at hgher levels. The herarchcal structure generaton tme s 9.7 seconds. It takes STING 0. second to answer the query. The expected result and the result returned by STING are n Fgure 4a and 4b, respectvely. Fgure 4a. Expected result of DS Fgure 4b. STING s result of DS Once agan, we can see that the STING s result s very closed to the expected one. 9. Benchmark Result 13

Currently, clusterng based approaches are an mportant category of spatal data mnng problems. Three extant systems are CLARANS [Ng94], BIRCH [Zha96], and DBSCAN [Est96]. We compare the performance of these three wth STING. In the followng tests, we only compare the tme for clusterng. However, f the clusterng data s the result of some query, then all other algorthms (other than STING) have at least three phases: 1. Fnd query response.. Buld auxlary structure. 3. Do clusterng. The reported numbers for the other methods do not nclude computaton of Phase 1, but STING only takes one step to answer the whole query. Therefore, STING actually compares better than that the measurements presented here ndcate. We use the benchmark chosen by Ester M. et al. n [Est96], namely SEQUOIA 000 [Sto93], to compare the performance of STING and other approaches. We successfully ran CLARANS and STING wth data sze between 15 and 151. STING has generaton tme and query tme. The generaton tme s the tme consumed to generate the herarchcal structure and the query tme s the tme used to answer a specfc query. In the test, the STING herarchy structure has sx layers. Due to unavalablty of DBSCAN source code, we are unable to run ths algorthm. We dscovered that CLARANS s approxmately 15 tmes faster n our confguraton than n the confguraton specfed n [Est96] for all data szes. We estmate that DBSCAN also runs roughly 15 tmes faster and show the estmated runnng tme of DBSCAN n the followng table as a functon of pont set cardnalty. All tmes are n unts of seconds. Number of Ponts 156 503 3910 513 656 151 CLARANS 49 00 457 785 138 5538 DBSCAN 0. 0.4 0.7 1.0 1..86 (projected) STING (query) 0.1 0.11 0.11 0.1 0.1 0.14 STING (generaton) 1.5 1.3 1.40 1.48 1.55 1.6 Table : Performance tests for CLARANS, DBSCAN, and STING Furthermore, BIRCH outperforms CLARANS about 0 to 30 tmes [Zha96]. So STING wll also outperform BIRCH by a very large margn. We plot the query response tme for DBSCAN and STING n Fgure 5 because DBSCAN s the fastest one among all exstng algorthms. 14

3.5 DBSCAN Tme (sec) 1.5 1 0.5 STING 0 0 5000 10000 15000 Number of ponts Fgure 5. Performance Comparson between STING and DBSCAN 10 Concluson In ths paper, we present a statstcal nformaton grd-based approach to spatal data mnng. It has much less computatonal cost than other approaches. The I/O cost s low snce we can usually keep the STING data structure n memory. Both of these wll speed up the processng of spatal data query tremendously. In addton, t offers us an opportunty for parallelsm (STING s trvally parallelzable). All these advantages beneft from the herarchcal structure of grd cells and the statstcal nformaton assocated wth them. 15

References [Che97] M. S. Chen, J. Han, P. S. Yu. Data mnng: an overvew from database perspectve. to appear n IEEE Transactons on Knowledge and data Engneerng, 1997. [Dev91] J. L. Devore. Probablty and Statstcs for Engneerng and the Scences, 3rd edton. Brooks/Cole Publshng Company, Pacfc Grove, Calforna, 1991. [Est95] M. Ester, H. P. Kregel, and X. Xu. Knowledge dscovery n large spatal databases: Focusng technques for effcent class dentfcaton. Proc. 4th Int. Symp. on Large Spatal Databases (SSD 95), pp. 67-8, Poland, Mane, August 1995. [Est96] M. Ester, H. P. Kregel, J. Sander, and X. Xu. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. Proc. nd Int. Conf. Knowledge Dscovery and Data Mnng (KDD-96), pp. 6-31, Portland, OR, USA, August 1996. [Fay96a] U. Fayyad, G. P.-Shapro, and P. Smyth. From data mnng to knowledge dscovery n databases. AI Magazne, Vol. 17 No. 3, pp. 37-54, Fall 1996. [Fay96b] U. Fayyad, G. P.-Shapro, P. Smyth, and R. Uthurusamy, edtors. Advances n Knowledge Dscovery and Data Mnng. AAAI/MIT Press, Menlo Park, CA, 1996. [Fot94] S. Fotherngham and P. Rogerson. Spatal Analyss and GIS. Taylor and Frances, 1994 [Kno96] E. M. Knorr and R. Ng. Extracton of spatal proxmty patterns by concept generalzaton. Proc. nd Int. Conf. Knowledge Dscovery and Data Mnng (KDD-96), pp. 347-350, Portland, OR, USA, August 1996. [Kop96a] K. Kopersk, J. Adhkary, and J. Han. Spatal data mnng: progress and challenges. SIGMOD 96 Workshop on Research Issues on Data Mnng and Knowledge Dscovery (DMKD 96), Montreal, Canada, June 1996. [Kop96b] K. Kopersk and J. Han. Data mnng methods for the analyss of large geographc databases. Proc. 10th Annual Conf. on GIS. Vancouver, Canada, March 1996. [Lu93] W. Lu, J. Han, and B. C. Oo. Dscovery of general knowledge n large spatal databases. Proc. Far East Workshop on Geographc Informaton Systems, pp. 75-89, Sngapore, June 1993. [Ng94] R. Ng and J. Han. Effcent and effectve clusterng method for spatal data mnng. Proc. 1994 Int. Conf. Very Large Databases, pp. 144-155, Santago, Chle, September 1994. [Sam90] H. Samet. The Desgn and Analyss of Spatal Data Structures. Addson-Wesley, 1990. [Sto93] M. Stonebraker, J. Frew, K. Gardels, and J. Meredth. The SEQUOIA 000 storage benchmark. Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data, pp. -11, Washngton DC, 1993. 16

[Zha96] T. Zhang, R. Ramakrshnan, and M. Lvny. BIRCH: an effcent data clusterng method for very large databases. Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pp. 103-114, Montreal, Canada, June 1996. 17

Appendx The followng s the specfcaton of our extended SQL n BNF notaton. <query> ::= <regon-query> <object-query> <func-query> <regon-query> ::= SELECT REGION FROM <from-clause> WHERE <regon-conds> <object-query> ::= SELECT object FROM <from-clause> WHERE <object-conds> <attr-query> ::= SELECT <attr-funcs> FROM <from-clause> WHERE <attr-conds> <from-clause> ::= <relatons> <classes> <relatons> ::= relaton-name relaton-name, <relatons> <classes> ::= class-name class-name, <classes> <regon-conds> ::= <regon-cond> <regon-cond> AND <regon-conds> <regon-cond> ::= <densty> <func> <area> <locaton> <confdence> <object-conds> ::= <object-cond> <object-cond> AND <object-conds> <object-cond> ::= <obj-func> <locaton> <attr-funcs> ::= <attr-func> <attr-func>, <attr-funcs> <attr-func> ::= attr-name <stat-func>(attr-name) <stat-func> ::= MAX MIN RANGE AVERAGE SUM COU NT... <func-conds> ::= <regon-conds> <object-conds> <densty> ::= DENSITY IN <left-paren>number, number<rght-paren> <func> ::= <obj-func> [WITH PERCENT <left-paren>percentage, percentage<rght-paren>] <obj-func> ::= <attr-func> RANGE <left-paren>number, number<rght-paren> <area> ::= AREA <left-paren>number, number<rght-paren> <locaton> ::= LOCATION <namelst> LOCATION <polygonlst> <confdence> ::= WITH CONFIDENCE percentage <namelst> ::= name name; <namelst> <polygonlst> ::= <polygon> <polygon>; <polygonlst> <polygon> ::= <ponts> <ponts> ::= <pont> <pont>, <ponts> <pont> ::= (coordnate, coordnate) <left-paren> ::= [ ( <rght-paren> ::= ] ) 18