SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS J.H.Guan, F.B.Zhu, F.L.Ban a School of Computer, Spatal Informaton & Dgtal Engneerng Center, Wuhan Unversty, Wuhan, 430079, Chnajhguan@wtusm.edu.cn TS, WG II/6 KEY WORDS: GIS, Analyss, Data Mnng, Vsualzaton, Algorthms, Dynamc, Mult-resoluton, Spatal. ABSTRACT: Clusterng can be appled to many felds ncludng data mnng, statstcal data analyss, pattern recognton, mage processng etc. In the past decade, a lot of effcent and effectve new clusterng algorthms have been proposed, n whch famous algorthms contrbuted from the database communty are CLARANS, BIRCH, DBSCAN, CURE, STING, CLIGUE and WaveCluster. All these algorthms try to challenge the problem of handlng huge amount of data n large-scale databases. In ths paper, we propose a scalable and vsualzaton-orented clusterng algorthm for exploratory spatal analyss (CAESA). The context of our research s 2D spatal data analyss, but the method can be extended to hgher dmensonal space. Here, Scalable means our algorthm can run focus-changng clusterng n an effcent way, and Vsualzaton-orented ndcates that our algorthm s adaptable to the vsualzaton stuaton, that s, choosng approprate clusterng granularty automatcally accordng to current vsualzaton resoluton. Expermental results show that our algorthm s effectve and effcent. 1. INTRODUCTION Clusterng, whch s the task of groupng the data of a database nto meanngful subclasses n such a way that mnmzes the ntra-dfferences and maxmzes the nter-dfferences of these subclasses, s one of the most wdely studed problems n data mnng feld. There are a lot of applcaton areas for clusterng technques, such as spatal analyss, pattern recognton, mage processng, and other busness applcatons, to name a few. In the past decade, a lot of effcent and effectve clusterng algorthms have been proposed, n whch famous algorthms contrbuted from the database communty nclude CLARANS, BIRCH, DBSCAN, CURE, STING, CLIGUE and WaveCluster. All these algorthms try to challenge the clusterng problem of handlng huge amount of data n large-scale databases. However, current clusterng algorthms are desgned to cluster a certan dataset n the fashon of once and for all. We can refer to ths knd of clusterng as global clusterng. In realty, take exploratory data analyss for example, the user may frst want to see the global vew of the processed dataset, then her/hs nterest may shft to a smaller part of the dataset, and so on. Ths process mples a seres of consecutve clusterng operatons: frst on the whole dataset, then on a smaller part of the dataset, and so on. Certanly, ths process can be drected n an nverse way, that s, user s focusng scope shfts from smaller area to larger area. We refer to ths knd of clusterng operaton as focus-changng clusterng. In mplementaton of focuschangng clusterng, the nave approach s to cluster the focused data each tme from scratch n the fashon of global clusterng. Obvously, such approach s tme-consumng and low effcent. The better soluton s to desgn a clusterng algorthm that carres out focus-changng clusterng n an ntegrated framework. On the other hand, although vsualzaton has been recognzed as an effectve tool for exploratory data analyss and some vsual clusterng approaches were reported n the lterature, these researches has focused on proposng new methods of vsualzng clusterng results so that the users can have a vew of the processed dataset s nternal structure more concretely and drectly. However, seldom concern has been put on the mpact of vsualzaton on the clusterng process. To make ths dea clear, let us take 2-dmensonal spatal data clusterng for example. In fact, clusterng can be seen as a process of data generalzaton on bass of certan lower data granularty. The lowest clusterng granularty of a certan dataset s the ndvdual data objects n the dataset. If we dvde the dataset space nto rectangular cells of smlar sze, then a larger clusterng granularty s the data objects enclosed n the rectangular cells. More reasonably, we defne clusterng granularty as the sze of the dvded rectangular cell n horzontal or vertcal drecton, and defne relatve clusterng granularty as the rato of clusterng granularty over the scope of focused dataset n the same drecton. Vsualzaton of clusterng results s also based on clusterng granularty. However, vsualzaton effect reles on the resoluton of dsplay devce. For comparson, we defne relatve vsualzaton resoluton as the nverson of the sze (takng pxel as measurement unt) of vsualzaton wndow for clusterng results n the same drecton as the defnton of clusterng granularty. Obvously, a reasonable choce s that the relatve clusterng granularty s close to, but not lower than the relatve vsualzaton resoluton of vsualzaton wndow. Otherwse, the clusterng results can t be vsualzed completely, whch means we do a lot but only part of ts effect s shown up n vsualzaton.

In ths paper, we propose a scalable and vsualzaton-orented clusterng algorthm for exploratory spatal analyss (CAESA), n whch we adopt the dea of combnng DBSCAN and STING. We bult a prototype to demonstrate the feasblty of the proposed algorthm. Expermental results ndcate that our algorthm s effectve and effcent. CAESA needs less regon queres than DBSCAN does, and s more flexble than STING. 2. RELATED WORK In recent years, a number of clusterng algorthms for large databases or data warehouses have been proposed. Bascally, there are four types of clusterng algorthms: parttonng algorthms, herarchcal algorthms, densty based algorthms and grd based algorthms. Parttonng algorthms construct a partton of a database D of n objects nto a set of k clusters, where k s an nput parameter for these algorthms. Herarchcal algorthms create a herarchcal decomposton s represented by a dendrogram, a tree that teratvely splts D nto smaller subsets untl each subset conssts of only one object. In such a herarchy, each node of the tree represents a cluster of D. The dendrogram can ether be created from the leaves up to the root (agglomeratve approach) or from the root down to the leaves (dvsve approach) by mergng or dvdng clusters at each step. Ester et al. (1996) developed a clusterng algorthm DBSCAN based on a densty-based noton of clusters. It s desgned to dscover clusters of arbtrary shape. The key dea n DBSCAN s that for each pont of a cluster, the neghborhood of a gven radus has to contan at least a mnmum number of ponts. DBSCAN can effectvely handle the nose ponts (outlers). CLIQUE clusterng algorthm (Agrawal et al., 1998) dentfes dense clusters n subspace of maxmum dmensonalty. It parttons the data space nto cells. To approxmate the densty of the data ponts, t counts the number of ponts n each cell. The clusters are unons of connected hgh-densty cells wth n a subspace. CLIQUE generates cluster descrpton n the form of DNF expressons. Shekholeslam et al. (1998), usng mult-resoluton property of wavelets, proposed the WaveCluster method whch parttons the data space nto cells and app les wavelet transform on them. Furthermore, WaveCluster can detect arbtrary shape clusters at dfferent degrees of detal. Wang et al. (1997, 1999) proposed a statstcal nformaton grd-based method (STING, STING+) for spatal data mnng. It dvdes the spatal area nto rectangular cells usng a herarchcal structure and stores the statstcal parameters of all numercal attrbutes of objects wth n cells. STING uses a mult-resoluton to perform cluster analyss, and the qualty of STING clusterng depends on the granularty of the lowest level of the grd structure. If the granularty s very fne, the cost of processng wll ncrease substantally. However, f the bottom level of the grd structure s too coarse, t may reduce the qualty of cluster analyss. Moreover, STING does not consder the spatal relatonshp between the chldren and ther neghborng cells for constructon of a parent cell. As a result, the shapes of the resultng clusters are sothetc, that s, all of the cluster boundares are ether horzontal or vertcal, and no dagonal boundary s detected. Ths may lower the qualty and accuracy of the clusters despte of the fast processng tme of the technque. However, all algorthms descrbed above have the common drawback that they are all desgned to cluster a certan dataset n the fashon of once and for all. Although methods has been proposed to focus on vsualzng clusterng results so that the users can have a vew of the processed dataset s nternal structure more concretely and drectly, however, seldom concern has been put on the mpact of vsualzaton on the clusterng process. Dfferent from the algorthms abovementoned, the proposed CAESA algorthm n ths paper s a scalable and vsualzatonorented clusterng algorthm for exploratory spatal analyss. Here, scalable means our algorthm can run focus-changng clusterng n an effectve and effcent way, and vsualzatonorented ndcates that our algorthm s adaptable to vsualzaton stuaton, that s, choosng approprate relatve clusterng granularty automatcally accordng to current relatve vsualzaton resoluton. 3. SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING ALGORITHM 3.1 CAESA OVERVIEW Lke STING, CAESA s also a grd-based mult-resoluton approach n whch the spatal area s dvded nto rectangular cells. There usually several levels of such rectangular cells correspondng to dfferent levels of resoluton, and these cells form a herarchcal structure: each cell at a hgh level s parttoned to form a number of cells at the next lower level. Statstcal nformaton assocated wth the attrbutes of each grd cell (such as the mean, standard devaton, dstrbuton) s calculated and stored beforehand and s used to answer the user s queres. Level 1 Level 2 Level 3 Fgure 1: Herarchcal structure Fgure 1 shows a herarchcal structure for CAESA clusterng. Statstcal parameters of hgher-level cells can easly be computed form the parameters of the lower-level cells. For each cell, there are attrbuted-dependent and attrbute-ndependent parameters. The attrbute-ndependent parameter s: n number of objects(ponts) n ths cell cellno number of ths cell, t s calculated when the tree structure s establshng layerno layer number of ths cell The attrbute-dependent parameters are: m mean of all values n ths cell, where value s the dstance of two objects n ths cell s standard devaton of all values of the attrbute n ths cell d densty of all values n ths cell mn the mnmum value of the attrbute n ths cell max the maxmum value of the attrbute n ths cell

dstrbuton the type of dstrbuton that the attrbute value n ths cell follows(such as normal, unform, exponental, etc. NONE s assgned f the dstrbuton type s unknown) cellloc the locaton of ths cell, t record the coordnate nformaton assocated to the objects n data set. relevant coeffcent of ths cell relevant to gven query In our algorthm, parameters of hgher-level cells can be easly calculated from parameters of lower level cell. Let n, m, s, d, mn, max, cellloc, layerno, and dst be parameters of correspondng lower level cells, respectvely. The parameters n current cell n -1, m -1, s -1, d -1, mn -1, max -1, cellloc -1, layerno -1, and dst -1 can be calculated as follows. = n 1 (1) d n m d n 1 = (2) n m n 1 = (3) n ( s + m ) n 2 2 2 s 1 = m 1 n 1 (4) mn 1 = mn(mn ) (5) max 1 = max(max ) (6) layerno 1= layerno 1 (7) cellloc = mn ( cellloc ) 1 (8) cellno 1 The calculaton of dst -1 s much more complcated, here we adopt the calculatng method desgned by Wang et al.(1997) n ther algorthm of STING. 1 2 3 4 n 80 10 90 60 m 18.3 17.8 18.1 18.5 d 0.8 0.5 0.7 0.2 s 1.8 1.6 2.1 1.9 mn 2.3 5.8 1.2 4.2 max 27.4 54.3 62.8 48.5 layerno 5 5 5 5 cellno 85 86 87 88 cellloc (80,60) (90,60) (90,60) (90,60) dst NORMAL NONE NORMAL NORAML Table 1: Parameters of lower cells Accordng to the formula present above, we can easly to calculate the parameters n current cell: n -1 = 240 m -1 = 18.254 s -1 = 1.943 d -1 = 0.6 mn -1 = 1.2 max -1 = 62.8 cellloc -1 = (80,60) layerno -1 = 4 dst -1 = NORMAL The advantages of ths approach are: 1. It s a query-ndependent approach snce the statstcal nformaton stored n each cell represents the summary nformaton of the data n the cell, whch s ndependent of the query. 2. The computatonal complexty s O(k), where k s the number of cells at the lowest level. Usually, k<<n, where N s the number of objects. 3. The cell structure facltates parallel processng and ncremental updatng. 3.2 Focus-changng Clusterng Focus-changng clusterng means that CAESA can provde user the correspondng nformaton when hs nterested dataset s changed. Take exploratory data analyss for example, the user may frst want to see the global vew of the processed dataset, then her/hs nterest may turn to a smaller part of the dataset to see some detals, and so on. To fulfl focus-changng clusterng, a smple method s to cluster focused data each tme from scratch n the fashon of current clusterng algorthm. Obvously, such approach s tme-consumng and of low effcent. The better soluton s to desgn a clusterng algorthm that carres out focus-changng clusterng n an ntegrated framework. Thus clusterng tme and I/O cost s reduced, and the clusterng flexblty s enhanced. For example, the user wants to select the maxmal regons that have at least 100 houses per unt area and at least 70% of the house prces are above $200,000 and wth total area at least 100 unts wth 90% confdence. We can descrbe ths query usng SQL lke ths: SELECT REGION FROM house-map WHERE DENSITY IN (100, ) AND prce RANGE (200000, ) WITH PERCENT (0.7, 1) AND AREA (100, ) AND WITH CONFIDENCE 0.9; After gettng ths nformaton, perhaps she/he wants to see more detaled nformaton, e.g., the maxmal sub-regons that have at least 50 houses per unt area and at least 85% of the house prces are above $350,000 and wth total area at least 80 unts wth 80% confdence. By the followng SQL, we can get nformaton we need from the orgnal dataset. Here, we needn t scan all the orgnal dataset agan. SELECT SUB-REGION FROM house-map WHERE DENSITY IN (50, ) AND prce RANGE (350000, ) WITH PERCENT (0.85, 1)

AND AREA (80, ) AND WITH CONFIDENCE 0.85; To complete ths task, we should frstly recalculate the clusterng granularty accordng to data objects dstrbuton n the assocated dataset, then dvde the data space nto rectangular cells of sze equal to the mnmum clusterng granularty; count data objects densty for each cell and store densty value wth each cell. 3.3 Vsualzaton of Clusterng Process Although vsualzaton has been recognzed as an effectve tool for exploratory data analyss and some vsual clusterng approaches were reported n the lterature, these researches has focused on proposng new methods of vsualzng clusterng results so that the users can have a vew of the processed dataset s nternal structure more concretely and drectly. However, seldom concern has been put on the mpact of vsualzaton on the clusterng process. To make ths dea clear, let us take 2-dmensonal spatal data clusterng for example. In fact, clusterng can be seen as a process of data generalzaton on bass of certan lower data granularty. The lowest clusterng granularty of a certan dataset s the ndvdual data objects n the dataset. If we dvde the dataset space nto rectangular cells of smlar sze, then a larger clusterng granularty s the data objects enclosed n the rectangular cells. Fgure 3: CAESA s result Noses are random dsturbance that reduces the clarty of clusters. In our algorthm, we can easly fnd noses and wpe off them by fndng the cells wth very low densty and elmnate the data ponts n them precsely. Ths method can reduce the nfluence of the noses both on effcency and on tme. Unlke the noses, outlers are not well proportoned. Outlers are data ponts that are away from the clusters and have smaller scale compared to clusters. So outlers wll not be merged to any cluster. When the algorthm fnshes, the subclusters that have rather small scale are outlers. Fgure 2: Expected result More reasonably, we defne clusterng granularty as the sze of the dvded rectangular cell n horzontal or vertcal drecton, and defne relatve clusterng granularty as the rato of clusterng granularty over the scope of focused dataset n the same drecton. Vsualzaton of clusterng results s also based on clusterng granularty. However, vsualzaton effect reles on the resoluton of dsplay devce. For comparson, we defne relatve vsualzaton resoluton as the nverson of the sze (takng pxel as measurement unt) of vsualzaton wndow for clusterng results n the same drecton as the defnton of clusterng granularty. Obvously, a reasonable choce s that the relatve clusterng granularty s close to, but not lower than the relatve vsualzaton resoluton of vsualzaton wndow. Otherwse, the clusterng results can t be vsualzed completely, whch means we do a lot but only part of ts effect s shown up n vsualzaton. Fgure 4: CAESA s result after user s focus changed 4. ALGORITHM In ths secton we ntroduce the CAESA algorthm. Frst, we set the mnmum clusterng granularty accordng to data objects dstrbuton n the whole concerned dataset; then dvde the data space nto rectangular cells of sze equal to the mnmum clusterng granularty; after that, we count data objects densty for each cell and store densty value wth each cell. Here, we adopt the cluster defnton alke to grd-based clusterng algorthms. It takes CASEA two steps to complete a gven query, that s, calculate the statstcal nformaton and stores to the quad-tree, then get user s query and output the result. Frst we should create a tree structure to keep the nformaton of the dataset, there are two parameters must be nput to setup the

tree. One parameter s dataset to be placed n the herarchcal structure; the other parameter s the number of desred cells at the lowest level. It needs two step to buld the quad-tree: calculate the bottom layer cell statstcal nformaton drectly by the gven dataset, then calculate the hgher layer cells statstcal nformaton accordng to the lower layer cells. We only need to go through the dataset once n order to calculate the parameters assocated wth the grd cells at the bottom level, the overall computaton tme s nearly proportonal to the number of cells. That s, query processng complexty s O(k) rather than O(n), here k means the number of cells. We wll analyze performance n more detal n later sectons. The process of creatng the tree structure goes lke ths: Input: D //Data to be placed n the herarchcal structure k //Number of desred cells at the lowest level Output: T //Tree CAESA BUILD Algorthm // Create the empty tree from top to down T = root node wth data values ntalzed; //Intally only root node = 1; repeat for each node n level do create 4 chldren nodes wth ntal values; end for = + 1; untl 4 = k; // Populate tree from bottom up for each tem n D do determne leaf node j assocated wth the poston of D; update values of j based on attrbute values n tem; end for = log4(k); repeat = - 1; for each node j n level do update values of j based on attrbute values n ts 4 chldren; untl = 1; End of CAESA BUILD Algorthm Fgure 5: CAESA BUILD algorthm Once the user changes her/hs focus scope, adjust clusterng granularty based on the crteron that relatve clusterng granularty s close to, but not lower than the relatve vsualzaton resoluton of vsualzaton wndow. Wth the new clusterng granularty, re-dvde the focused data space; then resort to grd-based clusterng algorthm to cluster the focused data. Note that new clusterng granularty must be n tmes of the mnmum clusterng granularty where n s a postve nature number. Due to the suffcent nformaton (such as locaton of coordnate, relevant coeffcent etc.), we can easly use the formal result to answer the new query. The process of query answerng s as follows: Input: T //Tree Q //Query Output: R //Regons of relevant cells CAESA QUERY Algorthm QUEUE q; = 1; repeat For each node n level do f cell j s relevant to Q mark ths cell as such; q.add( j );//cell j add to queue so as to calculate ts neghbor end f = + 1; //go on wth the next level untl all layers n the tree have been vsted; for each neghbor cell of cells n q AND doesn t n q do f (densty( ) > M_DENSITY) Identfy neghborng cells of relevant cells to create regons of cells; end f end for D1=R.data(); calculate the new clusterng granularty k1 call CAESA BUILD Algorthm wth parameter D1, k1 End of CAESA QUERY Algorthm Fgure 6: CAESA QUERY algorthm To handle very large databases, the algorthm constructs unts nstead of orgnal data objects. To obtan these unts, frst, parttonng s done to allocate data ponts nto cells. Statstcs nformaton s also obtaned for each cell. Then we test to see f a cell s qualfed as a unt. The defnton of unt s as follows: The densty of a unt s greater than a certan predefned threshold. Therefore, we determne f a cell s a unt by ts densty. If the densty of a cell s M (M1) tmes of the average densty of all data ponts, we regard such a cell as a unt. There may be some cells of low densty but are n fact part of a fnal cluster, they may not be ncluded nto unts. We treat such data ponts as separated sub-clusters. Ths method wll greatly reduce the tme complexty, as wll be shown n the experments. The clusters dentfed n ths algorthm are denoted by the representatve ponts. Usually, user may want to know the detaled nformaton about the clusters and the data ponts they nclude. For each data pont, we need not test whch cluster t belongs to. Instead, we need only fnd the cluster a cell belongng to, and all data ponts n the cell belong to such cluster. Ths process can also mprove the speed of the whole clusterng process. 5. PERFORMANCE EVALUATION We conduct several experments to evaluate the performance of the proposed algorthm. The followng experments are run on a PC machne wth Wn2000 operatng system (256M memory). In CAESA, we only need to go through the data set once n order to calculate the parameters assocated wth the grd cells at the bottom level, the overhead of CAESA s lnearly proportonal to the number of objects wth a small constant factor.

Tmes(sec) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 2000 4000 6000 STING CAESA Number of cells at bottom layer 8000 Fgure 7: Overhead comparson between STING and CAESA when user change her/hs focus Hnneburg, A., Kem, D. and Wawrynuk, M., 2003. Usng projectons to vsually cluster hgh-dmensonal data. IEEE Computatonal Scence and Engneerng], 5(2), pp. 14 25. Hu, B. and Marek-Sadowska, M., 2004. Fne Granularty Clusterng-Based Placement. IEEE Transactons on Computer- Aded Desgn of Integrated Crcuts and Systems. 23(4), pp. 527 536. Jawe Han, Mchelne Kamber. Data Mnng: Concepts and Technques. Morgan Kaufmann Publshers, 2001. Matsushta, M. and Kato, T., 2001. Interactve vsualzaton method for exploratory data analyss. In Proc. of 15th Intl. Conf. on Informaton Vsualzaton, pp. 671 676. To obtan performance of CAESA, we mplemented the houseprce example dscussed n Secton 3.2. We generated 2400 data ponts(houses), the herarchcal structure has 6 layers n ths test. Fgure 2 shows the expected result, Fgure 3 shows the result of our algorthm, and Fgure 4 shows the result when user change her/hs focus accordng to the last query. From the expermental result we can see apparently that our algorthm s vald and has a hgh performance. 6. CONCLUSION In ths paper, we proposed a scalable and vsualzaton-orented clusterng algorthm for exploratory spatal analyss. It s of hgh performance and low computng cost, and t can run focuschangng clusterng effcently, can be adaptable to vsualzaton stuaton, that s, choosng approprate relatve clusterng granularty automatcally accordng to current relatve vsualzaton resoluton. We bult a prototype to demonstrate the practcablty of the proposed algorthm. Expermental results show our algorthm s effectve and effcent. REFERENCES Agrawal R., Gehrke J., Gunopulos D. et al., 1998. Automatc subspace clusterng of hgh dmensonal data for data mnng applcatons. In Proc. of the ACM SIGMOD, pp. 94 105. Davd Hand, Hekk Mannla and Padhrac Smyth., 2001. Prncples of Data Mnng. Massachusetts Insttute of Technolology Press. Ester,M., Kregel, H. P., Sander,J. et al., 1996. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. In Proc.of 2nd Int. Conf. Knowledge Dscovery and Data Mnng, pp. 226-231. Memarsadegh, N. and O'Leary, D.P., 2003. Classfed nformaton: the data clusterng problem. IEEE Computatonal Scence and Engneerng. 5(5), pp. 54 60. Ng Ka Ka. E. and Ada Wa-chee Fu., 2002. Effcent algorthm for projected clusterng. In Proc.of ICDE, pp. 273 273. Shekholeslam G., Chatterjee, S. and Zhang, A., 1998. WaveCluster: a mult-resoluton clusterng app roach for very large spatal databases. In Proc. of 24th VLDB, pp. 428 439. Tn Kam Ho., 2002. Exploratory analyss of pont proxmty n subspaces. In Proc. of 16th Intl. Conf. on Pattern Recognton. Vol. 2, pp. 196 199 Wang Lan, Cheung, W.W., Mamouls, N., et al., 2004. An effcent and scalable algorthm for clusterng XML documents by structure. IEEE Transactons on Knowledge and Data Engneerng, 16(1), pp. 82 96 Wang, W., Yang, J. and Muntz, R., 1997. Stng: a statstcal nformaton grd approach to spatal data mnng. In Proc. of VLDB, pp. 186 195. Wang, W., Yang, J. and Muntz, R., 1999. Stng+: An Approach to Actve Spatal Data Mnng. In Proc. of ICDE, pp. 116 125. ACKNOWLEDGEMENTS The work s supported by H-Tech Research and Development Program of Chna under grant No. 2002AA135340, and Open Researches Fund Program of SKLSE under grant No. SKL(4)003, and IBM Research Award, and Open Researches Fund Program of LIESMARS under grant No. SKL(01)0303. Guha S., Rastog R. and Shm K., 1998. CURE: an effcent clusterng algorthm for large databases. In: Proc. of the ACM SIGMOD, pp. 7384. Halkd, M., Batstaks, Y. and Vazrganns, M., 2001. Clusterng algorthms and valdty measures. In Proc of 13th Intl. Conf. on Scentfc and Statstcal Database Management, pp. 3 22.