A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan Road Tamsu, Tape County TAIWAN, R.O.C nancyln@mal.tku.edu.tw, taftdc@mal.tku.edu.tw, 8909034@s90.tku.edu.tw chenh@mal.su.edu.tw, 88990@s89.tku.edu.tw Abstract: - The grd-based clusterng algorthm, whch parttons the data space nto a fnte number of cells to form a grd structure and then performs all clusterng operatons on ths obtaned grd structure, s an effcent clusterng algorthm, but ts effect s serously nfluenced by the sze of the cells. To cluster effcently and smultaneously, to reduce the nfluences of the sze of the cells, a new grd-based clusterng algorthm, called DGD, s proposed n ths paper. The man dea of DGD algorthm s to deflect the orgnal grd structure n each dmenson of the data space after the clusters generated from ths orgnal structure have been obtaned. The deflected grd structure can be consdered a dynamc adustment of the sze of the orgnal cells, and thus, the clusters generated from ths deflected grd structure can be used to revse the orgnally obtaned clusters. The expermental results verfy that, ndeed, the effect of DGD algorthm s less nfluenced by the sze of the cells than other grd-based ones. Key-Words: - Data Mnng, Clusterng Algorthm, Grd-based Clusterng, Sgnfcant Cell, Grd Structure Introducton Clusterng analyss whch s to group the data ponts nto clusters s an mportant task of data mnng recently. Unlke classfcaton whch analyzes the labeled data, clusterng analyss deals wth data ponts wthout consultng a known label prevously. In general, data ponts are grouped only based on the prncple of maxmzng the ntra-class smlarty and mnmzng the nter-class smlarty, and thus, clusters of data ponts are formed so that data ponts wthn a cluster are hghly smlar to each other, but are very dssmlar to the data ponts n other clusters. Up to now, many clusterng algorthms have been proposed [, 2, 3, 4, 5, 6, 7, 8, 9, 0,, 2, 3], and generally, the called grd-based algorthms are the most computatonally effcent ones. The man procedure of the grd-based clusterng algorthm s to partton the data space nto a fnte number of cells to form a grd structure, and next, fnd out the sgnfcant cells whose denstes exceed a predefned threshold, and group nearby sgnfcant cells nto clusters fnally. Clearly, the grd-based algorthm performs all clusterng operatons on the generated grd structure; therefore, ts tme complexty s only dependant on the number of cells n each dmenson of the data space. That s, f the number of the cells n each dmenson can be controlled as a small value, then the tme complexty of the grd-based algorthm wll be low. Some famous algorthms of the grdbased clusterng are STING [], WaveCluster [2], and CLIQUE [3]. As the above mentoned, the grd-based clusterng algorthm s an effcent algorthm, but ts effect s serously nfluenced by the sze of the grds (or the value of the predefned threshold). If the cell s small, then t needs many cells to be connected nto one cluster. And there wll also be more connecton of cells. In the connecton of cells, the number of data ponts n cell s the maor factor to connect or dsconnect the cells. So, the more cells, the more effects. And n the same data space, there are more cells, there wll be smaller sze. To cluster data ponts effcently and to reduce the nfluences of the sze of the cells at the same tme, a new grd-based clusterng algorthm, called DGD, s proposed here. The man dea of DGD algorthm s to deflect the orgnal grd structure n each dmenson of the data space after the clusters generated from the orgnal grd structure have been obtaned. The deflected grd structure s then used to fnd out the new sgnfcant cells. Next, the nearby sgnfcant cells are grouped as well to form some new clusters. Fnally, these new generated clusters are used to ISSN: 09-2750 25 Issue 3, Volume 7, March 2008

revse the orgnally generated clusters. The rest of the paper s organzed as follows: In secton 2, some famous grd-based clusterng algorthms wll be ntroduced. In secton 3, the proposed clusterng algorthm, DGD algorthm, wll be presented. In secton 4, some experments and dscussons wll be dsplayed. The conclusons wll be gven n secton 5. 2 Grd-based Clusterng Algorthm In ths secton, two popular grd-based clusterng algorthms, STING [] and CLIQUE [3], wll be ntroduced. STING (Statstcal Informaton Grd-based algorthm) (Wang et al., 997) explots the clusterng propertes of ndex structures. It employs a herarchcal grd structure and uses longtude and lattude to dvde the data space nto rectangular cells. STING selects a layer to begn wth at the begnnng. For each cell of ths layer, to label the cell as relevant f ts confdence nterval of probablty s hgher than the threshold. We go down the herarchy structure by one level and go back to check those cells s relevant or not untl the bottom level. Return those regons that meet the requrement of the query. And fnally, to retreve those data fall nto the relevant cells. CLIQUE (Clusterng In QUEst) (Agrawal et al., 998) s a densty and grd-based approach for hgh dmensonal data sets that provdes automatc sub-space clusterng of hgh dmensonal data. It conssts of the followng steps: Frst, to uses a bottom-up algorthm that explots the monotoncty of the clusterng crteron wth respect to dmensonalty to fnd dense unts n dfferent subspaces. Second, t use a depth-frst search algorthm to fnd all clusters that dense unts n the same connected component of the graph are n the same cluster. Fnally, t wll generate a mnmal descrpton of each cluster. In fact, the effects of these two algorthms are serously nfluenced by the sze of the predefned grds and the threshold of the sgnfcant cells. To reduce the nfluences of the sze of the predefned grds and the threshold of the sgnfcant cells, we propose a new grd-based clusterng algorthm whch s called A Deflected Grd-based (DGD) algorthm n ths paper. 3 A Deflected Grd-based Algorthm After the grd structure s bult, the deflected grd-based algorthm (DGD) deflects the cell margns by half a cell wdth n each dmenson and have the new grd structure and then combne the two sets of clusters nto the fnal result. The procedure of DGD s shown n the followng steps. Step : Generate a grd structure. By dvdng nto k equal parts n each dmenson, the n dmensonal data space s parttoned nto k n non-overlappng cells to be the grd structure. Step 2: Identfy sgnfcant cells. Next, the densty of each cell s calculated to fnd out the sgnfcant cells whose denstes exceed a predefned threshold. Step 3: Generate the set of clusters. Then the nearby sgnfcant cells whch are connected to each other are grouped nto clusters. The set of the clusters s denoted as S. Step 4: Deflect the grd structure. The orgnal grd structure s next deflected by dstance d n each dmenson of the data space. Step 5: Generate the set of new clusters. The step 2 and step 3 are used agan to generate the set of new clusters by usng the deflected grd structure. The set of new clusters generated here s denoted as S 2. Step 6: Revse orgnal clusters. The clusters generated from the deflected grd structure are used to revse the orgnally obtaned clusters as the followng steps. Step 6a: Fnd each overlapped cluster C 2 for C S, and generate the rule C C2, where C I C2 φ, C2 S2. The rulec C 2 means that clusterc overlaps cluster C 2. Smlarty, fnd each overlapped cluster C for C 2 S 2, and also generate the rule C2 C, where C2 I C φ. Step 6b: The set of all the rules generated n step 6a s denoted as R o. Next, each clusterc S s revsed by usng the cluster revsed functon CR (). The cluster modfed functon CR() s shown n fg.. ISSN: 09-2750 26 Issue 3, Volume 7, March 2008

Step 7: Generate the clusterng result. After all clusters of S have been revsed, S s the set of fnal clusters. for each C S Let X := X; Repeat oldx := X ; For each Y Z n R 0 Do If Y X then X := X Z; If Z S then S := S {Z}; Endf Untl (oldx = X ); C := X ; End Fg. the CR algorthm 3. Example In ths place, the two dmensonal example, shown n fgure 2, wth 600 ponts s easy to be dvded nto two clusters. The example goes through the algorthm. Step 2: Identfy sgnfcant cells. Next, the densty of each cell s calculated, shown n fg. 4, to fnd out the sgnfcant cells whose denstes exceed a predefned threshold, here the threshold s 4. Fg.3 the grd structure of 20 2 cells Fg.4 the densty of each cell step3: Generate the set of clusters. Then the nearby sgnfcant cells whch are connected to each other are grouped nto 5 clusters. The set of the clusters s denoted as S ={C,C 2,,C 5 }, shown n fg. 5. Fg.2 two dmensonal example Step : Generate a grd structure. By dvdng nto 20 equal parts n each dmenson, the two dmensonal data space n ths example s parttoned nto 20 2 non-overlappng cells to be the grd structure, shown n fg.3. step4: Deflect the grd structure. The orgnal grd structure s deflected by dstance d n each dmenson of the data space. In ths example, d s equal to the half sde length of the cell. By deflectng the grd structure, the new one s parttoned nto 2 2 cells, shown n fg. 6. ISSN: 09-2750 27 Issue 3, Volume 7, March 2008

Fg.5 result of frst clusterng Fg.7 the cell densty of new grd structure Fg.8 Result of the second clusterng Fg.6 the new grd structure wth 2 2 cells Step 5: Generate the set of new clusters. Here, the cell densty of new grd structure s shown n fg. 7. It s easy to fnd out the sgnfcant cells whose denstes exceed a predefned threshold, 4. And the nearby sgnfcant cells whch are connected to each other are grouped nto 4 clusters. The set of the clusters s denoted as S ={C 2,C 22,,C 4 }, shown n fg. 8. 2 R0 s composed of rulesc C2, shown n table, andc2 C, shown n table 2. Step 7: Generate the clusterng result. After all clusters of S have been revsed by usng cluster modfed functon CR (), revsed S s shown n table 3. And the fnal clusterng result s shown n fg. 9. Step 6: Revse orgnal clusters. The clusters generated from the deflected grd structure are used to revse the orgnally obtaned clusters as steps 6.a and 6.b. ISSN: 09-2750 28 Issue 3, Volume 7, March 2008

Table rules C C2 of R 0 Fg. 9 the fnal clusterng result 4. Experment and Dscussons Here, we experment wth seven dfferent data. The features are shown n Table 4. Table 2 rules C C 2 of R 0 Table 4 expermental data features Fg.0 experment Fg. experment 2 Table 3 the set of fnal clusters Fg.2 experment 3 Fg.3 experment 4 ISSN: 09-2750 29 Issue 3, Volume 7, March 2008

result of SDG s part of the clusterng result of DGD n experment. And n experment, t s mpossble to fnd the wrong expermental result that usng n DGD but s correct when usng n SDG. Fg.4 experment 5 Fg.5 experment 6 Table 5 the correct rate comparson sheet of experment Fg.6 experment 7 4. Experment Fgure.7 shows the correct rates of DGD and SDG, where the correct clusterng result of SDG s by usng one of orgnal or new grd structures n the experment. The correct rates of DGD are all hgher than SDG. In the experments, the correct rates comparson s by usng random 00 sets of parameters (densty threshold, number of dvdng parts n each dmenson) from (6, ) to (55, 3). In table.6, 7, 8, and 9, t s possble to fnd the correct expermental result that usng n DGD but s wrong when usng n SDG. Though the values are low, the expermental results are not the same as experment n table 5. So, the results of SDG are not always parts of the clusterng results of DGD. Because the correct rate of DGD s always hgher than SDG, the experment by usng DGD s able to advance the correct rate than usng other grd-based algorthms. In other words, the expermental results verfy that the effect of DGD algorthm s less nfluenced by the sze of the cells than other grd-based ones. Table 6: the correct rate comparson sheet of experment 2 Fg.7 correct rates of DGD and SDG In table.5, t s the correct rate comparson sheet of experment by usng random 00 sets of parameters. The correct rate of DGD s 47% whch s hgher than SDG whose correct rate s only 2%. Here, the correct rate of both usng the same set of parameters s only 2%. So, the Table 7: the correct rate comparson sheet of experment 3 ISSN: 09-2750 30 Issue 3, Volume 7, March 2008

Table 8: the correct rate comparson sheet of experment 4 Table 9: the correct rate comparson sheet of experment 5 connected sgnfcant cells to generate the two orgnal clusterng results s k*p*[m d + (m+) d ] at most. And the tme of the cluster revsed functon CR () s k2*r, where r s the number of C C2 and C2 C n R o, r << m d << n. In the end, the tme of checkng the cluster s number of all data s k3*n. So the total tme complexty s O(m d )+O(n). 5. Concluson and Future Work In ths paper, a new grd-based clusterng algorthm s called the Deflected Grd-based (DGD) algorthm, whch has the obvous wder ranges of sze of the cell and threshold of densty. And the expermental results verfy that the effect of DGD algorthm s less nfluenced by the sze of the cells than other grd-based ones. At the same tme, the DGD algorthm stll nherts the advantage wth the low tme complexty. There are many nterestng research problems related to DGD algorthm. One s to fnd the non-parametrc algorthm wth the same effcency of the DGD algorthm at least. And the other s to use algorthm of parallelsm to reduce the computatonal cost. Table 0: the correct rate comparson sheet of experment 6 Table : the correct rate comparson sheet of experment 7 4.2 Dscusson In the DGD algorthm, for each data pont α, only those ponts that are n the same cell of α are consdered. The densty of each cell s calculated at frst. When the total number of data ponts s n and each dmenson, total d dmensons, s dvded nto m ntervals, there wll be m d cells. The tme of checkng the densty of all cells s k0*[m d + (m+) d ]. If p(=3 d -) s the number of nearby cells of one cell, the tme of comparng the References: [] J. MacQeen. Some methods for classfcaton and analyss of multvarate observaton. Proc. 5th Berkeley Symp. Math. Statst, Prob., :28-297,967 [2] L. Kaufman and P.J. Rousseeuw. Fndng Groups n Data: An Introducton to Cluster Analyss. New York: John Wley & Sons, 990. [3]Charu C. Aggarwal, Phlp S. Yu, An effectve and effcent algorthm for hgh-dmensonal outler detecton The VLDB ournal, 4:2-22,2005 [4] M. Ester, H. Kregel, J. Sander, and X. Xu. A Densty-Based Algorthm for Dscoverng Clusters n Large Spatal Databases wth Nose, In Proc. of 2nd Int. Conf. on KDD, 996, pages 226-23. [5] A. Hnneburg and D. A. Kem,. An Effcent Approach to Clusterng n Large Multmeda Databases wth Nose, In Knowledge Dscovery and Data Mnng, 998, pages 58-65. [6] ANKERST M. etc. OPTICS: Orderng Ponts to Identfy the Clusterng Structure. In Proc. ACM SIGMOD Int. Conf. on MOD, 999, pages ISSN: 09-2750 3 Issue 3, Volume 7, March 2008

49-60. [7] A. H. Plevar, M. Sukumar, GCHL: A grd-clusterng algorthm for hgh-dmensonal very large spatal data bases, Pattern Recognton Letters 26(2005),999-00 [8] ZHAO Y.C., SONG J., GDILC: A Grd-based Densty-Isolne Clusterng Algorthm., In Proc. Internat. Conf. on Info-net, Vol 3,pp.40-45,200, [9]Ma, W.M., Eden, Chow, Tommy, W.S., A new shftng grd clusterng algorthm, Pattern Recognton 37 (3),2004,503-54 [0]Alevzos, P., Boutsnas, B., Tasouls, D., Vrahats, M.N., Improvng the K-wndows clusterng algorthm, In Proc. 4th IEEE Internat. Conf. on Tools wth Artfcal Intell, pp.239-245, 2002. [] Wang, Yang, R. Muntz, We Wang and Jong Yang and Rchard R. Muntz STING: A Statstcal Informaton Grd Approach to Spatal Data Mnng, In Proc. of 23rd Int. Conf. on VLDB, 997, pages 86-95. [2] G. Shekholeslam, S. Chatteree, and A. Zhang. WaveCluster: a wavelet-based clusterng approach for spatal data n very large databases, In VLDB Journal: Very Large Data Bases, 2000, pages 289-304. [3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatc sub-space clusterng of hgh dmensonal data for data mnng applcatons, In Proc. of ACM SIGMOD Int. Conf. MOD, 998, pages 94-05. ISSN: 09-2750 32 Issue 3, Volume 7, March 2008