Analyzing Popular Clustering Algorithms from Different Viewpoints

1000-9825/2002/13(08)1382-13 2002 Journal of Software Vol.13, No.8 Analyzng Popular Clusterng Algorthms from Dfferent Vewponts QIAN We-nng, ZHOU Ao-yng (Department of Computer Scence, Fudan Unversty, Shangha 200433, Chna) (Laboratory for Intellgent Informaton Processng, Fudan Unversty, Shangha 200433, Chna) E-mal: {wnqan,ayzhou}@fudan.edu.cn http://www.cs.fudan.edu.cn/ch/thrd_web/webdb/webdb_englsh.htm Receved September 3, 2001; accepted February 25, 2002 Abstract: Clusterng s wdely studed n data mnng communty. It s used to partton data set nto clusters so that ntra-cluster data are smlar and nter-cluster data are dssmlar. Dfferent clusterng methods use dfferent smlarty defnton and technques. Several popular clusterng algorthms are analyzed from three dfferent vewponts: (1) clusterng crtera, (2) cluster representaton, and (3) algorthm framework. Furthermore, some new bult algorthms, whch mx or generalze some other algorthms, are ntroduced. Snce the analyss s from several vewponts, t can cover and dstngush most of the exstng algorthms. It s the bass of the research of self-tunng algorthm and clusterng benchmark. Key words: data mnng; clusterng; algorthm Clusterng s an mportant data-mnng technque used to fnd data segmentaton and pattern nformaton. Clusterng technque s wdely used n applcatons of fnancal data classfcaton, spatal data processng, satellte photo analyss, and medcal fgure auto-detecton etc.. The problem of clusterng s to partton the data set nto segments (called clusters) so that ntra-cluster data are smlar and nter-cluster data are dssmlar. It can be formalzed as follows: Defnton 1. Gven a data set V{v 1,v 2,,v n }, n whch v s (=1,2,,n) are called data ponts. The process of parttonng V nto {C 1,C 2,,C k }, C V ( =1,2,,k), and k =1 C = V, based on the smlarty between data ponts are called clusterng, C s ( =1,2,,k) are called clusters. The defnton does not defne the smlarty between data ponts. In fact, dfferent methods use dfferent crtera. Clusterng s also known as unsupervsed learnng process, snce there s no pror knowledge about the data set. Therefore, clusterng analyss usually acts as the preprocessng of other KDD operatons. The qualty of the clusterng result s mportant for the whole KDD process. As other data mnng operatons, hgh performance and scalablty are other two requests besde the accuracy. Thus, a good clusterng algorthm should satsfy the followng Supported by the Natonal Grand Fundamental Research 973 Program of Chna under Grant No.G1998030414 ( 973 ); the Natonal Research Foundaton for the Doctoral Program of Hgher Educaton of Chna under Grant No.99038 ( ) QIAN We-nng was born n 1976. He s a Ph.D. canddate at the Department of Computer Scence, Fudan Unversty. Hs research nterests are clusterng, data mnng and Web data management. ZHOU Ao-yng was born n 1965. He s a professor and doctoral supervsor at the Department of Computer Scence, Fudan Unversty. Hs current research nterests nclude Web data management, data mnng, and obect management over peer-to-peer networks.

: 1383 requests: Independent of n-advance knowledge; Only need easy-to-set parameters; Accurate; Fast; Havng good scalablty. Much research work has been done on buldng clusterng algorthms. Each uses novel technques to mprove the ablty of handlng certan characterstc data sets. However, dfferent algorthms use dfferent crtera as mentoned above. Snce there s no benchmark for clusterng methods, t s dffcult to compare these algorthms by usng a common measurement. However, a detaled comparson s necessary. Ths s because that: (1) The advantages and dsadvantages should be analyzed, so that mprovement can be developed on exstng algorthms. (2) The user should be able to choose rght algorthm for a certan data set, so that the optmal result and performance can be obtaned. (3) The detaled comparson s the bass for buldng a clusterng benchmark. In ths paper, we analyze several exstng popular algorthms from some dfferent aspects. It s dfferent wth some other survey work [1~3] n that we compare these algorthms unversally from dfferent vewponts, whle others try to generalze some methods to a certan framework, such as n Refs.[1,2], whch can only cover lmted algorthms, or ust ntroduce clusterng algorthms one by one as tutoral [3], so that no comparson among algorthms s analyzed. Snce dfferent algorthms use dfferent crtera and technques, those surveys can only cover some of the algorthms. Furthermore, some algorthms cannot be dstngushed snce they use a same technque so that they fall nto the same category n a certan framework. The rest of ths paper s organzed as follows: Secton 1 to 3 analyze the clusterng algorthms from three dfferent vewponts, namely, clusterng crtera, algorthm framework and cluster representaton. Secton 4 ntroduces some methods, whch are mxture or generalzaton of other algorthms. Secton 5 ntroduces research focus on auto-detecton of clusters. Fnally, Secton 6 s for concluson remarks. It should be note that from each vewpont, although we try to classfy as many algorthms as we can, someone s stll mssng. And some algorthms may fall nto the same category. However, whle we observng these algorthms from all these vewponts, dfferent algorthms can be dstngushed. Ths s the motvaton of our work. 1 Crtera The bass of clusterng analyss s the defnton of smlarty. Usually, the defnton of smlarty contans two parts: (1) The smlarty between data ponts; (2) The smlarty between sets of data ponts. Not all clusterng methods need both of them. Some algorthms only use one. The clusterng crtera can be classfed nto three categores: dstance-based, densty-based, and lnkage-based. Dstance-based and densty-based clusterng s usually appled to data n Eucldean space, whle lnkage-based clusterng can be appled to data n arbtrary metrc space. 1.1 Dstance-Based clusterng The basc dea of dstance-based clusterng s that a cluster s the data ponts close to each other. The dstance between two data ponts s easy to defne n Eucldean space. The wdely used dstance defntons nclude Eucldean dstance, and Manhattan dstance. However, there are several choces for smlarty defnton between two sets of data ponts, as follows: or or Smlarty ( C, C ) = dstance( rep, rep ) Smlarty rep 1 avg ( C, C ) = dstance( v, v ) n n v C, v C (1) (2)

1384 Journal of Software 2002,13(8) Smlarty max ( C, C ) = max{ dstance( v, v ) v C, v C } (3) or Smlarty mn ( C, C ) = mn{ dstance( v, v ) v C, v C } (4) In (1), rep and rep are representatves of C and C, respectvely. The representatve of a data set s usually the mean, such as n k-means [4]. Sngle representatve methods usually employ Defnton (1). It s obvous that the complexty of (2), (3), and (4) are all O( C * C ), whch are neffcent for large data sets. Although they are more global defntons, they are usually not drectly appled on smlarty defnton for sub-clusters or clusters. The only excepton s BIRCH [5], n whch CF-vector and CF-tree are employed to accelerate the computaton. Some trade-off approaches are taken, as t wll be dscussed n Secton 2.1, n whch the detaled analyss of sngle representatve methods s also gven. The advantage of dstance-based clusterng s that dstance s easy for computng and understandng. And dstance-based clusterng algorthms usually need parameters of K, whch s the number of fnal clusters user wants, or the mnmum dstance to dstngush two clusters. However, the dsadvantage of them s also dstnct that they are nose-senstve. Although some technques are ntroduced n some of them, they result n other serous problems. CURE [6] uses representatveshrnkng technques to reduce the mpact of noses. Fg. 1 Hollow-Shaped cluster dentfed by CURE However, t nvtes the problem that t fals to dentfy the clusters n hollow shapes, as the result n our experment shown n Fg.1. Ths shortcomng counteracts the advantage of mult-representatves that the algorthm can dentfy arbtrary-shaped clusters. BIRCH, whch s the frst clusterng algorthm consderng noses, ntroduces a new parameter T, whch s substantally a parameter related to densty. Furthermore, t s hard for user to understand ths parameter unless the page storage ablty of CF-tree s known(page_sze/entry_sze/t s an approxmaton of densty n that page). In addton, t may cause loss of small clusters and long-shaped clusters. Snce lack of space, the detaled dscusson s omtted here. 1.2 Densty-Based clusterng Other than dstance-based clusterng methods, densty-based clusterng stands for that clusters are dense areas. Therefore, the smlarty defnton of data ponts s based on whether they belong to connected dense regons. The data ponts belongng to the connected dense regon belong to the same cluster. Based on the dfferent computaton of densty, densty-based clusterng can be further classfed nto Nearest-Neghbor (called NN n the rest of ths paper) methods and cell-based methods. The dfference between them s that the former defne densty based on data set, and the latter defne t based on data space. No matter whch knd a densty-based clusterng algorthm belongs to, t always needs a parameter of mnmum-densty threshold, whch s the key to defne dense regon. 1.2.1 NN methods NN methods only treat ponts, whch have more than k neghbors n hyper-sphere whose radus s ε, as data ponts n clusters. Snce the neghbors of each pont should be counted, the ndex structures whch support regon query, such as R * -tree, or X-tree, are always employed. Because of the curse of dmensonalty [7], these methods don t have good scalablty for dmensonalty. Furthermore, NN methods wll result n frequent I/O when the data

: 1385 sets are very large. However, for most mult-dmensonal data sets, these methods are effcent. In short, the shortcomng of ths knd of methods s the shortcomng of the ndex structures they based-on. Tradtonal NN methods, such as DBSCAN and ts descendants [8~10], need parameters of densty threshold and ε. Recently, OPTICS [11], whose basc dea s the same as DBSCAN, focuses on automatcally dentfcaton of cluster structures. Snce the novel technques n OPTICS do not belong to the topc of ths sub-secton, we wll dscuss them n Secton 5. 1.2.2 Cell-Based methods Cell-based methods count densty nformaton based on the unts. STING [12], WaveCluster [13], DBCLASD [14], CLIQUE [15], and OptGrd [16] all fall nto ths category. Cell-based methods have the shortcomng that cells are only pproxmaton of dense areas. Some methods ntroduce technques to solve ths problem, as wll be ntroduced n Secton 2.3. Densty-based clusterng methods all meet problem when data sets contan clusters or sub-clusters whose granularty s smaller than the granularty of unts for computng densty. A well-known example s the dumbbell-shaped clusters, as shown n our expermental result, Fgure 2. However, for densty-based clusterng methods, t s easy to remove noses, f the parameters are properly set. That s to say, t s robust to noses. Fg.2 Dumbbell-Shaped clusters dentfed by densty-based algorthm (DBSCAN) 1.3 Lnkage-Based clusterng Other than dstance-based or densty-based clusterng, lnkage-based clusterng can be appled to arbtrary metrc spaces. Furthermore, snce n hgh-dmensonal space, the dstance nformaton and densty nformaton s not suffcent for clusterng, lnkage-based clusterng s often employed. Algorthms belongng to ths knd nclude ROCK [17], CHAMELEON [18], ARHP [19,20], STIRR [21], CACTUS [22], etc. Lnkage-based methods are based on graph or hyper-graph model. They usually map the data set nto a graph/hyper-graph, then cluster the data ponts based on the edge/hyper-edge nformaton, so that the hghly connected data ponts are assgned to the same cluster. The dfference between graph model and hyper-graph model s that the former reflects the smlarty of par of nodes, whle the latter usually reflects the co-occurrence nformaton. ROCK and CHAMELEON use graph model, whle ARHP, PDDP, STIRR, and CACTUS use hyper-graph model. Although the developers of CACTUS ddn t state that t s a hyper-graph-model-based algorthm, t belongs to that knd. The qualty of lnkage-based clusterng result depends on the defnton of lnk or hyper-edge. Snce t s mpossble to handle a complete graph, the graph/hyper-graph model always elmnates the edges/hyper-edges whose weght s low, so that the graph/hyper-graph s sparse. However, to gan the effcency, t may reduce the accuracy. The algorthms fall n ths category use dfferent frameworks. ROCK and CHAMELEON are herarchcal clusterng methods, whle ARHP s dvsve method, and STIRR uses dynamcal system model. Furthermore, snce the co-occurrence problem s smlar to assocaton rule mnng problem, ARHP and CACTUS both borrow Apror

1386 Journal of Software 2002,13(8) algorthm [23] to fnd the clusters. Another algorthm employ Apror-lke algorthm s CLIQUE. However, the monotoncty lemma s used to fnd hgh-dmensonal clusters based on clusters fnd n subspaces. CLIQUE s not lnkage-based clusterng methods, whch s the dfference between t wth other algorthms dscussed n ths subsecton. The detaled dscusson of algorthm framework wll be gven n Secton 3. And snce CHAMELEON uses both lnk and dstance nformaton, t wll be dscussed standalone n Secton 4.1. 2 Cluster Representaton The purpose of clusterng s to dentfy the data clusters, whch are the summary of the smlar data. Each algorthm should represent the clusters and sub-clusters n some forms. Although labelng each data pont wth a cluster dentty s a straghtforward dea, most methods don t employ ths approach. Ths may be because that: (1) The summary, whch should be easly understandable, s more than (data-pont, cluster-d) pars; (2) It s tme- and space-expensve to label all the data ponts n the process of clusterng; (3) Some methods employ accurate compact cluster representatves, whch make the tme-consumng process of labelng unnecessary. We classfy the cluster representaton technques nto four knds, as dscussed n the followng: 2.1 Representatve ponts Most dstance-based clusterng methods use some ponts to represent clusters. These ponts are called representatve ponts. The representatves may be data ponts, or some other ponts that do not exst n database, such as means of some sets of data ponts. The data representaton technques fallng nto ths category can be further classfed nto three classes: 2.1.1 Sngle representatve The smplest approach s to use one pont as the representatve of each cluster. Each data pont s assgned to the cluster whose representatve s the closest one. The representatve pont may be the mean of the cluster, lke k-means [4] methods do, or the data pont n the database, whch s the closest pont to the center, lke k-medods methods do. Other algorthms fall nto ths knd nclude BIRCH [5], CLARA [24], and CLARANS [25]. The dfferent affect of k-means and k-medods methods on clusterng result s ntroduced n detal n Ref.[25]. Snce t s not related to the motvaton of ths paper, we don t survey t here. The shortcomng of sngle representatve approach s obvous: (1) only sphere clusters can be dentfed; and (2) large clusters wth small cluster besde wll be splt, whle some data ponts n the large cluster wll be assgned to the small cluster. These two condtons are shown n Fg.3 (The rght part of ths Fgure s borrowed from Ref.[6], Fg.1(b)). Therefore, ths approach wll fal when processng data sets wth arbtrary shaped clusters or clusters wth great dfference. 2.1.2 All data ponts Usng all the data ponts n a cluster to represent t s another straghtforward approach. However, t s tme-expensve snce: (1) the data sets are always large so that the label nformaton cannot ft n memory, whch leads to frequent dsk access, and (2) whle computng nformaton ntra- and nter- clusters, t wll access all data ponts. Furthermore, the label nformaton s hard to understand. Therefore, no popular algorthms take ths approach. 2.1.3 Mult-Representatves Mult-representatves approach s ntroduced n CURE, whch s the trade-off between sngle-pont and all-ponts methods. The frst representatve s the data pont, whch s the farthest to the mean of the cluster. And next, the data pont, whose dstance to the nearest exstng representatve s the largest, s chosen each tme, untl the number of representatves s large enough. In Ref.[6], the experments show that for most data sets, 10

: 1387 Fg.3 Non-Sphercal clusters and clusters wth dfferent scales dentfed by sngle representatve methods representatves wll lead to satsfed result. In the long verson of Ref.[26], the authors who developed CURE also mentoned that for complex data sets, more representatves are needed. However, before clusterng, the complexty of the clusters s unknown. Furthermore, the relatonshp between complexty of clusters and number of representatves s not clear. Ths forces the user to choose a large number of representatves. Snce the tme complexty of CURE s O(n 2 log n), n whch n s the number of data ponts n the begnnng, the exstence of large number of representatves n the ntal sub-clusters wll affect the effcency (there exsts sub-clusters because that a smple parttonng technque s used n CURE [6]. The tme-complexty accordng to number of representatves s O(c*log c), f the number of ntal sub-clusters s a fxed number), as shown n our expermental result, Fg.4. Furthermore, along wth the technque they handlng outlers (the shrnkng of representatves), t fals to dentfy clusters of hollow shape, as t has already been dscussed n Secton 1.1 and shown n Fg.1. However, t outperforms sngle-pont and all-ponts approaches when both effectveness and effcency are consdered. Tme (s) 400 350 300 250 200 150 100 50 0 0 20 40 60 80 100 120 140 160 180 Number of representatves n a cluster Fg. 4 Performance of CURE vs. number of representatves n a cluster 2.2 Dense area Some densty-based clusterng algorthms use dense area to denote clusters and sub-clusters. DBSCAN [8], ts descendants [9,10], and OPTICS [11] belong to ths category. Dense area representaton method s smlar to all-data-ponts methods except that only core ponts are used. Core ponts are those data ponts whose neghbors wthn a certan regon are more than the threshold. Therefore, only core ponts are used to expand a sub-cluster, and t wll stop when no further expanson can be appled on core ponts. Dense area can fgure arbtrary-shaped clusters besdes the dumbbell-shaped clusters. However, the cost for computng core ponts s expensve, so that specal ndex structures are needed. In algorthms of DBSCAN seres and OPTICS, R * -tree s used to support regon query. Snce these methods need to scan the whole database, and

1388 Journal of Software 2002,13(8) each pont may cause a regon query, these methods always result n frequent I/O when appled to large databases, as shown n experments gven n Secton 4.2. 2.3 Cells Some grd-based methods use cells to summary the clusters, such as STING [12], WaveCluster [13], CLIQUE [15], DBCLASD [14], and OptGrd [16] etc.. Other than dense areas, whch are the condensaton of dense data ponts, cells are parttons of the data space. Therefore, a cell s the approxmaton of the data ponts fallng nto t. Ths makes the algorthms takng ths approach naccurate n some condton. In Ref.[12], the authors argue that under a suffcent condton, STING can ensure the result s accurate. However, ths concluson s made n the condton that the characterstc of queres s known a pror. WaveCluster facltates the mult-resoluton property of wavelet to dentfy clusters n dfferent resolutons, whch ensure that the hghest resoluton clusters are accurate. The advantage of usng cells to represent clusters s straghtforward. Frstly, the number of cells s much smaller than the sze of the database. Therefore, the data for processng s lmted, whch leads to hgh scalablty of those approaches. Secondly, the cost of computng propertes of cells s low compared to fndng dense area, whch needs complex data structure support. Ths s because that cells are data ndependent, whle dense area depends on data dstrbuton. At last, as dense areas, cells can reflect the data dstrbuton nformaton of a local area, although t s approxmate. Snce the number of neghborng relatonshp s explosve when the dmensonalty s ncreasng, the algorthms facltatng neghborng nformaton of cells s usually neffcent for hgh-dmensonal data. The only excepton s CLIQUE. Dfferent from other cell representaton methods, CLIQUE fnds dense unts (cells) from low-dmensonal subspaces to hgh-dmensonal subspaces. Therefore, t has hgh scalablty to dmensonalty. Although OptGrd s a cell-based clusterng method, t does not use neghborng nformaton, so that t s effcent for hgh-dmensonal data sets. 2.4 Probablty Some methods use probablty to denote the degree of a data ponts belongng to a cluster. EM [27,28], and AutoClass [29] belong to ths category. The problem of classfyng a data pont to more than one cluster s also known as fuzzy clusterng or soft clusterng. In most cases, the performance of soft clusterng s unsatsfactory. Reference [2] provdes a detaled survey of fuzzy clusterng. Snce the lack of space, we are not verbose here. 3 Algorthm Framework In the above two sectons, we dscussed the clusterng crtera and cluster representaton, whch are the two most mportant factors for clusterng effectveness. In ths secton, the algorthm framework wll be dscussed. The algorthm framework determnes the tme complexty of the algorthms, and the needed parameters. Furthermore, algorthm framework also affects the technques of preprocessng. These are the focuses n the followng three subsectons. 3.1 Optmzaton methods Optmzaton methods usually try to optmze a certan measure. Tradtonal optmzaton methods are also known as parttonng methods. The most famous ones nclude k-means (ncludng ts varance k-modes [30], k-prototypes [30] ) [4], and k-medods (ncludng PAM [24], CLARA [24], CLARANS [25], etc.). Some new bult algorthms also fall nto ths category, ncludng STIRR [21]. K-means methods try to mnmze a dssmlar crteron (typcally the squared-error crteron). K-means

: 1389 algorthms usually are lnear to the sze of the data set. However, they are usually senstve to outlers, and often termnate at a local optmum. Therefore, the qualty of the result s not satsfable. Furthermore, they are usually desgned as memory-resdent algorthms, whch lmts the scalablty. Other than k-means, k-medods methods use data ponts to represent a cluster. Snce noses or outlers less nfluence the medods, they are more robust than k-means. However, the cost of k-medods algorthms s also expensve. PAM, CLARA, and CLARANS are three most famous k-medods algorthms. PAM s the frst k-medods method. CLARA and CLARANS both use samplng technque, n whch CLARA use fxed samples, whle CLARANS don t. Furthermore, CLARANS explots randomzed search. Therefore, CLARANS s more scalable than PAM and CLARA. Other than k-means or k-medods, some new bult optmzaton algorthms don t use representatves, such as STIRR. STIRR s desgned to handle categorcal data, so that means or medods s dffcult to defne. It maps the data set nto a hyper-graph and then employs dynamcal system technques to fnd basns, whch are fx-ponts of the system. Therefore, t can be vewed as the process of fndng an optmum of the system confguraton. 3.2 Agglomerate methods Agglomerate algorthms treat data ponts or data set parttons as sub-clusters n the begnnng. Then they merge the sub-clusters teratvely untl the fnal clusters are gotten. BIRCH [5], CURE [6], ISAAC [31], ROCK [17], STING [12], CHAMELEON [18], all fall nto ths category. The agglomerate methods have the shortcomng that the tme complexty s at least O(n2). Therefore, several technques are employed to accelerate the processng. Snce the number of the merge operatons depends on the number of ntal obects, some preprocessng technques are used to reduce the obect to be processed. Samplng and parttonng are two wdely used preprocessng technques. The developers of CURE proved that a small sample could guarantee the qualty of clusterng, whle CURE, STING, CHAMELEON all use parttonng before mergng the sub-clusters. Another technque used to accelerate the processng s ndexng. Nearly all agglomerate algorthms explot specal ndex structure. BIRCH uses CF-tree, CURE uses k-d-tree and heap, ROCK uses two-level heap, STING uses quad-tree-lke ndex, and CHAMELEON uses k-d-tree and heap-based prorty queue. Agglomerate methods usually need a parameter known as stop condton, whch s used to determne when the merge operatons should stop. Ths parameter may be k, the number of fnal clusters, or a threshold, whch denotes the mnmum value of the mergng measurement. 3.3 Dvsve methods Dvsve methods belong to herarchcal methods as agglomerate methods do. Dvsve methods begn wth a large cluster, whch contans all the data ponts, and then partton the cluster based on the dssmlarty recursvely, untl some stop condton s reached. ARHP [19,20], PDDP [20], and OptGrd [16] fall nto ths category. ARHP uses hyper-graph model. The whole data set s mapped to a hyper-graph by usng assocaton rule dscovery technques frst. Then, the sub-graphs satsfy that the ftness s larger than a threshold s parttoned out. At last, the vertces are assgned to the clusters they are hghly connected to. Other than ARHP, whch uses ftness to partton the clusters, PDDP and OptGrd use a hyper-plane to splt a cluster n each teraton. As agglomerate methods, dvsve methods also need the parameter of stop condton. It can be ether the number of fnal clusters: k, or a threshold for parttonng, such as ftness-threshold. The advantage of dvsve methods s that, for graph/hyper-graph model, there s some mature research work, such as HMETIS [32], can be employed. In fact, even CHAMELEON [18], an agglomerate method, has a dvsve step as the pre-processng to get the ntal sub-clusters. Snce t s the preprocessng, the parameter s easy to set.

1390 Journal of Software 2002,13(8) 4 Mxed or Generalzed Clusterng Approaches As analyzed above, algorthms usng sngle crtera may fall down on handlng some knd of data sets. Some recent research focuses on combnng or generalzng dfferent crtera. In ths secton, three algorthms of ths knd wll be ntroduced and analyzed. 4.1 CHAMELEON: dstance + connectvty method CHAMELEON [18] s an algorthm combnng several exstng clusterng technques. From the clusterng crtera vewpont, t combnes dstance measurement (relatve closeness) wth lnkage measurement (relatve nter-connectvty). Furthermore, t generalzes the classc dstance measurement n that t uses relatve crtera, whch s frst ntroduced n lnkage-based clusterng [19]. From the algorthm framework vewpont, t uses dvsve method as parttonng step to generate the ntal sub-clusters. And the man phase of the algorthm employs agglomerate framework. From the cluster representaton vewpont, t s an all-pont method. However, the ponts here may be the ntal sub-clusters. The advantages and shortcomngs of CHAMELEON can be derved easly from the multple vewponts analyss. It s strong at dentfcaton of arbtrary shaped clusters and hghly ntra-connectve clusters, snce relatve dstance and relatve connectvty are used. However, t needs two parameters as the threshold of relatve dstance and relatve connectvty respectvely. Furthermore, the dvsve parttonng needs another parameter. Ths s the shortcomng of combnng so many technques together. Furthermore, the framework determnes that ndex structure (e.g. k-d-tree) supports regon query and a heap must be used. Although the tme complexty s analyzed theoretcally, the scalng up technque or experment s not provded n the paper. 4.2 Hybrd: dstance + densty method Hybrd algorthm s a clusterng method combnng dstance and densty crtera [33]. From the vewpont of crtera, t uses dstance and densty nformaton. From the cluster representaton vewpont, t uses mult-representatve technque. Although cell s employed to enable the scalng up processng, t s not used to present the clusters, so that the cluster representaton could be more accurate. From the framework vewpont, t s an agglomerate algorthm. As dscussed before, the advantages and shortcomngs s straghtforward after the analyss. It can dentfy arbtrary-shaped clusters, and be nsenstve to noses or outlers, snce both dstance and densty nformaton are taken use of. However, ths ntroduced three parameters: one s for dstance computng whle other two are for densty computng. Furthermore, the framework determnes the use of k-d-tree and heap structure. Dfferent from CHAMELEON, t s desgned to handlng very large databases. The cell-based ndexng not only reduces the data to be processed, but also acceleratng the labelng process. As shown n our experments, Fg.5, t outperforms two popular clusterng algorthms DBSCAN and CURE, snce that R * -tree takes hgh overhead when processng large data sets, whle CURE fals when data sets scales out of the man memory. Detaled descrpton of the experments can be found n Ref.[33]. 4.3 DENCLUE: generalzed densty method DENCLUE [34] s a densty-based clusterng method, whch tres to generalze several other clusterng algorthms. It can be vewed as a knd of survey on densty-based clusterng algorthms, snce t can cover almost all densty-based algorthms by usng dfferent nfluence functon and densty functon. The developers of DENCLUE also state that t can generalze herarchcal algorthms and parttonng algorthms (named as tradtonal optmzaton algorthms n ths paper). However, t can only denote the framework of those algorthms. It cannot cover those algorthms usng representatves, even dfferent functons or parameters are set.

: 1391 450 400 350 300 Tme (s) 250 200 150 100 50 0 0 200000 400000 600000 800000 1000000 1200000 Data sze Hybrd DBSCAN CURE Fg.5 Scalng-up experments of CURE, DBSCAN, and Hybrd algorthm Snce DENCLUE s n fact a densty-based method. It needs to determne the parameters to calculate densty, and be robust to noses. Furthermore, the cell-based technque determnes that a tree-based ndex should be taken use of, so that t can handle very large data sets. It also employs a flterng technque to reduce the complexty of handlng hgh-dmensonal data. However, another parameter should be ntroduced. 5 Automatc and Vsualzaton Approaches Snce clusterng s a process of unsupervsed learnng, settng approprate parameters s a problem for lots of algorthms. The above analyss show that for most clusterng algorthms, some parameters are needed. Although they may be straghtforward n some cases, they are dffcult to set n many envronments. Furthermore, current cluster representaton technques can be easly understood only when the data s n low-dmensonal space. Therefore, some algorthms are bult for automatc clusterng. Meanwhle, some other efforts has been made to vsualze the process of clusterng, so that the user can set the parameters easly and the result can be more understandable. OPTICS [11] s an algorthm, whch s desgned to dscover cluster structure. It s essentally a densty-based clusterng algorthm, as DBSCAN s. The dfference between OPTICS and other densty-based methods s that t uses reachablty-plots to vsualze the process of clusterng. Furthermore, t ntroduces an automatc technque to detect the steep ponts, so that clusters can be dscovered. By usng dfferent parameters, t can dscover clusters n dfferent densty-level. Therefore, cluster structure s an organzaton of clusters n dfferent densty. In Ref.[35], the authors ntroduced an algorthm to buld mult-granularty cluster-tree. They argued that an accurate mult-granularty cluster-tree should be vertcal dstngushed, horzontal dstngushed, and complete, whch ensure that each node n the cluster-tree denotes a cluster n a certan granularty, whle any cluster n any granularty has a correspondng node n the cluster-tree. The constructon of mult-granularty cluster-tree employs dstance-based clusterng n agglomerate framework, whch s the man dfference between mult-granularty cluster-tree wth cluster structure n Ref.[11]. Therefore, clusters n dfferent densty wll be treated as clusters n dfferent level, and clusters n dfferent scale may be treated as clusters n the same level, by OPTICS; whle mult-granularty cluster-tree wll treat them n the contrary, as shown n Fg.6. The dfference exsts because that the motvaton of buldng mult-granularty cluster-tree s to provde a cluster management faclty to ease the understandng of clusterng result, whle OPTICS s desgned for automatcally or nteractve clusterng.

1392 Journal of Software 2002,13(8) Fg.6 Some researchers n computer graphcs also developed some algorthms to vsualze the clusterng process, such as H-BLOB [36]. However, the basc dea s smlar: (1) vsualze the clusterng processng, so that the constructon of clusters can be seen by the user; (2) clusters may exst n dfferent levels, whle dfferent parameters are used, whatever whch crtera s used. 6 Conclusons In ths paper, we try to analyze the exstng popular clusterng algorthms both theoretcally and expermentally from three dfferent vewponts: clusterng crtera, cluster representaton, and algorthm framework, so that most algorthms can be covered, and dstngushed. Ths work can be the bass of: (1) Clusterng algorthm advantage/dsadvantage analyss; (2) Clusterng algorthm selecton for data mnng users; (3) Clusterng algorthm auto-selecton for dfferent data sets; (4) Self-tunng clusterng algorthm development; (5) Clusterng benchmark constructon. The analyss shows that most current algorthms have ts shortcomngs whle beng effectve or effcent for some specal characterstc data sets. Furthermore, three algorthms, whch generalze or mx some other algorthms, are ntroduced. And they are analyzed from the three vewponts ntroduced n ths paper. At last, some automatc/vsualzaton algorthms for clusterng are ntroduced. They are the attempts of researchers to push the unsupervsed learnng process to a more understandable and automatc stage. Acknowledgement We would lke to thank Dr. Wen Jn n Smon Fraser Unversty for hs suggeston on the outlne and draft of ths paper. We also would lke to thank Dr. Joerge Sander for provdng the source code of DBSCAN, and Ms. Hale Qan for helpng us to mplement the algorthms of CURE and Hybrd. References: [1] Fasulo, D. An analyss of recent work on clusterng algorthms. Techncal Report, Department of Computer Scence and Engneerng, Unversty of Washngton, 1999. http://www.cs.washngton.edu. [2] Barald, A., Blonda, P. A survey of fuzzy clusterng algorthms for pattern recognton. IEEE Transactons on Systems, Man and Cybernetcs, Part B (Cybernetcs), 1999,29:786~801. [3] Kem, D.A., Hnneburg, A. Clusterng technques for large data sets from the past to the future. Tutoral Notes for ACM SIGKDD 1999 Internatonal Conference on Knowledge Dscovery and Data Mnng. San Dego, CA, ACM, 1999. 141~181. [4] McQueen, J. Some methods for classfcaton and Analyss of Multvarate Observatons. In: LeCam, L., Neyman, J., eds. Proceedngs of the 5th Berkeley Symposum on Mathematcal Statstcs and Probablty. 1967. 281~297. [5] Zhang, T., Ramakrshnan, R., Lvny, M. BIRCH: an effcent data clusterng method for very large databases. In: Jagadsh, H.V., Mumck, I.S., eds. Proceedngs of the 1996 ACM SIGMOD Internatonal Conference on Management of Data. Quebec: ACM Press, 1996. 103~114. [6] Guha, S., Rastog, R., Shm, K. CURE: an effcent clusterng algorthm for large databases. In: Haas, L.M., Twary, A., eds. Proceedngs of the 1998 ACM SIGMOD Internatonal Conference on Management of Data. Seattle: ACM Press, 1998. 73~84.

: 1393 [7] Beyer, K.S., Goldsten, J., Ramakrshnan, R., et al. When s nearest neghbor meanngful? In: Beer, C., Buneman, P., eds. Proceedngs of the 7th Internatonal Conference on Data Theory, ICDT 99. LNCS1540, Jerusalem, Israel: Sprnger, 1999. 217~235. [8] Ester, M., Kregel, H.-P., Sander, J., et al. A densty-based algorthm for dscoverng clusters n large spatal databases wth noses. In: Smouds, E., Han, J., Fayyad, U.M., eds. Proceedngs of the 2nd Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD 96). AAAI Press, 1996. 226~231. [9] Ester, M., Kregel, H.-P., Sander, J., et al. Incremental clusterng for mnng n a data warehousng envronment. In: Gupta, A., Shmuel, O., Wdom, J., eds. Proceedngs of the 24th Internatonal Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 323~333. [10] Sander, J., Ester, M., Kregel, H.-P., et al. Densty-Based clusterng n spatal databases: the algorthm GDBSCAN and ts applcatons. Data Mnng and Knowledge Dscovery, 1998,2(2):169~194. [11] Ankerst, M., Breung, M.M., Kregel, H.-P., et al. OPTICS: orderng ponts to dentfy the clusterng structure. In: Dels, A., Faloutsos, C., Ghandeharzadeh, S., eds. Proceedngs of the 1999 ACM SIGMOD Internatonal Conference on Management of Data. Phladelpha: ACM Press, 1999. 49~60. [12] Wang, W., Yang, J, Muntz, R. STING: a statstcal nformaton grd approach to spatal data mnng. In: Jarke, M., Carey, M.J., Dttrch, K.R., et al., eds. Proceedngs of the 23rd Internatonal Conference on Very Large Data Bases. Athens: Morgan Kaufmann, 1997. 186~195. [13] Shekholeslam, G., Chatteree, S., Zhang, A. WaveCluster: a mult-resoluton clusterng approach for very large spatal databases. In: Gupta, A., Shmuel, O., Wdom, J., eds. Proceedngs of the 24th Internatonal Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 428~438. [14] Xu, X., Ester, M., Kregel, H.-P., et al. A dstrbuton-based clusterng algorthm for mnng n large spatal databases. In: Proceedngs of the 14th Internatonal Conference on Data Engneerng. Orlando: IEEE Computer Socety Press, 1998. 324~331. [15] Agrawal, R., Gehrke, J., Gunopulos, D., et al. Automatc subspace clusterng of hgh dmensonal data for data mnng applcatons. In: Haas, L.M., Twary, A., eds. Proceedngs of the 1998 ACM SIGMOD Internatonal Conference on Management of Data. Seattle: ACM Press, 1998. 94~105. [16] Hnnebrug, A., Kem, D.A. Optmal grd-clusterng: towards breakng the curse of dmensonalty n hgh-dmensonal clusterng. In: Atknson, M.P., Orlowska, M.E., Valdurez, P., et al., eds. Proceedngs of the 25th Internatonal Conference on Very Large Data Bases. Ednburgh: Morgan Kaufmann, 1999. 506~517. [17] Guha, S., Rastog, R., Shm, K. ROCK: a robust clusterng algorthm for categorcal attrbutes. In: Proceedngs of the 15th Internatonal Conference on Data Engneerng. Sydney: IEEE Computer Socety Press, 1999. 512~521. [18] Karyps, G., Han, E.H., Kumar, V. CHAMELEON: a herarchcal clusterng algorthm usng dynamc modelng. IEEE Computer, 1999,32(8):68~75. [19] Han, E.H., Karyps, G., Kumar, V., et al. Hypergraph based clusterng n hgh-dmensonal data sets: a summary of results. Data Engneerng Bulletn, 1998,21(1):15~22. [20] Boley, D., Gn, M., Gross, R., et al. Parttonng-Based clusterng for web document categorzaton. Decson Support System Journal, 1999,27(3):329~341. [21] Gbson, D., Klenberg, J.M., Raghavan, P. Clusterng categorcal data: an approach based on dynamcal systems. In: Gupta, A., Shmuel, O., Wdom, J., eds. Proceedngs of the 24th Internatonal Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 311~322. [22] Gant, V., Gehrke, J., Ramakrshnan, R. CACTUS, clusterng categorcal data usng summares. In: Proceedngs of the 5th Internatonal Conference on Knowledge Dscovery and Data Mnng. San Dego: ACM Press, 1999. 73~83. [23] Agrawal, R., Srkant, R. Fast algorthms for mnng assocaton rules. In: Bocca, J.B., Jarke, M., Zanolo, C., eds. Proceedngs of the 20th Internatonal Conference on Very Large Data Bases (VLDB 94). Santago: Morgan Kaufmann, 1994. 487~499. [24] Kaufman, L., Rousseeuw, P.J. Fndng Groups n Data: an Introducton to Cluster Analyss. John Wley & Sons, 1990. [25] Ng, R.T., Han, J. Effcent and effectve clusterng methods for spatal data mnng. In: Bocca, J.B., Jarke, M., Zanolo, C., eds. Proceedngs of the 20th Internatonal Conference on Very Large Data Bases (VLDB 94). Santago: Morgan Kaufmann, 1994. 144~155. [26] Guha, S., Rastog, R., Shm, K. CURE: an effcent clusterng algorthm for large databases. Informaton System Journal, 1998, 26(1):35~58. [27] Dempster, A.P., Lard, N.M., Rubn, D.B. Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety(Seres B), 1977,29(1):1~38.

1394 Journal of Software 2002,13(8) [28] Laurtzen, S.L. The EM algorthm for graphcal assocaton models wth mssng data. Computatonal Statstcs and Data Analyss, 1995,19:191~201. [29] Cheeseman, P., Stutz, J. Bayesan classfcaton (AutoClass): theory and results. In: Fayyad, U.M., Patetsky-Shapro, G., Smyth, P., et al., eds. Advances n Knowledge Dscovery and Data Mnng. AAAI/MIT Press, 1996. 153~180. [30] Huang, Z. Extensons to the K-means algorthm for clusterng large data sets wth categorcal values. Data Mnng and Knowledge Dscovery, 1998,2:283~304. [31] Talavera, L., Bear, J. Effcent constructon of comprehensble herarchcal clusterng. In: Zytkow, J.M., Quafalou, M., eds. Prncples of Data Mnng and Knowledge Dscovery, Proceedngs of the 2nd European Symposum, PKDD 98. LNCS1510, Nantes: Sprnger-Verlag, 1998. 93~101. [32] Karyps, G., Aggarwal, R., Kumar, V., et al. Multlevel hypergraph parttonng: applcaton n VLSI doman. In: Proceedngs of the 34th Conference on Desgn Automaton. Anahem, CA: ACM Press, 1997. 526~529. [33] Zhou, A., Qan, W., Qan, H., et al. A hybrd approach to clusterng n very large databases. In: Cheung, D., Wllams, G.J., L, Q., eds. Proceedngs of the 5th Pacfc-Asa Conference on Knowledge Dscovery and Data Mnng. LNCS2035, Hong Kong: Sprnger-Verlag, 2001. 519~524. [34] Hnneburg, A., Kem, D.A. An effcent approach to clusterng n large multmeda databases wth nose. In: Agrawal, R., Stolorz, P.E., Patetsky-Shapro, G., eds. Proceedngs of the 4th Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD 98). New York: AAAI Press, 1998. 58~65. [35] Zhou, A., Qan, W., Qan, H., et al. SACT: automatc cluster-tree constructon for very large spatal databases. Techncal Report, Computer Scence Department, Fudan Unversty, 2001. http://www.cs.fudan.edu.cn/ch/thrd_web/webdb/wnqan_englsh.htm. [36] Sprenger, T.C., Brunella, R., Gross, M.H. H-BLOB: a herarchcal vsual clusterng method usng mplct surfaces. Techncal Report No.341, Computer Scence Department, ETH Zürch, 2000. ftp://ftp.nf.ethz.ch/pub/publcatons/tech-reports/3xx/341.pdf., (, 200433) (, 200433) :.,,.. 3 : (1) ; (2) ; (3).,. 3,.. : ; : TP311 : A