On the Two-level Hybrid Clustering Algorithm

On the Two-level Clusterng Algorthm ng Yeow Cheu, Chee Keong Kwoh, Zongln Zhou Bonformatcs Research Centre, School of Comuter ngneerng, Nanyang Technologcal Unversty, Sngaore 639798 ezlzhou@ntu.edu.sg ABSTRACT In ths aer, we desgn the hybrd clusterng algorthms, whch nvolve two level clusterng. At each of the levels, users can select the -means, herarchcal or SOM clusterng technues. Unle the exstng cluster analyss technues, the hybrd clusterng aroach develoed here reresents the orgnal data set usng a smaller set of rototye vectors (cluster means), whch allows effcent use of a clusterng algorthm to dvde the rototye nto grous at the frst level. Snce the clusterng at the frst level rovdes data abstracton frst, t reduces the number of samles for the second level clusterng. The reducton of the number of samles, hence, the reducton of comutatonal cost s esecally mortant when herarchcal clusterng s used n the second stage. The rototyes clustered at the frst level are local averages of the data and therefore less senstve to random varatons than the orgnal data. The emrcal evaluaton of the two-level hybrd clusterng algorthms s made at four data sets 1. INTRODUCTION Over the years, extensve research has been carred out n determnng the otmal cluster analyss. Technues for clusterng have been develoed very radly, surred mostly by the avalablty of comuters to carry out awesome calculatons nvolved. These research efforts have resulted n a number of well-nown algorthms, and varants are contnuously beng develoed, each addressng secfc shortcomngs of ther ancestors. In ths aer, three general methods are selected, namely (1) -means, an teratve arttonng method, (2) agglomeratve herarchcal clusterng, a method that bulds a herarchcal clusterng tree from bottom-u, (3) Self-Organzng Ma (SOM), a romnent unsuervsed neural networ model mang hgh-dmensonal data onto a two-dmensonal lane. Our hybrd clusterng technues are desgned based on them. Analyss of dfferences n erformance of the three general methods and our hybrd clusterng algorthms s also gven. 2. CLUSTRING ALGORITHMS There are many dfferent algorthms that are avalable today, and the two of the algorthms that we nvestgate, fall nto two general categores: herarchcal and nonherarchcal. The thrd s an unsuervsed clusterng method - SOM, used to fnd clusters n the nut data, and dentfy an unnown data vector wth one of the clusters [1]. 2.1. HIRARCHICAL CLUSTRING PROCDUR There are bascally two tyes of herarchcal clusterng rocedures agglomeratve and dvsve. In agglomeratve herarchcal methods, each observaton starts out as ts own cluster. In subseuent stes, the two closest clusters are combned nto a new aggregate cluster, thus reducng the number of clusters by one n each ste. Two grous of ndvduals formed at an earler stage may on together n a new cluster. ventually, all ndvduals are fused nto one large cluster. In dvsve methods, an ntal sngle grou of obects s dvded nto two subgrous such that the obects n one subgrou are far from the obects n the other. These subgrous are then further dvded nto dssmlar subgrous; the rocess contnues untl there are as many subgrous as obects (each obect forms a cluster). In both herarchcal methods, a herarchy of a tree-le structure s constructed and usually reresented as a dendrogram or tree grah. The dendrogram llustrates the mergers or dvsons that have been made at successve levels. In artcular, Wshart [6] contends that the to down decson tree aroach has nherently greater rs of msclassfcaton by neffcently slttng on a sngle varable than the bottom u aroach. ach classfcaton generated n a decson tree s unvarate by defnton, and ths lmts the range of ossble segments avalable for consderaton. By comarson, the agglomeratve aroach s multvarate and exloratory, and allows for more feasble segments to be nvestgated n terms of the actual dstrbuton of the scatter. Hence, ths roect concentrates on agglomeratve herarchcal algorthms manly (dvsve methods act almost as agglomeratve methods n reverse). The followng are the stes n the agglomeratve herarchcal clusterng algorthm for groung N obects:

1. Start wth N clusters, each contanng a sngle entty and an N x N symmetrc matrx of dstances (or smlartes) D = d }. { 2. Search the dstance matrx for the nearest (most smlar) ar of clusters. Let the dstance between most smlar clusters U and V be D. 1. Merge clusters U and V. Label the newly formed cluster (UV ). Udate the entres n the dstance matrx by a. deletng the rows and columns corresondng to clusters U and V and b. addng a row and column gvng the dstances between cluster (UV ) and the remanng clusters. Reeat Stes 2 and 3 a total of N 1 tmes. (All obects wll be n a sngle cluster after the algorthm termnates.) Record the dentty of clusters that are merged and the levels (dstances or smlartes) at whch the mergers tae lace. 2.2. VARIATIONS OF HIRARCHICAL ALGORITHM Ths secton descrbes the varous varants of agglomeratve herarchcal clusterng algorthms - sngle, comlete, average and Ward s method (SS). 2.2.1. LINKAG MTHODS The nuts to a algorthm can be dstances or smlartes between ars of obects. Sngle, comlete and average are the three -based herarchcal clusterng algorthms mlemented. Table 1: Between-clusters dstances Betweenclusters dstance uv d ( Q, Q l ) Sngle d s = mn, { x x } Comlete { } d c = max, x x Average d a, = x N N l x Between-clusters dstance ( Q, Q ) x Q,x Q, l. l of samles n cluster d l ; N s the number Q. Table 1 shows the between-clusters dstance defnton for each of the methods. In ths case, dssmlarty coeffcent s emloyed. The selecton of the dstance crteron or smlarty coeffcent deends on alcaton. Sngle Lnage: Grous are formed from the ndvdual enttes by mergng nearest neghbours, where the term nearest neghbour connotes the smallest dstance or largest smlarty. Comlete Lnage: The dstance (smlarty) between clusters s determned by the dstance (smlarty) between the two elements, one from each cluster, whch are most dstant (or least smlar). Average Lnage: Average treats the dstance between two clusters as the average dstance between all ars of tems where one member of a ar belongs to each cluster. 2.2.2 WARD S MTHOD (UCLIDAN SUM OF SQUARS) In Ward s method, the dstance between two clusters s the sum of suares between the two clusters summed over all varables. At each stage n the clusterng rocedure, the wthn-cluster sum of suares s mnmzed over all arttons obtanable by combnng two clusters from the revous stage. The ucldean Sum of Suares (SS), s gven by: = c w ( x µ ) w, for a cluster where x s the value of varable n case wthn cluster, c s an otonal dfferental weght for case, w s an otonal dfferental weght for varable, and µ s the mean of varable for cluster. The total SS for all clusters s = and the ncrease n the ucldean Sum of Suares the unon of two clusters and s: I = 2 I at Ward consders herarchcal clusterng rocedures based on mnmzng the loss of nformaton from onng two grous. Ths method s usually mlemented wth loss of nformaton taen to be an ncrease n an error sum of suares crteron. At each ste, unon of every ossble ar of clusters s consdered, and the two clusters whose combnaton results n the smallest ncrease n SS are oned.

2.2. NONHIRARCHICAL CLUSTRING PROCDUR Nonherarchcal rocedures do not nvolve the tree-le constructon rocess. Instead, these methods assgn obects nto clusters once the number of clusters to be formed s secfed. The number of clusters may be ether be secfed n advance or determned as art of the clusterng rocedure. Nonherarchcal methods start from ether from (1) an ntal artton of tems nto grous or (2) an ntal set of seed onts, whch wll form the nucle of clusters. Nonherarchcal clusterng rocedures are freuently referred to as K-means clusterng. MacQueen [5] suggests the term K-means for descrbng an algorthm of hs that assgns each tem to the cluster havng the nearest centrod (mean). In ts smlest form, the rocess s comosed of three stes: 1. Partton the tems nto ntal clusters. (or secfy ntal centrods (seed onts)) but non-homogeneous clusters. The hybrd aroach on data set 1 s erformed usng Ward s herarchcal clusterng and sngle herarchcal clusterng. Durng the frst stage of the hybrd aroach, Ward s method s used to fnd ten smaller clusters on the standardzed data set 1. As can be seen from Fgure 1, ten small clusters are found. No smaller cluster s formed wth elements n both elongated clusters of data set 1. Durng the 2 nd stage sngle herarchcal clusterng, cluster analyss s erformed on the ten cluster means. The cluster means are treated as new nut vectors to the 2 nd stage. Ths hybrd aroach utlzes the roerty of Ward s method and sngle herarchcal clusterng. Ward s method tends to fnd relatvely eual szes and hyer-shercal clusters whereas sngle clusterng tends to form long elongated cluster. In ths test by combnng the features of both clusterng methods, the two elongated clusters of data set 1 are found n Fgure 2. 2. Proceed through the lst of tems, assgnng an tem to the cluster who centrod (mean) s nearest. (Dstance s usually comuted usng ucldean dstance wth ether standardzed or unstandardzed observatons.) Recalculate the centrod for the cluster recevng the new tem and for the cluster losng the tem. 3. Reeat Ste 2 untl no more reassgnments tae lace. Because a matrx of dstances (smlartes) does not have to be determned, and the basc data do not have to be stored durng the comuter run, nonherarchcal methods can be aled to larger data sets than can herarchcal technues. 2.3. SLF-ORGANIZING MAP (SOM) The Self-Organzng Ma (SOM) s an unsuervsed neural networ mang hgh dmensonal nut data onto a usually two-dmensonal outut sace whle reservng relatons between the data tems. The cluster structure wthn the data as well as the nter-cluster smlarty s vsble from the resultng toology reservng mang [3, 4]. The SOM conssts of unts (neurons), whch are arranged as a two-dmensonal rectangular or hexagonal grd. Durng the tranng rocess vectors from the data set are resented to the ma n random order. The unt most smlar to a chosen vector s selected as the wnner and adoted to match the vector even better. Then unts n the neghborhood of the wnner are slghtly adoted as well. The traned SOM rovdes a mang of the data sace onto a two-dmensonal lan n such a way that smlar data onts are located close to each other. Fgure1: Result after 1 st clusterng on data set 1 stage Ward s herarchcal 3. TH MPIRICAL STUDY In the emrcal secton, the software for all the clusterng algorthms evaluated n ths aer s avalable at [2]. Data set 1 s artfcally generated to see how the algorthms erform when there are two well-searated Fgure2: Result after 2 nd stage sngle herarchcal clusterng on data set 1 Data set 2 contans three classes of 50 nstances each, where each class refers to a tye of rs lant. ach

nstance has four contnuous attrbutes. One class s lnearly searable from the other two; the latter are not lnearly searable from each other. Table 2 summarzes the results acheved by each of the clusterng technues carred out n ths exermental setu, ncludng two two-level hybrd clusterng algorthms. Table 2: Results of the clusterng technues on raw data set 2 Clusterng Method Percentage of samles K-means 89.3% Sngle 68% Comlete 96% Average 74% Ward s method 89.3% 2 nd stage Comlete 92.6% 82% Data set 3 contans two classes of 690 samles. In ths dataset, there s a good mx of attrbutes: contnuous, nomnal wth small numbers of values, and nomnal wth larger numbers of values. In Table 3, the results acheved by drect clusterng on ths data set usng comlete, and average herarchcal clusterng technue are not as good as the result acheved usng the hybrd aroach. In ths exerment setu, hybrd aroach clusterng utlsng SOM and comlete herarchcal clusterng acheves a better result than comlete clusterng on data set 3. A better result s also acheved usng hybrd aroach clusterng utlsng SOM and average herarchcal clusterng than drect average herarchcal clusterng on data set 3. Table 3: Results of the clusterng technues on data set 3 Clusterng Method Percentage of samles K-means 84% Sngle 55% Comlete 55% Average 55% Ward s method 79% 2 nd stage Sngle 55% 2 nd stage Comlete 2 nd stage Comlete 2 nd stage Average 55% 80% 76% 84% Data set 4 contans two classes of samles where one class s the grou of atents dagnosed ostvely for dabetes. ach samle has eght contnuous attrbutes. In ths exerment setu, the results n Table 4 acheved by all the clusterng technues are about the same. There s a slght mrovement usng hybrd aroach utlzng K-means clusterng and comlete herarchcal clusterng when t s comared to the result acheved usng comlete herarchcal clusterng on data set 4. Table 4: Results of each of the clusterng technues on data set 4 Percentage of samles Clusterng Method K-means 70% Sngle 65% Comlete 67% Average 65% Ward s method 66% 2 nd 65% stage Sngle 2 nd 70% stage Comlete 2 nd 63% stage Comlete 2 nd 65% stage Average 65%

4. CONCLUSIONS We comared on the four data sets the erformance of the two-level hybrd clusterng algorthms aganst the other clusterng algorthms: -mean, SOM, sngle, comlete, average, and Ward s herarchcal clusterng. The two-level hybrd clusterng algorthms ht the hghest ercentage of samles on all the data sets as comared to each of the other clusterng algorthms alone. In artcular, n data set 1, the hybrd aroach usng ward s method n the frst stage and sngle herarchcal clusterng n the second stage s able to fnd the two well-searated non-homogeneous clusters of the data set, whereas other clusterng methods, other than sngle clusterng, are not able to fnd the clusters for ths tye of data set. RFRNCS [1] M. S. Aldenderfer and R. K. Blashfeld, Cluster analyss. Beverly Hlls: Sage Publcatons, 1984. [2] Clustan Clusterng Software, Avalable: www.clustan.com, 2003. [3] T. Kohonen, Self-organzng mas: Otmzaton aroaches, In roceedngs of the nternatonal conference on artfcal neural networs, Fnland,. 981-990, 1991. [4] T. Kohonen, J. Hynnnen, J. Kangas, and J. Laasonen, The Self-Organzng Ma Program Pacage, Laboratory of Comuter and Informaton Scence, Helsn Unversty of Technology, 1995 [5] Macueen, Some methods for classfcaton and analyss of multvarate observaton, Proc. 5 th Bereley Sym., I,. 281-297, 1967. [6] D. Wshart, ffcent herarchcal cluster analyss for data mnng and nowledge dscovery, Presented at the Interface 1998. Mnneaols, USA, 1998.