Available olie www.jocpr.com Joural of Chemical ad Pharmaceutical Research, 2013, 5(12):745-749 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 K-meas algorithm i the optimal iitial cetroids based o dissimilarity Wag Shuye, Cui Yeqi, Ji Zuotao ad Liu Xiyua Departmet of Computer Sciece ad Techology, Lagfag Teachers Uiversity, Chia ABSTRACT K-meas clusterig algorithm is oe of the most popular clusterig algorithms ad has bee applied i may fields. A major problem of the basic k-meas clusterig algorithm is that the cluster result heavily depeds o the iitial cetroids which are chose at radom. At the same time, it is ot suitable for the sparse spatial datasets which use space distace as the similarity measuremet o the algorithm. I this paper, a improved k-meas clusterig algorithm i the optimal iitial cetroids based o dissimilarity is proposed. It adopts the dissimilarity to reflect the degree of correlatio betwee data objects, ad the uses Huffma tree to fid the iitial cetroids. May experimets cofirm that the proposed algorithm is a efficiet algorithm with better clusterig accuracy o the same maily time complexity. Key words: k-meas, iitial cetroids, Huffma tree, dissimilarity INTRODUCTION These days may datasets are produced from variety of scietific disciplies ad reality life, ad data geeratio, collectio ad aalysis is becomig the mai role i research. The data is gathered by wheever, wherever ad whatever, ad should take value i differet fields. Data miig is the process of fidig useful iformatio i large data warehouse. May data miig techiques are used to discover the importat patters from datasets ad predict the capabilities i future. Cluster aalysis is the most importat usupervised-learig method. The mai purpose is to fid a structure i a collectio of ulabelled data. Totally the clusterig ivolves partitioig a give dataset ito some groups of data whose members are similar i some way. Clusterig aalysis has bee widely used i data recovery, text ad web miig, patter recogitio, image segmetatio ad software reverse egieerig [1]. K-meas clusterig is a popular clusterig algorithm. It is partitioig a dataset ito k groups i the viciity of its iitializatio such that the similar data objects are grouped i the same cluster while dissimilar data objects are i differet clusters. However, k- meas clusterig algorithm also has some limitatios. (1), k, the umber of clusters is user-parameter, it eeds may professioal kowledge, ad a good clusterig with smaller k ca have a lower SSE (Sum of the Squared Error) tha a poor clusterig with higher k [2]. (2), the algorithm heavily depeds o the iitial coditios, ad it is sesitive to the sequece of iput, it eve ofte makes coverge to local rather tha global optimum. (3), it has may problems with outliers, how to detect ad mie them is also importat. (4), some clusterig may produce ew problems o high dimesioal ad sparse characteristics of dataset, for example the disaster of dimesio. (5), it also may make empty clusters. (6), it has problems whe clusters are of differig sizes, desities or o-globular shapes. Recetly, may improved k-meas clusterig algorithms have bee proposed to solve the iitial cetroids problem. The geeral solutio icludes usig multiple rus, clusterig a sample first, bisectig k-meas that is ot as susceptible to iitializatio issues [2]. J. C. Bezdek raised fuzzy c-meas, which a object belogs to all clusters with a weight, ad the sum of the weights is 1[3]. Redmod [4] proposed a method that iitial cetroids are selected 745
through combiig desity of data distributio ad the kd-tree. Hag Ligbo[5] improved the iitial cetroids through the desity of data ad the average distace. Tog Xuejiao[6] costructed k clusters, the decided each data object belogig to the cluster whether or ot depedig o the threshold. Zhag Jig [7] preseted a method to improve the iitial cetroids through idividual silhouette coefficiet. The high quality clusterig is to obtai high itra-cluster similarity ad low iter-cluster similarity. How to measure the similarity iflueces the results of the clusterig. May similarity measuremets are chose to meet differet applicatios or data types. Most algorithms adopt traditioal similarity based o spatial distace to describe the relatioship betwee data objects. It icludes Euclidea, Mahatta, Mikowsky [8] ad Chebychev, especially the Euclidea. They are good at low dimesioal data space but failig i dealig with high-dimesioal dataset. There are characteristics of sparse, empty space pheomea, the traditioal methods i the high-dimesioal space are greatly decreased, ad the results become ustable [9]. Ad may papers use similarity to measure the relatioship betwee data objects [10]. I this paper, a improved k-meas clusterig algorithm based o dissimilarity to optimize iitial cetroids is proposed. It draws the lessos from the Huffma tree i Wu Xiaorog[11]. Dissimilarity is adopted istead of the space distace method. Ad it uses the dimesio cotributio rate to reflect the importace of differet attitude to clusterig results. So it also ca be used i dimesio reductio i order to improve efficiecy. IRIS, Wie, Balace-scale datasets i UCI [12] are chose to be traied. Experimets show that the proposed algorithm is good at accuracy rate, especially at high-dimesioal space. K-MEANS CLUSTERING ALGORITHM The k-meas clusterig algorithm is oe of the top te data miig algorithms [13]. A descriptio of the basic algorithm follows. The data set D={x 1, x 2,, x m } is assumed. The first k data objects are chose at radom as the iitial cetroids. The k is user-parameter that the umber of clusters desired. Each data object is the assiged to the earest iitial cetroid. The idea is to choose radom cluster cetroids, oe for each cluster. The cetroid of each cluster is the updated based o meas of each group which assig as a ew cetroids. The the assigmet is repeated ad cetroids are updated util o data object chages. It meas o object avigates from each cluster to aother or equivaletly, each cetroid remais the same comparig with the previous iteratio. Algorithm 1: the basic k-meas clusterig algorithm Choose k objects as the iitial cetroids at radom Repeat Assig each object to the earest cluster ceter Recompute the cluster ceters of each cluster Util covergece criterio is met The time complexity of the basic k-meas cluster algorithm is O ( k*l*m*d ), k represets the umber of clusters, l is the umber of iteratios i order to meet the covergece criterio, m is the size of dataset, d is the umber of attributes. So the umber k, l, m ad d all ifluece the efficiecy of the algorithm. IMPROVED K-MEANS CLUSTERING ALGORITHM Formal defiitio I order to explai the algorithm proposed i this paper, relative defiitios are itroduced as follows. Defiitio 1: The dataset D is defies as D={x 1,x 2,,x m }, the size is m, ad each object has may attributes, the umber is d. Defiitio 2: Attribute dissimilarity ad. A dataset D, x i D, x j D, represets ay attribute, the attribute dissimilarity of x i ad x j o the attribute is: ad ij x x i max x x j mi The x i is the value of x i i the attribute, x j is the value of x j i the attribute, x max is the maximal value of the dataset i the attribute, ad x mi is the miimal value of the dataset i the attribute. (1) 746
Because dimesioally homogeeous is exist i huge dataset, the value rage of each attribute is absolutely differet. Data reprocessig which coverts raw data ito suitable iformatio is very importat. It ormalizes the dataset i order to avoid the ifluece o the data of differet dimesio. Defiitio 3: Object dissimilarity od. The object dissimilarity of x i ad x j i dataset D is: od( i, j) d 1 w ad d ij (2) The w is dimesio cotributio rate to weight the differece ifluece of each attribute i the clusterig procedure. It rages from 0 to 1, ad ca get from differet expressios ad also ca get from experts i practical applicatio. The w i [9] is adopted i this paper. Defiitio 4: Dissimilarity matrix dm (m*m). The dissimilarity matrix of the give dataset D is: od(1,1) od(2,1) od(2,2) dm......... od( m.1) od( m,2)... od( m, m) (3) The optimal iitial cetroids based o dissimilarity Formula (1) is used to calculate the dissimilarity betwee each data object i each attitude. Formula (2) is used to calculate the dissimilarity betwee each data object icludig every attitude. The value of the od(i,j) reflects the degree of correlatio betwee x i ad x j. The smaller the value, the closer, ad it is the greater of possibility to partitio i the same cluster. Ad formula (3) is used to create the dissimilarity matrix dm. It is a symmetric matrix. Huffma tree (Huffma) is a kid of weighted legth of the shortest path tree. Huffma tree is used to calculate the iitial cetroids. Dissimilarity that defied ahead is adopted to measure the differece betwee each data objects, ad usig the dissimilarity matrix to store all the values. Selectig the smallest value i the iitial dissimilarity matrix, it meas that the most possibility the two objects will be i the same cluster. We compute the average value of the two objects ot the sum as a ew object, delete the two objects from the dataset, recompute the od(i,j) ad get a ew dissimilarity matrix dm(m-1,m-1), circle the procedure util gettig oe object usig the Huffma algorithm. Accordig the Huffma tree ad the k value, k-1 odes will be foud from the root to leaf odes. Whe deletig them, k sub-trees are left. The values of each sub-trees are the iitial cetroids which will use i the basic k-meas clusterig algorithm. The descriptio of the improved algorithm The improved algorithm uses the iitial cetroids which come from the Huffma tree. It bases o the dissimilarity to describe the degree of correlatio. The rest procedure is as the same as the basic k-meas clusterig algorithm. The improved algorithm is described as follow. Algorithm 2: the improved k-meas clusterig algorithm Iput the dataset D with m objects, each object with d attributes, ad k the umber of clusters Calculate the ad, od, ad get the dissimilarity matrix dm Costruct the Huffma tree accordig the dissimilarity matrix dm Delete the k-1 ode from the Huffma tree, left k sub-trees, get each the k sub-trees ode value as the iitial cetroids Repeat Assig each object to the earest cluster ceter Recompute the cluster ceters of each cluster Util covergece criterio is met Algorithm 2 shows the procedure of the improved k-meas clusterig algorithm. The time complexity is affected by the size of dataset (m), the umber of attributes (d), the umber of iteratios (l) ad the umber of clusters (k). The time complexity of computig the dissimilarity is O (m*), it is idetical with the distace-based method i [14]. The time complexity of costructig the Huffma tree is O (m*logm). The time complexity of clusterig is O (m*k*l*d). The total complexity of the improved algorithm is O (m*d + m*logm + m*k*l*d). Although this algorithm speds more time o Huffma algorithm ad the value of logm is very small, the algorithm s time 747
cosumptio maily depeds o the basic k-meas clusterig algorithm. The maily time complexity is O (m*k*l*d). The data size, the umber of iteratio ad the umber of attributes are the mai factors i clusterig. It also dimiishes the iteratio through the Huffma tree. So it drops the time cosumptio. At the same time the clusterig result is stable ad less depeds o the iitial cetroids. RESULTS AND DISCUSSION I order to evaluate the improved k-meas clusterig algorithm, the stadard datasets IRIS, Wie, Balace-scale are chose from the UCI machie learig repository. They all have 3 clusters ad the umber of data objects i each cluster is show i Table 1. Table 1: The umber of data objects i each cluster Cluster IRIS Wie Balace-scale first cluster 50 59 49 secod cluster 50 71 288 third cluster 50 48 288 sum 150 178 625 Table 2 describes the accuracy rate which is defied i [14] of the improved algorithm i this paper. It shows that the accuracy rate is all above the value of the algorithm which is based o distace i [14] especially i the big dataset ad the high-dimesioal dataset. Table 2: Accuracy rate of improved algorithm Names clusters IRIS Wie Balace-scale first 50 64 56 Dataset secod 47 58 259 third 53 56 310 first 50 51 45 right secod 44 56 218 third 46 39 251 first 0 13 11 wrog secod 3 2 41 third 7 17 59 accuracy_rate 93.33% 82.02% 82.24% Table 3 describe the fial cluster ceters which defies i [14] at IRIS dataset i each algorithm ( ceter1 symbols the stadard, ceter2 symbols the basic k-meas, ceter3 symbols the clusterig algorithm usig Huffma based o distace, ceter4 symbols the improved algorithm i this paper). As is see from table 3, the improved algorithm i this paper is most close to the stadard cluster ceters. It is better tha the algorithm usig Huffma tree based o distace. The Wie ad Balace-scale datasets also have the same results. It meas the dissimilarity icludig the dimesio cotributio rate is more suitable for the big dataset with high-dimesioal. Table 3: The fial cluster ceters i IRIS Clusters Ceter1 Ceter2 Ceter3 Ceter4 first cluster 5.006 3.418 1.464 0.244 5.830 3.511 1.476 0.250 5.006 1.464 3.418 0.224 5.006 3.418 1.464 0.244 secod cluster 5.936 2.770 5.756 2.716 5.901 2.748 5.905 2.779 4.260 1.326 4.026 1.118 4.394 1.434 4.278 1.354 third cluster 6.588 2.970 6.315 2.895 6.850 3.073 6.628 2.927 5.552 2.026 5.125 1.803 5.742 2.071 5.635 2.051 CONCLUSION The k-meas clusterig algorithm is the secod i the top te data miig algorithms. But the algorithm has ecoutered may limitatios. I this paper it presets a improved k-meas clusterig algorithm i the optimal iitial cetroids based o dissimilarity. It adopts the dissimilarity to reflect the degree of correlatio betwee data objects, the uses Huffma tree to fid the iitial cetroids. So it resolves the problem that the cluster results are sesitive to the iitial cetroids i basic k-meas clusterig algorithm. It cosumes less time because the iteratio dimiishes through the Huffma algorithm tha the basic k-meas which has the same values m, k, ad d. May experimets show that the improved algorithm has better accuracy rate ad cluster results. However, this ew algorithm based o dissimilarity still has problem for further research. We cosider the dimesio cotributio rate to weight the differece ifluece of each attribute i clusterig. The ext step of the research are how to defie the 748
dimesio cotributio rate i differet fields ad datasets, ad how to improve the algorithm's efficiecy to reduce the umber of attributes d through priciple compoet aalysis based o the dimesio cotributio rate. Ackowledgemet This work was supported i part by the Natural Sciece Foudatio of Lagfag Teachers Uiversity i 2013(LSZY201306). REFERENCES [1] Elham Karoussi, Data miig k clusterig problem, Uiversity of Agder, 2012 [2] Ta, Steibach, Kumar, The k-meas cluster, http://www.cs.uvm.edu/~xwu/kdd/slides/ Kmeas-ICDM06.pdf, 2006 [3] J. C. Bezdek, Fuzzy Mathematics i Patter Classificatio, Corell Uiversity, Ithaca, NY. 1973 [4] Redmod S J, Heegha C, Patter recogitio Letters, 2007, 28(8):965-973. [5] Ha Ligbo, Wag Qiag, Jiag Zhegfeg, Computer egieerig ad applicatio, 2010, 46 (17): 150-152. [6] Fu Desheg, Zhou Che, Joural of computer applicatios, 2011,31 (2):432-434 [7] Zhag Jig, Dua Fu, Computer egieerig ad desig, May 2013(5): 1691-1694. [8] B. Shamugapriya, M. Puithavalli, Iteratioal joural of computer applicatio, April 2012(8):26-32. [9] Wag Xiaoyag, Zhag Hogyua, She Liagzhog, Chi Wale, Computer techology ad developmet, May 2013(23): 30-33. [10] Huag Maida, Che Qimai, Microcomputer iformatio, 2009(27):187-188,198. [11] Wu Xiaorog, Research o problems related to the iitial ceter selectio i k-meas clusterig algorithm, Hua Uiversity, May 2008. [12] UCI machie learig repository, http://archiv.ic.uci.edu/ml/. [13] Ta P N, Steibach M, Kumar V, Itroductio to data miig, MA, USA: Addiso-Wesley Logma Publishig Co., Ic. Bosto, 2010. [14] Wag Shuye, A Improved K-meas Clusterig Algorithm Based o Dissimilarity, Proceedigs 2013 Iteratioal Coferece o Mechatroic Scieces, Electric Egieerig ad Computer (MEC), December 20-22, 2013, Chia: 2629-2633. 749