An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters Akhtar Sabzi Department of Information Technology Qom University, Qom, Iran asabzii@gmail.com Yaghoub Farjami Department of Information Technology Qom University, Qom, Iran farjami@qom.ac.ir Morteza ZiHayat Department of computer science and York University, Toronto, Canada zihayatm@ cse.yorku.ca Abstract K-medoids algorithm is one of the most prominent techniques, as a partitioning clustering algorithm, in data mining and knowledge discovery applications. However, the determined numbers of cluster as an input and the impact of initial value of cluster centers on clusters' quality are the two major challenges of this algorithm. In this paper an improved version of fuzzy k-medoids algorithm has been proposed. Applying entropy concept as a complementary factor in optimization problem of fuzzy k-medoids has become to obtain more accurate centers. Also, using this factor, number of clusters has been achieved effectively. The results show that the proposed method outperforms fuzzy k-medoids in terms of accuracy of obtained centers. Keywords-Partitioning clustering, Fuzzy k-medoids; Entropy; optimization ; I. INTRODUCTION Clustering is an unsupervised technique which has been developed in purpose of division of data into clusters. Each cluster is formed based on similar objects. Thus the objects in one cluster have high resemblance and objects in divergent clusters differ significantly. The concept of fuzzy in data clustering was revealed by [1]. In fuzzy clustering each data point assigned partially to clusters individually. This partially assignment is represented by a float number between 0 and 1 that shows association degree of membership of each object to each cluster. Although, there are various studies on clustering and fuzzy clustering [2] [3] [4] [5] [6], but some issues are still open. Fuzzy k-medoid clustering as a partitioning clustering algorithm is struggling with two fundamental issues. Firstly; the number of cluster must be determined in advanced and this algorithm gets them as an input to dividing data into clusters. But in real world data sets, the numbers of clusters are unknown. The second issue is the initial values of center points which are opted randomly. This randomization produces different clusters in each run. Therefore these kinds of algorithms are very sensitive to initial points. Commonly to decrease effect of these issues, rehearsal method is applied and the best result selects as output. Partitioning algorithms are subdivided into k-medoids and k-means methods [6]. A new method was developed in [3] that solve the mentioned issues for k-mean. In addition of these problems the k-means is enduring another problem which is sensitivity to noisy data [7]. Because the center of clusters calculate based on mean of all object in a specific cluster. In contrast, k-medoid opts an object as an center which is more represent the cluster, so this algorithm do not take effect of noise. In this paper a novel k-medoids algorithm is introduced which covers the problems of partitioning clustering methods. The rest of the paper is organized as flow: in section II we have an overview of existing partitioning clustering algorithms. Then propose our method in section III while Section IV will report on some promising results we have obtained by using three artificial datasets. The conclusions are given in Section V. II. RELATED WORKS Partitioning clustering algorithms have an important role in machine leaning and data mining field. Thus there have been various studies on these aspects. In k-medoid algorithms antithesis of k-means a particular instance is selected as a center of cluster. The very primitive and prominent type of k-medoid was introduced by Kaufman et.al under name of PAM [7]. CLARA is a modified version of PAM that is suitable for large databases [7]. When clusters have overlaps the fuzzy clustering is preferred. The fuzzy c-means clustering was always popular. Moreover for the first time, Krishnapuram Present fuzzy k medoids [8].For an overview on fuzzy clustering, see [9]. The new versions of fuzzy clustering that try to improve the past problems are [3] [5] [10]. In field of fuzzy k-means type algorithm a very comprehensive study had done in [3]. In this study, the two problems of k-means type of algorithm, determined cluster number and sensitivity to initial value of clusters was solved but as it mentioned before k-means are sensitive to noise and do not work impeccable in all cases. In fuzzy k-medoids type of algorithm FCMdd [8] and FCTMdd [8] is tow primitive algorithms that unearthed by Krishnapuram. Despite FCMdd is not robust, the FCTMdd is robust version of FCMdd based on the Least Trimmed Squares idea. 978-1-4577-2152-6/11/$26.00 c 2011 IEEE 206

Table I. review of recent improvement on Kmeans and Kmedoids K-means K-medoids Algorithms Fuzzy Description Year c-means[7] center is means of instance MacQueen 1967 FCM[1] Fuzzy c-means Bezdek 1984 agglomerative fuzzy Select number of clusters Ng,Cheung and MLi 2008 K-Means[3] SAHN Sequential agglomerative hierarchical non-overlapping PAM[7] Partitioning around medoids Kaufman and Rousseeuw 1990 CLARA[7] Clustering large applications Kaufman and Rousseeuw 1990 CLARANS[7] CLARA base upon Randomized Search Ng and Han 1994 FCMdd [2] Fuzzy k-medoids Krishnapuram 1999 FCTMdd [2] Robust fuzzy k-medoid Krishnapuram 1999 PFC[11] Multiple medoids Mei and Chen 2010 PFC [10] is a recent version of fuzzy k medoid introduced by Mei and Chen. In PFC, more than one object represents each cluster in assist of weighted objects. But it still suffers the issues that Raising in introduction. The overview of improvement of partitioning clustering is present in Table I. III. THE PROPOSED APPROACH In this section, to address the mention challenges we have proposed a new fuzzy k-medoids base on instance entropy. The propose method referred to as (Improved Fuzzy K-Medoids) hereafter, consist of following phases. A. Prerequisites Fuzzy clustering algorithms are encompassing of two chief stages. First, disclosing an appropriate function to find out each instance membership degree of each cluster. Second, obtaining a method that calculates the cluster centers. Typically following objective function is employed as membership degree computing function: P (Z, X) = Where represents the association degree of membership of the ith object x i to the jth cluster z j, Z containing the cluster centers, and is a dissimilarity measure between the jth cluster center and the ith object. In order to improvement the efficiency of fuzzy clustering algorithm, sum of objects entropy as a complementary factor is considered in objective function in this paper. Thus formula (1) plus sum of objects entropy formed Manipulated objective function: P(Z,X)= s.t = 1 (0, 1], 1 i n (3) Euclidian distance is applied for dissimilarity criterion as follow: Partial optimization for U and Z is a commonplace method that employed toward optimization of P. In this method, first U gets fixed and minimizes the reduced P with respect to Z. Then, fix Z and minimize the reduced P with respect to U. consequently U is obtained as follow: As it is obvious, the amount of U relies on the coefficient. The empirical results show that the amount of depends of type of the data objects. Data object with small value anticipates small and for large data object value large value is expected. Moreover, in [8] was demonstrated that the value should be in certain interval. If it is too large the number of unearthed cluster is converging to 1 and for too small parameter value the number of uncovered clusters are more that the actual one. Second stage of fuzzy clustering, finding cluster center, in k-medoid type algorithm is performed as follow [8]: For i = 1 to k q = argmin 1 j k End for The fuzzy k medoid algorithm base on these modifications is present in Fig1. B. Improved Fuzzy - medoids The proposed algorithm gets inspired from agglomerative algorithms. An agglomerative clustering commence with all objects as one cluster and merging method is applied to establish the accurate grouped set of object [3]. Consequently the presented algorithm is start with large 2011 11th International Conference on Hybrid Intelligent Systems (HIS) 207

Fuzzy k- medoid algorithm: Input: coefficient, initial value of Z While (1) 1. Compute Value of U by (3) Determine value of P (U, Z) by (1) Set P = P (U, Z) If P revious =P then END 2. Compute value of Z by (4) Determine value of P (U, Z) by (1) If P revious =P then END End while Output: the value of U and Z Figure 1- Fuzzy k- medoids algorithm number of clusters as an input parameter and the value of Z (value of cluster centers) are optimized during a loop. For computing Z value the fuzzy k-medoid algorithm that was introduced above is employed. In each cycle of loop the value of U and Z is computed based on fuzzy clustering algorithm then the closest pair of clusters is determined and merged. This procedure continues until the number of cluster reach to one (see fig 1). The validation index that has been proposed by [12] is used to determine which Z value set is the one. The improved fuzzy k medoid algorithm has been presented in Fig2. For merging the clusters the MergeDBMSDC algorithm that was introduced by Khan [12] is used. IV. EXPERIMENTAL RESULTS To evaluate our proposed approach in this section, three experiments were carried out and all results prove the effectiveness of the proposed method. All data that used in three experiments are obtained synthetically and built under various conditions to confirm that this algorithm work in any condition. A. Experiment 1 This experiment aimed to demonstrate the ability of algorithm to obtain the right number of clusters. In first dataset, 4500 object points are produced by combination of three bivariate Gaussian densities given by (6). Where Gaussian [X, Y] is a Gaussian normal distribution with the mean X and the covariance matrix Y. The synthetic data set with 10 initial cluster centers are shown in Fig 2a. Fig2 demonstrate the stage of reaching the accurate number of clusters. According to Fig2, The obtained centers using are more accurate obviously. In Table II the position of true cluster centers, output of simple fuzzy k- medoids and result of are shown. B. Experiment 2 This experiment was evidenced that by increasing number of clusters, algorithm is still working well and got better centers than simple fuzzy k- medoids. In this experiment, 5000 points in 7 clusters constructed by using the mixture of three normal distributions. Table III presents the obtained centers using fuzzy k- medoid and. Moreover Fig3 depicts that the results of experiment 2 and prosperous result is obvious in that. Table II. Comparison between real centers and the result of fuzzy k- medoids and Real (1,1) (1,2.5) (2.5,2.5) Fuzzy k-medoids (0.9854,0.9257) (1.0288,2.3964) (2.4908,2.4513) (1.0256,0.9859) (1.0635,2.4825) (2.5121,2.4795) Improved fuzzy k-medoid algorithm: Input: initial value of number of clusters K * which is selected a great number, coefficient, initial value of Z which is selected randomly, t=2. While (k! =1) 1. Fuzzy k- medoid algorithm 2. Determine K merge ; used MergeDBMSDC 3. k = K*- K share 4. save the U and Z for this K 5. t=t+1 End while Output: the minimum value of U and Z using least validation index Table III. Comparison between real centers and the result of fuzzy k- medoids and Real centers (10,5) (40,50) (50,175) (60,80) (90,35) (150,79) (100,120) Fuzzy k-medoids (1.1852,8.1558) (37.9243,43.1686) (29.9187,156.9329) (62.8865,76.4544) (81.7268,47.4920) (120.3511,63.2428) (114.4468,106.8432) (9.6646,3.8116) (31.4578,52.3654) (49.2359,175.9832) (61.0091,82.1413) (89.1618,34.5074) (120.9105,78.209) (99.7343,119.6416) Figure 2- Improved fuzzy k-medoid algorithm 208 2011 11th International Conference on Hybrid Intelligent Systems (HIS)

(a) (b) (c) Figure3- three steps of obtained centers during experiment 1 -red point show the result of fuzzy k-medoids and black point show the results - (a) stage 1- start with 10 initial input centers (b)) stage 3-obtained center after 3 cycles (c) final stage- obtained right number of clusters Figure 4- result of experiment 2 red point show the result of fuzzy k-medoids and black point show the results- (a) first stage (b) final stage C. Experiment 3 In this experiment data points consist of some noisy points. To create noises, mixture of four bivariate Gaussian densities is employed as flow: V. CONCLUSION Many studies have been done on foundation of the partitioning clustering which is practical and useful. In this paper we proposed a new version of fuzzy k medoid algorithm named which covers the tow vulnerable issue of partitioning algorithm; determined cluster number and sensitivity to noise. Base on empirical numeric results is prospered. In comparison to fuzzy c- mean, give successful result as it is described in Fig4. The outcome cluster center position of these two algorithms is shown in table IV. Table IV. Comparison between real centers and the result of fuzzy k- means and Real (1,1) (1,2.5) (2.5,2.5) Fuzzy k-means (0.9739,0.9995) (1.1010,2.9046) (2.4599,2.4954) (0.9976,1.0141) (1.0809,2.7814) (2.4771,2.4853) Figure 5-comparision between and FCM- red point represent FCM results and black point show results 2011 11th International Conference on Hybrid Intelligent Systems (HIS) 209

REFRENCES [1] R. Ehrlich JC Bezdek, "FCM:The fuzzy c-means clustering algorithm," in Computers & Geosciences, 1984. [2] G. Richards, V.J. Rayward-Smith and A.P Reynolds, "The Aplication of K-medoids and PAM to Clustering of Rules," in Intelligent Data and Automated Learning, 2004. [3] M k. Ng,Y. Cheung and MLi, "Agglomerative Fuzzy K-means clustering Algorithm With Selection of Number ofnclusters," in IEEE Transaction on Knowlege and Data Enginieering, 2008. [4] A. Keller, "Fuzzy clustering with outliers," in Fuzzy Information Processing Society, 2000. [5] J. Undercoffer H Shah, "Fuzzy clustering for intrudion detection," in Fuzzy Systems, 2003. [6] W. Li, "Modified K-Means Clustering Algorithm," in Congress on Image and Signal Processing, 2008. [7] P. Berkhin, "Survey of clustering data mining technique,", 2002. [8] L. Kaufman, P.J. Rousseeuw, "Finding Groups in Data, An Introduction to Cluster Analysis," in John Wiley & Sons, 1990. [9] A. Joshi, L. Yi R Krishnapuram, "A Fuzzy Relative of the K-Medoids Algorithm with Application to Web Document and Snippet Clustering," in Fyzzy Systems, 1999. [10] P. Blond, A.Baraldi, "A survey of fuzzy clustering algorithms for pattern recognition," in System,man,and Cybenetics, 1999. [11] L. Chen, J. Mei, "Fuzzy Clustering with weighted medoids for relational data," in pattern recognition, 2010. [12] S.Wang, Q. Jiang and H. Sun, "FCM-Based Model Selection Algorithms for Determining the Number of Clusters," in Pattern Recognition, 2004, pp. vol. 37, pp. 2027-2037. [13] A. Ahmad, SS. Khan, "Cluster center initialization algorithm for K-means clustering," in Pattern Recognition Letters, 2004. [14] JC. Bezdek,"Pattern recognition with fuzzy objective function algorithms.: Kluwer Academic Publishers Norwell, 1981. 210 2011 11th International Conference on Hybrid Intelligent Systems (HIS)