An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

Similar documents
HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

Kapitel 4: Clustering

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CSE 5243 INTRO. TO DATA MINING

Texture Image Segmentation using FCM

Fuzzy-Kernel Learning Vector Quantization

The Application of K-medoids and PAM to the Clustering of Rules

Hierarchical Document Clustering

Collaborative Rough Clustering

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Clustering Large Datasets using Data Stream Clustering Techniques

Dynamic Clustering of Data with Modified K-Means Algorithm

The Clustering Validity with Silhouette and Sum of Squared Errors

ECLT 5810 Clustering

Unsupervised Learning and Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering part II 1

Randomized Response Technique in Data Mining

Using Categorical Attributes for Clustering

K-Means. Oct Youn-Hee Han

An Efficient Technique to Test Suite Minimization using Hierarchical Clustering Approach

Supervised vs. Unsupervised Learning

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

ECLT 5810 Clustering

Implementation of Fuzzy C-Means and Possibilistic C-Means Clustering Algorithms, Cluster Tendency Analysis and Cluster Validation

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A Brief Overview of Robust Clustering Techniques

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

K-means algorithm and its application for clustering companies listed in Zhejiang province

Chapter 6 Continued: Partitioning Methods

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Fuzzy C-means Clustering with Temporal-based Membership Function

A Review on Cluster Based Approach in Data Mining

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Clustering Large Dynamic Datasets Using Exemplar Points

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Unsupervised Learning

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Understanding Clustering Supervising the unsupervised

Clustering: An art of grouping related objects

Web Document Clustering using Hybrid Approach in Data Mining

International Journal of Advanced Research in Computer Science and Software Engineering

HFCT: A Hybrid Fuzzy Clustering Method for Collaborative Tagging

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Exploratory data analysis for microarrays

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

10701 Machine Learning. Clustering

Comparative Study of Different Clustering Algorithms

Clustering. Supervised vs. Unsupervised Learning

High Accuracy Clustering Algorithm for Categorical Dataset

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

The Effect of Word Sampling on Document Clustering

Unsupervised Learning Partitioning Methods

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

Empirical Analysis of Data Clustering Algorithms

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

A k-means Clustering Algorithm on Numeric Data

Improved Version of Kernelized Fuzzy C-Means using Credibility

Clustering CS 550: Machine Learning

Unsupervised Learning : Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Lecture 6, 1/24/03 ECS289A

An Enhanced K-Medoid Clustering Algorithm

A Hybrid Recommender System for Dynamic Web Users

S. Sreenivasan Research Scholar, School of Advanced Sciences, VIT University, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu, India

Introduction to Mobile Robotics

6. Learning Partitions of a Set

Clustering Algorithms In Data Mining

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

Centroid Based Clustering Algorithms- A Clarion Study

Redefining and Enhancing K-means Algorithm

Data Mining Algorithms

A Comparative Study of Various Clustering Algorithms in Data Mining

Research and Improvement on K-means Algorithm Based on Large Data Set

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Colour Image Segmentation Using K-Means, Fuzzy C-Means and Density Based Clustering

Gene Clustering & Classification

Cluster Analysis for Microarray Data

Clustering. Chapter 10 in Introduction to statistical learning

ECS 234: Data Analysis: Clustering ECS 234

K-means clustering based filter feature selection on high dimensional data

Analyzing Outlier Detection Techniques with Hybrid Method

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

New Approach for K-mean and K-medoids Algorithm

Automatic K- Expectation Maximization (A K-EM) Algorithm for Data Mining Applications

Transcription:

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters Akhtar Sabzi Department of Information Technology Qom University, Qom, Iran asabzii@gmail.com Yaghoub Farjami Department of Information Technology Qom University, Qom, Iran farjami@qom.ac.ir Morteza ZiHayat Department of computer science and York University, Toronto, Canada zihayatm@ cse.yorku.ca Abstract K-medoids algorithm is one of the most prominent techniques, as a partitioning clustering algorithm, in data mining and knowledge discovery applications. However, the determined numbers of cluster as an input and the impact of initial value of cluster centers on clusters' quality are the two major challenges of this algorithm. In this paper an improved version of fuzzy k-medoids algorithm has been proposed. Applying entropy concept as a complementary factor in optimization problem of fuzzy k-medoids has become to obtain more accurate centers. Also, using this factor, number of clusters has been achieved effectively. The results show that the proposed method outperforms fuzzy k-medoids in terms of accuracy of obtained centers. Keywords-Partitioning clustering, Fuzzy k-medoids; Entropy; optimization ; I. INTRODUCTION Clustering is an unsupervised technique which has been developed in purpose of division of data into clusters. Each cluster is formed based on similar objects. Thus the objects in one cluster have high resemblance and objects in divergent clusters differ significantly. The concept of fuzzy in data clustering was revealed by [1]. In fuzzy clustering each data point assigned partially to clusters individually. This partially assignment is represented by a float number between 0 and 1 that shows association degree of membership of each object to each cluster. Although, there are various studies on clustering and fuzzy clustering [2] [3] [4] [5] [6], but some issues are still open. Fuzzy k-medoid clustering as a partitioning clustering algorithm is struggling with two fundamental issues. Firstly; the number of cluster must be determined in advanced and this algorithm gets them as an input to dividing data into clusters. But in real world data sets, the numbers of clusters are unknown. The second issue is the initial values of center points which are opted randomly. This randomization produces different clusters in each run. Therefore these kinds of algorithms are very sensitive to initial points. Commonly to decrease effect of these issues, rehearsal method is applied and the best result selects as output. Partitioning algorithms are subdivided into k-medoids and k-means methods [6]. A new method was developed in [3] that solve the mentioned issues for k-mean. In addition of these problems the k-means is enduring another problem which is sensitivity to noisy data [7]. Because the center of clusters calculate based on mean of all object in a specific cluster. In contrast, k-medoid opts an object as an center which is more represent the cluster, so this algorithm do not take effect of noise. In this paper a novel k-medoids algorithm is introduced which covers the problems of partitioning clustering methods. The rest of the paper is organized as flow: in section II we have an overview of existing partitioning clustering algorithms. Then propose our method in section III while Section IV will report on some promising results we have obtained by using three artificial datasets. The conclusions are given in Section V. II. RELATED WORKS Partitioning clustering algorithms have an important role in machine leaning and data mining field. Thus there have been various studies on these aspects. In k-medoid algorithms antithesis of k-means a particular instance is selected as a center of cluster. The very primitive and prominent type of k-medoid was introduced by Kaufman et.al under name of PAM [7]. CLARA is a modified version of PAM that is suitable for large databases [7]. When clusters have overlaps the fuzzy clustering is preferred. The fuzzy c-means clustering was always popular. Moreover for the first time, Krishnapuram Present fuzzy k medoids [8].For an overview on fuzzy clustering, see [9]. The new versions of fuzzy clustering that try to improve the past problems are [3] [5] [10]. In field of fuzzy k-means type algorithm a very comprehensive study had done in [3]. In this study, the two problems of k-means type of algorithm, determined cluster number and sensitivity to initial value of clusters was solved but as it mentioned before k-means are sensitive to noise and do not work impeccable in all cases. In fuzzy k-medoids type of algorithm FCMdd [8] and FCTMdd [8] is tow primitive algorithms that unearthed by Krishnapuram. Despite FCMdd is not robust, the FCTMdd is robust version of FCMdd based on the Least Trimmed Squares idea. 978-1-4577-2152-6/11/$26.00 c 2011 IEEE 206

Table I. review of recent improvement on Kmeans and Kmedoids K-means K-medoids Algorithms Fuzzy Description Year c-means[7] center is means of instance MacQueen 1967 FCM[1] Fuzzy c-means Bezdek 1984 agglomerative fuzzy Select number of clusters Ng,Cheung and MLi 2008 K-Means[3] SAHN Sequential agglomerative hierarchical non-overlapping PAM[7] Partitioning around medoids Kaufman and Rousseeuw 1990 CLARA[7] Clustering large applications Kaufman and Rousseeuw 1990 CLARANS[7] CLARA base upon Randomized Search Ng and Han 1994 FCMdd [2] Fuzzy k-medoids Krishnapuram 1999 FCTMdd [2] Robust fuzzy k-medoid Krishnapuram 1999 PFC[11] Multiple medoids Mei and Chen 2010 PFC [10] is a recent version of fuzzy k medoid introduced by Mei and Chen. In PFC, more than one object represents each cluster in assist of weighted objects. But it still suffers the issues that Raising in introduction. The overview of improvement of partitioning clustering is present in Table I. III. THE PROPOSED APPROACH In this section, to address the mention challenges we have proposed a new fuzzy k-medoids base on instance entropy. The propose method referred to as (Improved Fuzzy K-Medoids) hereafter, consist of following phases. A. Prerequisites Fuzzy clustering algorithms are encompassing of two chief stages. First, disclosing an appropriate function to find out each instance membership degree of each cluster. Second, obtaining a method that calculates the cluster centers. Typically following objective function is employed as membership degree computing function: P (Z, X) = Where represents the association degree of membership of the ith object x i to the jth cluster z j, Z containing the cluster centers, and is a dissimilarity measure between the jth cluster center and the ith object. In order to improvement the efficiency of fuzzy clustering algorithm, sum of objects entropy as a complementary factor is considered in objective function in this paper. Thus formula (1) plus sum of objects entropy formed Manipulated objective function: P(Z,X)= s.t = 1 (0, 1], 1 i n (3) Euclidian distance is applied for dissimilarity criterion as follow: Partial optimization for U and Z is a commonplace method that employed toward optimization of P. In this method, first U gets fixed and minimizes the reduced P with respect to Z. Then, fix Z and minimize the reduced P with respect to U. consequently U is obtained as follow: As it is obvious, the amount of U relies on the coefficient. The empirical results show that the amount of depends of type of the data objects. Data object with small value anticipates small and for large data object value large value is expected. Moreover, in [8] was demonstrated that the value should be in certain interval. If it is too large the number of unearthed cluster is converging to 1 and for too small parameter value the number of uncovered clusters are more that the actual one. Second stage of fuzzy clustering, finding cluster center, in k-medoid type algorithm is performed as follow [8]: For i = 1 to k q = argmin 1 j k End for The fuzzy k medoid algorithm base on these modifications is present in Fig1. B. Improved Fuzzy - medoids The proposed algorithm gets inspired from agglomerative algorithms. An agglomerative clustering commence with all objects as one cluster and merging method is applied to establish the accurate grouped set of object [3]. Consequently the presented algorithm is start with large 2011 11th International Conference on Hybrid Intelligent Systems (HIS) 207

Fuzzy k- medoid algorithm: Input: coefficient, initial value of Z While (1) 1. Compute Value of U by (3) Determine value of P (U, Z) by (1) Set P = P (U, Z) If P revious =P then END 2. Compute value of Z by (4) Determine value of P (U, Z) by (1) If P revious =P then END End while Output: the value of U and Z Figure 1- Fuzzy k- medoids algorithm number of clusters as an input parameter and the value of Z (value of cluster centers) are optimized during a loop. For computing Z value the fuzzy k-medoid algorithm that was introduced above is employed. In each cycle of loop the value of U and Z is computed based on fuzzy clustering algorithm then the closest pair of clusters is determined and merged. This procedure continues until the number of cluster reach to one (see fig 1). The validation index that has been proposed by [12] is used to determine which Z value set is the one. The improved fuzzy k medoid algorithm has been presented in Fig2. For merging the clusters the MergeDBMSDC algorithm that was introduced by Khan [12] is used. IV. EXPERIMENTAL RESULTS To evaluate our proposed approach in this section, three experiments were carried out and all results prove the effectiveness of the proposed method. All data that used in three experiments are obtained synthetically and built under various conditions to confirm that this algorithm work in any condition. A. Experiment 1 This experiment aimed to demonstrate the ability of algorithm to obtain the right number of clusters. In first dataset, 4500 object points are produced by combination of three bivariate Gaussian densities given by (6). Where Gaussian [X, Y] is a Gaussian normal distribution with the mean X and the covariance matrix Y. The synthetic data set with 10 initial cluster centers are shown in Fig 2a. Fig2 demonstrate the stage of reaching the accurate number of clusters. According to Fig2, The obtained centers using are more accurate obviously. In Table II the position of true cluster centers, output of simple fuzzy k- medoids and result of are shown. B. Experiment 2 This experiment was evidenced that by increasing number of clusters, algorithm is still working well and got better centers than simple fuzzy k- medoids. In this experiment, 5000 points in 7 clusters constructed by using the mixture of three normal distributions. Table III presents the obtained centers using fuzzy k- medoid and. Moreover Fig3 depicts that the results of experiment 2 and prosperous result is obvious in that. Table II. Comparison between real centers and the result of fuzzy k- medoids and Real (1,1) (1,2.5) (2.5,2.5) Fuzzy k-medoids (0.9854,0.9257) (1.0288,2.3964) (2.4908,2.4513) (1.0256,0.9859) (1.0635,2.4825) (2.5121,2.4795) Improved fuzzy k-medoid algorithm: Input: initial value of number of clusters K * which is selected a great number, coefficient, initial value of Z which is selected randomly, t=2. While (k! =1) 1. Fuzzy k- medoid algorithm 2. Determine K merge ; used MergeDBMSDC 3. k = K*- K share 4. save the U and Z for this K 5. t=t+1 End while Output: the minimum value of U and Z using least validation index Table III. Comparison between real centers and the result of fuzzy k- medoids and Real centers (10,5) (40,50) (50,175) (60,80) (90,35) (150,79) (100,120) Fuzzy k-medoids (1.1852,8.1558) (37.9243,43.1686) (29.9187,156.9329) (62.8865,76.4544) (81.7268,47.4920) (120.3511,63.2428) (114.4468,106.8432) (9.6646,3.8116) (31.4578,52.3654) (49.2359,175.9832) (61.0091,82.1413) (89.1618,34.5074) (120.9105,78.209) (99.7343,119.6416) Figure 2- Improved fuzzy k-medoid algorithm 208 2011 11th International Conference on Hybrid Intelligent Systems (HIS)

(a) (b) (c) Figure3- three steps of obtained centers during experiment 1 -red point show the result of fuzzy k-medoids and black point show the results - (a) stage 1- start with 10 initial input centers (b)) stage 3-obtained center after 3 cycles (c) final stage- obtained right number of clusters Figure 4- result of experiment 2 red point show the result of fuzzy k-medoids and black point show the results- (a) first stage (b) final stage C. Experiment 3 In this experiment data points consist of some noisy points. To create noises, mixture of four bivariate Gaussian densities is employed as flow: V. CONCLUSION Many studies have been done on foundation of the partitioning clustering which is practical and useful. In this paper we proposed a new version of fuzzy k medoid algorithm named which covers the tow vulnerable issue of partitioning algorithm; determined cluster number and sensitivity to noise. Base on empirical numeric results is prospered. In comparison to fuzzy c- mean, give successful result as it is described in Fig4. The outcome cluster center position of these two algorithms is shown in table IV. Table IV. Comparison between real centers and the result of fuzzy k- means and Real (1,1) (1,2.5) (2.5,2.5) Fuzzy k-means (0.9739,0.9995) (1.1010,2.9046) (2.4599,2.4954) (0.9976,1.0141) (1.0809,2.7814) (2.4771,2.4853) Figure 5-comparision between and FCM- red point represent FCM results and black point show results 2011 11th International Conference on Hybrid Intelligent Systems (HIS) 209

REFRENCES [1] R. Ehrlich JC Bezdek, "FCM:The fuzzy c-means clustering algorithm," in Computers & Geosciences, 1984. [2] G. Richards, V.J. Rayward-Smith and A.P Reynolds, "The Aplication of K-medoids and PAM to Clustering of Rules," in Intelligent Data and Automated Learning, 2004. [3] M k. Ng,Y. Cheung and MLi, "Agglomerative Fuzzy K-means clustering Algorithm With Selection of Number ofnclusters," in IEEE Transaction on Knowlege and Data Enginieering, 2008. [4] A. Keller, "Fuzzy clustering with outliers," in Fuzzy Information Processing Society, 2000. [5] J. Undercoffer H Shah, "Fuzzy clustering for intrudion detection," in Fuzzy Systems, 2003. [6] W. Li, "Modified K-Means Clustering Algorithm," in Congress on Image and Signal Processing, 2008. [7] P. Berkhin, "Survey of clustering data mining technique,", 2002. [8] L. Kaufman, P.J. Rousseeuw, "Finding Groups in Data, An Introduction to Cluster Analysis," in John Wiley & Sons, 1990. [9] A. Joshi, L. Yi R Krishnapuram, "A Fuzzy Relative of the K-Medoids Algorithm with Application to Web Document and Snippet Clustering," in Fyzzy Systems, 1999. [10] P. Blond, A.Baraldi, "A survey of fuzzy clustering algorithms for pattern recognition," in System,man,and Cybenetics, 1999. [11] L. Chen, J. Mei, "Fuzzy Clustering with weighted medoids for relational data," in pattern recognition, 2010. [12] S.Wang, Q. Jiang and H. Sun, "FCM-Based Model Selection Algorithms for Determining the Number of Clusters," in Pattern Recognition, 2004, pp. vol. 37, pp. 2027-2037. [13] A. Ahmad, SS. Khan, "Cluster center initialization algorithm for K-means clustering," in Pattern Recognition Letters, 2004. [14] JC. Bezdek,"Pattern recognition with fuzzy objective function algorithms.: Kluwer Academic Publishers Norwell, 1981. 210 2011 11th International Conference on Hybrid Intelligent Systems (HIS)