A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 1 P.G. Scholar, Department of Computer Engineering, ARMIET, Mumbai University, India 2 Principal of, S.S.J.C.O.E, Mumbai University, India ABSTRACT Now a days, clustering is the main objective of the research in several fields such as machine learning, pattern recognition, etc. Clustering plays an outstanding role in information retrieval, text summarization, marketing, bioinformatics, medicine and many more. Clustering is process which groups or divides the data into meaningful groups and these groups are called as clusters. Clusters are formed on the basis of similar and dissimilar objects in the clusters. The clustering algorithms are used to cluster the data objects. Generally, clustering algorithms are categorized as hard and soft clustering. Some clustering algorithms like K-means, Fuzzy C- means (FCM), Hierarchical and mixture of Gaussian are mostly used. This paper is focus on these clustering algorithms with their advantages and disadvantages. Keywords: K-means, Fuzzy C-means, Hierarchical, Mixture of Gaussian [1] INTRODUCTION Clustering or cluster analysis is the process of grouping a set of objects that are meaningful, useful or both. However, the groups are not predefined. Clustering can be used in many application domains like marketing, medicine, bioinformatics, economics and anthropology. Clustering can be sometimes referred to as unsupervised learning. An unsupervised learning finds some kind of structure in the data. A clustering is a set of clusters which contains all objects in the data set. Clustering can be distinguished as hard clustering and soft clustering. In hard clustering each object belongs to a cluster and in soft clustering each object belongs to each cluster to a certain degree. Clustered objects are grouped in such a way that objects in the same group are more similar and dissimilar in the other group. Clustering algorithms are classified as exclusive (K-means), overlapping (Fuzzy C-means), hierarchical and probabilistic clustering (Mixture of Gaussian). The most used clustering algorithms are as follows: K-means Fuzzy C-means Hierarchical clustering Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 157

Mixture of Gaussians [2] CLUSTERING ALGORITHMS K-means Algorithm K-means clustering algorithm is the unsupervised learning algorithm. The algorithm solves the well known clustering problem. The k-means clustering follows to classify a given data set through a certain number of clusters. The main idea is to define k centers, one for each cluster. The k-means is an algorithm to group objects based on attributes into k number of group. The main purpose of k-means clustering is to classify the data. The k-means clustering uses the squared Euclidean distance to allocate objects to clusters. The quality of cluster is determined by following squared error function Where, xi - vj is the Euclidean distance between xi and vj. ci is the number of data points in i th cluster. c is the number of cluster canters. Algorithmic steps for k-means clustering Here X = {x1,x2,x3,..,xn} is the set of data points and V = {v1,v2,.,vc} is the set of centers. 1) Select any c cluster centers. 2) Calculate the distance between each data point and cluster centers. 3) Assign the data point to the cluster center (data points distance from the cluster center is minimum of all the cluster centers). 4) Again calculate the new cluster center using: 5) Recalculate the distance between each data point and new obtained cluster centers. 6) If data point was not reassigned then stop, otherwise go to step 3). 1) Easy to understand. 2) Gives best result when data set are distinct or well separated from each other. Disadvantages:- 158

1) Requires apriori specification of the number of cluster centers. 2) Unable to handle noisy data and outliers. 3) Provides the local optima of the squared error function. 4) Euclidean distance measures can unequally weight underlying factors. Fuzzy C-Means Algorithm Fuzzy C-means algorithm is developed by Jim Bezdek in 1974. Fuzzy C-means algorithm is unsupervised clustering algorithm which assigns the membership values to each data point corresponding to each cluster center. The membership value is assign on the basis of the distance between the data point and cluster center. The degree of membership of each data item to the cluster is calculated and this degree of membership value decides the cluster to which that data item is supposed to belong. The summation of membership of each data v item should be equal to one. The following formula specifies the membership degree and the cluster center: Where m is the fuzziness index m [1, ]. c represents the number of cluster center. µij represents the membership of the i th data to j th cluster center. dij represents the Euclidean distance between i th data and j th cluster center. Algorithmic steps for Fuzzy c-means clustering Here X = {x1, x2, x3..., xn} is the set of data points, V = {v1, v2, v3..., vc} is the set of centers. 1) Select any c cluster centers. 2) Calculate the fuzzy membership 'µij' using: 3) The fuzzy centers 'vj' calculate using: Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 159

4) Repeat step 2) and 3) until the minimum 'J' value is achieved or U (k+1) - U (k) < β. Where, k is the iteration step. β is the termination criterion between [0, 1]. U = (µij) n*c is the fuzzy membership matrix. J is the objective function. 1) Better than K-means algorithm 2) Gives best results when overlapped data set. Disadvantages: 1) Apriori specification of the number of clusters. 2) Euclidean distance measures can unequally weight underlying factors. Hierarchical Clustering Algorithm A hierarchical clustering algorithm (HCA) creates a set of clusters. For this cluster are recursively partitions the instances. The clusters are group data into a tree structure. This tree structure is known as dendogram. The dendogram is used to show the hierarchical clustering methods or technique and the clusters which are belong to different set. The root of dendogram tree is one cluster and in this cluster all elements are grouped together. A single element cluster is the leaves in the dendogram. Figure:1. Dendogram 160

Hierarchical clustering algorithm is divided in two types:- i) Agglomerative Algorithm [merging]:- The clustering process is start with the unclustered items and merge clusters until all items are belong to one cluster. For this the pairwise similarity measures are performed to determine the clusters. ii) Divisive Algorithm [splitting]:- These algorithms initially placed all the items in one cluster and clusters are repeatedly split into smaller cluster. If elements are not sufficiently close to each other then the clusters are split up. Algorithmic steps for HCA:- 1) Start with all instances in their own cluster. 2) Until there is only one cluster: Among the current clusters, determine the two Clusters, ci and cj, that are most similar. 3) Replace ci and cj with a single cluster ci cj 1) Ease of handling of any forms of similarity or distance. 2) Hierarchical clustering algorithms are more versatile. Disadvantages:- 1) Algorithm can never undo what was done previously. 2) No objective function is directly minimized. Mixture of Gaussian In model based clustering, certain models of clustering are used and attempt to optimize the fit between the data and model. There are Gaussian (continuous) or Poisson (discrete) distributions which are modeled by mixture of distributions for the entire data set. The Expectation-Maximization (EM) algorithm is used to find the parameters of mixture of Gaussian. The EM for Gaussian mixture is an iterative that starts from some initial estimates Ɵ, and then proceeds to iteratively update Ɵ until convergence is detected. Each iteration consists of an E-step and an M-step. E-Step: - Estimates the missing values using the current estimates of Ɵ. This can initially done by finding a weighted average of the observed data. M-Step: - Finds the new estimates for the Ɵ parameters that maximize by using those estimates of the missing data. 1) Fastest algorithm for learning mixture model. Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 161

Disadvantages:- 1) Algorithm always use all the components it has access to, needing complex held-out data criteria to decide how many components to use in the absence of external cues. [3] LITERATURE SURVEY T. Kanungo and D. M. Mount presents a simple and efficient implementation of Lloyd's k-means clustering algorithm. This algorithm is called as filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. [2] Rui Xu, Donald Wunsch II presents survey of different clustering algorithms for data sets appearing in statistics, computer science, and machine learning. They illustrate their applications in some benchmark data set. [3] A. Baraldi and P. Blonda, the reviews the issues related to clustering approaches and their relationships to the different methods. [4] M.S. Yang gives the summary of the fuzzy set theory. The fuzzy set theory is applied in cluster analysis. This paper mostly focused on the fuzzy clustering which is based on fuzzy relation objective functions and the fuzzy generalized K- nearest neighbor rule. [5] Brendan J. Frey* and Delbert Dueck, Clustering of data is to learn a set of centers of cluster such that sum of squared errors between data points and their nearest centers is small. The examplars, are the centers selected from actual data point. [6] Jianbo Shi and Jitendra Malik developed a algorithm which is based on the view perceptual grouping a process that extract global impressions of scene or image. This grouping provides a hierarchical description of scene. In this paper, graph segmentation is done by the normalized cut criterial. Normalized cut is an unbiased measure of disassociation between sub groups of graph. [7] P.Corsini, B.Lazzerini, F. Marcelloni shown a new fuzzy clustering algorithm known as any relation clustering algorithm. This algorithm partitions data set which minimize the Euclidean distance between object from a cluster and the prototype of the cluster. The proposed algorithm is based on the fuzzy relational object data. The proposed algorithm is more stable, scalable and convergence speed. [8] M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo, paper discuss cluster process, some hierarchical clustering algorithms, attributes of algorithms, advantages and disadvantages of hierarchical clustering algorithms and compare the algorithms with each other. [9] A.K. Jain, M.N. Murty, and P.J. Flynn, they examined various steps in clustering and discussed fuzzy, neural, evolutionary, and knowledge-based approaches to clustering. The paper described the applications of clustering. [10] [6] CONCLUSION Clustering or cluster analysis is the process of grouping a set of objects that are meaningful, useful or both. Clustering can be hard clustering or soft clustering. Clustering objects are grouped on the basis of similarities and dissimilarities of object in the group. There are various clustering algorithms which are used for clustering data objects. 162

K-means clustering algorithm is unsupervised learning algorithm and solves the well known clustering problem. This algorithm is easy to understand but requires apriori specification of the number of cluster centers. Fuzzy C-means algorithm (FCM) assigns the membership values to each data point on the basis of of the distance between the data point and cluster center. This algorithm is better than K-means algorithm but it also requires apriori specification of the number of cluster centers. Hierarchical Clustering algorithm (HCA) creates a set of clusters which are grouped into a tree structure. This is called as dendogram. Hierarchical Clustering algorithm is divided into two types agglomerative and divisive. Hierarchical Clustering algorithms are more versatile but objective function is not directly minimized. The Expectation-Maximization (EM) algorithm is used to find the parameters of mixture of Gaussian. The algorithm is divided into Expectation (E-step) and Maximization (M-step). It is a fastest algorithm. REFERENCES [1] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press, 1981. [2] T. Kanungo and D. M. Mount, An Efficient K-means Clustering Algorithm: Analysis and on Implementation Pattern Analysis and Machine Intelligence, IEEE Transactions Pattern Analysis and Machine Intelligence. vol. 24, no. 7, 2002. [3] Rui Xu, Donald Wunsch II, Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.16, NO. 3, MAY 2005 [4] A. Baraldi and P. Blonda, A survey of fuzzy clustering algorithms for pattern recognition-part I And II, " IEEE Trans. Syst.,Man, Cybern. B, Cybern., vol. 29, no. 6, pp. 778-801, Dec. 1999. [5] M.-S. Yang, A Survey of Fuzzy Clustering, Math. Computer Modelling, vol. 18, no. 11, pp 1-16, 1993. [6] B.J. Frey and D. Dueck, Clustering by Passing Messages between Data Points, Science, vol. 315, pp. 972-976, 2007. [7] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000. [8] P. Corsini, F. Lazzerini, and F. Marcelloni, A New Fuzzy Relational Clustering Algorithm Based on the Fuzzy C-Means Algorithm, Soft Computing, vol. 9, pp. 439-447, 2005. [9] M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo, A survey of hierarchical clustering algorithms The Journal of Mathematics and Computer Science Vol.5 No.3 (2012), 229-240. [10] A.K. Jain, M.N. Murty, and P.J. Flynn, ªData Clustering: A Review,º ACM Computing Surveys, vol. 31, no. 3, pp. 264-323,1999. [11] C.F.J. Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11(1):95 103, 1983. [12] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181 214, 1994. Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 163