A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING Susan Tony Thomas PG. Student Pillai Institute of Information Technology, Engineering, Media Studies & Research New Panvel-410206 ABSTRACT Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. This paper reviews two types of clustering techniques- k-means and clustering. In this paper the Strengths and weaknesses of both techniques and their methodology and process are listed. The performance and various other attributes of the two techniques are presented and compared. Keywords Data clustering,,, Agglomerative clustering, Divisive INTRODUCTION is a data mining technique to group the similar data into a cluster and dissimilar data into different clusters. is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. Moreover, most of the data collected in many problems seem to have some inherent properties that lend themselves to natural groupings. is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are Ujwal Harode Asst. Prof. Electronics Department Pillai Institute of Information Technology, Engineering, Media Studies & Research New Panvel-401206 dissimilar to the objects belonging to other clusters. algorithms are used extensively not only to organize and categorize data, but are also useful for data compression and model construction. methods can be broadly classified as clustering and Partition (nonhierarchical) clustering. A clustering is a nested sequence of Partitions. This method further classified as Agglomerative clustering (Bottom-up) approach and Divisive (Topdown)approach.Partitioning methods relocate instances by moving them from one cluster to another, starting from an initial partitioning. Such methods typically require that the number of clusters will be pre-set by the user. To achieve global optimality in partitionedbased clustering, an exhaustive enumeration process of all possible partitions is required. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization. Namely, a relocation method iteratively relocates points between the k clusters. One of the partitioning methods is the clustering. 1. DATA CLUSTERING TECHNIQUES In this section a detailed discussion of both techniques is presented. The algorithm are presented in the following sections 5

Methods Partition Agglomerative (Bottom-up) Divisive (Top-down) Single Pass Relocation Nearest Neighbour Figure1. Classification of Methods.K-MEANS CLUSTERING Single Link Complete Link Average Link Minimum Diameter clustering [16] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into Verona cells. K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as bar centers of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. 1.1.1 Algorithm Properties There are always K clusters There is always at least one item in each cluster The clusters are non-hierarchical and they do not overlap Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters 1.1.2 Algorithm Process The data set is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points. 1) Let X = {X1,X2,X3,..,Xn} be the set of data points and V = {V1,V2,.,Vc} be the set of centers. 2) Randomly select c cluster centers 3) Calculate the distance between each data point and cluster centers. 4) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.. 5) Recalculate the new cluster center using: 6) Where, ci represents the number of data points in i the cluster 7) Recalculate the distance between each data point and new obtained cluster centers. 8) If no data point was reassigned then stop, otherwise repeat from step 4. 1.1.2.1 Strengths of 6

Simple: - Easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. Since both k and t are small. k-means is considered a linear algorithm 1.1.2.2 Weaknesses of The algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers. Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values. Weaknesses of k-means: To deal with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. 1.2 HIERARCHICAL CLUSTERING A hierarchical clustering is a nested sequence of partitions. This method works on both bottom-up and top-down approaches. Based on the approach hierarchical clustering is further subdivided into agglomerative and divisive. The agglomerative hierarchical technique follows bottom up approach whereas divisive follows top-down approaches. clustering use different metrics which measures the distance between 2 tuples and 7 the linkage criteria, which specifies the dissimilarity in the sets as a function of the pair-wise distances of observations in that sets. The linkage criteria could be of 3 types single linkage, average linkage and complete linkage. [3] Table 1: Linkage Methods or Measuring Association d12 Between Clusters 1 and 2 Single linkage Complete linkage Average linkage Notation: d12= d12=mini,j d(xi,yj) d12=maxi,j d(xi,yj) This is the distance between the closest member of the two clusters. This is the distance between the farthest apart members This method involves looking at the distance between all pairs and averages all of these distances X1, X2,..., Xk =Observations from cluster 1 Y1, Y2,..., Yl = Observations from cluster 2 d(x,y)= Distance between a subject with observation vector x and a subject with observation vector y Methods for measuring association between clusters are called linkage methods and are presented in the table1 [5] 1.2.1 General steps of :

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: 1. Start by assigning each item to a cluster, so that if we have N items, we now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into K number of clusters. 1.2.1.1 Strengths of Conceptually Simple. Theoretical properties are well understood. When Clusters are merged /split, the decision is permanent => the number of different alternatives that need to be examined is reduced. 1.2.1.2 Weaknesses of When Clusters are merged /split, the decision is permanent => the number of different alternatives that need to be examined is reduced. Divisive methods can be computational hard. Methods are not (necessarily) scalable for large datasets. Does not require the number of clusters k in advance. Needs a termination condition. The final mode in both Agglomerative and Divisive is of no use. 1.2.2 Similarity Measure The basic objective of using cluster analysis is to discover natural groupings of the items (or variables). To measure the association between objects a quantitative scale is developed. These scales are referred as similarity measures and are mainly statistical measures that indicate the distances between each of the objects 1.2.2.1 Similarity Measures for Numeric Data An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance measure and may be away according to another distance measure. For clustering Numeric field there are many well known methods such as Euclidean distance, Minkowski distance, Manhattan (City-Block), etc., but all the distance measures discussed yields the same result for 1-norm distance. Euclidean Distance is the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. [6][12][13] The Euclidean distance between points P= (p1,p2, pn) and Q=(q1,q2, qn), in Euclidean n-space, is calculated using: Where, pi is the data point in x-axis Where, qi is the data point in y-axis 2. HIERARCHICAL ALGORITHMS clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two 8

types: Agglomerative clustering and Divisive clustering 2.1 AGGLOMERATIVE HIERARCHICAL CLUSTERING This algorithm produces sequence of clustering of decreasing number of clusters at each step. The clusters produced at each step results from the previous step, by merging two clusters into one. The clusters are merged by computing the distance between each pair of clusters. For n samples, agglomerative algorithms [1] begin with n clusters and each cluster contains a single sample or a point. Then two clusters will merge so that the similarity between them is the closest until the number of clusters becomes 1 or as specified by the user. [4] [7][14] 1. Start with n clusters, and a single sample indicates one cluster. 2. Find the most similar clusters Ci and Cj then merge them into one cluster. 3. Repeat step 2 until the number of cluster becomes one or as specified by the user. The distances between each pair of clusters are computed to choose two clusters that have more opportunity to merge. There are several ways to calculate the distances between the clusters Ci and Cj. all sample data. Then, the single cluster splits into 2 or more clusters that have higher dissimilarity between them until the number of clusters becomes number of samples or as specified by the user. The following algorithm is one kind of divisive algorithms using splinter party method. Divisive algorithm: 1. Start with one cluster that contains all samples. 2. Calculate diameter of each cluster. Diameter is the maximal distance between samples in the cluster. Choose one cluster C having maximal diameter of all clusters to split. 3. Find the most dissimilar sample x from cluster C. Let x depart from the original cluster C to form a new independent cluster N (now cluster C does not include sample x). Assign all members of cluster C to Mc. 4. Repeat step 6 until members of cluster C and N do not change. 5. Calculate similarities from each member of Mc to cluster C and N, and let the member owning the highest similarities in Mc move to its similar cluster C or N. Update members of C and N. 6. Repeat the step 2, 3, 4, 5 until the number of clusters becomes the number of samples or as specified by the user. [15]. Figure 2. Agglomerative 2.2 DIVISIVE HIERARCHICAL CLUSTERING This algorithm acts in the opposite direction that is, this method produce clustering sequence of increasing number of cluster at each step. The clusters produced at each step results from the previous step by splitting a single cluster into two. Divisive algorithms begin with just only one cluster that contains 9 Figure 3. Divisive 3 COMPARISONS BETWEEN K-MEANS AND HIERARCHICAL CLUSTERING Properties K Means Definition generates a method

Criteria Performance Category Data Sensitivity Noise Cluster Quality Data Set 10 to specific number of disjoint, flat(nonhierarchical) clusters It is well suited for generating globular Clusters The performance of algorithm is better than can be used in categorical data is first converted into numeric by assigning rank K-means is very sensitive to noise in the dataset. It is less sensitive to noise in the dataset There are always k algorithm shows less quality algorithm is good for large dataset construct a hierarchy of, not just a single partition of objects Use a Distance matrix as Criteria. A termination condition can be used,example- A number of Clusters The performance of is less than K- Means was adopted for categorical data, and due to its complexity a new approach for assigning rank value to each categorical attribute It is less sensitive to noise in the dataset Cluster The number of clusters k is not required as an input algorithm shows high quality algorithm is good for small dataset 4 CONCLUSIONS The K - mean algorithm has the advantage of clustering large data sets and its performance increases as the number of clusters increases. But these conditions apply when Researchers use is limited to numeric values. algorithm was adopted for categorical data, and due to its complexity a new approach for assigning rank value to each categorical attribute using K- means can be used in which categorical data is first converted into numeric by assigning rank. The performance of K- mean algorithm is better than Algorithm. Performance of algorithm increases as the RMSE decreases and the RMSE decreases as the number of cluster increases. All the algorithms have some ambiguity in some (noisy) data when clustered. The quality of all algorithms becomes very good when using huge dataset. is very sensitive to noise in the dataset. This noise makes it difficult for the algorithm to cluster data into suitable clusters, while affecting the result of the algorithm. When using huge dataset, algorithm is faster than other clustering algorithm and also produces quality clusters algorithm is faster than other clustering algorithm and also produces quality clusters when using huge dataset. clustering algorithm is more sensitive for noisy data. 5 FUTURE SCOPES As a future work, comparison between these algorithms (or may other algorithms) may be done using different parameters

other than considered in this paper. One important factor is normalization. Comparing between the results of algorithms using normalized data and nonnormalized data will give different results. Of course normalization will affect the performance of the algorithm and the quality of the results REFERENCES [1] Sung Young Jung, and Taek-Soo Kim, An Agglomerative Using Partial Maximum Array and Incremental Similarity Computation Method, Proceedings of the 2001 IEEE International Conference on Data Mining, p.265-272, November 29-December 02, 2001 [2] R.J. Gil-Garcia; J.M. Badia-Contelles, A General Framework for Agglomerative Algorithms A Pons-Porrata Pattern Recognition, 2006. ICPR 2006. 18th International Conference on Volume 2, 2006 Page(s):569 572 [3] K.P.Soman, Shyam Diwakar, and V.Ajay, Insight into Data Mining- Theory and Practice, Eastern Economy Edition, Prentice Hall of India Pvt. Ltd, New Delhi, 2006 [4] Measuring Association d12 Between Clusters 1 and 2 in http:// www.stat.psu.edu/online/courses/stat505/1 8_cluster/0 5_cluster_between.html [5] Margaret H.Dunham Data Mining Introductory and Advance Topics, Low price Edition Pearson Education, Delhi, 2003. [6] Euclidean Distance in http://people. revoledu.com /kardi/tutorial l/similarity /EuclideanDistance.html [7] Cluster analysis in http://en.wikipedia. org/ wiki/cluster_ analysis [8] Levenshtein_Distance in http://en. wikipedia.org/wiki/ Levenshtein_Distance [9] Similarity Metrics in http://www. dcs.shef.ac.uk/~sam/ stringmetrics. html# hamming [10] Levenshtein_Distance in http://www. dcs.shef.ac.uk/~sam/ stringmetrics.html# Levenshtein [11] Tsunami victim list http: // www. ems.narenthorn. thaigov. net/ tsunami_e / tsunamilist.php [12] Euclidean distance in http://en. wikipedia.org/wiki/ Euclidean_distance# One-dimensional_distance [13] Distance in http://en.wikipedia. org/wiki/ Distance# Mathematics [14] Algorithms in http://home.dei. polimi.it/matteucc / /tutorial_html/hierarchical.html [15] Hui-Chuan Lin (2009) Survey and Implementation of Algorithms an Unpublished master's thesis for master's degree, Hsinchu, Taiwan, Republic of China [16] Jinxin Gao, David B. Hitchcock James- Stein Shrinkage to Improve K- meanscluster Analysis, University of South Carolina,Department of Statistics November 30, 2009. 11