Comparative Study of Clustering Algorithms using R

Size: px

Start display at page:

Download "Comparative Study of Clustering Algorithms using R"

Jayson Cain
6 years ago
Views:

1 Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer Science, Christ University, Bangalore, India) Abstract: Data is everywhere around us and with the rapid improvements of technology, collecting data is becoming easier. Data is key resource in every field and thus data mining has gained immense popularity. In this paper, we will be looking at clustering techniques, which is one of the types of data mining. A comparative study is performed on K-means, Fuzzy C-means and Hierarchical using R. Two data sets are taken in this paper. First, is the Iris data set and second is the breast cancer data set. Both are taken from the UCI repository for conducting the comparative study. The comparison is based on the algorithm s ability to properly cluster the data in the appropriate class. Keywords: Clustering, K-means, Fuzzy c means, hierarchical clustering I. INTRODUCTION Data mining is a technique used to discover hidden patterns and gain insights among the large data sets. Effective data mining depends on both successful data collection and warehouse techniques. Knowledge Discovery and Data(KDD) is widely considered as a synonym for data mining whereas some regard data mining as a pivotal procedure of knowledge discovery. [2] There are different techniques used for data mining. Some of these techniques are association, classification, clustering, regression etc. In this paper we are going to focus on clustering techniques. Need of the Study The paper shows how three clustering techniques, K - means, Fuzzy C means and Hierarchical clustering perform against each other using two classified data sets. R is a powerful tool that is used for the comparative analysis and the various built in functions make it easy to visualize the clustering. The paper will aid those who are interested in analysing clustering techniques using R and it sheds light on its various features one can use. Clustering Clustering involves grouping of data such that, elements within a group are more similar to one another than to those in other group. The groups are known as clusters. Greater the resemblance within a group and greater the discrepancy between other groups leads to an efficient clustering. Clustering falls under unsupervised learning problem, which involves finding hidden structured from unlabelled data. Thereby the aim of clustering involves determining essential grouping in a set unlabelled data. Figure1: Different Clustering Techniques Clustering Techniques A. K-means K-means clustering is a partitioning clustering technique that groups the given data set into k given cluster and was initially suggested by Macqueen in Its simplicity has made it a widely popular clustering Page 14

2 algorithm. It is also very efficient in terms of time complexity. It is an exclusive clustering technique i.e. each data point from the observation will belong to only one single cluster. There are two segments to the algorithm. The first segment involves the arbitrary selection of the k cluster centres. Each observation is then measured using distance functions with all cluster centres, to determine the cluster centre it is closest to. All the observations which are closest a particular cluster centre, would form a cluster. The first step would end when every observation is allocated to a particular cluster. This iterative process continues repeatedly until the criterion function becomes the minimum. These steps are repetitively performed and the algorithm would cease when the objective function becomes minimum. The positioning of centroids is crucial, as the algorithm would yield different results on changing the locations of the centroids. [3] Thus it would be wise to place the data points as much as possible far away from each other. Advantages Computationally fast Simple to understand and implement Disadvantage It is sensitive to outliers Determining k value is difficult B. Fuzzy C-means Data is divided into distinct clusters in non-fuzzy logic, whereas in fuzzy clustering, data points can belong to multiple clusters. Advancement of fuzzy theory lead to Fuzzy C-Means (FCM) clustering algorithm. Ruspini Fuzzy clustering theory was the basis of the FCM. [5] In order for Fuzzy c- mean to perform clustering, it requires the c initial seeds of clustering, just like the k-means algorithm and the working of the algorithm is very much similar to k-means algorithm. Membership is assigned to each data point, based on its distance from the cluster centre. As the data moves closer to a cluster centre, the membership becomes higher towards that cluster centre. Advantages Fuzzy c-means give efficient result for data set that contain overlapping data Unlike in k-means, the data points can belong to more than one cluster. Disadvantages Long computational time The membership of outliers is expected to be low or even none C. Hierarchical Clustering Hierarchical clustering is another method of cluster analysis which seeks to build a hierarchy of clusters. The data objects are decomposed in a hierarchical manner, developing a dendogram. The dendograms is a tree structure, which breaks the data into smaller subsets. There are two ways of forming the dendrogram bottom up and top down approach. The two types of Hierarchical Clustering are: 1. Agglomerative 2. Divisive Agglomerative Begin with one particular point Two or more clusters are added to each other recursively The algorithm will terminate when k number of clusters is formed Divisive Initially consist of one big cluster The bigger cluster is then divided to smaller clusters recursively The algorithm will terminate when k number of clusters is formed Advantages No prior information about the number of clusters is needed It is applicable with any valid distance measure Disadvantages Not efficient in term of execution time Does not work well with large data set II. R PROGRAMMING R programming is chosen for conducting the comparative study. It is very popular for conducting data analysis. R has many of the clustering packages, which can be downloaded. The packages are easy to implement, Page 15

3 as it just involves filling in the parameters. These packages can be downloaded from CRAN (Comprehensive R Archive Network). Data Set Two data sets are taken from UCI repository. The first data set is Iris data, which has no missing values and has 150 observations. The second data set is the best cancer data set obtained from University of Wisconsin Hospitals, which has 699 observations with missing values. Table I: Iris data set class distribution Class Distribution Iris Setosa 50 (33.3%) Iris Versicolor 50 (33.3%) Iris Virginica 50 (33.3%) Table II: Breast Cancer data set class distribution Class Distribution Benign 458 (65.5%) Malignant 241 (34.4%) Both the data set is already classified, with the last column indicating the class. So in order to cluster the data, the last column of the data set must be omitted. The breast cancer data set also contains certain missing values. Missing values can affect the efficiency of the clustering, thus there are several approaches that can be taken to solve the issue: The complete row can be ignored if the missing value is from the column containing the class label Attribute s mean value can be used to fill the missing values Efficient prediction techniques. For the purpose of this paper, the missing values are substituted with the mean value of the attribute. III. IMPLEMENTATION Iris Data Set A. K-means Table III: Results of K-means on Iris data Iris Setosa Iris Versicolor Iris Virginica Figure 2: The graph indicates the clusters formed from two attributes of Iris data by K-means, with each cluster associated with a colour B. Fuzzy C-means Table IV: Results of Fuzzy C-means on Iris data Iris Setosa Iris Versicolor Iris Virginica Page 16

4 Pertinence Figure3: The above graph shows the relevance of each observation to the clusters formed from the Iris data, where each cluster is indicated by a particular colour C. Hierarchical Clustering Table V: Results of Hierarchical clustering on Iris data with linkage type single Iris Setosa Iris Versicolor Iris Virginica Table VI: Results of Hierarchical clustering on Iris data with linkage type average Iris Setosa Iris Versicolor Iris Virginica Table VII: Results of Hierarchical clustering on Iris data with linkage type complete Iris Setosa Iris Versicolor Iris Virginica Figure4: Dendogram formed from Iris data Breast Cancer Data A. K-means Table VIII: Results of K-means on Breast Cancer data Benign Malignant Page 17

Figure5: The graph indicates the clusters formed from the two attributes of Breast Cancer data by K- means, with each cluster associated with a colour B.

5 Figure5: The graph indicates the clusters formed from the two attributes of Breast Cancer data by K- means, with each cluster associated with a colour B. Fuzzy C-means Table IX: Results of Fuzzy c-means on Breast Cancer data Benign Malignant Figure 6: The above graph shows the relevance of each observation to the clusters formed from the Breast Cancer data, where each cluster is indicated by a particular colour Table X: Results of hierarchical clustering on breast cancer data with linkage type single Benign Malignant Table XI: Results of Hierarchical clustering on Breast Cancer data with linkage type average Benign Malignant Table XII: Results of Hierarchical clustering on Breast Cancer data with linkage type complete Benign Malignant Page 18

Figure7: Dendogram formed from Breast Cancer data Table XIII: Cluster Distribution Results on Iris Data Set Method No. of Iterations Cluster Distribution k-means 20 62(41.33%) 50(33.33%) 38(25.

6 Figure7: Dendogram formed from Breast Cancer data Table XIII: Cluster Distribution Results on Iris Data Set Method No. of Iterations Cluster Distribution k-means 20 62(41.33%) 50(33.33%) 38(25.33%) Fuzzy c-means 20 60(40%) 40(26.66%) 50(33.33%) Hierarchical (average linkage) 50(33.33%) 64(42.66%) 36(24%) TABLE XIV: Cluster Distribution Results on Breast Cancer Data Set Method No. of Iterations Cluster Distribution k-means (33.33%) 466(66.66%) Fuzzy c-means (32.90%) 469(67.09%) Hierarchical 480(68.66%) (average linkage) 219(31.33%) Limitations of the Study The paper does not discuss in detail the clustering techniques, hence prior knowledge in the three techniques would be useful to the reader. The study also does not cover the time performance, which could be another factor in the comparative analysis of clustering techniques. Directions for Future Research Future research would involve exploring the many other clustering techniques as well as exploring the other data mining techniques such as classification and association in R. Time will be taken in consideration when the analysis is carried out. Sources of funding of the Study The study was not aided financial by any external source and was completely self-funded by the two authors. IV. CONCLUSION From the results obtained, it can be observed that none of the three clustering techniques were able to cluster the data accurately. Overlapping of clusters was noticed in the results of all the three techniques. K- means and Fuzzy c-means cluster results were similar in nature i.e. for both the data sets, the cluster distribution was very similar. In hierarchical clustering, average linkage gave the closest results to the actual distribution. R allows easy implementation of the cluster algorithms through built in packages. Visualization of the clusters is also possible in R, through numerous functions. V. REFERENCES [1] Yuni Xia, Bowei Xi Conceptual Clustering Categorical Data with Uncertainty Indiana University Purdue University Indianapolis Indianapolis, IN 46202, USA Page 19

7 [2] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,second Edition, (2006). [3] Velmurugan, T., and T. Santhanam. "Computational complexity between K-means and K-medoids clustering algorithms for normal and uniform distributions of data points." Journal of computer science 6.3 (2010): 363. [4] Sharma, Ritu, M. Afshar Alam, and Anita Rani. "K-means clustering in spatial data mining using weka interface." International conference on advances in communication and computing technologies (ICACACT). Vol [5] Ghosh, Soumi, and Sanjay Kumar Dubey. "Comparative analysis of k-means and fuzzy c-means algorithms." IJACSA) International Journal of Advanced Computer Science and Applications 4.4 (2013). [6] Mingoti, Sueli A., and Joab O. Lima. "Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms." European Journal of Operational Research (2006): [7] Miller, H., and J. Han. "Spatial clustering methods in data mining: a survey." Geographic data mining and knowledge discovery, Taylor and Francis (2001). Page 20

Unsupervised Learning

Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised