IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm 1 Bhaskar Adepu and 2 Kiran Kumar Bejjanki 1 Department of MCA, Kakatiya Institute of Technology & Science, Warangal, Andhra Pradesh, INDIA, 506015 E-Mail: 1 bhaskar_adepu@yahoo.com; 2 kiran_b_kumar@yahoo.com Abstract: Clustering analysis has been an emerging research issue in Data Mining due to its variety of applications. In the recent years, it has become an essential tool for Gene expression analysis. Many clustering algorithms have been proposed so far. However each algorithm has its own advantages and disadvantages and cannot work for all real situations. The Minimum Spanning Tree (MST) based clustering algorithms have been widely used due to their ability to detect clusters with irregular boundaries. In this paper we present a clustering algorithm that is inspired by MST. In this algorithm, we propose a new method for construction of MST which reduces the time complexity compared with traditional MST construction methods. Key words: Clustering, Minimum Spanning Tree, Partitioning, Dissimilarity Matrix. 1. INTRODUCTION Clustering is the process of grouping the data objects into classes or clusters, so that objects within a cluster have high similarly in comparison to one another but are very dissimilar to objects in other clusters. Usually, the common properties are quantitatively evaluated by some measures of the optimality such as minimum intra-cluster distance or maximum inter-cluster distance. Clustering plays an important role in various fields including Pattern Recognition, Image Processing, Biological Data Analysis, Micro Aggregation, Mobile Communication, Medicine and Economics. Clustering is used to explore the hidden structure of modern large databases and many algorithms have been proposed in the literature. Because of the huge variety of the problems and data distribution, different techniques, such as hierarchical, partition, density and model based algorithms have been developed and no techniques are completely satisfactory for all the cases. With the recent advances of micro array technology, there has been tremendous growth of the micro array data. Identifying co-regulated genes to organize them into meaningful groups has been an important research in bioinformatics. Therefore, clustering analysis has become an essential and valuable Manuscript received May 25, 2010 Manuscript revised December 15, 2010 tool in micro array or gene expression data analysis [1]. Given a set of N data points, a minimum spanning tree is a spanning tree that connects all the data points either by a direct edge or by a path and has minimum total weight. The total weight is the sum of the weights of all the edges of the spanning tree. In MST based clustering algorithms, the weight for each edge is usually computed as Euclidean distance between the points connecting that edge. Minimum Spanning Tree (MST) based clustering algorithms allows us to overcome many of the problems faced by the classical clustering algorithms. Due to their ability to detect clusters with irregular boundaries, MST based clustering algorithms have been widely used in practice. Initially, Zhan[2] proposed MST based clustering algorithms. Later MST based clustering algorithms has been extensively studied in the fields of biological data analysis [3], image processing, pattern recognition [4] and outlier detection [5], [6]. Usually, MST based clustering algorithms[2] consists of three steps: (1) A minimum spanning tree is constructed (typically in quadratic time) using either the Prim s algorithm or the Kruskal s algorithm (2) The inconsistent edges are removed to get a set of connected components(clusters) and (3) Step (2) is repeated until some terminating condition is satisfied.
70 IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 In this paper, we proposed a new method for construction of MST which is based on partitioning technique. Our algorithm has no specific requirement of prior knowledge of some parameters like number of clusters required and the dimensionality of the datasets etc. The rest of the paper is organized as follows. In section 2, we introduce the necessary concepts of MST and review of existing work on MST-based clustering algorithms. We next present MST construction method and proposed algorithm in section 3. Finally, conclusions are made in section 4. 2. RELATED WORK 2.1. MST-based Clustering Algorithms After MST being constructed, the next step is to define an edge inconsistency measure so as to partition the tree into clusters. The simple edge inconsistency measure is the removal of longest edge candidates from the MST. So that k number of clusters are formed by removing ( k-1) inconsistent edges from the MST. The number of clusters k is given as an input parameter in many algorithms. The definition of the inconsistent edges and the development of the terminating condition are two major issues that have to be addressed in all MST-based clustering algorithms. In Zahn s original work [2], the inconsistent edges are those edges whose weights are significantly larger than the average weight of the nearby edges in the tree. The performance of this clustering algorithm is affected by the size of the nearby neighborhood. Five group clustering is shown in Figure 1. Figure 1: MST Representation of Five Group Clustering There exist other spanning tree based clustering algorithms that maximize or minimize the degrees of link of the vertices [7], which is computationally expensive. Grygorash et al. [9] proposed two MSTbased clustering algorithms called the Hierarchical Euclidean Distance based MST clustering algorithm (HEMST) and the Maximum Standard Deviation Reduction clustering algorithm (MSDR) respectively. As stated in [3] that MST based clustering algorithm does not depend on the detailed geometric structure of a cluster and therefore, it can overcome many of the problems faced by many clustering algorithms. The other graphical structures such as Relative Neighborhood Graph (RNG), Gabriel Graph (GG), and Delaunay Triangulation (DT) have also been used for cluster analysis. The relationship among these graphs can be seen as MST RNG GG DT [10]. In Density-oriented approach, Chowdbury and Murthy s MST based clustering technique[11] assumes that the boundary between any two clusters must belong to a valley region i.e., where the density of data points is the lowest compared to the neighboring regions and the inconsistency measure is based on finding such valley regions. Laszlo and Mukherjee present an MST-based clustering algorithm [12] that puts a constraint on the minimum cluster size rather than on the number of clusters. This algorithm is developed for micro aggregation problem, where the number of clusters in the data set can be figured out by the constraints of the problem itself. Vathy-Fogarassy et al. suggest three new cutting criteria for the MST-based clustering [4]. Their goal is to decrease the number of heuristically defined parameters of existing algorithms so as to decrease the influence of the user on the clustering results. Recently, Wang et al. [8] proposed a new approach called Divide and Conquer method to facilitate efficient MST-based clustering by using the idea of the Reverse Delete algorithm. 3. PROPOSED METHOD Our algorithm mainly consists of the following steps: 1. Representation of n-dimensional data points in the form of Dissimilarity Matrix (Object-by-Object Structure). 2. Construction of Spanning Tree (ST) using this Dissimilarity Matrix (DM). 3. Construction of MST from ST. 4. Generating Clusters using the MST. 3.1. Dissimilarity Matrix Representation Generally in most of the clustering algorithms data points can be represented as Data Matrix or Dissimilarity Matrix representation. In our method
A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm 71 we represented data points in the form of Dissimilarity Matrix. It contains the distance values between the data points represented as lower or upper triangular matrix. The distance calculation measure we used is Euclidean distance. 2 2 2 ( 1 1 2 2 ) d( i,) j = xi x j + xi x j + + xin x jn (1) where i, j are n-dimensional data points. Consider the sample data about the students as shown in Table 1. Table 1 Sample Data StudentID Age Marks 1 18 73.0 2 18 79.0 3 23 70.0 4 20 55.0 5 22 85.0 6 19 91.0 7 20 17.0 8 21 53.0 9 19 82.0 10 47 75.0 (iii) Select an edge e such that only any one end point of e is in ST and dist(e) 0 (iv) Add edge e to the ST. The sample spanning tree for the above data by randomly selecting an edge {1, 2} using the above procedure is shown in Table 3 and Fig. 2. Edge Table 3 Spanning Tree Distance/Weight {1, 2} 6.0 {2, 3} 10.3 {1, 4} 18.11 {4, 5} 30.07 {1, 6} 18.03 {6, 7} 74.01 {1, 8} 20.22 {1, 9} 9.06 {5, 10} 26.93 The DM for the above sample data is shown in Table 2 by using Eq. (1). Table 2 Dissimilarity Matrix 1 2 3 4 5 6 7 8 9 10 1 0 6.0 5.83 18.11 12.65 18.03 56.04 20.22 9.06 29.07 2 0 10.3 24.08 7.21 12.04 62.03 26.17 3.16 29.27 3 0 15.3 15.03 21.38 53.08 17.12 12.65 24.52 4 0 30.07 36.01 38.0 2.24 27.02 33.6 5 0 6.71 68.03 32.02 4.24 26.93 6 0 74.01 38.05 9.0 32.25 7 0 36.01 65.01 63.98 8 0 29.07 34.06 9 0 28.86 10 0 3.2. Construction of Spanning Tree Randomly choose one edge and add it to the ST. (ii) Repeat the following steps until number of edges in ST=N-1 where N is the number of data points. Figure 2: Spanning Tree 3.3. Construction of MST - Proposed Algorithm The basic idea of our proposed algorithm is as follows: Repeat 1. Select the longest distance edge e from the ST. 2. Remove an edge e from the ST, then the vertices in the ST are partitioned into two sets P1, P2. 3. Find an edge E such that the following conditions are satisfied. dist ( E ) < dist ( e ).
72 IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 (ii) One of the end points of E should be in one partition and the other end point should be in another partition. 4. if (edge E found) then Add edge E to the ST 5. else if (edge E not found) then Add edge e to the MST. Until ( number_of_edges in the MST = N-1); For example in the above ST the longest edge e ={6, 7} whose distance = 74.01. By removing this edge from ST, vertices (data points) are partitioned into two sets P1 = {1, 2, 3, 4, 5, 6, 8, 9, 10} and P2 = {7}. Next we can find many edges satisfying the above two conditions from the DM. Those are {1-7, 2-7, 3-7, 4-7, 5-7, 8-7, 9-7, 10-7}. Select the minimum distance edge from these set of edges and add it to the MST. The final MST generated from the above process is depicted in Fig. 3. Figure 3: Minimum Spanning Tree 3.4. Generating Clusters using the MST Calculate the Mean(M), Standard Deviation(SD) of the edge weights in the MST (ii) Calculate Threshold(λ) = M + SD (iii) for each edge e MST if weight of e (w e ) λ remove e from MST end if end for This gives us disjoint sub trees {T 1, T 2, T 3 }. Each of the sub trees T i is a cluster. For the above MST Mean = 11.55667, Standard Deviation = 11.60832, Threshold = 23.16499. The Clusters formed are: Cluster1: {1, 2, 3, 4, 5, 6, 8, 9}, Cluster2: {7}, Cluster3: {10} 4. CONCLUSIONS In this paper, we have presented a new approach for the construction of minimum spanning tree, which takes less time compared to classical minimum spanning tree algorithms. Unlike the other algorithms such as k-means, our algorithm does not require any prior parametric values, like, number of clusters, initial cluster seeds etc. We have done experiments on some synthetic data sets namely Students, Employees data. Experimental results demonstrate that the proposed algorithm performs better than the k-means. REFERENCES [1] Daxin Jiang, Chun Tang and Aidong Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, 16, 2004, 1370-1385. [2] C. T. Zahn, Graph-Theriotical Methods for Detecting and Describing Getalt Clusters, IEEE Trans. Computers, 20(1), 1971, 68-86. [3] Y. Xu, V. Olman and D. Xu, Clustering Gene Expression Data using a Graph-Theriotic Approach: An Application of Minimum Spanning Trees, Bioinformatics, 18(4), 2002, 536-545. [4] A. Vathy-Fogarassy, A. Kiss and 1. Abonyi, Hybrid Minimal Spanning Tree and Mixture of Gaussians based Clustering Algorithm, Foundations of Information and Knowledge Systems, Springer, 2006, 313-330. [5] J. Lin, D. Ye, C. Chen, and M. Gao, Minimum Spanning Tree-Based Spatial Outlier Mining and Its Applications, Lecture Notes in Computer Science, Springer-Verlag, Vol. 5009/2008, 2008,pp. 508-515. [6] M. F. Jiang, S. S. Tseng, and C. M. Su, Two-Phase Clustering Process for Outliers Detection, Pattern Recognition Letters, 22, 2001, 691-700. [7] N. Paivinen, Clustering with a Minimum Spanning Tree of Scale- free-like Structure, Pattern Recognition Letters, Elsevier, 26(7), 2005, 921-930. [8] Xiaochun Wang, Xiali Wang and D. Mitchell Wilkes, A Divide-and-conquer Approach for Minimum Spanning Tree-based Clustering, IEEE Transactions on Knowledge and Data Engg., 21, 2009. [9] O. Gryorash, Y. Zhou ands Z, Jorgenssn, Minimum Spanning tree-based Clustering Algorithms, Proc.
A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm 73 IEEE Int l Conf. Tools with Artificial Intelligence, 2006, pp. 73-81. [10] A. K. Jain, Algorithms for Clustering Data, New Jersey: Prentice Hall, Englewood Cliffs, 1988. [11] N. Chowdhury and C.A. Murthy, Minimum Spanning Tree-Based Clustering Technique: Relationship with Bayes Classifier, Recognition, 30(11), 1997, 1919-1929. Pattern [12] M. Laszlo and S. Mukherjee, Minimum Spanning Tree Partitioning Algorithm for Microaggregation, IEEE Trans. on Knowledge and Data Engg., 17(7), 2005, 902-911.