A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Similar documents
Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

REDUCING RUNTIME VALUES IN MINIMUM SPANNING TREE BASED CLUSTERING BY VISUAL ACCESS TENDENCY

Manuscript Click here to download Manuscript: Interval neutrosophic MST clustering algorithm and its an application to taxonomy.

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Triclustering in Gene Expression Data Analysis: A Selected Survey

Dynamic Clustering of Data with Modified K-Means Algorithm

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

International Journal of Advanced Research in Computer Science and Software Engineering

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

A Review of K-mean Algorithm

Enhancing K-means Clustering Algorithm with Improved Initial Center

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Analyzing Outlier Detection Techniques with Hybrid Method

International Journal of Advance Research in Computer Science and Management Studies

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

International Journal of Advanced Research in Computer Science and Software Engineering

Clustering Algorithms for Data Stream

ISSN: (Online) Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

FACE RECOGNITION FROM A SINGLE SAMPLE USING RLOG FILTER AND MANIFOLD ANALYSIS

A NOVEL ALGORITHM FOR MINIMUM SPANNING CLUSTERING TREE

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Unsupervised learning on Color Images

DS504/CS586: Big Data Analytics Big Data Clustering II

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

I. INTRODUCTION II. RELATED WORK.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Filtered Clustering Based on Local Outlier Factor in Data Mining

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

6. Concluding Remarks

A Novel Algorithm for Meta Similarity Clusters Using Minimum Spanning Tree

Including the Size of Regions in Image Segmentation by Region Based Graph

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

A Generalized Method to Solve Text-Based CAPTCHAs

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

A NOVEL APPROACH TO TEST SUITE REDUCTION USING DATA MINING

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Color Image Segmentation Using a Spatial K-Means Clustering Algorithm

CAD SYSTEM FOR AUTOMATIC DETECTION OF BRAIN TUMOR THROUGH MRI BRAIN TUMOR DETECTION USING HPACO CHAPTER V BRAIN TUMOR DETECTION USING HPACO

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Image Segmentation for Image Object Extraction

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Fast Efficient Clustering Algorithm for Balanced Data

Uniformity and Homogeneity Based Hierachical Clustering

DS504/CS586: Big Data Analytics Big Data Clustering II

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Unsupervised Learning and Clustering

DOI:: /ijarcsse/V7I1/0111

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1

CSE 5243 INTRO. TO DATA MINING

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Statistical Pattern Recognition

Distributed and clustering techniques for Multiprocessor Systems

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Analysis and Extensions of Popular Clustering Algorithms

A Novel Analysis of Clustering for Minimum Spanning Tree using Divide & Conquer Technique

Mining Quantitative Association Rules on Overlapped Intervals

Finding Consistent Clusters in Data Partitions

CHAPTER 7 A GRID CLUSTERING ALGORITHM

Normalization based K means Clustering Algorithm

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

A New Meta-heuristic Bat Inspired Classification Approach for Microarray Data

Density Based Clustering using Modified PSO based Neighbor Selection

An Enhanced K-Medoid Clustering Algorithm

Data Stream Clustering Using Micro Clusters

Intro to Artificial Intelligence

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images

Graph-based High Level Motion Segmentation using Normalized Cuts

Detection of Anomalies using Online Oversampling PCA

A NOVEL ALGORITHM FOR CENTRAL CLUSTER USING MINIMUM SPANNING TREE

A Survey on DBSCAN Algorithm To Detect Cluster With Varied Density.

Classification of Face Images for Gender, Age, Facial Expression, and Identity 1

Statistical Pattern Recognition

Introduction to Mobile Robotics

Lorentzian Distance Classifier for Multiple Features

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

A Graph Theoretic Approach to Image Database Retrieval

Introduction to Computer Science

IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family

AUTOMATIC PATTERN CLASSIFICATION BY UNSUPERVISED LEARNING USING DIMENSIONALITY REDUCTION OF DATA WITH MIRRORING NEURAL NETWORKS

Correlation Based Feature Selection with Irrelevant Feature Removal

Redefining and Enhancing K-means Algorithm

A Keypoint Descriptor Inspired by Retinal Computation

Transcription:

IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm 1 Bhaskar Adepu and 2 Kiran Kumar Bejjanki 1 Department of MCA, Kakatiya Institute of Technology & Science, Warangal, Andhra Pradesh, INDIA, 506015 E-Mail: 1 bhaskar_adepu@yahoo.com; 2 kiran_b_kumar@yahoo.com Abstract: Clustering analysis has been an emerging research issue in Data Mining due to its variety of applications. In the recent years, it has become an essential tool for Gene expression analysis. Many clustering algorithms have been proposed so far. However each algorithm has its own advantages and disadvantages and cannot work for all real situations. The Minimum Spanning Tree (MST) based clustering algorithms have been widely used due to their ability to detect clusters with irregular boundaries. In this paper we present a clustering algorithm that is inspired by MST. In this algorithm, we propose a new method for construction of MST which reduces the time complexity compared with traditional MST construction methods. Key words: Clustering, Minimum Spanning Tree, Partitioning, Dissimilarity Matrix. 1. INTRODUCTION Clustering is the process of grouping the data objects into classes or clusters, so that objects within a cluster have high similarly in comparison to one another but are very dissimilar to objects in other clusters. Usually, the common properties are quantitatively evaluated by some measures of the optimality such as minimum intra-cluster distance or maximum inter-cluster distance. Clustering plays an important role in various fields including Pattern Recognition, Image Processing, Biological Data Analysis, Micro Aggregation, Mobile Communication, Medicine and Economics. Clustering is used to explore the hidden structure of modern large databases and many algorithms have been proposed in the literature. Because of the huge variety of the problems and data distribution, different techniques, such as hierarchical, partition, density and model based algorithms have been developed and no techniques are completely satisfactory for all the cases. With the recent advances of micro array technology, there has been tremendous growth of the micro array data. Identifying co-regulated genes to organize them into meaningful groups has been an important research in bioinformatics. Therefore, clustering analysis has become an essential and valuable Manuscript received May 25, 2010 Manuscript revised December 15, 2010 tool in micro array or gene expression data analysis [1]. Given a set of N data points, a minimum spanning tree is a spanning tree that connects all the data points either by a direct edge or by a path and has minimum total weight. The total weight is the sum of the weights of all the edges of the spanning tree. In MST based clustering algorithms, the weight for each edge is usually computed as Euclidean distance between the points connecting that edge. Minimum Spanning Tree (MST) based clustering algorithms allows us to overcome many of the problems faced by the classical clustering algorithms. Due to their ability to detect clusters with irregular boundaries, MST based clustering algorithms have been widely used in practice. Initially, Zhan[2] proposed MST based clustering algorithms. Later MST based clustering algorithms has been extensively studied in the fields of biological data analysis [3], image processing, pattern recognition [4] and outlier detection [5], [6]. Usually, MST based clustering algorithms[2] consists of three steps: (1) A minimum spanning tree is constructed (typically in quadratic time) using either the Prim s algorithm or the Kruskal s algorithm (2) The inconsistent edges are removed to get a set of connected components(clusters) and (3) Step (2) is repeated until some terminating condition is satisfied.

70 IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 In this paper, we proposed a new method for construction of MST which is based on partitioning technique. Our algorithm has no specific requirement of prior knowledge of some parameters like number of clusters required and the dimensionality of the datasets etc. The rest of the paper is organized as follows. In section 2, we introduce the necessary concepts of MST and review of existing work on MST-based clustering algorithms. We next present MST construction method and proposed algorithm in section 3. Finally, conclusions are made in section 4. 2. RELATED WORK 2.1. MST-based Clustering Algorithms After MST being constructed, the next step is to define an edge inconsistency measure so as to partition the tree into clusters. The simple edge inconsistency measure is the removal of longest edge candidates from the MST. So that k number of clusters are formed by removing ( k-1) inconsistent edges from the MST. The number of clusters k is given as an input parameter in many algorithms. The definition of the inconsistent edges and the development of the terminating condition are two major issues that have to be addressed in all MST-based clustering algorithms. In Zahn s original work [2], the inconsistent edges are those edges whose weights are significantly larger than the average weight of the nearby edges in the tree. The performance of this clustering algorithm is affected by the size of the nearby neighborhood. Five group clustering is shown in Figure 1. Figure 1: MST Representation of Five Group Clustering There exist other spanning tree based clustering algorithms that maximize or minimize the degrees of link of the vertices [7], which is computationally expensive. Grygorash et al. [9] proposed two MSTbased clustering algorithms called the Hierarchical Euclidean Distance based MST clustering algorithm (HEMST) and the Maximum Standard Deviation Reduction clustering algorithm (MSDR) respectively. As stated in [3] that MST based clustering algorithm does not depend on the detailed geometric structure of a cluster and therefore, it can overcome many of the problems faced by many clustering algorithms. The other graphical structures such as Relative Neighborhood Graph (RNG), Gabriel Graph (GG), and Delaunay Triangulation (DT) have also been used for cluster analysis. The relationship among these graphs can be seen as MST RNG GG DT [10]. In Density-oriented approach, Chowdbury and Murthy s MST based clustering technique[11] assumes that the boundary between any two clusters must belong to a valley region i.e., where the density of data points is the lowest compared to the neighboring regions and the inconsistency measure is based on finding such valley regions. Laszlo and Mukherjee present an MST-based clustering algorithm [12] that puts a constraint on the minimum cluster size rather than on the number of clusters. This algorithm is developed for micro aggregation problem, where the number of clusters in the data set can be figured out by the constraints of the problem itself. Vathy-Fogarassy et al. suggest three new cutting criteria for the MST-based clustering [4]. Their goal is to decrease the number of heuristically defined parameters of existing algorithms so as to decrease the influence of the user on the clustering results. Recently, Wang et al. [8] proposed a new approach called Divide and Conquer method to facilitate efficient MST-based clustering by using the idea of the Reverse Delete algorithm. 3. PROPOSED METHOD Our algorithm mainly consists of the following steps: 1. Representation of n-dimensional data points in the form of Dissimilarity Matrix (Object-by-Object Structure). 2. Construction of Spanning Tree (ST) using this Dissimilarity Matrix (DM). 3. Construction of MST from ST. 4. Generating Clusters using the MST. 3.1. Dissimilarity Matrix Representation Generally in most of the clustering algorithms data points can be represented as Data Matrix or Dissimilarity Matrix representation. In our method

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm 71 we represented data points in the form of Dissimilarity Matrix. It contains the distance values between the data points represented as lower or upper triangular matrix. The distance calculation measure we used is Euclidean distance. 2 2 2 ( 1 1 2 2 ) d( i,) j = xi x j + xi x j + + xin x jn (1) where i, j are n-dimensional data points. Consider the sample data about the students as shown in Table 1. Table 1 Sample Data StudentID Age Marks 1 18 73.0 2 18 79.0 3 23 70.0 4 20 55.0 5 22 85.0 6 19 91.0 7 20 17.0 8 21 53.0 9 19 82.0 10 47 75.0 (iii) Select an edge e such that only any one end point of e is in ST and dist(e) 0 (iv) Add edge e to the ST. The sample spanning tree for the above data by randomly selecting an edge {1, 2} using the above procedure is shown in Table 3 and Fig. 2. Edge Table 3 Spanning Tree Distance/Weight {1, 2} 6.0 {2, 3} 10.3 {1, 4} 18.11 {4, 5} 30.07 {1, 6} 18.03 {6, 7} 74.01 {1, 8} 20.22 {1, 9} 9.06 {5, 10} 26.93 The DM for the above sample data is shown in Table 2 by using Eq. (1). Table 2 Dissimilarity Matrix 1 2 3 4 5 6 7 8 9 10 1 0 6.0 5.83 18.11 12.65 18.03 56.04 20.22 9.06 29.07 2 0 10.3 24.08 7.21 12.04 62.03 26.17 3.16 29.27 3 0 15.3 15.03 21.38 53.08 17.12 12.65 24.52 4 0 30.07 36.01 38.0 2.24 27.02 33.6 5 0 6.71 68.03 32.02 4.24 26.93 6 0 74.01 38.05 9.0 32.25 7 0 36.01 65.01 63.98 8 0 29.07 34.06 9 0 28.86 10 0 3.2. Construction of Spanning Tree Randomly choose one edge and add it to the ST. (ii) Repeat the following steps until number of edges in ST=N-1 where N is the number of data points. Figure 2: Spanning Tree 3.3. Construction of MST - Proposed Algorithm The basic idea of our proposed algorithm is as follows: Repeat 1. Select the longest distance edge e from the ST. 2. Remove an edge e from the ST, then the vertices in the ST are partitioned into two sets P1, P2. 3. Find an edge E such that the following conditions are satisfied. dist ( E ) < dist ( e ).

72 IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 (ii) One of the end points of E should be in one partition and the other end point should be in another partition. 4. if (edge E found) then Add edge E to the ST 5. else if (edge E not found) then Add edge e to the MST. Until ( number_of_edges in the MST = N-1); For example in the above ST the longest edge e ={6, 7} whose distance = 74.01. By removing this edge from ST, vertices (data points) are partitioned into two sets P1 = {1, 2, 3, 4, 5, 6, 8, 9, 10} and P2 = {7}. Next we can find many edges satisfying the above two conditions from the DM. Those are {1-7, 2-7, 3-7, 4-7, 5-7, 8-7, 9-7, 10-7}. Select the minimum distance edge from these set of edges and add it to the MST. The final MST generated from the above process is depicted in Fig. 3. Figure 3: Minimum Spanning Tree 3.4. Generating Clusters using the MST Calculate the Mean(M), Standard Deviation(SD) of the edge weights in the MST (ii) Calculate Threshold(λ) = M + SD (iii) for each edge e MST if weight of e (w e ) λ remove e from MST end if end for This gives us disjoint sub trees {T 1, T 2, T 3 }. Each of the sub trees T i is a cluster. For the above MST Mean = 11.55667, Standard Deviation = 11.60832, Threshold = 23.16499. The Clusters formed are: Cluster1: {1, 2, 3, 4, 5, 6, 8, 9}, Cluster2: {7}, Cluster3: {10} 4. CONCLUSIONS In this paper, we have presented a new approach for the construction of minimum spanning tree, which takes less time compared to classical minimum spanning tree algorithms. Unlike the other algorithms such as k-means, our algorithm does not require any prior parametric values, like, number of clusters, initial cluster seeds etc. We have done experiments on some synthetic data sets namely Students, Employees data. Experimental results demonstrate that the proposed algorithm performs better than the k-means. REFERENCES [1] Daxin Jiang, Chun Tang and Aidong Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, 16, 2004, 1370-1385. [2] C. T. Zahn, Graph-Theriotical Methods for Detecting and Describing Getalt Clusters, IEEE Trans. Computers, 20(1), 1971, 68-86. [3] Y. Xu, V. Olman and D. Xu, Clustering Gene Expression Data using a Graph-Theriotic Approach: An Application of Minimum Spanning Trees, Bioinformatics, 18(4), 2002, 536-545. [4] A. Vathy-Fogarassy, A. Kiss and 1. Abonyi, Hybrid Minimal Spanning Tree and Mixture of Gaussians based Clustering Algorithm, Foundations of Information and Knowledge Systems, Springer, 2006, 313-330. [5] J. Lin, D. Ye, C. Chen, and M. Gao, Minimum Spanning Tree-Based Spatial Outlier Mining and Its Applications, Lecture Notes in Computer Science, Springer-Verlag, Vol. 5009/2008, 2008,pp. 508-515. [6] M. F. Jiang, S. S. Tseng, and C. M. Su, Two-Phase Clustering Process for Outliers Detection, Pattern Recognition Letters, 22, 2001, 691-700. [7] N. Paivinen, Clustering with a Minimum Spanning Tree of Scale- free-like Structure, Pattern Recognition Letters, Elsevier, 26(7), 2005, 921-930. [8] Xiaochun Wang, Xiali Wang and D. Mitchell Wilkes, A Divide-and-conquer Approach for Minimum Spanning Tree-based Clustering, IEEE Transactions on Knowledge and Data Engg., 21, 2009. [9] O. Gryorash, Y. Zhou ands Z, Jorgenssn, Minimum Spanning tree-based Clustering Algorithms, Proc.

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm 73 IEEE Int l Conf. Tools with Artificial Intelligence, 2006, pp. 73-81. [10] A. K. Jain, Algorithms for Clustering Data, New Jersey: Prentice Hall, Englewood Cliffs, 1988. [11] N. Chowdhury and C.A. Murthy, Minimum Spanning Tree-Based Clustering Technique: Relationship with Bayes Classifier, Recognition, 30(11), 1997, 1919-1929. Pattern [12] M. Laszlo and S. Mukherjee, Minimum Spanning Tree Partitioning Algorithm for Microaggregation, IEEE Trans. on Knowledge and Data Engg., 17(7), 2005, 902-911.