CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Similar documents
DBSCAN. Presented by: Garrett Poppe

Distance-based Methods: Drawbacks

Data Mining 4. Cluster Analysis

DS504/CS586: Big Data Analytics Big Data Clustering II

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering CS 550: Machine Learning

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

DS504/CS586: Big Data Analytics Big Data Clustering II

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Clustering Lecture 4: Density-based Methods

Methods for Intelligent Systems

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

University of Florida CISE department Gator Engineering. Clustering Part 4

Analysis and Extensions of Popular Clustering Algorithms

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Clustering Part 4 DBSCAN

Knowledge Discovery in Databases

CSE 5243 INTRO. TO DATA MINING

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Clustering Algorithms for Data Stream

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Clustering part II 1

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Faster Clustering with DBSCAN

Chapter 4: Text Clustering

Unsupervised Learning : Clustering

Data Mining Algorithms

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Mobility Data Management & Exploration

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

CS Introduction to Data Mining Instructor: Abdullah Mueen

Cluster Analysis. Ying Shen, SSE, Tongji University

ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Colour Image Segmentation Using K-Means, Fuzzy C-Means and Density Based Clustering

Clustering: - (a) k-means (b)kmedoids(c). DBSCAN

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning

Lecture 7 Cluster Analysis: Part A

Clustering Documentation

CS570: Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 4: CLUSTER ANALYSIS

COMP 465: Data Mining Still More on Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

University of Florida CISE department Gator Engineering. Clustering Part 2

Exploratory data analysis for microarrays

Unsupervised learning on Color Images

Chapter ML:XI (continued)

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering

Clustering & Classification (chapter 15)

ECM A Novel On-line, Evolving Clustering Method and Its Applications

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Unsupervised Learning

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Chapter VIII.3: Hierarchical Clustering

Unsupervised Learning and Clustering

数据挖掘 Introduction to Data Mining

Clustering in Data Mining

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Heterogeneous Density Based Spatial Clustering of Application with Noise

Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering

Clustering: Classic Methods and Modern Views

Boundary Detecting Algorithm for Each Cluster based on DBSCAN Yarui Guo1,a,Jingzhe Wang1,b, Kun Wang1,c

Cluster Analysis (b) Lijun Zhang

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

University of Florida CISE department Gator Engineering. Clustering Part 5

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Cluster Analysis: Basic Concepts and Algorithms

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Equi-sized, Homogeneous Partitioning

COLOR image segmentation is a method of assigning

ECLT 5810 Clustering

OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method

Community Detection. Community

CHAPTER 4 SEGMENTATION

Overlapping Clustering: A Review

A Comparative Study of Various Clustering Algorithms in Data Mining

A Parallel Community Detection Algorithm for Big Social Networks

CS Data Mining Techniques Instructor: Abdullah Mueen

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Density Based Clustering using Modified PSO based Neighbor Selection

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Unsupervised Learning and Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Improving Cluster Method Quality by Validity Indices

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Transcription:

37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory mechanisms. So genes can belong to more than one cluster at a time. This paved way to use fuzzy clustering method that assign single object to several clusters. Fuzzy c-means clustering (FCM) is one of the most common method used in the field of microarray data. The degree of membership in the fuzzy clusters depends on the closeness of the data object to the cluster centers. FCM partitions the given data set into disjoint subsets so that specific clustering criteria are optimized (Babuska 2009). FCM algorithm has some limitations such as the artificial initial center values, identifying only mission-shaped cluster, slow in convergence speed and local minimum problem (Bezdeka 1974; Bezdekb 1974). Optimal solution could not be achieved due to these factors. In the proposed approach input parameters to the FCM algorithm such as number of clusters and the initial cluster centroid are determined using density based approach.

38 4.2 RELATED WORK Zou et al (2009) proposed a grid and density based initialization method for fuzzy c-means algorithm to determine the cluster number and initial cluster centroids in order to improve the clustering result and reduce clustering time reliably. The proposed method suits well for spherical clusters. Further, it relies upon the initial parameters for partitioning and density threshold value. Samarjit & Hemanta (2014) proposed a modified approach to determine the initial centroids in order to overcome the random initialization problem. The final clusters are obtained with this initial centroids applied over Fuzzy c-means clustering algorithm. The proposed method is compared with Partition Coefficient and Clustering Entropy validity indices to prove its efficiency. Thanh & Tom (2011) have proposed a novel initialization method which overcomes the local optimum problem which highly depends on initial parameters setting. The method uses fuzzy subtractive clustering that uses fuzzy partition of data instead of the data themselves. It doesn t require specification of the mountain peak and mountain radii which is more efficient for large data sets. The data likelihood estimator which is used based on fuzzy partitions for model selection works better than existing methods. Liang et al (2011) proposed a novel initialization method for categorical data using the k-modes-type algorithm. It determines the efficient initial cluster centers and provides a criterion to find candidates for the number of clusters. Since it achieves linear time complexity it could be applied to large data sets.

39 A new evolutionary algorithm is proposed to solve the sensitiveness of initial cluster centers of the fuzzy c-means (FCM) and hard c-means(hcm) (Anima et al 2012). The proposed approach Teaching learning based Optimization (TLBO) explores the search space of given data set to find nearoptimal cluster centers which are evaluated using reformulated c-mean objective function during the first stage. The best cluster centers found are used as the initial cluster center for the c-mean algorithm during the second stage. TLBO algorithm works globally and locally in the search space to find the appropriate cluster centers. Artificial neural networks (ANN) are employed to determine the number of clusters (Alp et al 2011). The proposed feed forward artificial neural networks method is compared with the cluster validity indexes such as PC, CE, and XB. 4.3 DENSITY BASED CLUSTERING ALGORITHM The DBSCAN (Density Based Spatial Clustering of Applications with Noise) (Ester et al 1996) is a density-based clustering algorithm which defines clusters as density-connected points. This method requires two input parameters: minimum objects (µ) and radius (ε). The method uses the following definitions: Definition 1:(ε neighborhood of a point) The neighborhood within a radius ε of a given object is the ε-neighborhood of the object, defined as N Eps (p) = { p D dist(p,q) Eps } Core object and border point: If the ε-neighborhood of a point contains atleast a minimum number of points then the point is said to be core object and the point on the border of the cluster is a border point. Definition 2: (Directly density-reachable) A point p is directly densityreachable from a point q wrt. Eps, MinPts if

40 i) p N Eps (q) and ii) N Eps (q) MinPts Definition 3: (Density-reachable) A point p is density-reachable from a point q wrt. Eps and MinPts if there is a chain of points p 1,,p n, p 1 =q,p n =p such that p i+1 is directly density-reachable from p i. Definition 4 : (Density-connected) A point p is density-connected to a point q wrt. Eps and MinPts if there is a point o such that both, p and q are densityreachable from o wrt. Eps and MinPts. Definition 5: (Cluster) Let D be a database of points. A cluster C wrt. Eps and MinPts is a non-empty subset of D satisfying the following conditions: i) p,q: if p C and q is density-reachable from p wrt. Eps and MinPts, then q C. ii) p,q C : p is density-connected to q wrt. Eps and MinPts. Definition 6: (noise) Let C 1, C k be the clusters of the database D wrt. Parameters Eps i and MinPts i, i=1,.,k. The noise is defined as the set of points in the database D not belonging to any cluster C i. i.e. noise= { p D i: p C i } Algorithm 4.1: DBSCAN 1. Select an arbitrary point p 2. Retrieve all points density-reachable from p wrt. Eps and MinPts 3. If p is a core point, a cluster is formed 4. If p is a border point, no points are density reachable from p, the method selects the next point of the database. 5. The process is continued until all the points have been visited. 4.4 METHODOLOGY Cluster analysis places similar objects in the identical groups. The proposed preliminary steps of FCM algorithm is presented along with the

41 standard fuzzy C-means clustering method. The proposed approach consists of 4 steps: 1. Data Preprocessing - Missing value handling 2. Method for Initializing the number of clusters 3. Method for Initializing the Membership matrix 4. Improved Fuzzy C-Means Clustering Method. Figure 4.1 shows the outline of the proposed approach. The Gene expression data set is preprocessed for handling missing values. The denser regions are determined using the density based clustering algorithm DBSCAN. The core points generated are utilized to initialize the number of clusters c and the membership matrix which forms the initial steps of the FCM clustering algorithm. The results of the proposed method are compared with the standard algorithm to show the performance of the methods. Gene Expression Data set Preprocessing of data Improved fuzzy c- means Compare the Clustering results Density Approach Determine optimal no. of clusters Initializing cluster membership matrix Figure 4.1 Framework of Proposed model 4.4.1 Data Preprocessing As given in chapter 3 Bagging k-nn imputation method estimates the missing values of gene microarray data sets which suits for unstable learning algorithms.

42 4.4.2 Method for Initializing the Number of Clusters Let D = {x 1,x 2,..., x n } be the set of data objects. Let CP= {x cp1, x cp2,..., x cpm } be the core points obtained from the density based clustering algorithm (Section 4.3). Each core point of the input vector CP has the potential to be a cluster center. The core points are further measured for density using the following equation. cc i 2 cpj 2 xcpi x m Eps e i=1,2,..c (4.1) j 1 where, Eps is neighborhood radius and m is the total number of core points. Thus, the potential associated with each core point depends on its distance to all core points, leading to largest density core point which determines the initial number of clusters. 4.4.3 Method for Initializing the Membership Matrix The identified potential cluster centers to obtain the cluster centroids., i=1,2,..c are normalized centers w max log cp (4.2) i a i i 1 c where c centersi 1 and where w is the adjustment weight factor determined i 1 according to the priority of membership value. Equation (4.2) gives the closeness of the objects assigned to the cluster centroids.

43 4.4.4 Improved Fuzzy C-Means Method Input: Microarray Data set D, number of clusters c Output: Cluster center, membership value, objective value Algorithm 4.2: Improved FCM 1. Apply DBSCAN as the first step to initialize the prototypes. Identify the core points x cpi. 2. Fix the values for m, Eps. 3. Using Equation (4.1) compute number of clusters c. 4. Using Equation (4.2) compute the initial membership matrix using. 5. Compute the Euclidean distance d ij, i=1,2,3, c; j=1,2,3,..n using Equation (3.3). 6. Update the membership function µ ij i=1,2,3,.c; j=1,2,3,..n using Equation (4.3). 1 ij c dij k 1 dik 2 m 1 7. If not converged, go to step 4. (4.3) 4.5 EXPERIMENTAL RESULTS This section compare the result of the improved fuzzy clustering algorithm with the existing algorithm when applied over microarray data sets in order to evaluate the performance of the proposed initialization method. The performance of the algorithms are compared with the time taken for finding the results, objective values, number of clusters, number of iterations and clustering accuracy of the proposed method with the existing algorithm.

44 The description about the number of samples and genes of Yeast, Colon cancer, Leukemia and Splice microarray data sets is given in section 3.4. Fuzzy partitioning undergoes an iterative optimization based on minimization of the objective function, with the update of membership µ ij and the cluster centers c i. Table 4.1 shows the number of iterations taken to obtain the desired number of clusters by minimizing the objective function of FCM algorithm. FCM has taken more number of iterations to converge to the termination value. Table 4.1 Memberships of final iteration of FCM Data set Objective Value No. of clusters No. of iterations Yeast 5.0960 4 68 Colon cancer 8.3100 3 39 Leukemia 5.9973 3 24 Splice 0.8478 4 75

45 Table 4.2 Memberships of final iteration of Improved FCM Data set Objective Value No. of clusters No. of iterations Yeast Colon cancer Leukemia Splice 6.8526 3 67 5.4209 4 28 3.5984 2 15 1.0790 3 72 Table 4.2 shows the memberships of the final iteration of Improved FCM algorithm. The standard FCM algorithm has taken 39 iterations to complete the experimental work on colon cancer data set for clustering it into three partitions, whereas the proposed method has taken 28 iterations to converge to form four partitions. From the observations of other data sets the proposed method provides a better result with regard to the objective function value and the number of iterations taken for completing the experiments. Table 4.3 shows the performance according to the running time taken for execution of FCM and Improved FCM algorithm. According to the observations it shows that the proposed method takes less time and gives better accuracy.

46 Table 4.3 Comparison of iteration count, running time and clustering accuracy Data set Method Running time (in secs) Clustering accuracy Yeast FCM 11.8801 79.21 Improved FCM 4.4840 79.30 Colon cancer Leukemia Splice FCM 0.7650 89.41 Improved FCM 0.6410 92.37 FCM 10.4540 78.59 Improved FCM 7.9690 81.01 FCM 6.7030 76.55 Improved FCM 6.0630 77.81 The obtained four clusters for yeast data set by the FCM clustering algorithm is shown in Figure 4.2. The reallocated data into three clusters using the proposed improved FCM clustering algorithm is given in Figure 4.3 that depicts the appropriate number of clusters. The results obtained by applying FCM clustering algorithm over colon cancer data set shows three clusters which is shown in Figure 4.4. The proposed method shows four clusters which is shown in Figure 4.5.

47 Figure 4.2 Four Clusters of yeast data set by FCM Figure 4.3 Three Clusters of yeast data set by Improved FCM

48 Figure 4.4 Three Clusters of colon cancer data set by FCM Figure 4.5 Four Clusters of colon data set by Improved FCM

49 Figure 4.6 Three Clusters of Leukemia data set by FCM Figure 4.7 Two Clusters of Leukemia data set by Improved FCM Figure 4.6-4.7 shows the partitioned clusters of leukemia data set applied by FCM and Improved FCM clustering algorithms respectively. FCM clustering algorithms partitions into three clusters whereas the Improved FCM algorithm partitions into two clusters.

50 4.6 CONCLUSION In this chapter, an enhanced approach for initialization of membership and cluster number for fuzzy c-means clustering algorithm is presented. The usual random assignment of initial parameter to the FCM algorithm is altered in this approach. The experimental result shows that the improved FCM algorithm enhances clustering accuracy, reduces running time and iterations to complete the experiments. The objective function of the improved FCM algorithm is minimum when compared to the existing FCM algorithm. The initial cluster center determined with DBSCAN method reduces the number of iteration to form resulting partitions.