CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory mechanisms. So genes can belong to more than one cluster at a time. This paved way to use fuzzy clustering method that assign single object to several clusters. Fuzzy c-means clustering (FCM) is one of the most common method used in the field of microarray data. The degree of membership in the fuzzy clusters depends on the closeness of the data object to the cluster centers. FCM partitions the given data set into disjoint subsets so that specific clustering criteria are optimized (Babuska 2009). FCM algorithm has some limitations such as the artificial initial center values, identifying only mission-shaped cluster, slow in convergence speed and local minimum problem (Bezdeka 1974; Bezdekb 1974). Optimal solution could not be achieved due to these factors. In the proposed approach input parameters to the FCM algorithm such as number of clusters and the initial cluster centroid are determined using density based approach.

38 4.2 RELATED WORK Zou et al (2009) proposed a grid and density based initialization method for fuzzy c-means algorithm to determine the cluster number and initial cluster centroids in order to improve the clustering result and reduce clustering time reliably. The proposed method suits well for spherical clusters. Further, it relies upon the initial parameters for partitioning and density threshold value. Samarjit & Hemanta (2014) proposed a modified approach to determine the initial centroids in order to overcome the random initialization problem. The final clusters are obtained with this initial centroids applied over Fuzzy c-means clustering algorithm. The proposed method is compared with Partition Coefficient and Clustering Entropy validity indices to prove its efficiency. Thanh & Tom (2011) have proposed a novel initialization method which overcomes the local optimum problem which highly depends on initial parameters setting. The method uses fuzzy subtractive clustering that uses fuzzy partition of data instead of the data themselves. It doesn t require specification of the mountain peak and mountain radii which is more efficient for large data sets. The data likelihood estimator which is used based on fuzzy partitions for model selection works better than existing methods. Liang et al (2011) proposed a novel initialization method for categorical data using the k-modes-type algorithm. It determines the efficient initial cluster centers and provides a criterion to find candidates for the number of clusters. Since it achieves linear time complexity it could be applied to large data sets.

39 A new evolutionary algorithm is proposed to solve the sensitiveness of initial cluster centers of the fuzzy c-means (FCM) and hard c-means(hcm) (Anima et al 2012). The proposed approach Teaching learning based Optimization (TLBO) explores the search space of given data set to find nearoptimal cluster centers which are evaluated using reformulated c-mean objective function during the first stage. The best cluster centers found are used as the initial cluster center for the c-mean algorithm during the second stage. TLBO algorithm works globally and locally in the search space to find the appropriate cluster centers. Artificial neural networks (ANN) are employed to determine the number of clusters (Alp et al 2011). The proposed feed forward artificial neural networks method is compared with the cluster validity indexes such as PC, CE, and XB. 4.3 DENSITY BASED CLUSTERING ALGORITHM The DBSCAN (Density Based Spatial Clustering of Applications with Noise) (Ester et al 1996) is a density-based clustering algorithm which defines clusters as density-connected points. This method requires two input parameters: minimum objects (µ) and radius (ε). The method uses the following definitions: Definition 1:(ε neighborhood of a point) The neighborhood within a radius ε of a given object is the ε-neighborhood of the object, defined as N Eps (p) = { p D dist(p,q) Eps } Core object and border point: If the ε-neighborhood of a point contains atleast a minimum number of points then the point is said to be core object and the point on the border of the cluster is a border point. Definition 2: (Directly density-reachable) A point p is directly densityreachable from a point q wrt. Eps, MinPts if

40 i) p N Eps (q) and ii) N Eps (q) MinPts Definition 3: (Density-reachable) A point p is density-reachable from a point q wrt. Eps and MinPts if there is a chain of points p 1,,p n, p 1 =q,p n =p such that p i+1 is directly density-reachable from p i. Definition 4 : (Density-connected) A point p is density-connected to a point q wrt. Eps and MinPts if there is a point o such that both, p and q are densityreachable from o wrt. Eps and MinPts. Definition 5: (Cluster) Let D be a database of points. A cluster C wrt. Eps and MinPts is a non-empty subset of D satisfying the following conditions: i) p,q: if p C and q is density-reachable from p wrt. Eps and MinPts, then q C. ii) p,q C : p is density-connected to q wrt. Eps and MinPts. Definition 6: (noise) Let C 1, C k be the clusters of the database D wrt. Parameters Eps i and MinPts i, i=1,.,k. The noise is defined as the set of points in the database D not belonging to any cluster C i. i.e. noise= { p D i: p C i } Algorithm 4.1: DBSCAN 1. Select an arbitrary point p 2. Retrieve all points density-reachable from p wrt. Eps and MinPts 3. If p is a core point, a cluster is formed 4. If p is a border point, no points are density reachable from p, the method selects the next point of the database. 5. The process is continued until all the points have been visited. 4.4 METHODOLOGY Cluster analysis places similar objects in the identical groups. The proposed preliminary steps of FCM algorithm is presented along with the

41 standard fuzzy C-means clustering method. The proposed approach consists of 4 steps: 1. Data Preprocessing - Missing value handling 2. Method for Initializing the number of clusters 3. Method for Initializing the Membership matrix 4. Improved Fuzzy C-Means Clustering Method. Figure 4.1 shows the outline of the proposed approach. The Gene expression data set is preprocessed for handling missing values. The denser regions are determined using the density based clustering algorithm DBSCAN. The core points generated are utilized to initialize the number of clusters c and the membership matrix which forms the initial steps of the FCM clustering algorithm. The results of the proposed method are compared with the standard algorithm to show the performance of the methods. Gene Expression Data set Preprocessing of data Improved fuzzy c- means Compare the Clustering results Density Approach Determine optimal no. of clusters Initializing cluster membership matrix Figure 4.1 Framework of Proposed model 4.4.1 Data Preprocessing As given in chapter 3 Bagging k-nn imputation method estimates the missing values of gene microarray data sets which suits for unstable learning algorithms.

42 4.4.2 Method for Initializing the Number of Clusters Let D = {x 1,x 2,..., x n } be the set of data objects. Let CP= {x cp1, x cp2,..., x cpm } be the core points obtained from the density based clustering algorithm (Section 4.3). Each core point of the input vector CP has the potential to be a cluster center. The core points are further measured for density using the following equation. cc i 2 cpj 2 xcpi x m Eps e i=1,2,..c (4.1) j 1 where, Eps is neighborhood radius and m is the total number of core points. Thus, the potential associated with each core point depends on its distance to all core points, leading to largest density core point which determines the initial number of clusters. 4.4.3 Method for Initializing the Membership Matrix The identified potential cluster centers to obtain the cluster centroids., i=1,2,..c are normalized centers w max log cp (4.2) i a i i 1 c where c centersi 1 and where w is the adjustment weight factor determined i 1 according to the priority of membership value. Equation (4.2) gives the closeness of the objects assigned to the cluster centroids.

43 4.4.4 Improved Fuzzy C-Means Method Input: Microarray Data set D, number of clusters c Output: Cluster center, membership value, objective value Algorithm 4.2: Improved FCM 1. Apply DBSCAN as the first step to initialize the prototypes. Identify the core points x cpi. 2. Fix the values for m, Eps. 3. Using Equation (4.1) compute number of clusters c. 4. Using Equation (4.2) compute the initial membership matrix using. 5. Compute the Euclidean distance d ij, i=1,2,3, c; j=1,2,3,..n using Equation (3.3). 6. Update the membership function µ ij i=1,2,3,.c; j=1,2,3,..n using Equation (4.3). 1 ij c dij k 1 dik 2 m 1 7. If not converged, go to step 4. (4.3) 4.5 EXPERIMENTAL RESULTS This section compare the result of the improved fuzzy clustering algorithm with the existing algorithm when applied over microarray data sets in order to evaluate the performance of the proposed initialization method. The performance of the algorithms are compared with the time taken for finding the results, objective values, number of clusters, number of iterations and clustering accuracy of the proposed method with the existing algorithm.

44 The description about the number of samples and genes of Yeast, Colon cancer, Leukemia and Splice microarray data sets is given in section 3.4. Fuzzy partitioning undergoes an iterative optimization based on minimization of the objective function, with the update of membership µ ij and the cluster centers c i. Table 4.1 shows the number of iterations taken to obtain the desired number of clusters by minimizing the objective function of FCM algorithm. FCM has taken more number of iterations to converge to the termination value. Table 4.1 Memberships of final iteration of FCM Data set Objective Value No. of clusters No. of iterations Yeast 5.0960 4 68 Colon cancer 8.3100 3 39 Leukemia 5.9973 3 24 Splice 0.8478 4 75

45 Table 4.2 Memberships of final iteration of Improved FCM Data set Objective Value No. of clusters No. of iterations Yeast Colon cancer Leukemia Splice 6.8526 3 67 5.4209 4 28 3.5984 2 15 1.0790 3 72 Table 4.2 shows the memberships of the final iteration of Improved FCM algorithm. The standard FCM algorithm has taken 39 iterations to complete the experimental work on colon cancer data set for clustering it into three partitions, whereas the proposed method has taken 28 iterations to converge to form four partitions. From the observations of other data sets the proposed method provides a better result with regard to the objective function value and the number of iterations taken for completing the experiments. Table 4.3 shows the performance according to the running time taken for execution of FCM and Improved FCM algorithm. According to the observations it shows that the proposed method takes less time and gives better accuracy.

46 Table 4.3 Comparison of iteration count, running time and clustering accuracy Data set Method Running time (in secs) Clustering accuracy Yeast FCM 11.8801 79.21 Improved FCM 4.4840 79.30 Colon cancer Leukemia Splice FCM 0.7650 89.41 Improved FCM 0.6410 92.37 FCM 10.4540 78.59 Improved FCM 7.9690 81.01 FCM 6.7030 76.55 Improved FCM 6.0630 77.81 The obtained four clusters for yeast data set by the FCM clustering algorithm is shown in Figure 4.2. The reallocated data into three clusters using the proposed improved FCM clustering algorithm is given in Figure 4.3 that depicts the appropriate number of clusters. The results obtained by applying FCM clustering algorithm over colon cancer data set shows three clusters which is shown in Figure 4.4. The proposed method shows four clusters which is shown in Figure 4.5.

47 Figure 4.2 Four Clusters of yeast data set by FCM Figure 4.3 Three Clusters of yeast data set by Improved FCM

48 Figure 4.4 Three Clusters of colon cancer data set by FCM Figure 4.5 Four Clusters of colon data set by Improved FCM

49 Figure 4.6 Three Clusters of Leukemia data set by FCM Figure 4.7 Two Clusters of Leukemia data set by Improved FCM Figure 4.6-4.7 shows the partitioned clusters of leukemia data set applied by FCM and Improved FCM clustering algorithms respectively. FCM clustering algorithms partitions into three clusters whereas the Improved FCM algorithm partitions into two clusters.

50 4.6 CONCLUSION In this chapter, an enhanced approach for initialization of membership and cluster number for fuzzy c-means clustering algorithm is presented. The usual random assignment of initial parameter to the FCM algorithm is altered in this approach. The experimental result shows that the improved FCM algorithm enhances clustering accuracy, reduces running time and iterations to complete the experiments. The objective function of the improved FCM algorithm is minimum when compared to the existing FCM algorithm. The initial cluster center determined with DBSCAN method reduces the number of iteration to form resulting partitions.