CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

Size: px

Start display at page:

Download "CHAPTER 5 CLUSTER VALIDATION TECHNIQUES"

Sara Alexis Small
5 years ago
Views:

1 120 CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 5.1 INTRODUCTION Prediction of correct number of clusters is a fundamental problem in unsupervised classification techniques. Many clustering techniques require predefined number of clusters. To overcome this problem, a new Hybrid Clustering Technique is introduced in Chapter 4. In order to ascertain the results of new clustering technique, cluster validity indices have been proposed (Davies and Bouldin 1979, Halkidi et al 2001) which find the partitioning that best fit the underlying data. There are several validity indices used for measuring goodness of clustering results are discussed in section 2.3, among them only two validity indices namely a) Silhouette index and b) Davies-Bouldin index are used here to verify the results of new Hybrid Clustering Technique. 5.2 SILHOUETTE INDEX For a given cluster, Xj (j = 1,, c), this method assigns to each sample of Xj a quality measure, s(i) (i = 1,, m), known as the Silhouette width. The Silhouette width is a confidence indicator on the membership of the ith sample in cluster Xj. The Silhouette width for the ith sample in cluster Xj is defined as (Jens Jakel and Martin Nollenberg 2004, Dinakaran and Suresh 2009, Bolshakova and Azuaje 2003).

2 121 s( i) ( b( i) a( i)) / max{ a( i), b( i )} (5.1) where a(i) is the average distance between the ith sample and all of the samples included in X j and b(i) is the minimum average distance between the ith sample and all of the samples clustered in X k (k=1,.c; k j). From this formula it shows that s(i) has a value between -1 and 1. Thus, for a given cluster, X j, it is possible to calculate a cluster silhouette S j, which characterizes the heterogeneity and isolation properties of such a cluster. It is calculated as the sum of all samples, silhouette widths in X j. Moreover, for any partition, a global silhouette value or silhouette index, GS u, can be used as an effective validity index for a partition U. GS u 1 c c j 1 S j (5.2) where S j is silhouette index value and S(i) {i =1,..m}, is known as silhouette width. In this case the maximum silhouette index value is taken as the optimal partition. 5.3 DAVIES BOULDIN INDEX Davies-bouldin index indentifies sets of clusters that are compact and well separated. For any partition U X: X1... Xi Xc, where X i represents the ith cluster of such partition where the lowest value indicates the good clustering structure (Ferenc Kovacs et al 2006, Jens Jakel and Martin Nollenberg 2004, Dinakaran and Suresh 2009), Davies bouldin (DB), is defined as DB U 1 c c max ( ) ( i j i 1 i j xi x j ( x, x ) (5.3)

3 122 where (x i, x j ) defines the inter cluster distance between clusters x i and x j ; (x i ), (x j ) represents the intra cluster distance of clusters x i and x j and c is the number of clusters of partition U. In this case, small index values correspond to good clusters that is the clusters are compact and the centers are far away from each other. Therefore, the cluster configuration that minimizes Davies- Bouldin index is taken as the optimum number of clusters. 5.4 RESULTS OF CLUSTER VALIDITY INDICES The basic problem in clustering technique is to decide the optimal number of clusters that fits a dataset as discussed in chapter-2. Here, the new Hybrid Clustering Technique has been applied on four microarray gene expression datasets and the results produced by this technique are validated using validity measures and the same was interpreted. It is understood that the results are optimum and further it can be used for analysis. In this research, two important and most suitable validation measures Silhouette and Devis- Bouldin are used to calculate validity index on human serum data, yeast data and cancer data. The results of human serum, yeast and cancer data are presented in sections 5.4.1, and respectively. The highest value of Silhouette index corresponding to number of clusters in X axis is considered as optimum number of clusters whereas the index value calculated by Devis- Bouldin should be less to make good clustering results Human Serum Data In this section two cluster validity methods Silhouette validity index and Davies Bouldin index are used to validate the clustering result of human serum microarray data. These indices measure how compact and well separated the clusters are. As discussed in section 5.3.1, clusters are compact and well separated if the silhouette values are large. The characteristics of human serum dataset are formulated in the Chapter 4.

4 123 Silhouette index values for Hybrid Clustering Technique on human serum data is calculated and a graph drawn for silhouette values against the number of clusters from k=2 to k=10 which is shown in Figure 5.1. Silhouette value highlighted in the figure is the maximum among several and the corresponding k value is 3. A deep elbow like line that bends at the point where maximum silhouette index value occurs is said to be optimal. Figure 5.1 Silhouette index values of human serum data Silhouette index values on human serum data using Hybrid Clustering Technique divides data into clusters from k=2 to k=10. The entries in the table represent the average silhouette value for each cluster. In order to select the optimal number of clusters for a dataset, choose the partition where global silhouette value is a maximum. The highlighted maximum value in the table is obtained using Hybrid Clustering Technique so that the number of cluster k=3 is considered as optimum.

5 124 Figure 5.2 Davies-Bouldin index values of human serum data Davies-Bouldin index values for Hybrid Clustering Technique on human serum data is given in Figure 5.2. Each row contains number of clusters against average Davies-Bouldin values for each cluster. The highlighted value is a minimum index value and it is suggested that the corresponding number of clusters k=3 is optimum. The figure shows the index value against number of clusters ranging from k=2 to k=10. As the value produced by Davies-Bouldin using Euclidean distance measure is low and it is represented in the figure with respect to deep knee which bends at that point where k =3. This k number of clusters is best fit on human serum data Yeast Data The quality assessment of clustering results is an important part in cluster analysis. In this section, the estimation of optimum number of clusters on yeast data is performed using two cluster validity methods, Silhouette validity index and Davies-Bouldin index. These methods validate the clustering results produced by Hybrid Clustering Technique by checking for

6 125 its compactness and how well separated the clusters are. As discussed in section 5.3.1, clusters are compact and well separated if the silhouette values are large. The characteristics of yeast dataset are formulated in the Chapter 4. Silhouette index values are calculated for the clusters ranging from k=2 to k=10. These index values are plotted as shown in Figure 5.3. Silhouette value is the largest and the corresponding k value 2 is optimum. As the deepest knee indicates greatest jump of these values from smaller to larger number of clusters, the point where maximum silhouette index value occurs is said to be at optimum. Figure 5.3 Silhouette index values of yeast data Davies-Bouldin index values are calculated for Hybrid Clustering Technique on yeast data for clusters ranging from k=2 to k=10. A graph is drawn for index values calculated against number of clusters k as shown in figure 5.4. The X axis represents the number of clusters k and the index values produced by the clustering technique are used as Y axis. In the X axis the point at k=2 against the index value which was a minimum index value shows the knee jumping from larger to smaller value at that point. Since

7 126 the smaller value of Davies Bouldin index value produced by the clustering algorithm being always an optimum as discussed in the section 5.3.2, the optimum number of clusters k on yeast data is two. Figure 5.4 Davies-Bouldin index values of yeast data Lung Cancer Data In this section estimation of optimum number of clusters on lung cancer data is carried out by two cluster validity methods Silhouette index and Davies Bouldin index. As discussed in section 5.3.1, in case the validity measure used was silhouette index, the larger index value may provide the better clustering. Figure 5.5 shows the silhouette index values for k=2 to k=10 on cancer data. The highlighted maximum silhouette index value in the figure gives the corresponding optimum number of clusters k=2.

8 127 Figure 5.5 Silhouette index values of Lung Cancer data Davies-Bouldin index values are calculated on lung cancer data for cluster ranging from k=2 to k=10 and the same information is provided in a graphical form as shown in Figure 5.6. The index value is the smallest as the deepest knee indicates greatest jump on this particular point and this value is considered as optimum. To indicate the optimum clusters the corresponding row is highlighted in the Figure 5.6 that shows k=2. Figure 5.6 Davies-Bouldin index values of Lung Cancer data The results produced by Hybrid Clustering Technique on four different microarray gene expression datasets and their validation results by

9 128 validity indices are summarized in Table 5.1. Clearly, the results of new Hybrid Clustering Technique are optimum as the criteria of validity indices are satisfied. Table 5.1 Validation results on microarray gene expression datasets Dataset Name Number of Optimum Clusters (K) Silhouette Index Value Davies-Bouldin Index Value Human serum Yeast Cancer Blood Cancer Table 5.2 provides the outcome of the thesis in different stages from pre-processing to post-processing. It is clearly seen that the new Hybrid Clustering Technique produce optimum results on four different microarray gene expression datasets. Table 5.2 Outcome of thesis work in different stages Techniques Datasets Human Serum Yeast Lung Cancer Blood Cancer Outlier Detection (attribute values) Dimensionality Reduction (no. of dimensions) Hybrid Clustering (no. of clusters) Cluster Validation (optimum no. of clusters)

10 CONCLUSION Clustering techniques of different types produce different results on same dataset, hence validating the results of clustering techniques is essential. Two validity measures namely a) Silhouette index and b) Davies-Bouldin index are used in this research work to evaluate the results of the new clustering technique. As the requirements of the above said measures are fulfilled by this technique, it is ensured that the results are optimal.

Improved Performance of Unsupervised Method by Renovated K-Means

Improved Performance of Unsupervised Method by Renovated P.Ashok Research Scholar, Bharathiar University, Coimbatore Tamilnadu, India. ashokcutee@gmail.com Dr.G.M Kadhar Nawaz Department of Computer Application