CHAPTER 7 A GRID CLUSTERING ALGORITHM

Size: px

Start display at page:

Download "CHAPTER 7 A GRID CLUSTERING ALGORITHM"

Delphia Cole
5 years ago
Views:

1 CHAPTER 7 A GRID CLUSTERING ALGORITHM 7.1 Introduction The grid-based methods have widely been used over all the algorithms discussed in previous chapters due to their rapid clustering results. In this chapter, a nonparametric grid-based clustering algorithm is presented using the concept of boundary grids and local outlier factor [31]. 7.2 Algorithm Based on Local Outlier Factor Let D = (x 1, x 2,, x n ) be the given set of n data points embedded in an m dimensional space. Each x i is composed of m attributes. i.e., x i = (x i1, x i2,, x im ). Initially a single grid is used to represent all the given n points. The extrema of this grid are taken as the minimum and maximum attribute values in each dimension [129]. Let Min(i) = min{x 1i, x 2i,..., x ni } and Max(i) = max{x 1i, x 2i,..., x ni }. Then the initial grid G(n;m) is represented as follows. G(n;m) = [Min(1), Max(1)] [Min(2), Max(2)] [Min(m), Max(m)]. Initially the value of k (number of clusters) is one. Now, we partition the grid G(n;m) into two equal volume grids G 1 (n 1 ;m) and G 2 (n 2 ;m) in a uniformly selected dimension m. Now, the data points of G(n;m) are distributed to these two grids. Here, we define the grids (cells) which contain at least one of the given points as non-empty grids. All the other grids are remained as empty grids. The empty grids which share the common vertex with non-empty grids are known as boundary grids. It can easily be noted that each cluster is surrounded by the boundary grids. After each round of partitioning of grids, it is necessary to check the presence of new clusters. Here, a cluster is defined as the collection of points in the non-emptygrids which are connected by the common vertices in the grid structure. In the next round of partitioning, the two grids G 1 (n 1 ;m) and G 2 (n 2 ;m) are partitioned into four 110

2 7.2 Algorithm Based on Local Outlier Factor equal volume grids G 1,1 (n 3 ;m), G 1,2 (n 4 ;m); and G 2, 1 (n 5 ;m) and G 2,2 (n 6 ;m) in another chosen dimension. Then the points are distributed as per the new grids. In this way all the grids of earlier iteration are bisected and the partitioning process is continued until a termination condition is met. In order to find the case of optimal grid structure we use the boundary grids. As discussed above, a cluster is defined as the collection of the points of directly connected non-empty grids. Each cluster is surrounded by the boundary grids. We also define the volume of a cluster as sum of the volumes of all the non-empty grids of that cluster. Here, we have observed that the volume of a cluster strictly decreases as the partitioning process continues. At the same time, the number of surrounding empty grids (boundary grids) increases significantly. This is clearly depicted in the Figures 7.1(a-e). If the required clusters are formed and we continue the partitioning process, a few clusters split into multiple clusters as shown in the Figure 7.1(e). Some of these clusters are surrounded by very small number of boundary grids compared to the clusters of previous iteration. For instance, five clusters surrounded by very less boundary grids are shown in the Figure 7.1(e). Therefore, this case indicates that the clusters of previous iteration are of our objective and the corresponding grid size is the optimal one compared to the sizes of all the other iterations. 111

(a) one cluster; (b) five clusters; (c) five clusters; (d) more than five

3 7.2 Algorithm Based on Local Outlier Factor (a) (b) (c) (d) (e) Figure 7.1: Clusters produced by grouping the points in the grids with common vertices (a) one cluster; (b) five clusters; (c) five clusters; (d) more than five clusters; (e) a cluster partitioned into several sub-clusters with less boundary grids. 112

4 7.2.1 Problem of Outliers Problem of Outliers The above idea is sensitive if the given data set involves outlier points. As the outliers are far from the clusters, it can be observed that these points form separate clusters during the partitioning process. This happens because the outlier region is separated from the other clusters as the number of grids increases (or the grid size decreases). As per our definition of a cluster, the outlier points are surrounded by the boundary grids. Here, the number of boundary grids is significantly less compared to the boundary grids of other clusters which indicate the termination stage as per the above discussion. Therefore, the clusters may not be fully formed as the partitioning process is uncompleted with the effect of outliers. Hence, we need a measure of finding an outlier to continue the partitioning process smoothly. For that purpose, we have used the local outlier factor (LOF) proposed by Breunig et al. [31] to compute the degree to which a point is an outlier. Initially, the local neighborhood of a point x S with respect to the minimum points threshold mp is defined as follows: N(x, mp) = {y S / d (x, y) d (x, x mp ) where x mp is the mpth nearest neighbor of x. Thus N(x, mp) contains at least mp points. The density of a point x S is defined as follows. density x, mp d N x, mp,, y N x mp x y (7.1) If the distances between x and its neighboring points are small, then the density of x is high. Then the average relative density (ard) of x is calculated as below. ard x, mp density x, mp y N x, mp density y, mp, N x mp (7.2) 113

5 Now, the local outlier factor (LOF) of x is defined as the inverse of the average relative density of x. i.e., LOF (x, mp) = 1 ard x, mp (7.3) If a point belongs to any cluster, then its corresponding value of LOF is closer to one because the density of that point and the densities of its neighboring points are roughly equal. If any cluster has significantly less number of boundary grids in any iteration of partitioning, we compute the LOF value for all the points of that cluster by using the above definition. If any point is an outlier we continue the partitioning process. The process is terminated otherwise. The pseudo code of this algorithm is provided below. Algorithm OPT-GRID (S) Input: A set S of n data points; Output: A set of clusters C 1, C 2,, C k. Functions and variables used: Grid (n; m): A function to find the initial grid structure of the given set S of n points with dimension m. Partition (G): A function to partition all the grids into two equal volume new grids. Connected (); A function to find the clusters from the grid structure using the common vertices in the grid structure. LOF (x): A function to compute the local outlier factor. EG j : Empty grids. NE j : Non-empty grids. BG j : Boundary grids. 114

6 Step 1: Call Grid (n; m); Step 2: p 1; Step 3: Call Partition (G) to find the equal volume grids G 1 and G 2. Step 4: Call Connected () to produce the clusters (say, C 1, C 2,, C l for some l) by grouping points in the grids which are connected by the common vertices in the grid structure. Step 5: Find the number of boundary grids BG j of all the clusters C 1, C 2,, C l ; Step 6: If (p > 1) then If the number of boundary grids of any cluster C of t points is less than 20% of the boundary grids of any cluster in (p-1) th iteration then Go to step 7. Else { p p + 1; Go to step 3 to call the Grid function for all the grids. } Else Go to step 3 to call the Grid function for all the grids. Step 7: Call LOF (x q ) x q C, 1 q t; Step 8: If LOF (x q ) 1 x q C then Go to step 9; Else /* Sign of outliers */ p p + 1; Go to step 3 to continue the partitioning process for all the grids. Step 9: Output the clusters C 1, C 2,, C k with respect to the grid size in the (p-1) th iteration. Step 10: Exit (); 115

7 Function Grid (n; m) { Min (y) = min {x 1y, x 2y,..., x ny }, 1 y m; Max (y) = max {x 1y, x 2y,..., x ny }, 1 y m; G(n;m) = [Min(1), Max(1)] [Min(2), Max(2)] [Min(m), Max(m)]. i.e., G n; m Min y, Max y 1 y m } Connected ( ) { Step 1: l 1; Step 2: Start with any random grid G. Step 3: If G is non-visited, then mark it as visited and add all the points of G in C l. i.e., C l C l {points in the grid G}; Else Go to step 2. Step 4: Find the non-visited grids Gr (for some r) shared by G with any common vertex. Step 5: Add the points of all Gr to the cluster C l and Gr is visited. i.e., C l C l {points in the grids G r }; Step 6: Repeat the steps 4 and 5 for all the grids G r until no new grid is identified. Step 7: If j Else C j Return (C 1, C 2,, C j ); } n then l l + 1; Go to step 2 to restart the process. 116

8 7.2.2 Experimental Analysis Time Complexity: The proposed algorithm constructs the grid structure of the given data points for a finite number of times (say p) with respect to the uniformly chosen dimensions. This task requires O (pn) time as the partitioning process in an iteration requires linear time. Initially, the local outlier factor (LOF) is computed for the points (if existed) from the cluster that represents the outliers. The LOF is also computed for non-outlier points only in the last iteration of the algorithm. Hence, the LOF is computed very less number of times (say h) compared to the given data points (n). This is required O (hn) time. It can easily be seen that both p and h are very small compared to n. Therefore, the overall time complexity of the proposed algorithm may be linear Experimental Analysis We performed wide experiments of the proposed algorithm on many synthetic and biological data. In order to prove the efficiency of the proposed method, we compared the experimental results with the existing clustering techniques, namely K-means [26], IGDCA [145] and GGCA [129]. We used normalized information gain (NIG) to evaluate the quality of the clusters quantitatively. The NIG is defined as follows. Normalized Information Gain: Normalized Information Gain [91] is a measure of the quality of clusters based on the information gain [114]. This measure has been shown very effective for evaluating data classification. Hence, it is extended to evaluate clusters of supervised data. NIG is expressed in terms of total and weighted entropy s defined as follows. Suppose there are L classes and each of the given n data point belongs to any one of the L classes. Let c l denote the number of data points of the class l, for l = 1, 2,, L. Sol L 1 c l n Then, the Total Entropy (EN Total ) which is nothing but the average information per point in the dataset is defined as follows. EN Total L c c l log l 1 2 n l n (7.4) 117

9 7.2.2 Experimental Analysis Assume that the total number of clusters is K with the number of points in the kth cluster is n k. Let the number of points belong to class l where l = 1, 2,, L is c k l. Then the kth cluster entropy (EN k ) is given by EN k L c k l l 1 n k c k log l 2 n k (7.5) It is clear to note that if any cluster has points of only one class, then EN K is obviously zero. Now based on the kth cluster entropy, the weighted entropy (wen) which is nothing but the average information per point in each cluster is calculated as follows. wen 1 K n k EN k k n (7.6) If all the clusters are homogeneous, then wen is zero which indicates that the whole data set belong to a single cluster. The Normalized Information Gain (NIG) is calculated by the above equations as follows. EN wen NIG Total EN Total (7.7) It is important to note that NIG is zero, if no information is obtained and NIG is one if the total information is retrieved by the clustering technique. Therefore, NIG should be close to 1 for good quality clusters. Initially, the proposed method, K-means, IGDCA and GGCA were applied on eight two-dimensional synthetic data sets of various shapes and densities. The comparison results in the below Table 7.1 depict the efficiency of the proposed method over the existing. The NIG values of the proposed method are larger than the NIG values of the existing methods K-meaans, IGDCA and GGCA. Hence, as per the definition of NIG, the proposed method outperforms the existing methods. 118

10 7.2.2 Experimental Analysis Table 7.1: Results of synthetic data using normalized information gain (NIG) Data Data Size Cluster No. NIG K-means IGDCA GGCA Proposed DS DS DS DS DS DS DS DS Next, we experimented the proposed and existing methods on eight biological data sets namely, iris, wine, statlog heart, breast tissue, iima-india-diabetes, cloud, blood transfusion and yeast taken from UCI machine learning repository [32]. It can easily be observed from the comparison results shown in the Table 7.2 that the proposed algorithm consistently produces better results than that of the algorithms K- means, IGDCA and GGCA in terms of the normalized information gain (NIG). 119

11 Data Iris Wine Statlog Heart Breast Tissue Pima Indians Diabetes Cloud Blood Transfusion Yeast Table 7.2: Results of biological data using normalized information gain (NIG) No. of Attributes Data Size Cluster No K-means IGDCA NIG GGCA Proposed

12 7.3 Conclusion 7.3 Conclusion In this chapter, a new approach has been proposed to find the optimal grid size using the cluster boundaries. The proposed method is non-parametric and runs with linear computational complexity. It is insensitive to the outlier points by exploiting the local outlier factor (LOF). The proposed scheme is experimented on various synthetic and biological data using normalized information gain (NIG). It has been shown that the proposed method produces better results than that of K-means, IGDCA and GGCA. 121

13 CHAPTER 8 SUMMARY AND CONCLUSIONS In this thesis, we have presented various clustering algorithms of four basic models of clustering, namely, hierarchical, partitional, density and grid-based. We have addressed some of the problems associated with the existing algorithms and tried to resolve them through new or improved algorithms. We have shown the experimental results of all the proposed algorithms on several benchmark synthetic and biological data. The results have been compared with the existing algorithms to show the outperformance of the proposed algorithms. The introductory part of the thesis has been presented in the opening chapter 1. It comprises of the scope of the thesis, an introduction of four major clustering models, resources used and organization of the thesis. The chapter 2 is composed of an extensive review of the existing clustering algorithms of the above four major clustering models. The various clustering techniques of these models are discussed along with their merits and demerits. Our contribution begins with chapter 3 in which two novel MST-based clustering algorithms have been presented. The algorithm MST-CV has been designed using the coefficient of variation. This method has also been dealt with the outliers. The algorithm MST-DVI has been produced the optimal clusters using DVI. The proposed algorithms outperformed several existing methods, K-means, SFMST, SC, MSDR, IMST, SAM and MinClue in case of various synthetic and biological data. It is observed that both the proposed algorithms have quadratic time complexity. Chapter 4 presented three new clustering algorithms using Voronoi diagram. The algorithm GCVD produced desired clusters by exploiting the Voronoi edges. The algorithm Voronoi-Cluster resulted into optimum clusters by using a function defined with the help of Voronoi circles. Similarly, the algorithm Build-Voronoi generates 122

14 Summary and Conclusions efficient clusters by reconstructing the Voronoi diagram. The proposed algorithms have been shown efficient over K-means, FCM, CTVN, VDAC and SC. The time complexity of both of these algorithms has been shown to be O (n log n). Two algorithms to enhance the K-means clustering have been presented in chapter 5. The algorithm KD-JDF has been effective in case of random initialization and automation problems of K-means algorithm. Here, the initialization problem has been dealt with the kd-tree where as the number of clusters is automated by the JDF. The algorithm KD-Cluster improved the K-means clustering algorithm to be insensitive to the outliers. The proposed algorithms have been outperformed the classical and improved K-means algorithms. The time complexity of the proposed kdtree based algorithms is O (n log n). A prototype based approach has been designed in the chapter 6 to speed up the DBSCAN algorithm. The prototypes in this algorithm are produced using the squared error clustering technique. The proposed algorithm produced clusters with less computational cost over the existing techniques I-DBSCAN, DBCAMM, VDBSCAN and KFWDBSCAN. In chapter 7, we have proposed a grid clustering algorithm to compute the grid-size using a novel concept of boundary grids. In this method, we used the local outlier factor (LOF) to deal with the problem of outliers. The proposed technique is non-parametric and runs in linear time. The method has been produced better results than that of K-means, IGDCA and GGCA. Although, the proposed algorithms efficient in clustering various complex data, they have few challenges. Our future efforts will be made towards such problems described as follows. We shall attempt to find a solution to the localization problem of the proposed MST-based clustering algorithm. The proposed Voronoi diagram based algorithm has the problem of inputting the K value in advance. We shall enhance the algorithm in connection with this issue. The performance of the kdtree based K-means clustering algorithms is completely depends on the proper 123

15 Summary and Conclusions formation of the leaf bucket sizes. Therefore, we shall try to find the best possible size for the leaf buckets. The notion of the proposed MDBSCAN clustering algorithm is directed by the neighborhood parameter. We shall make an effort to automate the proper value of. If the given data has the clusters of diverse densities, the proposed concept of boundary grids in grid-based clustering may not provide the optimal gridsize. We shall try to enhance the grid-based clustering algorithm in our future endeavor for producing the clusters of varied densities. 124

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for