Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering

Size: px

Start display at page:

Download "Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering"

Lorraine Hutchinson
5 years ago
Views:

1 Journal of Computational Information Systems 10: 12 (2014) Available at Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Ye TIAN 1, Peng YANG 2, 1 Key Laboratory of Photonic and Electronic Bandgap Materials, Ministry of Education, School of Physics and Electronic Engineering, Harbin Normal University, Harbin , China 2 College of Information and Communication Engineering, Harbin Engineering University, Harbin , China Abstract Cluster ensemble has been shown to be an effective thought of improving the accuracy and stability of single clustering algorithms. It consists of generating a set of partition results from a same data set and combining them into a final one. In this paper, we develop a novel cluster ensemble method named Cluster Ensemble algorithm using the Binary k-means and Spectral Clustering (CEBKSC). By using the binary k-means algorithm and the spectral clustering method, the proposed method requires low computational complexity and is therefore very suitable for large text data sets. It works by firstly using the binary k-means algorithm to create a set of partition results and then integrating these results by using the spectral clustering. In addition, we introduce a matrix transformation technique to lower the computational cost of the spectral clustering. Experiments show that the proposed method has better clustering quality and is faster than several other cluster ensemble methods. Keywords: Cluster Ensemble; Binary k-means; Spectral Clustering; Matrix Transformation 1 Introduction Clustering analysis, which belongs to the non-supervision pattern recognition problem, can be viewed as a process of clustering the unlabeled data objects into k (we denote k as the number of desired classes) groups with several clustering criteria such that the intracluster dissimilarity is minimized while the intercluster dissimilarity is maximized [1]. It is an essential technique in the research areas which involve analyzing multivariate data such as pattern classification, data mining, taxonomy, text retrieval and image segmentation [2]. Over the past half centuries, a large variety of clustering algorithms has been proposed. Traditional clustering algorithms such as k-means and its variants impose a convex spherical sample space on the data sets. When the sample space is not convex, these algorithms tend to obtain local optimum. Project supported by the Science and Technology Research Projects of Heilongjiang Education Department (No ). Corresponding author. address: yangpeng @163.com (Peng YANG) / Copyright 2014 Binary Information Press DOI: /jcis10617 June 15, 2014

2 5148 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) Recently, many studies show that the cluster ensemble methods can provide consistent, robust, novel, and stable solutions [3, 4]. In the thoughts of cluster ensemble, the design of the consensus function plays a significant role where a new partition, which belongs to the integration of all the clustering results obtained in the generation step, is computed. And the function will directly affect the clustering quality of the cluster ensemble. We will use the spectral clustering to combine all the partition results obtained in the generation step in this paper. Spectral clustering algorithm [5-7] which exploits the pairwise similarities of data objects has been shown to be more effective than traditional clustering algorithm in finding clusters. Because of the advantage, spectral clustering algorithm is now widely used in several areas such as computer vision and information retrieval [8-10]. However, when the number of data objects (denoted by n) is large, spectral clustering algorithm will encounter a quadratic resource bottleneck in computing the pairwise similarities among n data objects [11]. Furthermore, it is sensitive to the scaling parameter when constructing the similarity matrix. In order to lower the computational complexity of the eigen value decomposition (EVD) of the similarity matrix of the spectral clustering algorithm, we adopt a matrix transformation technique, which transforms equivalently the EVD of the graph Laplacian matrix to that of a matrix with much smaller size, and use a cosine function, which does not require any scaling parameters, instead of some other similarity measures such as Gaussian kernel to compute the pairwise similarities of data objects in this paper. The rest of the paper is organized as follows: Section 2 surveys the contributions upon which this paper builds. Section 3 is devoted to giving the detailed steps of Cluster Ensemble algorithm using the Binary k-means and Spectral Clustering (CEBKSC). Section 4 presents the main results. Section 5 gives some conclusions and the looking. 2 Related Works 2.1 Cluster ensemble Given a set of data objects, the cluster ensemble method consists of two principal steps: Generation, which is about the generation of a set of partition results of these objects. Ensemble (Integration or Combination), which is a process of combining these results into a final one Generation Generation is the first step of clustering ensemble method where a set of partition results is generated. In general, there are no constraints about how these results should be generated. Therefore, different clustering algorithms or the same algorithm with different initialization parameters can be used to generate these results in this step. However, it is advisable to use the clustering algorithms with linear computational complexity to generate the partition results. Therefore, the k-means method using the binary thought is applied in this paper.

3 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) Ensemble Ensemble is very important in any clustering ensemble algorithm. In fact, the great challenge in clustering ensemble is just the design of an appropriate ensemble method. In this step, the final consensus partition, which is the result of any clustering ensemble algorithm, is computed. However, the consensus among a set of clustering results is not obtained in the same way in all cases. There are two main ensemble approaches: Points co-association and Median partition. The basic thought of the first approach is to avoid the correspondence problem by using a coincidence matrix between all pairs of data objects. The matrixes of the clustering results are then used to construct a new matrix (Co-association matrix) and a final result is obtained by performing some agglomerative clustering algorithms such as single-link and complete-link [4], or by using a graph partitioning algorithm, METIS [12], shown in Cluster-based Similarity Partitioning Algorithm (CSPA) which was proposed in literature [3]. In the second one, the consensus partition is obtained by selecting an optimization problem that finds the median partition of the cluster ensemble. The median partition is a partition maximizing the similarity of all partitions in the cluster ensemble. 2.2 Ensemble When the k-means method is applied to data with the number of the classes k=2, it is fast and stable. Therefore, it is easy to image that we can also get stable partition results when clustering data set with classes greater than 2 using the k-means method, if we adopt the following binary thought. Fig. 1: The diagram of binary thought The binary thought can be described as follows. Data objects will be firstly partitioned into two clusters and then each cluster will be partitioned into two, and repeat. Fig. 1 depicts the diagram of the binary thought. Algorithm 1 shows the k-means method using the binary thought. Algorithm 1 Binary k-means algorithm (BKA) Input: Data objects {x 1, x 2,..., x n }, number of desired classes k. Step 1 Compute iteration times R = int(log 2 k) + 1. Step 2 for r = 1 to R do Compute and renew the number of clusters M = 2 r 1.

4 5150 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) Compute and renew the size of clusters n d = n/2 r 1. for m = 1 to M do Call the kmeans(n d, 2) to partition these clusters. end for end for Step 3 Compute the number of leaves M = 2 R. Step 4 Merge M leaves into k. Output: The cluster membership for each data object. 3 Cluster Ensemble using Binary k-means and Spectral Clustering Given a text data set X = {x 1, x 2,..., x n }, let P = {p 1, p 2,..., p r } represent a set of partition results of X. And we generate a hypergraph (denoted by H = {h 1, h 2,..., h r }) of P with n vertices and t = rk (t << n) hyperedges by using the thought of generating a hypergraph proposed in literature [3]. Because the computational complexity of the EVD of the similarity matrix S is proportional to O(n 3 ), we adopt a matrix transformation technique to lower it. The procedure is as follows. As the eigensystem of similarity matrix S can take the form: If we substitute S = HH T, Eq. (1) can take a different form: Sx = λx. (1) HH T x = λx. (2) Without loss of generality, suppose X R n m (n m) and c = rank(x). Compute the singular value decomposition (SVD) of X, X = UΣV T with U T U = V T V = I m, Σ = diag(σ 1, σ 2,..., σ m ), and I is a unit matrix. Ensure that the eigenvalues in Σ are in descending order. As the EVD of the matrix XX T is XX T = UΣ 2 U T and the EVD of the matrix X T X is X T X = V Σ 2 V T, the left singular vectors U can be obtained by computing the EVD of the X T X. Thus, the main computational complexity of computing the left singular vectors U is only O(m 3 ). Theorem 3.1 Assume X R n m (n m) and c = rank(x). If there exists a matrix V = {v 1, v 2,..., v c } which consists of the linearly independent eigenvectors of X T X such that V T (X T X) V = diag(σ 2 1, σ 2 2,..., σ 2 c, 0,..., 0), σ i is the i-th nonzero singular value of X corresponding to the right singular vector v i and the left singular vector u i, the relationship between the two singular vectors can then take the form [13]: u i = Xv i /σ i. (3)

5 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) Theorem 3.2 Assume X = UΣV T R n m (n m), k < c = rank(x). If we let X k = k i=1 u iσ i vi T = U k Σ k Vk T represent the best rank-k approximation to X with U k = [u 1, u 2,..., u k ], V k = [v 1, v 2,..., v k ] and Σ k = [σ 1, σ 2,..., σ k ], where the eigenvalues in Σ are in descending order, we can have the following equation [13]. min X Y rank(y )=k 2 F = X X k 2 F = σ2 k+1 + σk σc 2. (4) Theorem 3.2, which is the theoretical basis for the concepts such as image enhancement and data reduction, illustrates that we can use the first k columns of eigenvectors U to perform clustering. Algorithm 2 shows the above processes. Algorithm 2 Cluster Ensemble algorithm using the Binary k-means and Spectral Clustering (CEBKSC). Input: n m text-term coincidence matrix X, number of desired classes k. Step 1 Call the BKA to cluster the n texts into k groups. Run BKA r times to generate the partition results P. Step 2 Construct the hypergraph H, and compute the similarity matrix S, S = HH T. Step 3 Compute the first k eigenvectors v 1, v 2,..., v k of the matrix H T H. Step 4 Compute the eigenvector u i, u i = Hv i /σ i. Step 5 Let Z R n k be the matrix consisting of the vectors {u 1, u 2,..., u k }. Step 6 Use k-means algorithm to cluster n rows of Z into k groups. Output: Cluster membership for each text object. 4 Experiment and Results Analysis We design an experiment to investigate the performance of our proposed algorithm. compare five different clustering algorithms, including: And we 1) Cluster-based Similarity Partitioning Algorithm, CSPA. It uses METIS to obtain the consensus partition of a similarity matrix (co-association matrix). 2) HyperGraph Partitioning Algorithm, HGPA. In this algorithm, HMETIS is applied to obtain the partition of a hypergraph. 3) Meta-CLustering Algorithm, MCLA. In this algorithm, METIS is used to partition a similarity matrix between clusters. 4) Hybrid Bipartite Graph Formulation, HBGF. We apply the spectral clustering to partition the bipartite graph. 5) Cluster Ensemble based on the Binary k-means and Spectral Clustering, CEBKSC. We use both binary k-means and spectral clustering to solve the text cluster ensemble problem. All of the above algorithms involve the MATLAB built-in k-means function whose number of replications is 10 and maximum number of iterations is 100.

6 5152 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) The description of the experimental data sets Our experiment uses six data sets, including: 1) tr31 and tr41. They are derived from TREC-6, and TREC-7 collections. The real categories of the two data sets correspond to the queries of the particular categories. 2) re0 and re1. They are selected from Reuters text categorization test collection Distribution 1.0. We divide the labels into two subsets. And for each subset, we select the text with a single label. 3) reviews and hitech. They are derived from the San Jose Mercury newspaper articles distributed as part of the TREC collection (TIPSTER Vol. 3). reviews contains texts about food, movies, music, radio, and restaurants. hitech contains texts about computers, electronics, health, medical, research, and technology. And no two texts in these texts will share the same DESCRIPT tag which may contain multiple categories. Table 1: The description of experimental data sets Data sets Instances Features Classes tr tr re re reviews hitech The verification of the effectiveness of our method We measure the quality via the Normalized Mutual Information (NMI) which uses information theoretic measure to quantify the match between the category label and the cluster label, and the Average Normalized Mutual Information (ANMI) which measures the average normalized mutual information between a set of r labels and the final label. Note: The highest scores in Tables 2 and 3, and the shortest times in Table 4 are bold marked. Table 2: NMI comparisons of five cluster ensemble methods Data sets CSPA HGPA MCLA HBGF CEBKSC tr tr re re reviews hitech

7 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) Table 3: ANMI comparisons of five cluster ensemble methods Data sets CSPA HGPA MCLA HBGF CEBKSC tr tr re re reviews hitech Tables 2, 3 present the comparison results. Each result is an average over 10 runs. The results show that: 1) The clustering quality of CSPA is better than that of the HGPA and MCLA in all data sets. For CSPA, it calls the efficient graph partition method METIS and is stable. 2) In all of the experimental data sets, the clustering quality of the two spectral cluster ensemble methods, HBGF and CEBKSC, is better than that of the three graph and hypergraph based methods, CSPA, HGPA and MCLA. 3) CEBKSC slightly outperforms HBGF. For CEBKSC, it can obtain higher NMI values than HBGF in all of the data sets, and can obtain the highest ANMI values in the remaining data sets except for the re0. Table 4: Runtime comparisons of five cluster ensemble methods at the ensemble step Data sets CSPA HGPA MCLA HBGF CEBKSC tr tr re re reviews hitech From Table 4, we can make the following observations: 1) CSPA is the slowest method in all of the cluster ensemble methods followed by the HBGF. For CSPA, it has a computational and storage complexity of O(mkn 2 ), which is quadratic of the number of texts. And HBGF calls a time-consuming spectral clustering to partition a bipartite graph. 2) MCLA is slightly slower than CEBKSC and HGPA. For MCLA, it has a computational complexity of O(m 2 k 2 n). 3) CEBKSC and HGPA are the fastest methods. For CEBKSC, it requires a significantly reduced computational complexity for applying a matrix transformation technique. And the computational complexity of HGPA is only O(mkn). 5 Conclusions and Looking In this paper, we develop a cluster ensemble method using the binary k-means and spectral clustering. The proposed algorithm takes the advantages of the binary k-means method and the spectral clustering method, whereas the shortcomings are avoided. On one hand, the usage of the

8 5154 Y. Tian et al. /Journal of Computational Information Systems 10: 12 (2014) binary k-means method permits the formation of partitions that are different from each other. On the other hand, the application of the spectral clustering method to the partition results rather than directly to the texts, yields superior clustering performance. Moreover, a matrix transformation technique is adopted to address the computational and memory problems of the spectral clustering. In the future, techniques to avoid the bottleneck of our method including the acceleration of binary k-means method will be researched. And, we will investigate the probability of SAR image segmentation using our method. References [1] Bai Xue, Luo Si-wei, Yin Hui, Ni Wei-yuan, Multi-Feature Similarity Measures Under Information- Based Clustering Framework for Image Segmentation, Journal of Computational Information Systems, 2012, 8 (15): [2] Vega-Pons S, Ruiz-Shulcloper J, A Survey of Clustering Ensemble Algorithms, International Journal of Pattern Recognition and Artificial Intelligence, 2011, 25 (3): [3] Strehl A, Ghosh J, Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitionings, In Proc. Conference on Artificial Intelligence (AAAI 2002), Edmonton, AAAI/MIT Press, 2002, [4] Fred A L, and Jain A K, Combining Multiple Clusterings using Evidence Accumulation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27 (6): [5] Meila M, Shi J, Learning Segmentation by Random Walks, Proc. Conf. Neural Information Processing Systems, 2000, [6] Shi J, Malik J, Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 22 (8): [7] Fowlkes C, Belongie S, Chung F, Malik J, Spectral Grouping using the Nyström Method, IEEE Trans. Pattern Analysis and Machine Intelligence, 2004, 26 (2): [8] Dhillon I S. Co-clustering Documents and Words using Bipartite Spectral Graph Partitioning, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001, [9] Xu Wei, Liu Xin, Gong Yi-hong, Document Clustering Based on Non-negative Matrix Factorization, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, 2003, [10] Yu S X, Shi Jian-bo, Multiclass spectral clustering, Computer Vision, Proceedings. Ninth IEEE International Conference on. IEEE, 2003, [11] Liu Rong, Zhang Hao, Segmentation of 3D Meshes Through Spectral Clustering, Computer Graphics and Applications, PG Proceedings. 12th Pacific Conference on. IEEE, 2004, [12] Karypis G, Kumar V, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM Journal on scientific Computing, 1998, 20 (1): [13] Berry M W, Large-Scale Sparse Singular Value Computations, International Journal of Supercomputer Applications, 1992, 6 (1):

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal: