Weighted-Object Ensemble Clustering

Size: px

Start display at page:

Download "Weighted-Object Ensemble Clustering"

Malcolm White
6 years ago
Views:

213 IEEE 13th International Conference on Data Mining Weighted-Object Ensemble Clustering Yazhou Ren School of Computer Science and Engineering South China University of Technology Guangzhou, 516,

1 213 IEEE 13th International Conference on Data Mining Weighted-Object Ensemble Clustering Yazhou Ren School of Computer Science and Engineering South China University of Technology Guangzhou, 516, China Guoji Zhang School of Sciences South China University of Technology Guangzhou, 5164, China Carlotta Domeniconi Department of Computer Science George Mason University Fairfax, 223, USA Guoxian Yu College of Computer and Information Science Southwest University Chongqing, 4715, China Abstract Ensemble clustering, also known as consensus clustering, aims to generate a stable and robust clustering through the consolidation of multiple base clusterings. In recent years many ensemble clustering methods have been proposed, most of which treat each clustering and each object as equally important. Some approaches make use of weights associated with clusters, or with clusterings, when assembling the different base clusterings. Boosting algorithms developed for classification have also led to the idea of considering weighted objects during the clustering process. However, not much effort has been put towards incorporating weighted objects into the consensus process. To fill this gap, in this paper we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the coassociation matrix that summarizes the base clustering results, and we then embed the corresponding information as weights associated to objects. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings. Keywords Ensemble clustering, consensus clustering, graph partition, weighted objects. I. INTRODUCTION Clustering is the key step for many tasks in data mining. It is well known that off-the-shelf clustering methods may discover different patterns in a given set of data. This is because each algorithm has its own bias due to the optimization of different criteria. Furthermore, there is no ground truth to validate the result. Recently the use of clustering ensembles has emerged as a technique for overcoming problems with clustering algorithms [1], [2]. A clustering ensemble technique is characterized by two components: the mechanism to generate diverse partitions, and the consensus function to combine the input partitions into a final clustering. A clustering ensemble consists of different clusterings, obtained from multiple applications of any single algorithm with different initializations, or on various bootstrap samples of the available data, or from the application of different algorithms to the same data set. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned, or to the variance induced by different data samples. Researchers have investigated how to combine clustering ensembles with subspace clusterings in an effort to address both the ill-posed nature of clustering and the curse-ofdimensionality in high dimensional spaces [3], [4]. A subspace clustering is a collection of weighted clusters, where each cluster has a weight vector representing the relevance of features for that cluster. In a subspace clustering ensemble, the consensus function makes use of both the clusters and the weight vectors provided by the base subspace clusterings. Work has also been done to evaluate the relevance of each base clustering (and assign weights accordingly), in an effort to improve the final consensus clustering [5]. To the best of our knowledge, no previous work has investigated how to use weights associated to objects within the clustering ensemble framework. Several approaches have studied the benefits of weighting the objects in iterative clustering methods combined with boosting techniques [6], [7], [8]. Empirical observations suggest that large weights should be assigned to the objects that are hard to be clustered. As a result, centroid-based clustering methods move the centers towards the regions where the objects whose cluster membership is hard to be determined are located. In boosting, the weights bias the data distribution, thus making the region around the difficult points denser. With this interpretation of the weights, shifting the centers of the clusters corresponds to moving them towards the modes of the distribution modified by the weights. Nock and Nielsen [9] formulated clustering as a constrained minimization using a Bregman divergence, and recommended that the hard objects should be given large weights. They analyzed the benefits of using boosting techniques in clustering and introduced several weighted versions of classical clustering algorithms. Inspired by this work, we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and we then embed the corresponding information as /13 $ IEEE DOI 1.119/ICDM

2 weights associated to objects. Different from boosting [1], the weights are not subject to iterative changes. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings. The main contributions of this paper are as follows: i) To the best of our knowledge, for the first time weights associated to objects are used within clustering ensembles. In WOEC, a one-shot method is proposed to assign weights to objects, and then three weighted-object techniques are presented to compute the consolidated clustering. ii) A Weighted-Object k-means algorithm (WOKmeans) is proposed. Its superiority to k-means demonstrates the effectiveness of the one-shot weights assignment. iii) Extensive experiments are conducted to analyze the performance of WOEC against existing consensus clustering algorithms, on several data sets and using three popular clustering evaluation measures. The rest of this paper is organized as follows. In Section II, we review related work on ensemble clustering. In Section III, we introduce the WOEC methodology. Section IV gives the experimental settings and Section V analyzes the experimental results. Conclusions and future work are provided in Section VI. II. RELATED WORK Ensemble techniques were first developed for supervised settings. Empirical results have shown that they are capable of improving the generalization performance of the base classifiers [11]. This result has inspired researchers to further investigate the development of ensemble techniques for unsupervised problems, namely clustering. Fred and Jain [12], [13] captured the information provided by the base clusterings in a co-association matrix, where each entry is the frequency according to which two objects are clustered together. The co-association matrix was used as a similarity matrix to compute the final clustering. Strehl and Ghosh [2] proposed three graph-based methods, namely CSPA, HGPA, and MCLA. CSPA constructs a co-association matrix, which is again viewed as a pairwise similarity matrix. A graph is then constructed, where each object corresponds to a node, and each entry of the coassociation matrix gives the weight of the edge between two objects. The METIS [14] package is used to partition the objects into k clusters. HGPA combines multiple clusterings by solving a hypergraph partitioning problem. In the hypergraph, each hyperedge represents a cluster, and it connects all the objects that belong to the corresponding cluster. The package HMETIS [15] is utilized to generate the final clustering. MCLA considers each cluster as a node in a meta-graph, and sets the pairwise similarity between two clusters as the ratio between the number of shared objects and the number of total objects in the two clusters. In the meta-graph, each cluster is represented by a hyperedge. MCLA groups and collapses related hyperedges, and it assigns an object to the collapsed hyperedge to which it participates most strongly. MCLA also uses METIS [14] to partition the meta-graph. Fern and Brodley [16] introduced HBGF to integrate base clusterings. HBGF constructs a bipartite graph that models both objects and clusters simultaneously as vertices. In the bipartite graph, an object is connected to several clusters, whereas there is no edge between pairwise objects or pairwise clusters. HBGF applies two graph partitioning algorithms: spectral graph partitioning [17], [18] and METIS to get the final partitions. To handle missing values, Wang, Shan and Banerjee proposed Bayesian cluster ensembles (BCE) [19], which considers all base clustering results as feature vectors, and then learns a Bayesian mixed-membership model from this feature representation. Domeniconi and Al-Razgan [3], [4] combined the clustering ensemble framework with subspace clustering. A subspace clustering is a collection of weighted clusters, where each cluster has a weight vector representing the relevance of features for that cluster. The input to the consensus function is a collection of subspace clusterings. The subspace clustering method used is Locally Adaptive Clustering (LAC) [2], and the resulting clustering techniques are WSPA and WBPA. Li and Ding [21] specified different weights for different clusterings and proposed an approach called Weighted Consensus Clustering (WCC). The optimal weights are sought iteratively through optimization and then used to compute the consensus result within the framework of nonnegative matrix factorization (NMF) [5]. More recently, a method called ensemble clustering by matrix completion (ECMC) was proposed [22]. ECMC uses the reliable pair of objects to construct a partially observed co-association matrix, and exploits the matrix completion algorithm to replenish the missing entries of the co-association matrix. Two objects are reliable if they are often clustered together or seldom clustered together. ECMC then uses spectral clustering on the completed matrix to get the final clustering. However, ECMC has two disadvantages: (i) the notion of reliability between objects is hard to define, and (ii) the matrix completion process may result in information loss. All the aforementioned methods do not assign weights to objects. In contrast, our proposed WOEC algorithms use the information provided by the base clusterings to define objects weights which reflect how difficult it is to cluster them. The weights are then embedded within the consensus function. Specifically, we propose three different WOEC approaches: (i) Weighted-Object Meta Clustering (WOMC), (ii) Weighted-Object Similarity Partition () Clustering, and (iii) Weighted-Object Hybrid Bipartite () Graph Partition Clustering. III. WEIGHTED OBJECT ENSEMBLE CLUSTERING This section introduces the ensemble clustering problem and presents the details of our weighted-object ensemble clustering. A. Ensemble Clustering Problem Formulation An ensemble clustering approach consists of two steps. First, multiple base clusterings are generated; second, the base clusterings are consolidated into the final clustering result. We assume here that the base clusterings have already been generated, and we only discuss hard clustering; however, the methods proposed can be extended to solve soft clustering problems as well. 628

3 Let X = {x 1, x 2,..., x n } denote the data set and X = [x T 1 ; x T 2 ;...; x T n ] R n d be its matricial representation, where n is the number of objects and d is the dimensionality of each object x i =(x i1,x i2,..., x id ) T. Lets consider a set of R clustering solutions C = {C 1,C 2,..., C R }, where each clustering component C r = {C1,C r 2, r..., Ck r r }, r =1, 2,...,R, partitions the data set X into k r disjoint clusters, i.e. Ci r Cr j = ( i = j, i, j =1, 2,...,k r ), and kr k=1 Cr k = X. The ensemble clustering problem consists in defining a consensus function Γ that maps the set of base clusterings C into a single consolidated clustering C. B. One-shot Weight Assignment Boosting [1], [11] is a supervised technique which seeks to create a strong learner based on a set of weak learners. Adaboost (short for Adaptive Boosting) [23] is the most commonly used boosting algorithm. It iteratively generates a distribution over the data, which is then used to train the next weak classifier. In each run, objects that are hard to be classified gain more weight, while easy-to-classify objects lose weight. The newly trained classifier focuses on the points with larger weights. At termination, Adaboost combines the weak classifiers into a strong one. The effectiveness of boosting has been proven both theoretically and experimentally. Boosting algorithms iteratively assign weights to objects using label information, which is likely to be unknown in clustering applications. This makes difficult to apply boosting techniques to clustering, and explains why little research has been performed in the context of weighted-object ensemble clustering methods. This paper aims at reducing this gap. Our goal is to enable difficultto-cluster points to play a bigger role when producing the final consolidated clustering result. Following this objective, we propose a one-shot weight assignment to objects using the results of all base clusterings, and then embed the weights into the successive consensus clustering process. Similar to boosting, points that are hard to cluster receive larger weights, while easy-to-cluster points are given smaller weights. The difference is that boosting is an iterative process, while our weight assignment scheme is performed in one-shot. The details are given below. Let A be the n n co-association matrix built from the clustering solutions C: A ij = V ij (1) R where V ij is the number of times objects x i and x j co-occur in the same cluster, and R = C is the ensemble size. Obviously, A ij [, 1]. WesetA ii =1, i =1, 2,...,n. A ij 1 means that x i and x j are often placed in the same cluster. A ij means that x i and x j are often placed in different clusters. In both cases, the base clusterings show a high level of agreement. When A ij.5, roughly half of the clusterings group x i and x j together, and the other half place them in different clusters. This scenario is the most uncertain one, since the clustering components show no agreement on how to cluster points x i and x j. We can capture this trend by mapping the A ij values through a quadratic function: y(x) =x(1 x),x [, 1]. This function achieves the peak at x =.5, and the minima at x = and x =1, as illustrated in Fig Hence, we can measure 1 Other functions satisfying these properties can be used as well. Fig. 1. y y = x(1 x) x = x Quadratic function: y(x) =x(1 x) the level of uncertainty in clustering two points x i and x j as follows: confusion(x i, x j )=A ij (1 A ij ) (2) The confusion index reaches its maximum of.25 when A ij =.5, and its minimum of when A ij =or A ij =1.Weuse this confusion measure to define the weight associated with each object as follows: w i = 4 n confusion(x i, x j ) (3) n j=1 The normalization factor 4 n is used to guarantee that w i [, 1]. To avoid a value of for a weight, which can lead to instability, we add a smoothing term: w i = w i + e (4) 1+e where e is a small positive number (e =.1 in our experiments). As a result, w i (, 1]. A large w i value means that confusion(x i, x j ) is large for different x j values. As such it measures how hard it is to cluster x i. The assignment of large weights to points that are hard to cluster is consistent with the way boosting techniques operate [1], [11]. C. Algorithms In this section we introduce three different Weighted-Object Ensemble Clustering (WOEC) algorithms. They all make use of the weights associated with objects (computed as explained above) to determine the consensus clustering. The overall process is shown in Fig. 2. As illustrated in the Figure, the consensus functions of and make use of the feature vectors, while WOMC does not. 1) Weighted-Object Meta Clustering Algorithm (WOMC): The first approach we introduce is a weighted version of the Meta Clustering Algorithm (MCLA) introduced in [2]. We call the resulting algorithm Weighted-Object Meta Clustering (WOMC). The key step of meta clustering algorithms is the clustering of clusters. When measuring the similarity between two clusters, MCLA treats all the points present in both clusters equally. In contrast, the proposed WOMC technique distinguishes the contribution of hard-to-cluster and easy-tocluster points. Specifically, a point in the intersection of two clusters contributes to the clusters similarity in a way that is proportional to its weight. The details of the approach follow. We explicit the components of each C r in C. This gives: C = {C1, 1..., Ck 1 1,C1, 2..., Ck 2 2,..., C1 R,..., Ck R R }, where each 629

4 Fig. 2. The process map of Weighted-object Ensemble Clustering (WOEC) component now corresponds to a cluster. The WOMC algorithm groups the clusters in C. To this end, it proceeds as follows. It constructs an undirected meta-graph G =(V,E) with V = n c = R r=1 k r vertices, each representing a cluster in C. The edge E ij connecting vertices V i and V j is assigned the weight value S ij defined as follows: S ij = x k C i C j w k x k C i C j w k (5) Clearly, S ij [, 1]. IfS ij =, there is no edge between C i and C j in G. S ij =1indicates that C i and C j contain the same points. WOMC partitions the meta-graph G into k (number of clusters in the final clustering result) meta-clusters, each representing a group of clusters, by using the similarity measure S ij and by applying the well-known graph partitioning algorithm METIS [14]. Similarly to MCLA, WOMC also assigns each point to the meta-cluster in which it participates most strongly. Specifically, a data point may occur in different meta-clusters, and it is associated to a given meta-cluster according to the ratio of clusters it belongs to. A point is eventually assigned to the meta-cluster with the highest ratio for that point. Ties are broken randomly. There might be a situation in which no object is assigned to a meta-cluster. Thus, the final combined clustering C may contains less than k clusters. The similarity measure defined in Eq. (5) computes the similarity of two clusters, not only based on the fraction of shared points, but also on the values of the weights associated to the points. In constrast, the binary Jaccard measure Ci Cj C i C j computes the similarity as the proportion between the size of the intersection and the size of the union of C i and C j. Suppose C i and C j have fixed points assigned to them. Thus, their Jaccard similarity is also fixed. Lets see how changing the weights assigned to the points affects the value of S ij. Consider D =(C i C j ) (C i C j ), i.e. the set of points not in the intersection. Let the weights assigned to the points in D be fixed. Then, increasing (decreasing) the weights assigned to the points in (C i C j ) will cause an increase (decrease) of S ij. As such, the more hard-to-cluster points C i and C j share, the more similar they are. Lets now fix the weights assigned to the points is (C i C j ). Increasing (decreasing) the weights assigned to the points in D will cause a decrease (increase) of S ij. As a consequence, the more hard-to-cluster points C i and C j do not share, the smaller their similarity is. Algorithm 1 describes the WOMC algorithm. Algorithm 1 Weighted-Object Meta Clustering Algorithm (WOMC) Input: Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the meta-graph G =(V,E), where the weights associated to the edges are computed using Eq. (5). 3: MetaClusters = METIS(G, k ). //Partition the metagraph into k meta-clusters. 4: C = Assign(MetaClusters) //Assign each object to the meta-cluster in which it participates most strongly. 5: return C. 2) Weighted-Object Similarity Partitioning Algorithm (): Previous work on clustering [6], [7], [8] suggested to assign large weights to objects that are hard to be clustered, and iteratively move cluster centers towards areas that contain objects with large weights. Inspired by this work, we proceed similarly while leveraging the information provided by the ensemble to estimate the weights. The motivation stems from boosting. The difference is that clustering methods combined with a boosting technique assign weights to objects and move cluster centers iteratively, while we only move centers once for each base clustering by making use of the original data and objects weight information. Clustering with a boosting technique aims to generate a better single clustering result while we seek to find a better consolidated consensus solution. To this end, for each cluster in the base clusterings C r (r =1, 2,..., R), we compute the corresponding weighted center as follows: m r l = x i Cl r x i Cl r w i x i (6) w i where l = 1,..., k r. The basic idea behind this step is to move the center of each cluster towards the region consisting of the objects that are hard to cluster. We call it shifting center technique. The experiments presented in Section V demonstrate the rationale and the improvement achieved by such transformation. The similarity between a point x i and a 63

5 weighted center m r l is calculated using the following exponential function: d(x i, m r l )=exp{ x i m r l 2 } (7) t The probability of cluster Cl r,givenx i, can then be defined as: P (Cl r d(x i, m r l x i )= ) kr l =1 d(x i, m r l ) (8) The smaller x i m r l 2 is, the larger P (Cl r x i) will be. We can now define the vector Pi r of posterior probabilities associated with x i : P r i =(P (C r 1 x i ),P(C r 2 x i ),..., P (C r k r x i )) (9) where k r l=1 P (Cr l x i)=1. Pi r provides a new representation of x i in a space of relative coordinates with respect to cluster centroids, where each dimension corresponds to one cluster. This new representation embeds information from both the original input data and the clustering ensemble. We use the cosine similarity to measure the similarity Sij r between x i and x j with respect to the base clustering C r : S r ij = P r i (P r j )T P r i P r j (1) where T denotes the transpose of a vector. There are R similarity matrices S r (r =1,..., R), each one corresponding to a base clustering. We combine these matrices into one final similarity matrix S: S = 1 R S r (11) R r=1 S represents the average similarity between x i and x j (through vectors P i and P j ), across the R contributing clusterings. Hence we can construct an undirected graph G = (V,E), where V = n and each vertex represents an object in X. The edge E ij connecting vertices V i and V j is assigned the weight value S ij. A graph partitioning algorithm (METIS in our experiments) can be applied to graph G to compute the partitioning of the n vertices that minimizes the edge weightcut. This gives the consensus clustering we seek. The algorithm is illustrated in Algorithm 2. Algorithm 2 Weighted-Object Similarity Partitioning Algorithm () Input: Data set X ; Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the corresponding graph G = (V,E), with weights computed through Eqs. (6)-(11). 3: C = METIS(G, k ). //Partition the graph into k parts, each representing a final cluster. 4: return C. 3) Weighted-Object Hybrid Bipartite Graph Partitioning Algorithm (): measures pairwise similarities which are solely instance-based, thus ignoring cluster similarities. We now introduce our third technique,, which attempts to capture both kinds of similarities. This approach reduces the clustering consensus problem to a bipartite graph partitioning problem, which partitions both cluster vertices and instance vertices simultaneously. Thus, it also accounts for similarities between clusters. As, makes use of the shifting center technique and computes probabilities as in Eq. (8). constructs an undirected hybrid bipartite graph G = (V,E), where V contains n c + n vertices (n c = R r=1 k r). The first n c vertices represent all the clusters in C and the last n vertices represent objects. The edge E ij connecting the vertices V i and V j is assigned the weight value S ij defined as follows: S ji =, S ij = S ji = P (C j x i ) if both vertices i and j are clusters or objects (12) where P (C j x i ) is the probability of cluster C j given x i, and can be computed from[ Eq. (8). The ] matrix S of elements S ij B T can be written as S =, where the n n B c matrix B is defined as follows: P1 1 P P R 1 P B = 2 1 P P1 R (13) Pn 1 Pn 2... Pn R where P r i is a row vector defined in Eq. (9). A graph partitioning algorithm (METIS in our experiments) is then used to partition this graph into k hybrid parts (each containing objects and clusters), so that the edge weight-cut is minimized. The partition of the objects provides the final clustering result. The algorithm is given in Algorithm 3. Algorithm 3 Weighted-Object Hybrid Bipartite Graph Partitioning Algorithm () Input: Data set X ; Set of base clusterings C; Graph partitioning algorithm METIS; Number of clusters k. Output: The final consensus clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Construct the hybrid bipartite graph G = (V,E) as described in this subsection. 3: C hybrid = METIS(G, k ). //Partition the graph into k parts. 4: return the partition of objects in C hybrid as C. D. Weighted-Object k-means To verify the effect of moving the center of each cluster according to Eq. (6), we introduce here a modified version of 631

6 Algorithm 4 Weighted-Object k-means (WOKmeans) Input: Data set X ; Set of base clusterings C; Number of clusters k. Output: The final clustering C. 1: Assign weights to objects using the one-shot weight assignment. 2: Randomly generate k weighted cluster centers c 1,..., c k, each corresponding to a cluster C k = (k =1,..., k ). 3: repeat 4: //assign each object to its closest weighted center i, C k = C k x i, where k = arg min k x i c k. 5: //generate new weighted centers x c k = i C k w i x i,k =1,..., k x (14) i C k w i 6: until All weighted centers do not change or reach the maximum of iteration 7: return C = {C 1,C 2,..., C k }. k-means that makes use of such weighted means. The resulting technique is called Weighted-Object k-means (WOKmeans), and it s illustrated in Algorithm 4. WOKmeans computes the weighted centers after assigning objects to clusters in each iteration, as shown in Eq. (14). The main idea is to move cluster centers towards regions with larger weights iteratively. Compared to the weighted version of k-means in [9], WOKmeans computes all objects weights in advance using the coassociation matrix provided by the base clusterings, while [9] assigns weights to objects iteratively. IV. EMPIRICAL EVALUATION A. Data Sets We conducted experiments on one simulated data and eight UCI 2 data sets to compare the performance of the WOEC methods against existing state-of-the-art techniques. A description of the data sets used is given in Table I. The simulated data is discussed in V-A. The Satimage data set originally contained 6435 objects; we randomly selected 42 images equally distributed among the six classes for our experiments. The Spambase data set had 461 objects; we randomly sampled 5 objects (25 for each class). The other six data sets were used in their original form. For each data set, each feature is preprocessed to have zero mean value and unit variance. TABLE I. DATA SETS USED IN THE EXPERIMENTS Data sets #Objects #Dimensions #Classes Simulated data AustralianCredit BalanceScale BreastTissue Glass HayesRoth Iris Satimage Spambase B. Evaluation Measures There exist various metrics to evaluate clustering results [24], [25]. Since the labels of the data are known, we use the Rand Index (RI) [26], the (ARI) [26], and the Normalized Mutual Information (NMI) [2] as the validity indices. Let C = {C1,C2,..., Ck } be the consensus clustering result (k is the number of clusters in the final consensus clustering), and let C = {C 1, C 2,..., C k } be the ground-truth clustering (k is the number of classes). 1) RI measures the similarity between C and C as follows: where RI(C, C) = a + b a + b = ( a + b + c + d n (15) 2) a is the number of pairs of points that are in the same cluster in both C and C. b is the number of pairs of points that are in different clusters in both C and C. c is the number of pairs of points that are in the same cluster in C and in different clusters in C. d is the number of pairs of points that are in different clusters in C and in the same cluster in C. RI ranges from to 1. 2) ARI defines the similarity between C and C as follows: ( nij ) ARI(C i,j 2 (S S)/ ( ) n 2, C) = 1 2 (S + S) (S S)/ ( ) n (16) 2 ( n i 2 ), S = j ( nj 2 ), and n i and n j denote where S = i the number of objects in cluster Ci and C j (i =1,..., k ; j = 1,..., k), respectively. n ij = Ci C j. The ARI values range from-1to+1. 3) NMI [2] treats the clustering labels of C and C as two random variables. The mutual information of these two random variables is evaluated and then normalized within the [, 1] range. A larger value of RI (ARI or NMI) means that C and C are more similar. C. Base Clusterings To generate diverse clusterings we used k-means. Random sampling of the data and feature subset selection techniques were applied. Specifically, for the first half of the base clusterings, in each run we randomly selected 7% of the objects and performed k-means on this subset of the data. The remaining objects were assigned to the closest cluster based on the Euclidean distances between the object and the cluster centers. For the second half of the base clusterings, in each run we selected 7% of the features and conducted k-means in the corresponding subspace. When using k-means, we randomly generated initial centers in each independent run. V. RESULTS AND ANALYSIS For a fair comparison, we used the METIS package [14] for all the clustering ensemble methods that need to partition a graph. The number of clusters k in the consensus clustering is set equal to the number of classes, for all data and all 632

7 TABLE II. RESULTS ON SIMULATED DATA Data Index min max mean CSPA MCLA HBGF BCE WCC ECMC WOEC WOMC Simulated RI Data ARI NMI TABLE III. COMPARISON BETWEEN k-means AND WOKMEANS ON SIMULATED DATA Data Index k-means WOKmeans Simulated RI.8329± ±.1888 Data ARI.6657± ±.3776 NMI.6329± ±.3611 methods 3. In our experiments, the value of t (see Eq. (7)), for each base clustering, is set to the average of the squared Euclidean distances between the data and the weighted means. All the experimental results are the average of 1 independent runs. We also performed a paired t-test to assess the statistical significance of the results; the significance level was always set to.5. We compare the proposed WOEC methods against eight state-of-the-art ensemble clustering algorithms, namely: CSPA [2], MCLA [2], HBGF [16], BCE [19], WCC [21], ECMC [22], WSPA[3], [4] and WBPA[3], [4]. (HGPA [2] was not included in our experiments since it typically performs worse than CSPA and MCLA.) A. Simulated Data To illustrate the effectiveness of the proposed three WOEC algorithms and of WOKmeans, we first generated a simulated data. The simulated data consists of d = 2 input features and 2 classes, each consisting of 15 points. Both classes are normally distributed according to multivariate Gaussian distributions, as shown in Fig. 3. The mean vector and the covariance matrix of class 1 are (5, 1) and [1 ; 1], respectively. For class 2, the mean vector is (2, 1) and the covariance matrix is again [1 ; 1]. We set the ensemble size R = 4. RI, ARI and NMI are applied to assess the performance of all algorithms. For each of the 4 base clusterings, the data sampling technique was used. That is, we randomly selected 7% of the points and performed k-means on this subset of the data. The remaining points were assigned to the closest cluster based on the Euclidean distance between the points and the cluster centers. Each feature value was also restricted to have zero mean and unit variance. For each independent run, we record the minimum (min), maximum (max), and mean value (of RI, ARI and NMI) for all the base clusterings. The corresponding averages of 1 runs are also reported. The results are shown in Table II. Significantly better values are given in boldface. and achieve the best performance for all the evaluation indices (RI, ARI, and NMI). This is because the hard-to-cluster points (i.e. the points between the clusters boundaries) have the largest contribution to the shifting center procedure. This causes a shift of the clusters centers towards the region with hard-to-cluster points. As such, the influence of 3 In real applications, classes may be multimodal, and thus k should be larger than the number of classes. In other cases, there may be less clusters than classes. These scenarios are not considered in this paper and we simply set k equal to the number of classes. Fig. 3. Feature Simulated data Pseudo outliers Class 1 (5,1) (2,1) Mean vectors Class Feature 1 Simulated Data the pseudo-outliers 4 is reduced, thereby leading to more robust ensemble clustering results. BCE performs the worst. WOMC achieves comparable performance to the other methods, but loses to and, showing that the technique of shifting centers can work better on this data. To further demonstrate the effectiveness of shifting centers, we conducted experiments on the simulated data to compare the performance of k-means and WOKmeans. The experimental setup is the same as before. Table III shows the results. Clearly, WOKmeans achieves a considerable improvement with respect to k-means and its standard deviations are also smaller. k-means is negatively affected by the pseudo-outliers. Although WOKmeans improves upon k-means, its performance is considerably worse than that of ensemble clustering methods, demonstrating the effectiveness of ensemble techniques. Points that receive the largest weights are highlighted in squares in Fig. 3, showing that between-cluster points tend to gain large weights. From Table II and III, we can see that generally a better RI value corresponds to a better ARI and NMI values. Therefore, we only report the ARI values for the rest of the experiments due to the limitation in space. B. Evaluation of the WOEC Techniques We conducted experiments to evaluate the effectiveness of the proposed WOEC methods on eight UCI data sets. Base clusterings were generated as explained in Section IV-C. We experimented with three different ensemble sizes R: 2, 4, and 6. The results in terms of ARI are shown in Table IV. Each ARI value reported is the average of 1 independent runs. In each row, the values in boldface are the ARI values that 4 As shown in Fig. 3, the left circled points are far away from the mean point of class 1 and the density around them is much lower than that around the other points in class 1. These points can bias the computation of the mean vector. They are generated from the same distribution as the other red points in class 1, and should be grouped in the same cluster, but behave like outliers for the purpose of this discussion. For this reason we call them pseudo-outliers. Similarly, the points in the right circle are the pseudo-outliers of class

8 TABLE IV. COMPARISON AGAINST EXISTING METHODS (ARI VALUES) Data R min max mean CSPA MCLA HBGF BCE WCC ECMC WOEC WOMC Australian (.4189) Balance (.1336) Breast (.2952) Glass (.168) HayesRoth (.61) Iris (.5784) Satimage (.699) Spambase (.1412) are statistically significantly better than the remaining ones. We also run k-means 1 times for baseline comparison; its average ARI is given in Table IV in parenthesis, following the name of each data set. For each independent run, we record the minimum, maximum and mean ARI for all the base clusterings. The corresponding averages are also reported in the Table. We can see that the mean ARI of the base clusterings is very close to the baseline for each data set. From Table IV we can see that at least one of the WOEC methods can always achieve the best performance on all data sets, except for R =2and R =4on BreastTissue. To be more specific, WOMC gives the best performance on AustralianCredit, BalanceScale, and Satimage; achieves the best performance on Glass and HayesRoth; while significantly outperforms all other methods on Spambase. On Iris data, and achieve comparable performance and both outperform all other methods. As expected, the ensemble clustering methods achieve superior performance with respect to the baseline for most of the problems. In particular, on the Spambase data set, the ensemble techniques provide a considerable improvement. The poor performance of the baseline may be due to the large dimensionality of the data in this case. There are exceptions to this general trend. For instance, most of the ensemble clustering methods are inferior to the baseline on BalanceScale, Glass, and HayesRoth. Nevertheless, at least one of the WOEC methods performs well also for these data sets, i.e. WOMC on BalanceScale, and on Glass and HayesRoth. An interesting result is that the performance of and is close to the max index on HayesRoth and Spambase respectively, and WOMC can achieve better ARI values than the max index on Satimage. As pointed out earlier, our WOMC technique is a variation of MCLA. We can see that these two methods have comparable performance on AustralianCredit, Glass, HayesRoth, and Spambase, while WOMC outperforms MCLA on the other 4 data sets. This shows the benefits of considering objects weights when measuring the similarity of two meta clusters in a meta clustering algorithm. Compared against the most recent ECMC technique, the WOEC methods can achieve much better results in general. The main reason is that ECMC assigns a value of one to the entries of the co-association matrix that are larger than a threshold d 1, and a value of zero to the entries that are smaller than a threshold value d. This leads to information loss; in particular, the loss of information regarding the hardto-cluster objects. In addition, the setting of thresholds like d and d 1 is always problematic. In our experiments we set d =.2 and d 1 =.8 as suggested in [22]. ECMC needs also to choose a suitable value for a parameter C that minimizes 1 T M1, where M is a pairwise similarity matrix (details are in [22]), which is time consuming. In contrast, WOMC does not need to perform any parameter selection, while and are robust to the setting of its parameter t, asthe analysis in Section V-E shows. C. Comparison against WSPA and WBPA To perform comparisons against WSPA and WBPA, we generated the base clusterings using the LAC algorithm, as described in [3], [4]. LAC assigns weights to features, locally at each cluster. Both WSPA and WBPA require such weights, while our WOEC techniques have a wider applicability. The weight vectors computed by the LAC algorithm are discarded when combined with the WOEC approaches. Results are shown in Table V. As before, statistically significant results are in boldface. We can see that the WOEC methods can always achieve the best performance, while WSPA and WBPA give comparable to best results only on BreastTissue and Glass data sets. D. Evaluation of WOKmeans In Section V-A, we have discussed the superiority of WOKmeans with respect to k-means on the simulated data. To further investigate the behavior of WOKmeans, in this section we add more comparisons on three data sets: BreastTissue, Glass, and Iris. The reported results are the average of 1 individual runs. In each run, the maximum number of iterations of both methods is set to 1. The number of clusters is set to the number of classes given by the ground truth labels. The ensemble sizes are again 2, 4 and 6. Note that only WOKmeans is affected by the ensemble size. Table VI gives 634

9 AustralianCredit.2 HayesRoth Spambase t t t (a) AustralianCredit (b) HayesRoth (c) Spambase Fig. 4. Sensitivity analysis of parameter t.5 AustralianCredit WOMC HayesRoth Spambase WOMC Ensemble size Ensemble size.2.1 WOMC Ensemble size (a) AustralianCredit (b) HayesRoth (c) Spambase Fig. 5. Sensitivity analysis of ensemble size R TABLE V. COMPARISON AGAINST WSPA AND WBPA (ARI) Data R WSPA WBPA WOEC WOMC Australian Balance Breast Glass HayesRoth Iris Satimage Spambase the results. We observe that WOKmeans can significantly outperform k-means on the three data sets in all cases. For Iris data, WOKmeans not only achieves a better performance, but it s also more stable with much smaller standard deviations. TABLE VI. EXPERIMENTAL RESULTS OF k-means AND WOKMEANS (ARI) Data R k-means WOKmeans ±.375 BreastTissue ± ± ± ±.35 Glass 4.168± ± ± ±.294 Iris ± ± ±.329 These results empirically demonstrate again the effect of Eq. (6) on the resulting clusters. As suggested by previous analysis [6], [7], [8], moving the centers towards the hard-to-cluster data can improve the clustering of the same. E. Sensitivity Analysis of Parameter t Here we test how sensitive the and the methods are with respect to the values of the parameter t in Eq. (7). We set the ensemble size R =4and change the parameter t from.5 to 1. We use the AustralianCredit, HayesRoth, and Spambase data sets to observe the performance of the methods for the different values of t. Fig.4 shows the results. We can see that both and tend to be stable when t>5 on HayesRoth and Spambase, while the performance is almost the same for different values of t on AustralianCredit. In fact, the probability of a cluster 635

10 Ck r given an object x i, computed as in Eq. (8), is essentially determined by the Euclidean distances between x i and all the cluster centers in clustering C r. Different values of t will not change the sequence of probability values. This explains why both and are robust to different values of the parameter t. Nevertheless, we should avoid setting t to values that are too small or too large. In these two cases, in fact, k, d(x i, m r k ) and d(x i, m r k ) 1, respectively, in Eq. (7), and this will influence the consensus clustering process of and. Therefore, the average of the squared Euclidean distances between the data and the weighted means is a reasonable default value of t for each base clustering. F. Sensitivity Analysis of Ensemble Size R In this section we investigate the sensitivity of the three WOEC methods with respect to the ensemble size R. We change R from 5 to 1, and we use the same three data sets: AustralianCredit, HayesRoth, and Spambase. Fig. 5 gives the results. In general, with an increase of the ensemble size, the performance of each WOEC method improves. WOEC tends to be stable when R is larger than 1, 6, and 3 on AustralianCredit, HayesRoth, and Spambase, respectively. On AustralianCredit, all three WOEC methods can achieve a good performance when R =5. Also and achieve a stable performance on Spambase in correspondence of a small ensemble size. Each method gives the highest ARI for a different data set. VI. CONCLUSIONS AND FUTURE WORK We have introduced an approach called Weighted-Object Ensemble Clustering (WOEC). Three different consensus techniques have been proposed to leverage weighted objects. Experimental results indicate that each of the three techniques leads in performance in different data sets. This suggests that the three methods are most effective under different conditions. The achievement of a better understanding of such conditions is an interesting question we ll investigate. Like most existing ensemble methods, the WOEC approaches are quadratic in the number of points and linear in the dimensionality. How to apply WOEC methods effectively and efficiently with large scale and high dimensional data is also a challenging future work. VII. ACKNOWLEDGMENTS This paper is partially supported by grants from the Natural Science Foundation of China (Project Nos , , ), the Fundamental Research Funds for the Central Universities of China (Project Nos. XDJK21B2, XDJK213C123), the Doctoral Fund of Southwest University (Nos. SWU1163 and SWU11334) and the China Scholarship Council (CSC). REFERENCES [1] J. Ghosh and A. Acharya, Cluster ensembles, WIREs Data Mining and Knowledge Discovery, vol. 1, no. 4, pp , 211. [2] A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3, pp , 22. [3] M. Al-Razgan and C. Domeniconi, Weighted clustering ensemble, in Proceedings of The 6th SIAM International Conference on Data Mining, April 26, pp [4] C. Domeniconi and M. Al-Razgan, Weighted cluster ensembles: Methods and analysis, ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 4, pp. 17:1 4, January 29. [5] T. Li, C. Ding, and M. I. Jordan, Solving consensus and semisupervised clustering problems using nonnegative matrix factorization, in Proceedings of the 7th IEEE International Conference on Data Mining, 27, pp [6] G. Hamerly and C. Elkan, Alternatives to the k-means algorithm that find better clusterings, in Proceedings of the 11th International Conference on Information and Knowledge Management, 22, pp [7] A. Topchy, B. Minaei-Bidgoli, A. K. Jain, and W. F. Punch, Adaptive clustering ensembles, in Proceedings of the 17th International Conference on Pattern Recognition, August 24, pp [8] B. Zhang, M. Hsu, and U. Dayal, K-harmonic means a spatial clustering algorithm with boosting, Temporal, Spatial, and Spatio-Temporal Data Mining, pp , 21. [9] R. Nock and F. Nielsen, On weighting clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp , August 26. [1] R. E. Schapire, The strength of weak learnability, Machine Learning, vol. 5, no. 2, pp , 199. [11] Z. Zhou, Ensemble Methods: Foundations and Algorithms. Chapman & Hall, 212. [12] A. Fred and A. Jain, Data clustering using evidence accumulation, in Proceedings of the 16th International Conference of Pattern Recognition, 22, pp [13] A. Fred and A. K. Jain, Evidence accumulation clustering based on the k-means algorithm, in Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, 22, pp [14] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing, vol. 2, no. 1, pp , August [15] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, Multilevel hypergraph partitioning: Application in vlsi domain, in Proceedings of the Design and Automation Conference, 1997, pp [16] X. Z. Fern and C. E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in Proceedings of the 21th International Conference on Machine Learning, 24, pp [17] A. Y. Ng, M. I. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, in Advances in Neural Information Processing Systems, 21, pp [18] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp , August 2. [19] H. Wang, H. Shan, and A. Banerjee, Bayesian cluster ensembles, in Proceedings of the 9th SIAM International Conference on Data Mining, 29, pp [2] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S. Ma, Subspace clustering of high dimensional data, in Proceedings of the SIAM International Conference on Data Mining, April 24, pp [21] T. Li and C. Ding, Weighted consensus clustering, in Proceedings of the 8th SIAM International Conference on Data Mining, 28, pp [22] J. Yi, T. Yang, R. Jin, A. K. Jain, and M. Mahdavi, Robust ensemble clustering by matrix completion, in Proceedings of the 12th IEEE International Conference on Data Mining, December 212, pp [23] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, pp , [24] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Prez, and I. Perona, An extensive comparative study of cluster validity indices, Pattern Recognition, vol. 46, no. 1, pp , January 213. [25] C. Legny, S. Juhsz, and A. Babos, Cluster validity measurement techniques, in Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 26, pp [26] L. Hubert and P. Arabie, Comparing partitions, Journal of Classification, vol. 2, no. 1, pp ,

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal: