A Survey on Consensus Clustering Techniques

Size: px

Start display at page:

Download "A Survey on Consensus Clustering Techniques"

Dominic Nathaniel Holmes
5 years ago
Views:

1 A Survey on Consensus Clustering Techniques Anup K. Chalamalla School of Computer Science, University of Waterloo Abstract Consensus clustering is an important elaboration of the classical clustering problem, in which multiple clusterings of a dataset are consolidated in to a single clustering. The different clusterings are obtained from different runs of the same algorithm or different algorithms. Formally, given r clusterings of a dataset, λ 1,...,λ r, the objective is to produce a single clustering ˆλ that agrees as much as possible with the r clusterings. In this survey, we describe and classify various approaches to the problem of consensus clustering. We discuss different formulations of the problem, consensus functions and efficient algorithms to compute them, and specific applications addressed in the literature. I. INTRODUCTION Clustering, an important task in data analysis with applications in data mining, image analysis, bioinformatics and pattern recognition, is the assignment of a set of objects into groups (called clusters) so that objects in the same cluster are similar, while objects in different clusters are dissimilar. This task assumes that there is some well-defined distance measure, which will determine how the similarity of two objects is calculated. There is also a quality measure that captures the intra-cluster similarity and inter-cluster dissimilarity. The primary goal of a clustering algorithm is to optimize this quality measure. There are many approaches to improve the quality of clustering, of which consensus clustering is an important approach. Consensus clustering combines multiple clusterings of a dataset in to a single clustering which is better in some sense than the input clusterings. Consensus clustering is known by many different names such as cluster ensemble, cluster aggregation, clustering combination in different areas of research: machine learning [1], [2], pattern recognition [3], bioinformatics [5], and data mining [6]. We next discuss the motivation and application areas of consensus clustering. A. Motivation and Applications Several different types of clustering techniques exist in the literature such as iterative refinement approaches, e.g., SOM and K-Means [4], Hierarchical Clustering [7], Subspace Clustering [8], etc., which have been effective to some extent in several applications. However, there are different shortcomings with each of them such as increase in time complexity for large number of dimensions and data objects, fuzziness in the distance measure, number of clusters not known a priori, sensitivity to initial settings (e.g., K-Means), getting stuck in the local optima, lack of robust techniques to validate the clustering results, etc. Consensus clustering tries to address many of these shortcomings by using a consensus function to combine the multiple clusterings. We discuss some of the major application areas here. Improve Quality and Robustness. Iterative refinement algorithms such as K-Means and EM-algorithm are sensitive to the choice of the initial seed clusters. Hence, running K- Means with different seeds may yield very different clusterings of the same data objects. It is observed that multiple weak clusterings can be combined into a stronger one by computing a consensus among the resulting clusterings of multiple runs of an algorithm, e.g., K-means, seeded with different initial centers [10]BradleyFayyad. Similarly, clusterings generated by different algorithms such as density based, K-means, fuzzy c-means, graph-partitioning based, etc. can be aggregated to obtain gains in clustering quality. Distributed and Privacy-Preserving Clustering. Nowadays, applications require processing of massive data and hence data is often distributed, e.g., a large customer database partitioned vertically and stored in different geographic locations (column-distributed). Different clusterings of the same data are generated on different sets of attributes. There is a need to combine them to obtain a clustering that agrees with all the different clusterings. Consensus clustering can also be employed in privacy-preserving scenarios where the distributed computing objects can only share certain amounts of higher level information such as cluster labels, or a limited number of observed features of each object. For example, in gene function prediction, separate gene clusterings can be obtained from diverse sources such as gene sequence comparisons, and combinations of DNA microarray data from many independent experiments. Each clustering hence shares only specific aspects of the data and the goal is to integrate them to obtain a unified clustering. Identifying the correct number of clusters. Automatic identification of appropriate number of clusters is an important research problem [12] [11]. Previous approaches impose a hard constraint on the quality or the distance measure in order to determine the number of clusters. For example, in agglomerative algorithms one can impose a bound on the distance beyond which no pair of clusters will be merged. Some of the approaches we discuss provide ways to automatically select the number of clusters. The various clusterings input to the consensus function can have different number of clusters and the consensus function itself can determine a different number of clusters based on the agreement of similarity between the objects in the input clusterings. For example, if many input clusterings place two objects in the same cluster, then a good consensus function will not split these two objects.

2 Handling Missing Information. There are many different types of data, that include categorical attributes, attributes with incomparable values, constantly changing data, or has missing attribute-values. There are also legacy clusterings that were often provided by human experts, and cluster labels are available for old data while no results are available on new data. These situations can lead to missing or incorrect cluster labels for objects in certain clusterings. Consensus clustering provides a framework to account for the missing labels and missing values in data objects. B. Challenges We summarize the key challenges raised by the problem of consensus clustering as follows: To explore the space of possible consensus clusterings efficiently to determine the best consensus clustering To model the similarities in input clusterings and design effective consensus functions accordingly. The remainder of this survey is organized as follows. In Section 2, we classify and discuss various formulations of consensus clustering approaches. We also discuss the consensus functions and methods to compute them. In Section 3 we discuss comparison of different approaches, their complexity analysis, and their strengths and drawbacks. Section 4 discusses open problems and proposes potential application areas. We conclude the survey in Section 5. II. APPROACHES TO CONSENSUS CLUSTERING In this section we discuss the classification of various formulations of consensus clustering and consensus functions. First, we adopt a notation that more or less captures the ideas of all the formulations. Let χ = {x 1,x 2,...,x n } denote a set of n objects. A partitioning of these n objects into k clusters can be represented as a set of k sets of objects {C j j =1,...,k} or as a label vector λ q N n. For each x i,weusec q (x i ) to denote the label of the cluster to which object x i belongs, i.e., C q (x i )=j if and only if x i C qj.a clusterer Φ q is a clustering algorithm that generates the label vector λ q given χ. Let k q be the number of clusters in λ q.a set of r labelings Λ={λ q q 1,...,r} is combined into a single labeling ˆλ using a consensus function Γ. The general architecture of consensus clustering is given in 1. λ 3 = (1, 1, 2, 2, 3, 3, 3), and λ 4 = (1, 2,?, 1, 2,?,?). An inspection suggests that a reasonable consensus clustering is (2, 2, 2, 3, 3, 1, 1). Here, λ 4 has missing labels. Each clustering follows a different labeling scheme, for e.g., λ 1 and λ 2 are the same with different labelings. For a labeling with k distinct clusters there are k! equivalent representations as integer label vectors. Hence, a common assumption is that a labeling scheme follows the two rules: (i) C 1 = 1; (ii) i =1,...,n 1:C i +1 max j=1,...,i (C j )+1. Hence, a labeling λ q can be transformed in to an equivalent labeling using a uniform scheme for all clusterings. A. Graph-Partitioning Approaches Problem of Graph-Partitioning. Given a weighted graph G, the goal is to partition it in to k disjoint clusters of vertices. Unless a given graph has k, or more than k, strongly connected components, any k-way partition will cross some of the graph edges. The sum of the weights of these crossed edges is defined as the cut of a partition, P : Cut(P, W) = W (i, j) where vertices i and j do not belong to the same cluster. The goal of a graph partitioning algorithm is to minimize the cut of the k-way partition. Two types of graph-partitioning techniques for consensus clustering are discussed in the literature: 1) instance-based graph partitioning (IBGF) in which a similarity metric is induced based on the number of clusterings that cluster pairs of objects together, 2) cluster-correspondence based graph partitioning (CBGF) in which a measure is induced based on the similarity between clusters in two different clusterings. We discuss the approaches which adopt one of these techniques and a hybrid approach in this section. 1) Objective Functions: The consensus functions that are discussed later try to optimize the objective function. The objective functions capture the similarity between the input clusterings at the instance level or cluster level. Mutual Information. The mutual information metric is proposed with the cluster ensemble framework by Strehl et al. in [1]. Mutual information, a measure to quantify the statistical information shared between two distributions, is used to quantify the similarity between two clusterings [13]. Let X and Y be the random variables described by the cluster labeling λ a and λ b, with k a and k b clusters respectively. Let I(X, Y ) denote the mutual information between X and Y, and H(X) denote the entropy of X. Thus the normalized I(X,Y ) mutual information NMI(X, Y ) is. Let n (a) H(X)H(Y ) h be the number of objects in cluster C h according to λ a, and n (b) l be that in cluster C l according to λ b. Let n h,l denote the number of objects that are in C h according to λ a as well as in C l according to λ b. Then, after substituting for I and H, the normalized mutual information estimate Φ (NMI) is given by: Fig. 1. Consensus Clustering First, we consider an example with 4 input clusterings: λ 1 = (1, 1, 1, 2, 2, 3, 3), λ 2 = (2, 2, 2, 3, 3, 1, 1), Φ (NMI) (λ a,λ b )= ( k a ka h=1 h=1 n(a) n(a) h log( n h kb l=1 n h,llog( n.n h,l ))( k b n (a) h n(b) l ) l=1 n(b) l log( n(b) l n )) (1)

3 Based on this pairwise measure of mutual information, the optimal combined clustering λ (k opt) is the one that has maximal mutual information with all individual labelings λ q in Λ, givenk to be the number of clusters in the consensus clustering. In other words: r λ (k opt) = argmaxˆλ Φ (NMI) (ˆλ, λ q ) (2) q=1 where ˆλ goes through all possible k-partitions. The authors show that the above optimization problem is hard, and a naïve solution is unthinkable. Even greedy approaches have their own drawbacks, such as strong dependency on initial settings and generating a poor local optima as solution. The authors propose three techniques based on graph partitioning, CSPA, HSPA, and MCLA, which we discuss later. Disagreement Metric. The disagreement metric is defined in the cluster aggregation framework in [14]. As in the cluster ensemble framework, cluster aggregation defines a distance measure between the two clusterings. Let d xi,x j (λ 1,λ 2 ) denote a boolean function whose value is 0 if λ 1 and λ 2 both put x i and x j in the same cluster, and 1 otherwise. Basically, this function measures the disagreement between two different clusterings on two data objects. The distance between two clusterings is defined as in Eq 3. It follows that the problem of clustering aggregation, given a set of clusterings Λ, computes a new clustering ˆλ that minimizes the total number of disagreements with Λ, given by the sum, r q=1 d χ(λ q, ˆλ). The distance function (Equation 3) satisfies a number of properties such as triangle inequality. d χ (λ 1,λ 2 )= d xi,x j (λ 1,λ 2 ) (3) (x i,x j) χ χ 2) Graph Partitioning Techniques: Cluster-based Similarity Partitioning Algorithm (CSPA). An n n boolean similarity matrix on the n objects is built for each clustering λ q, where an entry 1 indicates objects are in the same cluster and 0 indicates otherwise. A cumulative similarity matrix S is obtained from r such boolean matrices where each entry is the fraction of clusterings in which two objects are clustered together. A similarity-graph is induced from this matrix whose edge-weights correspond to the entries in S, and METIS [15] is used to partition the graph and obtain a consensus clustering of the objects. Hypergraph-Partitioning Algorithm (HSPA). A set of input clusterings Λ are transformed in to a hyper-graph, in which the vertices are objects to be clustered and a hyperedge connects a set of objects belonging to the same cluster. The problem of consensus clustering is then reduced to finding the minimal-cut of a hypergraph. Standard hyper-graph partitioning algorithms (e.g., HMETIS) combined with NMI as objective function to control the partition size are used to obtain the consensus clustering. Meta-Clustering Algorithm (MCLA). In this approach, several hyperedges, each representing a cluster, are grouped together and collapsed in to a single hyperedge. If the number of hyperedges in the hypergraph are r j=1 k j, k collapsed hyperedges are generated using NMI as the similarity measure between clusters to combine the hyperedges. Hybrid Bipartite Graph Formulation (HBGF). HBGF is proposed by Fern et al. [16] claims to take a hybrid approach by combining the ideas of instance-based graph partitioning (IBGF) and cluster-based graph partitioning (CBGF). In this approach, the authors formulate the cluster ensemble problem as partitioning a weighted bipartite graph, where the two sets of vertices in the bipartite graph correspond to 1) V C, the set of all clusters in all input clusterings 2) V I, the set of n data objects. If the vertices i and j are both clusters or both objects, W (i, j) =0otherwise if object i belongs to cluster j, W (i, j) =W (j, i) =1and 0 otherwise. To illustrate the effectiveness of a hybrid approach, consider two pairs of instances (A, B) and (C, D), we assume that A and B are never clustered together in the ensemble and the same is true for pair (C, D). However, the instances A and B are each frequently clustered together with the same group of instances in the ensemble, i.e., A and B are frequently assigned to two different clusters that are similar to each other. In contrast, this is not true for C and D. Intuitively we consider A and B to be more similar to one another than C and D. However, IBGF will fail to differentiate these two cases and assign both similarities to be zero. This is because IBGF ignores the information about the similarity of clusters while computing the similarity of instances. Similarly, CBGF has its own drawbacks. The hybrid approach integrates the similarity between instances and similarity between clusters simultaneously. It uses two graph partitioning techniques, spectral graph partitioning [17] and METIS to compute the consensus clustering. There are other graph partitioning approaches and objective functions that are slight variations of the ones discussed above. We leave the details of those approaches to possibly an extended version of this survey. B. Probabilistic Approaches In this section we discuss the probabilistic approaches to consensus clustering. As opposed to the graph-partitioning approaches, in probabilistic approaches the objective functions are tightly coupled with the consensus functions that optimize them. 1) Bayesian Cluster Ensembles: The Bayesian Cluster Ensembles (BCE) proposed in [18] uses a Bayesian approach to consensus clustering. It treats all the input clustering results for each object as a feature vector with discrete feature values, and learns a mixed-membership model from such a feature representation. Figures 2(a) and 2(b) show B, the matrix representation of cluster assignments of objects by different input clusterings. The distance-based approaches process the clusterings column-wise (Fig. 2(a)), where as BCE processes them row-wise (Fig. 2(b)). The consensus clustering problem then becomes finding a clustering ˆλ of the objects {x 1,...,x n } with feature vectors as rows of B. BCE is defined as a mixture model which generates the matrix B. Assuming that there are K consensus clusters, each object x i s

4 (a) Fig. 2. Matrix Representation cluster ids are drawn from a finite mixture model θ i over the K clusters. And, θ i is sampled from a Dirichlet distribution, with parameter α. Further, the latent variable corresponding to each consensus cluster id h follows a discrete distribution β hj each for input clustering λ j over its cluster ids {1,...,k j }. Hence, if an object x i belongs to consensus cluster h for λ j, its cluster id x ij = s [1,k j ] will be determined by the distribution β hj (s) = p(x ij h), where β hj (s) 0, kj s=1 β hj(s) = 1. Let z ij be the latent variable denoting that object x i belongs to cluster h for c j. Hence, given the model parameters α, β = β hj, [h] k 1, [j]r 1, the joint probability distribution over the latent variables θ i, z i and observed values {x ij, [i] n 1, [j]r 1 } is given by: r p(x i,θ i, z i α, β) =p(θ i α) p(z ij = h θ i )p(x ij β hj ), j=1, x ij (4) where x ij denotes that there exists a j t h input clustering result for x i (there may be no label for x i in some clusterings). Given the observable matrix B, the goal is to estimate the mixed-membership θ i, i [1,n] of each object to the consensus clusters. The model parameters α and β are unknown, hence they have to be estimated such that the likelihood of observing B is maximized. Typically, EM algorithm can be used by alternating between calculating the posterior over latent variables p(θ i,z i x i,α,β) and updating the parameters until convergence. However, computing the posteriors in closed form is found to be intractable. Hence, the authors employ two known techniques, variational inference approximation and gibbs sampling, to compute the posterior distribution. As discussed about the variational inference techniques (in class) [19], the posterior distribution is approximated by a family of variational distributions to compute a lower bound L of the log-likelihood, log(p(x i α, β)). The variational distributions are obtained by introducing new variational parameters φ, γ, and choosing an approximating distribution for x i s. Variational EM-algorithm is used to maximize the lower bound. The algorithm starts with some initiations for the parameters α, β, finds the best variational parameters that maximizes L. The M-step uses the computed variational parameters to maximize L over α, β to find new estimates for them. These two steps repeat until convergence is reached. The paper also proposes specialized EM-algorithms for row-distributed and column-distributed cluster ensembles, for which we refer the (b) readers to the original paper. The paper also proposes gibbs sampling as an approach to compute the posterior distribution, assuming a Dirichlet prior over β. 2) Mixture Model for Consensus Clustering: As in BCE, this paper which defines a mixture model for consensus clustering [20] views the cluster labels of an object according to different input clusterings as a set of new features associated with the object. Let x ij = λ j (x i ) be the cluster label assigned by the j th clustering to data object x i, then x i follows a finite parametric mixture model (Eqn 7) with components corresponding to the K consensus clusters. The data {x i } is generated by first drawing a component according to the probability mass function α m, and then sampling a point from the distribution p m (x θ m ). Given the data x = {x i } n i=1,in which each variable x i is assumed to be independent and identically distributed, the log likelihood function over the parameters Θ=α 1,...,α k,θ 1,...,θ k is given as follows. The goal is to find the parameters which maximize the likelihood function. k p(x i Θ) = α m p m (x i θ m ) (5) m=1 log L(Θ x) = log = n log i=1 n p(x i Θ) (6) i=1 k α m p m (x i θ m ) (7) m=1 As in the BCE, the maximum likelihood problem cannot be solved in a closed form when all the parameters are unknown. Hence, EM algorithm is applied on the equation for p(x Θ), after assuming independence assumption ) where each p j m(x θm) j multinomial(k j ) and k j is the number of clusters in λ j. With each x i, a hidden variable z i = {z i1,...,z ik } is introduced, such that z im =1if x i belongs to the m th component and z im = 0, otherwise. The EM algorithm starts with an initial guess for the parameters in Θ. The E- step computes the expected values of the hidden variables E[z im ] and the M-step maximizes the likelihood by computing new estimates of the parameters. The convergence criteria is obtained from the improvement in the amount of likelihood probability between two M-steps. And the consensus clustering solution is obtained from the expected values, E[z im ]. Once convergence is achieved, an object x i is assigned to the component which has the largest value in z i. Next, we discuss the non-parametric Bayesian cluster ensemble approach. 3) Non-parametric Bayesian Cluster Ensembles: The nonparametric Bayesian cluster ensembles (NBCE) [21] is similar in spirit to BCE, except that it uses a Dirichlet Process mixture model to sample the data. We have the clustering matrix B as in BCE, where the row vector x i = {x ij j [1,r]} is a new feature vector representation for the i th data object. The x i s are generated using a Dirichlet process mixture model with α 0 as the concentration parameter and G 0 as the base measure using truncated stick breaking (TSB) construction. to simplify p m (x i θ m ) to r j=1 pj m (x ij θ j m

5 The TSB construction stops at the level K. Let an infinite sequence of random variables be defined as v k Beta(1,α 0 ). k 1 Let π = π k k =1, 2,..., where π k = v k j=1 (1 v j) be the mixing proportions of the infinite number of components. The TSB truncates after iterating for K times by setting v K = 1 which automatically makes π k =0,k >K. Let, the probability of generating a cluster ID x ij = k j by λ j for x i be given by θ ijkj, where K j k θ j=1 ijk j = 1. And, let x i = {x ij = k j j [1,...,r]} and θ ij = {θ ijkj k j [1,...,K j ]} and θ i = {θ ij j [1,...,r]}. Then x i is generated with probability r j=1 θ ijk j. Since the truncation level is K, there are K distinct θ i, denoted as θk, k {1,...,K}. Further, θk is sampled from G 0. Hence, in addition to π k, an indicator variable z i is associated with each object x i to indicate which θk is assigned to x i. A consensus cluster is defined as a cluster of objects associated with the same θk. Further, the algorithms assume a Dirichlet prior πdir( α0 α0 K,..., K ). The goal is to compute the components of the distribution P (X, Z,π,θ α 0,G 0 ), where X = {x i i [1,n]} and Z = {z i i [1,n]}. The approach discussed is the paper is to apply gibbs sampling after marginalizing π and θ. The paper also proposes variational inference techniques similar to that of BCE. C. Relabeling and Voting Approaches The voting approach is the third kind of known approaches to solve the consensus clustering problem. It first solves the label correspondence problem that we discussed at the beginning of this paper. This approach assumes that all the input clusterings have the same number of clusters and so as the target consensus clustering. The idea is to choose a reference clustering among the given input clusterings, and for each other clustering, the labels of objects are permuted to obtain best agreement between an input clustering and the reference clustering. For a clustering with k labels, there are k! equivalent labelings. Hungarian algorithm can be employed to achieve a O(k 3 ) solution for the cluster re-labeling problem. After solving the cluster relabeling problem, a voting algorithm can be employed to determine the consensus cluster id of each object [3] [22]. III. COMPARISON OF CONSENSUS CLUSTERING TECHNIQUES In this section we compare the various techniques by their computational complexity, performance and accuracy of consensus clusterings generated. A. Complexity Analysis The complexity of the graph-partitioning techniques discussed by Strehl et al. [1] (Section 2.1) depends on the complexity of the partitioners used, such as (H)METIS. The worst-case complexity of CSPA is given by O(nK 2 r), that of HGPA is O(nKr) and MCLA is O(nK 2 r 2 ). The complexity of the HBGF partitioning technique is O(nK). The earlier three graph-partitioning techniques are either based on instance-based graph partitioning (IBGF) or cluster-based graph partitioning ideas (CBGF). HBGF leverages both the ideas and hence achieves better running time over the other approaches. Coming to the complexity of the probabilistic approaches, the methods used are either variational Bayesian inference techniques or sampling techniques. Variational inference techniques are basically approximation techniques used to approximate intractable integrals in Bayesian inference, and their efficiency is well-known in the literature. The complexity of the voting approaches in O(K 3 ). B. Accuracy Analysis 1) Graph Partitioning Techniques: For comparing CSPA, HGPA, and MCLA, a random number generator is used to generate r noisy labelings for a dataset and the labelings are fed to each of the techniques. The resulting consensus labelings are evaluated by comparing its NMI with all the input labelings (φ (ANMI) (Λ, ˆλ)) and all possible cluster labelings of the dataset. It is observed that as the noise increases the NMI measure for ˆλ decreases, and HGPA performs the worst among the three algorithms. All the three algorithms proposed here are either in the category of IBGF or CBGF. HBGF avoids the pitfalls of both IBGF and CBGF, by considering the similarity of instances and similarity of clusters simultaneously. The HBGF uses NMI to evaluate the result clusterings by comparing the three algorithms based on the HBGF, CBGF, and IBGF formulations to the true cluster labels. The cluster ensembles are generated by random subsampling from the datasets, and then clustering the sample and assigning the objects not in the sample to one of the clusters based on Euclidean distance to cluster centers. The maximum NMI value is compared for the three algorithms over 5 datasets. It is observed that HBGF performs comparably or significantly better than IBGF and CBGF for all of the datasets. 2) Probabilistic Approaches: BCE is evaluated over 10 datasets from the UCI machine learning repository. Microprecision is used as a measure to evaluate the accuracy of a consensus cluster with respect to the true labels. Microprecision MP is defined as: MP = K a hn h=1, where K is the number of clusters and n is the number of objects, a h denotes the number of objects in consensus cluster h that are correctly assigned to the corresponding class. The corresponding class for consensus cluster h is the true class with the largest overlap with the cluster. The value MP: 0 MP 1, with 1 indicating the best possible consensus clustering. The input clusterings are generated by running k-means for 2000 times over a dataset of n objects. Further they are divided in to 100 subsets each with 20 input clusterings thus generating 100 N 20 base matrices. The maximum and average MPs are computed from the BCE results. It is observed that the result clustering generated by BCE always outperforms the input clusterings in max and average MPs. Further BCE also outperforms the CSPA and Mixture models in 80% of the results.

6 In the mixture model, the results are compared against CSPA, HGPA ad MCLA graph-partitioning algorithms over five datasets. The mean error rate of consensus clustering is used as the measure to compare the algorithms. It is observed that the mixture model performs better than the CSPA and HGPA for most of the input clusterings, but MCLA performs better with the increase in the number of input clusterings. NBCE is evaluated using F-1 and perplexity measures on test datasets whose true cluster labels are known. NBCE has a better F-1 measure compared to CSPA, HGPA, and MCLA and has better perplexity measure compared to BCE and BCE is better than Mixture models. C. Strengths and Drawbacks We summarize the strengths and drawbacks of different kinds of approaches to consensus clustering in Table I. IV. GENERATING INPUT CLUSTERINGS An important problem related to the consensus clustering is generating diverse input clusterings for empirical studies. Some of the approaches used in the literature are generating multiple clusterings using K-means algorithm with different initializations. Other approaches to data generation include random sub-sampling and random projection [18]. V. DISCUSSION AND OPEN PROBLEMS Consensus clustering is an active area of research and there are a number of open problems that are still to be addressed. We discuss some of the open problems in this section. Fluctuations in Input Clusterings. We have seen that the accuracy of many of the algorithms we discussed goes down as the noise in the input clusterings increases. There may be a few input clusterings which adversely affect the accuracy of the consensus clustering. For example, 80% of the input clustetings may agree to a particular consensus clustering and the remaining 20% may lead to a vastly different clustering. Also, the accuracy of the consensus clustering depends vastly on the number of input clusterings, which can be small in number or very large. It is very difficult to automatically identify the input clusterings that adversely affect the final clustering. However, it may be worth investigating the possibility of designing consensus functions which can minimize the effect of such clusterings. Further, a framework can be developed in which a number of consensus clusterings ranked by some scoring function can be output, and let the user choose from them. It is also worth investigating a hierarchical approach to consensus clustering. For example, instead of using the entire set of input clusterings at once, using subsets of input clusterings to output different consensus clusterings, and then computing a consensus over these clusterings. Further, as the accuracy of Dirichlet process proves to be better than the mixture models and parametric approach, it is worth investigating the application of Pitman-Yor processes to consensus clustering. In the approaches discussed in this survey, it is assumed that all the input clusterings are equally important. It is possible that certain clusterings are more important than other clusterings. It is worth investigating how the bias of certain clusterings can be modeled. Applications in Databases and Data Mining. Consensus clustering has many applications in databases and web mining. Problems in bioinformatics such as clustering gene expression data have been discussed in the literature [23]. An important application of consensus clustering is outlier detection. Though traditional clustering techniques can be used for outlier detection, the quality and robustness of outlier detection improves with consensus clustering. Multiple runs of same or different algorithms can be used to generate multiple clusterings of the data, from which consensus about an object can be formed to determine if it is an outlier. In web mining, clustering the search results is an important task. Web search engines such as Google provide the users with the facility to observe similar documents for the documents retrieved as search results. Consensus clustering can be used to improve the accuracy in finding similar documents. The problem of determining the number of clusters is still a difficult problem. Though the Non-parametric Bayesian Cluster Ensemble approach [21] provides a way to automatically determine the number of consensus clusters, it appears to depend on the number of levels up to which the stick-breaking condition is applied. It can be investigated how traditional techniques [11], [12] can be combined with consensus functions to automatically determine the number of consensus clusters. VI. CONCLUSION AND FUTURE WORK Consensus clustering is an important elaboration of the classical clustering problem and has emerged as an important approach to improving the quality of clustering results. Several approaches have been proposed independently to address the problem of consensus clustering. The idea is to use a consensus function to compute a clustering that is a better fit than the input clusterings. In this survey we discussed and classified major approaches to consensus clustering. We discussed the motivation and application areas of consensus clustering. Three major kinds of approaches discussed in the literature are graph-partitioning, probabilistic approaches and voting approaches. We provided formulations for different kinds of approaches and discussed the consensus functions. We provided an analysis of the complexity and accuracy of various approaches. We further discussed and compared the strengths and drawbacks of various approaches. The probabilistic approaches are by far the most useful since they address most variations of the consensus clustering problem, such as handling missing values, rowdistributed and column-distributed clustering. We intend to publish an extended version of this survey discussing the algorithms, strengths and drawbacks and application areas for each of the approaches in greater detail. REFERENCES [1] A. Strehl and J. Ghosh Cluster Ensembles A Knowledge Reuse Framework for Combining Multiple Partitions, Journal of Machine Learning Research (JMLR) 3: (2002).

7 Technique Strengths Drawbacks Graph Partitioning Techniques [1] [16] Uses an objective function to control the There is no automatic way of detecting partition size. Handles missing cluster labels in the the number of consensus clusters. K is manually determined. input clusterings. It is observed that, for CSPA and HGPA, Handles column-distributed cluster ensembles automatically. with the increase in the noise in labelings, the accuracy of the result clustering [1] uses a technique called supraconsensus goes down. function to select the best It is not a very effective approach for consensus function for the data. row-distributed cluster ensembles. HBGF further improves the accuracy by combining the similarity between instances as well as clusters. Since the techniques use scalable algorithms like METIS and Spectral graph partitioning, they are scalable with the number of input clusterings and number of clusters per clustering. The computational complexity is reasonably good, and the clusters obtained are stable and robust. Probabilistic Approaches [20] [18] [21] In most cases, the accuracy of the approaches is better than the graph partitioning algorithms. It can handle missing values in the clusterings. It can handle both the row-distributed and column-distributed cluster ensembles. BCE [18] also proposes rowdistributed and column-distributed EM algorithms. The number of consensus clusters is automatically determined from the observations. The techniques are scalable. Voting Approaches [22] [3] It solves the label correspondence problem. Many of the approaches make simplifying assumptions instead of actually solving it. The accuracy of the mixture models is not better than the graph partitioning algorithms when the number of input clusterings is large. The mixture models suffer from overfitting. This can be overcome using Bayesian approaches. Sampling requires time, as some times a large number of input clusterings are needed to obtain reasonable accuracy. High computational cost. Performs poorly in the presence of touching clusters (clusters without clear boundaries). TABLE I COMPARISON OF CONSENSUS CLUSTERING TECHNIQUES [2] X. Z. Fern, Carla E. Brodley: Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach, ICML 2003: [3] Ana L. N. Fred, Anil K. Jain: Data Clustering Using Evidence Accumulation, ICPR 2002: [4] J. A. Hartigan, Clustering Algorithms: John Wiley, [5] Vladimir Filkov, Steven Skiena: Integrating Microarray Data by Consensus Clustering, ICTAI 2003: [6] Alexander P. Topchy, Martin H. C. Law, Anil K. Jain, Ana L. N. Fred: Analysis of Consensus Partition in Cluster Ensemble, ICDM 2004: [7] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome: Hierarchical clustering, The Elements of Statistical Learning, 2nd ed., New York: Springer. pp , [8] Kailing, Karin; Kriegel, Hans-Peter; Kroger, Peer: Density-Connected Subspace Clustering for High-Dimensional Data, SDM 2004: [9] Paul S. Bradley, Usama M. Fayyad: Refining Initial Points for K-Means Clustering, ICML 1998: [10] Alexander P. Topchy, Anil K. Jain, William F. Punch: Combining Multiple Weak Clusterings, ICDM 2003: [11] P. Smyth: Model selection for probabilistic clustering using crossvalidated likelihood, Statistics and Computing 10, 1: [12] Hamerly, G. and Elkan, C.: Learning the k in k-means, NIPS [13] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, Wiley, [14] Aristides Gionis, Heikki Mannila, Panayiotis Tsaparas: Clustering Aggregation, ICDE 2005: [15] George Karypis, Vipin Kumar: Multilevel k-way Partitioning Scheme for Irregular Graphs, J. Parallel Distrib. Comput. (JPDC) 48(1): (1998). [16] Xiaoli Zhang Fern, Carla E. Brodley: Solving cluster ensemble problems by bipartite graph partitioning, ICML [17] Andrew Y. Ng, Michael I. Jordan, Yair Weiss: On Spectral Clustering: Analysis and an algorithm, NIPS 2001: [18] Hongjun Wang, Hanhuai Shan, Arindam Banerjee: Bayesian Cluster Ensembles, SDM 2009: [19] Tommi Jaakkola, Michael I. Jordan: Variational Probabilistic Inference and the QMR-DT Network, J. Artif. Intell. Res. (JAIR) (JAIR) 10: (1999). [20] Alexander P. Topchy, Anil K. Jain, William F. Punch: A Mixture Model for Clustering Ensembles, SDM [21] Pu Wang, Carlotta Domeniconi, Kathryn Blackmond Laskey: Nonparametric Bayesian Clustering Ensembles, ECML/PKDD 2010: [22] Sandrine Dudoit, Jane Fridlyand: Bagging to Improve the Accuracy of A Clustering Procedure, Bioinformatics 19(9): (2003). [23] Stefano Monti, Pablo Tamayo, Jill Mesirov and Todd Golub: A resampling-based method for class discovery and visualization of gene expression microarray data.

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal: