A Survey on Consensus Clustering Techniques

Size: px
Start display at page:

Download "A Survey on Consensus Clustering Techniques"

Transcription

1 A Survey on Consensus Clustering Techniques Anup K. Chalamalla School of Computer Science, University of Waterloo Abstract Consensus clustering is an important elaboration of the classical clustering problem, in which multiple clusterings of a dataset are consolidated in to a single clustering. The different clusterings are obtained from different runs of the same algorithm or different algorithms. Formally, given r clusterings of a dataset, λ 1,...,λ r, the objective is to produce a single clustering ˆλ that agrees as much as possible with the r clusterings. In this survey, we describe and classify various approaches to the problem of consensus clustering. We discuss different formulations of the problem, consensus functions and efficient algorithms to compute them, and specific applications addressed in the literature. I. INTRODUCTION Clustering, an important task in data analysis with applications in data mining, image analysis, bioinformatics and pattern recognition, is the assignment of a set of objects into groups (called clusters) so that objects in the same cluster are similar, while objects in different clusters are dissimilar. This task assumes that there is some well-defined distance measure, which will determine how the similarity of two objects is calculated. There is also a quality measure that captures the intra-cluster similarity and inter-cluster dissimilarity. The primary goal of a clustering algorithm is to optimize this quality measure. There are many approaches to improve the quality of clustering, of which consensus clustering is an important approach. Consensus clustering combines multiple clusterings of a dataset in to a single clustering which is better in some sense than the input clusterings. Consensus clustering is known by many different names such as cluster ensemble, cluster aggregation, clustering combination in different areas of research: machine learning [1], [2], pattern recognition [3], bioinformatics [5], and data mining [6]. We next discuss the motivation and application areas of consensus clustering. A. Motivation and Applications Several different types of clustering techniques exist in the literature such as iterative refinement approaches, e.g., SOM and K-Means [4], Hierarchical Clustering [7], Subspace Clustering [8], etc., which have been effective to some extent in several applications. However, there are different shortcomings with each of them such as increase in time complexity for large number of dimensions and data objects, fuzziness in the distance measure, number of clusters not known a priori, sensitivity to initial settings (e.g., K-Means), getting stuck in the local optima, lack of robust techniques to validate the clustering results, etc. Consensus clustering tries to address many of these shortcomings by using a consensus function to combine the multiple clusterings. We discuss some of the major application areas here. Improve Quality and Robustness. Iterative refinement algorithms such as K-Means and EM-algorithm are sensitive to the choice of the initial seed clusters. Hence, running K- Means with different seeds may yield very different clusterings of the same data objects. It is observed that multiple weak clusterings can be combined into a stronger one by computing a consensus among the resulting clusterings of multiple runs of an algorithm, e.g., K-means, seeded with different initial centers [10]BradleyFayyad. Similarly, clusterings generated by different algorithms such as density based, K-means, fuzzy c-means, graph-partitioning based, etc. can be aggregated to obtain gains in clustering quality. Distributed and Privacy-Preserving Clustering. Nowadays, applications require processing of massive data and hence data is often distributed, e.g., a large customer database partitioned vertically and stored in different geographic locations (column-distributed). Different clusterings of the same data are generated on different sets of attributes. There is a need to combine them to obtain a clustering that agrees with all the different clusterings. Consensus clustering can also be employed in privacy-preserving scenarios where the distributed computing objects can only share certain amounts of higher level information such as cluster labels, or a limited number of observed features of each object. For example, in gene function prediction, separate gene clusterings can be obtained from diverse sources such as gene sequence comparisons, and combinations of DNA microarray data from many independent experiments. Each clustering hence shares only specific aspects of the data and the goal is to integrate them to obtain a unified clustering. Identifying the correct number of clusters. Automatic identification of appropriate number of clusters is an important research problem [12] [11]. Previous approaches impose a hard constraint on the quality or the distance measure in order to determine the number of clusters. For example, in agglomerative algorithms one can impose a bound on the distance beyond which no pair of clusters will be merged. Some of the approaches we discuss provide ways to automatically select the number of clusters. The various clusterings input to the consensus function can have different number of clusters and the consensus function itself can determine a different number of clusters based on the agreement of similarity between the objects in the input clusterings. For example, if many input clusterings place two objects in the same cluster, then a good consensus function will not split these two objects.

2 Handling Missing Information. There are many different types of data, that include categorical attributes, attributes with incomparable values, constantly changing data, or has missing attribute-values. There are also legacy clusterings that were often provided by human experts, and cluster labels are available for old data while no results are available on new data. These situations can lead to missing or incorrect cluster labels for objects in certain clusterings. Consensus clustering provides a framework to account for the missing labels and missing values in data objects. B. Challenges We summarize the key challenges raised by the problem of consensus clustering as follows: To explore the space of possible consensus clusterings efficiently to determine the best consensus clustering To model the similarities in input clusterings and design effective consensus functions accordingly. The remainder of this survey is organized as follows. In Section 2, we classify and discuss various formulations of consensus clustering approaches. We also discuss the consensus functions and methods to compute them. In Section 3 we discuss comparison of different approaches, their complexity analysis, and their strengths and drawbacks. Section 4 discusses open problems and proposes potential application areas. We conclude the survey in Section 5. II. APPROACHES TO CONSENSUS CLUSTERING In this section we discuss the classification of various formulations of consensus clustering and consensus functions. First, we adopt a notation that more or less captures the ideas of all the formulations. Let χ = {x 1,x 2,...,x n } denote a set of n objects. A partitioning of these n objects into k clusters can be represented as a set of k sets of objects {C j j =1,...,k} or as a label vector λ q N n. For each x i,weusec q (x i ) to denote the label of the cluster to which object x i belongs, i.e., C q (x i )=j if and only if x i C qj.a clusterer Φ q is a clustering algorithm that generates the label vector λ q given χ. Let k q be the number of clusters in λ q.a set of r labelings Λ={λ q q 1,...,r} is combined into a single labeling ˆλ using a consensus function Γ. The general architecture of consensus clustering is given in 1. λ 3 = (1, 1, 2, 2, 3, 3, 3), and λ 4 = (1, 2,?, 1, 2,?,?). An inspection suggests that a reasonable consensus clustering is (2, 2, 2, 3, 3, 1, 1). Here, λ 4 has missing labels. Each clustering follows a different labeling scheme, for e.g., λ 1 and λ 2 are the same with different labelings. For a labeling with k distinct clusters there are k! equivalent representations as integer label vectors. Hence, a common assumption is that a labeling scheme follows the two rules: (i) C 1 = 1; (ii) i =1,...,n 1:C i +1 max j=1,...,i (C j )+1. Hence, a labeling λ q can be transformed in to an equivalent labeling using a uniform scheme for all clusterings. A. Graph-Partitioning Approaches Problem of Graph-Partitioning. Given a weighted graph G, the goal is to partition it in to k disjoint clusters of vertices. Unless a given graph has k, or more than k, strongly connected components, any k-way partition will cross some of the graph edges. The sum of the weights of these crossed edges is defined as the cut of a partition, P : Cut(P, W) = W (i, j) where vertices i and j do not belong to the same cluster. The goal of a graph partitioning algorithm is to minimize the cut of the k-way partition. Two types of graph-partitioning techniques for consensus clustering are discussed in the literature: 1) instance-based graph partitioning (IBGF) in which a similarity metric is induced based on the number of clusterings that cluster pairs of objects together, 2) cluster-correspondence based graph partitioning (CBGF) in which a measure is induced based on the similarity between clusters in two different clusterings. We discuss the approaches which adopt one of these techniques and a hybrid approach in this section. 1) Objective Functions: The consensus functions that are discussed later try to optimize the objective function. The objective functions capture the similarity between the input clusterings at the instance level or cluster level. Mutual Information. The mutual information metric is proposed with the cluster ensemble framework by Strehl et al. in [1]. Mutual information, a measure to quantify the statistical information shared between two distributions, is used to quantify the similarity between two clusterings [13]. Let X and Y be the random variables described by the cluster labeling λ a and λ b, with k a and k b clusters respectively. Let I(X, Y ) denote the mutual information between X and Y, and H(X) denote the entropy of X. Thus the normalized I(X,Y ) mutual information NMI(X, Y ) is. Let n (a) H(X)H(Y ) h be the number of objects in cluster C h according to λ a, and n (b) l be that in cluster C l according to λ b. Let n h,l denote the number of objects that are in C h according to λ a as well as in C l according to λ b. Then, after substituting for I and H, the normalized mutual information estimate Φ (NMI) is given by: Fig. 1. Consensus Clustering First, we consider an example with 4 input clusterings: λ 1 = (1, 1, 1, 2, 2, 3, 3), λ 2 = (2, 2, 2, 3, 3, 1, 1), Φ (NMI) (λ a,λ b )= ( k a ka h=1 h=1 n(a) n(a) h log( n h kb l=1 n h,llog( n.n h,l ))( k b n (a) h n(b) l ) l=1 n(b) l log( n(b) l n )) (1)

3 Based on this pairwise measure of mutual information, the optimal combined clustering λ (k opt) is the one that has maximal mutual information with all individual labelings λ q in Λ, givenk to be the number of clusters in the consensus clustering. In other words: r λ (k opt) = argmaxˆλ Φ (NMI) (ˆλ, λ q ) (2) q=1 where ˆλ goes through all possible k-partitions. The authors show that the above optimization problem is hard, and a naïve solution is unthinkable. Even greedy approaches have their own drawbacks, such as strong dependency on initial settings and generating a poor local optima as solution. The authors propose three techniques based on graph partitioning, CSPA, HSPA, and MCLA, which we discuss later. Disagreement Metric. The disagreement metric is defined in the cluster aggregation framework in [14]. As in the cluster ensemble framework, cluster aggregation defines a distance measure between the two clusterings. Let d xi,x j (λ 1,λ 2 ) denote a boolean function whose value is 0 if λ 1 and λ 2 both put x i and x j in the same cluster, and 1 otherwise. Basically, this function measures the disagreement between two different clusterings on two data objects. The distance between two clusterings is defined as in Eq 3. It follows that the problem of clustering aggregation, given a set of clusterings Λ, computes a new clustering ˆλ that minimizes the total number of disagreements with Λ, given by the sum, r q=1 d χ(λ q, ˆλ). The distance function (Equation 3) satisfies a number of properties such as triangle inequality. d χ (λ 1,λ 2 )= d xi,x j (λ 1,λ 2 ) (3) (x i,x j) χ χ 2) Graph Partitioning Techniques: Cluster-based Similarity Partitioning Algorithm (CSPA). An n n boolean similarity matrix on the n objects is built for each clustering λ q, where an entry 1 indicates objects are in the same cluster and 0 indicates otherwise. A cumulative similarity matrix S is obtained from r such boolean matrices where each entry is the fraction of clusterings in which two objects are clustered together. A similarity-graph is induced from this matrix whose edge-weights correspond to the entries in S, and METIS [15] is used to partition the graph and obtain a consensus clustering of the objects. Hypergraph-Partitioning Algorithm (HSPA). A set of input clusterings Λ are transformed in to a hyper-graph, in which the vertices are objects to be clustered and a hyperedge connects a set of objects belonging to the same cluster. The problem of consensus clustering is then reduced to finding the minimal-cut of a hypergraph. Standard hyper-graph partitioning algorithms (e.g., HMETIS) combined with NMI as objective function to control the partition size are used to obtain the consensus clustering. Meta-Clustering Algorithm (MCLA). In this approach, several hyperedges, each representing a cluster, are grouped together and collapsed in to a single hyperedge. If the number of hyperedges in the hypergraph are r j=1 k j, k collapsed hyperedges are generated using NMI as the similarity measure between clusters to combine the hyperedges. Hybrid Bipartite Graph Formulation (HBGF). HBGF is proposed by Fern et al. [16] claims to take a hybrid approach by combining the ideas of instance-based graph partitioning (IBGF) and cluster-based graph partitioning (CBGF). In this approach, the authors formulate the cluster ensemble problem as partitioning a weighted bipartite graph, where the two sets of vertices in the bipartite graph correspond to 1) V C, the set of all clusters in all input clusterings 2) V I, the set of n data objects. If the vertices i and j are both clusters or both objects, W (i, j) =0otherwise if object i belongs to cluster j, W (i, j) =W (j, i) =1and 0 otherwise. To illustrate the effectiveness of a hybrid approach, consider two pairs of instances (A, B) and (C, D), we assume that A and B are never clustered together in the ensemble and the same is true for pair (C, D). However, the instances A and B are each frequently clustered together with the same group of instances in the ensemble, i.e., A and B are frequently assigned to two different clusters that are similar to each other. In contrast, this is not true for C and D. Intuitively we consider A and B to be more similar to one another than C and D. However, IBGF will fail to differentiate these two cases and assign both similarities to be zero. This is because IBGF ignores the information about the similarity of clusters while computing the similarity of instances. Similarly, CBGF has its own drawbacks. The hybrid approach integrates the similarity between instances and similarity between clusters simultaneously. It uses two graph partitioning techniques, spectral graph partitioning [17] and METIS to compute the consensus clustering. There are other graph partitioning approaches and objective functions that are slight variations of the ones discussed above. We leave the details of those approaches to possibly an extended version of this survey. B. Probabilistic Approaches In this section we discuss the probabilistic approaches to consensus clustering. As opposed to the graph-partitioning approaches, in probabilistic approaches the objective functions are tightly coupled with the consensus functions that optimize them. 1) Bayesian Cluster Ensembles: The Bayesian Cluster Ensembles (BCE) proposed in [18] uses a Bayesian approach to consensus clustering. It treats all the input clustering results for each object as a feature vector with discrete feature values, and learns a mixed-membership model from such a feature representation. Figures 2(a) and 2(b) show B, the matrix representation of cluster assignments of objects by different input clusterings. The distance-based approaches process the clusterings column-wise (Fig. 2(a)), where as BCE processes them row-wise (Fig. 2(b)). The consensus clustering problem then becomes finding a clustering ˆλ of the objects {x 1,...,x n } with feature vectors as rows of B. BCE is defined as a mixture model which generates the matrix B. Assuming that there are K consensus clusters, each object x i s

4 (a) Fig. 2. Matrix Representation cluster ids are drawn from a finite mixture model θ i over the K clusters. And, θ i is sampled from a Dirichlet distribution, with parameter α. Further, the latent variable corresponding to each consensus cluster id h follows a discrete distribution β hj each for input clustering λ j over its cluster ids {1,...,k j }. Hence, if an object x i belongs to consensus cluster h for λ j, its cluster id x ij = s [1,k j ] will be determined by the distribution β hj (s) = p(x ij h), where β hj (s) 0, kj s=1 β hj(s) = 1. Let z ij be the latent variable denoting that object x i belongs to cluster h for c j. Hence, given the model parameters α, β = β hj, [h] k 1, [j]r 1, the joint probability distribution over the latent variables θ i, z i and observed values {x ij, [i] n 1, [j]r 1 } is given by: r p(x i,θ i, z i α, β) =p(θ i α) p(z ij = h θ i )p(x ij β hj ), j=1, x ij (4) where x ij denotes that there exists a j t h input clustering result for x i (there may be no label for x i in some clusterings). Given the observable matrix B, the goal is to estimate the mixed-membership θ i, i [1,n] of each object to the consensus clusters. The model parameters α and β are unknown, hence they have to be estimated such that the likelihood of observing B is maximized. Typically, EM algorithm can be used by alternating between calculating the posterior over latent variables p(θ i,z i x i,α,β) and updating the parameters until convergence. However, computing the posteriors in closed form is found to be intractable. Hence, the authors employ two known techniques, variational inference approximation and gibbs sampling, to compute the posterior distribution. As discussed about the variational inference techniques (in class) [19], the posterior distribution is approximated by a family of variational distributions to compute a lower bound L of the log-likelihood, log(p(x i α, β)). The variational distributions are obtained by introducing new variational parameters φ, γ, and choosing an approximating distribution for x i s. Variational EM-algorithm is used to maximize the lower bound. The algorithm starts with some initiations for the parameters α, β, finds the best variational parameters that maximizes L. The M-step uses the computed variational parameters to maximize L over α, β to find new estimates for them. These two steps repeat until convergence is reached. The paper also proposes specialized EM-algorithms for row-distributed and column-distributed cluster ensembles, for which we refer the (b) readers to the original paper. The paper also proposes gibbs sampling as an approach to compute the posterior distribution, assuming a Dirichlet prior over β. 2) Mixture Model for Consensus Clustering: As in BCE, this paper which defines a mixture model for consensus clustering [20] views the cluster labels of an object according to different input clusterings as a set of new features associated with the object. Let x ij = λ j (x i ) be the cluster label assigned by the j th clustering to data object x i, then x i follows a finite parametric mixture model (Eqn 7) with components corresponding to the K consensus clusters. The data {x i } is generated by first drawing a component according to the probability mass function α m, and then sampling a point from the distribution p m (x θ m ). Given the data x = {x i } n i=1,in which each variable x i is assumed to be independent and identically distributed, the log likelihood function over the parameters Θ=α 1,...,α k,θ 1,...,θ k is given as follows. The goal is to find the parameters which maximize the likelihood function. k p(x i Θ) = α m p m (x i θ m ) (5) m=1 log L(Θ x) = log = n log i=1 n p(x i Θ) (6) i=1 k α m p m (x i θ m ) (7) m=1 As in the BCE, the maximum likelihood problem cannot be solved in a closed form when all the parameters are unknown. Hence, EM algorithm is applied on the equation for p(x Θ), after assuming independence assumption ) where each p j m(x θm) j multinomial(k j ) and k j is the number of clusters in λ j. With each x i, a hidden variable z i = {z i1,...,z ik } is introduced, such that z im =1if x i belongs to the m th component and z im = 0, otherwise. The EM algorithm starts with an initial guess for the parameters in Θ. The E- step computes the expected values of the hidden variables E[z im ] and the M-step maximizes the likelihood by computing new estimates of the parameters. The convergence criteria is obtained from the improvement in the amount of likelihood probability between two M-steps. And the consensus clustering solution is obtained from the expected values, E[z im ]. Once convergence is achieved, an object x i is assigned to the component which has the largest value in z i. Next, we discuss the non-parametric Bayesian cluster ensemble approach. 3) Non-parametric Bayesian Cluster Ensembles: The nonparametric Bayesian cluster ensembles (NBCE) [21] is similar in spirit to BCE, except that it uses a Dirichlet Process mixture model to sample the data. We have the clustering matrix B as in BCE, where the row vector x i = {x ij j [1,r]} is a new feature vector representation for the i th data object. The x i s are generated using a Dirichlet process mixture model with α 0 as the concentration parameter and G 0 as the base measure using truncated stick breaking (TSB) construction. to simplify p m (x i θ m ) to r j=1 pj m (x ij θ j m

5 The TSB construction stops at the level K. Let an infinite sequence of random variables be defined as v k Beta(1,α 0 ). k 1 Let π = π k k =1, 2,..., where π k = v k j=1 (1 v j) be the mixing proportions of the infinite number of components. The TSB truncates after iterating for K times by setting v K = 1 which automatically makes π k =0,k >K. Let, the probability of generating a cluster ID x ij = k j by λ j for x i be given by θ ijkj, where K j k θ j=1 ijk j = 1. And, let x i = {x ij = k j j [1,...,r]} and θ ij = {θ ijkj k j [1,...,K j ]} and θ i = {θ ij j [1,...,r]}. Then x i is generated with probability r j=1 θ ijk j. Since the truncation level is K, there are K distinct θ i, denoted as θk, k {1,...,K}. Further, θk is sampled from G 0. Hence, in addition to π k, an indicator variable z i is associated with each object x i to indicate which θk is assigned to x i. A consensus cluster is defined as a cluster of objects associated with the same θk. Further, the algorithms assume a Dirichlet prior πdir( α0 α0 K,..., K ). The goal is to compute the components of the distribution P (X, Z,π,θ α 0,G 0 ), where X = {x i i [1,n]} and Z = {z i i [1,n]}. The approach discussed is the paper is to apply gibbs sampling after marginalizing π and θ. The paper also proposes variational inference techniques similar to that of BCE. C. Relabeling and Voting Approaches The voting approach is the third kind of known approaches to solve the consensus clustering problem. It first solves the label correspondence problem that we discussed at the beginning of this paper. This approach assumes that all the input clusterings have the same number of clusters and so as the target consensus clustering. The idea is to choose a reference clustering among the given input clusterings, and for each other clustering, the labels of objects are permuted to obtain best agreement between an input clustering and the reference clustering. For a clustering with k labels, there are k! equivalent labelings. Hungarian algorithm can be employed to achieve a O(k 3 ) solution for the cluster re-labeling problem. After solving the cluster relabeling problem, a voting algorithm can be employed to determine the consensus cluster id of each object [3] [22]. III. COMPARISON OF CONSENSUS CLUSTERING TECHNIQUES In this section we compare the various techniques by their computational complexity, performance and accuracy of consensus clusterings generated. A. Complexity Analysis The complexity of the graph-partitioning techniques discussed by Strehl et al. [1] (Section 2.1) depends on the complexity of the partitioners used, such as (H)METIS. The worst-case complexity of CSPA is given by O(nK 2 r), that of HGPA is O(nKr) and MCLA is O(nK 2 r 2 ). The complexity of the HBGF partitioning technique is O(nK). The earlier three graph-partitioning techniques are either based on instance-based graph partitioning (IBGF) or cluster-based graph partitioning ideas (CBGF). HBGF leverages both the ideas and hence achieves better running time over the other approaches. Coming to the complexity of the probabilistic approaches, the methods used are either variational Bayesian inference techniques or sampling techniques. Variational inference techniques are basically approximation techniques used to approximate intractable integrals in Bayesian inference, and their efficiency is well-known in the literature. The complexity of the voting approaches in O(K 3 ). B. Accuracy Analysis 1) Graph Partitioning Techniques: For comparing CSPA, HGPA, and MCLA, a random number generator is used to generate r noisy labelings for a dataset and the labelings are fed to each of the techniques. The resulting consensus labelings are evaluated by comparing its NMI with all the input labelings (φ (ANMI) (Λ, ˆλ)) and all possible cluster labelings of the dataset. It is observed that as the noise increases the NMI measure for ˆλ decreases, and HGPA performs the worst among the three algorithms. All the three algorithms proposed here are either in the category of IBGF or CBGF. HBGF avoids the pitfalls of both IBGF and CBGF, by considering the similarity of instances and similarity of clusters simultaneously. The HBGF uses NMI to evaluate the result clusterings by comparing the three algorithms based on the HBGF, CBGF, and IBGF formulations to the true cluster labels. The cluster ensembles are generated by random subsampling from the datasets, and then clustering the sample and assigning the objects not in the sample to one of the clusters based on Euclidean distance to cluster centers. The maximum NMI value is compared for the three algorithms over 5 datasets. It is observed that HBGF performs comparably or significantly better than IBGF and CBGF for all of the datasets. 2) Probabilistic Approaches: BCE is evaluated over 10 datasets from the UCI machine learning repository. Microprecision is used as a measure to evaluate the accuracy of a consensus cluster with respect to the true labels. Microprecision MP is defined as: MP = K a hn h=1, where K is the number of clusters and n is the number of objects, a h denotes the number of objects in consensus cluster h that are correctly assigned to the corresponding class. The corresponding class for consensus cluster h is the true class with the largest overlap with the cluster. The value MP: 0 MP 1, with 1 indicating the best possible consensus clustering. The input clusterings are generated by running k-means for 2000 times over a dataset of n objects. Further they are divided in to 100 subsets each with 20 input clusterings thus generating 100 N 20 base matrices. The maximum and average MPs are computed from the BCE results. It is observed that the result clustering generated by BCE always outperforms the input clusterings in max and average MPs. Further BCE also outperforms the CSPA and Mixture models in 80% of the results.

6 In the mixture model, the results are compared against CSPA, HGPA ad MCLA graph-partitioning algorithms over five datasets. The mean error rate of consensus clustering is used as the measure to compare the algorithms. It is observed that the mixture model performs better than the CSPA and HGPA for most of the input clusterings, but MCLA performs better with the increase in the number of input clusterings. NBCE is evaluated using F-1 and perplexity measures on test datasets whose true cluster labels are known. NBCE has a better F-1 measure compared to CSPA, HGPA, and MCLA and has better perplexity measure compared to BCE and BCE is better than Mixture models. C. Strengths and Drawbacks We summarize the strengths and drawbacks of different kinds of approaches to consensus clustering in Table I. IV. GENERATING INPUT CLUSTERINGS An important problem related to the consensus clustering is generating diverse input clusterings for empirical studies. Some of the approaches used in the literature are generating multiple clusterings using K-means algorithm with different initializations. Other approaches to data generation include random sub-sampling and random projection [18]. V. DISCUSSION AND OPEN PROBLEMS Consensus clustering is an active area of research and there are a number of open problems that are still to be addressed. We discuss some of the open problems in this section. Fluctuations in Input Clusterings. We have seen that the accuracy of many of the algorithms we discussed goes down as the noise in the input clusterings increases. There may be a few input clusterings which adversely affect the accuracy of the consensus clustering. For example, 80% of the input clustetings may agree to a particular consensus clustering and the remaining 20% may lead to a vastly different clustering. Also, the accuracy of the consensus clustering depends vastly on the number of input clusterings, which can be small in number or very large. It is very difficult to automatically identify the input clusterings that adversely affect the final clustering. However, it may be worth investigating the possibility of designing consensus functions which can minimize the effect of such clusterings. Further, a framework can be developed in which a number of consensus clusterings ranked by some scoring function can be output, and let the user choose from them. It is also worth investigating a hierarchical approach to consensus clustering. For example, instead of using the entire set of input clusterings at once, using subsets of input clusterings to output different consensus clusterings, and then computing a consensus over these clusterings. Further, as the accuracy of Dirichlet process proves to be better than the mixture models and parametric approach, it is worth investigating the application of Pitman-Yor processes to consensus clustering. In the approaches discussed in this survey, it is assumed that all the input clusterings are equally important. It is possible that certain clusterings are more important than other clusterings. It is worth investigating how the bias of certain clusterings can be modeled. Applications in Databases and Data Mining. Consensus clustering has many applications in databases and web mining. Problems in bioinformatics such as clustering gene expression data have been discussed in the literature [23]. An important application of consensus clustering is outlier detection. Though traditional clustering techniques can be used for outlier detection, the quality and robustness of outlier detection improves with consensus clustering. Multiple runs of same or different algorithms can be used to generate multiple clusterings of the data, from which consensus about an object can be formed to determine if it is an outlier. In web mining, clustering the search results is an important task. Web search engines such as Google provide the users with the facility to observe similar documents for the documents retrieved as search results. Consensus clustering can be used to improve the accuracy in finding similar documents. The problem of determining the number of clusters is still a difficult problem. Though the Non-parametric Bayesian Cluster Ensemble approach [21] provides a way to automatically determine the number of consensus clusters, it appears to depend on the number of levels up to which the stick-breaking condition is applied. It can be investigated how traditional techniques [11], [12] can be combined with consensus functions to automatically determine the number of consensus clusters. VI. CONCLUSION AND FUTURE WORK Consensus clustering is an important elaboration of the classical clustering problem and has emerged as an important approach to improving the quality of clustering results. Several approaches have been proposed independently to address the problem of consensus clustering. The idea is to use a consensus function to compute a clustering that is a better fit than the input clusterings. In this survey we discussed and classified major approaches to consensus clustering. We discussed the motivation and application areas of consensus clustering. Three major kinds of approaches discussed in the literature are graph-partitioning, probabilistic approaches and voting approaches. We provided formulations for different kinds of approaches and discussed the consensus functions. We provided an analysis of the complexity and accuracy of various approaches. We further discussed and compared the strengths and drawbacks of various approaches. The probabilistic approaches are by far the most useful since they address most variations of the consensus clustering problem, such as handling missing values, rowdistributed and column-distributed clustering. We intend to publish an extended version of this survey discussing the algorithms, strengths and drawbacks and application areas for each of the approaches in greater detail. REFERENCES [1] A. Strehl and J. Ghosh Cluster Ensembles A Knowledge Reuse Framework for Combining Multiple Partitions, Journal of Machine Learning Research (JMLR) 3: (2002).

7 Technique Strengths Drawbacks Graph Partitioning Techniques [1] [16] Uses an objective function to control the There is no automatic way of detecting partition size. Handles missing cluster labels in the the number of consensus clusters. K is manually determined. input clusterings. It is observed that, for CSPA and HGPA, Handles column-distributed cluster ensembles automatically. with the increase in the noise in labelings, the accuracy of the result clustering [1] uses a technique called supraconsensus goes down. function to select the best It is not a very effective approach for consensus function for the data. row-distributed cluster ensembles. HBGF further improves the accuracy by combining the similarity between instances as well as clusters. Since the techniques use scalable algorithms like METIS and Spectral graph partitioning, they are scalable with the number of input clusterings and number of clusters per clustering. The computational complexity is reasonably good, and the clusters obtained are stable and robust. Probabilistic Approaches [20] [18] [21] In most cases, the accuracy of the approaches is better than the graph partitioning algorithms. It can handle missing values in the clusterings. It can handle both the row-distributed and column-distributed cluster ensembles. BCE [18] also proposes rowdistributed and column-distributed EM algorithms. The number of consensus clusters is automatically determined from the observations. The techniques are scalable. Voting Approaches [22] [3] It solves the label correspondence problem. Many of the approaches make simplifying assumptions instead of actually solving it. The accuracy of the mixture models is not better than the graph partitioning algorithms when the number of input clusterings is large. The mixture models suffer from overfitting. This can be overcome using Bayesian approaches. Sampling requires time, as some times a large number of input clusterings are needed to obtain reasonable accuracy. High computational cost. Performs poorly in the presence of touching clusters (clusters without clear boundaries). TABLE I COMPARISON OF CONSENSUS CLUSTERING TECHNIQUES [2] X. Z. Fern, Carla E. Brodley: Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach, ICML 2003: [3] Ana L. N. Fred, Anil K. Jain: Data Clustering Using Evidence Accumulation, ICPR 2002: [4] J. A. Hartigan, Clustering Algorithms: John Wiley, [5] Vladimir Filkov, Steven Skiena: Integrating Microarray Data by Consensus Clustering, ICTAI 2003: [6] Alexander P. Topchy, Martin H. C. Law, Anil K. Jain, Ana L. N. Fred: Analysis of Consensus Partition in Cluster Ensemble, ICDM 2004: [7] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome: Hierarchical clustering, The Elements of Statistical Learning, 2nd ed., New York: Springer. pp , [8] Kailing, Karin; Kriegel, Hans-Peter; Kroger, Peer: Density-Connected Subspace Clustering for High-Dimensional Data, SDM 2004: [9] Paul S. Bradley, Usama M. Fayyad: Refining Initial Points for K-Means Clustering, ICML 1998: [10] Alexander P. Topchy, Anil K. Jain, William F. Punch: Combining Multiple Weak Clusterings, ICDM 2003: [11] P. Smyth: Model selection for probabilistic clustering using crossvalidated likelihood, Statistics and Computing 10, 1: [12] Hamerly, G. and Elkan, C.: Learning the k in k-means, NIPS [13] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, Wiley, [14] Aristides Gionis, Heikki Mannila, Panayiotis Tsaparas: Clustering Aggregation, ICDE 2005: [15] George Karypis, Vipin Kumar: Multilevel k-way Partitioning Scheme for Irregular Graphs, J. Parallel Distrib. Comput. (JPDC) 48(1): (1998). [16] Xiaoli Zhang Fern, Carla E. Brodley: Solving cluster ensemble problems by bipartite graph partitioning, ICML [17] Andrew Y. Ng, Michael I. Jordan, Yair Weiss: On Spectral Clustering: Analysis and an algorithm, NIPS 2001: [18] Hongjun Wang, Hanhuai Shan, Arindam Banerjee: Bayesian Cluster Ensembles, SDM 2009: [19] Tommi Jaakkola, Michael I. Jordan: Variational Probabilistic Inference and the QMR-DT Network, J. Artif. Intell. Res. (JAIR) (JAIR) 10: (1999). [20] Alexander P. Topchy, Anil K. Jain, William F. Punch: A Mixture Model for Clustering Ensembles, SDM [21] Pu Wang, Carlotta Domeniconi, Kathryn Blackmond Laskey: Nonparametric Bayesian Clustering Ensembles, ECML/PKDD 2010: [22] Sandrine Dudoit, Jane Fridlyand: Bagging to Improve the Accuracy of A Clustering Procedure, Bioinformatics 19(9): (2003). [23] Stefano Monti, Pablo Tamayo, Jill Mesirov and Todd Golub: A resampling-based method for class discovery and visualization of gene expression microarray data.

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Advances in Fuzzy Clustering and Its Applications. J. Valente de Oliveira and W. Pedrycz (Editors)

Advances in Fuzzy Clustering and Its Applications. J. Valente de Oliveira and W. Pedrycz (Editors) Advances in Fuzzy Clustering and Its Applications J. Valente de Oliveira and W. Pedrycz (Editors) Contents Preface 3 1 Soft Cluster Ensembles 1 1.1 Introduction................................... 1 1.1.1

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Consensus Clusterings

Consensus Clusterings Consensus Clusterings Nam Nguyen, Rich Caruana Department of Computer Science, Cornell University Ithaca, New York 14853 {nhnguyen,caruana}@cs.cornell.edu Abstract In this paper we address the problem

More information

Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating Partitions with Locally Reliable Clusters

Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating Partitions with Locally Reliable Clusters Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 15) Instance-Wise Weighted Nonnegative Matrix Factorization for Aggregating s with Locally Reliable Clusters

More information

Clustering Ensembles Based on Normalized Edges

Clustering Ensembles Based on Normalized Edges Clustering Ensembles Based on Normalized Edges Yan Li 1,JianYu 2, Pengwei Hao 1,3, and Zhulin Li 1 1 Center for Information Science, Peking University, Beijing, 100871, China {yanli, lizhulin}@cis.pku.edu.cn

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Weighted-Object Ensemble Clustering

Weighted-Object Ensemble Clustering 213 IEEE 13th International Conference on Data Mining Weighted-Object Ensemble Clustering Yazhou Ren School of Computer Science and Engineering South China University of Technology Guangzhou, 516, China

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Journal of mathematics and computer Science 6 (2013) 154-165 A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Mohammad Ahmadzadeh Mazandaran University of Science and Technology m.ahmadzadeh@ustmb.ac.ir

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

Cluster ensembles address the problem of combining

Cluster ensembles address the problem of combining Joydeep Ghosh and Ayan Acharya combine multiple clusterings of a set of objects into a single consolidated clustering, often referred to as the consensus solution. Consensus clustering can be used to generate

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Cluster Ensembles for High Dimensional Clustering: An Empirical Study

Cluster Ensembles for High Dimensional Clustering: An Empirical Study Cluster Ensembles for High Dimensional Clustering: An Empirical Study Xiaoli Z. Fern xz@ecn.purdue.edu School of Electrical and Computer Engineering, Purdue University, W. Lafayette, IN 47907, USA Carla

More information

Dependency detection with Bayesian Networks

Dependency detection with Bayesian Networks Dependency detection with Bayesian Networks M V Vikhreva Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Leninskie Gory, Moscow, 119991 Supervisor: A G Dyakonov

More information

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS Marcin Pełka 1 1 Wroclaw University of Economics, Faculty of Economics, Management and Tourism, Department of Econometrics

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Combining Multiple Clustering Systems

Combining Multiple Clustering Systems Combining Multiple Clustering Systems Constantinos Boulis and Mari Ostendorf Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA boulis,mo@ee.washington.edu Abstract.

More information

Rough Set based Cluster Ensemble Selection

Rough Set based Cluster Ensemble Selection Rough Set based Cluster Ensemble Selection Xueen Wang, Deqiang Han, Chongzhao Han Ministry of Education Key Lab for Intelligent Networks and Network Security (MOE KLINNS Lab), Institute of Integrated Automation,

More information

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural

More information

Clustering Aggregation

Clustering Aggregation Clustering Aggregation ARISTIDES GIONIS Yahoo! Research Labs, Barcelona HEIKKI MANNILA University of Helsinki and Helsinki University of Technology and PANAYIOTIS TSAPARAS Microsoft Search Labs We consider

More information

COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR DATA CLUSTERING

COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR DATA CLUSTERING Author manuscript, published in "IEEE International Workshop on Machine Learning for Signal Processing, Grenoble : France (29)" COMBINING MULTIPLE PARTITIONS CREATED WITH A GRAPH-BASED CONSTRUCTION FOR

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

A Co-Clustering approach for Sum-Product Network Structure Learning

A Co-Clustering approach for Sum-Product Network Structure Learning Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering

Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Journal of Computational Information Systems 10: 12 (2014) 5147 5154 Available at http://www.jofcis.com Cluster Ensemble Algorithm using the Binary k-means and Spectral Clustering Ye TIAN 1, Peng YANG

More information

Robust Ensemble Clustering by Matrix Completion

Robust Ensemble Clustering by Matrix Completion Robust Ensemble Clustering by Matrix Completion Jinfeng Yi, Tianbao Yang, Rong Jin, Anil K. Jain, Mehrdad Mahdavi Department of Computer Science and Engineering, Michigan State University Machine Learning

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Clustering ensemble method

Clustering ensemble method https://doi.org/10.1007/s13042-017-0756-7 ORIGINAL ARTICLE Clustering ensemble method Tahani Alqurashi 1 Wenjia Wang 1 Received: 28 September 2015 / Accepted: 20 October 2017 The Author(s) 2018 Abstract

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

Classification with PAM and Random Forest

Classification with PAM and Random Forest 5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering [1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

A Deterministic Global Optimization Method for Variational Inference

A Deterministic Global Optimization Method for Variational Inference A Deterministic Global Optimization Method for Variational Inference Hachem Saddiki Mathematics and Statistics University of Massachusetts, Amherst saddiki@math.umass.edu Andrew C. Trapp Operations and

More information

Hierarchical Mixture Models for Nested Data Structures

Hierarchical Mixture Models for Nested Data Structures Hierarchical Mixture Models for Nested Data Structures Jeroen K. Vermunt 1 and Jay Magidson 2 1 Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, Netherlands

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Multiobjective Data Clustering

Multiobjective Data Clustering To appear in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Multiobjective Data Clustering Martin H. C. Law Alexander P. Topchy Anil K. Jain Department of Computer Science

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Stephen Scott.

Stephen Scott. 1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

A Soft Clustering Algorithm Based on k-median

A Soft Clustering Algorithm Based on k-median 1 A Soft Clustering Algorithm Based on k-median Ertu grul Kartal Tabak Computer Engineering Dept. Bilkent University Ankara, Turkey 06550 Email: tabak@cs.bilkent.edu.tr Abstract The k-median problem is

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Bayesian model ensembling using meta-trained recurrent neural networks

Bayesian model ensembling using meta-trained recurrent neural networks Bayesian model ensembling using meta-trained recurrent neural networks Luca Ambrogioni l.ambrogioni@donders.ru.nl Umut Güçlü u.guclu@donders.ru.nl Yağmur Güçlütürk y.gucluturk@donders.ru.nl Julia Berezutskaya

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Enhancing Single-Objective Projective Clustering Ensembles

Enhancing Single-Objective Projective Clustering Ensembles Enhancing Single-Objective Projective Clustering Ensembles Francesco Gullo DEIS Dept. University of Calabria 87036 Rende CS), Italy fgullo@deis.unical.it Carlotta Domeniconi Department of Computer Science

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,

More information

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Randomized Algorithms for Fast Bayesian Hierarchical Clustering Randomized Algorithms for Fast Bayesian Hierarchical Clustering Katherine A. Heller and Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College ondon, ondon, WC1N 3AR, UK {heller,zoubin}@gatsby.ucl.ac.uk

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

A Local Learning Approach for Clustering

A Local Learning Approach for Clustering A Local Learning Approach for Clustering Mingrui Wu, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics 72076 Tübingen, Germany {mingrui.wu, bernhard.schoelkopf}@tuebingen.mpg.de Abstract

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information