Alternative Clusterings: Current Progress and Open Challenges

Size: px

Start display at page:

Download "Alternative Clusterings: Current Progress and Open Challenges"

Dortha Elliott
5 years ago
Views:

1 Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1

2 Introduction Cluster analysis: group similar objects into clusters No single solution => Equally important, different views or Cluster by pose or individual? hypotheses regarding the data

3 Motivations Multiple explanations of the data user doesn t initially know what they want, needs options different viewpoints of users may be aiming to verify that multiple explanations do not exist (hypothesis verification, or for benchmarking clustering algorithms) Contrast with consensus clustering Every clustering should be accompanied by at least one alternative clustering!?

4 Alternative Clustering: Is it new? From one perspective, alternative clustering is not so new Generation of clusterings often goes like Generate and assess a clustering with 2 clusters Generate and assess a clustering with 3 clusters Generate and assess a clustering with k clusters We now have k-1 alternative clusterings. But some of them may be very similar

5 Alternative Clustering Algorithms Growing number of approaches ADFT, CAMI, COALA, Condens, Convolutional EM, Decorrelated k-means, MAXIMUS, Meta clustering, Multiview orthogonal clustering, NACI, Non redundant clustering,. Papers have appeared at KDD10, ICML10, SDM10, KDD09, SDM09,ICDM08,ICDM07,ICDM06,KDD05, ICDM04,,DMKD, KAIS,

6 How do these approaches differ? Task formulation: Number of alternatives to generate Sequential or Simultaneous Generation Mathematical basis Linear algebra Information theory Other objective functions

7 Sequential Alternative Clustering Generation Task: Given input clusterings {C1,..Cn}, generate an alternative clustering C, such that C is of high quality and C is different from {C1 Cn} Important special case: n=1 Existing C1 C2 Cn Alternative generate > C

8 Simultaneous Alternative Clustering Generation Task: Simultaneously generate n clusterings {C1, Cn}, such that each Ci is of high quality and each pair (Ci,Cj) is different from one another Important special case: n=2 generate > Alternatives C1 C2 Cn

9 Sequential vs. Simultaneous Sequential (greedy) Semi-supervised For i=2 to n {generate the optimal alternative clustering with respect to the previous i clusterings} Locally optimal at each step Simultaneous (non-greedy) Unsupervised In parallel, generate optimal set of n clusterings Globally optimal clustering collection but might miss some strong clusterings which would be generated by a sequential technique More difficult optimisation problem

10 Style of Algorithm Projection based Project the data into an orthogonal subspace and then re-cluster Appealing linear algebra formulation Relatively efficient Orthogonality may be too strict More complex objective function Generate the alternative clustering, trading off dissimilarity and quality in the objective function More flexible May require parameter choices

11 Simple Example Most existing techniques seem to work well (a canonical example)

12 Circle of Gaussians -Techniques which trade off dissimilarity and quality more likely to produce the second clustering -Orthogonal projection doesn t work so well here

13 Other issues Evaluation: Measuring quality/dissimilarity of alternatives Clustering setting: Desired shape of clusters: spherical versus elongated, linear versus non linear separation low versus high dimensionality data continuous versus discrete features soft versus hard clusters EM versus K-means versus hierarchical versus constraint based Number of clusters desired in each clustering

14 Alternative Clustering Evaluation Measuring dissimilarity: Mathematical measures - Rand index, Jaccard index, normalised mutual information Measuring quality: Internal validation measures: Dunn index, David Bouldin index, silhouette width External validation: Synthetic examples Combine dissimilarity and quality into a single number, or present separately? Are these numbers useful?

15 Where are we? Good existing algorithms for generation of one or two alternatives Sequential generation Simultaneous generation Not yet deployed on very large datasets Validated using assorted benchmark datasets and internal metrics

16 Open Issues What s the killer application? Deployment of alternative clusterings Need convincing use cases where consensus clustering is limited Objective function and performance measures How many alternatives is enough? How many clusters should be in an alternative clustering? the same number as the original clustering?

17 Open Issues cont. How to find alternative subspace clusters (rather than clusterings)? Visualisation of alternative clusterings More focused alternatives ``Give me another clustering which is similar in these respects and different in these other respects to the previous clustering

18 Moving Forward Central repository of code and canonical examples (synthetic and real) Make alternative clusterings algorithms accessible Identify cases in the literature of missing alternative clusterings

19 Bibliography E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its Application to the Discovery of Alternate Clusterings. To appear in Data Mining and Knowledge Discovery. D. Niu, J. G. Dy, and M. I. Jordan, Multiple non-redundant spectral clustering views, in Proc. of ICML 10, X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non Linear Alternative Clusterings. Proc. of KDD X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. Proc. of SDM Z. Qi and I. Davidson, A principled and flexible framework for finding alternative clusterings, Proc. of KDD P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings. Proc. of SDM I. Davidson and Z. Qi. Finding alternative clusterings using constraints. Proc. of ICDM Y. Cui, X. Z. Fern, and J. G. Dy, Non-redundant multi-view clustering via orthogonalization. Proc. of ICDM E. Bae and J. Bailey. COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. Proc. of ICDM R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In ICDM Conference, D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. Proc. of KDD Gondek, D., Hofmann, T. Non-redundant data clustering. Proc. of ICDM 2004.

Generating a Diverse Set of High-Quality Clusterings

Generating a Diverse Set of High-Quality Clusterings Jeff M. Phillips, Parasaran Raman, and Suresh Venkatasubramanian School of Computing, University of Utah {jeffp,praman,suresh}@cs.utah.edu Abstract.