Theoretical Foundations of Clustering. Margareta Ackerman

Size: px

Start display at page:

Download "Theoretical Foundations of Clustering. Margareta Ackerman"

Stuart Powell
6 years ago
Views:

1 Theoretical Foundations of Clustering Margareta Ackerman

2 The Theory-Practice Gap Clustering is one of the most widely used tools for exploratory data analysis. Identifying target markets Constructing phylogenetic trees Facility allocation for city planning Personalization...

3 The Theory-Practice Gap While the interest in and application of cluster analysis has been rising rapidly, the abstract nature of the tool is still poorly understood -Wright, There has been relatively little work aimed at reasoning about clustering independently of any particular algorithm, objective function, or generative data model -Kleinberg, 2002.

4 Inherent obstacles: Clustering is ill-defined Clustering aims to organize data into groups of similar items, but beyond that There is very little consensus on the definition of clustering

5 Clustering algorithms: A few classical examples How can we partition data into k groups?

6 Clustering algorithms: A few classical examples How can we partition data into k groups? Use Kruskal s algorithm for MST (Singlelinkage)

7 Clustering algorithms: A few classical examples How can we partition data into k groups? Use Kruskal s algorithm for MST (Singlelinkage) Find the minimum cut (motivates spectral clustering methods)

8 Clustering algorithms: A few classical examples How can we partition data into k groups? Use Kruskal s algorithm for MST (Singlelinkage) Find the minimum cut (motivates spectral clustering methods) Find k centers that minimize the average distance to a center (k-median, k-means,...) Many more...

9 Inherent obstacles: Clustering is inherently ambiguous There are many clustering algorithms with different (often implicit) objective functions Different algorithms have radically different input-output behavior There may be multiple reasonable clusterings There is usually no ground truth

10 Different input-output behavior of clustering algorithms

11 Different input-output behavior of clustering algorithms

12 Progress despite these obstacles: Overview Axioms of clustering quality measures (Ackerman & Ben-David, 08) Study and compare notions of clusterability (Ackerman and Ben-David, 09) Characterizing linkage-based algorithms (Ackerman, Ben-David, and Loker, 2010) Framework for clustering algorithm selection (Ackerman, Ben-David, and Loker, 2010) Characterizing hierarchical linkage-based algorithms (Ackerman & Ben- David, 2011) Properties of Phylogenetic algorithms (Ackerman, Brown, and Loker, 2012) Properties in the weighted clustering setting (Ackerman, Ben-David, Branzei, and Loker, 2012) Clustering oligarchies (Ackerman, Ben-David, Loker, and Sabato, 2013) Perturbation robust clustering (Ackerman & Schulman, 2013) Online clustering (Ackerman & Dasgupta, 2014)

13 Progress despite these obstacles: Overview Axioms of clustering quality measures (Ackerman & Ben-David, 08) Study and compare notions of clusterability (Ackerman and Ben-David, 09) Characterizing linkage-based algorithms (Ackerman, Ben-David, and Loker, 2010) Framework for clustering algorithm selection (Ackerman, Ben-David, and Loker, 2010) Characterizing hierarchical linkage-based algorithms (Ackerman & Ben- David, 2011) Properties of Phylogenetic algorithms (Ackerman, Brown, and Loker, 2012) Properties in the weighted clustering setting (Ackerman, Ben-David, Branzei, and Loker, 2012) Clustering oligarchies (Ackerman, Ben-David, Loker, and Sabato, 2013) Perturbation robust clustering (Ackerman & Schulman, 2013) Online clustering (Ackerman & Dasgupta, 2014)

14 Outline Axiomatic treatment of clustering Clustering algorithm selection Characterizing Linkage-Based clustering

15 Outline Axiomatic treatment of clustering Clustering algorithm selection Characterizing Linkage-Based clustering

16 Formal setup For a finite domain set X, a distance function d is the distance defined between the domain points. A clustering function maps Input: a distance function to d over X Output: a partition (clustering) of X

17 Kleinberg s axioms Scale Invariance: f(cd) =f(d) for all d and all strictly positive c. Consistency: If equals d, except for shrinking distances within clusters of f(d) or stretching between-cluster distances, then f(d 0 )=f(d). d 0 Richness: For any clustering C function over so that d X of, there exists a distance X f(d) =C.

18 Theorem [Kleinberg, 02]: These axioms are inconsistent. Namely, no function can satisfy these three axioms.

19 Theorem [Kleinberg, 02]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. Why are axioms that seem to capture our intuition about clustering inconsistent??

20 Theorem [Kleinberg, 02]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. Why are axioms that seem to capture our intuition about clustering inconsistent?? Our answer: The formalization of these axioms is stronger than the intuition they intend to capture We express that same intuition in an alternative framework, and achieve consistency.

21 Clustering quality measures How good is this clustering? Clustering-quality measures quantify the quality of clusterings.

22 Defining clustering quality measures A clustering-quality measure is a function m(dataset, clustering) 2 R satisfying some properties that make this function a meaningful clustering quality measure. What properties should it satisfy?

23 Rephrasing Kleinberg s axioms for clustering quality measures Scale Invariance m(c, d) =m(c, d) for all C, d and strictly positive. Richness For any clustering C of X, there exists a distance function d over X so that C = argmax C m(c, d)

24 Consistency: If equals d, except for shrinking distances within clusters of C or stretching between-cluster distances, then m(c, d) apple m(c, d 0 ). d 0 d d 0 C C

Major gain - consistency of new axioms Theorem [Ackerman & Ben-David, NIPS 08]: Consistency, scale invariance, and richness for clustering quality measures form a

25 Major gain - consistency of new axioms Theorem [Ackerman & Ben-David, NIPS 08]: Consistency, scale invariance, and richness for clustering quality measures form a consistent set of requirements. Dunn s index ( 73): min x6 C y d(x, y) max x C y d(x, y) This clustering quality measure satisfies consistency, scale-invariance, and richness.

26 Additional measures satisfying our axioms C-index (Dalrymple-Alford, 1970) Gamma (Baker & Hubert, 1975) Adjusted ratio of clustering (Roenker et al., 1971) D-index (Dalrymple-Alford, 1970) Modified ratio of repetition (Bower, Lesgold, and Tieman, 1969) Variations of Dunn s index (Bezdek and Pal, 1998) Strict separation (Balacan, Blum, and Vempala, 2008) And many more...

27 Why is the quality measure formulation more faithful to intuition? In the earlier setting of clustering functions, consistent changes to the underlying distance should not create any new contenders for the best clustering of the data. d C d 0 C 0 A clustering function that satisfies Kleinberg s Consistency cannot output C 0.

28 Why is the quality measure formulation more faithful to intuition? d In the setting of clustering-quality measures, consistency requires only that the quality of clustering Cnot get worse. d 0 C 0 C A different clustering can have better quality than the original.

29 Outline Axiomatic treatment of clustering Clustering algorithm selection Characterizing Linkage-Based clustering

30 Clustering algorithm selection There is a wide variety of clustering algorithms, which can produce very different clusterings. How should a user decide which algorithm to use for a given application? 30

31 Clustering algorithm selection Users rely on cost related considerations: running times, space usage, software purchasing costs, etc There is inadequate emphasis on input-output behavior 31

32 Our framework for algorithm selection We propose a framework that lets a user utilize prior knowledge to select an algorithm Identify properties that distinguish between different input-output behavior of clustering paradigms The properties should be: 1) Intuitive and user-friendly 2) Useful for distinguishing clustering algorithms Ex. Kleinberg s axioms, order invariance, etc.. 32

33 Property-based classification for fixed k Ackerman, Ben-David, and Loker, NIPS 2010 Local Outer Con. Inner Con. Consistent Refin. Preserv Order Inv. Rich Outer Rich Rep Ind Scale Inv Single!!!!!!!!!! linkage Average!! " "! "!!!! linkage Complete!! " "!!!!!! linkage K-means!! " " " "!!!! K-medoids!! " " " "!!!! Min-Sum!!!! " "!!!! Ratio-cut " "!! " "!!!! Normalizedcut " " " " " "!!!! 33

34 Kleinberg s axioms for fixed k Local Outer Con. Inner Con. Consistent Refin. Preserv Order Inv. Rich Outer Rich Rep Ind Scale Inv Single!!!!!!!!!! linkage Average!! " "! "!!!! linkage Complete!! " "!!!!!! linkage Kleinberg s Axioms are consistent when k is given K-means!! " " " "!!!! K-medoids!! " " " "!!!! Min-Sum!!!! " "!!!! Ratio-cut " "!! " "!!!! Normalizedcut " " " " " "!!!! 34

35 Single-linkage satisfies everything Local Outer Con. Inner Con. Consistent Refin. Preserv Order Inv. Rich Outer Rich Rep Ind Scale Inv Single!!!!!!!!!! linkage Recall: Single linkage is Kruskal s algorithm for Minimum Spanning Tree. It s not a good clustering algorithm in practice! 35

36 Classification in Weighted Setting Ackerman, Ben-David, Branzei, and Loker (AAAI, 2012) Weight robust: ignores element duplicates Weight sensitive: output can always be changed by duplicating some of the data Weight considering: element duplication effects the output on some data sets, but not others.

Weight sensitive: output can always be changed by duplicating

the output on some data sets, but not others.

Single Linkage Complete Linkage Weight Sensitive k-means,

37 Classification in Weighted Setting Ackerman, Ben-David, Branzei, and Loker (AAAI, 2012) Weight robust: ignores element duplicates Weight sensitive: output can always be changed by duplicating some of the data Weight considering: element duplication effects the output on some data sets, but not others. Partitional Hierarchical Weight Robust Min Diameter k-center Single Linkage Complete Linkage Weight Sensitive k-means, k-medoids, k-median, min-sum Ward s Method Bisecting k-means Weight Considering Ratio Cut Average Linkage

38 Using property-based classification to choose an algorithm Enables users to identify a suitable algorithm without the overhead of executing many algorithms This framework helps understand the behavior of existing and new algorithms The long-term goal is to construct a property-based classification for many useful clustering algorithms 38

39 Outline Axiomatic treatment of clustering Clustering algorithm selection Characterizing linkage-based clustering

40 Characterizing Linkage-Based Clustering We characterize a popular family of clustering algorithms, called linkage-based. We show that 1) all linkage-based algorithms satisfy two natural properties, and 2) no algorithm outside that family satisfies these properties.

41 Formal setting: Dendrograms and clusterings C i is a cluster in a dendrogram D if there exists a node in the dendrogram so that leaf descendants. C i is the set of its 41

42 Formal setting: Dendrograms and clusterings C = {C 1,...,C k } is a clustering in a dendrogram D if is a cluster in D for all 1 apple i apple k, and C i Clusters are disjoint 42

43 Formal setting: Hierarchical clustering algorithm A Hierarchical Clustering Algorithm A maps Input: A data set to X with a distance function d, Output: A dendrogram of X 43

44 Linkage-based algorithms Create a leaf node for every elements of X 44

45 Linkage-based algorithms Create a leaf node for every elements of X Repeat the following until a single tree remains: Consider clusters represented by the remaining root nodes 45

46 Linkage-based algorithms Create a leaf node for every elements of X Repeat the following until a single tree remains: Consider clusters represented by the remaining root nodes Merge the closest pair of clusters by assigning them a common parent node 46

47 Linkage-Based Algorithms Create a leaf node for every elements of X Repeat the following until a single tree remains: Consider clusters represented by the remaining root nodes Merge the closest pair of clusters by assigning them a common parent node? 47

48 Examples of linkage-based algorithms The choice of linkage function distinguishes between different linkage-based algorithms. Examples of common linkage-functions Single-linkage: min between-cluster distance Average-linkage: average between-cluster distance Complete-linkage: max between-cluster distance 48

49 Characterizing Linkage-Based Clustering Partitional Setting Local Outer Con. Inner Con. Consistent Refin. Preserv Order Inv. Rich Outer Rich Rep Ind Scale Inv Single!!!!!!!!!! linkage Average!! " "! "!!!! linkage Complete!! " "!!!!!! linkage K-means!! " " " "!!!! K-medoids!! " " " "!!!! Min-Sum!!!! " "!!!! Ratio-cut " "!! " "!!!! Normalizedcut " " " " " "!!!! 49

50 Characterizing Linkage-Based Clustering Ackerman, Ben-David, and Loker, COLT 2010 Local Outer Con. Inner Con. Consistent Refin. Preserv Order Inv. Rich Outer Rich Rep Ind Scale Inv Single!!!!!!!!!! linkage Average!! " "! "!!!! linkage Complete!! " "!!!!!! linkage The 2010 characterization applies in the partitional setting, by using the k-stopping criteria. This characterization distinguished linkage-based algorithms from other partitional techniques. 50

51 Characterizing Linkage-Based Clustering in the Hierarchal Setting (Ackerman & Ben-David, IJCAI 11) Propose two intuitive properties that uniquely identify hierarchical linkage-based clustering algorithms. Show that common hierarchical algorithms, including bisecting k-means, cannot be simulated by any linkage-based algorithm 51

52 Locality D = A(X, d) D 0 = A(X 0,d) X 0 = {x 1,...,x 4 } If we select a cluster from a dendrogram, and run the algorithm on the data in this cluster, we obtain a result that is consistent with the original dendrogram. 52

53 Outer consistency A(X,d) C (X, d) C outer-consistent change (X, d 0 ) C If A is outer-consistent, then A(X, d 0 ) will include the clustering C. 53

54 Theorem [Ackerman & Ben-David, IJCAI 11]: A hierarchical clustering algorithm is Linkage-Based if and only if it is Local and Outer-Consistent. 54

55 Easy direction of proof Every linkage-based hierarchical clustering algorithm is Local and Outer-Consistent. The proof is quite straightforward. 55

56 Interesting direction of proof If A is Local and Outer-Consistent, then A is linkage-based. To prove this direction we first need to formalize linkage-based clustering, by formally defining what is a Linkage Function. 56

57 What do we expect from a linage function? A linkage function ` : {(X 1,X 2,d):d is a distance function over X 1 [ X 2 }! R + satisfies the following: Monotonicity: If we increase distances that go between then `(X 1,X 2,d) doesn t decrease X 1 and X 2 Representation independence: Doesn t change if we re-label data X 1 X 2 57

58 Proof Sketch Recall direction: If A satisfies Outer-Consistency and Locality, then it is linkage-based. Goal Define a linkage function ` so that the linkage-based clustering based on ` outputs A(X, d) (for every X and d). 58

59 Proof Sketch Define an operator : are merged before Prove that can be extended to a partial ordering. < A (X, Y, d 1 ) < A (Z, W, d 2 ) if when we run (X [ Y [ Z [ W, d), d d 1 d 2, X Y < A on where extends and and and Z W. A A(X, d) Use the ordering to define `. Z W X Y 59

60 Sketch of proof continue: Show that < A is a partial ordering We show that < A is cycle-free. Lemma: Given a hierarchical algorithm A that is Local and Outer-Consistent, there exists no finite sequence so that (X 1,Y 1,d 1 ) < A < A (X n,y n,d n ) < A (X 1,Y 1,d 1 ). 60

61 Proof Sketch (continued...) By the above Lemma, the transitive closure of is a partial ordering. < A This implies that there exists an order preserving function ` that maps pairs of data sets to R +. It can be shown that ` satisfies the properties of a Linkage Function. 61

62 Future Directions Identify properties that are significant for specific clustering applications (some previous work in this directions by Ackerman, Brown, and Loker (ICCABS, 2012)). Analyze clustering algorithms in alternative settings, such as categorical data, fuzzy clustering, and using a noise bucket Online clustering Axiomatize clustering functions

Clustering Algorithms. Margareta Ackerman

Clustering Algorithms. Margareta Ackerman Clustering Algorithms Margareta Ackerman A sea of algorithms As we discussed last class, there are MANY clustering algorithms, and new ones are proposed all the time. They are very different from each