K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Size: px

Start display at page:

Download "K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining."

Annabella Horn
5 years ago
Views:

1 K-means clustering Based in part on slides from textbook, slides of Susan Holmes December 2, / 1

2 K-means Outline K-means, K-medoids Choosing the number of clusters: Gap test, silhouette plot. Mixture modelling, EM algorithm. 2 / 1

3 K-means Figure : Simulated data in the plane, clustered into three classes (represented by red, blue and green) by the K-means clustering algorithm. From ESL. 3 / 1

4 K-means Algorithm (Euclidean) 1 For each data point, the closest cluster center (in Euclidean distance) is identified; 2 Each cluster center is replaced by the coordinatewise average of all data points that are closest to it. 3 Steps 1. and 2. are alternated until convergence. Algorithm converges to a local minimum of the within-cluster sum of squares. Typically one uses multiple runs from random starting guesses, and chooses the solution with lowest within cluster sum of squares. 4 / 1

5 K-means Non-Euclidean 1 We can replace the Euclidean distance squared with some other dissimilarity measure d, this changes the assignment rule to minimizing d.. is identified; 2 Each cluster center is replaced by the point that minimizes the sum of all pairwise d s. 3 Steps 1. and 2. are alternated until convergence. Algorithm converges to a local minimum of the within-cluster sum of d s. 5 / 1

6 K-means 14.3 Cluster Analysis Initial Centroids Initial Partition Iteration Number 2 Iteration Number 20 6 / 1

7 K-means Figure : Decrease in W (C), the within cluster sum of squares. 7 / 1

8 K-means Importance of Choosing Initial Centroids Figure : Another example of the iterations of K-means Tan,Steinbach, Kumar Introduction to 4/18/ / 1

9 K-means Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering Tan,Steinbach, Kumar Introduction to 4/18/ / 1

10 The Iris data (K-means) 10 / 1

11 K-means Issues to consider Non-quantitative features, e.g. categorical variables, are typically coded by dummy variables, and then treated as quantitative. How many centroids k do we use? As k increases, both training and test error decrease! By test error, we mean the within-cluster sum of squares for data held-out when fitting the clusters... Possible to get empty clusters / 1

12 K-means Choosing K Ideally, the within cluster sum of squares flattens out quickly and we might choose the value of K at this elbow. We might also compare the observed within cluster sum of squares to a null model, like uniform on a box containing the data. This is the basis of the gap statistic. 12 / 1

13 K-means Unsupervised Learning log WK Gap Number of Clusters Number of Clusters FIGURE (Left panel): observed (green) and expected (blue) values of log W K for the simulated data of Figure Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; 13 / 1 Figure : Blue curve is the W K for uniform, green curve is for data.

14 K-means Unsupervised Learning log WK Gap Number of Clusters Number of Clusters FIGURE (Left panel): observed (green) and expected (blue) values of log W K for the simulated data of Figure Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; here K =2. 14 / 1 Figure : Largest gap is at 2, and the formal rule also takes into account the variability of estimating the gap.

15 K-medoid Algorithm Same as K-means, except that centroid is estimated not by the average, but by the observation having minimum pairwise distance with the other cluster members. Advantage: centroid is one of the observations useful, eg when features are 0 or 1. Also, one only needs pairwise distances for K-medoids rather than the raw observations. In R, the function pam implements this using Euclidean distance (not distance squared). 15 / 1

16 K-medoid Example: Country Dissimilarities This example comes from a study in which political science students were asked to provide pairwise dissimilarity measures for 12 countries. BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG BRA 5.58 CHI CUB EGY FRA IND ISR USA USS YUG ZAI / 1

17 K-medoid Figure : Left panel: dissimilarities reordered and blocked according to 3-medoid clustering. Heat map is coded from most similar (dark red) to least similar (bright red). Right panel: two-dimensional multidimensional scaling plot, with 3-medoid clusters indicated by different colors. 17 / 1

18 The Iris data: K-medoid (PAM) 18 / 1

19 K-medoid Silhouette For each case 1 i n, and set of cases C and dissimilarity d define d(i, C) = 1 #C d(i, j). j C Each case 1 i n is assigned to a cluster C l(i). The silhouette width is defined for each case as silhouette(i) = min j l(i) d(i, C j ) d(i, C l(i) ) max( d(i, C l(i) ), min j l(i) d(i, C j )). High values of silhouette indicate good clusterings. In R this is computable for pam objects. 19 / 1

20 The Iris data: silhouette plot for K-medoid 20 / 1

21 The Iris data: average silhouette width 21 / 1

22 Mixture modelling A soft clustering algorithm Imagine we actually had labels Y for the cases, then this would be a classification problem. For this classification problem, we might consider using a Gaussian discriminant model like LDA or QDA. We would then have to estimate (µ j, Σ j ) within each cluster. This would be easy... The next model is based on this realization / 1

23 Mixture modelling EM algorithm The abbreviation: E=expectation, M=maximization. A special case of an majorization-minimization algorithm and widely used throughout statistics. Particularly useful for situations in which there might be some hidden data that would make the problem easy / 1

24 Mixture modelling EM algorithm In this mixture model framework, we assume that the data were drawn from the same model as in QDA (or LDA). Y Multinomial(1, π) X Y = l N(µ l, Σ l ) (choose a label) Only, we have lost our labels and only observe X n p. The goal is still the same, to estimate π, (µ l, Σ l ) 1 l k. 24 / 1

25 Mixture modelling EM algorithm The algorithm keeps track of (µ l, Σ l ) 1 l k It also trackes guesses at Y in the form of Γ n k. Alternates between guessing Y and estimating π, (µ l, Σ l ) 1 l k. 25 / 1

26 Mixture modelling EM algorithm Initialize Γ, µ, Σ, π. Repeat For 1 t T, Estimate Γ These are called the responsibilities Estimate µ l, 1 k γ (t+1) il = µ (t+1) l = π (t) l K l=1 π(t) l (t) (t) φ µ l, Σ l (X i ) (t) (t) φ µ l, Σ l n i=1 γ(t+1) il X i n i=1 γ(t+1) il (X i ) This is just weighted average with weights γ (t+1) l. 26 / 1

27 Mixture modelling EM algorithm Estimate Σ l, 1 k Estimate π l Σ (t+1) l = n i=1 γ(t+1) il (X i µ (t+1) l n i=1 γ(t+1) il )(X i µ (t+1) l ) T This is just a weighted estimate of the covariance matrix with weights γ (t+1) l. π (t+1) l = 1 n n i=1 γ (t+1) il 27 / 1

28 Mixture modelling EM algorithm The quantities Γ are not really parameters, they are estimates of the random labels Y which were unobserved. If we had observed Y then the rows of Γ would be all zero except one entry, which would be 1. In this case, estimation of π l, µ l, Σ l is just as it would have been in QDA... The EM simply replaces the unobserved Y with a guess / 1

29 The Iris data: Gaussian mixture modelling 29 / 1

30 The Iris data (K-means) 30 / 1

31 The Iris data: silhouette plot for K-medoid 31 / 1

32 32 / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group