Clustering COMS 4771
1. Clustering
Unsupervised classification / clustering Unsupervised classification Input: x 1,..., x n R d, target cardinality k N. Output: function f : R d {1,..., k} =: [k]. Typical semantics: hidden subpopulation structure. 1 / 14
Unsupervised classification / clustering Unsupervised classification Input: x 1,..., x n R d, target cardinality k N. Output: function f : R d {1,..., k} =: [k]. Typical semantics: hidden subpopulation structure. Clustering Input: x 1,..., x n R d, target cardinality k N. Output: partitioning of x 1,..., x n into k groups. Often done via unsupervised classification; clustering often synonymous with unsupervised classification. Sometimes also have a representative c j R d for each j [k] (e.g., average of the x i in jth group) quantization. 1 / 14
Unsupervised classification / clustering Clustering Input: x 1,..., x n R d, target cardinality k N. Output: partitioning of x 1,..., x n into k groups. Often done via unsupervised classification; clustering often synonymous with unsupervised classification. Sometimes also have a representative c j R d for each j [k] (e.g., average of the x i in jth group) quantization. 1 / 14
Uses of clustering: feature representations One-hot / dummy variable encoding of f(x) φ(x) = 0. 0 1 0. 0 f(x) position (Often used together with other features.) 2 / 14
Uses of clustering: feature representations Histogram representation Cut up each x i R d into different parts x i,1,..., x i,m R p (e.g., small patches of an image). Cluster all the parts x i,j: get k representatives c 1,..., c k R p. Represent x i by a histogram over {1,..., k} based on assignments of x i s parts to representatives. 3 / 14
Uses of clustering: compression Quantization Replace each x i with its representative x i c f(xi ). Example: quantization at image patch level. 4 / 14
2. k-means
k-means optimization problem Input: x 1,..., x n R d, target cardinality k N. Output: k means c 1,..., c k R d. Objective: choose c 1,..., c k R d to minimize n min j [k] xi cj 2 2. i=1 5 / 14
k-means optimization problem Input: x 1,..., x n R d, target cardinality k N. Output: k means c 1,..., c k R d. Objective: choose c 1,..., c k R d to minimize n min j [k] xi cj 2 2. i=1 Natural assignment function f(x) := arg min x c j 2 2. j [k] 5 / 14
k-means optimization problem Input: x 1,..., x n R d, target cardinality k N. Output: k means c 1,..., c k R d. Objective: choose c 1,..., c k R d to minimize n min j [k] xi cj 2 2. i=1 Natural assignment function f(x) := arg min x c j 2 2. j [k] NP-hard, even if k = 2 or d = 2. 5 / 14
The easy cases k-means clustering for k = 1 Problem: Pick c R d to minimize n x i c 2 2. i=1 6 / 14
The easy cases k-means clustering for k = 1 Problem: Pick c R d to minimize n x i c 2 2. i=1 Solution: Using bias/variance decomposition, best choice is c = 1 n n i=1 xi. 6 / 14
The easy cases k-means clustering for k = 1 Problem: Pick c R d to minimize n x i c 2 2. i=1 Solution: Using bias/variance decomposition, best choice is c = 1 n n i=1 xi. k-means clustering for d = 1 Dynamic programming in time O(n 2 k). 6 / 14
The easy cases k-means clustering for k = 1 Problem: Pick c R d to minimize n x i c 2 2. i=1 Solution: Using bias/variance decomposition, best choice is c = 1 n n i=1 xi. k-means clustering for d = 1 Dynamic programming in time O(n 2 k). Lloyd s algorithm is a popular heuristic for general k and d. (There are many other algorithms for k-means clustering.) 6 / 14
Lloyd s algorithm Start with some initial means c 1,..., c k. Repeat: Partition x 1,..., x n into k clusters C 1,..., C k, based on distance to current means : x i arg min x i c j 2 2, j [k] i [n], breaking ties according to any fixed rule. Update means : c j := mean(c j), j [k]. 7 / 14
Sample run of Lloyd s algorithm 2 (a) 0 Arbitrary initialization of c 1 and c 2. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (b) 0 Iteration 1 Optimize assignments. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (c) 0 Iteration 1 Optimize means c j. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (d) 0 Iteration 2 Optimize assignments. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (e) 0 Iteration 2 Optimize means c j. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (f) 0 Iteration 3 Optimize assignments. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (g) 0 Iteration 3 Optimize means c j. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (h) 0 Iteration 4 Optimize assignments. 2 2 0 2 8 / 14
Sample run of Lloyd s algorithm 2 (i) 0 Iteration 4 Optimize means c j. 2 2 0 2 8 / 14
Initializing Lloyd s algorithm Basic idea: Choose initial centers to have good coverage of the data points. 9 / 14
Initializing Lloyd s algorithm Basic idea: Choose initial centers to have good coverage of the data points. Farthest-first traversal For j = 1,..., k: Pick c j R d from among x 1,..., x n farthest from previously chosen c 1,..., c j 1. (c 1 chosen arbitrarily.) 9 / 14
Initializing Lloyd s algorithm Basic idea: Choose initial centers to have good coverage of the data points. Farthest-first traversal For j = 1,..., k: Pick c j R d from among x 1,..., x n farthest from previously chosen c 1,..., c j 1. (c 1 chosen arbitrarily.) But this can be thrown off by outliers... 9 / 14
Initializing Lloyd s algorithm Basic idea: Choose initial centers to have good coverage of the data points. Farthest-first traversal For j = 1,..., k: Pick c j R d from among x 1,..., x n farthest from previously chosen c 1,..., c j 1. (c 1 chosen arbitrarily.) But this can be thrown off by outliers... D 2 sampling (a.k.a. k-means++ ) For j = 1,..., k: Randomly pick c j R d from among x 1,..., x n according to distribution P (x i) (Uniform distribution when j = 1.) min l=1,...,j 1 xi c l 2 2. 9 / 14
Choosing k Usually by hold-out validation / cross-validation on auxiliary task (e.g., supervised learning task). 10 / 14
Choosing k Usually by hold-out validation / cross-validation on auxiliary task (e.g., supervised learning task). Heuristic: Find large gap between (k 1)-means cost and k-means cost. 10 / 14
3. Hierarchical clustering
Clustering at multiple scales k = 2 or k = 3? 11 / 14
Clustering at multiple scales k = 2 or k = 3? Hierarchical clustering: encode clusterings for all values of k in a tree. 11 / 14
Clustering at multiple scales k = 2 or k = 3? Hierarchical clustering: encode clusterings for all values of k in a tree. Caveat: not always possible. 11 / 14
Clustering at multiple scales k = 2 or k = 3? Hierarchical clustering: encode clusterings for all values of k in a tree. Caveat: not always possible. 11 / 14
Clustering at multiple scales k = 2 or k = 3? Hierarchical clustering: encode clusterings for all values of k in a tree. Caveat: not always possible. 11 / 14
Example: phylogenetic tree 12 / 14
Agglomerative clustering Start with every point x i in its own cluster. Repeatedly merge closest pair of clusters, until only one cluster remains. 13 / 14
Agglomerative clustering Start with every point x i in its own cluster. Repeatedly merge closest pair of clusters, until only one cluster remains. Single-linkage dist(c, C ) := Complete-linkage dist(c, C ) := min x x 2. x C,x C max x x C,x C x 2. 13 / 14
Agglomerative clustering Start with every point x i in its own cluster. Repeatedly merge closest pair of clusters, until only one cluster remains. Single-linkage Average-linkage Many variants. E.g., Ward s average linkage dist(c, C ) := Complete-linkage dist(c, C ) := min x x 2. x C,x C max x x C,x C x 2. dist(c, C ) := C C C + C mean(c) mean(c ) 2 2. 13 / 14
Key takeaways Uses of clustering: Unsupervised classification ( hidden subpopulations ). Quantization... k-means clustering: popular objective for clustering and quantization. Lloyd s algorithm: alternating optimization, needs good initialization. Hierarchical clustering: clustering at multiple levels of granularity. 14 / 14