Stat 321: Transposable Data Clustering Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Clustering 1 / 27
Clustering Given n objects with d attributes, place them (the objects) into groups. Art B. Owen (Stanford Statistics) Clustering 2 / 27
Clustering Given n objects with d attributes, place them (the objects) into groups. A form of unsupervised learning. Unsupervised because there is no response. Art B. Owen (Stanford Statistics) Clustering 2 / 27
Clustering Given n objects with d attributes, place them (the objects) into groups. A form of unsupervised learning. Unsupervised because there is no response. Has a long history, at least as old as taxonomy. Art B. Owen (Stanford Statistics) Clustering 2 / 27
Clustering Given n objects with d attributes, place them (the objects) into groups. A form of unsupervised learning. Unsupervised because there is no response. Has a long history, at least as old as taxonomy. Raises all the vexing issues of an exploratory method. The curse of exploration. Art B. Owen (Stanford Statistics) Clustering 2 / 27
Clustering Given n objects with d attributes, place them (the objects) into groups. A form of unsupervised learning. Unsupervised because there is no response. Has a long history, at least as old as taxonomy. Raises all the vexing issues of an exploratory method. The curse of exploration. We ll look at it as a precursor to bi-clustering of objects and attributes. Art B. Owen (Stanford Statistics) Clustering 2 / 27
Clustering Given n points in R d Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, and cluster identities? Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, and cluster identities? Outcomes Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, and cluster identities? Outcomes In the best case, a clustering can reveal the presence of a new categorical variable, e.g. types of diabetes. Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, and cluster identities? Outcomes In the best case, a clustering can reveal the presence of a new categorical variable, e.g. types of diabetes. Other times there are no clusters, just a smear Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, and cluster identities? Outcomes In the best case, a clustering can reveal the presence of a new categorical variable, e.g. types of diabetes. Other times there are no clusters, just a smear Or we find clusters but not their meanings. Art B. Owen (Stanford Statistics) Clustering 3 / 27
Clustering Given n points in R d Do they clump together into k clusters? If so, how to find the clusters, and the boundaries, and cluster identities? Outcomes In the best case, a clustering can reveal the presence of a new categorical variable, e.g. types of diabetes. Key idea Other times there are no clusters, just a smear Or we find clusters but not their meanings. Items within a cluster are more similar (less distant) to each other than items from different clusters. Art B. Owen (Stanford Statistics) Clustering 3 / 27
k-means Art B. Owen (Stanford Statistics) Clustering 4 / 27
k-means Algorithm 1 Pick k points z 1,..., z k R d, then repeat For i = 1,..., n put g(i) = min 1 j k x i z j 2 For j = 1,..., k put z j = avg{x i g(i) = j} Art B. Owen (Stanford Statistics) Clustering 4 / 27
k-means Algorithm 1 Pick k points z 1,..., z k R d, then repeat For i = 1,..., n put g(i) = min 1 j k x i z j 2 For j = 1,..., k put z j = avg{x i g(i) = j} Issues Handle averaging over empty set Pick stopping rule (it must converge, or at worst cycle) Answer depends on starting points Hard to pick k (at least 30 methods proposed by 1985) Art B. Owen (Stanford Statistics) Clustering 4 / 27
k-means Algorithm 1 Pick k points z 1,..., z k R d, then repeat For i = 1,..., n put g(i) = min 1 j k x i z j 2 For j = 1,..., k put z j = avg{x i g(i) = j} Issues Handle averaging over empty set Pick stopping rule (it must converge, or at worst cycle) Answer depends on starting points Hard to pick k (at least 30 methods proposed by 1985) Properties Usually rapid convergence An iteration can be done in O(nkd) time. No n 2 or d 2 or k 2. (k log(k) to sort negligible) Art B. Owen (Stanford Statistics) Clustering 4 / 27
k-means ctd There s a criterion Each step minimizes SS = n x i z g(i) 2 i=1 over it s free variables. Sadly, it is not convex. Art B. Owen (Stanford Statistics) Clustering 5 / 27
k-means ctd There s a criterion Each step minimizes SS = n x i z g(i) 2 i=1 over it s free variables. Sadly, it is not convex. Hartigan & Wong (1979) get solutions s.t. no g(i) change reduces SS Art B. Owen (Stanford Statistics) Clustering 5 / 27
k-means ctd There s a criterion Each step minimizes SS = n x i z g(i) 2 i=1 over it s free variables. Sadly, it is not convex. Hartigan & Wong (1979) get solutions s.t. no g(i) change reduces SS Maybe penalizing i<i Z i Z i 2 will work (Y. She, F. Bach) Art B. Owen (Stanford Statistics) Clustering 5 / 27
k-means ctd There s a criterion Each step minimizes SS = n x i z g(i) 2 i=1 over it s free variables. Sadly, it is not convex. Hartigan & Wong (1979) get solutions s.t. no g(i) change reduces SS Maybe penalizing i<i Z i Z i 2 will work (Y. She, F. Bach) x-means Pelleg and Moore Efficient lookups to get k into tens of thousands Picks k via AIC or BIC along the way Art B. Owen (Stanford Statistics) Clustering 5 / 27
k-means ctd There s a criterion Each step minimizes SS = n x i z g(i) 2 i=1 over it s free variables. Sadly, it is not convex. Hartigan & Wong (1979) get solutions s.t. no g(i) change reduces SS Maybe penalizing i<i Z i Z i 2 will work (Y. She, F. Bach) x-means Pelleg and Moore Efficient lookups to get k into tens of thousands Picks k via AIC or BIC along the way Defining true k problematic (galaxies in clusters in super-clusters) Art B. Owen (Stanford Statistics) Clustering 5 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Art B. Owen (Stanford Statistics) Clustering 6 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Change z k medoids Art B. Owen (Stanford Statistics) Clustering 6 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Change z k medoids Require z k = x i(k) for some i(k) Art B. Owen (Stanford Statistics) Clustering 6 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Change z k medoids Require z k = x i(k) for some i(k) Avoids using z with 2.3 kids, 30% pregnant, 10% male Art B. Owen (Stanford Statistics) Clustering 6 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Change z k medoids Require z k = x i(k) for some i(k) Avoids using z with 2.3 kids, 30% pregnant, 10% male PAM partitioning around medoids Art B. Owen (Stanford Statistics) Clustering 6 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Change z k medoids Require z k = x i(k) for some i(k) Avoids using z with 2.3 kids, 30% pregnant, 10% male PAM partitioning around medoids Minimize i min j D(x i, z j ) Art B. Owen (Stanford Statistics) Clustering 6 / 27
Variations Change dist Change the distance L 2 to L 1 to L p Changes mean to median to arg min Change z k medoids Require z k = x i(k) for some i(k) Avoids using z with 2.3 kids, 30% pregnant, 10% male PAM partitioning around medoids Minimize i min j D(x i, z j ) Slow. Uses only D ij values. Art B. Owen (Stanford Statistics) Clustering 6 / 27
What is a cluster? Defining issues Scale of variables matters Subset of variables matters too Do whales go with penguins or with elephants? Why does my breakfast cereal cluster with pure sugar? # Density bumps # mixture components Art B. Owen (Stanford Statistics) Clustering 7 / 27
What is a cluster? Defining issues Scale of variables matters Subset of variables matters too Do whales go with penguins or with elephants? Why does my breakfast cereal cluster with pure sugar? # Density bumps # mixture components Scaling data z ij = x ij m j s j m, s are mean & stdev (so z j (0, 1)) or min & range (so 0 z ij 1) or median & MAD (for robustness) NB: transformation defined column-wise [vs row-wise or simultaneous] Art B. Owen (Stanford Statistics) Clustering 7 / 27
Weighting Weight variables via w j 0 d ii = with d d w j x ij x i j = x ij x i j j=1 x ij = x ij w j j=1 Weighting and scaling are equivalent (for L p distances) Art B. Owen (Stanford Statistics) Clustering 8 / 27
Weighting Weight variables via w j 0 d ii = with d d w j x ij x i j = x ij x i j j=1 x ij = x ij w j j=1 Weighting and scaling are equivalent (for L p distances) Automatic scaling not always sensible. Variables in the same units should sometimes get the same scaling even if they have different variances. Art B. Owen (Stanford Statistics) Clustering 8 / 27
Distances n(n 1)/2 interpoint distances d ii = dist(x i, x i ) Usually 1 d ii 0 2 d ii = 0 3 d ii = d i i A metric distance has d ij d ik + d jk (Triangle inequality) A Euclidean distance d ij = z i z j for some points z i = z(x i ) An ultra-metric distance has d ij max(d ik, d jk ) (More later) Art B. Owen (Stanford Statistics) Clustering 9 / 27
Distances n(n 1)/2 interpoint distances d ii = dist(x i, x i ) Usually 1 d ii 0 2 d ii = 0 3 d ii = d i i A metric distance has d ij d ik + d jk (Triangle inequality) A Euclidean distance d ij = z i z j for some points z i = z(x i ) An ultra-metric distance has d ij max(d ik, d jk ) (More later) Other distances Canberra distance d x ij x i j j=1 x ij + x i j (with 0/0 = 0) Angular or cosine distance (dogs cats, wolves tigers) Art B. Owen (Stanford Statistics) Clustering 9 / 27
Similarities Opposite of distance: d S Eg S ii = 1 d ii or 1/d ii or d ii = S S ii Art B. Owen (Stanford Statistics) Clustering 10 / 27
Similarities Opposite of distance: d S Eg S ii = 1 d ii or 1/d ii or d ii = S S ii Correlation type similarities S ii = = = d j=1 x ijx i j d d j=1 x2 ij j=1 x2 i j d j=1 x ijx i j d d j=1 x2 ij j=1 x2 i j, or,, d j=1 (x ij x j ) 2 d or, d j=1 (x ij x j )(x i j x j ) j=1 (x i j x j ) 2, or, Art B. Owen (Stanford Statistics) Clustering 10 / 27
Similarities Some equalities are more equal than others 1 i and i are both Nobel laureates (unusually strong similarity) Art B. Owen (Stanford Statistics) Clustering 11 / 27
Similarities Some equalities are more equal than others 1 i and i are both Nobel laureates (unusually strong similarity) 2 i and i are both over 21 years old (mild similarity) Art B. Owen (Stanford Statistics) Clustering 11 / 27
Similarities Some equalities are more equal than others 1 i and i are both Nobel laureates (unusually strong similarity) 2 i and i are both over 21 years old (mild similarity) 3 i and i are both not Nobel laureates (barely similar at all) Art B. Owen (Stanford Statistics) Clustering 11 / 27
Similarities Some equalities are more equal than others 1 i and i are both Nobel laureates (unusually strong similarity) 2 i and i are both over 21 years old (mild similarity) 3 i and i are both not Nobel laureates (barely similar at all) We can handle 1 vs 2 by weighting the variables. Art B. Owen (Stanford Statistics) Clustering 11 / 27
Similarities Some equalities are more equal than others 1 i and i are both Nobel laureates (unusually strong similarity) 2 i and i are both over 21 years old (mild similarity) 3 i and i are both not Nobel laureates (barely similar at all) We can handle 1 vs 2 by weighting the variables. But 1 vs 3 is trickier (same variable). Art B. Owen (Stanford Statistics) Clustering 11 / 27
Binary similarity measures p features 2 2 table 1 0 1 a b 0 c d We want to count a more than d Generic measure: α > 0, δ 0 S ii = αa + δd αa + b + c + δd (α, δ) and (α, δ ) give the same ranking if αδ = α δ Janowitz recommends Jaccard or Russel-Rao d = 1 S Specific measures a + d S ii = Simple matchin a + b + c + d a S ii = Jaccard-Tanimoto a + b + c = 1 when a + b + c = 0 a S ii = Russel-Rao a + b + c + d 2(a + d) S ii = Sokal-Sneath 2(a + d) + b + c a S ii = Sokal-Sneath II a + 2(b + c) Art B. Owen (Stanford Statistics) Clustering 12 / 27
Agglomerative clustering General Start with n clusters of one element each Art B. Owen (Stanford Statistics) Clustering 13 / 27
Agglomerative clustering General Start with n clusters of one element each Repeat 1 Find closest two clusters 2 Merge them into a new cluster Art B. Owen (Stanford Statistics) Clustering 13 / 27
Agglomerative clustering General Start with n clusters of one element each Repeat 1 Find closest two clusters 2 Merge them into a new cluster Until only one cluster remains Art B. Owen (Stanford Statistics) Clustering 13 / 27
Agglomerative clustering General Start with n clusters of one element each Repeat 1 Find closest two clusters 2 Merge them into a new cluster Until only one cluster remains We need point to cluster and cluster to cluster distances Art B. Owen (Stanford Statistics) Clustering 13 / 27
Flavors of agglomerative clustering Single linkage Get chaining ; friends of friends d(c 1, C 2 ) = min i C 1 min j C 2 d ij Art B. Owen (Stanford Statistics) Clustering 14 / 27
Flavors of agglomerative clustering Single linkage Get chaining ; friends of friends Complete linkage Get dense nearly spherical clusters d(c 1, C 2 ) = min i C 1 min j C 2 d ij d(c 1, C 2 ) = max i C 1 max j C 2 d ij Art B. Owen (Stanford Statistics) Clustering 14 / 27
Flavors of agglomerative clustering Single linkage Get chaining ; friends of friends Complete linkage Get dense nearly spherical clusters d(c 1, C 2 ) = min i C 1 min j C 2 d ij d(c 1, C 2 ) = max i C 1 max j C 2 d ij Average linkage Compromise d(c 1, C 2 ) = 1 1 C 1 C 2 d ij i C 1 j C 2 Art B. Owen (Stanford Statistics) Clustering 14 / 27
Example Bird data > dim(voeg) [1] 395 34 > voeg[1:5,1:5] Heron Mallard Sparrowhawk Buzzard Kestrel 1 0 0 0 0 1 2 1 0 0 0 0 3 0 0 1 0 0 4 0 0 1 0 0 5 1 1 0 0 0 1 means that place i has bird j Place names not present Art B. Owen (Stanford Statistics) Clustering 15 / 27
Complete linkage dendrogram Art B. Owen (Stanford Statistics) Clustering 16 / 27
Single linkage dendrogram Art B. Owen (Stanford Statistics) Clustering 17 / 27
Average linkage dendrogram Art B. Owen (Stanford Statistics) Clustering 18 / 27
Complete vs Average linkage dendrograms Art B. Owen (Stanford Statistics) Clustering 19 / 27
Heatmap Art B. Owen (Stanford Statistics) Clustering 20 / 27
Heatmap Art B. Owen (Stanford Statistics) Clustering 21 / 27
Heatmap Art B. Owen (Stanford Statistics) Clustering 22 / 27
Dendrogram details Ordering n points = n 1 splits = 2 n 1 orderings R puts tightest cluster on the left heatmap lets you control ordering somewhat Art B. Owen (Stanford Statistics) Clustering 23 / 27
Dendrogram details Ordering n points = n 1 splits = 2 n 1 orderings R puts tightest cluster on the left heatmap lets you control ordering somewhat Ultrametric H ij height at which clusters containing i and j merge It s an ultrametric: for i, j, k top two of H ij, H ik, H jk are equal Need H ij max(h ik, H jk ) Non-ultrametric measures yield dendrograms with reversals Art B. Owen (Stanford Statistics) Clustering 23 / 27
More hierarchical Centroid merging d(c 1, C 2 ) = X C1 X C2 Requires original points, not just distances Art B. Owen (Stanford Statistics) Clustering 24 / 27
More hierarchical Centroid merging d(c 1, C 2 ) = X C1 X C2 Requires original points, not just distances Ward s d(c 1, C 2 ) = 2 C 1 C 2 C 1 + C 2 X C1 X C2 Via within cluster SS (after before) Goal is to min j i C j X i X Cj 2 Art B. Owen (Stanford Statistics) Clustering 24 / 27
Costs of hierarchical splits It takes O(n 2 d) to get all pairwise distances We have to take O(n) steps Naive implementation would be O(n 3 d). Art B. Owen (Stanford Statistics) Clustering 25 / 27
Costs of hierarchical splits It takes O(n 2 d) to get all pairwise distances We have to take O(n) steps Naive implementation would be O(n 3 d). Lance-Williams family of methods Dist of i j to k α i d ki + α j d kj + βd ij + γ d ki d kj eg: α = 1/2, β = 0, γ = 1/2 for single linkage. α i can depend on n i. Updates lead to O(n 2 d) cost. Really clever data structures (F. Murtagh) avoid even n 2 log(n). Art B. Owen (Stanford Statistics) Clustering 25 / 27
Divisive clustering Recipe Start with one cluster of n objects Repeat 1 Select one cluster 2 Split it into two Until there are n clusters of size 1 Art B. Owen (Stanford Statistics) Clustering 26 / 27
Divisive clustering Recipe Start with one cluster of n objects Repeat 1 Select one cluster 2 Split it into two Until there are n clusters of size 1 Choices Which cluster to split. E.g. largest diameter arg max j max X i X i i,i C j How to split it. E.g. remove far point, X i goes with nearer of far point, XCj far Art B. Owen (Stanford Statistics) Clustering 26 / 27
Divisive clustering Recipe Start with one cluster of n objects Repeat 1 Select one cluster 2 Split it into two Until there are n clusters of size 1 Choices Which cluster to split. E.g. largest diameter arg max j max X i X i i,i C j How to split it. E.g. remove far point, X i goes with nearer of far point, XCj far Apparently no good speedup Art B. Owen (Stanford Statistics) Clustering 26 / 27
Optimization based clustering E.g. Ward s Recipe Measure quality of a cluster Scale est, like diameter, RMS width, median width Combine into quality of clustering Typically the sum (max plausible) (Attempt to) optimize Art B. Owen (Stanford Statistics) Clustering 27 / 27
Optimization based clustering E.g. Ward s Recipe Measure quality of a cluster Scale est, like diameter, RMS width, median width Combine into quality of clustering Typically the sum (max plausible) (Attempt to) optimize Isolation measures Split min i C,j C d ij Cut i C,j C d ij Art B. Owen (Stanford Statistics) Clustering 27 / 27