Hierarchical clustering - PDF Free Download

Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1

Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a dendrogram: A tree like diagram that records the sequences of merges or splits. 2 / 1

Produces a set of nested clusters organized as a Hierarchical hierarchical clustering tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits A clustering and its dendrogram. Tan,Steinbach, Kumar Introduction to 4/18/2004 5 3 / 1

Strengths Do not have to assume any particular number of clusters. Each horizontal cut of the tree yields a clustering. The tree may correspond to a meaningful taxonomy: (e.g., animal kingdom, phylogeny reconstruction,...) Need only a similarity or distance matrix for implementation. 4 / 1

Agglomerative Start with the points as individual clusters. At each step, merge the closest pair of clusters until only one cluster (or some fixed number k clusters) remain. 5 / 1

Divisive Start with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point (or there are k clusters). 6 / 1

Agglomerative Clustering Algorithm 1 Compute the proximity matrix. 2 Let each data point be a cluster. 3 While there is more than one cluster: 1 Merge the two closest clusters. 2 Update the proximity matrix. The major difference is the computation of proximity of two clusters. 7 / 1

Starting point for agglomerative clustering. 8 / 1

Intermediate Situation After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C1 C3 C4 C3 C4 C5 Proximity Matrix C2 C5 Intermediate point, with 5 clusters. Tan,Steinbach, Kumar Introduction to 4/18/2004 10 9 / 1

We will merge C2 and C5. 10 / 1

Hierarchical clustering After Merging The question is How do we update the proximity matrix? C1 C2 U C5 C3 C4 C1? C1 C3 C4 C2 U C5???? C3? C4? Proximity Matrix C2 U C5 How do we update proximity matrix? Tan,Steinbach, Kumar Introduction to 4/18/2004 12 11 / 1

How to Define Inter-Cluster Si Similarity? p1 p2 p3 We need a notion of similarity between clusters. MIN MAX Group Average p4 p5... 12 / 1

How to Define Inter-Cluster Si p1 p2 p3 Single linkage uses the minimum distance. MIN MAX Group Average p4 p5... 13 / 1

How to Define Inter-Cluster Si p1 p2 p3 Complete linkage uses the maximum distance. MIN MAX Group Average p4 p5... 14 / 1

How to Define Inter-Cluster Sim p1 p2 p3 Group average linkage uses the average distance between groups. MIN MAX Group Average p4 p5... 15 / 1

How to Define Inter-Cluster Sim p1 p2 p4 Centroid uses the distance between the centroids of the clusters p5 (presumes one can compute centroids...). MIN MAX Group Average p3.. 16 / 1

Hierarchical clustering Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. 1 2 3 4 5 Proximity matrix and dendrogram of single linkage. Tan,Steinbach, Kumar Introduction to 4/18/2004 18 17 / 1

Distance matrix for nested clusterings 1 0.24 0.22 0.37 0.34 0.23 2 0.15 0.20 0.14 0.25 3 0.15 0.28 0.11 4 0.29 0.22 5 0.39 6 18 / 1

Hierarchical Clustering: MIN 3 5 1 5 2 2 1 3 6 4 4 Nested cluster representation and dendrogram of single linkage. Nested Clusters Dendrogram Tan,Steinbach, Kumar Introduction to 4/18/2004 19 19 / 1

Single linkage Can handle irregularly shaped regions fairly naturally. Sensitive to noise and outliers in the form of chaining. 20 / 1

The Iris data (single linkage) 21 / 1

The Iris data (single linkage) 22 / 1

Hierarchical clustering Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters 1 2 3 4 5 Proximity matrix and dendrogram of complete linkage. Tan,Steinbach, Kumar Introduction to 4/18/2004 22 23 / 1

Hierarchical Clustering: MAX 5 4 1 2 5 2 3 6 3 1 4 Nested Nested cluster Clusters and dendrogram ofdendrogram complete linkage. Tan,Steinbach, Kumar Introduction to 4/18/2004 23 24 / 1

Complete linkage Less sensitive to noise and outliers than single linkage. Regions are generally compact, but may violate closeness. That is, points may much closer to some points in neighbouring cluster than its own cluster. This manifests itself as breaking large clusters. Clusters are biased to be globular. 25 / 1

The Iris data (complete linkage) 26 / 1

The Iris data (complete linkage) 27 / 1

The Iris data (complete linkage) 28 / 1

Hierarchical Clustering: Group Average 5 5 2 2 4 1 4 3 1 3 6 Nested cluster and dendrogram of group average linkage. Nested Clusters Dendrogram Tan,Steinbach, Kumar Introduction to 4/18/2004 27 29 / 1

Average linkage Given two elements of the partition C r, C s, we might consider d GA (C r, C s ) = 1 C r C s x C r,y C s d(x, y) A compromise between single and complete linkage. Shares globular clusters of complete, less sensitive than single. 30 / 1

The Iris data (average linkage) 31 / 1

The Iris data (average linkage) 32 / 1

Ward s linkage Similarity of two clusters is based on the increase in squared error when two clusters are merged. Similar to average if dissimilarity between points is distance squared. Hence, it shares many properties of average linkage. A hierarchical analogue of K-means. Sometimes used to initialize K-means. 33 / 1

The Iris data (Ward s linkage) 34 / 1

The Iris data (Ward s linkage) 35 / 1

NCI data (complete linkage) 36 / 1

NCI data (single linkage) 37 / 1

NCI data (average linkage) 38 / 1

NCI data (Ward s linkage) 39 / 1

Computational issues O(n 2 ) space since it uses the proximity matrix. O(n 3 ) time in many cases as there are N steps, and at each step a matrix of size N 2 must be updated and/or searched. 40 / 1

Statistical issues Once a decision is made to combine two clusters, it cannot be undone. No objective function is directly minimized. Different schemes have problems with one or more of the following: Sensitivity to noise and outliers. Difficulty handling different sized clusters and convex shapes. Breaking large cluster. 41 / 1

42 / 1