S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science, Kennesaw State University
What is luster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
What is not luster Analysis? Supervised classification Have class label information Simple segmentation Dividing students into different registration groups alphabetically, by last name Results of a query Groupings are a result of an external specification
Notion of a luster can be Ambiguous
Types of lustering A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional lustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree
Partitional lustering
Hierarchical lustering
Types of lusters enter-based clusters ontiguous clusters
Types of lusters: enter-based enter-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster
Types of lusters: ontiguity-based ontiguous luster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
Hierarchical lustering Produces a set of nested clusters organized as a hierarchical tree an be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0.2 0.15 0.1 6 4 3 4 2 5 2 5 0.05 0 1 3 2 5 4 6 3 1 1
Strengths of Hierarchical lustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by cutting the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )
Hierarchical lustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)
Agglomerative lustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1. ompute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms
Starting Situation Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5.. p1 p2 p3 p4 p5.... Proximity Matrix... p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation After some merging steps, we have some clusters.... p1 p2 p3 p4 p9 p10 p11 p12 1 4 2 5 3 2 1 1 3 5 4 2 3 4 5 Proximity Matrix
Intermediate Situation We want to merge the two closest clusters (2 and 5) and update the proximity matrix. 1 2 3 4 5 1 2 3 5 4 1 2 3 4 5 Proximity Matrix... p1 p2 p3 p4 p9 p10 p11 p12
After Merging The question is: How do we update the proximity matrix? i.e. How do we measure proximity (distance, similarity) between two clusters? 1 3 4 1 2 U 5 3 4 1 2 U 5??????? 3 4 2 U 5... p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-luster Similarity Similarity? MIN (Single Linkage) MAX (omplete Linkage) Group Average Distance Between entroids Other methods driven by an objective function Ward s Method uses squared error
How to Define Inter-luster Similarity MIN (Single Linkage) MAX (omplete Linkage) Group Average Distance Between entroids
How to Define Inter-luster Similarity MIN (Single Linkage) MAX (omplete Linkage) Group Average Distance Between entroids
How to Define Inter-luster Similarity MIN (Single Linkage) MAX (omplete Linkage) Group Average Distance Between entroids
How to Define Inter-luster Similarity MIN (Single Linkage) MAX (omplete Linkage) Group Average Distance Between entroids
luster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical lustering: MIN 3 5 1 5 2 2 3 1 6 0.2 0.15 0.1 4 4 0.05 0 3 6 2 5 4 1 Nested lusters Dendrogram
Similarity vs. Distance Similarity Matrix Distance Matrix I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 I1 I2 I3 I4 I5 I1 0.00 0.20 1.80 0.70 1.60 I2 0.20 0.00 0.60 0.80 1.00 I3 1.80 0.60 0.00 1.20 1.40 I4 0.70 0.80 1.20 0.00 0.40 I5 1.60 1.00 1.40 0.40 0.00
Strength of MIN Original Points Two lusters an handle non-elliptical shapes
Limitations of MIN Original Points Two lusters Sensitive to noise and outliers
luster Similarity: MAX or omplete Linkage Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical lustering: MAX 4 1 5 2 5 2 3 6 3 1 4 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 6 4 1 2 5 Nested lusters Dendrogram
Strength of MAX Original Points Two lusters Less susceptible to noise and outliers
Limitations of MAX Original Points Two lusters Tends to break large clusters Biased towards globular clusters
luster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters. pi luster i p luster proximity(p,p j j proximity(luster i, luster j) luster luster i i j j ) Need to use average connectivity for scalability since total proximity favors large clusters I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 0.4 0.625 0.35 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10.40.70 1.00 0.40 0.30 I4 0.65 0.625 0.60 0.40 1.00 0.80 I5 0.20 0.350.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical lustering: Group Average 5 2 2 1 5 0.25 0.2 0.15 4 3 4 3 1 6 0.1 0.05 0 3 6 4 2 5 1 Nested lusters Dendrogram
Hierarchical lustering: Group Average ompromise between Single and omplete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters
Hierarchical lustering: omparison 5 2 3 2 4 4 5 1 1 3 6 MIN MAX 5 4 1 2 5 2 3 6 3 1 4 5 2 2 1 5 4 3 4 3 1 6 Group Average
Hierarchical lustering: Time and Space requirements O(N 2 ) space since it uses the proximity matrix. N is the number of points. O(N 3 ) time in many cases There are N steps and at each step the size, N 2, proximity matrix must be updated and searched omplexity can be reduced to O(N 2 log(n) ) time for some approaches