Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis Agnieszka Nowak - Brzezinska

Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized

Applications of Cluster Analysis Understanding Group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters

Types of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree

Partitional Clustering Original Points A Partitional Clustering

Hierarchical Clustering p1 p2 p3 p4 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram

Clustering Algorithms K-means Hierarchical clustering Graph based clustering

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

y Illustration 3 Iteration 12 34 56 2.5 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x

y y y y y y Illustration 3 Iteration 1 3 Iteration 2 3 Iteration 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x -2-1.5-1 -0.5 0 0.5 1 1.5 2 x -2-1.5-1 -0.5 0 0.5 1 1.5 2 x 3 Iteration 4 3 Iteration 5 3 Iteration 6 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x -2-1.5-1 -0.5 0 0.5 1 1.5 2 x -2-1.5-1 -0.5 0 0.5 1 1.5 2 x

y y y Two different K-means Clusterings 3 2.5 2 1.5 Original Points 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering -2-1.5-1 -0.5 0 0.5 1 1.5 2 x Sub-optimal Clustering

Solutions to Initial Centroids Problem Multiple runs Helps, but probability is not on your side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widely separated Bisecting K-means Not as susceptible to initialization issues

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. SSE K i 1 x C i dist 2 ( m, x) x is a data point in cluster C i and m i is the representative point for cluster C i can show that m i corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smaller error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K i

Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers. The number of clusters (K) is difficult to determine.

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 6 5 0.2 0.15 0.1 4 3 4 2 5 2 0.05 3 1 1 0 1 3 2 5 4 6

Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by cutting the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )

Hierarchical Clustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms

Starting Situation Start with clusters of individual points and a proximity matrix p1 p1 p2 p3 p4 p5... p2 p3 p4 p5... Proximity Matrix... p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C3 C4 C1 C2 C3 C4 C5 Proximity Matrix C2 C5... p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C1 C2 C3 C4 C5 C3 C4 C3 C5 C4 Proximity Matrix C1 C2 C5... p1 p2 p3 p4 p9 p10 p11 p12

After Merging The question is How do we update the proximity matrix? C1 C2 U C5 C3 C4 C3 C4 C1 C2 U C5 C3 C4??????? C1 Proximity Matrix C2 U C5... p1 p2 p3 p4 p9 p10 p11 p12

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5... Similarity? p1 p2 MIN MAX Group Average Distance Between Centroids p3 p4 p5... Proximity Matrix

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5... p1 p2 p3 p4 p5 MIN MAX Group Average Distance Between Centroids... Proximity Matrix

How to Define Inter-Cluster Similarity p1 p1 p2 p3 p4 p5... p2 p3 p4 MIN MAX Group Average Distance Between Centroids p5... Proximity Matrix

Hierarchical Clustering: Group Average Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters

Hierarchical Clustering: Time and Space requirements O(N 2 ) space since it uses the proximity matrix. N is the number of points. O(N 3 ) time in many cases There are N steps and at each step the size, N 2, proximity matrix must be updated and searched Complexity can be reduced to O(N 2 log(n) ) time for some approaches

Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized Different schemes have problems with one or more of the following: Sensitivity to noise and outliers (MIN) Difficulty handling different sized clusters and non-convex shapes (Group average, MAX) Breaking large clusters (MAX)

Types of data in clustering analysis Interval-scaled attributes: Binary attributes: Nominal, ordinal, and ratio attributes: Attributes of mixed types:

Interval-scaled attributes Continuous measurements of a roughly linear scale E.g. weight, height, temperature, etc. Standardize data in preprocessing so that all attributes have equal weight Exceptions: height may be a more important attribute associated with basketball players

Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects (objects=records) Minkowski distance: where i = (x i1, x i2,, x ip ) and j = (x j1, x j2,, x jp ) are two p- dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance q q p p q q j x i x j x i x j x i x j i d )... ( ), ( 2 2 1 1... ), ( 2 2 1 1 p p j x i x j x i x j x i x j i d

Similarity and Dissimilarity Between Objects (Cont.) If q = 2, d is Euclidean distance: d( i, j) Properties d(i,j) 0 d(i,i) = 0 ( x i d(i,j) = d(j,i) x j x i d(i,j) d(i,k) + d(k,j) 2 x j... x i 1 1 2 2 p p Can also use weighted distance, or other dissimilarity measures. 2 x j 2 )

Binary Attributes A contingency table for binary data Object i Object j 1 0 sum 1 a b a b 0 c d c d sum a c b d p Simple matching coefficient (if the binary attribute is symmetric): d( i, j) b c a b c d Jaccard coefficient (if the binary attribute is asymmetric): d( i, j) b c a b c

Example i j Dissimilarity between Binary gender is a symmetric attribute Attributes Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N remaining attributes are asymmetric let the values Y and P be set to 1, and the value N be set to 0 d( d( d( 0 1 jack, mary ) 0.33 2 0 1 1 1 jack, jim) 0.67 1 1 1 1 2 jim, mary ) 0.75 1 1 2

Nominal Attributes A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of attributes that are same for both records, p: total # of attributes d( i, j) p p m Method 2: rewrite the database and create a new binary attribute for each of the m states For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.

Ordinal Attributes An ordinal attribute can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replacing x if by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th attribute by r 1 if z if M 1 compute the dissimilarity using methods for interval-scaled attributes f r { 1,..., M } if f

Eucliedean distance

Other measures spheric manhattan

Different measures

y The distance between two point Calculate the distance between A (2,3) and B(7,8). D (A,B) = square ((7-2) 2 + (8-3) 2 ) = square (25 + 25) = square (50) = 7.07 9 8 B 7 6 5 4 3 2 1 0 A 0 2 4 6 8 x A B

y 3 points A(2,3), B(7,8) and C(5,1). Calculate: D (A,B) = square ((7-2) 2 + (8-3) 2 ) = square (25 + 25) = square (50) = 7.07 D (A,C) = square ((5-2) 2 + (3-1) 2 ) = square (9 + 4) = square (13) = 3.60 D (B,C) = square ((7-5) 2 + (3-8) 2 ) = square (4 + 25) = square (29) = 5.38 9 8 7 B 6 5 4 3 2 1 0 A C 0 1 2 3 4 5 6 7 8 x A B C

D (A,B) = square ((0.7-0.6) 2 + (0.8-0.8) 2 + (0.4-0.3) 2 + (0.5-0.4) 2 + (0.2-0.2) 2 ) = square (0.01 + 0.01 + 0.01) = square (0.03) = 0.17 D (A,C) = square ((0.7-0.8) 2 + (0.8-0.9) 2 + (0.4-0.7) 2 + (0.5-0.8) 2 + (0.2-0.9) 2 ) = square (0.01 + 0.01 + 0.09 + 0.09 + 0.49) = square (0.69) = 0.83 D (B,C) = square ((0.6-0.8) 2 + (0.8-0.9) 2 + (0.5-0.7) 2 + (0.4-0.8) 2 + (0.2-0.9) 2 ) = square (0.04 + 0.01 + 0.04+0.16 + 0.49) = square (0.74) = 0.86 Many attributes V1 V2 V3 V4 V5 A 0.7 0.8 0.4 0.5 0.2 B 0.6 0.8 0.5 0.4 0.2 C 0.8 0.9 0.7 0.8 0.9

Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)

AGNES-Explored Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this: Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

AGNES Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Step 3 can be done in different ways, which is what distinguishes single-link from completelink and average-link clustering

data VAR 1 VAR 2 1 1 3 2 1 8 3 5 3 4 1 1 5 2 8 6 5 2 7 2 3 8 4 8 9 7 2 10 5 8

Distance matrix P_1 P_2 P_3 P_4 P_5 P_6 P_7 P_8 P_9 P_10 P_1 0 5,00 4,00 2,00 5,10 4,12 1,00 5,83 6,08 6,40 P_2 5,00 0 6,40 7,00 1,00 7,21 5,10 3,00 8,49 4,00 P_3 4,00 6,40 0 4,47 5,83 1,00 3,00 5,10 2,24 5,00 P_4 2,00 7,00 4,47 0 7,07 4,12 2,24 7,62 6,08 8,06 P_5 5,10 1,00 5,83 7,07 0 6,71 5,00 2,00 7,81 3,00 P_6 4,12 7,21 1,00 4,12 6,71 0 3,16 6,08 2,00 6,00 P_7 1,00 5,10 3,00 2,24 5,00 3,16 0 5,39 5,10 5,83 P_8 5,83 3,00 5,10 7,62 2,00 6,08 5,39 0 6,71 1,00 P_9 6,08 8,49 2,24 6,08 7,81 2,00 5,10 6,71 0 6,32 P_10 6,40 4,00 5,00 8,06 3,00 6,00 5,83 1,00 6,32 0

P_1 0 P_17 1 step P_1 P_2 P_3 P_4 P_5 P_6 P_7 P_8 P_9 P_10 P_2 5 0 P_3 4 6,4 0 P_4 2 7 4,47 0 P_5 5.1 1 5.83 7.07 0 P_6 4.12 7.21 1 4.12 6.71 0 P_7 P_17 1 P_7 1 5.1 3 2.24 5 3.16 0 P_8 5.83 3 5.1 7.62 2 6.08 5.39 0 P_9 6.08 8.49 2.24 6.08 7.81 2 5.1 6.71 0 P_10 6.4 4 5 8.06 3 6 5.83 1 6.32 0 Find the minimal distance...

2 step P_17 0 P_17 P_2 P_3 P_4 P_5 P_6 P_8 P_9 P_10 P_2 5 0 P_25 P_25 P_3 3 6,4 0 P_4 2 7 4,47 0 1 P_5 5 1 5.83 7.07 0 P_6 3.16 7.21 1 4.12 6.71 0 P_8 5.39 3 5.1 7.62 2 6.08 0 P_9 5.1 8.49 2.24 6.08 7.81 2 6.71 0 P_10 5.83 4 5 8.06 3 6 1 6.32 0

3 step P_17 P_25 P_3 P_4 P_6 P_8 P_9 P_10 P_17 0 P_25 5 0 P_36 P_3 P_3 3 5,83 0 P_4 2 7 4,47 0 P_6 3.16 6.71 1 4.12 0 P_8 5.39 2 5.1 7.62 6.08 0 P_6 P_36 P_9 5.1 7.81 2.24 6.08 2 6.71 0 P_10 5.83 3 5 8.06 6 1 6.32 0 Find the minimal distance...

4 step P_17 P_25 P_36 P_4 P_810 P_9 P_10 P_17 0 P_25 5 0 P_36 3 5,83 0 P_4 2 7 4,12 0 P_8 P_810 5.39 2 5.1 7.62 0 P_9 5.1 7.81 2 6.08 6.71 0 P_10 5.83 3 5 8.06 1 6.3 2 P_10 0 Find the minimal distance...

5 step P_174 P_25 P_36 P_4 P_810 P_9 P_174 0 P_25 5 0 P_36 3 5,83 0 P_4 2 7 4,12 0 P_4 P_810 5.39 2 5 7.62 0 P_9 5.1 7.81 2 6.08 6.71 0 Find the minimal distance...

6 step P_174 P_25810 P_36 P_810 P_9 P_174 0 P_25810 5 0 P_36 3 5,83 0 P_810 5.39 2 5 0 P_9 5.1 7.81 2 6.71 0 Find the minimal distance...

7 step P_174 0 P_174 P_25810 P_369 P_9 P_25810 5 0 P_369 3 5 0 P_9 5.1 6.71 2 0 Find the minimal distance...

8 step P_174369 P_174 P_25810 P_369 P_174369 0 P_25810 5 0 P_369 3 5 0 Find the minimal distance...

9 step P_174369 P_174 P_25810 P_369 P_174369 0 P_25810 5 0 P_369 3 5 0 Find the minimal distance... 9 step P_174369 0 P_174369 P_25810 P_25810 5 0 Find the minimal distance...

9 step P_174369 P_174 P_25810 P_369 P_174369 0 It is now only ONE GROUP P_17436925810 P_25810 5 0 P_369 3 5 0 9 STEPS OF THE ALGORITHM Find the minimal distance... 9 step P_174369 0 P_174369 P_25810 P_25810 5 0 Find the minimal distance...

9 step P_174369 P_174 P_25810 P_369 P_174369 0 Powstanie jedna grupa P_17436925810 P_25810 5 0 P_369 3 5 0 Find the minimal distance... 9 iteracji algorytmu 9 step P_174369 0 P_174369 P_25810 P_25810 5 0 Find the minimal distance...

Dendrogram

Single Linkage Complete linkage Average Linkage

The number of clusters