Clustering algorithms 6CCS3WSN-7CCSMWAL

Size: px

Start display at page:

Download "Clustering algorithms 6CCS3WSN-7CCSMWAL"

Delilah Richardson
5 years ago
Views:

1 Clustering algorithms 6CCS3WSN-7CCSMWAL

2 Contents Introduction: Types of clustering Hierarchical clustering Spatial clustering (k means etc) Community detection (next week)

3 What are we trying to cluster and why? What is the data?

4 What are we trying to cluster and why? What is the data? Vector (terms in documents)

5 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter)

6 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter) What do we want?

7 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter) What do we want? Group together the similar items

8 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter) What do we want? Group together the similar items Separate the items which are clearly different

9 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible

10 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches?

11 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive

12 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points

13 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points Divisive. Centroid: Split the data into a fixed number of regions based on distance to the regional centers

14 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points Divisive. Centroid: Split the data into a fixed number of regions based on distance to the regional centers Graph based data: Possible approaches?

15 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points Divisive. Centroid: Split the data into a fixed number of regions based on distance to the regional centers Graph based data: Possible approaches? Separate the graph into subgraphs based on communities

16 Theory and Practice The discussion proceeds by example It is best to try the techniques out for yourself in R

17 This is the content of file UKCITYDATA.txt Example: Major cities of UK Clustering data vectors. How to present data in a meaningful way How could we cluster these cities. If we choose geographic position (Latitude, Longitude) as our data how would you think they divide up? North West London Bristol Leeds Sheffield Bradford Manchester Liverpool Birmingham Glasgow Edinburgh Cardiff Belfast Newcastle

18 We look at two methods. Hierarchical clustering and k-means clustering. Both are based on distance between data points. but analyze and present the data in different ways.

19 Here is a picture of how things might be clustered

20 k means clustering West Cardiff Bristol London Liverpool Manchester Birmingham Bradford Sheffield Leeds Belfast Newcastle Glasgow Edinburgh North We made an arbitrary decision to choose 3 clusters

21 Hierarchical clustering Height Belfast Glasgow Edinburgh Birmingham Manchester Liverpool Newcastle Sheffield Leeds Bradford London Bristol Cardiff Height Cluster Dendrogram Belfast Glasgow Edinburgh Birmingham Manchester Liverpool Newcastle Sheffield Leeds Bradford London Bristol Cardiff d hclust (*, "ward.d") In the second figure we made an arbitrary decision to choose 3 clusters. How does it compare with the k means result?

22 R for this require(graphics) cdata = read.csv("ukcitydata.txt",header=t,row.names=1) cities <- as.matrix(cdata) #run hierarchical clustering using Wards method d=dist(cities) groups <- hclust(d,method="ward.d") #plot dendogram, use hang to ensure that labels fall below tree plot(groups, hang=-1) #cut into 3 subtrees (draw rectangles on plot) rect.hclust(groups,3) #k-means clustering colnames(cities) <- c("north", "West") cl <- kmeans(cities, 3) # make 3 clusters plot(cities, col = cl$cluster, xlim=c(50,58)) # plot clusters points(cl$centers, col = 1:2, pch = 8, cex = 2) # insert cluster centers text(cities, row.names(cities), cex=0.6, pos=4, col="blue") #label citie

23 Details: (Type cl in R to get k-means details) K-means clustering with 3 clusters of sizes 4, 3, 6 Cluster means: North West Clustering vector: London Bristol Leeds Sheffield Bradford Manchester Liverpool Birmingham Glasgow Edinburgh Cardiff Belfast Newcastle 3 Within cluster sum of squares by cluster: (between_ss / total_ss = 72.9 %)

24 Agglomerative Hierarchical Clustering (HAC) Need a measure of distance between data points Merge the two nearest clusters until there is a single cluster. The results are presented as a dendrogram showing hierarchy. Prune the dendrogram to give the required number of clusters. Distance: e.g. Euclidean distance d(a, b) = n (a i b i ) 2 = a b (1) i=1 The notation a b is standard for Euclidian distance. a, b vectors: a = (a 1, a 2,...a n ), b = (b 1, b 2,..., b n ) See clustering

Agglomerative Hierarchical Clustering: Detail (1) Assign each data point to its own (single member) cluster (2) Repeat steps 3 and 4 until you have a single cluster containing all data points (3)

25 Agglomerative Hierarchical Clustering: Detail (1) Assign each data point to its own (single member) cluster (2) Repeat steps 3 and 4 until you have a single cluster containing all data points (3) Find the pair of clusters that are closest to each other. Merge them to reduce the number of clusters by one (4) Compute distances between the new cluster and each of the old clusters From clustering

26 Distance between two clusters There are many methods. Three common ones are:

27 Distance between two clusters There are many methods. Three common ones are: Complete-linkage. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = max{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest.

28 Distance between two clusters There are many methods. Three common ones are: Complete-linkage. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = max{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest. Single-linkage clustering. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = min{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest.

29 Distance between two clusters There are many methods. Three common ones are: Complete-linkage. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = max{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest. Single-linkage clustering. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = min{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest. Ward s method. (Wards minimum variance method). Merge the two clusters which leads to the smallest increase in total within cluster variance. Intuitively the method tries to put together the two clusters whose means are closest.

30 Merge clusters Figure from Page 351, Chapter 17 of Introduction to IR book

31 Both Ward s method and Complete-linkage gave the same three groups in the dendrogram for UK cities, but Single-linkage gave a different answer. However the dendrogram of complete and Ward s look different. Ward s method is considered to give a nice flat clustering. Cluster Dendrogram Cluster Dendrogram Belfast Glasgow Edinburgh Newcastle Liverpool Birmingham Manchester Sheffield Leeds Bradford London Bristol Cardiff Belfast London Glasgow Edinburgh Newcastle Bristol Cardiff Birmingham Liverpool Manchester Sheffield Leeds Bradford Height Height d hclust (*, "complete") d hclust (*, "single")

32 Cophenetic distance The y-axis of the dendrogram (Height). The cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster. Cluster Dendrogram Belfast Glasgow Edinburgh Newcastle Liverpool Birmingham Manchester Sheffield Leeds Bradford London Bristol Cardiff Height d hclust (*, "complete")

33 Example We cluster the numbers 1, 2, 4, 8. If we ask for 3 clusters, hopefully they will be {1, 2}, {4}, {8}. We use method complete-linkage to merge clusters. Max Distance matrix: Clusters 1 0 C C C C8 The clusters with the smallest max distance are C1, C2. Merge these C12.

34 Max distance matrix. Distance from C12 to C4: max(d(1, 4), d(2, 4)) = d(1, 4) = 3 C12 C4 C8 C12 0 C4 3 0 C The clusters with the smallest max distance are C12, C4. Merge these C124. Max distance matrix: C124 C8 C124 0 C8 7 0

35 A plot of the dendrogram The clusters C1, C2 C12, C12, C4 C124, C124, C8 C1248 were merged at complete-linkage intercluster distances 1, 3, 7 This is recorded on the height axis. Cluster Dendrogram Height Exercise. Data 1, 2, 5, 9, 11. pd hclust (*, "complete")

36 k-means clustering Colour quantization: Reduce number of colours used Figures from Wikipedia

37 k-means clustering The number of clusters is an input to the algorithm, which then generates k centers and assigns each data point to nearest center.

38 k-means clustering The number of clusters is an input to the algorithm, which then generates k centers and assigns each data point to nearest center. The aim is to find some good clusters, but that is not always easy. How to define what we mean by good?

39 k-means clustering The number of clusters is an input to the algorithm, which then generates k centers and assigns each data point to nearest center. The aim is to find some good clusters, but that is not always easy. How to define what we mean by good? We want to partition the data points into k sets (the clusters) in such a way that we minimize the squared distance to the centers of the clusters. The center (or centroid µ) of a cluster is the average of the point positions.

40 k-means clustering If there are m points x 1,..., x m then µ = 1 m m x i. i=1 Typically the x i are vectors in which case µ is calculated component wise.

41 This is a wish list. In practice some starting centers are given. If not we generate some random ones. In either case the answer may not be exactly what we want. Assuming we do not have any starting centers: (1) Assign the data points (randomly) into k groups (2) Compute the centroid of each group (3) For each data point, compute the distance to each centroid. Assign the data point to the nearest centroid (4) If the clusters are unchanged then STOP, else go to step 2. If we use random starting centers, the final answer may vary

42 Example Divide 1, 2, 4, 5, 8, 9 into 3 clusters with starting centers 3, 6, data centers (1 2 4) (5) (8 9) assign to nearest ce 7/3 5 17/2 new centroid (1 2) (4 5) (8 9) assign to nearest ce 3/2 9/2 17/2 new centroid (1 2) (4 5) (8 9) assign to nearest ce 3/2 9/2 17/2 new centroid No change STOP

43 Ex1. What would have happened if we had broken the distance ties for 4 and 8 the other way in the first round? Ex2. Where do you think the cluster centers should be for the following set of points, for k = 2, 3? (1, 1), (1.5, 1.5), (2, 2), (2, 3), (3, 2), (3, 3) Check your answers by using them as the initial centers for the k-means algorithm.

44 Partitioning Around Medoids (PAM) Algorithm This is like k-means but the centers have to be part of the data set. The algorithm tries to find a k-partition of the n data points to minimize the dissimilarity F : F = n n d(i, j)z i,j, i=1 j=1 where z i,j = 1 if i, j in the same cluster and zero otherwise. The minimization is carried out subject to the constraint that all k clusters are non-empty. Obviously this is harder to do but makes more sense than k-means. Example: Divide 1, 2, 4, 5, 8, 9 into 3 clusters around medioids. Ans: (1, 2), (4, 5), (8, 9), either point in each cluster can act as a medioid.

45 Cities: Partitioning Around Medoids require(cluster) meds=pam(cities,3) clusplot(meds,labels=2)

46 Within cluster sum of squares (WCSS) The main limitation of the k-means method is that the solution found by the algorithm is often a local rather than global minimum. The algorithm can t improve things but the answer is not best possible. It is important to run the algorithm a number of times with different start centers and choose the result with the minimum WCSS. Keep running the algorithm until there is no significant improvement in WCSS. This is the reason for using random starting centers. For a given set of clusters S = (S 1,..., S k ), with centers (µ 1,..., µ k ) the within cluster sum of squares (WCSS) is defined as WCSS = k i=1 x S i x µ i 2. Here z 2 = z 2 i is squared Euclidian distance of z = (z 1,..., z n ).

47 k means: More detail > clus=kmeans(c(1,2,6),2) > clus K-means clustering with 2 clusters of sizes 2, 1 Cluster means: Clustering vector: (for data points (1,2,6) respectively) Within cluster sum of squares by cluster: Between_SS / total_ss = 96.4 % > clus$totss 14 > clus$betweenss 13.5

48 Total sum of squares (TSS) If there are m points x 1,..., x m then µ = 1 n m x i. For a given set of clusters S = (S 1,..., S k ) the within cluster sum of squares (WCSS) is defined as WCSS = i=1 k i=1 x S i x µ i 2. As usual z is squared Euclidian distance of z (see (??)). Mean of all data M = 1 k x i n TSS = i=1 n x M 2 i=1

49 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6

50 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6

51 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3

52 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5

53 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5 TSS = (1 3) 2 + (2 3) 2 + (6 3) 2 = 14

54 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5 TSS = (1 3) 2 + (2 3) 2 + (6 3) 2 = 14 BCSS = TSS WCSS = 13.5

55 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5 TSS = (1 3) 2 + (2 3) 2 + (6 3) 2 = 14 BCSS = TSS WCSS = 13.5 BCSS/TSS = 13.5/14 = 96.4% This is a good fit because the BCSS (between cluster sum of squares) explains 96.4%of data variation, and the WCSS (within cluster sum of squares) was 3.5% of data variation. The data points are close to their cluster centers

57 FBook example: Social Network Clustering Analysis This analysis uses a dataset representing a random sample of U.S. high school students who had profiles on a well-known Social Network in from 2006 to From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the persons profile. The aim is to cluster the document corpus (FB pages) according to text content.

58 R program require(cluster) #raw.githubusercontent.com/brenden17/sklearnlab/master/facebook/snsdata.csv teens <- read.csv("snsdata.csv") #download from above and put in wkdir() apply(teens[5:40],2,sum) interests <- teens[5:40] # throw out columns 1--4 of data: gradyear gender age friends (on FBook) interests_z <- as.data.frame(lapply(interests, scale)) teen_clusters <- kmeans(interests_z, 5) teen_clusters$size #The cluster characterization can be obtained with pie charts: pie(colsums(interests[teen_clusters$cluster==1,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==2,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==3,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==4,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==5,]),cex=0.5)

59 The output >apply(teens[5:40],2,sum) basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs 1813 ========================================================== > teen_clusters$size [1]

60 The five clusters are presented as pie-charts. Its impossible to represent 36 dimensions (basketball,...,drugs) on a page otherwise The final answer are not fully reproducible (random start clusters used) The largest 5 segments are (in order within group Group 5 (5523 points) music shopping dance God hair Group 4 (22258 points) music God dance hair band Group 3 (1039 points) hair sex music kissed die Group 2 (594 points) baseball football basketball music rock Group 1 (586 points) sexy music hair dance cute

61 The plot of cluster 5 dance kissed sexysex hot cute sports tennis baseball cheerleading swimming volleyball softball band marching soccer football music rock god clothes basketball drugs drunk death die abercrombie hollister church shopping jesus bible hair dressblonde mall

62 The plots of cluster 4 dance sexy hot kissed sex cute sports tennis baseball cheerleading swimming volleyball softball soccer marching band football basketball music drugs drunk death die hollister abercrombie clothes rock shopping god church blonde dress jesus bible hair mall

63 The plots of the clusters 3 kissed hot sexy sex dance cute marching music band sports tennis baseball cheerleading swimming volleyball softball soccer football basketball rock drugs god church jesus bible death drunk die hair abercrombie hollister clothes dress blonde mall shopping

64 The plots of the clusters 2 cheerleading swimming volleyball softball soccer baseball football basketball tennis sports cute drugs drunk death die hollister abercrombie clothes shopping sex mall sexy hot kissed blonde dress dance hair band marching music rock god jesus bible church

65 The plots of the clusters 1 sex cute sexy sports tennis baseball cheerleading swimming volleyball softball soccer football hot kissed dance band marching clothes shopping basketball drugs drunk death die abercrombie hollister music rock god church jesus bible hair mall blonde dress

4. Ad-hoc I: Hierarchical clustering

4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical