2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Size: px

Start display at page:

Download "2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m."

Rosa O’Connor’
6 years ago
Views:

1 Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar to each other while those in different groups are very dissimilar. It is an exploratory technique used when there is no a priori knowledge of the form of any underlying group structure, except perhaps that in some instances the number of groups may be known. There are many different methods of cluster analysis available but they can be broadly categorized as being either a hierarchical or a non-hierarchical procedure. The former class is one in which each cluster obtained at any stage is a merger or split of clusters obtained at other stages. There are therefore two extremes here - n clusters with one case per cluster or a single cluster containing n cases. Algorithms start at one of the extremes and successively work towards the other. In between there is a monotonically increasing or decreasing strength of clustering as you move up or down from one level to another. In contrast, with non-hierarchical techniques, new clusters are obtained by both the lumping together and splitting up of old clusters. Most hierarchical procedures start with a dissimilarity matrix for the data (see the relevant notes in the Multivariate Statistics module on measuring the dissimilarity (or distance) between cases) and successively fuse clusters together according to the dissimilarlty between groups. If we let d rs denote the dissimilarity between the r th and s th cases and D rs denote the dissimilarity between the r th and s th groups then we can use the following algorithm: 1. Define each case as a group. ie. D rs = d rs 2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m. 3. Calculate the dissimilarity between the new group and each of the existing groups and then replace the l th and m th rows and columns of the dissimilarity matrix by the single row/column of new values. 4. Repeat steps 2 and 3 until you only have one group. It is the different ways of defining the distance between groups that gives rise to the variety of hierarchical techniques. The following are some of the more commonly used ones: 1

2 (i) The nearest neighbour (or single-linkage) method. D rs is the smallest of the n r n s dissimilarities between each case in r and each case in s. (ii) The furthest neighbour (or complete-linkage) method. D rs is the largest of the n r n s dissimilarities. (iii) The group average method. D rs is the average of the n r n s dissimilarities. (iv) The centroid method. D rs is the squared Euclidean distance between the centroids of groups r and s. ie. if m r and m s denote the centroids for groups r and s, respectively then D rs = (m r m s )(m r m s ) T. (v) The minimum variance method. D rs is the between-group sum of squares for groups r and s. Methods (ii), (iii) and (iv) generally lead to spherical clusters while (i) often leads to elongated clusters. Note that different methods can lead to different partitions of the sample. A hierarchical approach can be represented by a dendrogram which is a special type of tree structure which depicts the nested nature of the clusters. The dendrogram can then be cut at any chosen level to see which individuals fall into which groups, either at a chosen dissimilarity value or to obtain a certain number of groups. Non-hierarchical methods are designed to group cases into K clusters where the value of K maybe either specified in advance or determined as part of the procedure. Such methods start from either an initial partition of all the cases into K groups or an initial set of K seed points. A commonly used non-hierarchical procedure is the K means method which is implemented using the following algorithm: 1. Partition the sample of cases into K initial clusters. 2. Work through all the cases in the sample assigning each in turn to the cluster whose mean it is closest to. After each assignment recalculate the mean for the cluster receiving the new case and for the cluster losing the case. 3. Repeat step 2 until no new reassignments take place. 2

3 In practice you should try running the algorithm for different values of K. A data-based method to choose K might be to maximize the betweencluster variability relative to the within-cluster variability. ie. maximize B + W / W (cf. multivariate analysis of variance) where B = W = K n k (x k x)(x k x) T k=1 K n k (x kj x k )(x kj x k ) T k=1 j=1 Many general texts on multivariate statistics contain a chapter or two on cluster analysis. Some useful references to more specialized books on the subject, which are also in the JRUL, are as follows: Gordon, Andrew D. (1981) Classification : methods for the exploratory analysis of multivariate data Everitt, Brian S. (1993) Cluster analysis. Hartigan, J. A. (1979) Clustering Algorithms. 311 Anderberg, Michael R. (1973) Cluster analysis for applications Jardine, N. and Sibson, R. (1971) Mathematical taxonomy Single Linkage Example Consider the following distance matrix between n = 5 cases: D = Case Step 1: Since min r,s (d rs ) = d 53 = 2, join cases 5 and 3 into a cluster. 3

4 Step 2.: Compute the new distance matrix. d (35)1 = min(d 31, d 51 ) = min(3, 11) = 3 d (35)2 = min(d 32, d 52 ) = min(7, 10) = 7 d (35)4 = min(d 34, d 54 ) = min(9, 8) = 8 Cluster (3, 5) (3,5) Step 3: The smallest distance is d (35)1 = 3, so we join case 1 to cluster (3, 5). Step 4: Compute the new distance matrix. Cluster (1, 3, 5) 2 4 (1, 3,5) Step 5: The smallest distance is d 24 cluster (2, 4). = 5, so we join cases 2 and 4 into the Step 6: Compute the new distance matrix. Cluster (1, 3, 5) (2, 4) (1, 3, 5) 0 (2, 4) 6 0 Step 7: Join clusters (1, 3, 5) and (2, 4) to form a single cluster (1, 2, 3, 4, 5). To do average linkage and complete linkage on these data we simply replace the measure of measuring distance between clusters accordingly. The analyses can be done in R as follows: 4

5 > d<-array(c(0, 9, 3, 6, 11, 9, 0, 7, 5, 10, 3, 7, 0, 9, 2, 6, 5, 9, 0, 8, 11, 10, 2, 8, 0),dim=c(5, 5)) > d<-as.dist(d, diag=t) > d > d_sl<-hclust(d, method="single") > d_al<-hclust(d, method="average") > d_cl<-hclust(d, method="complete") > par(mfrow=c(1, 3)) > plot(d_sl, hang=-1) > plot(d_al, hang=-1) > plot(d_cl, hang=-1) The dendrograms based on each of the three methods are shown in Figure 1. Note that step 1, where cases 3 and 5 are joined into a cluster, is always the same. 5

6 Cluster Dendrogram Cluster Dendrogram Cluster Dendrogram Height Height Height d hclust (*, "single") d hclust (*, "average") d hclust (*, "complete") Figure 1: Dendrograms based on clustering the data in distance matrix D 6

7 Clustering the Exam Marks Data The following code was used in R to do a cluster analysis of the exam marks data using the average linkage method. > d_marks<-dist(marks, method="euclidean") > d_marks_al<-hclust(d_marks, method="average") > plot(d_marks_al, hang=-1) > rect.hclust(d_marks_al, 4) > hc.class<-cutree(d_marks_al, k=4) > hc.class > table(hc.class) hc.class > hc<-as.vector(hc.class) > hc [1] [39] [77] > marks_c<-c(marks, hc) 7

8 > clust1<-marks_c[hc=1] > clust2<-marks_c[hc=2] > clust3<-marks_c[hc=3] > clust4<-marks_c[hc=4] > clmean1<-tapply(marks[,1], hc, mean) > clmean2<-tapply(marks[,2], hc, mean) > clmean3<-tapply(marks[,3], hc, mean) > clmean4<-tapply(marks[,4], hc, mean) > clmean5<-tapply(marks[,5], hc, mean) > clmeans<-c(clmean1, clmean2, clmean3, clmean4, clmean5) > clmeans<-array(clmeans, dim=c(4, 5)) > clmeans [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] > dist_clmeans<-dist(clmeans) > dist_clmeans The following dendrogram in Figure 2 shows the steps involved in forming clusters and the split of the data into four clusters. 8

9 Height d_marks hclust (*, "average") Cluster Dendrogram Figure 2: Dendrogram for the Exam Marks data based on using Average Linkage 9

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each