Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a specific set of variables. In order to accomplish this objective, the most common starting point is computing a matrix, called dissimilarity matrix, which contains information about the dissimilarity of the observed units. According to the nature of the observed variables (quantitative, qualitative, binary or mixed type variables), we can define and use different measures of dissimilarity. 1.1 Dissimilarity measures Let s consider the n p data matrix, where n is the number of observed units and p is the number of observed variables; let s indicate with i and j two generic objects and with x ik the value of the k-th variable for the i-th unit. 1.1.1 Quantitative variables If the observed variables are all quantitative, each unit can be identified with a point in the p-dimensional space and the dissimilarity between two objects i and j can be measured through: a) Euclidean distance: d ij = p (x ik x jk ) 2 k=1 1
b) City-Block or Manhattan distance: d ij = p x ik x jk k=1 Both of these two distances are specific cases, respectively for l = 2 and l = 1, of a generic distance family known as c) Minkowski metric: { p } 1/l d ij = x ik x jk l k=1 the larger l the greater is the weight given to high distances x ik x jk. For all of the three distances the properties of a distance function hold: 1. d ij 0 2. d ii = 0 3. d ij = d ji 4. d ij d it + d tj (triangle inequality) 1.1.2 Binary variables When dealing with binary or categorical variables it is not possible to define true distance measures, but one can use some dissimilarity measures which: are still close to zero, when i and j are similar to each other increase when differences between two units are larger are symmetric are equal to zero when considering the dissimilarity of an object with itself Nevertheless, these dissimilarity measures generally do not satisfy the triangle inequality. Consider x ik that can assume only two values, coded as 0 and 1. In order to calculate the dissimilarity between two units, relevant information is represented in the following 2 2 table: 2
i j 1 0 1 a b 0 c d where a is the number of variables that take value 1 for both objects i and j, b is the number of variables that assume modality 1 in object i and modality 0 in object j, c is the number of times the opposite is verified and d is the number of variables that assume value 0 for both objects i and j. Furthermore, a + b + c + d = p, with p equal to the number of observed variables. Some of the best known dissimilarity measures are: a) Simple dissimilarity coefficient d ij = b + c a + b + c + d b) Jaccard coefficient d ij = b + c a + b + c Jaccard s coefficient is recommended when binary variables indicate presence/absence of a specific characteristic. In this case agreement 1/1 can have a different meaning than agreement 0/0; indeed, absence of a certain characteristic in both objects i and j does not contribute to improve the similarity between them. Hence, it can be more useful to divide by a + b + c rather than by a + b + c + d. 1.1.3 Categorical variables When observed variables are categorical with more than two levels, a simple measure of the degree of dissimilarity between two units is given by a) d ij = 1 c ij where c ij is the ratio of the number of variables that assume the same values in both units i and j and the total number of observed variables p. 3
1.1.4 Mixed type variables When the observed variables have different nature a dissimilarity measure that is able to deal with their specific characteristics is: a) Gower index d ij = 1 s ij, with s ij = p k=1 δ ijks ijk p k=1 δ ijk x ik x jk where, for quantitative variables, s ijhk = 1 range of variable x k For qualitative or binary variables s ijk = 1 if i and j agree with respect to variable k and equal to 0 if they disagree. Indicator δ ijk is equal to 1 if both quantities x ik and x jk are recorded and equal to 0 in the opposite situation; furthermore it is equal to 0 for asymmetric binary variables when agreement 0/0 is observed. 2 Agglomerative Hierarchical Clustering Methods Hierarchical clustering methods are criteria aimed to create nested partitions that allow investigations with respect to different within clusters homogeneity levels. Those relationships are often represented by dendrograms; by cutting the graph at a certain level of dissimilarity we obtain a partition of objects into different groups. Each partition in k groups is optimal in a step-wise sense. It is the best partition in k group one can obtain, given the partition in k+1 group defined at the previous step, for agglomerative methods, or the best partition given the partition in k 1 groups yielded at the previous step, for divisive methods. Notice that for agglomerative hierarchical clustering methods, when two clusters are merged, they will remain so until the end of the hierarchy; a mistake committed in the initial phase of classification can never be corrected. This is both the strength and the weakness of this kind of methods; the bigger is the number of units the higher is the probability of misclassifying a unit. 4
If the aim is to find a number of clusters much lower than the number of units (k n), divisive methods have a lower risk to commit a mistake, since the number of steps needed to reach partition in k groups is smaller. On the other hand, divisive methods have a high computational complexity and this is the reason why agglomerative approaches are preferred. Starting from the dissimilarity matrix, agglomerative hierarchical clustering methods initially consider each unit as a separate cluster, and they subsequently merge in one cluster those units that have the lowest dissimilarity. Then, they evaluate dissimilarities between this new cluster and the others, merging again the two groups that have the lowest value in the dissimilarity matrix. Different clustering methods in this class refer to the different ways the dissimilarity between clusters is defined and evaluated. 2.1 Single Linkage (Nearest Neighbour) method Distance between two clusters C and G is defined as the dissimilarity between the two closest points d CG = min(d ij ), i C; j G i.e. it is the shortest distance between any element i of C and any element j of G. Clusters obtained with the Single Linkage method tend to have a long and narrow shape and to be internally poorly homogenous since only two similar observations from two different clusters are needed in order to merge the two clusters (chaining effect). For this reason, this method is not able to identify overlapping clusters, but it is good when clusters look well separated, because in every cluster yielded by the procedure any unit is more similar to the units belonging to the same cluster than to units of different groups. 2.2 Complete Linkage (Furthest Neighbour) method Distance between two clusters C and G is defined as the dissimilarity between the two furthest-away points d CG = max(d ij ), i C; j G i.e. it is the longest distance between any element i of C and any element j of G. In other words, d CG is the diameter of the smallest sphere that can 5
wrap the cluster obtained by merging C and G. Clusters yielded by this procedure tend to be spherical and compact; nothing can be said about their separation. 2.3 Average Linkage Distance between two clusters C and G is defined as the average of all the dissimilarities between pairs of points one from each cluster d CG = 1 d ij n C n G i C,j G where n C is the number of elements in cluster C, and n G is the number of elements in cluster G. 2.4 Centroid Linkage Distance between two clusters C and G is defined as the Euclidean distance between the two cluster centres d CG = x C x G where x C is the mean vector of the observed variables in cluster C and x G is the mean vector of the observed variables in cluster G. This method is suitable for quantitative variables only. A problem that may be encountered with this approach is that dissimilarity of a union is lower than the one measured at the previous merge (because of the centroid shift) and this makes the results less interpretable. 2.5 Ward method Distance between two clusters C and G is defined as the deviance between the two clusters. Be x ikg the value of variable k observed on unit i in cluster g and be x kg the average of variable k in cluster g; then E g = n p g (x ikg x kg ) 2 k=1 i=1 E g is the deviance within gth cluster and E = g E g the deviance within the whole partition. As usual the classification process starts by considering 6
each unit as a cluster. In this case the deviance within is 0. Each merge leads to an increase of the deviance within ; the algorithm is looking for the merge that will imply the smallest increase of the deviance within the whole partition; i.e. the groups that will be merged are those having the largest deviance between. Like the centroid linkage, this method can only be applied to quantitative data, but differently from that, it does not have problems with the order of dissimilarities in the different levels of the hierarchy. Both methods identify spherical clusters. 3 Partitioning Clustering Methods Hierarchical methods described in the previous section generate n different nested classifications that go from n one-element clusters to a cluster which contains all the n observations. Partitive methods give rise to a single partition with K groups, where K is previously specified or is determined by the clustering method itself. In the partitioning approach, the number of clusters (say K) is decided a priori and data are split into clusters in such a way the within-cluster dissimilarity is minimized. The function that is to be minimized is K g=1 i:x i C g (x i x Cg ) 2 The problem seems simple, but actually the numbers involved make impossible to consider all possible partition of n individuals into K groups. That is why several algorithms, designed to search for the minimum values of the clustering criterion, have then been developed. 3.1 K-means Algorithm A very popular partitioning method is the K-means algorithm. Classification begins by choosing the first K observations of the data matrix as centroids and carries on by assigning each of the remaining n K units to the closest centroid. This first partition in K groups is then improved by assuming the centroids of the K groups as new aggregation points and by associating each unit to the closest centroid. 7
The resulting classification strongly depends on the order by which units are assigned to different groups; nevertheless it has been shown that this effect is poorly relevant when clusters are clearly separated. 4 How to choose the number of clusters So far we have not discussed how to choose the appropriate number of clusters. There is plenty of methods in the literature designed to solve the problem but up to now there is no unique solution. Here a few methods are presented. 4.1 The Average Silhouette Width (ASW) For a partition of n units into K clusters C 1,..., C K, suppose object i has been assigned to cluster C h. We indicate with a(i) the average dissimilarity of i to all other objects of cluster C h : a(i, h) = 1 C h 1 j C h d(i, j) This expression makes sense only when C h contains other objects other than i. Let s consider now any cluster C l different from C h and define the average dissimilarity of i to all objects of C l d(i, C l ) = 1 d(i, j) C l j C l After computing d(i, C l ) for all clusters C l different from C h, we select the smallest of those: b(i) = min i/ C l d(i, C l ) The cluster for which this minimum is obtained is call neighbour of object i; this is like the second-best choice for clustering object i. The silhouette s(i) is obtained by combining a(i) and b(i) as follows: 1 a(i), if a(i) < b(i) b(i) s(i) = 0, if a(i) = b(i) b(i) 1, if a(i) > b(i). a(i) 8
This can be rearranged in one formula s(i) = b(i) a(i) max[a(i), b(i)] And it can be easily seen that 1 s(i) 1 for each object i. When s(i) is at its largest (that is, close to 1), this implies that the within dissimilarity a(i) is much smaller than the smallest between dissimilarity b(i). Therefore, we can say that i is well classified: the second best choice is not nearly as close at the actual choice. When s(i) is about zero, then a(i) and b(i) are approximately equal and so it is not clear whether i should have been assigned to C h or to C l, it lies equally far away from both. The worst situation takes place when s(i) is close to 1, when a(i) is actually much larger than b(i), and hence i lies on average closer to C l than to C h ; therefore it would have seemed better to assign object i to its neighbour. The silhouette s(i) hence measures how well unit i has been classified. By computing the average of the s(i), calculated for all the observations i = 1,..., n, we obtain the so called average silhouette width (ASW): s(i) = 1 n n s(i, K) i=1 If K is not fixed and needs to be estimated, the ASW estimate K ASW is obtained by maximizing the equation. Its expression leads to a clustering that emphasises the separation between the clusters and their neighbouring clusters. 4.2 Pearson Gamma If K is not fixed and needs to be estimated, the Pearson Gamma (PG) estimator K Γ maximises the Pearson correlation ρ(d, m) between the vector d of pairwise dissimilarities and the binary vector m that is 0 for every pair of observations in the same cluster and 1 for every pair of observations in different clusters. PG emphasises a good approximation of the dissimilarity structure by the clustering in the sense that when observations are in different clusters they should strongly be correlated with large dissimilarity. 9
Remark When the number of clusters is known (hence there is no need to estimate it) one can use these indexes as a measure of clustering quality. 10