1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class] Unlike Factor Analysis, we are concerned with grouping rather than what are the cause (factors) of the groups; therefore, the objective and approach differ from Factor Analysis. Unlike Multi-dimensional scaling (MDS, see Manly for this tool) we are not interested in reducing the dimensions so that we can produce a map, but rather we wish to group the data. Unlike Multivariate Discriminant Analysis (MDA), the groups to which the entities belong are unknown. Procedure: Formal (mathematical techniques) vs informal (based on inspection and judgement. For few dimensions (2 or 3) informal methods may be appropriate. In this case, we could use plots of the variables to aid in the clustering. For many dimensions, maps based on MDS may be used to reduce the number of dimensions and an informal approach used. Alternatively, a formal method may be more appropriate. Problems: (1) How is similarity (dissimilarity) measured? (2) If formal procedures are to be used, which one (of many)? Two Objectives: (1) Similarity within the groups. (2) Separation between the groups. The method to be selected will depend on the objective, and the type of clusters in the data (Figure 8-2, Jackson). Figure 8-2a Clear separation of clusters. Figure 8-2b Separation ok. But some points in cluster 1 are actually closer to points in cluster 2 than to points in cluster 1. Figure 8-2c Well separated; nonhomogenous Figure 8-2d One cluster using either objective. Figure 8-2e Using similarity within- 2 groups; not clear using separation between. Figure 8-2f Noise between the two groups--difficult to meet either objective. Figure 8-2g Similar to 8-2f 1
2 Measures of Similarity/Distance: [see Manly, p. 129] Distance when grouping variables: Separation in n-dimensional space based on n-observations of each variable Distance when grouping variables: Separation in p-dimensional space based on p measures for each observation. Commonly used similarity/distance measures for ratio, interval, or ordinal scale variables: [examples in class] Given a vector x = [x1 x2 x3...] and a vector y= [y1 y2 y3...], the distance between points x and y can be described by may different methods For ordinal variables, ranks are assigned and it is assumed that there is equal distance among ranks. 1. Euclidean distance (usually the default in packages): Distance(x,y) = SQRT(sum (x i - y i ) 2 ) where the sum is over all the dimensions; x i are the elements of vector x and y i are the elements of vector y; this is the commonly used distance between two vectors as already covered. Example: Observation 1 = [ 2 4 6] Observation 2= [ 1 2 7] 2. Squared Euclidean distance. 3. City-block or Manhattan distance: Distance(x,y) = sum x i -y i NOTE: large differences weighted less heavily than for Euclidean or squared Euclidean distances. 4. Chebychev distance metric: Distance(x,y) = Maximum x i -y i Distance (x,y) = 2
3 Simularity measure for ratio scale only: Cosine. Simularity (x,y) = (sum(x i y i )) / ( SQRT( (sum(y i 2 )) (sum(xi 2 )) ) Commonly used similarity/distance measures for discrete nominal variables (can also be used for ordinal variables) 1. Matching coefficient: Fraction of all variables with similar values is a measure of similarity between the observations. Example (ordinal): Observation 1 = [ 3 1 2 5] Observation 2 = [ 3 1 1 2] 1=strongly agree; 2=agree; 3=neither agree nor disagree; 4=disagree; 5=strongly disagree Distance: 2 / 4 are similar. Example nominal: sphere sweet sour crunchy purple Apple Yes Yes Yes Yes No Banana No Yes No No No (ie. converted to 0=no; 1=yes. Similarity: 2/5 match. 2. Jaccard (similarity) and DJaccard (distance or dissimilarity) coefficient: Like the matching coefficient, but does not include negative results. Example nominal: sphere sweet sour crunchy purple Apple Yes Yes Yes Yes No Banana No Yes No No No (ie. converted to 0=no; 1=yes. Similarity: 2/4 match (purple characteristic not included as neither have this) Dissimilarlity: 2/4 don t match. OTHERS: SAS lists many similarlity/dissimilarity measures in the documentation found with PROC distance. 3
4 Agglomerative or Hierarchical Procedures: General idea is to cluster points together until a large cluster results. Once a point is in a cluster, it cannot be removed. 1. Nearest-neighbour algorithm. (Also called single linkage). Steps are: a. Join the closest points. Treat these as an entity. b. Join the next two closest entities (points and the first group of two points). The distance between the entity from a. and any of the remaining points is defined as the least distance between the points in the entity and the remaining points. c. Continue to joint the closest entities, point by point until all points are in one group. see Figure 8-3 and Figure 8-8a, Jackson. The large jump in the tree diagram/dendrogram shows that the last two groups are not naturally joined. Uses: for clearly separated natural groups (separation-between objective, see Figure 8-13 Jackson); badly affected by noise between groups; poor at finding homogeneous groups. 2. Farthest neighbour algorithm (also called complete linkage). As the nearest neighbour algorith except that the distance between an entity and the remaining points is defined as the maximum of the differences between the individual points of the entity and the remaining points. Uses: highlights lack of similarity within clusters so suitable for similarity within objective; suitable for compact groups with similar variances within each group. 3. Minimum squared-error method. The notion of the centroid of the groups is used. Link entity by entity based on the criteria of reaching the smalled squared distance between points and centroids of entities. (see Table 8-10, Jackson) Uses: Extremely reluctant to include outliers (large squared error results) so good for similarity within objective and is little affected by noise; not appropriate for clusters such as Figure 8-13 (Jackson). Many other hierarchial methods are available. SPSS and SAS have many options for methods. A description (with references) for many procedures is given in the SAS documentation for PROC CLUSTER. 4
5 Nonhierarchical procedures: Nonhierarchial procedures may use Hill and Valley Methods to assign points to groups (e.g, Quick Cluster in SPSS), or may start with all the data in one group and proceed to break the data down (Divisive methods). 1. Hill and Valley Methods. Idea is to use the concept of density of the points(hill) tapering to fewer points (valley). The measure of density is the closeness of the points to the other points in the vacinity. Could specify a particular near point such as the 5th nearest point. If very dense, this distance would be small. If sparse, this distance would be large. With the Unimodal procedure the steps are: a. Define the nuclei of the clusters (specify how many there are) based on the density of the points. b. Classify the remaining points to a cluster based on density (specify distance to which nearest neighbour). Uses: Presence of noise or clusters close together; must be well-defined natural clusters. Suited for the similarity within objective especially in the presence of noice. The number of clusters at the end is determined by the number of nuclei defined at the beginning. (see Figure 8-14, Jackson) SAS Procedures: PROC CLUSTER; In PROC cluster of SAS, all observations start out as individual clusters, and clustering ends when all observations are in one cluster. Then, based on the distance between points, one cluster is formed by joining two points. The distances between this cluster and all other points are then calculated. Then, the next two points are joined, OR the newly formed cluster is joined to one of the remaining points based on the smallest distance. The distances between this cluster and all other points (or any other clusters) is calculated. The process is repeated until all points are joined together. Interpretation: If a very large distance is needed to join points (or clusters) together, then the number of natural clusters has been reached before this last clustering. A tree diagram showing the clustering and distances to join two clusters is often very useful in determining the number of clusters. 5
6 Variations: 1. The distances used in SAS (if numeric input is used) are either Euclidean or squared Euclidean. No other distances may be specified. However, a distance matrix, already calculated, can be input instead of the raw data. That way, any distance measure can be used in the clustering. 2. The distance from a cluster to a point (or another cluster) can be: average linkage; centroid method; complete linkage; simple linkage. 3. Other clustering methods include: density linkage; flexible-bata; maximum likelihood; McQuitty's similar analysis; median method; two-stage density linkage; Ward's minimum variance method. [See SAS documentation online for PROC CLUSTER, for descriptions of other methods] PROC FASTCLUS This procedure can be used for fast clustering of very large data sets. FASTCLUS performs a disjoint cluster analysis on the basis of Euclidean distances computed from one or more quantitative variables. The observations are divided such that every observation belongs to one group. The clustering is based on minimizing the squared differences from the cluster means. The user can specify the maximum number of clusters allowed. FASTCLUS starts will all observations in one group, and divides these into the number of clusters specified. SPSS Procedures: Two procedures are available in SPSS. Cluster is for small data sets; a hierarchial method is used. Several similarity/distance methods can be used (Squared Euclidean, Euclidean, Cosine, Chebychev, City-block, Power) if data are numeric; proximity data can also be input. Methods include Baverage, Waverage, Single linkage, complete linkage, centroid clustering, median, Ward's method. Tree diagrams can be output. Quick cluster is for larger data set; a hill and valley method is used. References: Jackson, B.B. 1983. Multivariate data analysis: An introduction. Richard D. Irwin, Inc., Homewood, Illinois. Manly, Bryan. 2005. Multivariate statistics: A Primer, 3rd edition. Chapman & Hall/CRC Press, New York, chapter 9. 6