Applied Multivariate Analysis

Size: px

Start display at page:

Download "Applied Multivariate Analysis"

Lauren Ford
5 years ago
Views:

1 Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017

2 Cluster Analysis

3 Background 1 Cluster analysis Background Distance data

4 Background Example 1 Consider the following data Beer Calories Sodium Alcohol Cost Budweiser Schlitz Lowenbrau Kronenbourg Heineken Old Milwaukee Augsberger Srohs Bohemian Style Miller Lite Budweiser Light Coors Coors Light Michelob Light Becks Kirin Pabst Extra Light Hamms Heilemans Old Style Olympia Goled Light Schlitz Light

5 Background A potentially interesting question might be are some beers more alike than the others. I.e. are there natural groups of the beers.

6 Background Before clustering, check descriptive statistics and plots

7 Background It turns out that Lowenbrau is an outlier in particular in the relation of alcohol to others.

8 Background No more obvious outliers.

9 Background Problem problem: Group sample units into homogeneous sub-groups on the basis of a given data set. Synonyms: clustering morphometrics pattern regognition classification taxonomy

10 Background The main difference to discriminant analysis is that in classification analysis the number of groups is not known in advance. Data: 1 Observations on variables x i1,..., x ip, i = 1,..., n. 2 n n distance matrix, which describes how apart from each other the observations are.

11 Distance data 1 Cluster analysis Background Distance data

12 Distance data Dissimilarity: Let x i = (x i1,..., x ip ) denote observations on the variables for sample unit i. Dissimilarity between sample units i and j are measured by a suitable distance measure that has the following properties 1 d(x, y) 0 2 d(x, y) = 0 x = y 3 d(x, y) = d(y, x) 4 d(x, y) d(x, z) + d(y, z) d ij = d(x i, x j ), (1)

13 Distance data Example 2 Block distance Mahalanobis distance 1 2 p d(x, y) = (x j y j ) 2. (2) j=1 p d(x, y) = x j y j. (3) j=1 where Σ = Cov(x i ) = Cov(x j ). d(x i, x j ) = [ (x i x j ) Σ 1 (x i x j ) ] 1 2, (4) Unlike Euclidian and block distance, Mahalanobis distance is independent of scales of variables.

14 Distance data Scaling makes difference! Example 3 Euclidian distances Observation Weight (g) Length (cm) cm mm d d d Thus, with Euclidian and block distances the differences should have equal practical importance. Linear or non-linear transformations may be needed. E.g. variables in different currencies should be converted to the same. Similarly if ratios are more meaningful than differences, take logarithms. Outliers should also be removed.

15 Distance data General solutions: (1) Standardization where s j is the standard deviation of variable x j. z ij = x ij s j (5) Shortcoming: May deteriorate clustering information of the variables. 1 (2) Mahalanobis distance 1 Milligan, Glenn W., and Martha C. Cooper, 1987, A study of standardization of variables in cluster analysis. Journal of Classification 5,

16 Distance data Similarity: Association measures c ij = c(x i, x j ) 1 0 c ij 1 2 c ii = 1 3 if c ij = 1 then x i = x j. 4 c ij = c ji.

17 Distance data Example 4 Observations x i = 0 or 1 (dichotomous). Sample unit m 0 1 Total Sample 0 a b a+b unit k 1 c d c+d Total a+c b+d a+b+c+d Jaccard : c km = Cxekanowski : c km = a a + b + c 2a 2a + b + c (6) (7)

18 1 Cluster analysis Background Distance data

19 Note: Similarity Dissimilarity d ij = 1 c ij (8) Distance matrix D = (d ij ). Aggloramerative Clustering: bottom up Divisive Clustering: top down

20 Strategies (examples): Single linkage (Nearest neighbor): The first cases are combined that have the smallest distance (are the most similar). Complete linkage (Furthest neighbor): The distance of two groups is calculated as the distance between their furthest point. Average linkage: The average distance of between pairs of observations, one in each cluster. Centroid method: Distances in terms of group means (group centroids). Minimum variance: Ward s minimum-variance method minimizes within sum of squares over clusters. k-means clustering: Given k clusters, minimize within cluster sum of squares. Initial partition must be somehow found (e.g randomly, or by some other clustering method).

21 Example 5 Single linkage (nearest neighbor) D = The nearest are 1 and 3. Joining them yields a new distance matrix D (1) = {1, 3} Next join 2 and 4 to yield D (2) = {1, 3} {2, 4} Next join 5 to {2, 4} D (3) = {1, 3} {2, 4, 5} ( )

22 The resulting dendrogram is

23 Number of Clusters: A large jump in the dendrogram is a sign of a correct number of clusters. A large jump in a used (another) criterion Maximum pseudo F, minimum pseudo t. Maximum cubic clustering criterion (cccc). In k-means clustering plot sums of squares and find an elbow similar to the scree plot in PCA

24 Example 6 Beer brands. data beer; input brand $21. calories sodium alcohol cost; datalines; Budweiser Schlitz ; /* single linkage method, b is subset after removing Lowenbrau*/ proc cluster data = b method = single simple std; /* simple statistics, standardize */ id brand; var calories--cost; run;

25 Single linkage

26 Single linkage

27 Single linkage

28 Single linkage

29 Single linkage No obvious break biggest jump after joining Becks, Kirin, Heineken, and Kronenburg to the rest. Psudo t drops at 4 clusters and is lowest at 7, 9, and 10. Pseudo F does not seem to work. Thus, on the basis of this the beers cluster to American and European brands. Within the Americans a distinction seems to be between lite and the others (Michelob differs from this picture).

30 Complete linkage proc cluster data = b method=complete; var calories sodium alcohol; id brand; run;

31 Complete linkage The complete linkage divides beers essentially to two lite brands, European, and the other Americans.

32 Centroid /* centroid linkage*/ proc cluster data = b method=centroid; var calories sodium alcohol cost; id brand; run;

33 Centroid Similar to complete.

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each