Multivariate Analysis

Size: px

Start display at page:

Download "Multivariate Analysis"

Vivian McGee
5 years ago
Views:

1 Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com

2 Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data are not known before the analysis is conducted The number of clusters Populations Interpretation

3 System Samples Measurements Similarities Distances Clusters

4 Shape and size of clusters may be very different Spherical Ellipsoidal Linear Crescent Ring spiral

5 The most important methods are Partitioning methods Each object is assigned to exactly one group Hierarchical methods Tree-like dendrogram Optimal number of clusters Fuzzy clustering methods Each object is assigned by a membership coefficient to each of the found clusters Model-based clustering The different clusters are supposed to follow a certain model, like multivariate normal distribution with a certain mean and variance

6 The outcome of any cluster analysis procedure are assignments of the objects to the clusters Usually we cannot expect a unique solution for cluster analysis Distance measure Cluster algorithm Chosen parameters Application of unsupervised methods is often a recommendable starting step in data evaluation in order to obtain an insight into the data structure (to detect clusters or outliers)

Distance and Similarity Measures Euclidean distance It is the most widely used Manhattan distance It is less dominated by far outlying objects since it is based on

7 Distance and Similarity Measures Euclidean distance It is the most widely used Manhattan distance It is less dominated by far outlying objects since it is based on absolute rather than square distances Minkowski distance It allows adjusting the power of the distances along the coordinates All these distance measures are not scale invariant

8 Most of the standard clustering algorithms can be directly used for clustering the variables Pearson Correlation Distance (for two variables x j and x k ) d corr x j, x k = 1 r jk Binary Vectors Chemical structures Each vector (values 0 or 1) indicates absence or presence of a certain substructure Tanimoto Index (Jaccard Similarity Coefficient) j AND x Aj, x Bj t AB = j OR x Aj, x Bj AND is for the number of variables with a 1 in both vectos, and OR in at least one of the vectors d Tani = 1 t AB

9 Partitioning Methods Clustered objects x 1 1,, x n1 1, x 1 2,, x n2 2,, x 1 k,, x nk k The number of objects assigned to a cluster n n k = n Often, the algorithms are adapted to the type of geometry of the data

10 k-means The most widely known algorithm for partitioning Data mining algorithm It uses pairwise distances between the objects Requires the input of the desired number of k clusters Centroids (means) represent the center of each cluster n j c j = 1 n j i=1 The objective (function) of k-means is to minimize the total within-cluster sum-of-squares k j=1 n j i=1 x i j x i j c j 2 min squared Euclidean distance between an object x j i of cluster j and cluster centroid c j

11 The most widely used algorithm for k-means work as follows 1. Select a number k of desired clusters and initialize k cluster centroids c j, for example, by randomly selecting k different objects 2. Assign each object to the cluster with the closest centroid, i.e., compute for each object the distance x i j c j for j = 1,, k and assign x i to the cluster where the minimum distance to the cluster centroid appears 3. Recompute the cluster centroids 4. Repeat steps 2 and 3 until the centroids become stable

14 The k-means algorithm tends to find spherical clusters k-means_spherical Library(cluster) library(chemometrics) data("glass") data("glass.grp") K2 <- kmeans(glass, 2) K3 <- kmeans(glass, 3) K4 <- kmeans(glass, 4) Xma <- scale(glass,center = TRUE, scale = TRUE) b_pca <- svd(xma) u_pca <- Xma%*%b_PCA$v par(mfrow = c(2, 2)) plot(u_pca[,1],u_pca[,2],col=glass.grp) plot(u_pca[,1],u_pca[,2],col=k2$cluster) plot(u_pca[,1],u_pca[,2],col=k3$cluster) plot(u_pca[,1],u_pca[,2],col=k4$cluster)

15 elliptical clusters

16 Silhouette It is proposed for partitioning techniques Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration The average silhouette width provides an evaluation of clustering validity, and might be used to select an appropriate number of clusters.

19 Honey data K=4 Rape (ra): 8-10 Honeydew (hd): Floral (of): 4-7; Acacia (ac): samples and 11 parameters

20 honeydata.mat X, 27x11 To get an idea of how well-separated the resulting clusters are, you can make a silhouette plot using the cluster indices output from k-means The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters This measure ranges from +1, indicating points that are very distant from neighboring clusters, through 0, indicating points that are not distinctly in one cluster or another, to -1, indicating points that are probably assigned to the wrong cluster.

21 Cluster autoscaling k = 4 idx4=kmeans(x,4); [silh4,h] = silhouette(x,idx4); Silhouette Value large silhouette values, greater than 0.6, indicate that the cluster is somewhat separated from neighboring clusters points with low silhouette values, and points with negative values, indicate that the cluster is not well separated

22 Cluster k = 3? idx3=kmeans(x,3); [silh3,h] = silhouette(x,idx3); Silhouette Value

23 Cluster k = 5? idx5=kmeans(x,5); [silh5,h] = silhouette(x,idx5); Silhouette Value

24 A more quantitative way to compare the solutions is to look at the average silhouette values k from 2 to 9 were tested. The best silhouette was obtained when k = 2 mean(silh2) ans = Without some knowledge of how many clusters are really in the data, it is a good idea to experiment with a range of values for k

25 Cluster Espectros de RMN 1 H Testei de k = 2 até 9 Melhor valor k = 3 mean(silh3) = x x Silhouette Value

26 k-means_silhoutte library(cluster) data("ruspini") disse <- daisy(ruspini) #dissimilarity matrixlibrary(fpc) de2 <- disse^2 K3 <- kmeans(ruspini, 3) sk3 <- silhouette(k3$cluster,de2) summary(sk3) #the higher the mean, the better par(mfrow = c(1, 1)) plot(sk3,col = c("red", "green", "blue")) K4 <- kmeans(ruspini, 4) sk4 <- silhouette(k4$cluster,de2) summary(sk4) #lower mean than K=3 plot(sk4,col = c("red", "green", "blue", "purple")) K5 <- kmeans(ruspini, 5) sk5 <- silhouette(k5$cluster,de2) summary(sk5) #lower mean than K=3 plot(sk5,col = c("red", "green", "blue", "purple", "yellow")) K6 <- kmeans(ruspini, 6) sk6 <- silhouette(k6$cluster,de2) summary(sk6) #lower mean than K=3 K7 <- kmeans(ruspini, 7) sk7 <- silhouette(k7$cluster,de2) summary(sk7) #lower mean than K=3 library(fpc) plotcluster(ruspini, K3$cluster) clusplot(ruspini,k3$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0) plotcluster(ruspini, K4$cluster) clusplot(ruspini,k4$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0) plotcluster(ruspini, K5$cluster) clusplot(ruspini,k5$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)

Hierarchical Clustering Methods Agglomerative Methods n clusters in the first level of the hierarchy The two closest clusters are merged in the

33 Hierarchical Clustering Methods Agglomerative Methods n clusters in the first level of the hierarchy The two closest clusters are merged in the next level And so on... Divisive Methods One cluster in the first level In the next level, this cluster is split into two smaller clusters And so on...

34 Splitting or merging Similarity or distance between the clusters CA searches for objects which are close together in the variable space The general distance is given by N N d ij = x ik x jk k=1 1 N For N = 2, this is the familiar n-space Euclidean distance Higher values of N will give more weight to smaller distances.

35 Distance between the two cluster can be determined by various methods Complete Linkage (Furthest neighbour) max i x i j x i l It uses the distance to the farthest point It usually results in homogeneous clusters in the early stages of the agglomeration, but the resulting clusters will be small

36 Single Linkage min i x i j x i l It judges the nearness of a point to a cluster on the basis of the distance to the closest point in the cluster It is known for the chaining effect, because even quite homogeneous clusters can be linked just by chance as soon as two objects are very similar

37 Average Linkage average i x i j x i l

38 Centroid Method c i c j The distance of a point to the centre of gravity of the points in a cluster is used.

39 Ward s Method c i c j 2n j n l n j + n l correction to the Centroid Method in order of increasing distance within the clustering procedure

40 Agglomerative clustering algorithm 1. Define each object as a separate cluster and compute all pairwise distances 2. Merge those clusters (=objetcs) with the smallest distance into a new cluster 3. Compute the distances between all clusters using a chosen method 4. Merge those clusters with smallest distance from step 3 5. Proceed with steps 3 and 4 until only one cluster remais

41 The results of hierarchical clustering are usually displayed by a dendrogram Points are grouped together based on their nearness or similarity into clusters The nearness of points in n-space reflects the similarity of their properties Similarity values, S ij, are calculated as S ij = 1 d ij d ij max

42 Observation Distance/Similarity Dendrogram

43 Complete and Single Linkage library(cluster) library(chemometrics) data("glass") cl <- hclust(dist(glass), method = "complete") plot(cl) sl <- hclust(dist(glass), method = "single") plot(sl)

51 Fuzzy Clustering Partitioning methods make an assignment of each object to exactly one cluster Fuzzy clustering, in contrast, allow for fuzzy assignment (an observation is not assigned to exclusively one cluster but at some part to all clusters Membership coefficients u ij for each observation x i to each cluster [0,1]: 0 not assignment; 1 full assignment

52 Fuzzy clustering membership Group 1 or Group 2? Group 3?

53 Fuzzy C-Means algorithm Objective function k n j j=1 i=1 u 2 ij x j 2 i c j min Similar to k-means, the number of clusters k has to be provided as an input Centroids c j = i u ij 2 x i i u 2 ij When using only memberships of 0 and 1, this algorithm reduces to k-means In each iteration step the membership coefficients are updated by u ij = k l=1 1 x i c j x i c l 2 and the cluster cenrtoids are recalculated

54 Fuzzy clustering library(chemometrics) data("glass") data("glass.grp") library(e1071) fc2 <- cmeans(glass,2) fc3 <- cmeans(glass,3) fc4 <- cmeans(glass,4) Xma <- scale(glass,center = TRUE, scale = TRUE) b_pca <- svd(xma) u_pca <- Xma%*%b_PCA$v par(mfrow = c(2, 2)) plot(u_pca[,1],u_pca[,2],col=glass.grp, main = "original classes") plot(u_pca[,1],u_pca[,2],col=fc2$cluster, main = "fuzzy clustering with 2 clusters") plot(u_pca[,1],u_pca[,2],col=fc3$cluster, main = "fuzzy clustering with 3 clusters") plot(u_pca[,1],u_pca[,2],col=fc4$cluster, main = "fuzzy clustering with 4 clusters") par(mfrow = c(1, 2)) fcpca4 <- cmeans(u_pca,4) plot(u_pca[,1],u_pca[,2],col=fcpca4$cluster, main = "fuzzy clustering with 4 clusters using pca scores") plot(u_pca[,1],u_pca[,2],col=fc4$cluster, main = "fuzzy clustering with 4 clusters")

57 Model-Based Clustering Model-based clustering assumes a statistical model of the clusters The simplest approach: Multivariate normal distribution with different means but covariance matrices of the same form σ 2 The same spherical cluster shape and size for all clusters A more complicated situation: σ j 2, for j = 1,, k Still spherical clusters but different cluster sizes In a third type of cluster model, the covariance matrix has not a diagonal form (as in the previous two model types) Elliptically symmetric clusters The most general form are clusters with different covariance matrices Σ j

58 Spherical Equal volume Diagonal covariance (same σ 2 for all clusters) Spherical Unequal volume Diagonal covariance of different size (σ j 2 ) Ellipsoidal General form Different covariance matrix for each cluster (Σ j )

59 Model-Based Clustering library(chemometrics) data("glass") Xma <- scale(glass,center = TRUE, scale = TRUE) b_pca <- svd(xma) u_pca <- Xma%*%b_PCA$v library(mclust) mbc_pc12 <- Mclust(u_PCA[,1:2]) plot(mbc_pc12)

60 Cluster Validity and Clustering Tendendy Measures Crucial point: number of k clusters Quality criterion and optimal number Homogeneity (within the clusters) n j W j = x i j c j 2 i=1 Heterogeneity (between the clusters) 2 B jl = c j c l Cluster Validity V k = k j=1 W j k j<l=1 B jl

61 library(chemometrics) clvalidity(glass) $validity kmeans fuzzy mclust kmeans fuzzy mclust

62 Hopkins Statistic point i = 1,, n d w of a data object to the nearest neighboring object d U of an arbitrary (artificial) point in the data space to the nearest neighboring object i d U i H = i d U i + i d w i Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters) Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters) If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the dataset is significantly a clusterable data library(clustertend) hopkins(glass, n=100) $H [1]

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a