Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Size: px

Start display at page:

Download "Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)"

Sherilyn Kelley
5 years ago
Views:

1 Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1

2 Exercise 9B - solut 2

3 Exercise 9B - solut 3

4 Exercise 9B - solut 4

5 Exercise 9B - solut 5

6 Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 6

7 Clustering Hierarchical clustering part 2 7

8 Looking for Interpretable clusters Graph of the fusion level values - Fusion level values of a dendrogram are the dissimilarity values where a fusion between two branches of a dendrogram occurs - May help in defining cutting levels 8

9 Looking for Interpretable clusters Plot of fusion values 9

10 Looking for Interpretable clusters The graph of fusion level values shows clear jump after each fusion between 2 and 6 groups. Let's cut our dendrogram at the corresponding distance. Do the groups makes sense? Do you obtain groups containing a substantial number of sites? 10

11 Looking for Interpretable clusters Let's do the same for the other clustering algorithm 11

12 Looking for Interpretable clusters Best support from cophonetic correlation and sheppard-like diagram Graphs look different: there is no single 'truth' among these solutions. Each may provide insight into the data 12

13 Looking for Interpretable clusters cutree & contingency tables: If 2 classification had provided the same group contents, one of the contingency tables would have shown only non-zero frequency value in each row and each column. This was never the case here. Change k to play. 13

Looking for Interpretable clusters Graph of Silhouette Widths The silhouette width is a measure of the degree of membership of an object to its cluster, based on the average distance between this

14 Looking for Interpretable clusters Graph of Silhouette Widths The silhouette width is a measure of the degree of membership of an object to its cluster, based on the average distance between this object and all objects of the cluster to which is belongs, compared to the same measure for the next closest cluster. Silhouette widths range from -1 to 1 and can be averaged over all objects of a partition The greater the value is, the greater the better the object is clustered. Negative values mean that the corresponding objects have probably placed in the wrong cluster (intra group variation > inter group variation) 14

15 Looking for Interpretable clusters 15

16 Looking for Interpretable clusters 16

17 Looking for Interpretable clusters The Elbow method This method looks at the percentage of variance explained (SS) as a function of the number of cluster One should choose a number of clusters so that adding another cluster doesn't give much better explanation At some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". 17

18 Looking for Interpretable clusters 18

19 Looking for Interpretable clusters Mantel test Compares the original distance matrix to binary matrices computed from dendrogram cut at various level Chooses the level where the matrix (Mantel) correlation between the two is the highest The Mantel correlation is in its simplest sense, i.e. the equivalent of a Pearson r correlation between the values in the distance matrices Comparison between the distance matrix and binary matrices representing partitions 19

20 Looking for Interpretable clusters 20

21 Looking for Interpretable clusters Mantel stat. (5 groups) vs silhouette plot / elbow (3 groups) 21

22 Exercise 11A Using Tikus dataset and UPGMA tree on Bray-Curtis matrix identified the optimal number of cluster using Silhouette Widths criterion Make a plot with the optimal number of clusters Cut the tree, and plot Silhouette Width graphic for the given number of clusters 22

23 Exercise 11A - solution 23

24 Exercise 11A - solution 24

25 Exercise 11A - solution 25

Non-Hierarchical clustering The most well-known and commonly used partitioning algorithms include: K-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of

26 Non-Hierarchical clustering The most well-known and commonly used partitioning algorithms include: K-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. If variable in the data table are not dimensionally homogenous, they must be standardized prior to partitioning 26

27 k-means partitioning Not clear??? Let's watch this! 27

28 k-means partitioning 28

29 Non-Hierarchical clustering Create partition, without hierarchy Determine a partition of the objects into k groups, or clusters, such as the objects within each cluster are more similar to one other than to objects in the other clusters User determine the number of groups, k Require an initial configuration, which will be optimized in a recursive process (often random) If random, the initial configuration is run a large number of times with different initial configurations to find the best solution 29

30 k-means partitioning Identify high-density regions in the data To do so, the method iteratively minimizes an objective function called the total error sum of squares (E 2 k or TESS or SSE), which is the sum of the withingroups sums-of squares. It is basically the sum, over the k groups, of the sums of squared distance among the objects in the group, each divided by the number of objects inthe group. If pre-determined number of groups,, recommended function is kmeans() of the stats package. Argument nstart will repeat the analysis a large number of time using different initial configuration until finding the best solution Linear method, i.e. not appropriate for raw species abundance. Used normalized data! 30

31 k-means partitioning 31

32 k-means partitioning # 2 groups only: k=2 32

33 Best partition Function cascadekm() in vegan package: wrapper for the kmeans() function - creates several partitions forming a cascade from small (argument inf.gr) to large values of k (argument sup.gr) - This cascade propose the 'best solution' for partitioning "calinski" and "ssi" indicator to choose the best partition 33

34 Best partition "calinski" and "ssi" indicator to choose the best partition 34

35 Partitioning around medoids (PAM) The goal is to find k representative objects which minimize the sum of dissimilarities of the observations to their closest representative object (k-means minimized the sum of squared Euclidean distance within the group) Allow the choice of an optimal number of groups using the silhouette criterion (package fpc) 35

norm, kmeans, method = "silhouette") fviz_nbclust(spe.

36 In brief fviz_nbclust(spe.norm, hcut, method = "silhouette", hc_method = "average") # will apply a eucl diss by default if diss='' no specify = chord disatnce fviz_nbclust(spe.norm, kmeans, method = "silhouette") fviz_nbclust(spe.norm, pam, method = "silhouette") UPGMA k=3 kmeans k=3 pamk k=6 asw increase but not so much 36

37 Exercise 11b

38 Exercise 11b 38

39 Exercise 10B Using silhouette and k-mean (+cascade): can you determine the optimum number of clusters? Since we know that there are 3 species involved, we ask the algorithm to group the data into 3 clusters: how many points are wrongly classified? Plot the results (scatter plot + new groups)

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as