Stats fest Multivariate analysis. Multivariate analyses. Aims. Multivariate analyses. Objects. Variables

Size: px

Start display at page:

Download "Stats fest Multivariate analysis. Multivariate analyses. Aims. Multivariate analyses. Objects. Variables"

Justin Wiggins
6 years ago
Views:

1 Stats fest 7 Multivariate analysis murray.logan@sci.monash.edu.au Multivariate analyses ims Data reduction Reduce large numbers of variables into a smaller number that adequately summarize the patterns Reveal patterns in the data that cannot be found using isolated variables Characterize things based on a large number of variables Classify sites Taxonomy Multivariate analyses 3 Objects Things we wish to compare Sampling or experimental units E.g. sites, quadrats Variables Characteristics measured from each object Usually continuous variables counts of many different species (species abundances) Size of body parts (taxonomy)

2 Multivariate analyses R-mode analyses Combine variables based on correlations E.g. Principal components analysis (PC) Q-mode analyses Combine variables based on object dissimilarity E.g. Multidimensional scaling (MDS) nalysis of similarity (NOSIM) utocorrelation Cluster analysis PC 5 ims Data reduction Reveal patterns in the data that cannot be found using isolated variables Data Many predictor variables measured from the same sampling units PC 6 If there are variables that are correlated to oneanother Combine them together xis rotation First component Surface of best fit Explains most variation Next component Perpendicular Explains next most Next.

3 PC 7 Difficult to visualize when more than 3 variables Eigenanalysis Matrix algebra used to do axes rotation in multidimensional space Start with p original variables End with p new completely uncorrelated variables (principal components) PC 8 Eigenanalysis Calculate correlation matrix between all p variables Calculate new principal components (PC) Eigenvalues (latent roots) mount or original variation explained by each new principal component dds up to the number of original variables Component loadings contribution of each original variable to each of the new PC Factor scores PC 9 How many components to keep Eigenvalue > rule The sum of the eigenvalues is always equal to the number of original variables ny PC > must be explaining more than its share of the variation Retain ny PC < not explaining much Do not retain 3

4 PC How many components to keep Scree test Obvious elbow PC Ordination plot PC ssumptions ecause it is based on correlations ssumes linearity

5 Q-mode analyses 3 Distance (dissimilarity) measures Measure of the degree of difference between each pair of objects based on a set of variables How different sites are with respect to species composition How different organisms are with respect to a suit of morphological and/or genetic characteristics Smaller dissimilarities represent higher degree of similarity Dissimilarity Euclidean distance Geometric distance between objects (j and k) in multidimensional space when two objects identical No maximum value Joint absences considered similar Dissimilarity 5 ray-curtis dissimilarity when two objects are identical Reaches a maximum of when two objects have no variables in common Joint absences ignored 5

6 Dissimilarity Distance matrix Site Sp Sp Sp C D E Distance matrix C D E C D E Euclidean distances C D E ray-curtis distances C D E Dissimilarity which is best? 7 Species abundance data Zeros common Max value when quadrats have no species in common ray-curtis preferred > library(vegan) > *.bc <- vegdist(variables, bray ) Measurement/morphological data Zeros rare Euclidean distance OK > library(vegan) > *.euc <- vegdist(variables, euc ) Dissimilarity 8 Other distances Genetic distances from gene frequencies Nei s distance Edward s (ngular) distance Coancestrality coefficient (Reynolds ) distance Classical Euclidean (Rogers ) distance bsolute genetics (Provesti's) distance 6

7 Standardizations 9 im To allow all variables to have an equal influence on patterns voids overweighting by highly abundant species llows rare species to contribute Different environmental variables measured on different scales Scale each variable Divide all observations by max for that variable Scale to a mean of and sd of Standardizations Scale to maximums Raw data Standardized (max) data Site C D E Sp Sp 9 Sp Site C D E Sp..... Sp Sp > library(vegan) > *.stnd <- decostand(*[,:], max ) Standardizations Scale to mean of, sd of Raw data Standardized (max) data Site C D E Sp Sp 9 Sp Site C D E Sp Sp Sp > *.stnd <- scale(*[,:]) 7

8 Multidimensional scaling (MDS) ims Graphical representation of dissimilarity between objects in as few dimensions (axes) as possible xes are new variables MDS 3. Setup data Objects (sites) in rows Variables (species) in columns Site Sp Sp Sp3 Sp 5 Sp5 MDS. Calculate dissimilarity (ray-curtis) > library(vegan) > *.bc <- vegdist(variables, bray ) Site Site Site 3 Site Site 5 Site 6 Site Site Site Site..9.8 Site 5..5 Site 6. 8

9 MDS 5 3. Decide on the number of dimensions for ordination Suspected number of major underlying ecological gradients Minimize new axes (variables) but still retain information Usually between and four dimensions > library(mss) > *.mds <- isomds(*.dist, k=) MDS 6. rrange objects on ordination plot Starting configuration Usually random MDS 7 5. Compare distances on ordination plot with dissimilarity distances Strength of the relationship between ordination distances and dissimilarity distances Kruskal s stress (equivalent to-r ) 9

10 MDS 8 6. Move objects on ordination iteratively Each move improves the match between dissimilarities and ordination distances Lower stress value MDS 9 7. Final configuration When further moving of objects no longer improves the match Stress low as possible MDS 3 Non-metric MDS Rank-based regression Similar to rank based correlation etter for ecological data How low should stress be? >. (%) basically random <.5 (5%) is good match <. (%) is ideal Ordination configuration is close to actual dissimilarities small number of new variables explain most of the patterns contained in all the original variables

11 Hypothesis testing? 3 Is there are difference between the habitats with respect to species composition? Can we use the new axes scores in NOV? NO! (why?) UT Site Sp Sp Sp3 Sp Sp5 5 5 Habitat Habitat nalysis of Similarities (NOSIM) 3 im To compare groups based on similarities of objects Uses dissimilarity matrices Data H : Categorical variable Multiple continuous response variables Dissimilarity matrix verage rank dissimilarities between objects within groups = verage rank dissimilarities between objects between groups No difference in species composition between groups nalysis of Similarities (NOSIM) 33 Habitat Site Sp Sp Sp3 Sp 5 Sp5 Site Site Site 3 Site Site 5 Site 6 Site Site Site Site..9.8 Site 5..5 Site 6. R = r b r w / number of sites

12 nalysis of Similarities (NOSIM) 3 Dissimilarities not normally distributed ased on ranks Dissimilarities not independent Uses randomization procedures to construct a probability distribution Generates own test statistic (called R) Cluster analysis 35 ims Combines similar objects together into clusters which are displayed as a dendogram Cluster analysis 36 Single linkage (Nearest neighbour) method. Calculate dissimilarity matrix. First cluster is formed between two objects with smallest dissimilarity C D E

13 Cluster analysis Next cluster is between the next most similar pairs and so on Length of linkage reflects dissimilarity C D E Cluster analysis Next cluster is between the next most similar pairs and so on C D E Cluster analysis Next cluster is between the next most similar pairs and so on. Procedure continues until all objects are linked in clusters C D E 3

14 Cluster analysis Other linkage methods verage linkage Unweighted Pair-Group Method of rithmetic veraging (UPGM) verage neighbour Complete linkage (Furthest neighbour) Distance between clusters determined by most dissimilar objects in their groups Clustering How well do the cluster groups match the dissimilarity patterns cophenetic correlation.83 Dissimilarity distances C D E C D E C D E C D E Cluster distances C D E Minimum spanning trees Mapped over ordination plots. Find smallest dissimilarity. Join these objects with a line 3. Find the next lowest dissimilarity and join objects. Repeat until all points joined 5. Short lines represent within clusters, long lines between clusters

15 utocorrelation- Mantal test 3 im Investigate association between two distance matrices H : No association between two matrices E.g. no correlation between species abundances and environmental characteristics. Calculate correlation (r) between matrices. Since dissimilarities not independent or normal, use randomization test to reshuffle one of the matrices repeatedly utocorrelation- Mantal test Calculate probability 5

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups