Multivariate Methods

Similar documents
Exploratory data analysis for microarrays

Dimension reduction : PCA and Clustering

Chapter 6: Cluster Analysis

Hierarchical clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Chapter 6 Continued: Partitioning Methods

Network Traffic Measurements and Analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis for Microarray Data

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Gene Clustering & Classification

Machine Learning (BSMC-GA 4439) Wenke Liu

Hierarchical Clustering

Distances, Clustering! Rafael Irizarry!

University of California, Berkeley

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering and Visualisation of Data

3. Cluster analysis Overview

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Unsupervised Learning

Supervised vs. Unsupervised Learning

MSA220 - Statistical Learning for Big Data

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

3. Cluster analysis Overview

Clustering CS 550: Machine Learning

Understanding Clustering Supervising the unsupervised

CSE 5243 INTRO. TO DATA MINING

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Multivariate Analysis

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering. Chapter 10 in Introduction to statistical learning

Chapter 1. Using the Cluster Analysis. Background Information

Workload Characterization Techniques

Clustering analysis of gene expression data

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

CSE 5243 INTRO. TO DATA MINING

Stat 321: Transposable Data Clustering

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

ECS 234: Data Analysis: Clustering ECS 234

Stats fest Multivariate analysis. Multivariate analyses. Aims. Multivariate analyses. Objects. Variables

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

Unsupervised Learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Clustering. Supervised vs. Unsupervised Learning

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

SGN (4 cr) Chapter 11

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Hierarchical Clustering

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

INF 4300 Classification III Anne Solberg The agenda today:

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

TELCOM2125: Network Science and Analysis

Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering part II 1

Lesson 3. Prof. Enza Messina

Lecture 25: Review I

Clustering. Lecture 6, 1/24/03 ECS289A

Medoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Scaling Techniques in Political Science

Forestry Applied Multivariate Statistics. Cluster Analysis

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

High throughput Data Analysis 2. Cluster Analysis

Clustering. Sandrien van Ommen

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Exploratory data analysis for microarrays

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Clustering and Classification. Basic principles of clustering. Clustering. Classification

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

General Instructions. Questions

Statistical Methods for Data Mining

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Unsupervised Learning

Supplementary text S6 Comparison studies on simulated data

CLUSTERING IN BIOINFORMATICS

DATA CLASSIFICATORY TECHNIQUES

CHAPTER 4: CLUSTER ANALYSIS

Sergei Silvestrov, Christopher Engström. January 29, 2013

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Exploratory data analysis for microarrays

Clustering Algorithms for general similarity measures

Unsupervised Learning

Hierarchical clustering

Week 7 Picturing Network. Vahe and Bethany

Lecture on Modeling Tools for Clustering & Regression

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Cluster analysis. Agnieszka Nowak - Brzezinska

Hierarchical Clustering

Transcription:

Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology, botany) Why classify? organizational convenience, convenient summary prediction explanation Note: these aims do not necessarily lead to the same classification; e.g. SIZE of object in hardware store vs. TYPE/USE of object 1

Classification, cont Classification divides objects into groups based on a set of values Unlike a theory, a classification is neither true nor false, and should be judged largely on the usefulness of results (Everitt) However, a classification (clustering) may be useful for suggesting a theory, which could then be tested Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations (discrimination analysis) Unsupervised: classes unknown, want to discover them from the data (cluster analysis) 2

Cluster analysis Addresses the problem: Given n objects, each described by p variables (or features), derive a useful division into a number of classes Often want a partition of objects But also fuzzy clustering Could also take an exploratory perspective Unsupervised learning Most clustering is not statistical Difficulties in defining cluster 3

Clustering Gene Expression Data Can cluster genes (rows), e.g. to (attempt to) identify groups of co-regulated genes Can cluster samples (columns), e.g. to identify tumors based on profiles Can cluster both rows and columns at the same time Clustering Gene Expression Data Leads to readily interpretable figures Can be helpful for identifying patterns in time or space Useful (essential?) when seeking new subclasses of samples Can be used for exploratory, quality assessment purposes 4

Visualizing Gene Expression Data Dendrogram (tree diagram) Heat Diagram available as R function heatmap() http://rana.lbl.gov/eisensoftware.htm Need to reduce number of genes first for figures to be legible/interpretable (at most a few hundred genes, not a whole array) A visual representation for a given clustering (e.g. dendrogram) is not unique Beware the influence of representation on apparent structure (e.g. color scheme) Cluster visualization Eisen, Michael B. et al. (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868 Copyright 1998 by the National Academy of Sciences 5

Similarity Similarity s ij indicates the strength of relationship between two objects i and j Usually 0 s ij 1 Correlation-based similarity ranges from 1 to 1 Use of correlation-based similarity is quite common in gene expression studies but is in general contentious... Problems using correlation 3 objects 2 1 1 2 3 4 5 variables 6

A more extreme example 3 objects 1 2 3 4 5 variables 1 2 Dissimilarity and Distance Associated with similarity measures s ij bounded by 0 and 1 is a dissimilarity d ij = 1 - s ij Distance measures have the metric property (d ij +d ik d jk ) Many examples: Euclidean ( as the crow flies ), Manhattan ( city block ), etc. Distance measure has a large effect on performance Behavior of distance measure related to scale of measurement 7

Distance example Euclidean Manhattan What distance should I use? This is like asking: What tool should I buy? It depends on what similarities you are interested in finding With Euclidean distance, larger values will tend to dominate; not useful if large value is simply a result of using smaller units (e.g., grams vs Kilos) Can get around this (if desired) by scaling or standardizing variables 8

Partitioning Methods Partition the objects into a prespecified number of groups K Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares) Examples: k-means, self-organizing maps (SOM), partitioning around medoids (PAM; more robust and computationally efficient than k-means), model-based clustering Hierarchical Clustering Produce a dendrogram (tree diagram) Avoid prespecification of the number of clusters K The tree can be built in two distinct ways: Bottom-up: agglomerative clustering Top-down: divisive clustering 9

Agglomerative Methods Start with n mrna sample (or G gene) clusters At each step, merge the two closest clusters using a measure of between-cluster dissimilarity Examples of between-cluster dissimilarities: Average linkage (Unweighted Pair Group Method with Arithmetic Mean (UPGMA)): average of pairwise dissimilarities Single-link (NN): min of pairwise dissimilarities Complete-link (FN): max of pairwise dissimilarities Between cluster distances 10

Divisive Methods Start with only one cluster At each step, split clusters into two parts Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram) Disadvantage: Computational difficulties when considering all possible divisions into two groups Partitioning Partitioning vs. Hierarchical Advantage: Provides clusters that satisfy some optimality criterion (approximately) Disadvantages: Need initial K, long computation time Hierarchical Advantage: Fast computation (agglomerative) Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier 11

R: clustering A number of R packages contain functions to carry out clustering, including: stats: hclust cluster (Kaufman and Rousseeuw) cclust mclust e1071 Generic Clustering Tasks Estimating number of clusters Assigning each object to a cluster Assessing strength/confidence of cluster assignments for individual objects Assessing cluster homogeneity (Interpretation of the resulting clusters) 12

Estimating how many clusters Many suggestions for how to decide this! Indices based on homogeneity and/or separation (within and between cluster sums of squares) Milligan and Cooper (Psychometrika 50:159-179, 1985) studied performance of 30 such methods in a large simulation R package fpc (Christian Hennig) has function cluster.stats which computes many of these Additional methods Model-based criteria (AIC, BIC, MDL) when using model-based clustering GAP, GAP-PC (Tibshirani et al.) Average silhouette width (Kaufman and Rousseuw) mean silhouette split (Pollard and van der Laan) clest (Dudoit and Fridlyand) 13

Example: Bittner et al. It has been proposed (by many) that a cancer taxonomy can be identified from gene expression experiments. 31 melanomas (from a variety of tissues/cell lines) 7 controls 8150 cdnas 6971 unique genes 3613 genes strongly detected How many clusters are present? How many clusters are present? 14

Average linkage, melanoma only 1-ρ =.54 unclustered cluster Issues in Clustering Pre-processing (Image analysis and Normalization) Which variables are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K 15

Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K Filtering Genes All genes (i.e. don t filter any) At least k (or a proportion p) of the samples must have expression values larger than some specified amount, A Genes showing sufficient variation a gap of size A in the central portion of the data a interquartile range of at least B large SD, CV,... 16

Average linkage, top 300 genes in SD Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K 17

Average linkage, melanoma only unclustered cluster Average linkage, melanoma & controls unclustered cluster control 18

Issues in clustering Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K Complete linkage (FN) 19

Complete linkage (FN) Single linkage (NN) 20

Ward s method (information loss) Issues in clustering Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K 21

Divisive clustering, melanoma only Divisive clustering, melanoma & controls 22

Issues in clustering Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K How many clusters K? Applying several methods yielded estimates of K = 2 (largest cluster has 27 members) to K = 8 (largest cluster has 19 members) 23

Average linkage, melanoma only K = 2 K = 8 unclustered cluster Summary Buyer beware results of cluster analysis should be treated with GREAT CAUTION and ATTENTION TO SPECIFICS, because Many things can vary in a cluster analysis If covariates/group labels are known, then clustering is usually inefficient 24

Multivariate Methods Principal Components Analysis food.pc UC CC FB food.pc$x[, 2] -3-2 -1 0 1 BR HS BB PSPR BS LS HR LL MB VC PF BC BT CB RC HF SC DC TC MC BH AR AC Variances 0.0 0.5 1.0 1.5 2.0-3 -2-1 0 1 2 1 2 3 4 5 food.pc$x[, 1] http://www.isrec.isb-sib.ch/~darlene/embnet/ Locating a point in the plane We can describe the location of a point in the plane by saying how much we move in the horizontal (X) direction, then how much we move in the vertical (Y) direction As an example, think of describing how to get to some particular place from where you are (for example, how to get from the train station to the BioZentrum) One way to do this is to say how far you go NORTH, then how far you go EAST 25

Directions North = 1 st? There is no rule that says we must first say how far to go NORTH for example, we could instead say first how far to go SOUTH (can think of as negative NORTH ) We could even say first how far to go NORTH-EAST, then how far to go... 26

Alternate Directions A small data set Head length (in mm) for each of the first two adult sons in 50 families head length 2nd son 130 140 150 160 160 170 180 190 200 head length 1st son 27

Variance-Covariance matrix Consider a data set consisting of p variables measured on n cases How the variables change together is summarized by the variance-covariance matrix (or by the correlation matrix) For our simple example: > cov(head) > cor(head) [,1] [,2] [,1] [,2] [1,] 96.95061 54.48939 [1,] 1.0000.7859 [2,] 54.48939 49.57918 [2,] 0.7859 1.0000 Principal Component Analysis (PCA) One aim of principal component analysis (PCA) is to reduce the dimensionality from p variables This has the effect of simplifying a dataset, by reducing multidimensional data to a lower dimension (i.e. have smaller number of variables) Try to explain the variance-covariance structure through linear combinations (principal components) of the (original) variables 28

Principal Component Analysis (PCA) Another aim is to interpret the first few principal components in terms of the original variables to give greater insight into the data structure More on PCA Another aim is to interpret the first few components in terms of the original variables => greater insight into data structure Each PC accounts for a certain amount of the variation in the data The 1st PC is the linear combination that accounts for ( explains ) the most variation Subsequent PCs account for as much as possible of the remaining variation, while being uncorrelated with earlier PCs Aubergine... 29

What is a linear combination? Say we have 2 variables, height and length Create new variables from these by summing multiples of the original values Examples: V = 12*height + 4*length W = π*height 4.2*length X = - 3*height +0.75*length What is a linear transformation? The main example of a linear transformation is given by matrix multiplication Say we have a matrix A, of dimension p x p We can multiply a vector x by A to form a new vector Ax For most vectors x, applying A to x changes both the length and the direction of the original vector 30

Example linear transformation Not all vectors are the same Usually, applying A to x changes both the length and the direction of the vector But for some special vectors x, the result is just an expansion or contraction with no change of direction (except possibly flipped) 31

Another linear transformation A little (more) linear algebra In these cases, the new vector is just a (scalar) multiple λ of the original vector The values λ satisfying Ax = λx are called eigenvalues (also characteristic values or latent roots) of the matrix A The corresponding (nonzero) vector x is called an eigenvector (or characteristic vector, latent vector) Usually convenient to scale eigenvectors to have length 1 ( unit norm ) 32

What does this have to do with PCA? Consider the variance-covariance matrix A of your dataset The eigenvectors of A provide sets of coefficients defining p linear functions of the original variables These functions are the PCs If A has eigenvalues λ 1, λ 2,..., λ p then the PCs have variances λ 1, λ 2,..., λ p and zero covariances (i.e. they are uncorrelated, and are hence giving independent information) Cautions Sometimes used as a method for simplifying data because PCs associated with smaller eigenvalues have smaller variances and might therefore be ignored This assumption requires caution When variables are on different scales, it is customary to use the correlation matrix (rather than the covariance matrix) These two formulations give different results : the eigenvalues for the two matrices are not related in a simple way Theory not simple for correlation-based PCA 33

R: PCA (I) > head.pc <- prcomp(head) > head.pc Standard deviations: [1] 11.518663 3.721586 Rotation: PC1 PC2 [1,] -0.8362568-0.5483381 [2,] -0.5483381 0.8362568 Principal axes head length of second son 140 160 180 200 140 160 180 200 head length of first son 34

How many PCs? Retain the number required to explain some percentage of the total variation (e.g. 90%) Number of eigenvalues > average (1 if correlation matrix is used) Look for elbow in scree plot scree plot shows proportion of variance (or just variance) explained by each component Compromise between these R: PCA (II) > summary(head.pc) Importance of components: PC1 PC2 Standard deviation 11.519 3.7216 Proportion of Variance 0.905 0.0945 Cumulative Proportion 0.905 1.0000 > screeplot(head.pc,type="lines") > head.pc$sdev^2 [1] 132.6796 13.8502 > head.pc$sdev^2/sum(head.pc$sdev^2) [1] 0.90547862 0.09452138 35

R: scree plots head.pc food.pc Variances 20 40 60 80 100 120 Variances 0.0 0.5 1.0 1.5 2.0 1 2 1 2 3 4 5 (BREAK) 36