High throughput Data Analysis 2 Cluster Analysis
Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results
Introduction WHY DO CLUSTERING?
Why clustering? Group genes based on common features e.g. common expression pattern common phenotype common time course Find common function of genes in cluster Pathway? Assign function to new genes Guilt by association Find intrinsic structure in data; hypotheses free? Does not require (does not use) additional information Advantage and disadvantage!
Basic Idea expression condition 2 Cluster A Cluster B expression condition 1 How do you formalize this?
What to Cluster Genes expression (e.g. across tissues, time, individuals, ) phenotype Samples expression phenotype Individuals expression phenotype genotype Species
Visualizing Samples Before batch correction: After batch correction:
Bi Clustering samples genes
Supervised vs. Unsupervised Supervised Learning Learn to predict outcome based on examples e.g. regression Needs examples (training data) Unsupervised Learning Find intrinsic structure in data e.g. clustering Does not need additional data/information
Clustering: Ingredients In real life n >> 2 dimensions Group genes that are close in n dimensional space Requires measure of distance between genes (objects) e.g. Euclidean distance (more later) Find clusters of genes that are close to each other, but far from the rest
HIERARCHICAL CLUSTERING
Hierarchical Clustering expression condition 2 expression condition 1 23 clusters
Joining Rules expression condition 1 expression condition 2
Joining Rules expression condition 2 Average Linkage (mean) expression condition 1
Joining Rules expression condition 2 Single Linkage (min) expression condition 1
Joining Rules expression condition 2 Complete Linkage (max) expression condition 1
Joining Rules
Limitations of Hierarchical Clustering Linking cannot be reversed (updated)!
How many clusters? Where do you cut?
DISTANCE MEASURES
Distance Measures Euclidean distance (L2 norm) Pearson correlation Mutual information Manhattan distance (L1 norm) Other correlation measures many others dissimilarity distance
Euclidean distance Distance in (Euclidean) space Small if absolute values similar d( x, y) x i yi 2
Pearson Correlation Known from linear regression Similar if patterns similar p( x, y) xij xi yij yi 2 x 2 ij xi yij yi j j d(x, y) = 1 (p(x, y)) 2
Euclidean vs. Pearson time expression level
Weighted Distances Assign weights to parameters d ( x, y) w d x y j j j i i
Weighted Distances Assign weights to parameters d ( x, y) w j d j xi yi j Not all parameters may be equally important Setting all w j the same may not give all parameters equal influence!! Influence is determined by variance. Time
Example: Time Course Determine weights by supervised approach AhR independent AhR controlled
Tibschirani et al. Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.
K MEANS
K Means Fix number of clusters a priory (k) Optimally split data into k clusters minimize variance within clusters minimizes Euclidean distance
K Means expression condition 1 expression condition 2
K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centriods 3. re assign genes to closest centroid expression condition 1
K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centroids 3. re assign genes to closest centroid 4. repeat until stable Repeat many times with random initial conditions. expression condition 1
K Means expression condition 2 too few clusters expression condition 1
K Means expression condition 2 too many clusters expression condition 1
How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter
How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter W ( C) 1 2 K k 1 C( i) k C( i' ) k d( x i, x i ') will always decrease as K increases
Gap Statistic W(C) real random W(C) K K
Figure of Merit (FOM) Cross validation: hide some data train method (i.e. identify clusters) test on hidden data Hide data of condition (parameter) e After clustering quantify similarity based on e: FOM ( e) 1 n K k 1 C( i) k ( x ik ( e) x k ( e)) 2
Figure of Merit (FOM) FOM will also always decrease with increasing K Thus, need to normalize ( adjusted FOM ) For random data, FOM decreases with Thus, FOM adj ( e) FOM ( e) n k n n k n
Figure of Merit (FOM) best k?
Getting K Gap Statistic: minimize within cluster variance, compare against random FOM: cross validation; minimize variance in training data, compare against random
Hierarchical vs. K Means Hierarchical Variable number of clusters Does not directly imply number of clusters Cluster assignment fixed Any distance measure Clusters are deterministic K Means Fixed number of clusters Cluster assignment dynamic (adaptive) Only Euclidean distance (but variations exist) Clusters are nondeterministic
OTHER METHODS
K Medoids Like K Means, but for arbitrary distance measures Center defined by most central object Test each object in each cluster computationally very expensive K Means K Medoids
Fuzzy C Means Assign objects to many (or all) clusters with different certainty Membership values between 0 and 1 Considers uncertainty in data & clustering Allows for multi cluster membership (e.g. participating in several pathways)
Principle Component Analysis (PCA) Dimension reduction Project data on smaller number of dimensions Each dimension is linear combination of original dimensions Reduce high dimensional data to smaller number of relevant components (e.g. can be visualized in 3D)
Principle Component Analysis (PCA) expression condition 2 expression condition 1
Principle Component Analysis (PCA) Very useful for visualizing high dimensional data removing redundancy/dependency in data (note ICA) clustering detecting (and removing) batch effects or other confounding effects in data
Model based Clustering Mixture modeling Make assumptions of data distribution Fit model (set of distributions) to data Cluster 1 Cluster 2 e.g. Gaussian mixture model
Model based Clustering Mixture modeling Make assumptions of data distribution Works well if model (i.e. distribution) known May give higher power (additional information used) May give spurious results if assumptions incorrect
ASSESSING CLUSTER QUALITY
What matters? Good statistical separation Stability of results Agreement with external (independent) data Biological plausibility
Davies Bouldin Index Minimize within versus between variance ), ( ) ( ) ( max 1 1 j i j i k i j i Q Q S Q S Q S k DB S() = average distance to center
Silhouette a(i) = average distance to all other cluster members b(i) = average distance to members of neighboring cluster Average s(i) should be close to 1.
Using external data E.g. expression data: GO enrichment Enrichment of known transcription factor target genes Enrichment of regulatory sequence motives.
Biological plausibility Very problem dependent What is known about the process? Are genes known to be related grouped in one cluster? (and vice versa) When clustering samples: are conditions that are similar grouped together? Are similar cell types/tissues clustering?
Further Reading The Elements of Statistical Learning, Hastie et al. http://www stat.stanford.edu/~tibs/elemstatlearn/ http://machaon.karanagai.com/validation_algorithms.html http://en.wikipedia.org/wiki/cluster_analysis