High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2 Cluster Analysis

Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results

Introduction WHY DO CLUSTERING?

Why clustering? Group genes based on common features e.g. common expression pattern common phenotype common time course Find common function of genes in cluster Pathway? Assign function to new genes Guilt by association Find intrinsic structure in data; hypotheses free? Does not require (does not use) additional information Advantage and disadvantage!

Basic Idea expression condition 2 Cluster A Cluster B expression condition 1 How do you formalize this?

What to Cluster Genes expression (e.g. across tissues, time, individuals, ) phenotype Samples expression phenotype Individuals expression phenotype genotype Species

Visualizing Samples Before batch correction: After batch correction:

Bi Clustering samples genes

Supervised vs. Unsupervised Supervised Learning Learn to predict outcome based on examples e.g. regression Needs examples (training data) Unsupervised Learning Find intrinsic structure in data e.g. clustering Does not need additional data/information

Clustering: Ingredients In real life n >> 2 dimensions Group genes that are close in n dimensional space Requires measure of distance between genes (objects) e.g. Euclidean distance (more later) Find clusters of genes that are close to each other, but far from the rest

HIERARCHICAL CLUSTERING

Hierarchical Clustering expression condition 2 expression condition 1 23 clusters

Joining Rules expression condition 1 expression condition 2

Joining Rules expression condition 2 Average Linkage (mean) expression condition 1

Joining Rules expression condition 2 Single Linkage (min) expression condition 1

Joining Rules expression condition 2 Complete Linkage (max) expression condition 1

Joining Rules

Limitations of Hierarchical Clustering Linking cannot be reversed (updated)!

How many clusters? Where do you cut?

DISTANCE MEASURES

Distance Measures Euclidean distance (L2 norm) Pearson correlation Mutual information Manhattan distance (L1 norm) Other correlation measures many others dissimilarity distance

Euclidean distance Distance in (Euclidean) space Small if absolute values similar d( x, y) x i yi 2

Pearson Correlation Known from linear regression Similar if patterns similar p( x, y) xij xi yij yi 2 x 2 ij xi yij yi j j d(x, y) = 1 (p(x, y)) 2

Euclidean vs. Pearson time expression level

Weighted Distances Assign weights to parameters d ( x, y) w d x y j j j i i

Weighted Distances Assign weights to parameters d ( x, y) w j d j xi yi j Not all parameters may be equally important Setting all w j the same may not give all parameters equal influence!! Influence is determined by variance. Time

Example: Time Course Determine weights by supervised approach AhR independent AhR controlled

Tibschirani et al. Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.

K MEANS

K Means Fix number of clusters a priory (k) Optimally split data into k clusters minimize variance within clusters minimizes Euclidean distance

K Means expression condition 1 expression condition 2

K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centriods 3. re assign genes to closest centroid expression condition 1

K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centroids 3. re assign genes to closest centroid 4. repeat until stable Repeat many times with random initial conditions. expression condition 1

K Means expression condition 2 too few clusters expression condition 1

K Means expression condition 2 too many clusters expression condition 1

How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter

How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter W ( C) 1 2 K k 1 C( i) k C( i' ) k d( x i, x i ') will always decrease as K increases

Gap Statistic W(C) real random W(C) K K

Figure of Merit (FOM) Cross validation: hide some data train method (i.e. identify clusters) test on hidden data Hide data of condition (parameter) e After clustering quantify similarity based on e: FOM ( e) 1 n K k 1 C( i) k ( x ik ( e) x k ( e)) 2

Figure of Merit (FOM) FOM will also always decrease with increasing K Thus, need to normalize ( adjusted FOM ) For random data, FOM decreases with Thus, FOM adj ( e) FOM ( e) n k n n k n

Figure of Merit (FOM) best k?

Getting K Gap Statistic: minimize within cluster variance, compare against random FOM: cross validation; minimize variance in training data, compare against random

Hierarchical vs. K Means Hierarchical Variable number of clusters Does not directly imply number of clusters Cluster assignment fixed Any distance measure Clusters are deterministic K Means Fixed number of clusters Cluster assignment dynamic (adaptive) Only Euclidean distance (but variations exist) Clusters are nondeterministic

OTHER METHODS

K Medoids Like K Means, but for arbitrary distance measures Center defined by most central object Test each object in each cluster computationally very expensive K Means K Medoids

Fuzzy C Means Assign objects to many (or all) clusters with different certainty Membership values between 0 and 1 Considers uncertainty in data & clustering Allows for multi cluster membership (e.g. participating in several pathways)

Principle Component Analysis (PCA) Dimension reduction Project data on smaller number of dimensions Each dimension is linear combination of original dimensions Reduce high dimensional data to smaller number of relevant components (e.g. can be visualized in 3D)

Principle Component Analysis (PCA) expression condition 2 expression condition 1

Principle Component Analysis (PCA) Very useful for visualizing high dimensional data removing redundancy/dependency in data (note ICA) clustering detecting (and removing) batch effects or other confounding effects in data

Model based Clustering Mixture modeling Make assumptions of data distribution Fit model (set of distributions) to data Cluster 1 Cluster 2 e.g. Gaussian mixture model

Model based Clustering Mixture modeling Make assumptions of data distribution Works well if model (i.e. distribution) known May give higher power (additional information used) May give spurious results if assumptions incorrect

ASSESSING CLUSTER QUALITY

What matters? Good statistical separation Stability of results Agreement with external (independent) data Biological plausibility

Davies Bouldin Index Minimize within versus between variance ), ( ) ( ) ( max 1 1 j i j i k i j i Q Q S Q S Q S k DB S() = average distance to center

Silhouette a(i) = average distance to all other cluster members b(i) = average distance to members of neighboring cluster Average s(i) should be close to 1.

Using external data E.g. expression data: GO enrichment Enrichment of known transcription factor target genes Enrichment of regulatory sequence motives.

Biological plausibility Very problem dependent What is known about the process? Are genes known to be related grouped in one cluster? (and vice versa) When clustering samples: are conditions that are similar grouped together? Are similar cell types/tissues clustering?

Further Reading The Elements of Statistical Learning, Hastie et al. http://www stat.stanford.edu/~tibs/elemstatlearn/ http://machaon.karanagai.com/validation_algorithms.html http://en.wikipedia.org/wiki/cluster_analysis