High throughput Data Analysis 2. Cluster Analysis

Size: px

Start display at page:

Download "High throughput Data Analysis 2. Cluster Analysis"

Sherman Walters
5 years ago
Views:

1 High throughput Data Analysis 2 Cluster Analysis

2 Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results

3 Introduction WHY DO CLUSTERING?

4 Why clustering? Group genes based on common features e.g. common expression pattern common phenotype common time course Find common function of genes in cluster Pathway? Assign function to new genes Guilt by association Find intrinsic structure in data; hypotheses free? Does not require (does not use) additional information Advantage and disadvantage!

5 Basic Idea expression condition 2 Cluster A Cluster B expression condition 1 How do you formalize this?

6 What to Cluster Genes expression (e.g. across tissues, time, individuals, ) phenotype Samples expression phenotype Individuals expression phenotype genotype Species

7 Visualizing Samples Before batch correction: After batch correction:

8 Bi Clustering samples genes

9 Supervised vs. Unsupervised Supervised Learning Learn to predict outcome based on examples e.g. regression Needs examples (training data) Unsupervised Learning Find intrinsic structure in data e.g. clustering Does not need additional data/information

10 Clustering: Ingredients In real life n >> 2 dimensions Group genes that are close in n dimensional space Requires measure of distance between genes (objects) e.g. Euclidean distance (more later) Find clusters of genes that are close to each other, but far from the rest

11 HIERARCHICAL CLUSTERING

12 Hierarchical Clustering expression condition 2 expression condition 1 23 clusters

13 Joining Rules expression condition 1 expression condition 2

14 Joining Rules expression condition 2 Average Linkage (mean) expression condition 1

15 Joining Rules expression condition 2 Single Linkage (min) expression condition 1

16 Joining Rules expression condition 2 Complete Linkage (max) expression condition 1

17 Joining Rules

18 Limitations of Hierarchical Clustering Linking cannot be reversed (updated)!

19 How many clusters? Where do you cut?

20 DISTANCE MEASURES

21 Distance Measures Euclidean distance (L2 norm) Pearson correlation Mutual information Manhattan distance (L1 norm) Other correlation measures many others dissimilarity distance

22 Euclidean distance Distance in (Euclidean) space Small if absolute values similar d( x, y) x i yi 2

23 Pearson Correlation Known from linear regression Similar if patterns similar p( x, y) xij xi yij yi 2 x 2 ij xi yij yi j j d(x, y) = 1 (p(x, y)) 2

24 Euclidean vs. Pearson time expression level

25 Weighted Distances Assign weights to parameters d ( x, y) w d x y j j j i i

26 Weighted Distances Assign weights to parameters d ( x, y) w j d j xi yi j Not all parameters may be equally important Setting all w j the same may not give all parameters equal influence!! Influence is determined by variance. Time

27 Example: Time Course Determine weights by supervised approach AhR independent AhR controlled

28 Tibschirani et al. Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.

29 K MEANS

30 K Means Fix number of clusters a priory (k) Optimally split data into k clusters minimize variance within clusters minimizes Euclidean distance

31 K Means expression condition 1 expression condition 2

32 K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centriods 3. re assign genes to closest centroid expression condition 1

33 K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centroids 3. re assign genes to closest centroid 4. repeat until stable Repeat many times with random initial conditions. expression condition 1

34 K Means expression condition 2 too few clusters expression condition 1

35 K Means expression condition 2 too many clusters expression condition 1

36 How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter

37 How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter W ( C) 1 2 K k 1 C( i) k C( i' ) k d( x i, x i ') will always decrease as K increases

38 Gap Statistic W(C) real random W(C) K K

39 Figure of Merit (FOM) Cross validation: hide some data train method (i.e. identify clusters) test on hidden data Hide data of condition (parameter) e After clustering quantify similarity based on e: FOM ( e) 1 n K k 1 C( i) k ( x ik ( e) x k ( e)) 2

40 Figure of Merit (FOM) FOM will also always decrease with increasing K Thus, need to normalize ( adjusted FOM ) For random data, FOM decreases with Thus, FOM adj ( e) FOM ( e) n k n n k n

41 Figure of Merit (FOM) best k?

42 Getting K Gap Statistic: minimize within cluster variance, compare against random FOM: cross validation; minimize variance in training data, compare against random

43 Hierarchical vs. K Means Hierarchical Variable number of clusters Does not directly imply number of clusters Cluster assignment fixed Any distance measure Clusters are deterministic K Means Fixed number of clusters Cluster assignment dynamic (adaptive) Only Euclidean distance (but variations exist) Clusters are nondeterministic

44 OTHER METHODS

45 K Medoids Like K Means, but for arbitrary distance measures Center defined by most central object Test each object in each cluster computationally very expensive K Means K Medoids

46 Fuzzy C Means Assign objects to many (or all) clusters with different certainty Membership values between 0 and 1 Considers uncertainty in data & clustering Allows for multi cluster membership (e.g. participating in several pathways)

47 Principle Component Analysis (PCA) Dimension reduction Project data on smaller number of dimensions Each dimension is linear combination of original dimensions Reduce high dimensional data to smaller number of relevant components (e.g. can be visualized in 3D)

48 Principle Component Analysis (PCA) expression condition 2 expression condition 1

49 Principle Component Analysis (PCA) Very useful for visualizing high dimensional data removing redundancy/dependency in data (note ICA) clustering detecting (and removing) batch effects or other confounding effects in data

50 Model based Clustering Mixture modeling Make assumptions of data distribution Fit model (set of distributions) to data Cluster 1 Cluster 2 e.g. Gaussian mixture model

51 Model based Clustering Mixture modeling Make assumptions of data distribution Works well if model (i.e. distribution) known May give higher power (additional information used) May give spurious results if assumptions incorrect

52 ASSESSING CLUSTER QUALITY

53 What matters? Good statistical separation Stability of results Agreement with external (independent) data Biological plausibility

54 Davies Bouldin Index Minimize within versus between variance ), ( ) ( ) ( max 1 1 j i j i k i j i Q Q S Q S Q S k DB S() = average distance to center

55 Silhouette a(i) = average distance to all other cluster members b(i) = average distance to members of neighboring cluster Average s(i) should be close to 1.

56 Using external data E.g. expression data: GO enrichment Enrichment of known transcription factor target genes Enrichment of regulatory sequence motives.

57 Biological plausibility Very problem dependent What is known about the process? Are genes known to be related grouped in one cluster? (and vice versa) When clustering samples: are conditions that are similar grouped together? Are similar cell types/tissues clustering?

58 Further Reading The Elements of Statistical Learning, Hastie et al. stat.stanford.edu/~tibs/elemstatlearn/

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other