Tight Clustering: a method for extracting stable and tight patterns in expression profiles

Size: px

Start display at page:

Download "Tight Clustering: a method for extracting stable and tight patterns in expression profiles"

Patricia Webb
6 years ago
Views:

Statistical issues in microarra analsis Tight Clustering: a method for etracting stable and tight patterns in epression profiles Eperimental design Image analsis Normalization George C. Tseng Dept.

1 Statistical issues in microarra analsis Tight Clustering: a method for etracting stable and tight patterns in epression profiles Eperimental design Image analsis Normalization George C. Tseng Dept. of Biostatistics & Human Genetics Universit of Pittsburgh Identif differentiall epressed genes Data visualization Clustering Regulator network Classification Data matri Heatmap (data visualization) Data: X={ ij } n d, an n (genes) d (samples) matri. row.names chromosome sample1 sample2 sample3 sample4 sample5 time time3 time5 time7 time NA 96669_at _at _at _at _at 15. NA NA NA NA NA 16378_at _at NA 98569_at 2. NA NA 93794_at _at _at _at _at 19. NA -.22 NA NA NA 95124_i_at _at _at _at NA 99674_at _at row.names chromosome sampl1 sample2 sample3 sample4 sample5 time time3 time5 time7 time NA 96669_at _at _at _at _at 15. NA NA NA NA NA 16378_at _at NA 98569_at 2. NA NA 93794_at _at _at _at _at 19. NA -.22 NA NA NA 95124_i_at _at _at _at NA 99674_at _at

Wh clustering: Cluster genes: similar epression pattern implies co-regulation. Although man sophisticated methods for detecting regulator interactions (e.g. Shortest-path and Liquid Association), cluster analsis remains a useful routine in arra analsis.

2 Wh clustering: Cluster genes: similar epression pattern implies co-regulation. Although man sophisticated methods for detecting regulator interactions (e.g. Shortest-path and Liquid Association), cluster analsis remains a useful routine in arra analsis. Subsequent analsis: Identif novel genes participating in known cellular process Enrichment of particular Gene Ontolog (GO) terms in clusters Motif finding in clusters Cluster samples: identif potential sub-classes of disease Clustering in microarra: an eample Gene epression during the life ccle of Drosophila melanogaster. (22) Science 297: genes monitored. Reference sample is pooled from all samples. 66 sequential time points spanning embronic (E), larval (L), pupal (P) and adult (A) periods. Filter genes without significant pattern (11 genes) and standardize each gene to have mean and stdev 1. Eample: Data from life ccle of Drosophila melanogaster. (22) Science 297: k=1 k=15 k=3 Main challenges for clustering in microarra Challenge 1: Lots of scattered genes. i.e. genes not belonging to an tight cluster of biological function. K-means Clustering looks informative A closer look, however, finds lots of noises in each cluster

3 Main challenges for clustering in microarra Challenge 2: Microarra is an eplorator tool to guide further biological eperiments Hpothesis driven: hpothesis => eperimental data. Data driven: high-throughput eperiment => data mining => hpothesis => further validation eperiment Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives). Current Methods Dimension reduction and data visualization: Principle Component Analsis (PCA) (Alter 2) Multi-Dimensional Scaling (MDS) Clustering methods Hierarchical Clustering (Eisen 1998) K-means (Hartigan 1975) K-memoids Self-Organizing Map (SOM) (Tamao 1999) CLICK (Ron Shamir 21) Model-based approach (Frale and Rafter 1998) Model-based approach Model-based approach Frale and Rafter (1998) applied a Gaussian miture model. (1)EM algorithm to maimize the classification likelihood. (2) Baesian Information Criterion (BIC) for determining k and the compleit of the covariance matri. Advantage: A sound probabilistic model for inference: model selection and estimation Can easil etend to model scattered genes Problems: Local minimum Model selection is usuall inapplicable in arra data; BIC is approimate

4 K-means clustering Procedures: Step 1: estimate the number of clusters, k. Step 2: minimize the within-cluster dispersion to the cluster centers. k 2 W ( k) = i C j j= 1 i Cj Note: 1. Points should be in Euclidean space. 2. Optimization performed b iterative relocation algorithms. Local minimum inevitable. 3. k has to be correctl estimated. K-means clustering K-means is a special case of model-based approach. Problems: Local minimum Does not allow scattered genes Estimation of number of clusters k Hierarchical clustering Estimate the number of clusters k: Milligan & Cooper(1985) compared 3 published rules. 1. Calinski & Harabasz (1974) 2. Hartigan (1975) B( k) /( k 1) ma CH ( k) = W ( k) /( n k), Stop when H(k)<1 3. Tibshirani, Walther & Hastie (2) * ma Gap ( k) = E (log( W ( k))) log( W ( k)) n n 4. Tibshirani et al(21), Dudoit & Fridland(22) Prediction-based resampling approach. Hierarchical clustering Iterativel agglomerate nearest nodes to form bottom-up tree. Single Linkage: shortest distance between points in the two nodes. Complete Linkage: largest distance between points in the two nodes. Note: Clusters can be obtained b cutting the hierarchical tree. 4

Eample of hierarchical clustering Hierarchical clustering Eisen et al 1998 Other Methods Current methods aim to find tight clusters: 1. CLICK: graph-theoretical techniques to find tight kernels.

5 Eample of hierarchical clustering Hierarchical clustering Eisen et al 1998 Other Methods Current methods aim to find tight clusters: 1. CLICK: graph-theoretical techniques to find tight kernels. Several heuristic procedures then used to epand the kernels into full clustering. 2. Committee algorithm: similar idea to find tight committees and then epand to full clustering. Traditional: Estimate the number of clusters, k. (ecept for hierarchical clustering) Perform clustering through assigning all genes into clusters Tight Clustering: Directl identif informative, tight and stable clusters with reasonable size, sa, 2~6 genes. Need not estimate k!! Need not assign all genes into clusters

6 whole data Tight Clustering subsample subsample 2 judgement b subsample 1 judgement b subsample Original Data X co-membership matri D[C(X', k), X] X={ ij } n d : data to be clustered. X'={' ij } n/2 d : random sub-sample C(X', k)=(c 1, C 2,, C k ): the cluster centers obtained from clustering X' into k clusters. sub-sample X' K-means cluster centers C(X', k)=(c 1,, C k ) D[C(X', k), X] : an n n matri denoting co-membership relations of X classified b C(X', k). (Tibshirani 21) D[C(X', k), X] ij =1 if i and j in the same cluster. = o.w. Vi I Vj s(v i,v j) = V U V i j :a measure of similarit of two sets of genes 6

7 Algorithm 1 (when fiing k): 1. Fi k. Random sub-sampling X (1),, X (B). Define the average co-membership matri to be (1) (B) D = mean( D[C(X, k), X], K, D[C(X, k), X] ). Note: a. D ij =1 i and j alwas clustered together in each sub-sampling judgment. b. D ij = i and j never clustered together in each sub-sampling judgment. c. Dii = 1 i Algorithm 1 (when fiing k): (cont d) 2. Search for a large set of points V = { v 1, K, vm} {1, K, n} such that Dv i v j 1 α i, j α close to. Sets with this propert are candidates of tight clusters. Order sets with this propert b their size to obtain V k1,v k2, Tight Clustering Algorithm: k k 1 k 2 k Tight Clustering Algorithm: 1. Start with a suitable k. Search for consecutive k s and choose the top 3 clusters for each k. V k,1 V k,2 V k,3.7.1 V( k +1),1.1 V( k +1), V( k +1),3.1 V( k +2), V( k +2), V( k +2),3.17 V( k +3), V( k +3), V( k +3),3 { Vk, Vk 2, Vk 3},{ V( k + 1)1, V( k + 1)2, V( k 1) 3}, K 2. Stop when s( V, V Select 1 + k ' l ( k ' + 1) m k' k, V ( k ' + 1) m ) β, s( V k + m, V k + ( ' 1) ( ' 2) n ) β l, m, n {1,2,3}, β close to1 to be the tightest cluster. 7

8 Tight Clustering Algorithm: (cont d) 3. Identif the tightest cluster and remove it from the whole data. 4. Decrease k b 1. Repeat 1.~3. to identif the net tight cluster. Remark: α, β and k determines the tightness and size of resulting clusters. Simulation A simple simulation on 2-D: 14 clusters normall distributed (5 points each) plus 175 sporadic points. Stdev=.1,.2,, Simulation Tight clustering on simulated data: α =, β =.7, B = 1, k = 1, 2, 25 and remain truth alpha beta.7 k= k= k= k= Simulation k = 25, α =, β =.7, B =

Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297:227-2275 Tight Clustering α =.1, β =.6, k = 15 Eample 1: Data from life ccle of Drosophila melanogaster.

9 Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297: Tight Clustering α =.1, β =.6, k = 15 Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297: k=1 k=15 k= K-means Clustering looks informative. 11 clusters and 661 remaining scattered genes A closer look, however, finds lots of noises in each cluster. Comparison: a corresponding cluster of K-means & Tight Clustering 22 common genes Eample 1: Data from life ccle of Drosophila melanogaster. (22) Science 297: Tight Clustering total of 28 genes K-means clustering total of 18 genes Eample 2: Mouse embronic eperiment Mouse embronic eperiment: oligonucleotide arra (U74Av2 mouse arra from Affmetri) containing probe sets for about 1, mouse genes. Totall 126 samples. Half of them are from different stages of mouse embronic development. The remaining half is a diverse collection of samples from various tissues, including several tpes of adult stem cells. Mean sq. distance:

Eample 2: Mouse embronic eperiment Comparison of various K-means and tight clustering: Eample 3: Simulated data A.

SOM d. CLICK e. Model-based clustering f.

results. We compare clustering results from each method to the underling truth.

10 Eample 2: Mouse embronic eperiment Comparison of various K-means and tight clustering: Eample 3: Simulated data A. simulated gene epression of 15 clusters and 5 scattered genes. B. Randoml permuted from A. a. K-means b. K-memoid c. SOM d. CLICK e. Model-based clustering f. Tight clustering Eample 3: Simulated data Adjusted Rand inde is a measure to compare similarit of two clustering results. We compare clustering results from each method to the underling truth. Ongoing developments Theoretical foundation for re-sampling approach. Multi-resolution tight clustering. Etend the idea to bi-clustering. Incorporating multiple tight clustering results. Other general and fundamental problems in clustering. 1

tightclust: a software for Tight Clustering Acknowledgement: http://www.pitt.edu/~ctseng/tightclust.

11 tightclust: a software for Tight Clustering Acknowledgement: Harvard: Wing H. Wong (Department of Statistics) Inputs from: Chen Li (Department of Biostatistics) Rung Kim Richard Zhong 11

Discussion: Clustering Random Curves Under Spatial Dependence

Discussion: Clustering Random Curves Under Spatial Dependence Gareth M. James, Wenguang Sun and Xinghao Qiao Abstract We discuss the advantages and disadvantages of a functional approach to clustering