Evaluation and comparison of gene clustering methods in microarray analysis

Size: px

Start display at page:

Download "Evaluation and comparison of gene clustering methods in microarray analysis"

Thomasine Pearson
6 years ago
Views:

1 Evaluation and comparison of gene clustering methods in microarray analysis Anbupalam Thalamuthu 1 Indranil Mukhopadhyay 1 Xiaojing Zheng 1 George C. Tseng 1,2 1 Department of Human Genetics 2 Department of Biostatistics University of Pittsburgh, USA

2 Acknowledgements Anbupalam Thalamuthu and Indranil Mukhopadhyay Thank Dr.Daniel E. Weeks Professor Department of Human Genetics and Department of Biostatistics University of Pittsburgh & Fogarty/NIH grant 5D43TW "India-US Research Training Program in Genetics and the University of Pittsburgh, USA

3 Overview Gene clustering analysis are useful for discovering groups of genes co-regulated or associated with disease or certain conditions. Several methods such as hierarchical, K-means, PAM, SOM, Model based clustering and Tight clustering are used for clustering genes. No comprehensive comparative study has been performed to compare effectiveness of these methods. Six clustering methods are compared using simulated data set with different levels of perturbation and two real data sets A weighted Rand index is proposed for measuring similarity of two clustering results with possible scattered genes.

4 Outline Problem of gene clustering Simulated data under different perturbation models Rand index for comparison of clustering methods Results of the simulation study Real data sets Comparison of clustering methods for real data sets Results for the real data sets

5 Problem of gene clustering Micro array data matrix X nxm (normalized and preprocessed) contains n genes on the rows and m genes on the column Group the n genes into K clusters. Number of genes is very large compared the number of samples. Several methods are available. Some of them are K-means, PAM, SOM, Model based clustering and Tight clustering. How these methods behave when noise points are included and perturbed?

6 Simulated Data Sets: Basic data generation Fifteen clusters C = ( C1, C2,..., C15) with dimension d=50 are simulated. Cluster sizes are 4 n c ~ Poisson( λ) m 1 m 2 m 3 m 4 mi = 50, mi > 2; mi are uniformly distributed log( T ( c) j ) ~ N(log( µ ( c) i ), σ 2 S ) & log( µ ( c) i ) ~ N( µ, σ c 2 c ) j = 1,2,..., m, m ,50 & i 1 1 = 1,2,3,4 x lj ~ N(log( T ( c) j ), σ ), l = 1,2,..., n, j c = 1,2,...50 Parameters : = 6, σ = 1, σ = 0.1, σ = 0.1& λ = 10 µ c c S

7 Perturbation models Type I : Add 0%,10%,20%,60%,100% and 200% of randomly simulated scattered genes to the original data set. Type II: To each log-transformed gene expression value x ij, a rand error simulated from normal distribution with mean zero and SDs 0.05,0.1,0.2,0.4,0.8 and 1.2 is added. Type III: A combination of both Type I and Type II perturbation. 25 data sets each replicated 100 times constitute a total of 2500 simulated data sets for our analysis

8 Representation of Perturbation Models 0 5% 10% 20% 60% 100% 200%

9 Rand Index (Rand, 1971) Measure of concordance between 2 clustering schemes Rand Index: R(Y,Y')= # concordant pair/total # pairs = (2+7)/15=0.6 Clustering Methods can be evaluated by R(Y,Y truth ) if Y truth is known

10 Rand Index v 1 v 2 v c-1 v noise u 1 n 11 n 12 n 1C-1 n 1C n 10 u 2 n 21 n 22 n 2C-1 n 1C n 20 M M M M M M u R-1 n R-11 n R-12 n R-1C-1 n R-1C n R-10 u noise u R1 n R2 n RC-1 n RC S 1 n 01 n 02 n C-10 n S 2 Consider all points in clustered as well as original dataset; R N =Rand(with noise) Drop noise points from clustered as well as original dataset; R 0 =Rand(no noise)

11 Rand Index Let S 1 = set of noise as in clustered set {U} S 2 = set of noise as in original set {V} Ran d = R 1 * 2 n i0 i=1 R C * n ij 1 n * i=1 j=1 R i=1 C * + n 0 j j=1 1 n * where a * = a(a 1) /2 Rand = λr N +(1-λ)R 0 * n i0 C j=1 R i=1 * n 0 j * n i0 C j=1 * n 0 j where λ=n(s 1 S 2 )/Total observations

12 Figure 1: Weighted Rand Index for hierarchical clustering with single (blue), complete (red) and average (black) linkage on simulated data.

13 Figure 2: Weighted Rand index of K-means with 1 (red), 100 (blue) and 1000 (black) random initial values on simulated data.

14 Figure 3: Weighted Rand index of PAM with 1 (red), 100 (blue) and 1000 (black) random initial values on simulated data.

15 Observations Hierarchical clustering is very sensitive to presence noise points in the data set. It is very much affected by perturbation of the data. K-means and PAM fall into local minima, therefore for practical applications, these algorithms, if used, should be used with many initial cluster centers. 100 random initial cluster centers seems to be adequate.

16 Figure 4: Weighted Rand index for SOM (violet), hierarchical (brown), K- means (black), PAM (green), model based clustering (red), tight clustering (blue) based on simulated data sets.

17 Observations Without scattered genes and perturbation in the data sets, all the methods except SOM performs better. Type I perturbation reveals that hierarchical clustering, K- means and PAM are very much affected by presence of noise points in the data sets. It is not so for tight clustering and model based clustering. Tight clustering and model based clustering performed equally well up to a perturbation SD 0.2~0.4 even when 200% scattered genes exist. SOM performs worst. Model based clustering needs correct specification of clusters. Its BIC criteria to select the model with correct number of clusters fails even in the data sets without perturbation and no scattered genes.

18 Real Data I: Yeast cell cycle Yeast cell cycle data (Spellman et al. 1998) contains 6179 genes genes were retained for analysis. Dropped genes with more than 20% missing values and SD values at log-2 scale less than 0.4. Missing values for the genes retained for analysis were imputed by KNN algorithm. 104 genes that are cell cycle regulated in yeast have been identified by traditional methods (Paul T. Spellman, 7www.stanford.edu/cellcycle/data/rawdata/KnownGe nes.doc of these 87 were found in our preprocessed data set.

19 Annotation of clusters: G-genes in the genome (G=87); Functional category F (Six functional categories). In a cluster of size C, h genes are found to be in a functional category F with m genes, then p-value (i.e. the probability of observing h or more annotated genes in the cluster is calculated as (Tavazoie et al. 1999): = = ] [ h i m G i m C G i C h X P A cluster is annotated as category F, if the p-value is less than a threshold level. δ

20 Prediction Accuracy Prediction Accuracy( δ ) = # Verifiable predictions Total number of predictions # verifiable prediction = total number of annotated genes Total number of predictions = Sum of all the cluster sizes in which annotated genes were found. Since number of clusters in real data sets were not known exactly several clusters with sizes 5~30 were run and the pooled prediction accuracy were calculated. Comparison of clustering methods can be done using the prediction accuracy plots.

21 Figure 5: Prediction accuracy for yeast cell cycle data

22 Real Data Set II: Yeast environmental change data 45 yeast samples analyzed under various changes in extracellular environment (Causton et al 2001) genes, of which 1744 were retained after preprocessing. Extracted 119 genes that are found annotated to the category of response to stress in SGD Gene Ontology Slim Mapper (

23 Figure 6: Prediction accuracy for yeast environmental change

24 Conclusion Tight clustering and model based clustering consistently performs well both for simulated data as well as in the real data sets. Between these two methods Tight clustering may be preferred over model based clustering, as this method allows some of the scatted genes in the data sets to be grouped into noise category. Although SOM gives a good visualization of the clustering solution, its performance is poor compared other methods. The authors recommend Tight clustering or Model based clustering for practical applications.

25 References 1. Kaufman, L & Rousseeuw, PJ (1990): Finding groups in data; An introduction cluster analysis; Wiley, NY 2. Struyf, A, Hubert, M & Rousseeuw, PJ (1997): Integrating robust clustering techniques in S-PLUS. Computational Statistics and Data Ananlysis, 26, Fraley, C & Raftery, AE (2002): Model based clustering, discreminant analysis and density estimation; JASA, 97, Tseng, GC (2005): Tight Clustering: A resampling-based approach for identifying stable and tight patterns in data; Biometrics, 61, Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9): Rand, WM (1971); Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66,

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be