Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler Mentor: Dr. Takis Benos 26 June 2003 Outline Brief microarray overview Purpose of the paper Discuss clustering algorithms Experiment and Results Conclusions Allow for monitoring of gene expression at the transcript level Slide with single-stranded DNA molecules attached at fixed positions (probes) Exploit the complementary binding of single-stranded DNA sequences Results in a large data set containing expression levels of thousands of genes Microarray experiments are often used to track the changes in gene expression Over time In the presence of various agents 1

Microarray Slide Gene expression profiles characterize the dynamic functioning of each gene in the genome Expression data can be represented as a matrix where the rows are genes and the columns are samples The values in the cells of the matrix represent the expression levels Experiment Gene Expression Matrix *Numerical values encoded by color Experiment Design Decide on probes and genes Type of microarray Data Normalization Data analysis Identify differentially expressed genes Cluster genes based on expression patterns 2

Clustering Goal of microarray data analysis identify changing levels of gene expression correlate the changes to identify sets of genes with similar profiles Clustering group objects into subsets Clustering algorithms can be used to group genes that have similar expression patterns Purpose of the Paper Currently, there are no clear guidelines for choosing a clustering algorithm to group genes based on their expression profiles This paper evaluated the performance of six different algorithms using a microarray data set on sporulation of budding yeast Clustering Algorithms Clustering Algorithms Hierarchical clustering with correlation UPGMA (most commonly used algorithm) Clustering by K-means Diana Fanny Model-based clustering Hierarchical clustering with partial least squares Algorithms differ in the measure of similarity used when grouping the objects and grouping technique Some need and use previous knowledge about the suspected number of clusters 3

Hierarchical clustering Produces a hierarchy of clusters rather than a predefined number of clusters (agglomerative approach) Initially, each observation is in its own cluster Subsequently, the two closest clusters are combined into a single cluster The similarity (distance) measure used is the average method Distance between clusters is the average of the distances between the points in one cluster and those in the other cluster K-means Clustering Uses advance knowledge about the number of clusters to be formed k clusters Initially, all objects are randomly assigned to one of k clusters Objects are moved between clusters in an attempt to minimize the distance between that object and its cluster Diana Divisive clustering method All objects start in one cluster and broken into smaller groups Genes with larger dissimilarity are put in different clusters Uses the standard Euclidean distance measure Fanny Uses fuzzy logic and produces a probability vector for each observation Hard cluster is formed by assigning an observation to a group with the highest probability Uses the Manhattan distance measurement d= x-u + y-v where (x,y) and (u,v) are two points Needs a predefined number of clusters (k) 4

Model-based Clustering Treats data as a mixture distribution Often based on a Gaussian distribution Describes each cluster using a probabilistic model No predefined number of clusters Experiment Run each of the six clustering algorithms with sporulation of yeast data set Used three validation measurements to compare the results Average proportion of non-overlap measure Average distance between means measure Average distance measure Results Average proportion of non-overlap & average distance between means measures Based on average proportion of nonoverlap and average distance between means measures: Model-based clustering appears to be worst Hierarchical clustering with correlation and Fanny seem to be the best Based on average distance measure: Hierarchical clustering performed worst Overall, Diana performed consistently well for all three measurements 5

Average distance between mean measures Comparison of model profiles The smaller the distance from the model profile the closer the results of that algorithm are to the results of the model. Time (hours) 0 2 4 6 8 10 12 Average Temporal Profiles Group category I II III IV V VI VII 52 genes 62 genes 47 genes 95 genes 158 genes 61 genes 5 genes Chu et al. (1998) Science 282: 699-705 6

Guidelines for choosing a clustering algorithm 1. Produce a visual plot of the first two principal components to determine the method that gives the most separation between groups 2. Check for consistency of the method with temporal observations 3. Compare the average group temporal profiles with the model profiles produced from a known, hand-picked set of genes (training set) Other aspects It may be important to inspect the computational stability as well as the computational time of an algorithm before making a choice In conclusion The clustering algorithm directly effects the interpretation and analysis of the data Therefore, careful consideration of the algorithms is necessary before a choice is made For this data set, Diana performed consistently well when compared with the model profile and when looking at the three validation measures References Datta, S, and S Datta. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics(19) 459-466 Brazma, A, and J Vilo. (2000) Gene expression data analysis. FEBS (Letters 480) 17-24 Quackenbush, J. (2001) Computational Analysis of Microarray Data. Nature Reviews (2) 418-427 Hastie, T, R Tibshirani, and J Friedman. (2001) The Elements of Statistical Learning 453-480 http://www.austinlinks.com/fuzzy/overview.html http://www.maths.lth.se/help/r/.r/library/cluster/html/fanny.html 7