Supervised Clustering of Yeast Gene Expression Data

Supervised Clustering of Yeast Gene Expression Data In the DeRisi paper five expression profile clusters were cited, each containing a small number (7-8) of genes. In the following examples we apply supervised clustering techniques to these cluster prototypes, classifying the remaining genes in the dataset. Classifiers were first trained on the genes in the original clusters, and then applied to the remaining genes to assign them to a cluster. In the first example, a Kohonen self-organizing feature map was used to arrange the original clusters in a two dimensional layout. The unclassified genes were mapped using this layout, creating a clustering of the genes. New clusters were defined by selecting a region of the map corresponding to each new cluster, thus classifying the genes within that region. In the second example, a decision tree was produced by training it on the original clusters. An extra cluster was added to represent those genes not sufficiently satisfying the original DeRisi cluster expression profiles. The remaining genes were filtered removing those without a significant change in expression level, and were then classified by the decision tree. In the third example, a Naive-Bayes classifier was generated from the original clusters.

15 DeRisi Cluster Expression Profiles 10 Fold Change 5 0-5 -10 8.5 10.5 12.5 14.5 16.5 18.5 20.5 Time Centroid B (n=7) Centroid C (n=7) Centroid D (n=7) Centroid E (n=7) Centroid F (n=8) The original DeRisi clusters are represented by a graph of the cluster centroids.

A parallel coordinates visualization displaying gene expression levels for each DeRisi cluster.

A Kohonen self-organizing feature map computes a new pair of axes and locates the genes according to its idea of similarity.

A Kohonen self-organizing feature map displaying user defined clusters.

Kohonen Map Cluster Expression Profiles 15 10 Fold Change 5 0-5 -10 8.5 10.5 12.5 14.5 16.5 18.5 20.5 Time Centroid none (n=5730) Centroid newb (n=17) Centroid newf (n=210) Centroid newd (n=143) Centroid newc (n=26) Centroid newe (n=27) The Kohonen self-organizing feature map clusters presented by a plot of the cluster centroids.

A parallel coordinates visualization showing the new Kohonen map clusters as compared to the original Derisi clusters.

A visualization of a decision tree that was created from the original DeRisi clusters (plus an extra None cluster). This part of the subtree shows clusters E and F being split from cluster None at time 18.5, and clusters E and F being split apart at time 14.5.

Decision Tree Cluster Expression Profiles 15 10 Fold Change 5 0-5 -10 8.5 10.5 12.5 14.5 16.5 18.5 20.5 Time Centroid none (n=197) Centroid C (n=21) Centroid E (n=71) Centroid B (n=47) Centroid D (n=72) Centroid F (n=347) The decision tree clusters presented by a plot of the cluster centroids.

Visualization of the Naive-Bayes classifier created from the original DeRisi clusters. The attributes are listed in order of importance (with respect to the cluster designation). The fact that the squares for time 18.5 are mostly one color indicates time 18.5 is a very good predictor for the cluster class.

This visualization of the Naive-Bayes classifier shows the probability distribution for cluster D. Cluster D can be classified perfectly from attribute T18.5 alone.

Cluster G2/M (n=195) 1.5 1 0.5 0-0.5-1 -1-1.5 Time Points Expression levels of the five yeast cell cycle peak phases, as designated from the Spellman dataset. The average of each cluster is plotted for all time periods (T0-T160), along with the standard deviation values for each peak phase. T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Fold Change Cluster G1(n=300) 1.5 1 0.5 0-0.5-1.5 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Time Points Cluster S (n=71) 1.5 1 0.5 0-0.5-1 -1.5 Time Points T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Fold Change Cluster S/G2 (n=121) 1.5 1 0.5 0-0.5-1 -1.5 Time Points Cluster M/G1 (n=113) 1.5 1 0.5 0-0.5-1 -1.5 Time Points Fold Change Fold Change Fold Change

A visualization of the five Spellman peak phase clusters displayed as a sequence of sixteen histograms for each cell cycle.

A Radviz visualization of the yeast cell cycle data, clustered using the time and colored by Spellman s peak phase classification. This visualization technique employs the physical concept of spring forces to position the multi-dimensional data.

A dendogram visualization displaying a user selected cluster generated by a standard hierarchical clustering method

A K-means clustering of the Spellman data. The visualization features a relative neighborhood graph (minimum set of lines that connect the centroids) and the outliers for all five K-means clusters.

A plot displaying the results of a Kohonen self-organizing map generated from the Spellman data, with the classification from Cho overlaid.

The statistical results of a self-organizing feature map trained on the Spellman data. Blue lines display the cluster centroids and the red lines show the standard deviations.

Comparison of a K-means clustering technique that generates 30 clusters with the five expression patterns designated by Spellman. While some of the 30 clusters represent subsets of a Spellman class (such as the yellow lines), other clusters have genes that fall into two or more Spellman Classes.

A comparison of two clustering techniques using a jittered scatterplot of the Spellman data. Five clusters from one technique (along the Y-axis) are compared with 12 clusters from another technique (along the X-axis). If the X-axis clusters were a pure superset of the Y-axis clusters then there would only be one clump per vertical line. In this case only the 12 th cluster on the X-axis is pure while the 1 st is nearly so.

A circle segment visualization comparing the results of different classification techniques. The true class is represented in color, while the predicted class is represented with a grayscale. If the change in grayscale value matches the change in color, then there is a strong correlation between the true and predicted class. In this example the "cl03" correlates well with the true class feature, the peak.

Comparing Clustering Techniques Rank Clustering Data Number of %correct %correct %correct %correct %correct Technique Clusters method -1 method -2 method -3 method -4 maximum 1 Kohonen 3 Norm 30 72.6 69.1 65.7 67.8 72.6 2 Kohonen 1 Norm 30 72.3 69.5 65.2 67.7 72.3 3 Kohonen 2 Norm 30 71.8 66.4 62.3 65.2 71.8 4 C K-means 1 Norm 30 71.1 66.4 59.7 65.1 71.1 5 SOM 4 Original 25 70.1 61.9 59.9 63.2 70.1 6 SOM 12 Original 27 69.3 64.0 60.1 63.0 69.3 7 Kohonen 2 Original 19 68.5 64.3 58.6 62.7 68.5 8 C K-means 1 Original 30 67.2 63.6 55.0 61.9 67.2 9 Kohonen 1 Original 19 67.1 59.8 53.6 58.8 67.1 10 Kohonen 3 Original 18 66.8 65.5 56.4 63.9 66.8 11 C K-means 2 Norm 5 66.8 61.1 56.4 58.6 66.8 12 SOM 7 Norm 12 62.5 57.8 49.6 52.8 62.5 13 M K-means 1 Original 5 59.7 51.8 48.4 54.7 59.7 14 Dendogram 2 Original 6 58.8 54.5 46.8 47.5 58.8 15 K-means 2 Original 5 55.8 50.0 47.8 54.5 55.8 16 SOM 7 Original 5 54.8 51.8 42.8 55.1 55.1 17 Dendogram 1 Original 6 45.6 43.1 32.7 33.4 45.6 18 SOM 12 Norm 30 44.2 38.5 31.0 36.0 44.2 19 M K-means 2 Original 30 43.7 36.6 29.3 35.9 43.7 20 M K-means 3 Original 17 39.5 30.8 23.5 30.2 39.5 21 random Original 6 37.5 16.3 20.0 22.9 37.5 The results of several clustering techniques were compared to the five Spellman classifications (G2/M, G1, S, S/G2, and M/G1). For a given technique, each generated cluster was considered to be a subset of one of the Spellman classes. The class chosen for each cluster was based on the majority of Spellman classes for the genes in that cluster. After each cluster was categorized, the resulting accuracies were calculated. The total percent correct and the average accuracy for each class was calculated and is presented in the method columns.

Unsupervised Clustering of Yeast Gene Expression Data In the Cho paper, 416 genes were visually identified as cell cycle regulated. In the Spellman paper, the Cho data was combined with the results from other experiments and 800 genes were identified algorithmically as cell cycle regulated. In the following examples, we apply various unsupervised clustering techniques to a subset of the Cho dataset (the 800 genes that were identified in). The first row (images 1-3) consists of visualizations of the original data (gene expression levels during two cell cycles). The second row (images 4-6) visually presents the results of several clustering algorithms. The third row (images 7-9) displays the statistical properties of each cluster generated by various algorithms. The fourth row (images 10-12) provides visual comparisons between selected clustering algorithms.

References Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998 Dec; 9(12): 3273-97. http://genome-www.stanford.edu/pdf/spellman_pt_mol_biol_cell_1998.pdf Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW. A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2: 65-73, 1998. http://depts.washington.edu/genetics/courses/genet551-aut01/1217paper.pdf DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997 Oct 24; 278(5338): 680-6. http://genome-www.stanford.edu/pdf/derisi_jl_science_1997.pdf http://cmgm.stanford.edu/pbrown/explore/