Supervised Clustering of Yeast Gene Expression Data

Similar documents
Visualizing Gene Clusters using Neighborhood Graphs in R

Reconstructing Boolean Networks from Noisy Gene Expression Data

Introduction to Mfuzz package and its graphical user interface

CompClustTk Manual & Tutorial

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Clustering Jacques van Helden

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering

CLUSTERING GENE EXPRESSION DATA USING AN EFFECTIVE DISSIMILARITY MEASURE 1

Double Self-Organizing Maps to Cluster Gene Expression Data

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am

CompClustTk Manual & Tutorial

Gene Expression Clustering with Functional Mixture Models

Dimension reduction : PCA and Clustering

Package Mfuzz. R topics documented: March 26, Version Date Title Soft clustering of time series gene expression data

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Evaluation and comparison of gene clustering methods in microarray analysis

Package Mfuzz. R topics documented: July 4, Version Date

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Advances in microarray technologies (1 5) have enabled

Mining Microarray Gene Expression Data

Modes and Clustering for Time-Warped Gene Expression Profile Data

Gene Clustering & Classification

/ Computational Genomics. Normalization

Network Traffic Measurements and Analysis

Predicting Gene Function and Localization

Clustering Techniques

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Tutorial:OverRepresentation - OpenTutorials

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

A NOVEL HYBRID APPROACH TO ESTIMATING MISSING VALUES IN DATABASES USING K-NEAREST NEIGHBORS AND NEURAL NETWORKS

How do microarrays work

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

PROBLEM 4

Radmacher, M, McShante, L, Simon, R (2002) A paradigm for Class Prediction Using Expression Profiles, J Computational Biol 9:

A Hybrid Algorithm for K-medoid Clustering of Large Data Sets

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

GPU Accelerated PK-means Algorithm for Gene Clustering

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Exploratory data analysis for microarrays

Supervised vs unsupervised clustering

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Clustering and Visualisation of Data

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Use of biclustering for missing value imputation in gene expression data

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Weka ( )

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

Validating Clustering for Gene Expression Data

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion.

Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data

Redefining and Enhancing K-means Algorithm

Introduction to Bioinformatics AS Laboratory Assignment 2

DI TRANSFORM. The regressive analyses. identify relationships

A STUDY ON DYNAMIC CLUSTERING OF GENE EXPRESSION DATA

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Unsupervised Learning I: K-Means Clustering

Chapter 1. Using the Cluster Analysis. Background Information

EMMA: An EM-based Imputation Technique for Handling Missing Sample-Values in Microarray Expression Profiles.

Machine Learning in Biology

An integrated tool for microarray data clustering and cluster validity assessment

Supervised vs.unsupervised Learning

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Nature Publishing Group

Robust PDF Table Locator

A Frequent Itemset Nearest Neighbor Based Approach for Clustering Gene Expression Data

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Unsupervised Learning

Package RobustRankAggreg

Model-Based Clustering and Data Transformations for Gene Expression Data

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Data mining techniques for actuaries: an overview

Correlation Motif Vignette

Parallel Coordinates ++

Biological Networks Analysis

ClaNC: The Manual (v1.1)

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Data mining with Support Vector Machine

Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit

Mining Gene Expression Data Using PCA Based Clustering

Visual Data Mining. Overview. Apr. 24, 2007 Visual Analytics presentation Julia Nam

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

THE MANUAL.

1 Case study of SVM (Rob)

APPLY DATA CLUSTERING TO GENE EXPRESSION DATA

Adaptive quality-based clustering of gene expression profiles

Seismic facies analysis using generative topographic mapping

BMC Bioinformatics. Open Access. Abstract

Figures and figure supplements

FPF-SB: a Scalable Algorithm for Microarray Gene Expression Data Clustering

A Dendrogram. Bioinformatics (Lec 17)

Time Series Gene Expression Data Classification via L 1 -norm Temporal SVM

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Y. Qi, J. Klein-Seetharaman, and Z.

Transcription:

Supervised Clustering of Yeast Gene Expression Data In the DeRisi paper five expression profile clusters were cited, each containing a small number (7-8) of genes. In the following examples we apply supervised clustering techniques to these cluster prototypes, classifying the remaining genes in the dataset. Classifiers were first trained on the genes in the original clusters, and then applied to the remaining genes to assign them to a cluster. In the first example, a Kohonen self-organizing feature map was used to arrange the original clusters in a two dimensional layout. The unclassified genes were mapped using this layout, creating a clustering of the genes. New clusters were defined by selecting a region of the map corresponding to each new cluster, thus classifying the genes within that region. In the second example, a decision tree was produced by training it on the original clusters. An extra cluster was added to represent those genes not sufficiently satisfying the original DeRisi cluster expression profiles. The remaining genes were filtered removing those without a significant change in expression level, and were then classified by the decision tree. In the third example, a Naive-Bayes classifier was generated from the original clusters.

15 DeRisi Cluster Expression Profiles 10 Fold Change 5 0-5 -10 8.5 10.5 12.5 14.5 16.5 18.5 20.5 Time Centroid B (n=7) Centroid C (n=7) Centroid D (n=7) Centroid E (n=7) Centroid F (n=8) The original DeRisi clusters are represented by a graph of the cluster centroids.

A parallel coordinates visualization displaying gene expression levels for each DeRisi cluster.

A Kohonen self-organizing feature map computes a new pair of axes and locates the genes according to its idea of similarity.

A Kohonen self-organizing feature map displaying user defined clusters.

Kohonen Map Cluster Expression Profiles 15 10 Fold Change 5 0-5 -10 8.5 10.5 12.5 14.5 16.5 18.5 20.5 Time Centroid none (n=5730) Centroid newb (n=17) Centroid newf (n=210) Centroid newd (n=143) Centroid newc (n=26) Centroid newe (n=27) The Kohonen self-organizing feature map clusters presented by a plot of the cluster centroids.

A parallel coordinates visualization showing the new Kohonen map clusters as compared to the original Derisi clusters.

A visualization of a decision tree that was created from the original DeRisi clusters (plus an extra None cluster). This part of the subtree shows clusters E and F being split from cluster None at time 18.5, and clusters E and F being split apart at time 14.5.

Decision Tree Cluster Expression Profiles 15 10 Fold Change 5 0-5 -10 8.5 10.5 12.5 14.5 16.5 18.5 20.5 Time Centroid none (n=197) Centroid C (n=21) Centroid E (n=71) Centroid B (n=47) Centroid D (n=72) Centroid F (n=347) The decision tree clusters presented by a plot of the cluster centroids.

Visualization of the Naive-Bayes classifier created from the original DeRisi clusters. The attributes are listed in order of importance (with respect to the cluster designation). The fact that the squares for time 18.5 are mostly one color indicates time 18.5 is a very good predictor for the cluster class.

This visualization of the Naive-Bayes classifier shows the probability distribution for cluster D. Cluster D can be classified perfectly from attribute T18.5 alone.

Cluster G2/M (n=195) 1.5 1 0.5 0-0.5-1 -1-1.5 Time Points Expression levels of the five yeast cell cycle peak phases, as designated from the Spellman dataset. The average of each cluster is plotted for all time periods (T0-T160), along with the standard deviation values for each peak phase. T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Fold Change Cluster G1(n=300) 1.5 1 0.5 0-0.5-1.5 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Time Points Cluster S (n=71) 1.5 1 0.5 0-0.5-1 -1.5 Time Points T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Fold Change Cluster S/G2 (n=121) 1.5 1 0.5 0-0.5-1 -1.5 Time Points Cluster M/G1 (n=113) 1.5 1 0.5 0-0.5-1 -1.5 Time Points Fold Change Fold Change Fold Change

A visualization of the five Spellman peak phase clusters displayed as a sequence of sixteen histograms for each cell cycle.

A Radviz visualization of the yeast cell cycle data, clustered using the time and colored by Spellman s peak phase classification. This visualization technique employs the physical concept of spring forces to position the multi-dimensional data.

A dendogram visualization displaying a user selected cluster generated by a standard hierarchical clustering method

A K-means clustering of the Spellman data. The visualization features a relative neighborhood graph (minimum set of lines that connect the centroids) and the outliers for all five K-means clusters.

A plot displaying the results of a Kohonen self-organizing map generated from the Spellman data, with the classification from Cho overlaid.

The statistical results of a self-organizing feature map trained on the Spellman data. Blue lines display the cluster centroids and the red lines show the standard deviations.

Comparison of a K-means clustering technique that generates 30 clusters with the five expression patterns designated by Spellman. While some of the 30 clusters represent subsets of a Spellman class (such as the yellow lines), other clusters have genes that fall into two or more Spellman Classes.

A comparison of two clustering techniques using a jittered scatterplot of the Spellman data. Five clusters from one technique (along the Y-axis) are compared with 12 clusters from another technique (along the X-axis). If the X-axis clusters were a pure superset of the Y-axis clusters then there would only be one clump per vertical line. In this case only the 12 th cluster on the X-axis is pure while the 1 st is nearly so.

A circle segment visualization comparing the results of different classification techniques. The true class is represented in color, while the predicted class is represented with a grayscale. If the change in grayscale value matches the change in color, then there is a strong correlation between the true and predicted class. In this example the "cl03" correlates well with the true class feature, the peak.

Comparing Clustering Techniques Rank Clustering Data Number of %correct %correct %correct %correct %correct Technique Clusters method -1 method -2 method -3 method -4 maximum 1 Kohonen 3 Norm 30 72.6 69.1 65.7 67.8 72.6 2 Kohonen 1 Norm 30 72.3 69.5 65.2 67.7 72.3 3 Kohonen 2 Norm 30 71.8 66.4 62.3 65.2 71.8 4 C K-means 1 Norm 30 71.1 66.4 59.7 65.1 71.1 5 SOM 4 Original 25 70.1 61.9 59.9 63.2 70.1 6 SOM 12 Original 27 69.3 64.0 60.1 63.0 69.3 7 Kohonen 2 Original 19 68.5 64.3 58.6 62.7 68.5 8 C K-means 1 Original 30 67.2 63.6 55.0 61.9 67.2 9 Kohonen 1 Original 19 67.1 59.8 53.6 58.8 67.1 10 Kohonen 3 Original 18 66.8 65.5 56.4 63.9 66.8 11 C K-means 2 Norm 5 66.8 61.1 56.4 58.6 66.8 12 SOM 7 Norm 12 62.5 57.8 49.6 52.8 62.5 13 M K-means 1 Original 5 59.7 51.8 48.4 54.7 59.7 14 Dendogram 2 Original 6 58.8 54.5 46.8 47.5 58.8 15 K-means 2 Original 5 55.8 50.0 47.8 54.5 55.8 16 SOM 7 Original 5 54.8 51.8 42.8 55.1 55.1 17 Dendogram 1 Original 6 45.6 43.1 32.7 33.4 45.6 18 SOM 12 Norm 30 44.2 38.5 31.0 36.0 44.2 19 M K-means 2 Original 30 43.7 36.6 29.3 35.9 43.7 20 M K-means 3 Original 17 39.5 30.8 23.5 30.2 39.5 21 random Original 6 37.5 16.3 20.0 22.9 37.5 The results of several clustering techniques were compared to the five Spellman classifications (G2/M, G1, S, S/G2, and M/G1). For a given technique, each generated cluster was considered to be a subset of one of the Spellman classes. The class chosen for each cluster was based on the majority of Spellman classes for the genes in that cluster. After each cluster was categorized, the resulting accuracies were calculated. The total percent correct and the average accuracy for each class was calculated and is presented in the method columns.

Unsupervised Clustering of Yeast Gene Expression Data In the Cho paper, 416 genes were visually identified as cell cycle regulated. In the Spellman paper, the Cho data was combined with the results from other experiments and 800 genes were identified algorithmically as cell cycle regulated. In the following examples, we apply various unsupervised clustering techniques to a subset of the Cho dataset (the 800 genes that were identified in). The first row (images 1-3) consists of visualizations of the original data (gene expression levels during two cell cycles). The second row (images 4-6) visually presents the results of several clustering algorithms. The third row (images 7-9) displays the statistical properties of each cluster generated by various algorithms. The fourth row (images 10-12) provides visual comparisons between selected clustering algorithms.

References Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998 Dec; 9(12): 3273-97. http://genome-www.stanford.edu/pdf/spellman_pt_mol_biol_cell_1998.pdf Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW. A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2: 65-73, 1998. http://depts.washington.edu/genetics/courses/genet551-aut01/1217paper.pdf DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997 Oct 24; 278(5338): 680-6. http://genome-www.stanford.edu/pdf/derisi_jl_science_1997.pdf http://cmgm.stanford.edu/pbrown/explore/