Visualizing Gene Clusters using Neighborhood Graphs in R

Similar documents
Supervised Clustering of Yeast Gene Expression Data

Introduction to Mfuzz package and its graphical user interface

CompClustTk Manual & Tutorial

Exploratory data analysis for microarrays

Graphs,EDA and Computational Biology. Robert Gentleman

Lab: Using Graphs for Comparing Transcriptome and Interactome Data

Double Self-Organizing Maps to Cluster Gene Expression Data

Neighborhood Graphs, Stripes and Shadow Plots for Cluster Visualization

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Clustering. Lecture 6, 1/24/03 ECS289A

Package Mfuzz. R topics documented: March 26, Version Date Title Soft clustering of time series gene expression data

Lab: Using Graphs for Comparing Transcriptome and Interactome Data

Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Advanced visualization techniques for Self-Organizing Maps with graph-based methods

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Kapitel 4: Clustering

Package Mfuzz. R topics documented: July 4, Version Date

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

R-Packages for Robust Asymptotic Statistics

Machine Learning in Biology

CompClustTk Manual & Tutorial

Evaluation and comparison of gene clustering methods in microarray analysis

Week 7 Picturing Network. Vahe and Bethany

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Distances, Clustering! Rafael Irizarry!

Supplementary text S6 Comparison studies on simulated data

Unsupervised learning

ViTraM: VIsualization of TRAnscriptional Modules

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Radmacher, M, McShante, L, Simon, R (2002) A paradigm for Class Prediction Using Expression Profiles, J Computational Biol 9:

Cluster Analysis. Ying Shen, SSE, Tongji University

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

ECS 234: Data Analysis: Clustering ECS 234

Chapter 6 Continued: Partitioning Methods

Data mining with Support Vector Machine

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

The Generalized Topological Overlap Matrix in Biological Network Analysis

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Graph projection techniques for Self-Organizing Maps

What to come. There will be a few more topics we will cover on supervised learning

High throughput Data Analysis 2. Cluster Analysis

Cluster Analysis for Microarray Data

Understanding Clustering Supervising the unsupervised

5.6 Self-organizing maps (SOM) [Book, Sect. 10.3]

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

2. Background. 2.1 Clustering

CLUSTERING IN BIOINFORMATICS

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

ViTraM: VIsualization of TRAnscriptional Modules

Codelink Legacy: the old Codelink class

Exploratory data analysis for microarrays

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Exploratory data analysis for microarrays

Automatic Group-Outlier Detection

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Supervised vs. Unsupervised Learning

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

A Topography-Preserving Latent Variable Model with Learning Metrics

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Keywords: ANN; network topology; bathymetric model; representability.

Clustering Jacques van Helden

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

EECS730: Introduction to Bioinformatics

MSA220 - Statistical Learning for Big Data

CSE 5243 INTRO. TO DATA MINING

CLUSTERING GENE EXPRESSION DATA USING AN EFFECTIVE DISSIMILARITY MEASURE 1

Forestry Applied Multivariate Statistics. Cluster Analysis

Microarray data analysis

A Toolbox for K-Centroids Cluster Analysis

Clustering of imbalanced high-dimensional media data

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

Redefining and Enhancing K-means Algorithm

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

Unsupervised Learning : Clustering

K-Means Clustering With Initial Centroids Based On Difference Operator

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Network Traffic Measurements and Analysis

Improving Cluster Method Quality by Validity Indices

Chapter DM:II. II. Cluster Analysis

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Unsupervised Learning and Clustering

K-means and Hierarchical Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Figure (5) Kohonen Self-Organized Map

Gene Clustering & Classification

Seismic regionalization based on an artificial neural network

Reconstructing Boolean Networks from Noisy Gene Expression Data

Classification by Support Vector Machines

The role of Fisher information in primary data space for neighbourhood mapping

An Experiment in Visual Clustering Using Star Glyph Displays

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Transcription:

Theresa Scharl & Friedrich Leisch Visualizing Gene Clusters using Neighborhood Graphs in R Technical Report Number 16, 2008 Department of Statistics University of Munich http://www.stat.uni-muenchen.de

Visualizing Gene Clusters using Neighborhood Graphs in R Theresa Scharl 1 and Friedrich Leisch 2 1 Department of Statistics and Probability Theory, Vienna University of Technology Wiedner Haupstraße 8-10/1071, 1040 Vienna, Austria, theresa.scharl@ci.tuwien.ac.at 2 Department of Statistics, University of Munich Ludwigstraße 33, D-80539 München, Germany, friedrich.leisch@stat.uni-muenchen.de Abstract. The visualization of cluster solutions in gene expression data analysis gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results. Neighborhood graphs allow for visual assessment of relationships between adjacent clusters. The number of clusters in gene expression data is for biological reasons rather large. As a linear projection of the data into 2 dimensions does not scale well in the number of clusters there is a need for new visualization techniques using non linear arrangement of the clusters. The new visualization tool is implemented in the open source statistical computing environment R. It is demonstrated on microarray data from yeast. Keywords: Cluster analysis, graphs, microarray data, R 1 Introduction Gene expression microarray experiments yield large and complex multivariate datasets that consist of several thousands of genes at multiple states. A typical question during the analysis is to find groups in the data. Cluster analysis is commonly used to reduce the complexity of the data from multidimensional space to a single nominal variable, the cluster membership. In the analysis of microarray data clustering is used as vector quantization and the set of genes is divided into artificial subsets. Genetic interactions are so complex that the definition of gene clusters is not clear. Additionally microarray data are very noisy and co expressed genes can end up in different clusters. As no clear density clusters exist in the data the relationship between clusters is very important. Clusters of co expressed genes can help to discover potentially co regulated genes or association to conditions under investigation. Usually cluster analysis provides a good initial investigation of microarray data before actually focusing on functional subgroups of interest. In the literature numerous cluster algorithms for clustering gene expression data have been proposed. Besides traditional methods like hierarchical clustering, K means, partitioning

2 Scharl, T. and Leisch, F. around medoids (PAM, K medoids) or self organizing maps there are several algorithms dealing with time course gene expression data (e.g., Heyer et al., 1999, De Smet et al., 2002, Ben-Dor et al., 1999). The display of cluster solutions particularly for a large number of clusters is very important in exploratory data analysis. Visualization methods give practitioners an understanding of the relationships between segments of a partition and make it easier to interpret the cluster results. Neighborhood graphs (Leisch, 2006) can be used for visual assessment of the cluster structure of centroid based cluster solutions. A linear projection of the data into 2 dimensions using for example linear discriminant analysis (LDA) does not scale well in the number of clusters. Gene expression data are high dimensional data and are usually separated into a lot of clusters (e.g., over 25 clusters in Heyer et al., 1999). As the vast amount of information cannot be shown in the plane there is a need for new visualization techniques. Using non linear arrangement of the clusters the cluster structure can be displayed up to a very large number of clusters. In this work the layout algorithms implemented in the open source graph visualization software Graphviz are used for non linear arrangement of the clusters. The new visualization tool is currently available at the homepage of the first author (http://www.ci.tuwien.ac.at/ scharl/software/) and will be released as an R package (R Development Core Team, 2007, http://www.r-project.org) soon. The functionality is demonstrated on a publicly available data set from yeast. 2 Methods 2.1 Cluster Algorithms In this work we focus on centroid based cluster algorithms like K means and PAM or others where clusters can be represented by centroids (e.g., QT Clust, Heyer et al., 1999). For a given data set X N = {x 1,..., x N } the distance between points x and y is given by d(x, y), e.g., the Euclidean or absolute distance. C K = {c 1,..., c N } is a set of centroids and the centroid closest to x is denoted by c(x) = argmin d(x, c). c C K The set of all points where c k is the closest centroid is given by A k = {x n c(x n ) = c k }. Minimizing the average distance between each data point and its closest centroid D(X n, C K ) = 1 N d(x n, c(x n )) min N C K n=1 is the task of most cluster algorithms.

Visualizing Gene Clusters using Neighborhood Graphs in R 3 2.2 Neighborhood Graphs Neighborhood graphs (Leisch, 2006) use the idea of topology representing networks (TRNs, Martinetz and Schulten, 1994) to count the number of data points a pair of centroids is closest and second closest. In TRNs the counts are used as weights for the edges of the graph. Silhouette plots (Rousseeuw, 1987) are diagnostic plots revealing the goodness of a partition. The distance from each point to the points in its own cluster is compared to the distance to points in the second closest cluster. The larger the silhouette values the better a cluster is separated from the other clusters. But silhouette plots do not show the proximity of clusters. They only give an indicator how wellseparated single points are from other clusters. Neighborhood graphs combine these two approaches and use the mean relative distances as edge weights in order to measure how separated pairs of clusters are. Hence they display the distance between clusters. In the graph each node corresponds to a cluster centroid and two nodes are connected by an edge if there exists at least one point that has these two as closest and second closest centroid. As described above the centroid closest to x is denoted by c(x) and the second closest centroid to x is denoted by c(x) = argmin d(x, c). c C K\{c(x)} Now the set of all points where c i is the closest centroid and c j is second closest is given by For each observation x we define A ij = {x n c(x n ) = c i, c(x n ) = c j }. s(x) = 2d(x, c(x)) d(x, c(x)) + d(x, c(x)). s(x) is small if x is close to its cluster centroid and close to 1 if it is almost equidistant between the two cluster centroids. The average s value of all points where cluster i is closest and cluster j is second closest can be used as a proximity measure between clusters and as edge weight in the graph. { Ai 1 x A s ij = s(x), A ij ij 0, A ij = A i is used in the denominator instead of A ij to make sure that a small set A ij consisting only of badly clustered points with large shadow values does not induce large cluster similarity. 3 Data In this work a publicly available dataset from yeast was investigated, the seventeen time point mitotic cell cycle data (Cho et al., 1998) available at

4 Scharl, T. and Leisch, F. http://genome-www.stanford.edu. The dataset was preprocessed adapting the instructions given by Heyer et al. (1999). The outlier time points 10 and 11 were removed from the original 17 variables. The gene vectors were standardized to have median 0 and MAD 1. Finally genes that were either expressed at very low levels or did not vary significantly over the time points were removed. This procedure yields gene expression data on N = 2832 genes (observations) for T = 15 time points (variables). The data was clustered using the K means algorithm. In this example 15 clusters were selected. 4 Software and Implementation All cluster algorithms and visualization methods used are implemented in the statistical computing environment R. R package flexclust (Leisch, 2006) is a flexible toolbox to investigate the influence of distance measures and cluster algorithms. It contains extensible implementations of the K centroids and QT Clust algorithm and offers the possibility to try out a variety of distance or similarity measures as cluster algorithms are treated separately from distance measures. New distance measures and centroid computations can easily be incorporated into cluster procedures. The default plotting method for cluster solutions in flexclust is the neighborhood graph. The visualization of partitioning cluster solutions is commonly accomplished by linear projection of the data into 2 dimensions using for example LDA. In Figure 1 the best possible separation using LDA is shown. The relationships between the centroids of 15 clusters can hardly be displayed in the plane. Therefore a new visualization tool is presented using nonlinear arrangement of the nodes. Infrastructure for creating, manipulating, and visualizing graphs is provided in Bioconductor (Gentleman et al., 2005, http://www.bioconductor.org) package graph. The package contains functionality for data structure, classes and methods to manipulate graphs and enables efficient representations of very large graphs. An interface to the open source graph visualization software Graphviz (http://www.graphviz.org/) is provided in Bioconductor package Rgraphviz which returns the layout information for a graph object, x- and y coordinates of the graph s nodes as well as the parameterization of the trajectories of the edges. Several layout algorithms can be chosen. dot: hierarchical layout algorithm for directed graphs neato and fdp: layout algorithms for large undirected graphs twopi: radial layout circo: circular layout Additionally global and local properties (e.g., labels, shape, color,... ) can be assigned to both nodes and edges. Using non linear arrangement of clusters clearly improves the visualization of the cluster solution (Figure 2). In the new visualization method related

Visualizing Gene Clusters using Neighborhood Graphs in R 5 8 0 5 10 15 4 10 14 9 511 12 72 15 13 6 1 3 8 6 4 2 0 2 4 Fig. 1. Projection of neighborhood graph of a K means cluster solution of the yeast data into 2 dimensions. clusters are not forced to lie next to each other. For example cluster 11 located at the bottom end of the graph is related to cluster 1 located at the top end of the graph. Additionally the graph is simplified by only drawing edges between nodes if the similarity of a cluster to another cluster is at least 10%. In Figure 2 the cluster structure of the cluster solution can easily be investigated. Cluster 15 is very different from the remaining clusters as no edge is drawn to cluster 15 and the similarities to connected clusters are very small. This indicates that the genes in cluster 15 are very different from the remaining genes. As cluster 4 is similar to clusters 10 and 14 which are also strongly connected the 3 clusters seem to be highly related. 4.1 Edge Methods The neighborhood graph is a directed graph as the similarity of cluster 1 to cluster 2 is different from the similarity of cluster 2 to cluster 1. Besides plotting the original directed graph there are several possibilities how to plot edges taking into account for instance the mean, minimum or maximum of the similarities between two clusters.

6 Scharl, T. and Leisch, F. k8 >= 0.4 0.3 0.2 0.1 0 k15 k3 k1 k13 k6 k12 k2 k7 k5 k4 k11 k9 k14 k10 Fig. 2. Neighborhood graph of a K means cluster solution for the yeast data using the dot layout algorithm. 4.2 Node Methods In the simple visualization of a neighborhood graph one single kind of node symbol is used for all nodes. By just looking at the graph no information about the different clusters is revealed. There are several possibilities how to include additional information in the representation of nodes. The most simple method is to use color coding, e.g., to color nodes by size or tightness of the corresponding clusters. Another possibility is to use different shapes or symbols for nodes representing clusters with specific properties. In Figure 3 the cluster solution is shown with tight clusters highlighted, i.e., clusters with small average distance of the genes to the cluster centroid. In this example the tightest clusters are numbers 15 and 8 indicating genes very different from the rest of the genes (colored darkgrey) and numbers 13 and 4 (lightgrey). 4.3 Other Features The new visualization tool offers various possibilities for the analysis of microarray data which cannot be shown here due to space constrains. The neighborhood graph is implemented in an interactive way and gene clusters can be

Visualizing Gene Clusters using Neighborhood Graphs in R 7 k8 >= 0.4 0.3 0.2 0.1 0 k15 k3 k1 k13 k6 k12 k2 k7 k5 k4 k11 k9 k14 k10 Fig. 3. The neighborhood graph with tight clusters highlighted. investigated by clicking on the nodes. Plots of the expression profiles of the corresponding genes pop up as well as tables giving further information about the genes. Additional information about the gene clusters can be included in the neighborhood graph. External information from differential expression analysis or functional grouping can easily be included in the node representation, e.g., the accumulation of gene ontology (GO) categories in certain gene clusters. 5 Summary Cluster analysis is commonly used in the analysis of gene expression data to find groups of co expressed genes. But the definition of gene clusters is not very clear as genetic interactions are extremely complex. For this reason the relationship between clusters is very important as co expressed genes can end up in different clusters. The neighborhood graph is a useful tool to visualize the underlying cluster structure. The visualization of partitioning cluster solutions is commonly accomplished by linear projection of the data into 2 dimensions. However, this is not recommended for high dimensional data like microarray data and a large number of clusters. In our new visual-

8 Scharl, T. and Leisch, F. ization method layout algorithms for non linear arrangement of the clusters are used to display the relationships between clusters. Our interactive software tools for the analysis of gene expression data is very helpful not only for statisticians but also for practitioners. Acknowledgement This work was supported by the Austrian K ind /K net Center of Biopharmaceutical Technology (ACBT). References BEN-DOR, A., SHAMIR, R. and YAKHINI, Z. (1999): Clustering gene expression patterns. Journal of Computational Biology, 6 (3 4), 281 297. BICKEL, D.R. (2003): Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically. Bioinformatics, 19 (7), 818 824. CAREY, V.J., GENTLEMAN, R., HUBER, W. and GENTRY, J. (2005): Bioconductor Software for Graphs. In: R. Gentleman, V.J. Carey, W. Huber, R.A. Irizarry and S. Dudoit (Eds.): Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York. CHO, R.J., CAMPBELLl, M.J., WINZELER E.A., STEINMETZ, L., CONWAY, A., WODICKA, L., WOLFSBERG, T.G., GABRIELIAN, A.E., LANDSMAN, D., LOCKHART, D.J. and DAVIS, R.W. (1998): A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 2/1, 65-73. DE SMET, F., MATHYS, J., MARCHAL, K., THIJS, G., DE MOOR, B. and MOREAU, Y. (2002): Adaptive quality-based clustering of gene expression profiles. Bioinformatics, 18 (5), 735 746. GENTLEMAN, R., CAREY, V.J., HUBER, W., IRIZARRY, R.A. and DUDOIT, S. (Eds.) (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York. HEYER, L.J., KRUGLYAK, S. and YOOSEPH, S. (1999): Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Research, 9, 1106-1115. LEISCH, F. (2006): A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis 51 (2), 526 544. MARTINETZ, T. and SCHULTEN, K. (1994): Topology representing networks. Neural Networks, 7 (3), 507 522. R DEVELOPMENT CORE TEAM (2007): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project.org. ROUSSEEUW, P.J. (1987): Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53 65.