Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit

Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit Li Xiao, Oleg Shats, and Simon Sherman * Nebraska Informatics Center for the Life Sciences Eppley Institute for Research in Cancer and Allied Diseases University of Nebraska Medical Center, Omaha, NE 68198-6805 Email: {lxiao, oshats, ssherm}@unmc.edu Abstract The expression levels of thousands of genes in different tissues or cells in different conditions can be detected all at one time by DNA microarray technology. A new, gridlayout method for the visualization results of hierarchical cluster analysis of DNA microarray data is proposed and incorporated in the Microarray Interactive Graphics Toolkit (MIGT). The grid-layout consists of a set of regular, two-dimensional grid units. Each unit represents a cluster or a group of gene clusters. The units are connected to adjacent ones by the neighborhood relation of the clusters in a hierarchical tree. Nodes lying near each other in the hierarchical tree are mapped onto nearby grid-layout units. The number of units may vary on a scale from a few dozen up to several thousands, depending on the number of the nodes in a hierarchical tree. Different colors are assigned to the units with RGB value according to the coordinates of the units, and the inter-distances, which are the distances between clusters in a hierarchical tree, and the intra-distances, which are the distances between genes within one cluster. The closer the inter-distances, the more similar the color of the units are, the smaller the intra-distances, the warmer the color of the unit is. 1. Introduction DNA microarrays exploit the preferential binding of complementary, single-stranded molecular fragments of DNA that are attached at fixed locations (spots) on glass slides. There may be up to tens of thousands of spots on a slide, each representing a single gene. The microarray technology offers an opportunity to simultaneously screen the expression pattern of a large number of distinct genes. * Corresponding author An aim of the microarray analysis is to identify genes differentially expressed in the target cells, as compared to the reference cells. The difference between expression profiles from cell samples can be quantitatively characterized. This allows researchers to track the effect of interventions or natural processes on gene expression levels, as well as to identify the functions of genes and the biochemical pathways they participate in. For the analysis of the microarray data, the cluster analysis methods [1-4] and methods for the identification of genes differentially expressed in the target cells [5], as compared to the reference cells, can be utilized. The cluster approach holds much promise for determining groups of genes with a similar function. A measure of correlation between gene expressions can be used to cluster together genes with similar expressions. Although both academic [7] and commercial [8] software for cluster analysis are already available, there is a large need for the proper visualization of results. Due to the fact that the size of microarray data involved in clustering is very large, the dendrogram visualization method, which is usually used for presenting the result of hierarchical clustering, does not work in a proper way. Therefore, in this paper, we are presenting a novel grid-layout method that is useful in visualizing the results of the hierarchical clustering of gene expressions obtained by DNA array experiments. The proposed method allows us to present the subgroups of genes in a hierarchical cluster tree by colorful rectangular or hexagon units, which become parts of a grid-layout map. In the hierarchical tree, the subgroups are divided by the cut-off value defined by the user. The distances between these subgroups are presented by the different colors assigned to the corresponding units in such a way that the neighboring subgroups have close colors, valued in the RGB format. Moreover, the assignment of the RGB value to one subgroup grid is based on the scale of distances between genes within the subgroup in this grid. In the grid-layout method, both information on the distances between the 0-7695-1874-5/03 $17.00 (C) 2003 IEEE 1

subgroups in a hierarchical tree and information on the distances within subgroups are presented. Therefore, this method can simplify a visual presentation of hierarchical clustering of the large size microarray data. The proposed grid-layout method is incorporated into the Microarray Interactive Graphics Toolkit (MIGT) that is under construction in our laboratory. A set of tools for preliminary analysis of microarray data, such as normalization, scatter plotting and hierarchical clustering, is provided in the MIGT as well. In the next section the grid-layout method, proposed for visualization results of hierarchical clustering on microarray data, is described. Section 3 describes the MIGT software package in which this grid-layout method, as well as the prerequisite method, hierarchical clustering for DNA microarray analysis, is introduced. Section 4 provides conclusions and the direction of future work. 2. Grid-layout Visualization Method for the Hierarchical Clustering of Microarray Data In the field of microarray data analysis, the hierarchical trees (dendrograms) are usually utilized for visualizing and analyzing results of the hierarchical clustering [7, 8]. When there are more than 30 nodes in the original data, however, the corresponding dendrograms may look crowded, and visual analysis of results of the clustering became very difficult. In fact, in the software package developed at Stanford University that is called TreeView [7], enlarged dendrograms can be viewed in a window with a vertical scroll bar. The user can observe the genes within one cluster or between neighboring clusters, but the inter-distances between all clusters are not well exhibited due to the large size of original data. In this paper, for analysis results of the hierarchical clustering we propose a method, which we called a gridlayout method. An analogous method is initially used for visualization results of self-organizing mapping [10, 11], and we are proposing to use it to visualize results of the hierarchical clustering of microarray data. The grid-layout consists of a regular, two-dimensional grid of units. Each unit is a cluster or a group of clusters. The units are connected to adjacent ones by the neighborhood relation of the clusters in a hierarchical tree. Nodes lying near each other in the hierarchical tree are mapped onto nearby grid-layout units. The number of units can vary from a few dozen up to several thousand, depending on the size of the nodes in a hierarchical tree. Different colors are assigned with an RGB value to the units. These assignments depend on the coordinates of the units, the inter-distances between clusters, and the intradistances between genes within one cluster. The closer the inter-distances, the more similar the color of the units are, the smaller the intra-distances, the warmer the color of the unit is. The grid-layout method consists of the three steps described below. Step1. Determining the number of the grid-layout units. There are two methods to cut a tree: (i) by giving the height on which the hierarchical tree will be cut, and (ii) by giving the maximum number nc of branches to be kept in the hierarchical tree, and the maximum number nc is less than or equal to grid-layout units nu. In this paper, we are using the last method. When the user predefines nc to be exhibited in the gridlayout format, the number of units, nu, that has to be bigger or equal to nc, can be found in the following way. The width of the grid-layout rounds, gw, can be estimated from nc as the nearest integers toward minus infinity: nc. The height, gl, of the grid-layout rounds can be estimated from a ration nc gw as the nearest integers toward infinity: nc gw. Then, the number of units can be calculated as nu = gw gl. The remaining empty grids can be found as: reg = nu nc. For example, suppose the number of clusters, nc, is 18, the width of the grid-layout, gw, is 4, the height of the grid-layout is 5, and the number of units, nu, is 20. The number of the remaining empty grids, reg, is 2. Step 2. Calculating coordinates of the units. The user can select one shape of lattice from rectangle R and hexagon H, and also determine size of a single rectangle R and a single hexagon H. The coordinate matrix of grid-layout is calculated in such a way that the length of each R or the width of each H multiply the numbers of units in the height and the width of gridlayout obtained by the method described in step 1. A hierarchical dendrogram (tree) is split into the given number nc of subclusters from left to right on the dendrogram. The ordered split subclusters are put on the grid-layout, one by one, starting from the left-upper corner of coordinate matrix of grid-layout, in the direction of from up to down, and move to next column of gridlayout till the former column is filled. The remaining empty grids, reg, are dispatched onto the grid-layout based on the bigger inter-distance of the neighboring subclusters. Step 3. Assigning colors to the grid-layout units. Colors are encapsulated to the units using the RGB (Red, Green, Blue) format. In the RGB format, each color can be presented as a mixture of the three components, red, blue, and green. Colors can also be encapsulated by 0-7695-1874-5/03 $17.00 (C) 2003 IEEE 2

HSB (Hue, Saturation, Brightness) format. Hue varies from 0 to 1, the resulting color varies from red, yellow, green, cyan, blue, magenta to red. Saturation varies from 0 to 1, when the saturation is 0, the colors are unsaturated; they are simply shades of gray, when the saturation is 1, the colors are fully saturated; they contain no white component. The brightness increases as the value varies from 0 to 1. To assign values of colors to the neighboring gridlayout units, two types of information are taken into account. First, the distances between overage clusters (inter-cluster distances) and the distances between genes within one cluster (intra-cluster-distances) are separately scaled and assigned to two index vectors. These vectors we have called inter-vector and intravector indexes of subclusters. According to the scale of the value of inter-vector, the hue value of the corresponding subclusters are arranged as the scale as the hue varied from 0 near to 1 (from red, yellow, green, cyan, blue to magenta). In this way, the neighboring subclusters should have a similar color (hue) if there are many more subclusters. Further, according to the scale of the values of the intra-vector, the saturation value of the corresponding subclusters are assigned scaling from 0.3 to 1, in the rule that closed intra-distances have closed saturation, and the bigger intra-distances, the bigger the saturation value. The value of saturation variance from 0.3 to 1 is set to avoid the overrepresentation of darker colors. The value of brightness for all units is assigned to a fixed value over 0.5, to avoid the overrepresentation of dark colors. After the HSB values of the corresponding subcluster are obtained, a Java function, HSBtoRGB, is used to transfer the HSB value to an RGB value of the color with the indicated hue, saturation, and brightness. 3. Microarray Interactive Graphics Toolkit In the MIGT, the commonly used microarray data analysis methods, such as normalization, scatter plot and hierarchical cluster, are implemented. The MIGT consists of seven tabbed panes: Load&Edit, Scatter Plot, Normalization, Cluster, Visualization, Web, and About. The Load&Edit pane allows users to load and edit microarray data. The Scatter Plot pane provides users with a 3D scatter plot allowing interactive manipulation with gene expressions data. The Normalization pane provides several tools to normalize microarray data. The Cluster pane allows users to do the hierarchical cluster analysis of gene expression data. The Visualization pane allows users to observe results of the hierarchical clustering by two methods, dendrogram and grid-layout. The Web pane allows users to connect their application to the databases on the Web and extract the corresponding biological information on observing genes from the databases. The About pane gives out the help information on a current version of the MIGT. In this paper, we introduce the Visualization and Cluster panes in details. 3.1. Cluster Pane Hierarchical clustering is the most widely used method for the analysis of patterns of gene expression [1-4, 9]. By hierarchical clustering, a representation of the gene expression is produced as the shape of a binary tree, in which the most similar patterns are clustered in a hierarchy. Divisive algorithms that work by recursively partitioning the data set until singleton sets are achieved, and agglomerative algorithms that work by beginning with singleton sets and merging them until the whole data set is achieved, are two major types of hierarchical clustering algorithms. The agglomerative methods, which are far more common, are incorporated in our MIGT package. In agglomerative methods, the distances between pairs of the elements of a data set are calculated based on different distance metrics and placed into a list of distance sets, d 1, d 2,..., d n. Then a cost function is used to find the pair of sets { d i, d j } from the list which is cheapest to merge. Finally, the corresponding singleton set x i and x j are removed from data sets and replaced with x i x j. This process is repeated until there is only one set remaining. Different distance metrics and cost functions utilized in agglomerative methods may bring out different clusters. Main metrics and cost functions used in agglomerative clustering algorithms are described in detail by Everritt and Dunn [15]. The Cluster pane provides a hierarchical clustering procedure, which consists of the following three phases. Distance Phase: The task in this phase is to find the similarity or dissimilarity between every pair of genes in the microarray data set. The microarray data X is treated as M samples of N genes. The distance is the measure of the similarity or dissimilarity, and is calculated by the distance metrics. The data structure of distances vector presented in distance phase, contains all the distances between each pair of genes. Since there are M*(M-1)/2 pairs of genes in X, the size of the distance vector is M*(M-1)/2. Grouping Phase: In this phase, the pairs of genes that are in close proximity are linked together, thus these genes are grouped into a binary, hierarchical cluster tree. The distance vector generated in the distance phase is used to determine the proximity of genes to each other. As genes are paired into binary clusters, the newly formed clusters are grouped into large clusters until a hierarchical 0-7695-1874-5/03 $17.00 (C) 2003 IEEE 3

tree is formed. Cluster information will be returned in a vector with the size of a row being m-1 and column being 3, where m is the number of genes in the samples. The first two columns of the vector contain gene cluster indices linked in pairs to form a binary tree, and the third column is the distance between the pairs of genes. Division Phase: In this phase, the division of the hierarchical tree into clusters is determined by a user predefined threshold value. The genes, grouped as a hierarchical tree in the grouping phase, are divided into clusters. The grouped clusters are put in a vector with two columns that the first column being gene names and the second column being the group ID that the gene on the same row should be. After completing all three phases, relationships between certain subgroups in the hierarchical cluster can be viewed in the Visualization pane. In the MIGT, the data structures of the hierarchical cluster algorithms are compatible with MATLAB [13]. In each phase of clustering, the processing data can be opened and observed directly. The processing data can be transferred into MATLAB for a comprehensive statistical analysis. The data in MATLAB can also be transferred into the MIGT for an interactive visualization. 3.2. Visualization pane In the Visualization pane, two methods, dendrogram and grid-layout, are designed to observer the clustering results. The dendrogram method plots the hierarchical tree information obtained as a graph from the Cluster pane. The input parameter of the dendrogram method is the vector generated in the Grouping Phase of the Cluster pane. There are three columns and m-1 rows in the vector, where m is the number of genes in the microarray samples. The pairs of genes on the one row of the vector have the cheapest cost function of hierarchical clustering. The order of indices of the vector depends on both the order and content of the leaf nodes of the dendrogram. In the dendrogram, the numbers along the horizontal axis represent the indices of the genes in the microarray data set. The links between genes are represented as upside-down U-shaped lines. The height of the U indicates the distance between the genes. Beside the plot of the dendrogram, there are two lists on the Visualization pane: one is a Gene list that shows the gene names that have been used for hierarchical clustering and another one is the Highlights list that shows the highlighted gene information. A mouse event is used to draw a rectangle on the user s interested part of the dendrogram, and the leaf nodes within the rectangle are highlighted when the mouse button is released. The gene information of the highlighted leaf nodes can be shown on the Highlights list on the Visualization pane. The grid-layout method plots the hierarchical tree information on the grid-layout consisting of a twodimensional grid of units. The input parameters of the grid-layout method are the vector generated by the Grouping Phase and the vector generated by Division Phase in the Cluster pane. Two shapes of grid, rectangle and hexagon, are utilized. Two kinds of grid-layout are provided: one is showing the labels of genes on the grids, another one is showing different colors on the grids. As described in section 2, the subgroup genes split by the Division Phase are put on the grids one-by-one, based on their neighborhood in the hierarchical clustering. This neighborhood can also be viewed from the relationship of the labels along the horizontal axis on the dendrogram. Due to the large size of the gene names, only the labels of genes can be shown on the grids. However, showing all labels of the genes belonging to one grid, would bring out the mess visualization. Therefore, the label of only one gene name can be shown in one grid. This label represents the gene that occurs on the first place of the subgroup. The inter-distances and intra-distances can be obtained from the vector generated by the Grouping Phase. The inter-distance of two subgroups of genes in a hierarchical tree is the distance of the top, newly formed, binary cluster, created by grouping the genes of the first subgroup minus the distance of the top, newly formed, binary cluster created by grouping the genes of the second subgroup. By the distances between genes within one subgroup shown in the vector of Grouping Phase, the intra-distances, are calculated as the average distances between genes within one subgroup. The colors (RGB values) of the grids are assigned, in the way of step 3 of grid-layout method introduced in section 2. The genes on the user s interested grids can be shown in a Highlights list by clicking on the corresponding grid. The performance of the hierarchical cluster visualization in the MIGT is tested by the Lymphochip microarray data set [6], which has 42 samples of 2041 genes, with genes preferentially expressed in lymphoid cells. After the hierarchical clustering of the data, we set up the 20*20 units for a grid-layout for the cluster visualization. The grid-layout, before assigning color, is shown in Figure 1, and the corresponding grid-layout after assigning color, is shown in Figure 2. Each label in the grids delegates a group of genes together. For example, there are nine genes in the first grid (left-upper corner). Due to the size of the grid, only one gene label '345' (Clone name is 1336563) is shown. Behind gene label 345, there are genes: '2145' (Clone=1367804), '413' (Clone= 1338105), '799' (Clone=1355116), '211' (Clone=1319728), '959' (Clone=1367840), '53' (Clone=17), '1411' (Clone=128088) and '25' (Clone=65). The HSB value of the first grid is (0.6, 0.6, 0.5) and the RGB value of this grid, which has an olive-green color is (127,127, 49). We found that the colors of the grids at the 0-7695-1874-5/03 $17.00 (C) 2003 IEEE 4

upper-right of the layout are red, and the intra-distances of these subgroups are smaller. The colors of the grids at the center of the layout are gray, and the intra-distances of these subgroups are relatively bigger. visualize results of the hierarchical clustering. In the MIGT the grid-layout method is combined with an interactive observation method. The user can focus on gene information through interactive observations, while the grid-layout visualization allows the user to present results of such observations in a simple way. In addition to the grid-layout visualization method, the basic procedures for microarray data analysis, such as normalization, scatter plotting and hierarchical clustering are incorporated in the MIGT. In future work, the more complicated interactive operations on grid-layout, visualization and low dimensional focus, should be considered. For instance, the cluster information in one or several grids that the user may select can be expended as a new grid-layout for the cluster information of the interested grids. This possibility will be also provided in the new version of the MIGT. Acknowledgements Figure 1. The grid-layout before assigning color. Figure 2. The grid-layout after assigning color. 4. Conclusion and future work The abilities of visualization and interaction are important for the user to observe and mine the microarray data. In this paper, we introduce the grid-layout visualization method for microarray data clustering, as well as the MIGT software designed for microarray data analysis. In the grid-layout method, grids are used to This work was partially supported by the NSF EPSCoR grant (#0091900). This publication was made possible by NIH Grant Number RR15635 from the COBRE Program of the National Center for Research Resources. References [1] D.A. Lashkari, J.L. DeRisi, J.H. McCusker, A.F. Namath, C. Gentile, S.Y. Hwang, P.O. Brown, and R.W. Davis, Yeast microarrays for genome wide parallel genetic and gene expression analysis, PNAS, National Academy of Sciences, 1997, 94: 13057-13062. [2] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns, PNAS, National Academy of Sciences, 1998, 95: 14863-14868. [3] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation, PNAS, National Academy of Sciences, 1999, 96: 2907-2912. [4] R. Herwig, A.J. Poustka, C. Müller, C. Bull, H. Lehrach, and J. O'Brien, Large-scale clustering of cdna-fingerprinting data, Genome Research, Cold Spring Harbor Laboratory Press, 1999, 9: 1093-1105. [5] M.K. Kerr, M. Martin, and G.A. Churchill, Analysis of variance for gene expression microarray data, Journal of Computational Biology, M.A. Liebert, Inc., 2000, 6: 819-837. [6] A.A. Alizadeh, M. Eisen, R.E. Davis, C. Ma, I. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson JR, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and 0-7695-1874-5/03 $17.00 (C) 2003 IEEE 5

L.M. Staudt, Distinct Types of Diffuse Large B-Cell Lymphoma Identified By Gene Expression Profiling, Nature, Nature Macmillan Publishers, 2000, 403: 503-511. [7] Cluster and TreeView, Stanford University, http://rana.stanford.edu/software. [8] GeneSight, BioDiscovery Inc., http://www.biodiscovery.com. [9] P.O. Brown and D. Botstein, Exploring the new world of the genome with DNA microarrays, Nature Genetics, Nature Macmillan Publishers, 1999, 21 (Suppl.), 33-37. [10] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, IEEE Neural Networks Society, 2000, 11:574-585. [11] T. Kohonen, Self-Organizing Maps, Third Extended Edition, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 2001, ISBN 3-540- 67921-9, ISSN 0720-678X. [12] A. Brazma, A. Robinson, G. Camero, and M. Ashburner, One-stop shop for microarray data is an universal, public DNAmicroarray database a realistic goal?, Nature, Nature Macmillan Publishers, 2000, 403: 699-700. [13]MATLAB (Version 6), www.mathworks.com. [14]MAExplorer, http://www.lecb.ncifcrf.gov/maexplorer. [15] B.S. Everritt and G. Dunn, Applied Multivariate Data Analysis, Oxford university press, New York, 1992. 0-7695-1874-5/03 $17.00 (C) 2003 IEEE 6