Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples Franck Olivier Ndjakou Njeunje Applied Mathematics, Statistics, and Scientific Computation University of Maryland - College Park fndjakou@math.umd.edu Advisers Wojtek Czaja John J. Benedetto Norbert Wiener Center for Harmonic Analysis Department of Mathematics University of Maryland - College Park October 18, 2014 Abstract In computational biology and medicine, gene expression data are a very useful and important piece of the puzzle as they are one of the main source from which are derived gene function and various diseases mechanism. Unfortunately, the analysis and visualization of gene expression data is not an easy task due to the high dimensionality of the data generated from high-density microarrays. In this project, I will be interested in two methods developed to carry the task of dimensionality reduction on the data at hand so that they will become better suited for further analysis. in particular I will be looking at implementation of Laplacian Eigenmaps and Principal Components Analysis as pre-processing dimensionality reduction methods on the data, and see how they compare when their generated output is fed to similarity learning algorithms like clustering. 1

Notation x Gene or row of dimension M in the matrix X unless otherwise stated. X Gene expression matrix of dimension M N. X Standardized matrix of the matrix X. M and N Dimension of the matrix X. y Reduced dimension data of dimension m. x i Mean of the vector x i. σ ii Variance of the vector x i. C Covariance matrix. Λ Diagonal matrix containing the eigenvalues, λ i, of the covariance matrix. U Matrix containing the eigenvectors, u i of the covariance matrix. W Weight matrix. L Laplacian matrix. D Diagonal or degree matrix. u i and f i Eigenvectors. λ i Eigenvalues. m j Means for 1 j k where k is the number of means. sets or clusters. S (t) i 2

1. Background and Motivation 1.1. Gene Expression Data Gene expression data are information that numerically represent the expression level of a set of genes due to environmental factors. These environment factors could be of natural cause such as the effect of cancer or any other diseases on a set of genes; or they could be reaction to drugs or medicines taken to fight said diseases. The data are usually given in matrix form, let s call this matrix X, in order to obtain the gene expression matrix X, a high-density microarray is used to numerically determine the level of expression of a set of genes, over multiple samples or observations. The matrix X is of dimension (N M) where the number of genes is given by the variable N and the number of samples is given by the variable M. Due to the usefulness of the gene expression data, a wide range of algorithms have been developed to study the biological network provided by high-density microarrays. The main ones are classification and clustering techniques. It has been shown that classification of gene expression data could help us distinguish between various cancer classes; while clustering techniques could help separate tumor from healthy normal tissues. Unfortunately, the number of observations or samples, M, is in general very high which makes it difficult to visualize the results from the similarity learning analysis. Therefore in order to determine the structure of those data in the hope of getting more information from them, whether is to classify them as genes of the same kind based on their expression or to visually separate healthy ones from unhealthy ones a dimensionality reduction algorithm is necessary as a pre-processing step. 1.2. Dimension Reduction By taking a closer look at the data, in figure 1 we can notice that within each expression array x, across the multiple samples, a lot of redundancy can be found in the data. This will provide us with a platform allowing us to do some pre-processing on the data in order to retain only the most pertinent information. The methods used in this part of the analysis are known as dimensionality reduction techniques and this is where I will be focusing throughout this year long project. Given an array x of dimension M the goal is to be able to reduce this array to an m-dimensional array y such that m is very small compare to M, while retaining the most important information about the array across all the samples. There are two class of dimensionality reduction techniques: linear (LDR) and non-linear (NDR). The linear techniques assume a linear relationship between the data and perform quite well under these circumstances. The problem here is that most data that arise from gene expression do not entirely maintain a linear relationship and so to remedy to this, nonlinear methods have been developed. The advantage here is that non-linear methods aim to keep the intrinsic or natural geometrical structure between the variables or data points. After this step is completed a similarity learning analysis known as clustering is applied to the data in order to acquire more information about the genes mechanism. 3

Figure 1: Dimentionality reduction illustration on a single gene expression across M samples. 1.3. Clustering After a dimensionality reduction analysis has been perform on the data, a clustering analysis will then follow and allow us to get a visual sense of the data at hand. The goal of clustering is to group elements of a set in separate subgroups called clusters in such a way that elements in the same cluster are more similar than elements in other clusters in one way or another. In practice, different clustering methods perform differently according to the nature of data they are applied to. 2. Approach For this project I will be interested in Principal Component Analysis also known as PCA as my linear dimensionality reduction method, which is the most common linear dimensionality reduction method used in the analysis the gene expression data. I will also look at Laplacian Eigenmap abbreviated as LE as my non-linear dimensionality reduction method. To perform similarity learning I will be interested in how the outputs from the dimensionality reduction algorithms listed above compare when using Hierarchical clustering and K-means clustering. The subsections bellow gives us a better understanding on how these methods operate mathematically. 2.1. Principal Component Analysis [1] PCA is a linear dimension reduction algorithm, a statistical technique to handle multivariate data that make use of the Euclidean distance to estimate a lower dimensional data. While this method sometimes fail at preserving the intrinsic structure of the data (given the data 4

have a non-linear structure) it does a good job preserving the most variance from data. The algorithm for this method can be viewed as three steps: Step 1: Given the initial matrix X representing the set of data, we will need to construct the standardized matrix X by making sure that each sample column has zero mean and unit variance. X = ( x 1, x 2,..., x M ) (1) = ( x 1 x 1, x 2 x 2,..., x M x M ). (2) σ11 σmm σ22 Here, x 1, x 2,..., x M and σ 11, σ 22,..., σ MM are respectively the mean values and the variances for corresponding column vectors. Step 2: Compute the covariance matrix of X, then make spectral decomposition to get the eigenvalues and its corresponding eigenvectors. C = X X = UΛU. (3) Here Λ = diag(λ 1, λ 2,..., λ M ), λ 1 λ 2... λ M, U = (u 1, u 2,..., u M ). λ i and u i are separately the ith eigenvalue corresponding eigenvector for covariance matrix C. Step 3: Given that we would like the target lower dimensional space to be of dimension m, the i th principal component can be computed as Xu i, and the reduced dimentional (N m) subspace is XU m. 1 Notice from Step 3 that each principal components making up the reduced dimensional subspace is just a linear combination of the raw variables. 2.2. Laplacian Eigenmaps [2] This method has its advantages since it is a non-linear approach to dimension reduction. It aims to preserve the intrinsic or natural geometric structure of the manifold from the high dimension to the lower dimension. This method could also be summarized in three steps: Step 1: Given a set of N points or nodes x 1, x 2,..., x N in a high dimensional space R M, construct a weighted graph with N nodes. Constructing the graph is as simple as putting an edge between nodes that are close enough to each other. In doing this, one might either consider the ɛ-neighborhood technique where two nodes are connected if their square Euclidean distance is less than ɛ, and not connected otherwise. This might sometimes lead to graphs with several connected nodes or even disconnected graphs. An alternative would be to consider the k-nearest neighbor where each node is connected to its k th nearest neighbors. Both techniques do yield a symmetric relationship. 1 Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. 5

Step 2: Choose the weight for the edges and construct the weight matrix W. 2. This could be as simple as putting a 1 between two connected nodes and a 0 otherwise (if the node are not connected). One could also consider a weight as a function of the Euclidean distance between two connected nodes and 0 otherwise. Step 3: For each connected sub-graph(s), solve the following generalized eigenvector problem, Lf = λdf, (4) where D ii = j W ji, the diagonal matrix; and L = D W, the Laplacian matrix. Let f 0, f 1,..., f N 1 be the solutions of (4) with corresponding λ 0, λ 1,..., λ N 1 such that, Lf i = λ i Df i for i going from 0 to N 1 and 0 = λ 0 λ 1... λ N 1. Then the m-dimensional Euclidean space embedding is given by: 2.3. Hierarchical Clustering x i y i = (f 1 (i),..., f m (i)). (5) Next I will consider a couple of clustering methods. Starting with hierarchical clustering (HC), this is a connectivity based algorithm, the idea is that nodes that are closer to each other are more related than those who are father apart. There are two ways of implementing HC; one could either take a bottom-up (order O(n 3 )) approach; where each data points start as being in its own cluster, and as we move on pairs of clusters are merged together, see figure 2. Otherwise, one could consider a top-down (order O(2 n ), mostly due to the search algorithm) approach; where we start with one big cluster and splits are performed recursively as we move further down the hierarchy. In order to proceed we then need to decide on a metric, a way to measure the distance between two pairs of observation and a linkage criteria, a function of the pairwise distances between observations in the sets which has for output the degree of similarity between sets (this function will let us know whether or not two sets could be merged). Here are some commonly used metric and linkage criteria: Examples of metric: Euclidean distance: Manhattan distance: a b 2 = (a i b i ) 2 (6) i a b 1 = a i b i (7) i 2 Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality Reduction and Data Representation Examples of linkage criteria: 6

Maximum or CLINK (complete linkage clustering) Minimum or SLINK (single linkage clustering) Mean or average linkage clustering max{d(a, b) : a A, b B}. (8) min{d(a, b) : a A, b B}. (9) 1 A B d(a, b). (10) a A b B Figure 2: Bottom up Hierarchical clustering illustration. 2.4. K-means clustering The idea here is to randomly select an initial set of K means, these could be random vectors selected either within your data set or outside your data set. This selection is follow by an assignment step where all individual data points are assigned to the nearest means according to a well-defined metric (square Euclidean distance). After this step is done the mean within each of the clusters formed gets updated to the mean of the data in the cluster. The two previous steps are repeated until no new assignment is made, this means that the clusters remain the same before and after an assignment step. This method is NP-hard (Nondeterministic Polynomial-time hard) and can be summarized as such: Initialized a set of k means m (1) 1, m (1) 2,..., m (1) k. Assignment step: Assign each observation x p to exactly one set S i containing the nearest mean to x p. S (t) i = {x p : x p m (t) i 2 x p m (t) j 2 j, 1 j k}. (11) 7

Update step: update the mean within each cluster, Repeat the two previous steps. m t+1 i = 1 S (t) 1 Stop when no new assignments are made. See figure 3 for an illustration of those steps. x j S i (t) x j. (12) Figure 3: K-means clustering illustration. 3. The Data The NCI-60 data I will be working with consist of microarray expressions of closed to 22,000 genes activities within 60 different cancer cell lines. I plan on working with the traditional gene expressions across these 60 cancer cell lines, without presence of drugs. Although there will be no drugs in the samples, the presence of cancer stimulant is still enough to make this analysis meaningful and interesting. These data are available to download trough the CellMiner database under the NCI website. 4. Implementation and Validation methods 4.1. Software and hardware The two dimension reduction algorithms described above will be implemented using Matlab as a mathematical tool. This decision is due to the superior ability of Matlab to deal with matrix operations. Another reason would be the wide range of toolbox available to bring this project to completion in a timely manner, the toolbox will provide us with test data and prior implementation of PCA and LE for validation and bench-marking respectively. I will be using my personal laptop with 8Gb of memory to run simulations on smaller data sets and the Norbert Wiener Center lab, clocking at 128Gb of memory for larger data set if needed. Clustering algorithms (K-means and Hierarchical) built into the Matlab software will be used as comparison tool to the implemented dimension reduction algorithms. 8

4.2. Validation methods We will take advantage of the DRtoolbox 3 [4] which contains implementation of the Principal Component Analysis method and the Laplacian Eigenmap methods describe above. The DRtoolbox also contains a number of well understood data sets in 3-dimensional space with corresponding representation in 2-dimensiona space for testing and validating the dimensionality reduction methods implemented for this project. Some examples of those datasets courtesy of the DRtoolbox [4] include the following: The Swiss Roll dataset in figure 4 F : (x, y) (xcos(x), y, xsin(x)) (13) Figure 4: 3-dimensional presentation of the Swiss Roll data. the Twin Peaks dataset in figure 5 f(x, y) = x 4 + 2x 2 + 4y 2 + 8x (14) 5. Results At the end of this project we expect to see a better performance overall from the Laplacian Eigenmap method versus Principal Component Analysis. This means the clusters obtained from the output or data coming from LE will be visually more significant than those coming PCA. In addition, the clustering algorithm would be able to produce more consistent results from the output of LE than that of PCA. 3 Laurens van der Maaten, Delft University of Technology 9

6. Timeline Figure 5: 3-dimensional presentation of the Twin Peaks data. Throughout the year I intend to follow the timeline below to completion. October - November: Implementation of PCA algorithm. Resolve issues that come up (storage and memory). Testing and validating. December: Mid-year presentation. January: First semester progress report. February - April: Implementation of LE algorithm. Testing and validating. April - May: Implementation of a clustering algorithm (if time permits). May: Final report 7. Deliverable The following materials are expected to be delivered by the end of the academic year: Weekly Report 10

Self Introduction Project Proposal First-Semester Progress Report Mid-year Status Report Final Report Code for Principal Component Analysis implementation Code for Laplacian Eigenmap implementation NIC-60 data set. 11

References [1] Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Computers in Biology and Medicine 40 (2010) 723-732. [2] Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality Reduction and Data Representation. Neural Computation 15, 1373-1396 (2003) [3] Vinodh N. Rajapakse (2013). Data Representation for Learning and Information Fusion in Bioinformatics. Digital Repository at the University of Maryland, University of Maryland (College Park, Md.) [4] Laurens van der Maaten, Affiliation: Delft University of Technology. Matlab Toolbox for Dimensionality Reduction (v0.8.1b) March 21, 2013. 12