Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Similar documents
Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis:midyear Report

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Application of Spectral Clustering Algorithm

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Spectral Clustering on Handwritten Digits Database

Assessing a Nonlinear Dimensionality Reduction-Based Approach to Biological Network Reconstruction.

Dimension Reduction CS534

Nonlinear Dimensionality Reduction Applied to the Classification of Images

Advanced Machine Learning Practical 2: Manifold Learning + Clustering (Spectral Clustering and Kernel K-Means)

Locality Preserving Projections (LPP) Abstract

Locality Preserving Projections (LPP) Abstract

Network Traffic Measurements and Analysis

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Large-Scale Face Manifold Learning

Dimension reduction : PCA and Clustering

Clustering and Visualisation of Data

Data fusion and multi-cue data matching using diffusion maps

Spectral Clustering X I AO ZE N G + E L HA M TA BA S SI CS E CL A S S P R ESENTATION MA RCH 1 6,

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Locally Linear Landmarks for large-scale manifold learning

MSA220 - Statistical Learning for Big Data

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Work 2. Case-based reasoning exercise

SGN (4 cr) Chapter 10

Unsupervised Clustering of Bitcoin Transaction Data

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Learning a Manifold as an Atlas Supplementary Material

Cluster Analysis (b) Lijun Zhang

Robust Pose Estimation using the SwissRanger SR-3000 Camera

Clustering and Dimensionality Reduction

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Sensitivity to parameter and data variations in dimensionality reduction techniques

Exploratory data analysis for microarrays

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Isometric Mapping Hashing

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering analysis of gene expression data

COMPRESSED DETECTION VIA MANIFOLD LEARNING. Hyun Jeong Cho, Kuang-Hung Liu, Jae Young Park. { zzon, khliu, jaeypark

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Non-linear dimension reduction

Clustering CS 550: Machine Learning

Schroedinger Eigenmaps with Nondiagonal Potentials for Spatial-Spectral Clustering of Hyperspectral Imagery

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Manifold Alignment. Chang Wang, Peter Krafft, and Sridhar Mahadevan

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

The Anatomical Equivalence Class Formulation and its Application to Shape-based Computational Neuroanatomy

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Interactive Text Mining with Iterative Denoising

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Low-dimensional Representations of Hyperspectral Data for Use in CRF-based Classification

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

MUSI-6201 Computational Music Analysis

Manifold Learning for Video-to-Video Face Recognition

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Unsupervised Learning

An efficient algorithm for sparse PCA

Face Recognition using Laplacianfaces

Outline. Multivariate analysis: Least-squares linear regression Curve fitting

Visualizing breast cancer data with t-sne

Globally and Locally Consistent Unsupervised Projection

Nonlinear Dimensionality Reduction Applied to the Binary Classification of Images

Image Similarities for Learning Video Manifolds. Selen Atasoy MICCAI 2011 Tutorial

The Curse of Dimensionality

Curvilinear Distance Analysis versus Isomap

Laplacian Faces: A Face Recognition Tool

Facial Expression Detection Using Implemented (PCA) Algorithm

Aarti Singh. Machine Learning / Slides Courtesy: Eric Xing, M. Hein & U.V. Luxburg

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

How do microarrays work

Lecture Topic Projects

Sergei Silvestrov, Christopher Engström. January 29, 2013

Image Segmentation. Srikumar Ramalingam School of Computing University of Utah. Slides borrowed from Ross Whitaker

Image Processing. Image Features

School of Computer and Communication, Lanzhou University of Technology, Gansu, Lanzhou,730050,P.R. China

Supervised vs unsupervised clustering

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

SELECTION OF THE OPTIMAL PARAMETER VALUE FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM. Olga Kouropteva, Oleg Okun and Matti Pietikäinen

Courtesy of Prof. Shixia University

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

RDRToolbox A package for nonlinear dimension reduction with Isomap and LLE.

CIE L*a*b* color model

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Spatial-Spectral Dimensionality Reduction of Hyperspectral Imagery with Partial Knowledge of Class Labels

Publication 4. Jarkko Venna, and Samuel Kaski. Comparison of visualization methods

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Feature Selection Using Principal Feature Analysis

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Frame based kernel methods for hyperspectral imagery data

Unsupervised Learning

Transcription:

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples Franck Olivier Ndjakou Njeunje Applied Mathematics, Statistics, and Scientific Computation University of Maryland - College Park fndjakou@math.umd.edu Advisers Wojtek Czaja John J. Benedetto Norbert Wiener Center for Harmonic Analysis Department of Mathematics University of Maryland - College Park October 18, 2014 Abstract In computational biology and medicine, gene expression data are a very useful and important piece of the puzzle as they are one of the main source from which are derived gene function and various diseases mechanism. Unfortunately, the analysis and visualization of gene expression data is not an easy task due to the high dimensionality of the data generated from high-density microarrays. In this project, I will be interested in two methods developed to carry the task of dimensionality reduction on the data at hand so that they will become better suited for further analysis. in particular I will be looking at implementation of Laplacian Eigenmaps and Principal Components Analysis as pre-processing dimensionality reduction methods on the data, and see how they compare when their generated output is fed to similarity learning algorithms like clustering. 1

Notation x Gene or row of dimension M in the matrix X unless otherwise stated. X Gene expression matrix of dimension M N. X Standardized matrix of the matrix X. M and N Dimension of the matrix X. y Reduced dimension data of dimension m. x i Mean of the vector x i. σ ii Variance of the vector x i. C Covariance matrix. Λ Diagonal matrix containing the eigenvalues, λ i, of the covariance matrix. U Matrix containing the eigenvectors, u i of the covariance matrix. W Weight matrix. L Laplacian matrix. D Diagonal or degree matrix. u i and f i Eigenvectors. λ i Eigenvalues. m j Means for 1 j k where k is the number of means. sets or clusters. S (t) i 2

1. Background and Motivation 1.1. Gene Expression Data Gene expression data are information that numerically represent the expression level of a set of genes due to environmental factors. These environment factors could be of natural cause such as the effect of cancer or any other diseases on a set of genes; or they could be reaction to drugs or medicines taken to fight said diseases. The data are usually given in matrix form, let s call this matrix X, in order to obtain the gene expression matrix X, a high-density microarray is used to numerically determine the level of expression of a set of genes, over multiple samples or observations. The matrix X is of dimension (N M) where the number of genes is given by the variable N and the number of samples is given by the variable M. Due to the usefulness of the gene expression data, a wide range of algorithms have been developed to study the biological network provided by high-density microarrays. The main ones are classification and clustering techniques. It has been shown that classification of gene expression data could help us distinguish between various cancer classes; while clustering techniques could help separate tumor from healthy normal tissues. Unfortunately, the number of observations or samples, M, is in general very high which makes it difficult to visualize the results from the similarity learning analysis. Therefore in order to determine the structure of those data in the hope of getting more information from them, whether is to classify them as genes of the same kind based on their expression or to visually separate healthy ones from unhealthy ones a dimensionality reduction algorithm is necessary as a pre-processing step. 1.2. Dimension Reduction By taking a closer look at the data, in figure 1 we can notice that within each expression array x, across the multiple samples, a lot of redundancy can be found in the data. This will provide us with a platform allowing us to do some pre-processing on the data in order to retain only the most pertinent information. The methods used in this part of the analysis are known as dimensionality reduction techniques and this is where I will be focusing throughout this year long project. Given an array x of dimension M the goal is to be able to reduce this array to an m-dimensional array y such that m is very small compare to M, while retaining the most important information about the array across all the samples. There are two class of dimensionality reduction techniques: linear (LDR) and non-linear (NDR). The linear techniques assume a linear relationship between the data and perform quite well under these circumstances. The problem here is that most data that arise from gene expression do not entirely maintain a linear relationship and so to remedy to this, nonlinear methods have been developed. The advantage here is that non-linear methods aim to keep the intrinsic or natural geometrical structure between the variables or data points. After this step is completed a similarity learning analysis known as clustering is applied to the data in order to acquire more information about the genes mechanism. 3

Figure 1: Dimentionality reduction illustration on a single gene expression across M samples. 1.3. Clustering After a dimensionality reduction analysis has been perform on the data, a clustering analysis will then follow and allow us to get a visual sense of the data at hand. The goal of clustering is to group elements of a set in separate subgroups called clusters in such a way that elements in the same cluster are more similar than elements in other clusters in one way or another. In practice, different clustering methods perform differently according to the nature of data they are applied to. 2. Approach For this project I will be interested in Principal Component Analysis also known as PCA as my linear dimensionality reduction method, which is the most common linear dimensionality reduction method used in the analysis the gene expression data. I will also look at Laplacian Eigenmap abbreviated as LE as my non-linear dimensionality reduction method. To perform similarity learning I will be interested in how the outputs from the dimensionality reduction algorithms listed above compare when using Hierarchical clustering and K-means clustering. The subsections bellow gives us a better understanding on how these methods operate mathematically. 2.1. Principal Component Analysis [1] PCA is a linear dimension reduction algorithm, a statistical technique to handle multivariate data that make use of the Euclidean distance to estimate a lower dimensional data. While this method sometimes fail at preserving the intrinsic structure of the data (given the data 4

have a non-linear structure) it does a good job preserving the most variance from data. The algorithm for this method can be viewed as three steps: Step 1: Given the initial matrix X representing the set of data, we will need to construct the standardized matrix X by making sure that each sample column has zero mean and unit variance. X = ( x 1, x 2,..., x M ) (1) = ( x 1 x 1, x 2 x 2,..., x M x M ). (2) σ11 σmm σ22 Here, x 1, x 2,..., x M and σ 11, σ 22,..., σ MM are respectively the mean values and the variances for corresponding column vectors. Step 2: Compute the covariance matrix of X, then make spectral decomposition to get the eigenvalues and its corresponding eigenvectors. C = X X = UΛU. (3) Here Λ = diag(λ 1, λ 2,..., λ M ), λ 1 λ 2... λ M, U = (u 1, u 2,..., u M ). λ i and u i are separately the ith eigenvalue corresponding eigenvector for covariance matrix C. Step 3: Given that we would like the target lower dimensional space to be of dimension m, the i th principal component can be computed as Xu i, and the reduced dimentional (N m) subspace is XU m. 1 Notice from Step 3 that each principal components making up the reduced dimensional subspace is just a linear combination of the raw variables. 2.2. Laplacian Eigenmaps [2] This method has its advantages since it is a non-linear approach to dimension reduction. It aims to preserve the intrinsic or natural geometric structure of the manifold from the high dimension to the lower dimension. This method could also be summarized in three steps: Step 1: Given a set of N points or nodes x 1, x 2,..., x N in a high dimensional space R M, construct a weighted graph with N nodes. Constructing the graph is as simple as putting an edge between nodes that are close enough to each other. In doing this, one might either consider the ɛ-neighborhood technique where two nodes are connected if their square Euclidean distance is less than ɛ, and not connected otherwise. This might sometimes lead to graphs with several connected nodes or even disconnected graphs. An alternative would be to consider the k-nearest neighbor where each node is connected to its k th nearest neighbors. Both techniques do yield a symmetric relationship. 1 Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. 5

Step 2: Choose the weight for the edges and construct the weight matrix W. 2. This could be as simple as putting a 1 between two connected nodes and a 0 otherwise (if the node are not connected). One could also consider a weight as a function of the Euclidean distance between two connected nodes and 0 otherwise. Step 3: For each connected sub-graph(s), solve the following generalized eigenvector problem, Lf = λdf, (4) where D ii = j W ji, the diagonal matrix; and L = D W, the Laplacian matrix. Let f 0, f 1,..., f N 1 be the solutions of (4) with corresponding λ 0, λ 1,..., λ N 1 such that, Lf i = λ i Df i for i going from 0 to N 1 and 0 = λ 0 λ 1... λ N 1. Then the m-dimensional Euclidean space embedding is given by: 2.3. Hierarchical Clustering x i y i = (f 1 (i),..., f m (i)). (5) Next I will consider a couple of clustering methods. Starting with hierarchical clustering (HC), this is a connectivity based algorithm, the idea is that nodes that are closer to each other are more related than those who are father apart. There are two ways of implementing HC; one could either take a bottom-up (order O(n 3 )) approach; where each data points start as being in its own cluster, and as we move on pairs of clusters are merged together, see figure 2. Otherwise, one could consider a top-down (order O(2 n ), mostly due to the search algorithm) approach; where we start with one big cluster and splits are performed recursively as we move further down the hierarchy. In order to proceed we then need to decide on a metric, a way to measure the distance between two pairs of observation and a linkage criteria, a function of the pairwise distances between observations in the sets which has for output the degree of similarity between sets (this function will let us know whether or not two sets could be merged). Here are some commonly used metric and linkage criteria: Examples of metric: Euclidean distance: Manhattan distance: a b 2 = (a i b i ) 2 (6) i a b 1 = a i b i (7) i 2 Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality Reduction and Data Representation Examples of linkage criteria: 6

Maximum or CLINK (complete linkage clustering) Minimum or SLINK (single linkage clustering) Mean or average linkage clustering max{d(a, b) : a A, b B}. (8) min{d(a, b) : a A, b B}. (9) 1 A B d(a, b). (10) a A b B Figure 2: Bottom up Hierarchical clustering illustration. 2.4. K-means clustering The idea here is to randomly select an initial set of K means, these could be random vectors selected either within your data set or outside your data set. This selection is follow by an assignment step where all individual data points are assigned to the nearest means according to a well-defined metric (square Euclidean distance). After this step is done the mean within each of the clusters formed gets updated to the mean of the data in the cluster. The two previous steps are repeated until no new assignment is made, this means that the clusters remain the same before and after an assignment step. This method is NP-hard (Nondeterministic Polynomial-time hard) and can be summarized as such: Initialized a set of k means m (1) 1, m (1) 2,..., m (1) k. Assignment step: Assign each observation x p to exactly one set S i containing the nearest mean to x p. S (t) i = {x p : x p m (t) i 2 x p m (t) j 2 j, 1 j k}. (11) 7

Update step: update the mean within each cluster, Repeat the two previous steps. m t+1 i = 1 S (t) 1 Stop when no new assignments are made. See figure 3 for an illustration of those steps. x j S i (t) x j. (12) Figure 3: K-means clustering illustration. 3. The Data The NCI-60 data I will be working with consist of microarray expressions of closed to 22,000 genes activities within 60 different cancer cell lines. I plan on working with the traditional gene expressions across these 60 cancer cell lines, without presence of drugs. Although there will be no drugs in the samples, the presence of cancer stimulant is still enough to make this analysis meaningful and interesting. These data are available to download trough the CellMiner database under the NCI website. 4. Implementation and Validation methods 4.1. Software and hardware The two dimension reduction algorithms described above will be implemented using Matlab as a mathematical tool. This decision is due to the superior ability of Matlab to deal with matrix operations. Another reason would be the wide range of toolbox available to bring this project to completion in a timely manner, the toolbox will provide us with test data and prior implementation of PCA and LE for validation and bench-marking respectively. I will be using my personal laptop with 8Gb of memory to run simulations on smaller data sets and the Norbert Wiener Center lab, clocking at 128Gb of memory for larger data set if needed. Clustering algorithms (K-means and Hierarchical) built into the Matlab software will be used as comparison tool to the implemented dimension reduction algorithms. 8

4.2. Validation methods We will take advantage of the DRtoolbox 3 [4] which contains implementation of the Principal Component Analysis method and the Laplacian Eigenmap methods describe above. The DRtoolbox also contains a number of well understood data sets in 3-dimensional space with corresponding representation in 2-dimensiona space for testing and validating the dimensionality reduction methods implemented for this project. Some examples of those datasets courtesy of the DRtoolbox [4] include the following: The Swiss Roll dataset in figure 4 F : (x, y) (xcos(x), y, xsin(x)) (13) Figure 4: 3-dimensional presentation of the Swiss Roll data. the Twin Peaks dataset in figure 5 f(x, y) = x 4 + 2x 2 + 4y 2 + 8x (14) 5. Results At the end of this project we expect to see a better performance overall from the Laplacian Eigenmap method versus Principal Component Analysis. This means the clusters obtained from the output or data coming from LE will be visually more significant than those coming PCA. In addition, the clustering algorithm would be able to produce more consistent results from the output of LE than that of PCA. 3 Laurens van der Maaten, Delft University of Technology 9

6. Timeline Figure 5: 3-dimensional presentation of the Twin Peaks data. Throughout the year I intend to follow the timeline below to completion. October - November: Implementation of PCA algorithm. Resolve issues that come up (storage and memory). Testing and validating. December: Mid-year presentation. January: First semester progress report. February - April: Implementation of LE algorithm. Testing and validating. April - May: Implementation of a clustering algorithm (if time permits). May: Final report 7. Deliverable The following materials are expected to be delivered by the end of the academic year: Weekly Report 10

Self Introduction Project Proposal First-Semester Progress Report Mid-year Status Report Final Report Code for Principal Component Analysis implementation Code for Laplacian Eigenmap implementation NIC-60 data set. 11

References [1] Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Computers in Biology and Medicine 40 (2010) 723-732. [2] Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality Reduction and Data Representation. Neural Computation 15, 1373-1396 (2003) [3] Vinodh N. Rajapakse (2013). Data Representation for Learning and Information Fusion in Bioinformatics. Digital Repository at the University of Maryland, University of Maryland (College Park, Md.) [4] Laurens van der Maaten, Affiliation: Delft University of Technology. Matlab Toolbox for Dimensionality Reduction (v0.8.1b) March 21, 2013. 12