Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Similar documents
Clustering and Visualisation of Data

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Gene Clustering & Classification

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

3 Nonlinear Regression

Multi-label classification using rule-based classifier systems

EECS730: Introduction to Bioinformatics

Network Traffic Measurements and Analysis

Hierarchical Clustering 4/5/17

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Supervised vs unsupervised clustering

Introduction to Artificial Intelligence

Clustering CS 550: Machine Learning

PARALLEL CLASSIFICATION ALGORITHMS

Artificial Neural Networks (Feedforward Nets)

What to come. There will be a few more topics we will cover on supervised learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Introduction to GE Microarray data analysis Practical Course MolBio 2012

3 Nonlinear Regression

High throughput Data Analysis 2. Cluster Analysis

Dimension reduction : PCA and Clustering

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Semi-Supervised Clustering with Partial Background Information

Chapter 6: Cluster Analysis

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

MSA220 - Statistical Learning for Big Data

1 Case study of SVM (Rob)

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Predicting Popular Xbox games based on Search Queries of Users

Clustering. Lecture 6, 1/24/03 ECS289A

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Chapter 6 Continued: Partitioning Methods

x = 12 x = 12 1x = 16

8.11 Multivariate regression trees (MRT)

/ Computational Genomics. Normalization

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Small Libraries of Protein Fragments through Clustering

Particle Swarm Optimization applied to Pattern Recognition

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Machine Learning in Biology

Unsupervised Learning

DI TRANSFORM. The regressive analyses. identify relationships

Exploiting a database to predict the in-flight stability of the F-16

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

University of California, Berkeley

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

10701 Machine Learning. Clustering

On The Value of Leave-One-Out Cross-Validation Bounds

ECS 234: Data Analysis: Clustering ECS 234

How do microarrays work

Clustering & Classification (chapter 15)

Statistical Pattern Recognition

CLUSTERING. JELENA JOVANOVIĆ Web:

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Multivariate Methods

Robust PDF Table Locator

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Cluster Analysis for Microarray Data

Clustering: Classic Methods and Modern Views

Iteration Reduction K Means Clustering Algorithm

Regularization and model selection

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Supervised vs. Unsupervised Learning

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

Machine Learning. Unsupervised Learning. Manfred Huber

CLUSTERING IN BIOINFORMATICS

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

A Dendrogram. Bioinformatics (Lec 17)

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

1 Introduction RHIT UNDERGRAD. MATH. J., VOL. 17, NO. 1 PAGE 159

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Algorithms for Grid Graphs in the MapReduce Model

Statistical Pattern Recognition

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Computational Genomics and Molecular Biology, Fall

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Response Network Emerging from Simple Perturbation

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Hierarchical Clustering

Introduction to Computer Science

Evaluating Classifiers

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Supervised Clustering of Yeast Gene Expression Data

TELCOM2125: Network Science and Analysis

Function approximation using RBF network. 10 basis functions and 25 data points.

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

FINDING a subset of a large number of models or data

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Data clustering & the k-means algorithm

Nearest Neighbor Predictors

Transcription:

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying genes into biologically meaningful clusters by applying machine learning techniques to gene expression data. Our data consists of gene expression data from two types of mouse immunological cells (T-cells and Granulocytes) as well as data that summarizes known biological pathways and genetic associations in mice. We utilize this data about previously studied and understood genetic associations to improve the clustering classifications that are made based on the gene expression data alone. 1 Introduction The applications of gene expression data are numerous, as information about the expression of individual genes can inform both academic understanding of genetic diseases as well as clinical treatments for patients. One interesting use of gene expression data is to cluster genes together in a biologically meaningful way. That is, given gene expression data, we wish to group genes into clusters that correspond to biological pathways in which the grouped genes code for proteins that work together at a functional level. This is a problem that has been studied, as in [1]. One may apply a k-means algorithm to a matrix containing gene expression data to effectively cluster the genes. When attempting to cluster genes, however, often times a lot of information is already known about genetic associations in biological pathways. The ability to leverage this data to improve clustering predictions that would be made on the gene expression data alone is the challenge that we address in this paper. 2 Methods We studied gene expression data that was extracted from two different immunological cell types from 39 different mice. mrna levels for 25,134 genes were measured using microarrays. This information was stored in a 25, 134 39 matrix D = (d ij ) where d ij [0, 15] is a measure of expression level (a higher value means higher expression level). Further, we accumulated data that summarized 13,476 known biological pathways. This data was stored in a 25, 134 13, 476 matrix P = (p ij ) defined thus: p ij = { n n N and n > 0, if gene i is associated with pathway j with strength n 0 else We ran a k-means clustering algorithms on the matrix D for various values of k and used the elbow method to decide on an optimal k. 1

2.1 Finding the optimal k We theorized that there exists a correct clustering of the genes considered, and therefore our chosen k should be the same for both the original data and the data augmented with the data about priorly determined pathways. Our approach was then to pick a value of k that worked well for both data sets and compare the clusters produced by running k-means on each set and get an idea of the utility of augmenting gene expression data with pathway priors. We decided to pick an optimal k using the elbow method. The elbow method, which can be traced to speculation by Thorndike (1953) in [2], describes a good choice of k for k-means. The method can be described as follows: for fixed k Z + let G denote the set of all training examples and C k the set of cardinality k of centroids determined by the k-means algorithm. Now for x G define µ x = arg min µ Sk x µ 2 where denotes the Euclidean norm on R n. Then define the error of the k-means algorithm to be ε k = x µ x 2 x G Now make a plot of ε k as a function of k and note where there are elbows in the plot, i.e., points that mark a sharp change in ε k. An elbow point is a good choice of k. We also performed k-means on the 25, 134 13, 515 matrix M that is just D and P concatenated. Performing the elbow method on both our results from running k-means on M and D, we picked an optimal k for clustering. Figure 1: Example of elbow method Then, we compared the results of k-means on M versus the results of k-means on D by comparing the similarity between the clusters determined by both methods, using the following method: 2.2 Choosing the Optimal Weight Our strategy for incorporating the prior data is to prepend the gene expressions data matrix D to the genetic pathways matrix P multiplied by a scalar λ, which corresponds to how much we want to weight the information about known pathways. We consider each genetic pathway as an addition feature to a gene data point. The result is a matrix in R m n, where m = 25,134, n = a+b, a = 13,476 and b = 39. We constructed a finite set S R of potential candidates for λ, ran k-means on the concatenated matrices with different choices of λ S and compared the results to find the optimal choice for λ. To pick our initial value λ in S, we scaled P such that it has approximately equal weight relative to D. This can be computed by normalizing D to be between 0 and 1, and scaling P as follows: 2

λ b = a(argmax p P ) Then, we selected our set S of λs to test by picking an even number of values greater than and smaller than λ, such that each λ increments evenly and begins at zero. Thus, our equation for λ i for the set S = {λ 1, λ 2...λ r } is λ i = b(ri) 2a(argmax p P ) Then, to test each selection of λ, we used the two-fold cross validation method. In a typical supervised learning problem, cross-validation is used to test a trained model on classifying a subset of the original data. We can make a normative judgement about the quality of our unsupervised learning model by using distance as our metric of comparison, as opposed to label classification. The choice of λ that produces the clusters that are most tightly clustered with randomly selected test data may be the optimal choice of λ. We can conduct this analysis as follows: 1. Divide each matrix D and P into train and test (validate) matrices by randomly splitting up the genes into two groups D t, D v, P t, and P v such that D t and P t have 70% the rows of D and P, and D v and P v have the remaining 30%. 2. For each value λ i S, compute the horizontal concatenation matrices X i t = [D t λ(p t )] and X i v = [D v λ(p v )]. 3. Run the k-means clustering algorithm on X i t with the optimal selection of k from Section 2.1 to produce the result vector C i of centroid locations. 4. Then, evaluate the clustering model by testing on the remaining 30% validation data. For each row x j in X i v, assign a cluster by finding the closest cluster c j in C. c j = argmin c x j c d The error value ε i is computed by summing the distances between each point x j and their corresponding cluster centroid c j. ε i = m v j=1 k d=1 1 d j x j c j where d j is the number of points assigned to cluster c j. 5. Pick the λ i with the lowest classification error ε i. To smooth out inconsistencies and ensure that our conclusion was more generalizable, we decided to run this algorithm multiple times and compare the averages of each ε i for our analysis. This helps reduce the effects of the random initial assignment of cluster positions when running the k-means algorithm, and local extrema in the training data set X v. 3 Data Because it is incredibly computationally expensive to run the k-means algorithm on all of our data set with thousands of features and training sets, we chose to run k-means only on the first 1,000 genes as a proof of concept of our method. 3

We ran the k-means algorithm first on the gene expression data D on varying values of k and plotted the results of the error function J to determine the location of the elbow points that mark the optimal choice of k. We made the assumption that the optimal choice of k would be something close to the value k = n 2, as suggested by Mardia et al. in [3]. Then, we chose values of k that deviated from k by ±10. Thus, our choices of k ranged from 10 to 30. This produced the graph shown in Figure 2a. We repeated the same procedure with the gene expression data agumented with pathway priors M. The plot of our k-means error values for each value of k on M can be seen in Figure 2b. (a) Error of k-means run with varying k on expression data D. (b) Error of k-means run with varying k on the augmented data M. Figure 2: The elbow method for determining an optimal k. From this analysis, we determined that the optimal choice of k was aproximately 24 due to the fact that in both Figures 2a and 2b, there was a sharp change in the slope of the plot of k-means error values around the point where k = 24. We then compared the results of k-means on M versus the results of k-means on D by comparing the similarity between the clusters determined by both methods where for both k = 24 using the algorithm described in Section 2.1. We determined that selecting k = 24 gave us a reasonable error. We then ran our 2-fold cross validation algorithm to find the optimal choice of λ in the concatenation of D and P. We first trained a cluster model on 70% of our original data set, and tested the resulting clusters on the remaining 30% to produce an error value for each λ i as described in Section 2.2. After running our validation algorithm ten times, taking the average of each trial s ε i values, we were able to identify the optimal ε = ε 5, as shown in Figure 3. Thus, we could select the optimal scalar λ = λ 5 = 5(br) 2a(argmax p P ) = 0.0012. 4

Figure 3: Validation error of each λ i, averaged over 10 trials 4 Further Research The analysis we have done thus far has been limited due to the computational load of running our algorithms on large data sets. Our next steps are to run our analysis on all 25,134 genes over more varied choices of k to get a more accurate clustering and more optimal choice of k. Once we have this choice of k, we would evaluate it with the following method: 1. Enumerate all possible bijective mappings between the clusters trained on the data and the clusters trained on the data augmented with priors. 2. Sort these mappings by the number of points shared by each. 3. Go through and accept these mappings in sorted order. If a given mapping has a cluster already accepted, reject that mapping and move on. 4. Find the total number of points shared by each mapping pair c found via this method. Subtract this value from the total number of points k and divide by the total number of points: ε = k c k Furthermore, we would like to extend our methodology to explore more strategies for incorporating the prior data into our gene expression data for clustering analysis. Thus far, we have only tried concatenating the pathways data to our gene expression data unweighted. We would like to explore weighting the pathways data other strategies. 5 References [1] D. B. Fogel, An introduction to simulated evolutionary optimization, IEEE Trans. Neural Networks, vol. 5, no. 1, pp. 3 14, 1994. [2] Robert L. Thorndike (December 1953). Who Belong in the Family?. Psychometrika 18 (4): 267 276. [3] Kanti Mardia et al. (1979). Multivariate Analysis. Academic Press. 5