Fast Calculation of Pairwise Mutual Information for Gene Regulatory Network Reconstruction

Similar documents
Package parmigene. February 20, 2015

Package minet. R topics documented: December 24, Title Mutual Information NETworks Version Date

Incoming, Outgoing Degree and Importance Analysis of Network Motifs

mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

A Novel Gene Network Inference Algorithm Using Predictive Minimum Description Length Approach

Building Classifiers using Bayesian Networks

Assessing a Nonlinear Dimensionality Reduction-Based Approach to Biological Network Reconstruction.

A Keypoint Descriptor Inspired by Retinal Computation

Package netbenchmark

INFORMATION THEORETIC APPROACHES TOWARDS REGULATORY NETWORK INFERENCE

Package netbenchmark

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

ViTraM: VIsualization of TRAnscriptional Modules

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Dependency detection with Bayesian Networks

Random projection for non-gaussian mixture models

Unsupervised Feature Selection for Sparse Data

On Skeletons Attached to Grey Scale Images. Institute for Studies in Theoretical Physics and Mathematics Tehran, Iran ABSTRACT

Package nettools. August 29, 2016

Relevance Feedback for Content-Based Image Retrieval Using Support Vector Machines and Feature Selection

ViTraM: VIsualization of TRAnscriptional Modules

3 Feature Selection & Feature Extraction

Data mining with Support Vector Machine

TieDIE Tutorial. Version 1.0. Evan Paull

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

BRANE Cut: biologically-related a priori network enhancement with graph cuts for gene regulatory network inference

6. Dicretization methods 6.1 The purpose of discretization

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Bumptrees for Efficient Function, Constraint, and Classification Learning

The Gene Modular Detection of Random Boolean Networks by Dynamic Characteristics Analysis

William Yang Group 14 Mentor: Dr. Rogerio Richa Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering

1. Discovering Important Nodes through Graph Entropy The Case of Enron Database

Character Recognition

Behavior Analysis on Boolean and ODE models for Extension of Genetic Toggle Switch from Bi-stable to Tri-stable

/ Computational Genomics. Normalization

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Bagging for One-Class Learning

NSGA-II for Biological Graph Compression

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

10-701/15-781, Fall 2006, Final

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

Predicting Disease-related Genes using Integrated Biomedical Networks

HPC methods for hidden Markov models (HMMs) in population genetics

Advanced Applied Multivariate Analysis

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Survey on Rough Set Feature Selection Using Evolutionary Algorithm

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Recurrent Pixel Embedding for Grouping Shu Kong

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Supplementary text S6 Comparison studies on simulated data

Performance Evaluation of Various Classification Algorithms

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains

Application of Support Vector Machine In Bioinformatics

Programs for MDE Modeling and Conditional Distribution Calculation

Optimizing Regulation Functions in Gene Network Identification

Small Libraries of Protein Fragments through Clustering

Using the Kolmogorov-Smirnov Test for Image Segmentation

SSV Criterion Based Discretization for Naive Bayes Classifiers

BiTrinA. Tamara J. Blätte, Florian Schmid, Stefan Mundus, Ludwig Lausser, Martin Hopfensitz, Christoph Müssel and Hans A. Kestler.

CS 195-5: Machine Learning Problem Set 5

Drug versus Disease (DrugVsDisease) package

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Evaluating the Effect of Perturbations in Reconstructing Network Topologies

A Background Subtraction Based Video Object Detecting and Tracking Method

Comparing different interpolation methods on two-dimensional test functions

Generative and discriminative classification techniques

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Features: representation, normalization, selection. Chapter e-9

Image segmentation based on gray-level spatial correlation maximum between-cluster variance

Texture Image Segmentation using FCM

Unsupervised Learning and Clustering

Matlab project Independent component analysis

8/19/13. Computational problems. Introduction to Algorithm

Compression of Stereo Images using a Huffman-Zip Scheme

Distance Measures for Gene Expression (and other high-dimensional) Data

Layout Search of a Gene Regulatory Network for 3-D Visualization

Computational Genomics and Molecular Biology, Fall

SEEK User Manual. Introduction

Learning Regulatory Networks from Sparsely Sampled Time Series Expression Data

Time Stamp Detection and Recognition in Video Frames

Visual Representations for Machine Learning

Bayesian Networks Inference (continued) Learning

Information Driven Healthcare:

Noise-based Feature Perturbation as a Selection Method for Microarray Data

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

Image Segmentation for Image Object Extraction

Bioimage Informatics

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

1 Homophily and assortative mixing

A Memetic Heuristic for the Co-clustering Problem

Classification by Support Vector Machines

Aggregating Descriptors with Local Gaussian Metrics

SVM Classification in -Arrays

MRT based Fixed Block size Transform Coding

Hybrid Spline-based Multimodal Registration using a Local Measure for Mutual Information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Transcription:

Fast Calculation of Pairwise Mutual Information for Gene Regulatory Network Reconstruction Peng Qiu, Andrew J. Gentles and Sylvia K. Plevritis Department of Radiology, Stanford University, Stanford, CA Abstract We present a new software implementation to more efficiently compute the mutual information for all pairs of genes from gene expression microarrays. Computation of the mutual information is a necessary first step in various information theoretic approaches for reconstructing gene regulatory networks from microarray data. When the mutual information is estimated by kernel methods, computing the pairwise mutual information is quite time-consuming. Our implementation significantly reduces the computation time. For an example data set of 336 samples consisting of normal and malignant B-cells, with 9563 genes measured per sample, the current available software for ARACNE requires 142 hours to compute the mutual information for all gene pairs, whereas our algorithm requires 1.6 hours. The increased efficiency of our algorithm improves the feasibility of applying mutual information based approaches for reconstructing large regulatory networks. Keywords: mutual information, gene regulatory network, microarray. 1

1 Introduction Information theoretic approaches are increasingly being used for reconstructing regulatory networks from gene expression microarray data. Examples include the Chow-Liu tree [1], the relevance network (RelNet) [2], the context likelihood of relatedness (CLR) algorithm [3], ARACNE [4] and the maximum relevance minimum redundance network (MRNet) [5]. These specific approaches and others start by computing the pairwise mutual information (MI) for all possible pairs of genes, resulting in an MI matrix. The MI matrix is then manipulated to identify regulatory relationships. The Chow-Liu tree is a maximum spanning tree constructed from the MI matrix, where edges are weighted by the MI values between connected nodes. In relevance networks, an edge exists between a pair of genes if their MI exceeds a given threshold. The CLR algorithm transforms the MI matrix into scores that take into account the empirical distribution of the MI values, and then applies a threshold. ARACNE applies the data processing inequality (DPI) on each connected gene triple, removing potential false positive edges in the relevance networks. In MRNet, the maximum relevance minimum redundance (MRMR) criterion [6] is applied, where the maximum relevance criterion tends to assign edges to gene pairs that share large MI, while the minimum redundance criterion controls false positives. The MRMR criterion is essentially a pairwise approximation of the conditional MI. In [7], the conditional MI is explicitly used for inference. If two genes share large MI but are conditionally independent give a third gene, there is no edge between them. These approaches have been applied successfully to simulated data and real microarray data, identifying interesting regulatory targets and pathways. The most time-consuming step in these approaches is computing the MI 2

matrix, because all possible pairs of genes need to be examined. When the expression values of genes are treated as continuous random variables and the MI is estimated by kernel methods, computing the pairwise MI can be computationally expensive. For example, in a recently analyzed B-cell data set consisting of 336 samples and 9563 genes per sample, [4], ARACNE takes 142 hours to compute the MI for all gene pairs. We propose a faster way to calculate the pairwise MI, where the order of the calculations is rearranged so that repeating operations are significantly reduced. For the previously mentioned B-cell data set, the run time of our algorithm is 1.6 hours. To explain our algorithm, in Section 2, we first briefly review kernel based estimation of MI. Then we present the pseudo-code of our approach. Section 3 illustrates the decrease in computation time, and is followed by Conclusions. 2 Method 2.1 Kernel Estimation of Mutual Information The MI between two continuous random variables X (a gene) and Y (another gene) is defined as follows: I(X; Y ) = ( ) f(x, y) f(x, y)log dxdy (1) f(x)f(y) where f(x, y) is the joint probability density of the two random variables, and f(x) and f(y) are the marginal densities. Given M data points drawn from the joint probability distribution, (x j, y j ), j = 1,..., M, the joint and marginal densities can be estimated by the Gaussian kernel estimator [8], where f(x, y) = 1 M 1 1 2πh 2 e 2h 2 ((x x j) 2 +(y y j ) 2 ) (2) 3

f(x) = 1 M 1 2πh 2 e 1 2h 2 (x x j) 2 (3) and f(y) takes a similar form. h is a tuning parameter that controls the width of the kernels. Since MI is an integration with respect to the joint probability density, it can be written as an expectation with respect to the random variables X and Y, and thus can be approximated by the sample average, I(X; Y ) = 1 M log i M j e 1 2h 2 ((x i x j ) 2 +(y i y j ) 2 ) j e 1 2h 2 (x i x j ) 2 j e 1 2h 2 (y i y j ) 2 (4) 2.2 Existing Method for Pairwise Mutual Information When the pairwise MI among N random variables (N genes) is considered, Equation (4) can be calculated inside two nested for-loops, both of which run from 1 to N. The software implementation for ARACNE [4] and MRNet [5] both adopt this structure, as did our initial implementation. However, we found this structure to be inefficient when coupled with kernel density estimator, because of repeated calculation of kernel distances. When estimating the MI between two random variables g 1 and g 2, Equation (4) evaluates the kernel distances between all possible pairs of data points for each of the two random variables separately, e 1 2h 2 (g 1 i g 1j ) 2 e 1 2h 2 (g 2 i g 2j ) 2, for all i, j = 1, 2,..., M. When estimating the MI between two random variables g 1 and g 3, the kernel distances e 1 2h 2 (g 1 i g 1j ) 2 are evaluated again. Therefore, when Equation (4) is computed inside two nested for-loops, the kernel distances among data points of one random variable are evaluated N 1 times. For the large N in gene expression microarray data, the repeated calculations will severely decrease the computational efficiency. Avoiding the repeated calculations can significantly improve the computa- and 4

tional efficiency of ARACNE. (Note: The implementation of MRNet in the Bioconductor package does not suffer from this problem, because it does not use kernel density estimation.) A straight forward method to eliminate the repeats would be to precompute the kernel distances for each random variable (gene). However, such a method is not feasible because a large amount of memory would be needed to store the kernel distances, which is a matrix of size N M 2. In the following section, we present an alternative fast algorithm for calculating pairwise MI, where repeats are significantly decreased, with a moderate increase in memory use. 2.3 Fast Calculation of Pairwise Mutual Information In practice, when we encounter computational structures such as nested for-loops, double summations and double integrals, switching the order of the procedures may sometimes result in significant gain. For example, in [9], switching the order of a double summation significantly increased the computational efficiency of the intersection kernel support vector machine applied to computer vision. The essential idea of our fast algorithm is to switch the order of the nested for-loops, so that the repeated operations can be pushed out of some of them. The detailed pseudo-code is as follows, where the input data is stored in the variable data of N rows (genes) and M columns (samples). 1: for i = 1 : M do 2: % kernel distance between the i th data point and all other points j for each gene k 3: for j = 1 : M do 4: SumMargin = zeros(n, 1) 5: for k = 1 : N do 6: Dist(k, j) = KernelDistance(data(k, j), data(k, i)); 7: SumMargin(k) = SumMargin(k) + Dist(k, j); 5

8: end for 9: end for 10: % compute the i th entry of the first sum in Eqn (4) for all pairs of genes k, l 11: for k = 1 : N 1 do 12: for l = k + 1 : N do 13: SumJoint = 0; 14: for j = 1 : M do 15: SumJoint = SumJoint + Dist(k, j)dist(l, j); 16: end for 17: MI(k, l) = MI(k, l) + log( 18: end for 19: end for 20: end for SumJoint ); SumMargin(k)SumMargin(l) The above algorithm uses two strategies to improve the computational efficiency. One is to pre-compute the marginal probabilities for each gene, which avoids repeated calculations of the marginal probabilities inside the double for-loops in lines 11-12. This strategy is also used in ARACNE. The second strategy is to put the first summation in Equation (4) as the outermost for-loop, instead of the double for-loop running over all pairs of genes. Using the second strategy, the kernel distance between any two particular data points is evaluated twice, which is significantly less than the N times repeats in ARACNE. This strategy is not currently used in ARACNE and is the key difference between ARACNE and our proposed algorithm. It is clear that the computation time of the proposed algorithm is of complexity O(M 2 N 2 ), which is the same as ARACNE [4, 10]. However, since the proposed algorithm performs much less repeated calculations, the computation time is significantly less than that of ARACNE. Simulation results are shown in Section 3. The gain in computational efficiency comes at the price of a small amount of additional memory use, which is needed to stores the variables Dist and 6

SumM argin. The additional memory use is of approximately the same size as the input data matrix. In Matlab, some for-loops in the above pseudo-code can be written in more compact forms using matrix multiplications. The Matlab code of our algorithm is available at http://icbp.stanford.edu/software/fastpairmi/. 3 Results We compare the computation time for obtaining the pairwise MI using ARACNE s Matlab and Java implementations downloaded from [4], and the our fast algorithm written in Matlab. The tests are performed using a personal computer (PC) with Intel Core 2 processor 2.66GHz. In Figure 1 (a), we fix the number of samples M = 200, vary the number of genes N = 100, 200,..., 1000, and compare the computation time of ARACNE and the proposed fast algorithm in log-log scale. In Figure 1 (b), we fix the number of genes N = 1000 and vary the number of samples M = 50, 100,..., 500. In Figures 1 (a) and (b), the curves for ARACNE and our fast algorithm share the same slope, meaning that the computation time of both methods grows at the same rate as the data size increases. Although the computational complexity is the same, due to significant reduction of repeated operations, the proposed fast algorithm achieves 15 to 89 times increase in speed compared with ARACNE, where greater gains in speed are achieved for data of larger size. For a more realistic example, we examine the B-cell gene expression data analyzed in [4], which contains 9563 genes across 336 samples. The Java implementation of ARACNE takes 142 hours to compute the MI for all gene-pairs, whereas our fast algorithm takes 1.6 hours. 7

4 Conclusion When analyzing large gene expression microarray data sets by information theoretic approaches, fast approaches are needed. We demonstrated a fast method for calculating the pairwise mutual information based on kernel estimation. Our method achieves significant gains in speed when compared to existing implementations, with modest increases in memory use. Moreover, our method is not limited to the analysis of gene expression data, and can be generally applied to speed up any algorithm that requires computing the mutual information matrix. Acknowledgement We gratefully acknowledge funding from the NCI Integrative Cancer Biology Program (ICBP), grant U56 CA112973. Conflict of Interest None declared. References [1] C. Chow, C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans on Information Theory, 14(3) (1968), 462-467. [2] A.J. Butte, L.S. Kohane, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements, Pacific Symposium on Biocomputing, 4 (2000), 418-429. 8

[3] J.J. Faith, B. Hayete, J.T. Thaden, I. Mogno, J. Wierbowski, G. Cottarel, S. Kasif, J.J. Collins, T.S. Gardner, Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles, Plos Biology, 5(1) (2007), e8. [4] A.A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, D.R. Favera, A. Califano, ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context, BMC Bioinformatics, 7(Suppl 1) (2006), s7. [5] P.E. Meyer, K. Kontos, F. Lafitte, G. Bontempi, Information-theoretic inference of large transcriptional regulatory networks, EURASIP Journal on Bioinformatics and Systems Biology, 2007(1) (2007), 9 pages. [6] C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression data, Journal of Bioinformatics and Computational Biology, 3(2) (2005), 185-205. [7] W. Zhao, E. Serpedin, E.R. Dougherty, Inferring connectivity of genetic regulatory networks using information-theoretic criteria, IEEE/ACM Trans on Computational Biology and Bioinformatics, 5(2) (2008), 262-274. [8] J. Beirlant, E. Dudewicz, L. Gyorfi, E. van der Meulen, Nonparametric entropy estimation: An overview, International Journal of the Mathematical Statistics Sciences, 6(1) (1997), 17-39. [9] S. Maji, A.C. Berg, J. Malik, Classification using intersection kernel support vector machines is efficient, IEEE Computer Vision and Pattern Recognition (CVPR), (2008). 9

[10] A.A. Margolin, K. Wang, W.K. Lim, M. Kustagi, I. Nemenman, A. Califano, Reverse engineering cellular networks, Nature Protocols, 1(2) (2006), 662-671. 10

Computation Time (sec) 10 4 10 3 10 2 10 1 10 0 Fix Number of Sample to be 200 ARACNE matlab ARACNE Java our fast implementation 10 1 10 2 200 400 600 800 10 3 Number of Genes (N) (a) Computation Time (sec) 10 5 10 4 10 3 10 2 10 1 ARACNE Matlab ARACNE Java our fast implementation 10 0 50 10 2 200 300 400 500 Number of Samples (M) (b) Figure 1: Comparison of computation time between ARACNE and our fast algorithm. In Fig (a), the number of samples is fixed at 200 and the number of genes varies from 100 to 1000. In Fig (b), the number of samples varies from 50 to 500, and the number of genes is fixed at 1000. 11