mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures

Similar documents
Fast Calculation of Pairwise Mutual Information for Gene Regulatory Network Reconstruction

Package netbenchmark

Package minet. R topics documented: December 24, Title Mutual Information NETworks Version Date

Package netbenchmark

Package parmigene. February 20, 2015

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

INFORMATION THEORETIC APPROACHES TOWARDS REGULATORY NETWORK INFERENCE

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

3 Feature Selection & Feature Extraction

Information Driven Healthcare:

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

Package nettools. August 29, 2016

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

Incoming, Outgoing Degree and Importance Analysis of Network Motifs

SEEK User Manual. Introduction

BRANE Cut: biologically-related a priori network enhancement with graph cuts for gene regulatory network inference

Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Y. Qi, J. Klein-Seetharaman, and Z.

Predicting Disease-related Genes using Integrated Biomedical Networks

Features: representation, normalization, selection. Chapter e-9

Clustering and Classification. Basic principles of clustering. Clustering. Classification

On Demand Phenotype Ranking through Subspace Clustering

Comparison of Optimization Methods for L1-regularized Logistic Regression

Analyzing Large Biological Datasets with an Improved. Algorithm for MIC

Joint probability distributions

Forward Feature Selection Using Residual Mutual Information

Variable Selection 6.783, Biomedical Decision Support

Distance Measures for Gene Expression (and other high-dimensional) Data

Netter: re-ranking gene network inference predictions using structural network properties

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Stability of Feature Selection Algorithms

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Statistical Pattern Recognition

Exploratory data analysis for microarrays

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

Business Club. Decision Trees

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data

Network Traffic Measurements and Analysis

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Drug versus Disease (DrugVsDisease) package

Statistical Matching using Fractional Imputation

Package EFS. R topics documented:

DI TRANSFORM. The regressive analyses. identify relationships

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

International Journal of Advance Research in Computer Science and Management Studies

1) Give decision trees to represent the following Boolean functions:

Classification and Regression

Hybrid Correlation and Causal Feature Selection for Ensemble Classifiers

Using Coherence-based Measures to Predict Query Difficulty

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Learning DAGs from observational data

How to use the DEGseq Package

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

THE complex interaction of genes and their products regulates

Hidden Markov Models in the context of genetic analysis

Machine Learning in Biology

The mrmr variable selection method: a comparative study for functional data

Clustering analysis of gene expression data

Relevance Feedback for Content-Based Image Retrieval Using Support Vector Machines and Feature Selection

UNIT 2 Data Preprocessing

Information theory methods for feature selection

A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data

Random Forest Classification and Attribute Selection Program rfc3d

Contents. Preface to the Second Edition

CS249: ADVANCED DATA MINING

Comparison of Centralities for Biological Networks

A Novel Gene Network Inference Algorithm Using Predictive Minimum Description Length Approach

Detecting Novel Associations in Large Data Sets

Statistical Pattern Recognition

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Dimensionality Reduction, including by Feature Selection.

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

MIT 9.520/6.860 Project: Feature selection for SVM

A Dendrogram. Bioinformatics (Lec 17)

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Package twilight. August 3, 2013

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Digital Image Processing Laboratory: MAP Image Restoration

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Feature Selection and Classification for Small Gene Sets

Using the DATAMINE Program

Bioconductor s tspair package

Cross-validation and the Bootstrap

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Lorentzian Distance Classifier for Multiple Features

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Review: Identification of cell types from single-cell transcriptom. method

EECS730: Introduction to Bioinformatics

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Transcription:

SUPPLEMENTARY FILE - S1 mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures Joseph J. Nalluri 1,*, Debmalya Barh 2, Vasco Azevedo 3 and Preetam Ghosh 1 1. Department of Computer Science, School of Engineering, Virginia Commonwealth University, Richmond, Virginia, USA 2. Center for Genomics and Applied Gene Technology, Institute of Integrative Omics and Applied Biotechnology, Purba Medinipur, West Bengal, India 3. Laboratório de Genética Celular e Molecular, Departamento de Biologia Geral, Instituto de Ciências Biológicas (ICB), Universidade Federal de Minas Gerais, Pampulha, Belo Horizonte, Minas Gerais, Brazil *Corresponding author: nallurijj@mymail.vcu.edu

SUPPLEMENTARY FILE- S1 Details of the Network Inference Algorithms used in this work 1. CLR Context likelihood of relatedness (CLR) 1 algorithm belongs to the class of relevance networks. Relevance network algorithms use mutual information (MI) based Z scores between a regulator-target pair to identify and determine potential interactions between them. If the MI score is above a certain threshold among the expression dataset, the interaction between the pair is highly likely. In the past, CLR method has been proven to be effective in learning novel transcriptional interactions in E. coli. 1 CLR uses the metric of MI to gauge the similarity between the expression profiles of two entities, in our case, disease-specific mirnas. The MI is calculated as below: I(X; Y) = P(x i, y i ) log p(x i, y i ) i,j p(x i )p(y i ) where, X and Y are random variables i.e. two mirnas in this case. P(x i ) denotes the probability of X = x i. MI values between a mirna i and mirna j are calculated and thereafter estimated with regards to the likelihood of their occurrence by comparing that score with a background null model, which is the distribution of MI values. The null model incorporates two sets of MI values: MI i which is a set of all mirna i s MI values and MI j, a set of all MI values of mirna j. These two sets, MI i and MI j are two independent variables used in the computation of the joint distribution of the nullmodel, i.e. the background MI. Thus the MI score of the pair(mirna i,mirna j ) is compared with two Z scores resulting from MI i and MI j. Further in-depth explanation of the algorithm can be found in 1. However, CLR predominantly relies on the MI matrix for its scores, and cannot ascertain for causality between the regulators which is more often based on the regulatory kinetics exhibited in the time-series data 2 The CLR algorithm, available in the minet R Bioconductor package 3 was used in this work.

2. GENIE3 Gene Network Inference with Ensemble of Trees (GENIE3) 4 algorithm was the top performing algorithm in the DREAM4- In Silico Network Challenge 5 of inferring gene regulatory network. GENIE3 has a different approach of inferring the associations by taking into account, feature selection as compared to CLR, which is mutual-information based. In GENIE3, a regression problem is formulated for each individual mirna. Hence in our case, 4343 regression problems were solved for each disease-specific mirnas. In this regression analysis, each disease-specific mirna s expression pattern is predicted from every other mirna s expression pattern by the application of Random Forests 6 model which is a tree-based ensemble model. Thus, based on the significance of the expression patterns between two disease-specific mirnas and the least output variance between the target mirna and the considered mirna, a regulatory link between them is predicted. In this way, the algorithm ranks all the interactions between the mirnas based on their aggregated scores from regression analysis. GENIE3 is adaptable to other categories of expression data involving interactions. The GENIE algorithm hosted on GenePattern 7 was executed with standard run conditions - tree-based method as Random Forests and the number of trees grown in an ensemble as 500. 3. Basic Correlation The Basic Correlation method ranks every disease-specific mirna-mirna pair according to the correlation between them. This algorithm uses the Pearson s and Spearman s coefficient to calculate their correlation score, Pearson s coefficient: where, ρ X,Y = corr(x, Y) = cov(x, Y) σ X σ Y = E[(X μ X)(Y μ Y )] σ X σ Y X and Y are random variables, with μ X and μ Y being the expected values and σ X and σ Y being the standard deviations. The Spearman correlation coefficient

ρ = 1 6 d i 2 n(n 2 1) where, d i is the difference between the ranks of corresponding values X i and Y i and n is the number of points in dataset. While running the Basic Correlation algorithm hosted on the tool GenePattern 7 the ranks of the disease-specific mirnamirna were derived using Pearson s correlation coefficient and Spearman s correlation. Note that, there are two resultant files - one with Pearson s correlation coefficient and the other with Spearman s correlation coefficient, under this approach. 4. MRNETB MRNETB 8 is a mutual information based network inference algorithm which is an improved version of its predecessor MRNET. MRNET performs network inference using the method Maximum Relevance Minimum Redundancy (MRMR). For every random variable X i, a set of predictor variables are chosen based on the difference of mutual information between X i and the set of variables X i X Sj. This method follows forward selection, i.e. a variable X i with highest MI score with the target variable X j is chosen at first. Hence, the general idea behind the algorithm is identification of subsets of X Sj for every variable with which the variable in the set has maximum pairwise relevance and maximum pairwise independence. While previous methods recursively select a subset of variables and compute the mutual information scores, MRNETB performs backward elimination with sequential search. Further, in-depth explanation can be found at 8. This algorithm was run from the Bioconductor package minet 3 with Spearman entropy estimator and the number of bins used for discretization was N, where N=number of samples, i.e. 267 in our expression dataset. 5. Distance Correlation This algorithm is described in 9 which uses a novel measurement of dependence called distance correlation (DC) 10 to derive non-linear dependencies from the gene expression dataset. This metric uses a different approach as compared to some of the

previous MI based methods. MI based methods rely on density estimator of certain patterns which can be challenging for multivariate data, it cab be challenging. Also, in the case of continuous data, it has to be discretized before MI based methods are applied. The algorithm details can be found in 9 and beyond the scope of this work. References 1. Faith JJ, Hayete B, Thaden JT, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. Levchenko A, ed. PLoS Biol. 2007;5(1):e8. doi:10.1371/journal.pbio.0050008. 2. Madar A, Greenfield A, Vanden-Eijnden E, Bonneau R. DREAM3: network inference using dynamic context likelihood of relatedness and the inferelator. PLoS One. 2010;5(3):e9803. 3. Meyer PE, Lafitte F, Bontempi G. minet: AR/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 2008;9(1):461. 4. Irrthum A, Wehenkel L, Geurts P, others. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010;5(9):e12776. 5. Dream Challenges. 6. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. 7. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. GenePattern 2.0. Nat Genet. 2006;38(5):500-501. 8. Meyer P, Marbach D, Roy S, Kellis M. Information-Theoretic Inference of Gene Networks Using Backward Elimination. In: BIOCOMP. ; 2010:700-705. 9. Guo X, Zhang Y, Hu W, Tan H, Wang X. Inferring Nonlinear Gene Regulatory Networks from Gene Expression Data Based on Distance Correlation. PLoS One. 2014;9(2):e87446. 10. Székely GJ, Rizzo ML, Bakirov NK, others. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35(6):2769-2794.