Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles

Similar documents
Abstract. Introduction

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

The Encoding Complexity of Network Coding

Evaluating the Effect of Perturbations in Reconstructing Network Topologies

Part II. Graph Theory. Year

A Transformational Characterization of Markov Equivalence for Directed Maximal Ancestral Graphs

GRIND Gene Regulation INference using Dual thresholding. Ir. W.P.A. Ligtenberg Eindhoven University of Technology

From acute sets to centrally symmetric 2-neighborly polytopes

Using Spam Farm to Boost PageRank p. 1/2

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

Discrete mathematics , Fall Instructor: prof. János Pach

Robust Signal-Structure Reconstruction

These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions.

1 Introduction. 3 Syntax

Clustering Using Graph Connectivity

Persistence of activity in random Boolean networks

Dependency detection with Bayesian Networks

Notes for Recitation 8

Ma/CS 6b Class 26: Art Galleries and Politicians

CS473-Algorithms I. Lecture 13-A. Graphs. Cevdet Aykanat - Bilkent University Computer Engineering Department

Generell Topologi. Richard Williamson. May 6, 2013

The Optimal Discovery Procedure: A New Approach to Simultaneous Significance Testing

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 10: Asymptotic Complexity and

Characterizations of Trees

Binary Decision Diagrams

Boolean Network Modeling

Evaluation and comparison of gene clustering methods in microarray analysis

CS 6110 S11 Lecture 25 Typed λ-calculus 6 April 2011

COL351: Analysis and Design of Algorithms (CSE, IITD, Semester-I ) Name: Entry number:

MAT 145: PROBLEM SET 4

GRAPH DECOMPOSITION BASED ON DEGREE CONSTRAINTS. March 3, 2016

SOME REMARKS CONCERNING D METRIC SPACES

Definition For vertices u, v V (G), the distance from u to v, denoted d(u, v), in G is the length of a shortest u, v-path. 1

MATH 1A MIDTERM 1 (8 AM VERSION) SOLUTION. (Last edited October 18, 2013 at 5:06pm.) lim

A Genus Bound for Digital Image Boundaries

Chapter S:V. V. Formal Properties of A*

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

CPS 102: Discrete Mathematics. Quiz 3 Date: Wednesday November 30, Instructor: Bruce Maggs NAME: Prob # Score. Total 60

Strongly Connected Components

Lecture 6,

Order-Independent Constraint-Based Causal Structure Learning

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting

arxiv: v1 [cond-mat.dis-nn] 30 Dec 2018

22 Elementary Graph Algorithms. There are two standard ways to represent a

arxiv: v2 [stat.ml] 27 Sep 2013

Lecture 2 : Counting X-trees

MATH 54 - LECTURE 10

Package RobustRankAggreg

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

In this lecture, we ll look at applications of duality to three problems:

U.C. Berkeley CS170 : Algorithms, Fall 2013 Midterm 1 Professor: Satish Rao October 10, Midterm 1 Solutions

Computer Science Technical Report

Spectral Clustering X I AO ZE N G + E L HA M TA BA S SI CS E CL A S S P R ESENTATION MA RCH 1 6,

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

by conservation of flow, hence the cancelation. Similarly, we have

arxiv: v1 [math.co] 25 Sep 2015

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

Supervised vs unsupervised clustering

Lecture 22 Tuesday, April 10

Robust Ring Detection In Phase Correlation Surfaces

CS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM

Definition: A graph G = (V, E) is called a tree if G is connected and acyclic. The following theorem captures many important facts about trees.

Treewidth and graph minors

Algebraic Graph Theory- Adjacency Matrix and Spectrum

Simplicity is Beauty: Improved Upper Bounds for Vertex Cover

MAXIMAL PLANAR SUBGRAPHS OF FIXED GIRTH IN RANDOM GRAPHS

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Solutions to In-Class Problems Week 4, Fri

Constraint Satisfaction Algorithms for Graphical Games

22 Elementary Graph Algorithms. There are two standard ways to represent a

Discrete Mathematics Lecture 4. Harper Langston New York University

Convex Optimization / Homework 2, due Oct 3

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

1 Digraphs. Definition 1

Module 11. Directed Graphs. Contents

IMAGE DE-NOISING IN WAVELET DOMAIN

Lecture 17: Continuous Functions

EXTREME POINTS AND AFFINE EQUIVALENCE

Learning DAGs from observational data

BACKGROUND: A BRIEF INTRODUCTION TO GRAPH THEORY

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Module 10. Network Simplex Method:

Bipartite graphs unique perfect matching.

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

Using Hidden Markov Models to analyse time series data

Graph Theory Questions from Past Papers

On the Page Number of Upward Planar Directed Acyclic Graphs

4. Simplicial Complexes and Simplicial Homology

Detecting Novel Associations in Large Data Sets

arxiv: v2 [q-bio.pe] 8 Aug 2016

Winning Positions in Simplicial Nim

Some irrational polygons have many periodic billiard paths

The Rainbow Connection of a Graph Is (at Most) Reciprocal to Its Minimum Degree

Regular Hypergraphs: Asymptotic Counting and Loose Hamilton Cycles

Basic Graph Theory with Applications to Economics

Detecting Infeasibility in Infeasible-Interior-Point. Methods for Optimization

The Acyclic Bayesian Net Generator (Student Paper)

Transcription:

Supporting Information to Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles Ali Shojaie,#, Alexandra Jauhiainen 2,#, Michael Kallitsis 3,#, George Michailidis 3, Department of Biostatistics, University of Washington, Seattle, WA, USA 2 Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Sweden 3 Department of Statistics, University of Michigan, Ann Arbor, MI, USA # These authors contributed equally to this work. E-mail: gmichail@umich.edu Properties of the RIPE Algorithm Lemma. Suppose that the binary influence matrix P obtained from perturbation screens correctly reflects the influence of genes in a regulatory network on each other; i.e. all the genes that are influenced by single knocked-out/down genes are correctly included in P. Then, (i) P does not contain sufficient information for estimation of the structure of the underlying regulatory network. (ii) P contains sufficient information for estimation of the causal orderings of the nodes of the underlying regulatory network. Proof. We establish (i) through a counterexample. Consider two simple regulatory networks depicted in Figure S. It can be seen that these two networks yield the same perturbation data; thus, such data can not provide sufficient information for reconstruction purposes. To show (ii), first assume G is a DAG. In this case, there exists (at least) one causal ordering of its nodes. (Note: all possible orderings in a DAG are valid causal orderings.) However, by the accuracy of the perturbation data assumed, P also forms a DAG. In addition, due to the transitivity of the influence matrix, the ordering of the nodes in the influence matrix is the same as that in G. Therefore, a search algorithm for orderings can find a valid causal ordering. Now, suppose G is not a DAG. The assumed accuracy of the perturbation data implies that the

2 3 2 3 Figure S: Simple graphs with 3 nodes. influence graph represents all the influences of nodes on each other, which in the presence of feedback mechanisms, results in cycles in the influence graph. Denote by D the DAG of the strong connected components obtained from P. First, note that by construction, and transitivity of the influence graph, the causal ordering of the nodes in D is a valid partial order of the nodes in P. On the other hand, the orderings among nodes in each strong component (each node in D), are equivalent, and an exhaustive search algorithm (like the backtracking algorithm implemented in RIPE) finds all these equivalent orderings. This implies that the set of orderings obtained by assembling (equivalent) orderings of nodes in all components, together with the partial ordering of nodes in D gives the list of all orderings consistent with the influence graph. This completes the proof. Remark: From the counterexample depicted in Figure S, it can be seen that including information on the signs of the perturbation effects does not help distinguishing the two networks. Further, because of the transitivity of the influences, the result of the lemma continues to hold in the case where all direct influences amongst the nodes are accurately included in P (i.e. without any false positives) and no indirect influences are included. Considering the fact that direct interactions often show most significant knockouts effects, this renders support for more conservative choices of the p-value cutoff. The following notation is used in the next lemma: f(n) = O(g(n)) implies that that there is a positive constant M such that for all sufficiently large values of n, f(n) M g(n), whereas f(n) = o(g(n)) implies that as n, f(n)/g(n) < ɛ for any ɛ >. Lemma 2 (Consistency of Estimates). Assume the data obtained from perturbation screens satisfies the assumptions of Lemma and let s be the total number of true edges in the graph. Suppose that for some a >, p = p(n) = O(n a ) and pa i = O(n b ), where sn 2b log n = o() as n. Moreover, suppose that there exists ν > such that for all n N and all i V, Var (X i X :i ) ν and there exists δ > and some ξ > b such that for every i V and for every j pa i, π ij δn ( ξ)/2, where π ij is the partial correlation between X i and X j after removing the effect of the remaining variables. Consider the RIPE estimate obtained with an adaptive lasso penalty in the second stage of the algorithm, where the initial weights come from another lasso penalized RIPE step with a tuning parameter satisfying λ = O( log p/n). Then, if λ is chosen so that λ dn ( ζ)/2 for some b < ζ < ξ a nd d >, we have that with probability converging to, the RIPE estimate satisfies the following: (i) the edges of the regulatory network are correctly estimated. 2

(ii) the signs of the regulatory effects (repression/inhibition) are correctly estimated. Proof. Lemma shows that under the assumptions of this lemma, causal orderings are correctly determined. Then, if the true regulatory network is a DAG, the result follows directly from Theorem 3 of []. On the other hand, when the true network consists of cycles, each ordering from the first step of RIPE is a valid ordering. Therefore, for each of the estimated DAGs, claims (i) and (ii) hold by Theorem 3 of []. Now, taking q = and τ = ε for some ε > and small, implies that the all the estimated edges are included in the graph estimated after model averaging, which in turn implies that (i) and (ii) are satisfied for this network. In words, Lemma 2 states that if the perturbation data are accurate, and if adequate samples of steady state data are available (compared to the number of genes in the network, and the number of regulators of each gene), then the RIPE algorithm with the appropriate choices of tuning parameters can correctly estimate both the edges of the network, as well as the sign of the effects (i.e. inductive versus inhibitory effects). Although this result is asymptotic nature, it nevertheless provides insight into the required number of samples, as long as the available sample size for the steady state data satisfies the requirements of the Lemma related to the total number of edges p, number of regulators of each gene pa i and the total number of edges in the network s. True Graph Estimated Graph G36 G9 G36 G9 G6 G6 G43 G43 Figure S2: Small cyclic subnetwork example. The true network (left) includes a number of cycles, and the estimate from the RIPE algorithm correctly identifies some of these cycles (right). The proof mostly builds on proofs establishing consistency of adaptive lasso estimates for estimation problems of directed acyclic graphs with known ordering of nodes []. The main difference in case of general regulatory networks is the presence of cycles in the network, where the graph averaging step of the algorithm, coupled with the assumption that the data from perturbation screens is noiseless, gives the desired result. To see this argument from a different point of view, note that, as n increases, DAGs from different orderings corresponding to cycles in the true network have very similar likelihood scores, which results in inclusion of these cycles in the set of final estimates. For instance, consider a bi-directed edge X X 2. Then DAGs from both orderings have the same 3

likelihoods, and hence the model resulting from averaging the two estimated DAGs will include both edges if an appropriate threshold is used. This is illustrated in Figure S2, where subnetworks corresponding to a small cyclic subgraph from DREAM4-net are shown. It is worth noting that the above lemma can be further extended to allow for presence of noise in the perturbation screens; however, this is a rather involved extension and thus beyond the scope of this paper. Additional Numerical Experiments P 4 2 8 6 4 2 3 2 9 8 7 6 5.5 q τ.4.2 4 3 Figure S3: Numerical study on choices of τ and q. Values of the Precision (P ) for different combinations of τ (threshold for including an edge in the consensus graph) and q (proportion of highest values of the log-likelihood function used in constructing the consensus graph). References. A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse highdimensional directed acyclic graphs. Biometrika, 97(3):59 538, 2. 4

.9 5.9 5 F.5 5.4.4 q.2 τ.4.2.55.5 Figure S4: Numerical study on choices of τ and q. Values of the Recall (R) for different combinations of τ (threshold for including an edge in the consensus graph) and q (proportion of highest values of the log-likelihood function used in constructing the consensus graph). 5

4 5 2 F 5 8 6 4.55.4.2 q τ.4.2 2.58 Figure S5: Numerical study on choices of τ and q. Values of the F measure for different combinations of τ (threshold for including an edge in the consensus graph) and q (proportion of highest values of the log-likelihood function used in constructing the consensus graph). 6

A B C Size of largest connected component 2 4 6 8 size of component number of edges 2 3 4 5 6 Number of edges...2.3.4.5...2.3.4.5...2.3.4.5 P value P value P value Figure S6: Influence graph characteristics versus p-value for knockdown experiments in DREAM4. Number of edges (open triangles) in the influence graph and the size of the largest connected component (dots) versus cut-off p-value for differential expression. The data is based on five replications of the knockdown and wildtype experiments for (A) -node network, (B) -node network 3, and (C) -node network 5 in the DREAM4 challenge. P-values chosen for analysis were.3 for (A) and (C), and.27 for (B). 7

g8 g9 g g4 g6 g g5 g8 g7 g4 g6 g7 g3 g3 g9 g2 g2 g5 g g2 Figure S7: Synthetic regulatory network. P P P 2 P 3 Figure S8: Illustration of influence matrices for synthetic networks. P : ground truth, P : 5% of directions reversed, P 2 : % new effects added, P 3 : 5% directions reversed and % new effects added. A black dot in position (i, j) (i.e. in row i and column j) represents that gene i influences gene j. 8

5 5.9.95. F P R 2 5 2 5 no of orders Figure S9: Effect of increasing number of ordering used in RIPE. The values of F, P and R are displayed for increasing number of orders in the synthetic DAG with p = nodes under.5% false positive edges. 9

Size of largest connected component 5 5 2 25 Size of component Number of edges 5 5 2 25 3 35 Number of edges...2.3.4.5 P value cut off Figure S: Influence graph characteristics versus p-value for the yeast regulatory network. Number of edges (open triangles) in the influence graph and the size of the largest connected component (dots) versus cut-off p-value for differential expression in the transcription factor knockout experiments in yeast. The graph is based on expression data from 588 two-color microarrays for 269 transcription factors.the p-value cut-off was chosen to.2.

RIPE Performance in yeast regulatory network (65 genes) TP =34, E =4 (p value<.) 5 5 2 RIPE 6 8 2 4 no of TP Figure S: Performance of RIPE in estimating the layered yeast regulatory network. The distribution of number of true positives, in comparison to the number of true positives for the RIPE estimator, for random graphs with the same number of edges and similar 2-layer structure (p = 65 genes and k = 269 transcription factors). No random network with equal or larger true positives was observed (p-value <.).