Simulation of molecular regulatory networks with graphical models Inma Tur 1 inma.tur@upf.edu Alberto Roverato 2 alberto.roverato@unibo.it Robert Castelo 1 robert.castelo@upf.edu 1 Universitat Pompeu Fabra, Barcelona, Spain 2 Università di Bologna, Bologna, Italy user! 2013 - UCLM, Albacete, Spain - July 10-12 2013 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 1 / 27
Motivation - Genomics data High-throughput genomics technologies produce high-dimensional (p n) multivariate data sets of continuous and discrete random variables. Gene expression data Network of molecular regulatory interactions Genetical genomics data Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 2 / 27
Motivation - Graphical Markov Models (GMM) Gaussian GMMs Homogeneous Mixed GMMs X V N p (µ, Σ) 1 X V N p (µ(i), Σ(i)) with Σ(i) Σ I1 2 3 Y1 Y2 > library(qpgraph) > set.seed(12345) > gmm <- ruggmm(dregulargraphparam()) > round(solve(gmm$sigma), digits=1) 1 2 3 4 1 9.5-3.4-7.2 0.0 2-3.4 5.9 0.0-2.3 3-7.2 0.0 8.2 0.9 4 0.0-2.3 0.9 2.3 4 Y3 > library(qpgraph) > set.seed(12345) > gmm <- rhmgmm(dregularmarkedgraphparam()) > round(solve(gmm$sigma), digits=1) Y1 Y2 Y3 Y1 11.0 0.0-7.2 Y2 0.0 1.2-1.6 Y3-7.2-1.6 8.2 > gmm$mean() Y1 Y2 Y3 1 0.4720734 0.9669291 0.7242007 2 1.4720734 1.9669291 1.7934027 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 3 / 27
Motivation - Graphical Markov Model (GMM) Testing GMM estimation procedures on simulated data is a fundamental step to verify properties such as correctness or asymptotic behavior. The R/Bioconductor package qpgraph implements algorithms to simulate Gaussian GMMs, homogeneous mixed GMMs and data from them. Simulating Gaussian GMMs requires simulating covariance matrices whose inverse matches a zero-pattern defined by missing edges in a graph. Simulating homogeneous mixed GMMs requires simulating conditional covariance matrices whose inverse matches a zero-pattern defined by a graph, and conditional mean vectors satisfying additive effects. The rest of this talk is based on the vignette entitled Simulating molecular regulatory networks using qpgraph from the qpgraph package. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 4 / 27
Outline 1 Simulation of graphs for GMMs 2 Simulation of undirected Gaussian Graphical Markov Models 3 Simulation of Homogeneous Mixed Graphical Markov Models 4 Simulation of eqtl models of experimental crosses 5 Conclusions Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 5 / 27
Outline 1 Simulation of graphs for GMMs 2 Simulation of undirected Gaussian Graphical Markov Models 3 Simulation of Homogeneous Mixed Graphical Markov Models 4 Simulation of eqtl models of experimental crosses 5 Conclusions Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 6 / 27
Simulation of graphs for GMMs Consider undirected labeled graphs G = (V, E) where V = {1,..., p} is the vertex set indexing a vector of random variables (r.v. s) X V = {X 1,..., X p } that belong to some probability distribution P. E (V V ) is the edge set, such that (i, j ) E X i X j X V \{X i, X j }, holds in P (pairwise Markov property w.r.t G). Let {A, B, S} V, S separates A from B in G (A G B S) if every path between A and B intersects S. G should be such that A G B S X A X B X S, holds in P (global Markov property w.r.t G). In the context of GMMs we want to simulate graphs in which we can control separation and sparseness. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 7 / 27
Simulation of graphs for GMMs qpgraph simulates undirected graphs according to 1 the type of graph pure single type of vertices marked two subsets of vertices (associated to discrete and continuous random variables) 2 the model to simulate the random graph Erdös-Rényi edges occur with equal probability, or graphs are chosen uniformly at random with given number of vertices & edges. d-regular vertices have a constant degree d (Harary, 1969). It follows that graph density is a linear function of d: D = d/(p 1). This is implemented in the function rgraphbam(n, param,...) which returns objects of the class graphbam defined in the graph package. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 8 / 27
Simulation of graphs for GMMs Input parameters in rgraphbam() are defined by S4 classes graphparam p labels markedgraphparam pi Ilabels py Ylabels ergraphparam m [# edges] prob [Pr(edge)] dregulargraphparam d [vertex degree] exclude ermarkedgraphparam dregularmarkedgraphparam Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 9 / 27
Simulation of graphs for GMMs d-regular graphs are simulated with the Steger and Wormald (1999) algorithm > library(qpgraph) > set.seed(1234) > g <- rgraphbam(dregularmarkedgraphparam(pi=2, py=10, d=3)) > plot(g, lwd=3) I2 Y1 I1 Y2 Y4 Y3 Y10 Y6 Y5 Y7 Y8 Y9 plot() is overloaded to use the functionality from the Rgraphviz package. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 10 / 27
Outline 1 Simulation of graphs for GMMs 2 Simulation of undirected Gaussian Graphical Markov Models 3 Simulation of Homogeneous Mixed Graphical Markov Models 4 Simulation of eqtl models of experimental crosses 5 Conclusions Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 11 / 27
Simulation of undirected Gaussian GMMs Undirected Gaussian GMMs are multivariate models on continuous r.v. s X V = {X 1,..., X p } determined by an undirected graph G = (V, E) with V = {1,..., p} and E V V such that X V N p (µ, Σ) where {Σ 1 } ij = 0 for i j and (i, j ) G Therefore, to simulate an undirected Gaussian GMM we need to build a matrix Σ such that 1 Σ is positive definite (Σ S + ), 2 the off-diagonal cells of the scaled Σ corresponding to the present edges in G match a given marginal correlation ρ, 3 the zero pattern of Σ 1 matches the missing edges in G. This is not straightforward since setting directly off-diagonal cells to zero in some initial Γ S + will not typically lead to a positive definite matrix. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 12 / 27
Simulation of undirected Gaussian GMMs Let Γ G be an incomplete matrix with elements {γ ij } for i = j or (i, j ) G. 1 γ 11 γ 12 γ 13 2 3 Γ G = γ 21 γ 22 γ 24 γ 31 γ 33 γ 34 γ 42 γ 43 γ 44 4 Γ is a positive completion of Γ G if Γ S + and {Γ 1 } ij =0 for i j, (i, j ) G. Draw Γ G from a Wishart distributionw p (Λ, p); Λ= R, =diag({ 1/p} p ) and R = {R ij } p p where R ij = 1 for i = j and R ij = ρ for i j. It is required that Λ S + and this happens if and only if 1/(p 1) < ρ < 1. Finally, to obtain Σ Γ from Γ G, qpgraph uses the regression algorithm by Hastie, Tibshirani and Friedman (2009, pg. 634) as matrix completion algorithm. See functions qprndwishart(), qphtf() and qpg2sigma() for further details. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 13 / 27
Simulation of undirected Gaussian GMMs ruggmm() simulates undirected Gaussian GMMs taking as input either a graphparam or a graphbam object. It returns an S4 object of class UGgmm. > set.seed(12345) > gmm <- ruggmm(n=1, g=dregulargraphparam(p=4, d=2), rho=0.75) > class(gmm) [1] "UGgmm" attr(,"package") [1] "qpgraph" > names(gmm) ## it behaves pretty much like a 'list' object [1] "X" "p" "g" "mean" "sigma" > round(solve(gmm$sigma), digits=1) 1 2 3 4 1 20.9-6.6-13.6 0.0 2-6.6 10.6 0.0-4.3 3-13.6 0.0 12.7 0.9 4 0.0-4.3 0.9 4.3 > gmm2 <- ruggmm(n=1, g=gmm$g) ## fix 'g' simulate covariance only > round(solve(gmm2$sigma), digits=1) 1 2 3 4 1 7.2-1.3-3.2 0.0 2-1.3 1.2 0.0-0.5 3-3.2 0.0 3.4-0.5 4 0.0-0.5-0.5 1.5 1 2 3 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 14 / 27 4
Simulation of undirected Gaussian GMMs rmvnorm() from the mvtnorm package is overloaded in qpgraph to simulate multivariate normal observations from the undirected GMM: > set.seed(123) > X <- rmvnorm(n=10000, gmm) > head(x, n=3) 1 2 3 4 [1,] 0.4212627 0.28868036 0.8864354 0.22768491 [2,] 0.7696843 0.98995123 0.8572206-0.09639081 [3,] 0.1257251 0.04640156 0.5221513 0.23059947 > round(solve(cov(x)), digits=1) 1 2 3 4 1 21.2-6.7-13.7 0.0 2-6.7 10.6 0.0-4.3 3-13.7 0.0 12.7 0.9 4 0.0-4.3 0.9 4.2 > mean(cor(x)[upper.tri(cor(x)) & as(gmm$g, "matrix")]) [1] 0.8765055 As the number of present edges grow, their mean marginal correlation approaches the given one. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 15 / 27
Outline 1 Simulation of graphs for GMMs 2 Simulation of undirected Gaussian Graphical Markov Models 3 Simulation of Homogeneous Mixed Graphical Markov Models 4 Simulation of eqtl models of experimental crosses 5 Conclusions Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 16 / 27
Simulation of Homogeneous Mixed GMMs Simplifying assumptions (in the context of genetical genomics data): 1 Discrete genotypes affect gene expression and not the other way around. 2 Joint distribution of X is a conditional Gaussian distribution X V N py (µ(i), Σ(i)) with i I. 3 Genotype alleles affect only mean expression levels of genes and not the correlations between them, i.e., Σ(i) Σ is constant throughout i I. 4 Currently, discrete r.v. s are simulated as being marginally independent between them. 5 Currently, every continuous r.v. cannot depend on more than one discrete r.v., where µ(i) = Σ h(i) and h(i) = {h 1 (i),..., h p (i)} contain the mixed linear parameters that define the additive effects. Y1 I1 Y3 Y2 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 17 / 27
Simulation of Homogeneous Mixed GMMs rhmgmm() simulates homogeneous mixed GMMs taking as input either a markedgraphparam or graphbam object. It returns an S4 object of class HMgmm. Here we simulate one involving 2 50 joint discrete levels: > set.seed(12345) > gmm <- rhmgmm(n=1, dregularmarkedgraphparam(pi=50, py=100, d=2), a=2) > names(gmm) [1] "X" "I" "Y" "p" "pi" "py" "g" "mean" "sigma" [10] "a" "eta2" > print(object.size(gmm), units="kb") 90.1 Kb rcmvnorm() simulates data from a homogeneous mixed GMM using rmvnorm() for the continuous observations (mvtnorm package): > set.seed(123) > X <- rcmvnorm(n=1000, gmm) > Ymix <- gmm$y[which(rowsums(as(gmm$g, "matrix")[gmm$y, gmm$i]) == 1)] > Imix <- gmm$i[apply(as(gmm$g, "matrix")[ymix, gmm$i] == 1, 1, which)] > smean <- sapply(1:length(ymix), function(i, Y, I) tapply(y[, i], I[, i], mean), + X[, Ymix], X[, Imix]) > mean(abs(smean[1, ] - smean[2, ])) ## sample additive effects approach the given one [1] 2.028512 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 18 / 27
Outline 1 Simulation of graphs for GMMs 2 Simulation of undirected Gaussian Graphical Markov Models 3 Simulation of Homogeneous Mixed Graphical Markov Models 4 Simulation of eqtl models of experimental crosses 5 Conclusions Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 19 / 27
Simulation of eqtl models of experimental crosses S4 class eqtlcross: a genetic map (map class in qtl package) a homogeneous mixed GMM (HMgmm) reqtlcross simulates n eqtlcross models: > reqtlcross(n=2, network=eqtlcrossparam()) [[1]] eqtl backcross model with 20 markers, 20 genes, 20 eqtl and 20 gene-gene expression associations. 0 20 40 Genetic map [[2]] eqtl backcross model with 20 markers, 20 genes, 20 eqtl and 20 gene-gene expression associations. Location (cm) 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Chromosome Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 20 / 27
Simulation of eqtl models of experimental crosses eqtl model with cis-associations only > data(map10, package="qtl") ## load example genetic map from the 'qtl' package > (eqtl <- reqtlcross(eqtlcrossparam(map=map10, genes=100))) eqtl backcross model with 187 markers, 100 genes, 100 eqtl and 100 gene-gene expression associations. > eqtl$model Homogeneous mixed graphical Markov model with 100 discrete and 100 continuous r.v., and 200 edges. > plot(eqtl, main="eqtl model with cis-associations only", cex.lab=1.5, cex.main=1.5) eqtl model with cis associations only Gene location X 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 8 9 11 13 15 17 19 eqtl location Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 21 / 27
Simulation of eqtl models of experimental crosses eqtl model with cis and trans-associations > set.seed(123) > (eqtl <- reqtlcross(eqtlcrossparam(map=map10, genes=100, cis=0.5, trans=c(5,5)), a=5)) eqtl backcross model with 187 markers, 100 genes, 60 eqtl and 100 gene-gene expression associations. > head(ciseqtl(eqtl), n=3) chrom location QTL gene a 1 1 0.00000 QTL1 g1 5 2 1 78.15385 QTL2 g5 5 3 1 97.69231 QTL3 g6 5 > head(transeqtl(eqtl), n=10) chrom location QTL gene a 1 6 65.62500 QTL17 g21 5 2 6 65.62500 QTL17 g40 5 3 6 65.62500 QTL17 g41 5 4 6 65.62500 QTL17 g59 5 5 6 65.62500 QTL17 g97 5 6 7 63.42857 QTL19 g10 5 7 7 63.42857 QTL19 g22 5 8 7 63.42857 QTL19 g34 5 9 7 63.42857 QTL19 g42 5 10 7 63.42857 QTL19 g79 5 > plot(eqtl, cex.lab=1.5, cex.main=1.5, + main="eqtl model with trans-eqtl") Gene location X 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 eqtl model with trans eqtl 1 2 3 4 5 6 8 9 11 13 15 17 19 eqtl location Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 22 / 27
Simulation of eqtl models of experimental crosses sim.cross() from package qtl is overloaded to simulate genotypes from experimental crosses and place eqtl using an input homogeneous mixed GMM: > set.seed(123) > cross <- sim.cross(map10, eqtl) > eqtlgene <- "g5" ## take gene 'g5' with a cis-eqtl > connectedgenes <- graph::edges(eqtl$model$g)[[eqtlgene]] > (connectedgenes <- connectedgenes[connectedgenes %in% eqtl$model$y]) [1] "g15" "g22" > out.mr <- qtl::scanone(cross, method="mr", pheno.col=c(eqtlgene, connectedgenes)) 15 g5 QTL2 LOD scores 10 5 g15 g22 g5 0 QTL2 g22 0 20 40 60 80 100 120 Map position (cm) g15 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 23 / 27
Outline 1 Simulation of graphs for GMMs 2 Simulation of undirected Gaussian Graphical Markov Models 3 Simulation of Homogeneous Mixed Graphical Markov Models 4 Simulation of eqtl models of experimental crosses 5 Conclusions Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 24 / 27
Conclusions The R/Bioconductor package qpgraph enables simulating Gaussian GMMs, homogeneous mixed GMMs and data from them. The implemented algorithms can simulate GMMs with given marginal correlations and additive effects. This provides flexibility when simulating data to verify correctness or asymptotic behavior of GMM estimation algorithms. The use of S4 classes hides algorithmic complexity to the non-expert user and provides a seamless integration with other packages such as graph, Rgraphviz, mvtnorm and qtl. By overloading the function sim.cross() from qtl, qpgraph allows one to easily simulate genetical genomics data from experimental crosses. Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 25 / 27
Session information > tolatex(sessioninfo()) R version 3.0.0 (2013-04-03), x86_64-unknown-linux-gnu Locale: LC_CTYPE=en_US.UTF8, LC_NUMERIC=C, LC_TIME=en_US.UTF8, LC_COLLATE=en_US.UTF8, LC_MONETARY=en_US.UTF8, LC_MESSAGES=en_US.UTF8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF8, LC_IDENTIFICATION=C Base packages: base, datasets, graphics, grdevices, methods, stats, utils Other packages: colorout 1.0-0, qpgraph 1.16.2, setwidth 1.0-3, vimcom 0.9-8 Loaded via a namespace (and not attached): annotate 1.38.0, AnnotationDbi 1.22.6, Biobase 2.20.1, BiocGenerics 0.6.0, DBI 0.2-7, genefilter 1.42.0, GGBase 3.22.0, graph 1.38.2, grid 3.0.0, IRanges 1.18.1, lattice 0.20-15, Matrix 1.0-12, mvtnorm 0.9-9995, parallel 3.0.0, qtl 1.27-10, Rgraphviz 2.4.1, RSQLite 0.11.4, snpstats 1.10.0, splines 3.0.0, stats4 3.0.0, survival 2.37-4, tools 3.0.0, XML 3.98-1.1, xtable 1.7-1 Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 26 / 27
Acknowledgements and advertisements Funding: FPI predoctoral fellowship [BES-2009-024901] to I. Tur Spanish MINECO project grant [TIN2011-22826] to R. Castelo The qpgraph package is available at http://www.bioconductor.org. Follow news and bugfixes about qpgraph in Twitter at @robertclab. Thanks for listening! Inma Tur, Alberto Roverato, Robert Castelo Simulation of molecular regulatory networks with graphical models 27 / 27