Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks

Size: px

Start display at page:

Download "Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks"

Johnathan Cox
6 years ago
Views:

1 Usng Bayesan Network Inference Algorthms to Recover Molecular Genetc Regulatory Networks Jng Yu 1,2, V. Anne Smth 1, Paul P. Wang 2, Alexander J. Hartemnk 3, Erch D. Jarvs 1 1 Duke Unversty Medcal Center, Department of Neurobology, Box 3209, Durham, NC Duke Unversty, Department of Electrcal Engneerng, Box 90291,Durham, NC Duke Unversty, Department of Computer Scence, Box 90129, Durham, NC Recent advances n hgh-throughput molecular bology has motvated n the feld of bonformatcs the use of network nference algorthms to predct causal models of molecular networks from correlatonal data. However, t s extremely dffcult to evaluate the effectveness of these algorthms because we possess nether the knowledge of the correct bologcal networks nor the ablty to expermentally valdate the hundreds of predcted gene nteractons wthn a reasonable amount of tme. Here, we apply a new approach developed by Smth, et al. (2002) that tests the ablty of network nference algorthms to accurately and effcently recover network structures based on gene expresson data taken from a smulated bologcal pathway n whch the structure s known a pror. We smulated a genetc regulatory network and used the resultant sampled data to test varatons n the desgn of a Bayesan Network nference algorthm, as well as varatons n total quantty of avalable data, length of samplng nterval, method of data dscretzaton, and presence of nterpolated data between observed data ponts. We also advanced the nference algorthm by developng a heurstc nfluence score that nfers the strength and sgn of regulaton (up or down) between genes. In these experments, we found that our nference algorthm worked best when presented wth data dscretzed nto three categores, when usng a greedy search algorthm wth random restarts, and when evaluatng networks usng the BDe scorng metrc. Under these condtons, the algorthm was both accurate and effcent n recoverng the smulated molecular network when the sampled data sets were large. Under more bologcally reasonable small amounts of sampled data, the algorthm worked best only when nterpolated data was ncluded, but had dffculty recoverng relatonshps descrbng genes wth more than one regulatory nfluence. These results suggest that network nference algorthms and samplng methods must be carefully desgned and tested before they can be used to recover bologcal genetc pathways, especally n the context of hghly lmted quanttes of data. INTRODUCTION The advent of novel technologes for collectng hgh-throughput data n molecular bology has led to the concurrent development of bonformatcs tools for analyzng ths data. Computer scentsts and bonformatcans soon realzed that common nference algorthms used n other felds can be appled to these large amounts bologcal data, such as those from mcroarrays, to statstcally predct causal molecular pathways. However, these potentally powerful algorthms are lmted by our nablty to evaluate ther accuracy, as we do not know the true bologcal network n whch to compare them wth and expermenters can not physcally perform n reasonable tme the multple gene knockouts or other types of nterventons requred to systematcally test the predcted networks. As part of an ongong project dedcated to ntegratng the songbrd bran (Jarvs et al. 2002), Smth, et al. (2002) developed a novel approach for evaluatng the accuracy and effcency of network nference algorthms n a reasonable amount of tme. Ths approach requres the creaton of a bologcally reasonable smulaton on a computer n whch the expermenter makes and knows all the rules. As the smulaton runs, data s sampled from t as one would sample data from a real bologcal system. The sampled data s then passed to an nference algorthm to evaluate the algorthm s ablty to recover the smulated system. The nference algorthm can then be modfed and made more robust to recover a network that closely matches the smulated system. After confdent recovery of the system from lmted smulated data s acheved, the algorthm can be appled to real data. The recovered system can then be used to gude further bologcal expermentaton for verfyng the predcted regulatory relatonshps. In our frst use of ths approach (Jarvs et al. 2002; Smth et al. 2002), we ncorporated multple levels of bologcal organzaton, from the molecular to the behavoral. Here, we attempt to look more closely at a sngle level of bologcal organzaton, the molecular level. We developed a smulator, whch we named GeneSmulator, that models genetc regulatory networks and generates correlatonal data smlar to that collected from hgh-densty gene mcroarrays. We then evaluated varous Bayesan network (BN) nference algorthm desgns for ther ablty to recover the underlyng genetc regulatory network. We chose to use a BN algorthm, because compared wth other common algorthms (Somogy and Snegosk 1996; D haeseleer et al. 1999; Weaver et al. 1999), BN have the ablty to smultaneously model non-lnear combnatoral relatonshps, robustly handle nosy data sets, and guard aganst over-fttng. BN can not handle networks wth cyclc structures, such as regulatory feedback loops; however, dynamc Bayesan networks (DBN) can handle cyclc structures (Fredman et al. 1999; Murphy and Man 1999). We used DBN, and when so confgured, they are also capable of copng wth hdden varables that are not observed n the data, such as proten levels or proten nteractons that affect the measured gene expresson data. In the DBN nference algorthm we developed here, we tested dfferent scorng metrcs and heurstc search methods, as well as dfferent aspects of data collecton and dscretzaton, n order to determne the best confguraton for recoverng the smulated system. Our analyss provdes nsght on how to more effcently use BN nference algorthms for dscoverng genetc networks from correlatonal data.

2 METHODS AND LOGIC GeneSmulator GeneSmulator s programmed n Matlab (MathWorks, Inc.). It models genetc regulatory pathways of arbtrary structure (topology) and produces values of gene expresson levels at dscrete tme ponts. Updates to values at each tme step are governed by a smple stochastc process: Yt + 1 = f ( Yt ) = A( Yt T ) + ε where Y t s a vector representng the expresson levels of all genes at tme t, wth expresson levels rangng from 0 to 100 (arbtrary unts). A s a matrx that represents the relatonshps of gene nteractons n the underlyng regulatory pathway. For every entry of A, the magntude of the entry represents the strength of the regulaton that a regulator gene exerts upon a target gene; the sgn ndcates the type of regulaton, wth postve values ndcatng upregulaton and negatve values ndcatng down-regulaton. T s a vector of threshold regulatng values for each gene: a regulatory gene exerts an nfluence on ts target gene only to the extent that t dffers from ts threshold value. In ths study all gene thresholds have been set to half of the maxmum value;.e., every entry of T s exactly 50. If the regulator gene s present at a level above ts threshold value, then ts regulatory effect on ts target genes occurs as specfed n A. Contrarly, f the regulator gene s present at a level below ts threshold value, then ts effect s n the opposte drecton of that specfed n A to return the gene to ts basal level. The ε term s whte nose, drawn unformly at random from the nterval 10 to 10. Ths term s added for stochastcty and s meant to capture all sources of nose, especally nherent bologcal nose. If a gene has no regulator (the correspondng entres n A are all zero), then t wll move n a random walk, wth steps taken accordng to the values of ε. As the smulaton runs, the data s sampled n pre-specfed ntervals as one would do n an actual bologcal experment, and the samples are exported to a text fle. For example, f we collect data every fve tme ponts, then we defne the samplng nterval to be 5, and the sampled output s the seres of expresson level vectors (Y 0,Y 5,Y 10,...), analogous to data gathered n a mcroarray tme course experment. Data Processng and Collecton Methods Dscretzaton: Before beng passed to our DBN nference algorthms, the data we collect needs to be dscretzed. Dscrete data allows us to model complex non-lnear nteractons between genes wthout resortng to computatonally prohbtve calculatons over contnuous dstrbutons. In ths study, we dscretzed the sampled expresson levels generated by GeneSmulator from contnuous values nto varous numbers of categores to determne f fner or coarser dscretzaton mproves recovery accuracy. We also evaluated two general types of dscretzaton strateges: hard and soft. Hard dscretzaton employs frm boundares between categores, requrng a gven expresson level to belong to only a sngle category. Soft dscretzaton employs fuzzy boundares between categores, allowng a gven expresson level to belong to two or more categores wth dfferent percentages each. Other data processng and collecton methods are descrbed n the results. Bayesan Network Inference Algorthms Our DBN nference algorthms are wrtten n C++ and are desgned to search for hgh-scorng graphcal models (networks) that descrbe probablstc relatonshps between varables. The score that s computed for a graph generated from the data collected and dscretzed s a measure of how successfully the graph explans the relatonshps n the data and also how smply t does so. Graphs are penalzed for over-complexty or overgeneralty so there s a resultant bas towards smpler graphs. Ths guards aganst over-fttng the model to the data. Every node n the BN graph represents a sngle varable, here one gene. Every drected edge, or lnes wth arrows, between two nodes represents a condtonal statstcal dependence of the chld node on the parent node. In the context of a DBN fro recoverng a genetc regulatory network, each edge ndcates a regulatory relatonshp n whch the parent gene regulates the chld gene at a later tme. Basc Theory of BN: A statc BN (Fredman et al. 2000) s an acyclc drected graph that encodes a jont probablty dstrbuton over χ, where χ= {X 1,...,X n } s a set of dscrete random varables X. The BN for χ s specfed as a par < G,Θ >. The varable G represents a drected graph whose vertces correspond to the random varables X 1,..., X n. In ths graphcal representaton, each varable X s ndependent of ts non-descendants gven ts parents n G. The varable Θ represents a set of parameters that collectvely quantfy the probablty dstrbutons assocated wth the varables n the graph. Each parameter of Θ s specfed by θ x pa( X ) = P( x pa( X )), for each possble value x of X, and each possble value pa ( X ) of Pa ( X ). Pa ( X ) denotes the set of parents of X n G and pa ( X ) denotes a partcular nstantaton of the parents. Thus, a BN specfes a unque jont probablty dstrbuton over χ gven by: n P ( X 1,..., X n ) = = 1 P( X Pa( X )) These notons extend qute naturally to DBN, whch we explot here n the context of tme seres data (for more detals, see Murphy and Mlan 1999). The problem of dscoverng a BN from a collecton of observed data can be stated as follows. Gven a data set D ={Y 1,Y 2,Y 3,...Y n }of observed nstances of χ, fnd the most probable graph G for explanng the data contaned n D. One common approach to ths problem s to ntroduce a scorng metrc that evaluates how probable each graph G explans the data n D. In the presence of such a scorng metrc, the problem of dscoverng a BN then reduces to the problem of searchng for a graph that yelds a hgh score, gven the observed data n D. To search the hghest scorng graph, a partcular search method needs to be used. Bayesan Scorng Metrcs: The Bayesan scorng metrc can be generally descrbed (Heckerman 1996) as: Score( G : D) = log P( G D) = log P( D G) + log P( G) log P( D) Whch states that the score of the graph G gven data n D s equvalent to the log of the probablty of G gven D. In ths study, we nvestgated two types of scorng metrcs that employ dfferent assumptons: the BDe (Bayesan Drchlet equvalent) and the BIC (Bayesan Informaton Crteron) scorng metrcs. Both scorng metrcs have an nherent penalty for overcomplexty to guard aganst over-fttng of data. The BDe score captures the full Bayesan posteror probablty P(G D). In ths

3 settng, the pror over graphs needs to be specfed (we use a unform pror) and the pror over parameters s Drchlet,.e. a multnomal dstrbuton descrbng the condtonal probablty of each varable n the network. The BIC score nstead of capturng, s an asymptotc approxmaton to the full posteror probablty P(G D). Ths approxmaton s based on a penalzed maxmum lkelhood estmate. Wth large amounts of data, the BIC s a good approxmaton to the full posteror (BDe) score and s faster to compute; however, t s known to over-penalze wth small amounts of data. Score metrcs also nvolve the generaton of a condtonal probablty table (CPT) for each node. The tables store the occurrences for all combnatons of parent-chld values extracted from the dscretzed data (Heckerman 1996). The occurrence values n the tables are called suffcent statstcs. For a gven graph, a CPT table s made for each node. The occurrence values n the tables are then used to calculate the score for each node. Scores for all nodes are then summed to generate the score for the entre graph. Search Methods To dentfy BN structures wth hgh scores, search methods are employed that search for the hghest scorng graph among a set of graphs usng dfferent heurstc methods. The reason for heurstc search methods s that dentfyng the hghest scorng network usng scorng metrcs s NP complete. As such, heurstc searches are teratve and thus can be run ndefntely and stopped at any tme to reveal the hghest scorng graph vsted thus far. The longer the search, the lkely of fndng a hgher-scorng graph. A sutable cutoff for runnng tme s found emprcally, where longer runnng tmes do not result n sgnfcant changes to the hghest scorng graph found. In our study, we tested three heurstc search methods: 1) greedy search wth random restarts (Heckerman 1996), 2) smulated annealng (Heckerman 1996), and 3) a genetc algorthm (developed n ths study). As n Heckeman (1996), for each type of search we used E to denote the set of elgble changes to a graph and (e) to denote the change n score of a graph resultng from the modfcaton e E, where stands for every elgble change. In addton, we created a hash table (a look up table) to store the calculated scores for each node wth a certan parent set. We found that ths mproved the performance of the search sgnfcantly, savng computatonal tme by avodng the recalculaton of suffcent statstcs n the CPTs when revstng a prevous set of parents for a gven node. Greedy search wth random restarts ntalzes tself by choosng a random graph, then evaluates the change n score (e) assocated wth every possble change e E, and fnally selects the change for whch (e) s maxmzed, provded the maxmal (e) s postve. It proceeds n ths fashon untl all (e) are negatve and no score mprovement can be made. To escape ths local maxmum, the algorthm then restarts from another random graph, and the entre process s repeated untl the total number of teratons s reached. Smulated annealng also ntalzes tself by choosng a random graph, but s gven an ntal temperature T 0, a search parameter. An elgble change e E s selected at random and the probablty expresson p = exp( ( e) / T0 ) s evaluated. If p >1 (whch occurs whenever (e) s postve), then the change e s made; otherwse, the change e s only made wth probablty p. The procedure begns at a very hgh temperature so that almost every elgble change n the graph can be made. As the search progresses, the temperature gradually decreases untl a very low temperature s reached where very lttle change s made n the graph. The search then performs smlarly to the local searches of the random greedy method. A genetc algorthm (GA) (Goldberg 1989) s a search method usng three operators to explore a space of solutons or, n our case, a set of graphs. The three operators are: reproducton, whch promotes the best graphs to the next generaton, mutaton, whch explores new graphs by ntroducng varaton n the populaton to avod local optma, and crossover, whch selects a swappng pont n the parents and exchanges nformaton between them to generate two new graphs, thereby ncreasng the average qualty of a populaton. We are not aware of a GA beng appled to BN searches, and thus explan the operatons we mplemented n more detal. In our genetc algorthm, we generated a mutaton operaton that makes a local change n any possble edge n the graph structure. We generated a crossover operaton that swaps parts of two graphs. Graph structures are specfed as the set of parents for every node, where graph s donated as { Pa ( X ), Pa ( X ),..., Pa ( X n )} and graph j as 1 2 { Pa j ( X ), Pa j ( X ),..., Pa j ( X n )}. To crossover, a randomly 1 2 j chosen varable X becomes the swap pont leadng to two new k structures, graph {Pa (X 1 ),...,Pa (X k ),Pa j (X k +1 )...,Pa j (X n )} ' and graph j {Pa j (X 1 ),...,Pa j (X k ),Pa (X k +1 )...,Pa (X n )} j '. For each GA teraton, ether a mutaton or a crossover operaton s chosen at random and the newly created graphs are reproduced n the next generaton f they have hgher scores than the current graphs n the stored populaton. As t s possble for crossovers to create b-drectonal edges, we check for and elmnate such graphs. Influence Score: Many BN nference algorthms appled to molecular bology are useful for predctng whch genes regulate whch others, but often do not predct the magntude or even the sgn of regulaton (an excepton s Hartemnk et al 2001). Here usng a dfferent approach, we took advantage of the CPT table logc. The occurrence numbers n CPT tables suggest relatonshps between nodes. Here, we generated CPT-lke tables, whch we call nfluence tables, for each par of connected nodes n the hghest scorng graph. The values n the nfluence tables are occurrences from the dscretzed data accordng to the hghest graph. An example of an nfluence table, when there s only one parent node per chld node, s shown n Table 1. Here, three category dscretzaton was used: low (L), medum (M), and hgh Parent L M H L LL ML HL Chld M LM MM HM H LH MH HH Table 1: An example nfluence score table. The table s a modfed verson of a CPT wthout pror knowledge nformaton. Both the parent and chld states are dscretzed as L, M, or H. The values n the cells represent the occurrence of each parent-chld combnaton where the frst descrptor (L, M, or H) s the parent and the second (L, M, or H) s the chld. For clarty, ths table contans the smple case where there s only one parent per chld. In practce, the table has more dmensons to account for multple parents.

4 (H). The occurrence values n the table are then descrbed as combnatons, for example, where HL s the number of tmes n the data that the parent s hgh and the chld s low. When for a gven node the occurrence of LL and HH s greater than the occurrence of LH and HL, then parent gene can be sad to be an actvator of the chld gene. And vce-versa, for the parent beng a repressor. Ths s even the case wthout consderng MX values. Thus, to calculate an nfluence score from these occurrence values, we used the followng formulas: LL HL LL _ HL = and HH LH HH _ LH = LL + HL HH + LH LL LH LL _ LH = and HH HL HH _ HL = LL + LH HH + HL The sgns (+ or -) of the numerators determnes the sgn of the gene regulaton. When LL-HL and HH-LH s greater than 0, and LL-LH and HH-HL s greater than 0, then the parent s consdered an actvator and gets a + sgn. The reverse s the case when both of these values are less than 0, then the parent receves a sgn. Dfferent combnatons of these values, such that one s greater and the other s less than 0, the score changes to less postve or less negatve, or to just 0 (no sgn can be determned). The numerator scales the denomnator to generate values between -1 and 1. Wth four such values, the fnal sgn and magntude of the nfluence s calculated wth the next set of rules and formulas: Frst set the nfluence score to 0. If both of LL_HL and HH_LH are greater than 0 (LL and HH domnate), then add the magntude (LL_HL*HH_LH)/2 to the nfluence score 0. If both of LL_HL and HH_LH are less than 0 (HL and LH domnate), then subtract (LL_HL*HH_LH)/2 from the nfluence score 0. Otherwse, keep the nfluence score 0. Perform smlar operatons for LL_LH or HH_HL. After, add both values from both sets of operatons and sum them to obtan the fnal nfluence score. These two symmetrc steps are used to average the effects from dfferent drectons. If several parents are present, the same calculaton s done wth the other parents fxed n each possble confguraton; the average of these nfluence scores s then used. Regardless of the number of levels of dscretzaton, only the lowest and hghest categores are ncluded nto the calculaton. In the above manner, the nfluence score s mapped to a range between 1 and 1. Stated n terms of gene expresson, postve numbers reflect actvatng relatonshps of a parent on a chld gene, whle negatve numbers reflect repressng relatonshps. The magntude of the nfluence score s suggestve of the gene Fgure 1: Regulatory network and gene expresson tme sequence plot. regulaton strength, whch means the more postve the score s, the stronger the up-regulaton s; the more negatve the score s, the stronger the down-regulaton s. When the nfluence score s close to 0, t s dffcult to nfer the type of regulaton (up or down). RESULTS We frst present results generated by the smulator, then examne dfferent confguratons for the BN nference algorthm usng the same smulated data set. We then examne dfferent samplng and data processng methods usng dfferent data sets. GeneSmulator Usng GeneSmulator, we smulated a genetc regulatory system defned n a matrx A of relatonshps wth the structure shown n the left half of Fgure 1. Twelve of twenty genes (genes 0 to 11) were connected n a regulatory pathway. In addton a feedback connecton, from gene 7 to gene 0, was ncluded. The other eght genes (genes 12 to 19) were not connected to other genes;.e., they were ndependent of all other genes. The absolute regulaton strength for each connecton was set to be the same: 0.1. Consequently, n the matrx A, for every upregulaton of gene y from gene x, A ( y, x) = 0. 1; f the relatonshp s one of down-regulaton, then A(y,x) = 0.1. If there s no connecton between genes x and y, the correspondng entry A(y,x) = 0. To show how GeneSmulator works, we ran t for 500 tme ponts and plotted expresson levels for four of the regulated genes n the rght half of Fgure 1. The results are consstent wth the relatonshps specfed by the orgnal structure: when gene 4 ncrease gene 5 decreases, consstent wth gene 4 down-regulatng gene 5; gene 6 engages n a random walk, consstent wth t havng no regulator; when gene 5 s hgh and 6 s low, gene 9 s hgh, and ths s consstent wth gene 9 beng both up-regulated by 5 and down-regulated by 6. Moreover, the changes n expresson occur over a seres of tme steps. Although our tme steps are untless, f each s consdered to be one mnute, they match well the tme-scale noted for gene expresson n actual bologcal systems (Jarvs and Nottebohm 1997). Ths underlyng regulatory structure was used for all of experments, except for the complexty of network secton. As such, throughout the paper Fgure 1 (left) s to be compared wth all other fgures. Bayesan Network Inference Algorthms Intally we sampled data from GeneSmulator output on a small scale 20 or fewer sampled tme ponts matchng what would be done expermentally usng mcroarrays or other approaches. However, we found that such small sample szes, wthout further data processng, as descrbed n the Data Collecton Methods secton below, were not suffcent for recoverng the smulated structure. In ths secton, to more effectvely evaluate the BN and the dfferent confguratons we made, we ran GeneSmulator wth the system of Fgure 1 (left) for 10,000 tme ponts and sampled at an nterval of every 5 tme ponts, yeldng a total of 2000 sampled tme ponts. Bayesan Scorng Metrcs: To compare the BDe and

5 chld s obscured by the nfluences from the other parent, and the resultng dfference s an artfact of how the nfluence score s calculated. Gven these weaknesses, the nfluence scores stll reflected relatvely well the gene relatonshps. For genes wth one versus two parents, the nfluence scores of the recovered edges are smlar wthn each case (~0.6 for sngle parents; ~0.3 for two parents, Fgure 3) as n the smulated network. We conclude that the nfluence score s capable of recoverng the sgn of the nteracton, but ts relatve magntude depends upon the number of parents. Fgure 2: Comparson of graphs found usng BDe (left) and BIC (rght) scores methods. Shown are the top graphs found. Red correct edges found by one method and not the other. The arrows n these graphs only specfy drecton, not postve or negatve nteractons. BIC scorng metrcs, we chose to use a 3-category hard dscretzaton of the data and a greedy search method wth 1000 random restarts. Compared wth the orgnal smulated structure n Fgure 1 (left), the BN nference algorthm under these condtons had remarkably good recovery, usng ether scorng metrc (Fgure 2). All genes (nodes) n the connected networks and most nteractons (edges) were recovered. For the BDe scorng metrc, the top graph (hghest scorng graph) had exactly the same regulatory structure as the smulated system. For the BIC scorng metrc, the top graph had two edges mssng. We conclude that the BDe score works better than the BIC score n recoverng the underlyng smulated genetc regulatory pathway gven ths quantty of sampled data. The mssng edges under the BIC score are consstent wth ts known over-penalzaton of model complexty. Influence Score: We found that our heurstc nfluence score functon worked, and t gave reasonable results when compared wth the type of regulaton (up or down) n the smulated system (Fgure 3). The sgns of the nteractons (+ and ) were all correctly dentfed (compare Fgure 1 wth Fgure 3). The absolute magntudes of the nfluences scores n the recovered networks ( ) and the regulatory strengths specfed n the smulated system (0.1) although n the same range, were not drectly comparable. Furthermore, n the recovered network when a node had more than one parent, the edges from those parents had lower nfluence score magntudes than f the node had just one parent. Ths s the case because the nfluence of one parent to ts Search Methods: To compare heurstc search methods, we used BDe score wth 3-category hard dscretzaton. Because t s dffcult to set a far stoppng crteron for dfferent search methods, we let each of them run long enough that they dd not make any mprovement for many teratons (determned emprcally). We compared three search methods: greedy search wth random restarts, smulated annealng, and a genetc algorthm. All three yelded the same top graphs wth exact matches to the smulated system, as n the left of Fgure 2. However, we found that wth ths data set greedy search took the least tme (mnutes) to fnd the correct graph wth the hghest score; smulated annealng a longer tme (10s of mnutes); and genetc algorthm took the greatest amount of tme (~hrs). We conclude that the three search methods worked equally well n terms of the top graph found, but the greedy search s best as t can fnd the top graph n the least amount of tme. Data Processng and Collecton Methods For ths secton, we used the BDe scorng metrc and the greedy search method wth 1000 random restarts. Dscretzaton: Usng the same data set as above, we compared the performance of the BN nference algorthm wth dscretzaton nto dfferent numbers of categores: hard dscretzaton nto 2, 3, and 4 categores. The top graph found for each s shown n Fgure 4. All of these dscretzaton approaches allowed recovery of relatvely smlar graphs to the smulated system. Only the 3-category dscretzaton found exactly the same regulatory graph as the smulated system. In the case of the 2-category dscretzaton, extraneous edges were found. These lkely emerged because too much of the nformaton contaned n the data was lost n the overly-coarse dscretzaton. On the other hand, the 4-category dscretzaton led to some dffculty recoverng edges whose chldren had Fgure 3: Graph wth nfluence scores. Numbers besdes the edges are nfluence scores from the top BDe generated graph of Fgure 2. Fgure 4: Comparson of number of dscretzaton categores, wth hard boundares. Black correct edges n common between all three graphs. Red correct edges found n only one or two of the graphs. Blue ncorrect edges. Numbers besde edges are nfluence scores.

6 Fgure 5: Comparson for hard (left) and soft dscretzaton (rght). Red correct edges found n the top graph by one method and not the other. Numbers besde edges are nfluence scores. multple parents, and nstead t found ncorrect edges from ther grandparents. We beleve ths occurs because ncreasngly fne levels of dscretzaton spreads out the data entered n the CPTs, such that ndvdual occurrence values are weakened. To strengthen these values, ntutvely, would requre more data. We next compared hard wth soft dscretzaton usng 3- categores (Fgure 5). Compared wth hard dscretzaton, soft dscretzaton mssed edges to genes wth 2 parents; only one of the parents could be found. We conclude that wth our smulated data the 3-category hard dscretzaton works best n allowng recovery of the smulated genetc pathway. All subsequent analyss below uses 3-category hard dscretzaton. Samplng Amount, Intervals, and Coverage: Snce samplng amount, ntervals, and coverage are dependent on each other, t s not possble to test ther effects on BN recovery ndependently. however, testng ther effects, can be done by multcomparsons. Frst, we sampled dfferent data amounts at the same nterval, 5, whch effectvely changes tme coverage across the smulated output (Fgure 6A). Second, we sampled dfferent data amounts wth the same overall coverage, 10,000 tme ponts, whch effectvely ths changes the samplng nterval (Fgure 6B). Thrd, we sampled at dfferent ntervals wth the same data amount, 500 data ponts, whch effectvely changes the overall coverage (Fgure 7). In each case, two out of three varables change at the same tme. Thus, when the recovery result s dfferent between graphs of the dfferent comparsons (Fgures 6A, 6B, and 7), the responsble varable s one or both of the changng varables. To decded wth a degree of confdence whch one t s, the common dfferences between graphs of two dfferent changng stuatons (for example the common edges n frst graph of Fgure 6A and frst of 6B compared wth ther dfferences n the second graph of 6A and 6B) are due to the effect of the common varable that changes n both. As expected, the more data collected, the better the recovery of the smulated system (compare wthn both Fgures 6A and 6B). Increased coverage does not appear to sgnfcantly mprove recovery (compare wthn both Fgures 6A and 7). Ths counterntutve result, mples that at the coverage across the smulaton s already well represented n our samplng ranges. Interestngly, when changng the samplng nterval, there appears to be an optmum (nterval of ~5) that led to the best recovery (compare wthn both Fgures 7 and 6B). At nterval of 1 (takng every tme pont, but wth a small coverage tme of 500) fewer correct edges were found. Increasng the samplng nterval to 5, more correct edges were found. But as the samplng nterval ncreased more (to 10 and 20) ncorrect edges appeared (Fgure 7 and 6B), even though n one there s more coverage (Fgure 7). The explanaton for these dfferences n Fgure 7 s that the small samplng nterval yelds small coverage, not gvng much nformaton between any two ponts spread dstantly n tme, whle the large samplng nterval, although t yelds large coverage, also loses a large amount of nformaton between any two ponts. In all types of comparsons, t was dffcult recover both edges of genes wth multple parents (genes 0 and 1 to 2; genes 5 and 6 to 9; and genes 8 and 9 to 10). The man varable that had the strongest effect on the recovery of edges from both parents was ncreasng data amount sampled (compare Fgures Fgure 6: Comparson for changes n data amounts and coverage wth same ntervals (A) or changes n data amount and ntervals wth same coverage tme (B). From left to rght, the number of data ponts ncreases. Top graphs are shown. Numbers besde edges are nfluence scores. Red correct edges found n only one or two graphs. Blue ncorrect edges.

7 grandparents (Fgure 6B and 7). There s also an nterval effect on our nfluence score, such that the score ncreased as samplng ntervals ncreased, untl the scores was reduced by the appearance of ncorrect multple parents at very hgh samplng ntervals (Fgures 6A and 7). The ntal ncreased nterval effect on the nfluence score makes sense because the magntude of the dfference n expresson of a gene between two tme ponts ncreases as the nterval ncreases. We conclude that wth our data set, samplng amount and nterval has the largest effect on BN recovery. There s an optmum nterval, whch wll depend upon the underlyng tmng of the gene regulatory pathway under study. Fgure 7: Comparson for samplng ntervals. Labelng s the same as s descrbed n Fgure 7 legend. Numbers besde edges are nfluence scores. 6A, 6B, 7). The type of errors found wth large samplng ntervals were not completely erroneous, but were all ncorrect edges from Bologcal Samplng and Interpolaton of Data Ponts: In an actual bologcal gene expresson experment, the amount of data collected s much smaller than that used n the above tests. To mmc the maxmum amount of data that could be reasonably collected n a mcroarray experment, we lmted the total number of data ponts to be from 100 cdna mcroarray sldes. Wth ths pre-determned total number, we nvestgated the best possble way to allocate the anmal samples. For example, should we sample 100 tme ponts wth one anmal each tme pont or 25 tme ponts wth four anmals each tme pont. For ths amount of data, we found that the recovery results changed wth dfferent data sets from repeated experments, unlke the bgger data sets above whch yelded qute stable results from repeated experments. To better understand ths varaton, we collected 10 dfferent data sets for each of these tests, passed each one separately through the BN algorthms and obtaned the average results (Fgure 8A). The edges that appear frequently across the 10 data sets are of Fgure 8: More bologcally reasonable samplng wth dfferent strateges (A) and performance of nterpolatng data (B). Shown are the average top graph results from 10 datasets each. Black dashed edge only found once n 10 recovery results. Black sold edge fnd more than once but less than 5 tmes n 10 recovery results. Red sold edge fnd more than or equal to 5 tmes n 10 recovery results. Numbers besde edges occurrence (left of slash) and average of nfluence score (rght of slash); these only appear besde the edges found more than once.

Fgure 9: Data nterpolaton for smaller amounts of data: 50 (mddle) and 25 (rght) data ponts compared wth 50 unnterpolated data ponts (left). Labelng s the same as s descrbed n Fgure 9 legend.

Wth ths amount of data, only a partal understandng of the regulatory system was recovered, and many orgnally ndependent genes were ncorrectly ncluded the resultant graph (Fgure 8A).

8 Fgure 9: Data nterpolaton for smaller amounts of data: 50 (mddle) and 25 (rght) data ponts compared wth 50 unnterpolated data ponts (left). Labelng s the same as s descrbed n Fgure 9 legend. greater confdence. Here we used a samplng nterval of 5 because ths worked best n our prevous tests. Wth ths amount of data, only a partal understandng of the regulatory system was recovered, and many orgnally ndependent genes were ncorrectly ncluded the resultant graph (Fgure 8A). As the number of anmals ncreased, there was no mportant change untl 10 anmals per tme pont n 10 tme ponts. The number of edges found were dramatcally decreased, but for both correct and ncorrect edges. The number genes ncluded n the graph also decreased, but n ths case mostly for those that were not part of the orgnal regulatory system. Another nterestng result s that most of the ncorrect edges revealed (both sold and dashed lnes) had nfluence scores of 0 (Fgure 8A, scores not shown for dashed lnes), and ths can be used to elmnate such edges. We tested whether the recovery performance would be mproved by nterpolatng data ponts. We kept the samplng method the same, except for the samplng nterval. We ncreased the samplng nterval, to 20, to have samplng tme between ponts for nterpolaton. Ths resulted n ncreased coverage, but kept the number of data samplng amount the same. We lnearly nterpolated fve data ponts between every two sampled data ponts, resultng n a sx-fold ncrease n the amount of avalable data, though only 20% of t corresponds to actual sampled data. The correspondng results are shown n Fgure 8B. Regardless of the samplng approach, we found that recovery results much mproved compared wth the results from the raw data ponts wthout nterpolaton. The genes n the connected network were correctly dentfed and no ndependent genes were mplcated. Most of the correct edges were found wth hgh occurrence and all wth the correct sgns of the nfluence score. The prncpal mssng connectons stll nvolved nodes wth multple parents. Some of the ncorrect edges were actually found from a grandparent, such as from gene 7 to 2, gene 4 to 9, and gene 3 to 10. Though techncally ncorrect, they can supply useful nformaton. We conclude that when usng bologcally reasonable expresson data samplng, BN nference algorthms appear to be able to partally recover the nodes of the underlyng genetc pathway and ther sngle parent nteractons, especally when the data s nterpolated. Our BN nference algorthm appears to be less good at recoverng multple parents per node wthn the constrants of bologcally reasonable amounts of data. To nvestgate ths more closely, we tested the data nterpolaton performance wth an even smaller data amount (Fgure 9). Usng 50 data ponts wthout nterpolaton, the recovered genetc system was a mess (Fgure 9, left). However, when usng 50 data ponts wth nterpolaton (Fgure 9, mddle) the recovery was much better, even better than that from 100 data ponts wthout nterpolaton (Fgure 8A). From 25 data ponts wthout nterpolaton, the recovery result was too messy to be shown, but wth data nterpolaton (Fgure 9, rght), the recovery s comparable wth that from 100 data ponts wthout nterpolaton. All the results presented n ths secton are computed usng the BDe score functon; when we tred usng the BIC score functon, no edges were found, agan confrmng that the BIC over-penalzes for over-complexty and that ths excessve penalty s most apparent n the lmt of small amounts of data. Non-unformly Sampled Data: Many bologcal experments sample gene expresson data n a non-unform manner. We tested the effect of non-unform samplng on BN recovery. Here the ntervals between successve samples s a number chosen at random between 15 and 25. Wth a small amount of data, we found that non-unform nterval datasets can recover part of the smulated system only when data nterpolaton was used (Fgure 10). The recovery results are comparable wth that from unformly sampled datasets. Fgure 10: Recovery results from non-unformly sampled dataset. Labelng s the same as s descrbed n Fgure 8 legend. The ntervals for successve samples were chosen at random between 15 and 25.

9 Complexty of the Network The pathway that we smulated has perhaps less complexty than s lkely to be found n bologcal systems. To test some addtonal complexty, we used GeneSmulator to generate data for a pathway that ncludes one node wth three parents nstead of two (gene 2), as well as two ndependent genetc pathways (Fgure 11, left). Fgure 12: A more complex network. The underlyng regulatory network (left) and the recovered one augmented wth nfluence scores (rght). Usng our standard condtons of small-data samplng, we were able to recover most of both ndependent pathways (not shown). However, to recover the three parents to one chld, we found that we needed to collect at least 5000 data ponts (Fgure 12, rght), and the recovered edges had very low nfluence scores (red). We conclude that BN nference algorthms are able to recover ndependent genetc pathways nherent n one data set, but that as the number of parents per node grows, the amount of expresson data needed to recover the entre network wll dramatcally grow, perhaps to an extent that the collecton of ths data becomes mpractcal. DISCUSSION In ths study, we tested and modfed the ablty of BN nference algorthms to recover genetc pathways from smulated gene expresson data. We found that the structure of the underlyng network and the methods used for data processng and samplng had a crtcal mpact on the ablty to recover accurate pathways. Ths concluson s mportant to consder when usng our algorthm on actual gene expresson data, such as that collected usng hghdensty mcroarrays. We found that the best BN nference algorthm for recoverng the smulated genetc pathways was a greedy search method wth random restarts, employng the BDe scorng metrc and beng gven data dscretzed usng a 3-catergory hard dscretzaton. In addton, our nfluence scores were very useful n determnng up or down regulaton and the relatve magntude of regulaton. We also found that there was an optmal samplng nterval and effcent amount of data that allowed our BN nference algorthm to recover the most accurate pathway. We found that the BIC score does not work as well as the BDe score because the BIC score yelds fewer edges or no edge wth bologcally reasonable samplng. Ths happens because BIC apples a strong penalty for more complex structure. One of the key mprovements we made to our nference algorthm for use wth gene expresson data s the generaton of nfluence scores. These scores can determne postve or negatve nteractons and select aganst low-scorng ncorrect nteractons. Ths mproves the accuracy of the network and ncreases the amount of nformaton recovered. Three-category dscretzaton was optmal for the smulated dataset we examned. Ths may seem counterntutve, as dscretzaton leads to nformaton loss from the data: the more coarse the dscretzaton, the more the nformaton loss. Whle ths s true, t s not however the case that fner dscretzaton results n better BN recovery. Ths s because wth more dscretzaton levels, the more CPT values are spread and also the more BN parameters must be estmated, and as a result, more data ponts are needed to confdently predct edges between nodes. In addton, dscretzaton wth hard boundares outperformed dscretzaton wth soft boundares. It s possble that because soft dscretzaton also spreads out the dstrbuton of the data, some of the BN relatonshps are not sgnfcant enough under soft dscretzaton to warrant ncluson of an edge. Based on the results of ths study, we beleve the major lmtaton n dscoverng regulatory networks from gene expresson experments s collectng a suffcent quantty of data to effectvely recover all the nteractons n the genetc network, especally those assocated wth genes wth multple parents. Data collecton n the context of complex organsms s lmted by the physcal constrants of the expermenter and the number of anmals that can be sacrfced for any one gven experment. Although we found here that our BN nference algorthm wll dscover genetc regulatory pathways from bologcally reasonable amounts of data, the generated pathways can be msleadng. Fortunately, nterpolaton of data from these smaller samples szes wth good coverage of the smulated gene expresson pathway can at least yeld a sgnfcant number of correct genes and ther nteractons. Moreover, many of the ncorrect nteractons found are from grandparents, whch at least places genes only one gene removed from ther true regulator. However, our BN nference algorthm had dffculty dentfyng more than one parent for genes wth multple parents. Ths s a serous problem for genetc pathway recovery, as combnatoral regulatory control s a basc property of genetc pathways. However, the soluton to ths problem les n not attemptng to recover complex pathways from lmted amounts of expresson data alone. Other types of data can be brought to bear and, when used n conjuncton wth expresson data, can sgnfcantly enhance the ablty to accurately recover regulatory network structures (Hartemnk et al 2002). Fnally, we suspect that the performance of our BN nference algorthm depends n part on how the networks are smulated. Though we ncluded nose n generatng our smulated data, the data are clearly not a perfect smulaton of true expermental data; some meanngful bologcal nformaton s always lost n mathematcal modelng. Future efforts need to be focused on usng more real data to mprove how well the output of the smulator matches real world data. Effort also needs to be on smulatng nteracton of genes products wthn the same cell versus across cells. As such, the recovery results produced by our BN nference algorthm are not ntended to serve as a substtute for gene nterventon experments, but rather as a gude for optmally desgnng easer to perform correlatonal experments for use wth nference algorthms and to approxmate the accurate genetc pathway leadng to easer to test nterventon experements. ACKNOWLEDGMENTS

10 We thank Kurt Grands and Derek Scott of Duke Unversty for assstance n the begnnng stage of ths project. Ths research s supported by a Whtehall Foundaton grant to EDJ. REFERENCES D haeseleer, P., Wen, X., Furham, S. and Somogy, R., Lnear modelng of mrna expresson levels durng CNS development and njury. (1999) Proceedng of the Pacfc Symposum on Bocomputng. 4: Fredman, N., Murphy, K. and Russell, S., Learnng the Structure form Massve Networks. (1998) Proceedngs of the Fourteenth Conference on Uncertanty n Artfcal Intellgence Fredman, N., Nachman, I. and Pe er, D., Learnng Bayesan Network Structure of Dynamc Probablstc Network. (1999) Proceedng of the Ffteenth Conference on Uncertanty n Artfcal Intellgence Fredman, N., Lnal, M., Nachman, I. and Pe er, D., Usng Bayesan Networks to analyze expresson data (2000) Journal Computatonal Bology. 7: Goldberg, D. E., Genetc Algorthms n Search, Optmzaton, and Machne Learnng. (1989) Wesley, MA. Hartemnk, A. J., Gfford, D., Jaakkola, T. and Young, R., Usng Graphcal Models and Genomc Expresson Data to Statstcally Valdate Modeals of Genetc Regulatory Networks. (2001) Pacfc Symposum on Bocomputng. Hartemnk, A. J., Gfford, D., Jakkoola, T. and Young, R., Combnng locaton and expresson data for prncpled dscovery of genetc regulatory network models. (2002) Pacfc Symposum on Bocomputng. Heckerman, D., A Tutoral on Learnng wth Bayesan Networks. (1996) Techncal Report MSR-TR-95-06, Mcrosoft Research, March, 1995 (revsed November, 1996). Jarvs, E. D. and Nottebohm, F., Motor-drven gene expresson. (1997) Proceedngs of Natonal Academy of Scences of the Unted States of Amerca. Jarvs, E. D. et al, Integrate Songbrd Bran.(2002) Journal of Comparatve Physology, A (n press). Murphy, K. and Man, S., Modelng Gene Expresson Data Usng Dynamc Bayesan Networks. (1999) Techncal Report, Unversty of Calforna, Berkeley. Smth, V. A., Jarvs, E. D. and Hartemnk, A. J., Evaluatng Functonal Network Inference Usng Smulatons of Complex Bologcal Systems. (2002) Accepted by The 10 th nternatonal conference on Intellgent Systems for Molecular Bology. Smth, V. A., Jarvs, E. D. and Hartemnk, A. J., Influence Of Network Topology and Data Collecton On Functonal Network Influence. (2002) Pacfc Symposum on Bocomputng (n press). Somogy, R. and Snegosk, C. A., Modelng the complexty of genetc networks: understandng multgenc and pleotropc regulaton. (1996) Complexty. 1: Weaver, D. C., Workman, C. T. and Stormo, G. D., Modelng regulatory networks wth weght matrcs. (1999) Proceedngs of the Pacfc Symposum on Bocomputng. 4:

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type