Predicting Transcription Factor Binding Sites with an Ensemble of Hidden Markov Models

Vol. 3, No. 1, Fall, 2016, pp. 1-10 ISSN 2158-835X (prnt), 2158-8368 (onlne), All Rghts Reserved Predctng Transcrpton Factor Bndng Stes wth an Ensemble of Hdden Markov Models Yngle Song 1 and Albert Y. Ch 2 1 School of Computer Scence and Engneerng Jangsu Unversty of Scence and Technology Zhenjang, Jangsu 212003, Chna Emal: ynglesong@gmal.com 2 Department of Mathematcs and Computer Scence Unversty of Maryland Eastern Shore Prncess Anne, MD 21853, USA Emal: albertchsquare@gmal.com Abstract Transcrpton Factor Bndng Stes (TFBS) are mportant for a number of bologcal processes such as gene expresson and regulaton. One fundamental problem n bonformatcs s to develop software tools that can dentfy TFBSs accurately and rapdly. In practce, exhaustve search of all possble combnatons of subsequences s tme consumng and thus cannot be appled. A large number of heurstc or approxmaton algorthms and machne learnng based approaches have been developed for ths problem. However, none of them have acheved satsfactory predcton accuracy. In ths paper, we develop a novel approach that can effcently explore the space of all possble locatons of TFBSs n a set of sequences wth hgh accuracy. The exploraton s carred out wth an ensemble of a few Hdden Markov Models (HMM). The ensemble s ntally constructed through local algnments of two sequences n the set, each HMM n the ensemble s then progressvely algned to other sequences n the set. The parameters of the HMMs n the ensemble are updated based on the algnment results. Our expermental results showed that ths approach can acheve hgher accuracy wth satsfyng effcency than exstng state-of-art approaches. Keywords: Hdden Markov Model (HMM); Motf fndng; Transcrpton factor bndng ste; ensemble approach 1. INTRODUCTION Transcrpton Factor Bndng Stes (TFBS) are subsequences found n the upstream regon of genes n DNA genomes. A transcrpton factor, whch s a specalzed proten molecule, may bnd to the nucleotdes n the subsequences and thus may affect some relevant bologcal processes. Research n molecular bology has revealed that transcrpton factor bndng stes are mportant for many bologcal processes, ncludng gene expresson and regulaton. Thus, an accurate dentfcaton of TFBSs s mportant for understandng the bologcal mechansm of gene expresson and regulaton. Expermental methods have been avalable for the task [6, 7]. However, most of them are tme consumng and expensve. Moreover, as the amount of newly sequenced data grows explosvely, the low throughput of expermental methods have become an mportant bottleneck for rapd processng of these data. Computatonal methods thus have become an mportant alternatve approach to rapd dentfcaton of TFBSs. Snce TFBSs for the same transcrpton factor have smlar sequence content n homologous sequences, the most often used computatonal approaches make the predcton by analyzng a set

2 of homologous sequences and dentfyng subsequences that are smlar n content. The locatons of a TFBS may vary n dfferent homologous sequences. To determne the locaton of a TFBS n each sequence, we need to evaluate all possble startng locatons among all sequences to fnd the optmal soluton. The total number of combnatons of subsequences that need to be examned s exponental and exhaustvely enumeratng all of them s obvously mpractcal when the number or the lengths of the sequences are large. To avod exhaustve search, a large number of heurstcs have been developed to reduce the sze of the search space, such as Gbbs samplng based approaches AlgnACE [15], BoProspector [12], Gbbs Motf sampler [11], expectaton maxmzaton based models [1, 2], greedy approaches such as Consensus [8], and genetc algorthm based approaches such as FMGA [10] and MDGA [4]. Of all these approaches and software tools, Gbbs samplng s a stochastc approach. It randomly selects a canddate motf of a fxed length from each sequence. It then pcks a sequence and uses each substrng of the same length n the sequence to replace the correspondng pre-selected motf for the sequence and computes the probablty. The approach randomly selects a substrng based on the dstrbuton of the probabltes to replace the pre-selected subsequence and obtans a new set of subsequences through the random samplng. The procedure s repeated untl the maxmum number of teratons has been reached or a satsfyng set of local optmal subsequences has been found [11, 12, 15]. Consensus uses a greedy algorthm to algn functonally related sequences and apples the algorthm to dentfy the bndng stes for the E. col CRP proten [8]. MEME+ [2] uses Expectaton Maxmzaton technque to ft a two component mxture model and the model s then used to fnd TFBSs. MEME+ acheves hgher accuracy than ts earler verson MEME [1]. However, the predcton accuracy s stll not satsfactory. Genetc algorthms (GAs) smulate the Darwn evolutonary process to fnd a local optmal soluton for an optmzaton problem. Approaches based on GAs start wth an ntal populaton of a certan sze. They then go through a seres of selecton, crossover and mutaton processes to converge to the global optmum. The selecton, crossover and mutaton operatons are appled to the ndvduals n the populaton to generate the next generaton. These operatons are based on certan methods and probabltes. Ths evolutonary procedure contnues untl the maxmum allowed number of generatons has been generated or the dfference between the values of objectve functons assocated wth two consecutve generatons s less than a pre-set small threshold. Genetc algorthms have been successfully used to solve the TFBS predctng problem, such as FMGA [10] and MDGA[4]. FMGA was declared to have better performance than Gbbs Motf Sampler [11] n terms of both predcton accuracy and computaton effcency. MDGA [4] s another program that uses genetc algorthms to predct TFBSs n homologous sequences. Durng the evolutonary process, MDGA uses nformaton content to evaluate each ndvdual n the populaton. MDGA s able to acheve hgher predcton accuracy than Gbbs samplng algorthm based approaches whle usng a less amount of computaton tme. So far, most of the exstng approaches use heurstcs methods to reduce the sze of the search space. However, heurstcs employed by these approaches may also adversely affect the predcton accuracy. For example, GA based predcton tools cannot guarantee the predcton results are the same for dfferent runs of the program. A well defned strategy that can be used to effcently explore the search space and can generate determnstc and hghly accurate predcton results s thus necessary to further mprove the performance of predcton tools.

3 Recent work has shown that an ensemble of HMMs can be effectvely used to mprove the accuracy of the algnment of multple proten sequences [17]. In ths paper, we develop a new approach that can predct the locatons of TFBSs wth an ensemble of Hdden Markov Models (HMMs). The approach uses an ensemble of profle HMMs to generate a lst of postons that are lkely to be the startng postons of the TFBSs. As the frst step, we construct the ensemble from the local algnment of two sequences. The ensemble conssts of HMMs that represent the local algnments wth most sgnfcant algnment scores. We then algn each profle HMM n the ensemble to each sequence n the dataset, the parameters of the HMMs are also changed to ncorporate the new nformaton we have obtaned by algnng the new sequence to the HMMs. Ths procedure s repeated untl all sequences n the dataset have been processed. The number of HMMs n the ensemble can be used as a parameter and can also be adjusted based on the needs of users. We have mplemented ths approach nto a software tool EHMM and our expermental results show that the predcton accuracy of EHMM s hgher than or comparable wth that of the exstng tools. II. ALGORITHMS AND METHODS The method selects the two sequences that have the lowest smlarty to ntalze the ensemble. The smlarty between each par of sequences n the set s computed by globally algnng the two sequences. A local algnment of the selected sequences s then computed. The algnment results are then used to construct an ensemble that conssts of k HMMs, where k s a postve nteger. The algorthm selects the local algnments wth the k largest algnment scores and each of such local algnments can be used to construct an HMM. An ensemble of k HMMs can thus be constructed based on the local algnments wth k most sgnfcant algnment scores. We then progressvely use the HMMs to scan through each remanng sequence n the set. Each sequence segment n a sequence s algned to each HMM n the ensemble and the algnments wth k most sgnfcant scores are selected to update the parameters of the HMM. Ths process wll create up to k 2 HMMs, but only the algnments that have the k most sgnfcant algnment scores are selected to create a new ensemble of k HMMs. We repeat ths procedure untl all sequences n the set have been processed and the HMMs remaned n the ensemble provde the canddate TFBS motfs. Fgure 1 (a) and (b) provde an llustraton of the process. Fgure 2 shows the fnal stage of the approach, where the bndng stes can be determned from the HMMs n the ensemble. The followng sectons provde a detaled descrpton of the steps of the algorthm..

4 (a) (b). Fgure 1. (a) An ensemble s constructed from local algnments (b) The ensemble s updated progressvely. Fgure 2. Fnally the bndng stes can be nferred from the HMMs n the ensemble. A. Ensemble Intalzaton The algorthm selects two sequences that are of the lowest smlarty value from the set and uses Smth-Waterman local algnment algorthm [16] to obtan local algnments wth sgnfcant scores. The algnment can be performed n quadratc computaton tme. To construct an ensemble of k HMMs, a dynamc programmng table needs to be mantaned to store the algnment scores. Gven two sequences s and t and a score matrx M that evaluates the ftness value to match two nucleotdes together n an algnment. The recurson relaton for the dynamc programmng s as follows. S, max{0, S[ 1][ M[ s, ], S[ ][ j 1] M[ t ] S[ 1][ j 1] M[ s ][ t ]} (1) [ j j

5 where S s the two dmensonal dynamc programmng table; s and t j are the th and j th nucleotdes n s and t, After the dynamc programmng table s completely determned, the algorthm selects the algnments wth the k largest algnment scores n table S. A trace-back table can be mantaned durng the dynamc programmng process. Based on the trace-back table, a trace-back procedure can be employed to dentfy the subsequences n the algnments that correspond to the k largest algnment scores. An ensemble of k profle HMMs can then be constructed from the k algnments. An algnment can be consdered as a set of columns, and each column contans a set of nucleotdes and gaps that are algned together n the algnment. A profle HMM contans two states, namely D and M, for column n the correspondng algnment. The deleton state D does not emt any nucleotde and s used to represent the gaps n column ; the matchng state M emts a nucleotde and s used to descrbe the probabltes for each nucleotde to appear n column. The probabltes of emsson and transton for each state can be computed from each algnment as well. Fgure 3 llustrates the process that converts a multple algnment of subsequences nto the correspondng profle HMM. The parameters of a profle HMM can be computed as follows. Fgure 3. A multple algnment of subsequences can be converted nto a profle HMM. et et Ca ep ( M, a) (2) C bn b P(, b, 1, c) bn, cn ( M, M 1) (3) P(, b, 1, c) bn, cn bn ( 1 1 P(, b, 1, ) et M, D ) 1 et( M, M ) (4) P(, 1, b) bn ( D, M 1 ) (5) P(, 1, b) P(, 1, ) bn ( D, D 1) 1 et( D, M 1 et ) (6)

6 where N s the set of all types of nucleotdes, C represents the number of tmes that nucleotde a a appears n column, ep( M, a) s the emsson probablty for state M to emt nucleotde a. et( M, M 1) s the probablty for the transton from M to M 1 to occur; P(, b, 1, c) s the number of tmes that nucleotde b appears n column and nucleotde c appears n poston 1; P(, b, 1, ) s the number of tmes that nucleotde b appears n column and a gap appears n column 1. et ( D, M 1) s the probablty for the transton from D to M 1 to occur; P(, 1, b) s the number of tmes that a gap appears n column and nucleotde b appears n column 1; P(, 1, ) s the number of tmes that gaps appear n both columns and 1. More detals of the algorthm can be found n [5]. B. Updatng Ensemble The remanng sequences n the set are processed based on the profle HMMs n the ensemble. A sequence that has not been processed n the set s scanned through by each profle HMM and subsequences that have the k most sgnfcant algnment scores are selected. The algorthm uses a wndow of certan sze to slde through the sequence. The sze of the wndow s set to be 1.5 tmes of the average lengths of all subsequences n the algnments used to construct the ensemble. The wndow moves by 1bp each tme and each subsequence n the wndow s algned to each HMM n the ensemble. The algnment can be computed wth a dynamc programmng algorthm. The recurson relaton for the dynamc programmng s as follows. S[ s s s1 s1 s s1 s1 j [ M s,, max{ et( M s, Ds ) ep( M s, t ) S[ Ds, 1,, et( M s, M s 1) S[ M s1, 1, j D,, max{ et( D, D ) S[ D,,, et( D, M ) S[ M,, ]} (7) S ]} (8) where 0 j W are ntegers that ndcate the locaton of subsequence t ncluded n the wndow; S[ Ds,, and S[ M s,, are the dynamc programmng table cells that store the maxmum probablty for states D and M to generate the subsequence t [... nucleotde at poston n t. More detals of the algorthm can be found n [5]. ; t s the The algorthm then selects k subsequences wth the largest algnment scores. We thus obtan n 2 total k canddates for updatng the HMMs n the ensemble. We pck k subsequences that 2 correspond to the largest k algnment scores from these k canddates. The parameters of each profle HMM are then updated based on these addtonal k subsequences. Specfcally, the addtonal subsequence changes the counts that appear n equatons (2), (3), (4), (5), and (6), the parameters of each HMM n the ensemble thus needs to be reevaluated. The process s appled progressvely to other remanng sequences n the set untl every sequence n the set has been processed. The locatons of the sequence segments that are used to construct each HMM n the ensemble are then determned by searchng n the sequences n the data set and the algorthm outputs the locatons as those of the bndng stes. C. Computaton Tme We assume the set contans m sequences, each sequence contans n nucleotdes, and the bndng 2 2 2 ste contans l nucleotdes. The constructon of the ntal ensemble needs O( m n kn ) tme

7 snce the dynamc programmng of Smth-Waterman local algnment needs O ( n 2 ) tme. The computaton tme needed to scan through a sequence wth a sngle HMM s O ( l 2 n). The total 2 2 2 2 2 amount of computaton needed by the approach s thus O( t( kml n m n kn )), where t s the number of teratons the algorthm needs to execute. Snce the memory space needed by the algnments can be reused, the space complexty of the algorthm s O ( n 2 ). III. EXPERIMENTAL RESULTS We mplemented ths approach and developed a software tool EHMM. We tested ts accuracy on a bologcal dataset cyclc-amp receptor proten (CRP). Ths dataset conssts of 18 sequences, each of whch conssts of 105 bps [13]. Twenty three bndng stes have been determned by usng the DNA footprntng method, wth a motf wdth of 22 [12]. Table 1 compares the predcton accuracy of EHMM wth three other computatonal methods: Gbbs Sampler [8], BoProspector [9], and MDGA [3]. The value of the parameter s set to be k 10 n all the tests. It can be seen from the table that EHMM can acheve comparable accuracy wth other tools n homologous sequences that contan a sngle bndng ste motf. However, ts predcton accuracy on those that contan multple bndng ste motfs s sgnfcantly hgher. For most of such sequences, EHMM can accurately dentfy the locatons of both motfs. Ths s beyond the capablty of all three other methods. In partcular, EHMM obtans excellent predcton results on sequence 17, where all three other methods fal to dentfy ether of the two motfs. It s not surprsng that our method s capable of dentfyng the locatons of multple bndng stes snce t uses an ensemble of HMMs to explore the algnment space of all subsequences, whch sgnfcantly mproves the samplng ablty and the probablty to accurately dentfy the locatons of TFBSs. Seq FP GS E BP E GA E EHMM E 1 17,61 59-2 63 2 62 1 16,60-1,-1 2 17,55 53-2 57 2 56 1 18,54 1,-1 3 76 74-2 78 2 77 1 78 2 4 63 59-4 65 2 64 1 63 0 5 50 11-39 52 2 51 1 51 1 6 7,60 5-2 9 2 8 1 6,59-1,-1 7 42 40-2 26-16 43 1 43 1 8 39 37-2 41 2 40 1 40 1 9 9,80 7-2 11 2 10 1 8,81-1,1 10 14 12-2 16 2 15 1 13-2 11 61 59-2 63 2 62 1 60-1 12 41 47 6 43 2 42 1 40-1 13 48 46-2 50 2 49 1 48 0 14 71 69-2 73 2 72 1 71 0 15 17 15-2 19 2 18 1 16-1 16 53 49-4 55 2 54 1 52-1 17 1,84 25 24 68-16 56-26 2,80 1,-4

8 18 78 74-4 80 2 77 1 76-2 Table 1. The predcton accuracy of EHMM,GS,BP, and GA. A sngle sequence may contan multple bndng ste motfs. Seq. denotes sequences; FP column lsts the startng postons of the bndng stes measured wth fngerprnt experments. GS, BP, GA columns lst the startng postons predcted by Gbbs Sampler, BoProspector, and MDGA, respectvely. E columns show the devaton of the predcted startng postons from those obtaned wth fngerprnt experments. In addton to the data set CRP, we also use EHMM ( k 10 ) and other tools to predct the bndng stes for a few transcrpton factors ncludng BATF [13], EGR1[9], FOXO1[3], and HSF1[14]. The predcton accuracy of a software tool s evaluated by computng ts predcton accuracy on each sngle sequence n the set and takng the average of the predcton accuracy on all sequences n the set. The predcton accuracy on a sngle sequence s defned to be the percentage of correctly predcted part n the bndng ste. In other words, f we use B to denote the bndng ste and P to denote the predcted bndng ste, the accuracy of the predcton can be computed wth P B A (9) B where we use P B to denote the ntersecton of P and B. For a set D of homologous sequences, the predcton accuracy of an approach on D s computed wth As sd AD (10) D where s s a sequence n D and As s the predcton accuracy of the approach on s. Fgure 4 shows and compares the predcton accuracy of EHMM, Gbbs Sampler, BoProspector, and MDGA on the four data sets. It s not dffcult to see from the Fgure that EHMM acheves sgnfcantly hgher predcton accuracy on data sets BATF, FOXO1, and HSF1 and acheves accuracy that s comparable wth other tools on data set FOXO1. 120% 100% 80% 60% 40% EHMM GS BP GA 20% 0% BATF EGR1 FOXO1 HSF1

9 Fgure 4. Predcton accuracy of the EHMM, GS(Gbbs Sampler), BP(BoProspector), GA(MDGA) on data sets BATF, EGR1, FOXO1,and HSF1. 120.00% 100.00% 80.00% 60.00% 40.00% k=6 k=8 k=10 k=12 20.00% 0.00% BATF EGR1 FOXO1 HSF1 Fgure 5. Predcton accuracy of the EHMM when k s 6,8,10,12 respectvely. The sze of the ensemble s a parameter that can be changed by the user to balance the predcton accuracy and the computaton tme needed for predcton. Fgure 5 shows the predcton accuracy on data sets BATF, EGR1, FOXO1, and HSF1 when the value of the parameter k s 6,8,10, and 12. It can be seen from the Fgure that the predcton accuracy mproves when the sze of the ensemble ncreases and the predcton accuracy becomes steady when the value of the parameter s 10. The testng results also show that a parameter value of 10 s thus suffcent to acheve satsfactory predcton accuracy n practce.. IV. CONCLUSIONS In ths paper, we developed a new approach that can accurately and effcently dentfy the bndng ste motfs on a set of homologous DNA sequences. Our approach starts wth a par of sequences n the set and uses the local algnment results of the two sequences to construct an ntal ensemble. It then progressvely processes the remanng sequences n the set and updates the parameters of the HMMs n the ensemble untl every sequence n the set has been processed. Expermental results show that, on the data we have performed our tests, ths approach can acheve hgher or comparable accuracy on sequences wth a sngle bndng ste whle ts accuracy on sequences wth multple bndng stes s sgnfcantly hgher than that of other tools. ACKNOWLEDGMENT Y.Song s work s under the support of the Startup Fundng for New Faculty at Jangsu Unversty of Scence and Technology. REFERENCES [1] T.L. Baley and C. Elkan, Unsupervsed learnng of multple motfs n bopolymers usng expectaton maxmzaton, Techncal Report CS93-302, Department of Computer Scence, Unversty of Calforna, San Dego, August 1993.

10 [2] T.L. Baley and C. Elkan, Fttng a mxture model by expectaton maxmzaton to dscover motfs n bopolymers, Proceedngs of the Second Internatonal Conference on Intellgent Systems for Molecular Bology, pp. 28-36, 1994. [3] M. M. Brent, R. Anand, and R. Marmorsten, Structural Bass for DNA Recognton by FoxO1 and ts regulaton by posttranslatonal modfcaton, Structure, 16: 1407-1416, 2008. [4] D. Che, Y. Song, and K. Rasheed, MDGA: Motf Dscovery Usng A Genetc Algorthm, Proceedngs of the Genetc and Evolutonary Computaton Conference 2005, pp. 447-452. [5] R. Durbn, S.R. Eddy, A. Krogh, and G. Mtchson, Bologcal Sequence Analyss: Probablstc Models of Protens and Nuclec Acds, Cambrdge Unversty Press, 1998. [6] D.J. Galas and A. Schmtz, A DNA footprntng: a smple method for the detecton of proten-dna bndng specfcty, Nuclec Acds Research, 5, 9, pp. 3157-3170, 1978. [7] M.M. Garner and A. Revzn, A gel electrophoress method for quantfyng he bndng of protens to specfc DNA regons: applcaton to components of the Eschercha col lactose operon regulatory systems, Nuclec Acds Research, 9, 13, pp. 3047-3060, 1981. [8] G. Z. Hertz and G. D. Stormo, Identfyng DNA and proten patterns wth statstcally sgnfcant algnments of multple sequences, Bonformatcs, 15,7, pp. 53-577, 1999. [9] T.C. Hu, et al., Snal assocates wth EGR-1 and SP-1 to upregulate transcrptonal actvaton of p15ink4b., the FEBS Journal, 277: 1202-1218, 2010. [10] F.F.M. Lu, J.J.P. Tsa, R.M. Chen, S.N. Chen, and S.H. Shh, FMGA: fndng motfs by genetc algorthm, IEEE Fourth Symposum on Bonformatcs and Boengneerng, pp. 459-466, 2004. [11] J.S. Lu, A.F. Neuwald, and C.E. Lawrence, Bayesan models fo multple local sequence algnment and Gbbs samplng strateges, J. Am. Stat. Assoc., 90, 432, pp. 1156-1170, 1995. [12] X. Lu, D.L. Brutlag, and J.S. Lu, BoProspector: dscoverng conserved DNA motfs n upstream regulatory regons of co-expressed genes, Pacfc Symposum of Bocomputng, 6, pp. 127-1138, 2001. [13] M. Qugley et al., Transcrptonal analyss of HIV-specfc CD8+ T cells shows that PD-1 nhbts T cell functon by upregulatng BATF, Nature Medcne, 16, 1147-1151, 2010. [14] K. T. Rgbolt, et al., System-wde temporal characterzaton of the proteome and phosphoproteome of human embryonc stem cell dfferentaton., Scence Sgnalng, 4: RS3-RS3, 2011. [15] F.R. Roth, J.D. Hughes, P.E. Estep, and G.M. Church, Fndng DNA regulatory motfs wthn unalgned noncodng sequences clustered by whole-genome mrna quanttaton, Nature Botechnology, 16,10, pp. 939-945, 1998. [16] T.F. Smth and M.S. Waterman, Identfcaton of Common Molecular Subsequences, Journal of Molecular Bology, 147: 195-197. [17] J. Song, C. Lu, Y. Song, J. Qu, and G. Hura, Algnment of multple protens wth an ensemble of Hdden Markov Models, Internatonal Journal of Bonformatcs and Data Mnng, 4(1): 60-71, 2010. [18] G.D. Stormo, Computer methods for analyzng sequence recognton of nuclec acds, Annu. Rev. BoChem, 17, pp. 241-263, 1988. [19] G.D. Stormo and G.W. Hartzell, Identfyng proten-bndng stes from unalgned DNA fragments, Proc. of Nat. Acad. Sc., 86, 4, pp. 1183-1187, 1989.