Predicting Transcription Factor Binding Sites with an Ensemble of Hidden Markov Models

Similar documents
EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

A Binarization Algorithm specialized on Document Images and Photos

Meta-heuristics for Multidimensional Knapsack Problems

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

CS 534: Computer Vision Model Fitting

Cluster Analysis of Electrical Behavior

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Module Management Tool in Software Development Organizations

S1 Note. Basis functions.

An Optimal Algorithm for Prufer Codes *

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

The Codesign Challenge

A Compressing Method for Genome Sequence Cluster using Sequence Alignment

EVALUATION OF THE PERFORMANCES OF ARTIFICIAL BEE COLONY AND INVASIVE WEED OPTIMIZATION ALGORITHMS ON THE MODIFIED BENCHMARK FUNCTIONS

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Classifying Acoustic Transient Signals Using Artificial Intelligence

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Support Vector Machines

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Unsupervised Learning

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Fast Feature Value Searching for Face Detection

Solving two-person zero-sum game by Matlab

Classifier Swarms for Human Detection in Infrared Imagery

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Clustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b

Using Neural Networks and Support Vector Machines in Data Mining

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks

Machine Learning. Topic 6: Clustering

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Complexity Analysis of Problem-Dimension Using PSO

Load Balancing for Hex-Cell Interconnection Network

On Supporting Identification in a Hand-Based Biometric Framework

Classifier Selection Based on Data Complexity Measures *

The Research of Support Vector Machine in Agricultural Data Classification

Unsupervised Learning and Clustering

TN348: Openlab Module - Colocalization

Programming in Fortran 90 : 2017/2018

Image Emotional Semantic Retrieval Based on ELM

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

GA-Based Learning Algorithms to Identify Fuzzy Rules for Fuzzy Neural Networks

Vectorization of Image Outlines Using Rational Spline and Genetic Algorithm

A Genetic Programming-PCA Hybrid Face Recognition Algorithm

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions

The Shortest Path of Touring Lines given in the Plane

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Learning-Based Top-N Selection Query Evaluation over Relational Databases

CMPS 10 Introduction to Computer Science Lecture Notes

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

A Hidden Markov Model Variant for Sequence Classification

Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

MINIMUM DESCRIPTION LENGTH BASED PROTEIN SECONDARY STRUCTURE PREDICTION

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Parallel matrix-vector multiplication

A New Token Allocation Algorithm for TCP Traffic in Diffserv Network

Cracking of the Merkle Hellman Cryptosystem Using Genetic Algorithm

Degree-Constrained Minimum Spanning Tree Problem Using Genetic Algorithm

SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB

Biological Sequence Mining Using Plausible Neural Network and its Application to Exon/intron Boundaries Prediction

Backpropagation: In Search of Performance Parameters

Reducing Frame Rate for Object Tracking

An Efficient Background Updating Scheme for Real-time Traffic Monitoring

BIOINFORMATICS ORIGINAL PAPER

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Feature Selection for Target Detection in SAR Images

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Network Intrusion Detection Based on PSO-SVM

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Solving Planted Motif Problem on GPU

A GENETIC ALGORITHM FOR PROCESS SCHEDULING IN DISTRIBUTED OPERATING SYSTEMS CONSIDERING LOAD BALANCING

Intelligent Information Acquisition for Improved Clustering

Disulfide Bonding Pattern Prediction Using Support Vector Machine with Parameters Tuned by Multiple Trajectory Search

A METHOD FOR ANALYSING GENE EXPRESSION DATA TEMPORAL SEQUENCE USING PROBABALISTIC BOOLEAN NETWORKS

GENETIC ALGORITHMS APPLIED FOR PATTERN GENERATION FOR DOWNHOLE DYNAMOMETER CARDS

CHAPTER 4 OPTIMIZATION TECHNIQUES

From Comparing Clusterings to Combining Clusterings

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

PARETO BAYESIAN OPTIMIZATION ALGORITHM FOR THE MULTIOBJECTIVE 0/1 KNAPSACK PROBLEM

Parallel Smith-Waterman Algorithm for DNA sequences Comparison on different cluster architectures

Fast Computation of Shortest Path for Visiting Segments in the Plane

Mathematics 256 a course in differential equations for engineering students

Simulation Based Analysis of FAST TCP using OMNET++

Transcription:

Vol. 3, No. 1, Fall, 2016, pp. 1-10 ISSN 2158-835X (prnt), 2158-8368 (onlne), All Rghts Reserved Predctng Transcrpton Factor Bndng Stes wth an Ensemble of Hdden Markov Models Yngle Song 1 and Albert Y. Ch 2 1 School of Computer Scence and Engneerng Jangsu Unversty of Scence and Technology Zhenjang, Jangsu 212003, Chna Emal: ynglesong@gmal.com 2 Department of Mathematcs and Computer Scence Unversty of Maryland Eastern Shore Prncess Anne, MD 21853, USA Emal: albertchsquare@gmal.com Abstract Transcrpton Factor Bndng Stes (TFBS) are mportant for a number of bologcal processes such as gene expresson and regulaton. One fundamental problem n bonformatcs s to develop software tools that can dentfy TFBSs accurately and rapdly. In practce, exhaustve search of all possble combnatons of subsequences s tme consumng and thus cannot be appled. A large number of heurstc or approxmaton algorthms and machne learnng based approaches have been developed for ths problem. However, none of them have acheved satsfactory predcton accuracy. In ths paper, we develop a novel approach that can effcently explore the space of all possble locatons of TFBSs n a set of sequences wth hgh accuracy. The exploraton s carred out wth an ensemble of a few Hdden Markov Models (HMM). The ensemble s ntally constructed through local algnments of two sequences n the set, each HMM n the ensemble s then progressvely algned to other sequences n the set. The parameters of the HMMs n the ensemble are updated based on the algnment results. Our expermental results showed that ths approach can acheve hgher accuracy wth satsfyng effcency than exstng state-of-art approaches. Keywords: Hdden Markov Model (HMM); Motf fndng; Transcrpton factor bndng ste; ensemble approach 1. INTRODUCTION Transcrpton Factor Bndng Stes (TFBS) are subsequences found n the upstream regon of genes n DNA genomes. A transcrpton factor, whch s a specalzed proten molecule, may bnd to the nucleotdes n the subsequences and thus may affect some relevant bologcal processes. Research n molecular bology has revealed that transcrpton factor bndng stes are mportant for many bologcal processes, ncludng gene expresson and regulaton. Thus, an accurate dentfcaton of TFBSs s mportant for understandng the bologcal mechansm of gene expresson and regulaton. Expermental methods have been avalable for the task [6, 7]. However, most of them are tme consumng and expensve. Moreover, as the amount of newly sequenced data grows explosvely, the low throughput of expermental methods have become an mportant bottleneck for rapd processng of these data. Computatonal methods thus have become an mportant alternatve approach to rapd dentfcaton of TFBSs. Snce TFBSs for the same transcrpton factor have smlar sequence content n homologous sequences, the most often used computatonal approaches make the predcton by analyzng a set

2 of homologous sequences and dentfyng subsequences that are smlar n content. The locatons of a TFBS may vary n dfferent homologous sequences. To determne the locaton of a TFBS n each sequence, we need to evaluate all possble startng locatons among all sequences to fnd the optmal soluton. The total number of combnatons of subsequences that need to be examned s exponental and exhaustvely enumeratng all of them s obvously mpractcal when the number or the lengths of the sequences are large. To avod exhaustve search, a large number of heurstcs have been developed to reduce the sze of the search space, such as Gbbs samplng based approaches AlgnACE [15], BoProspector [12], Gbbs Motf sampler [11], expectaton maxmzaton based models [1, 2], greedy approaches such as Consensus [8], and genetc algorthm based approaches such as FMGA [10] and MDGA [4]. Of all these approaches and software tools, Gbbs samplng s a stochastc approach. It randomly selects a canddate motf of a fxed length from each sequence. It then pcks a sequence and uses each substrng of the same length n the sequence to replace the correspondng pre-selected motf for the sequence and computes the probablty. The approach randomly selects a substrng based on the dstrbuton of the probabltes to replace the pre-selected subsequence and obtans a new set of subsequences through the random samplng. The procedure s repeated untl the maxmum number of teratons has been reached or a satsfyng set of local optmal subsequences has been found [11, 12, 15]. Consensus uses a greedy algorthm to algn functonally related sequences and apples the algorthm to dentfy the bndng stes for the E. col CRP proten [8]. MEME+ [2] uses Expectaton Maxmzaton technque to ft a two component mxture model and the model s then used to fnd TFBSs. MEME+ acheves hgher accuracy than ts earler verson MEME [1]. However, the predcton accuracy s stll not satsfactory. Genetc algorthms (GAs) smulate the Darwn evolutonary process to fnd a local optmal soluton for an optmzaton problem. Approaches based on GAs start wth an ntal populaton of a certan sze. They then go through a seres of selecton, crossover and mutaton processes to converge to the global optmum. The selecton, crossover and mutaton operatons are appled to the ndvduals n the populaton to generate the next generaton. These operatons are based on certan methods and probabltes. Ths evolutonary procedure contnues untl the maxmum allowed number of generatons has been generated or the dfference between the values of objectve functons assocated wth two consecutve generatons s less than a pre-set small threshold. Genetc algorthms have been successfully used to solve the TFBS predctng problem, such as FMGA [10] and MDGA[4]. FMGA was declared to have better performance than Gbbs Motf Sampler [11] n terms of both predcton accuracy and computaton effcency. MDGA [4] s another program that uses genetc algorthms to predct TFBSs n homologous sequences. Durng the evolutonary process, MDGA uses nformaton content to evaluate each ndvdual n the populaton. MDGA s able to acheve hgher predcton accuracy than Gbbs samplng algorthm based approaches whle usng a less amount of computaton tme. So far, most of the exstng approaches use heurstcs methods to reduce the sze of the search space. However, heurstcs employed by these approaches may also adversely affect the predcton accuracy. For example, GA based predcton tools cannot guarantee the predcton results are the same for dfferent runs of the program. A well defned strategy that can be used to effcently explore the search space and can generate determnstc and hghly accurate predcton results s thus necessary to further mprove the performance of predcton tools.

3 Recent work has shown that an ensemble of HMMs can be effectvely used to mprove the accuracy of the algnment of multple proten sequences [17]. In ths paper, we develop a new approach that can predct the locatons of TFBSs wth an ensemble of Hdden Markov Models (HMMs). The approach uses an ensemble of profle HMMs to generate a lst of postons that are lkely to be the startng postons of the TFBSs. As the frst step, we construct the ensemble from the local algnment of two sequences. The ensemble conssts of HMMs that represent the local algnments wth most sgnfcant algnment scores. We then algn each profle HMM n the ensemble to each sequence n the dataset, the parameters of the HMMs are also changed to ncorporate the new nformaton we have obtaned by algnng the new sequence to the HMMs. Ths procedure s repeated untl all sequences n the dataset have been processed. The number of HMMs n the ensemble can be used as a parameter and can also be adjusted based on the needs of users. We have mplemented ths approach nto a software tool EHMM and our expermental results show that the predcton accuracy of EHMM s hgher than or comparable wth that of the exstng tools. II. ALGORITHMS AND METHODS The method selects the two sequences that have the lowest smlarty to ntalze the ensemble. The smlarty between each par of sequences n the set s computed by globally algnng the two sequences. A local algnment of the selected sequences s then computed. The algnment results are then used to construct an ensemble that conssts of k HMMs, where k s a postve nteger. The algorthm selects the local algnments wth the k largest algnment scores and each of such local algnments can be used to construct an HMM. An ensemble of k HMMs can thus be constructed based on the local algnments wth k most sgnfcant algnment scores. We then progressvely use the HMMs to scan through each remanng sequence n the set. Each sequence segment n a sequence s algned to each HMM n the ensemble and the algnments wth k most sgnfcant scores are selected to update the parameters of the HMM. Ths process wll create up to k 2 HMMs, but only the algnments that have the k most sgnfcant algnment scores are selected to create a new ensemble of k HMMs. We repeat ths procedure untl all sequences n the set have been processed and the HMMs remaned n the ensemble provde the canddate TFBS motfs. Fgure 1 (a) and (b) provde an llustraton of the process. Fgure 2 shows the fnal stage of the approach, where the bndng stes can be determned from the HMMs n the ensemble. The followng sectons provde a detaled descrpton of the steps of the algorthm..

4 (a) (b). Fgure 1. (a) An ensemble s constructed from local algnments (b) The ensemble s updated progressvely. Fgure 2. Fnally the bndng stes can be nferred from the HMMs n the ensemble. A. Ensemble Intalzaton The algorthm selects two sequences that are of the lowest smlarty value from the set and uses Smth-Waterman local algnment algorthm [16] to obtan local algnments wth sgnfcant scores. The algnment can be performed n quadratc computaton tme. To construct an ensemble of k HMMs, a dynamc programmng table needs to be mantaned to store the algnment scores. Gven two sequences s and t and a score matrx M that evaluates the ftness value to match two nucleotdes together n an algnment. The recurson relaton for the dynamc programmng s as follows. S, max{0, S[ 1][ M[ s, ], S[ ][ j 1] M[ t ] S[ 1][ j 1] M[ s ][ t ]} (1) [ j j

5 where S s the two dmensonal dynamc programmng table; s and t j are the th and j th nucleotdes n s and t, After the dynamc programmng table s completely determned, the algorthm selects the algnments wth the k largest algnment scores n table S. A trace-back table can be mantaned durng the dynamc programmng process. Based on the trace-back table, a trace-back procedure can be employed to dentfy the subsequences n the algnments that correspond to the k largest algnment scores. An ensemble of k profle HMMs can then be constructed from the k algnments. An algnment can be consdered as a set of columns, and each column contans a set of nucleotdes and gaps that are algned together n the algnment. A profle HMM contans two states, namely D and M, for column n the correspondng algnment. The deleton state D does not emt any nucleotde and s used to represent the gaps n column ; the matchng state M emts a nucleotde and s used to descrbe the probabltes for each nucleotde to appear n column. The probabltes of emsson and transton for each state can be computed from each algnment as well. Fgure 3 llustrates the process that converts a multple algnment of subsequences nto the correspondng profle HMM. The parameters of a profle HMM can be computed as follows. Fgure 3. A multple algnment of subsequences can be converted nto a profle HMM. et et Ca ep ( M, a) (2) C bn b P(, b, 1, c) bn, cn ( M, M 1) (3) P(, b, 1, c) bn, cn bn ( 1 1 P(, b, 1, ) et M, D ) 1 et( M, M ) (4) P(, 1, b) bn ( D, M 1 ) (5) P(, 1, b) P(, 1, ) bn ( D, D 1) 1 et( D, M 1 et ) (6)

6 where N s the set of all types of nucleotdes, C represents the number of tmes that nucleotde a a appears n column, ep( M, a) s the emsson probablty for state M to emt nucleotde a. et( M, M 1) s the probablty for the transton from M to M 1 to occur; P(, b, 1, c) s the number of tmes that nucleotde b appears n column and nucleotde c appears n poston 1; P(, b, 1, ) s the number of tmes that nucleotde b appears n column and a gap appears n column 1. et ( D, M 1) s the probablty for the transton from D to M 1 to occur; P(, 1, b) s the number of tmes that a gap appears n column and nucleotde b appears n column 1; P(, 1, ) s the number of tmes that gaps appear n both columns and 1. More detals of the algorthm can be found n [5]. B. Updatng Ensemble The remanng sequences n the set are processed based on the profle HMMs n the ensemble. A sequence that has not been processed n the set s scanned through by each profle HMM and subsequences that have the k most sgnfcant algnment scores are selected. The algorthm uses a wndow of certan sze to slde through the sequence. The sze of the wndow s set to be 1.5 tmes of the average lengths of all subsequences n the algnments used to construct the ensemble. The wndow moves by 1bp each tme and each subsequence n the wndow s algned to each HMM n the ensemble. The algnment can be computed wth a dynamc programmng algorthm. The recurson relaton for the dynamc programmng s as follows. S[ s s s1 s1 s s1 s1 j [ M s,, max{ et( M s, Ds ) ep( M s, t ) S[ Ds, 1,, et( M s, M s 1) S[ M s1, 1, j D,, max{ et( D, D ) S[ D,,, et( D, M ) S[ M,, ]} (7) S ]} (8) where 0 j W are ntegers that ndcate the locaton of subsequence t ncluded n the wndow; S[ Ds,, and S[ M s,, are the dynamc programmng table cells that store the maxmum probablty for states D and M to generate the subsequence t [... nucleotde at poston n t. More detals of the algorthm can be found n [5]. ; t s the The algorthm then selects k subsequences wth the largest algnment scores. We thus obtan n 2 total k canddates for updatng the HMMs n the ensemble. We pck k subsequences that 2 correspond to the largest k algnment scores from these k canddates. The parameters of each profle HMM are then updated based on these addtonal k subsequences. Specfcally, the addtonal subsequence changes the counts that appear n equatons (2), (3), (4), (5), and (6), the parameters of each HMM n the ensemble thus needs to be reevaluated. The process s appled progressvely to other remanng sequences n the set untl every sequence n the set has been processed. The locatons of the sequence segments that are used to construct each HMM n the ensemble are then determned by searchng n the sequences n the data set and the algorthm outputs the locatons as those of the bndng stes. C. Computaton Tme We assume the set contans m sequences, each sequence contans n nucleotdes, and the bndng 2 2 2 ste contans l nucleotdes. The constructon of the ntal ensemble needs O( m n kn ) tme

7 snce the dynamc programmng of Smth-Waterman local algnment needs O ( n 2 ) tme. The computaton tme needed to scan through a sequence wth a sngle HMM s O ( l 2 n). The total 2 2 2 2 2 amount of computaton needed by the approach s thus O( t( kml n m n kn )), where t s the number of teratons the algorthm needs to execute. Snce the memory space needed by the algnments can be reused, the space complexty of the algorthm s O ( n 2 ). III. EXPERIMENTAL RESULTS We mplemented ths approach and developed a software tool EHMM. We tested ts accuracy on a bologcal dataset cyclc-amp receptor proten (CRP). Ths dataset conssts of 18 sequences, each of whch conssts of 105 bps [13]. Twenty three bndng stes have been determned by usng the DNA footprntng method, wth a motf wdth of 22 [12]. Table 1 compares the predcton accuracy of EHMM wth three other computatonal methods: Gbbs Sampler [8], BoProspector [9], and MDGA [3]. The value of the parameter s set to be k 10 n all the tests. It can be seen from the table that EHMM can acheve comparable accuracy wth other tools n homologous sequences that contan a sngle bndng ste motf. However, ts predcton accuracy on those that contan multple bndng ste motfs s sgnfcantly hgher. For most of such sequences, EHMM can accurately dentfy the locatons of both motfs. Ths s beyond the capablty of all three other methods. In partcular, EHMM obtans excellent predcton results on sequence 17, where all three other methods fal to dentfy ether of the two motfs. It s not surprsng that our method s capable of dentfyng the locatons of multple bndng stes snce t uses an ensemble of HMMs to explore the algnment space of all subsequences, whch sgnfcantly mproves the samplng ablty and the probablty to accurately dentfy the locatons of TFBSs. Seq FP GS E BP E GA E EHMM E 1 17,61 59-2 63 2 62 1 16,60-1,-1 2 17,55 53-2 57 2 56 1 18,54 1,-1 3 76 74-2 78 2 77 1 78 2 4 63 59-4 65 2 64 1 63 0 5 50 11-39 52 2 51 1 51 1 6 7,60 5-2 9 2 8 1 6,59-1,-1 7 42 40-2 26-16 43 1 43 1 8 39 37-2 41 2 40 1 40 1 9 9,80 7-2 11 2 10 1 8,81-1,1 10 14 12-2 16 2 15 1 13-2 11 61 59-2 63 2 62 1 60-1 12 41 47 6 43 2 42 1 40-1 13 48 46-2 50 2 49 1 48 0 14 71 69-2 73 2 72 1 71 0 15 17 15-2 19 2 18 1 16-1 16 53 49-4 55 2 54 1 52-1 17 1,84 25 24 68-16 56-26 2,80 1,-4

8 18 78 74-4 80 2 77 1 76-2 Table 1. The predcton accuracy of EHMM,GS,BP, and GA. A sngle sequence may contan multple bndng ste motfs. Seq. denotes sequences; FP column lsts the startng postons of the bndng stes measured wth fngerprnt experments. GS, BP, GA columns lst the startng postons predcted by Gbbs Sampler, BoProspector, and MDGA, respectvely. E columns show the devaton of the predcted startng postons from those obtaned wth fngerprnt experments. In addton to the data set CRP, we also use EHMM ( k 10 ) and other tools to predct the bndng stes for a few transcrpton factors ncludng BATF [13], EGR1[9], FOXO1[3], and HSF1[14]. The predcton accuracy of a software tool s evaluated by computng ts predcton accuracy on each sngle sequence n the set and takng the average of the predcton accuracy on all sequences n the set. The predcton accuracy on a sngle sequence s defned to be the percentage of correctly predcted part n the bndng ste. In other words, f we use B to denote the bndng ste and P to denote the predcted bndng ste, the accuracy of the predcton can be computed wth P B A (9) B where we use P B to denote the ntersecton of P and B. For a set D of homologous sequences, the predcton accuracy of an approach on D s computed wth As sd AD (10) D where s s a sequence n D and As s the predcton accuracy of the approach on s. Fgure 4 shows and compares the predcton accuracy of EHMM, Gbbs Sampler, BoProspector, and MDGA on the four data sets. It s not dffcult to see from the Fgure that EHMM acheves sgnfcantly hgher predcton accuracy on data sets BATF, FOXO1, and HSF1 and acheves accuracy that s comparable wth other tools on data set FOXO1. 120% 100% 80% 60% 40% EHMM GS BP GA 20% 0% BATF EGR1 FOXO1 HSF1

9 Fgure 4. Predcton accuracy of the EHMM, GS(Gbbs Sampler), BP(BoProspector), GA(MDGA) on data sets BATF, EGR1, FOXO1,and HSF1. 120.00% 100.00% 80.00% 60.00% 40.00% k=6 k=8 k=10 k=12 20.00% 0.00% BATF EGR1 FOXO1 HSF1 Fgure 5. Predcton accuracy of the EHMM when k s 6,8,10,12 respectvely. The sze of the ensemble s a parameter that can be changed by the user to balance the predcton accuracy and the computaton tme needed for predcton. Fgure 5 shows the predcton accuracy on data sets BATF, EGR1, FOXO1, and HSF1 when the value of the parameter k s 6,8,10, and 12. It can be seen from the Fgure that the predcton accuracy mproves when the sze of the ensemble ncreases and the predcton accuracy becomes steady when the value of the parameter s 10. The testng results also show that a parameter value of 10 s thus suffcent to acheve satsfactory predcton accuracy n practce.. IV. CONCLUSIONS In ths paper, we developed a new approach that can accurately and effcently dentfy the bndng ste motfs on a set of homologous DNA sequences. Our approach starts wth a par of sequences n the set and uses the local algnment results of the two sequences to construct an ntal ensemble. It then progressvely processes the remanng sequences n the set and updates the parameters of the HMMs n the ensemble untl every sequence n the set has been processed. Expermental results show that, on the data we have performed our tests, ths approach can acheve hgher or comparable accuracy on sequences wth a sngle bndng ste whle ts accuracy on sequences wth multple bndng stes s sgnfcantly hgher than that of other tools. ACKNOWLEDGMENT Y.Song s work s under the support of the Startup Fundng for New Faculty at Jangsu Unversty of Scence and Technology. REFERENCES [1] T.L. Baley and C. Elkan, Unsupervsed learnng of multple motfs n bopolymers usng expectaton maxmzaton, Techncal Report CS93-302, Department of Computer Scence, Unversty of Calforna, San Dego, August 1993.

10 [2] T.L. Baley and C. Elkan, Fttng a mxture model by expectaton maxmzaton to dscover motfs n bopolymers, Proceedngs of the Second Internatonal Conference on Intellgent Systems for Molecular Bology, pp. 28-36, 1994. [3] M. M. Brent, R. Anand, and R. Marmorsten, Structural Bass for DNA Recognton by FoxO1 and ts regulaton by posttranslatonal modfcaton, Structure, 16: 1407-1416, 2008. [4] D. Che, Y. Song, and K. Rasheed, MDGA: Motf Dscovery Usng A Genetc Algorthm, Proceedngs of the Genetc and Evolutonary Computaton Conference 2005, pp. 447-452. [5] R. Durbn, S.R. Eddy, A. Krogh, and G. Mtchson, Bologcal Sequence Analyss: Probablstc Models of Protens and Nuclec Acds, Cambrdge Unversty Press, 1998. [6] D.J. Galas and A. Schmtz, A DNA footprntng: a smple method for the detecton of proten-dna bndng specfcty, Nuclec Acds Research, 5, 9, pp. 3157-3170, 1978. [7] M.M. Garner and A. Revzn, A gel electrophoress method for quantfyng he bndng of protens to specfc DNA regons: applcaton to components of the Eschercha col lactose operon regulatory systems, Nuclec Acds Research, 9, 13, pp. 3047-3060, 1981. [8] G. Z. Hertz and G. D. Stormo, Identfyng DNA and proten patterns wth statstcally sgnfcant algnments of multple sequences, Bonformatcs, 15,7, pp. 53-577, 1999. [9] T.C. Hu, et al., Snal assocates wth EGR-1 and SP-1 to upregulate transcrptonal actvaton of p15ink4b., the FEBS Journal, 277: 1202-1218, 2010. [10] F.F.M. Lu, J.J.P. Tsa, R.M. Chen, S.N. Chen, and S.H. Shh, FMGA: fndng motfs by genetc algorthm, IEEE Fourth Symposum on Bonformatcs and Boengneerng, pp. 459-466, 2004. [11] J.S. Lu, A.F. Neuwald, and C.E. Lawrence, Bayesan models fo multple local sequence algnment and Gbbs samplng strateges, J. Am. Stat. Assoc., 90, 432, pp. 1156-1170, 1995. [12] X. Lu, D.L. Brutlag, and J.S. Lu, BoProspector: dscoverng conserved DNA motfs n upstream regulatory regons of co-expressed genes, Pacfc Symposum of Bocomputng, 6, pp. 127-1138, 2001. [13] M. Qugley et al., Transcrptonal analyss of HIV-specfc CD8+ T cells shows that PD-1 nhbts T cell functon by upregulatng BATF, Nature Medcne, 16, 1147-1151, 2010. [14] K. T. Rgbolt, et al., System-wde temporal characterzaton of the proteome and phosphoproteome of human embryonc stem cell dfferentaton., Scence Sgnalng, 4: RS3-RS3, 2011. [15] F.R. Roth, J.D. Hughes, P.E. Estep, and G.M. Church, Fndng DNA regulatory motfs wthn unalgned noncodng sequences clustered by whole-genome mrna quanttaton, Nature Botechnology, 16,10, pp. 939-945, 1998. [16] T.F. Smth and M.S. Waterman, Identfcaton of Common Molecular Subsequences, Journal of Molecular Bology, 147: 195-197. [17] J. Song, C. Lu, Y. Song, J. Qu, and G. Hura, Algnment of multple protens wth an ensemble of Hdden Markov Models, Internatonal Journal of Bonformatcs and Data Mnng, 4(1): 60-71, 2010. [18] G.D. Stormo, Computer methods for analyzng sequence recognton of nuclec acds, Annu. Rev. BoChem, 17, pp. 241-263, 1988. [19] G.D. Stormo and G.W. Hartzell, Identfyng proten-bndng stes from unalgned DNA fragments, Proc. of Nat. Acad. Sc., 86, 4, pp. 1183-1187, 1989.