HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

Size: px
Start display at page:

Download "HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms"

Transcription

1 HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) October, 2008 c TIN YIN LAM 2008

2 Abstract Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics applications employ variants of HMMs to analyze diverse types of biological data. It is typically fairly easy to design the states and the topological structure of an HMM. However, it can be difficult to estimate parameter values which yield a good prediction performance. As many HMM-based applications employ similar algorithms for generating predictions, it is also time-consuming and error-prone to have to re-implement these algorithms whenever a new HMM-based application is to be designed. This thesis addresses these challenges by introducing a tool-box, called HMMConverter, which only requires an XML-input file to define an HMM and to use it for sequence decoding and parameter training. The package not only allows for rapid proto-typing of HMM-based applications, but also incorporates several algorithms for sequence decoding and parameter training, including two new, linear memory algorithms for parameter training. Using this software package, even users without programming knowledge can quickly set up sophisticated HMMs and pair-hmms and use them with efficient algorithms for parameter training and sequence analyses. We use HMMConverter to construct a new comparative gene prediction program, called Annotaid, which can predict pairs of orthologous genes by integrating prior information about each input sequence probabilistically into the gene prediction process and into parameter training. Annotaid can thus be readily adapted to predict orthologous gene pairs in newly sequenced genomes. ii

3 Table of Contents Abstract ii Table of Contents iii List of Figures vii Acknowledgements ix 1 HMMs in Bioinformatics applications Introduction Introduction to hidden Markov models Theoretical background Variants of HMMs for Bioinformatics applications Challenges in designing and implementing HMMs Discussion and conclusion Decoding and parameter training algorithms for HMMs Introduction and motivation Decoding algorithms The Viterbi algorithm The Hirschberg algorithm iii

4 Table of Contents 2.3 Parameter training The Baum-Welch algorithm The Viterbi training algorithm Discussion and conclusion HMMConverter Motivation Main features Input and output files Special features Sequence decoding algorithms Training algorithms Training using sampled state paths Sampling a state path from posterior distribution A new, linear memory algorithm for parameter training using sampled state paths A new linear memory algorithm for Viterbi training Pseudo-counts and pseudo-probabilities Training free parameters Availability Discussion and conclusion Specifying an HMM or pair-hmm with HMMConverter Introduction The XML-input file The <model> component of the XML file iv

5 Table of Contents The <sequence analysis> component of the XML file Other text input files The input sequence file The free emission parameter file The free transition parameter file The tube file The prior information file The output files of HMMConverter Examples Implementation of HMMConverter Introduction Converting XML into C++ code The model parameters class Other classes Decoding and training algorithms The Viterbi and Hirschberg algorithms The parameter training algorithms Annotaid Introduction and Motivation Main features of Annotaid The pair-hmm of Annotaid Banding and sequence decoding algorithms Integrating prior information along the input sequences Parameter training in Annotaid v

6 Table of Contents 6.3 Availability Discussion and Future Work Bibliography vi

7 List of Figures 1.1 A Markov model for the CpG-Island process An HMM for the CpG-Island A 5-state simple pair-hmm A simple pair-hmm for predicting related protein coding regions in two input DNA sequences A simple generalized pair-hmm for predicting related protein coding regions in two DNA sequences Illustration of the Viterbi algorithm Illustration of the Hirschberg algorithm (1) Illustration of the Hirschberg algorithm (2) The XML-input file for defining an HMM in HMMConverter Using Blast to generate a tube for HMMConverter Defining a tube for HMMConverter explicitly A state which carries labels from two different annotation label sets Illustration of the special emission feature in HMMConverter Using the Hirschberg algorithm inside a tube Illustration of the forward algorithm vii

8 List of Figures 3.8 Sampling a state path from the posterior distribution Illustration of the novel, linear memory algorithm for parameter training using sampled state paths A part of the pair-hmm of Doublescan Defining a state of an HMM or pair-hmm in HMMConverter A part of the pair-hmm of Doublescan Specifying a tube explicitly in a flat text file A simple 5-state pair-hmm Combing the Hirschberg algorithm with a tube Special emissions in Projector and Annotaid Combining the parameter training with a tube viii

9 Acknowledgements I would like to thank my supervisor, Prof. Irmtraud Meyer for her advice, support and encouragement during the course of this project. A special thank you goes to Prof. Anne Condon for agreeing to be the second reader of my thesis. I thank Nick Wiebe, Rodrigo Goya, Chris Thachuk and all other members of the BETA lab who have made my life on campus enjoyable as they are always being helpful and nice. ix

10 Chapter 1 HMMs in Bioinformatics applications 1.1 Introduction Hidden Markov models (HMMs) are flexible and powerful statistical tools for many different kinds of sequence analysis. HMMs were first introduced in a series of papers by Leonard E. Baum in the 1960s, and an early application of HMMs is speech recognition in the mid- 1970s. Today, HMM-based methods are widely used in many different fields such as speech recognition and speech synthesis [42, 45], natural language processing [32], hand-writing recognition [27] and Bioinformatics [11]. Among the above applications, HMMs are especially popular for sequence analysis in Bioinformatics. In Bioinformatics, HMMs are used for sequence annotation such as gene prediction [8, 23], sequence analysis such as transcription binding site recognition [17] and peptide identification [22], sequence alignment such as protein alignment [21, 24] and other applications such as protein-protein interaction prediction [41]. Many variations of HMMs have been developed in the recent two decades for different Bioinformatics applications. In addition, new algorithms for sequence decoding and parameter training have been introduced. Although HMMs are a powerful and popular concept, several practical challenges often have 1

11 1.1. Introduction to be overcome in practice. It is not easy to design an HMM, especially to assign values to its parameters. The time and memory requirements for analyzing long input sequences can be prohibitive. Finally, it can be very time consuming and error-prone to implement an HMM and the corresponding prediction and parameter training algorithms. This thesis presents an HMM generation tool called HMMConverter, which only requires an XML-file and simple text input files for defining an HMM and for using the HMM for predictions and for training its parameters. Users of HMMConverter are not required to have programming skills. The tool provides the most commonly used Viterbi algorithm [47], and the more memory efficient Hirschberg algorithm [16] for sequence decoding. For parameter training, HMMConverter offers not only a new, linear memory algorithm for Baum- Welch training, but also a novel linear memory variation of the Viterbi training algorithm as well as a new and efficient algorithm for training with sampled state paths. Furthermore, HMMConverter supports many special features for constructing various types of HMMs, such as higher order HMMs, pair-hmms as well as HMMs which take external information on the input sequences into account for parameter training and generating predictions. In this chapter, we give some theoretical background on HMMs, and describe variants of HMM and their applications in Bioinformatics. We also further discuss the challenges of designing and implementing an HMM. Existing sequence decoding and parameter training algorithms for HMM are introduced in chapter 2. Chapter 3 describes the main features of HMMConverter, and includes a detailed explanation of the novel parameter training algorithms. The specifications of an HMM within HMMConverter and the input and output format for the tool are described in chapter 4, some implementation details of HMMConverter are also provided in chapter 5. Finally, we used this tool to implement a novel comparative gene prediction pair-hmm called Annotaid which can take various types of prior information on each of the two input sequences into account in order to give more accu- 2

12 1.2. Introduction to hidden Markov models rate predictions. The features and applications of this new model are described in chapter Introduction to hidden Markov models Theoretical background A hidden Markov model (HMM) is a variation of a Markov model, where the state sequence of a Markov chain is hidden from the observation. We first explain what a Markov chain is and then use a simple biological example to introduce HMMs. Markov model A Markov chain is a random process with a Markov property. In brief, a random process is a process for which the next state is determined in a probabilistic way. The Markov property implies that given the current state in the random process, the future state is independent of the past states. Let X i be a random variable for all positive integers i, representing the state of the process at time i. The sequence X 1, X 2,, X i, is a Markov chain if: P (X i = x X i 1 = x i 1,, X 1 = x 1 ) = P (X i = x X i 1 = x i 1 ) A time-homogeneous Markov chain means that the next state of the process is independent of the time, but only depends on the current state: P (X i = x X i 1 = y) = P (X i 1 = x X i 2 = y) One can also define higher order Markov chains. A Markov chain of n-th order ( n) is a Markov chain for which depends on the current state and the previous n 1 states. Note 3

13 1.2. Introduction to hidden Markov models that when we say Markov chain, we refer to the simplest case, i.e. a 1st order Markov chain. A Markov chain can be represented by a finite state machine [11]. CpG-Island Now, we the use CpG-Island [4] example from the book Biological Sequence Analysis [11] to introduce hidden Markov models. In a genomic DNA sequence, the dinucleotide CG (i.e. a neighboring C and G on the same strand) appears more frequently around the transcription start sites of genes than in other regions of the genome. We call regions which have an increased frequency of CG pairs a CpG-Island. They are usually a few hundred to a few thousand bases long. Suppose we want to identify CpG-Islands in a long genomic sequence. We can do this by defining a Markov chain. Each state of the model in figure represents a DNA nucleotide in a CpG-Island or not in a CpG-Island. The Markov property holds for this model as the probability of going from a given state to the next state only depends on the current state, but not the previous states. Figure shows the Markov model of the CpG-Island process, but this model is lacking something as we cannot distinguish whether an observed nucleotide is from a CpG-Island(+) or not (-). We can modify this model a little, see figure In this new model, each observed nucleotide can be read by two states, one state with label + representing a nucleotide from a CpG-Island and one state with - representing a nucleotide outside a CpG-Island. This is a hidden Markov model for a CpG-Island process, where the hidden feature is the underlying + or - state associated to nucleotides in the observed sequence. The transition probability from the C+ state to the G+ state should be larger than the transition probability from the C- state to the G- state. The start state and the end state in the two models are both silent states which do not read any character from the input sequence. The model always start with the start state 4

14 1.2. Introduction to hidden Markov models Figure 1.1: A Markov model for the CpG-Island process This is a Markov model for the CpG-Island process, but it cannot solve the CpG-Island annotation problem. Figure 1.2: An HMM for the CpG-Island This figure shows an HMM which can model CpG-Islands. The states of the model are all connected to each other, i.e. it is a complete graph(k 8 ) where each state also has a transition back to itself. 5

15 1.2. Introduction to hidden Markov models before reading any character from the input sequence. On the other hand, every state path ends at the end state. In the above example, the start state is connected to all the other states (except the end state), and all the states except the start state are connected to the end state. Now we are ready to give a more formal definition of HMM, an HMM is unambiguously defined by the following components: A set of N states S = {0,, N 1}, where 0 is the start state and N 1 is the end state. An alphabet A which contains the symbols of observations of the HMM (i.e. {A,C,G,T} is the DNA alphabet for the CpG-Island HMM in the above example). A set of transition probabilities T = {t i,j i, j S}, where t i,j [0, 1] is the transition probability of going from state i in the model to state j and j S t i,j = 1 for all states i in S. A set of emission probabilities E = {e i (γ) i S, γ A}, where e i (γ) [0, 1] is the emission probability of state i for symbol γ and γ A e i(γ ) = 1 for all states i in S. An HMM as defined above corresponds to a probabilistic model. It can be used in a number of ways to generate predictions. In the following, let M denote an HMM, X = (x 1,, x L ) an input sequence of length L where x i is the i-th letter of the sequence and π = (π 0,, π l ) a state path of length l + 1 in the model M where π k is the state at the k-th position in the state path, i.e. π 0 is always the start state and π l is always the 6

16 1.2. Introduction to hidden Markov models end state. We can use M in a number of ways to generate predictions: Sequence decoding: For a given input sequence X and model M, we can for example calculate the state path with the highest overall probability in the model, i.e. the state path π which maximizes P (π X, M). In that case, model M analyzes the sequence X and we will, in all of the following, say that the states of model M read the symbols of observations of input sequence X. Parameter training: For a given set of training sequences X and a given model M, we can train the set of free parameters θ of M, for example by using training algorithms which maximize P (X θ, M). Generating sequences of observations: As any HMM defines a probability distribution over sequences of observations, we can use any given HMM M to generate sequences of observations. In that case, we say that the states of model M emit symbols of observations. In this thesis, this way of using an HMM is not employed. We will, in all of the following, therefore refer to a state as reading one or more symbols from an input sequence. The CpG-Island finding problem can be solved with a decoding algorithm, which computes the most probable state path π for an input DNA sequence X and model M. Before we can use an HMM for decoding, we first have have to determine good values for its parameters which may require parameter training. Algorithms for decoding and parameter training will be discussed in more detail in this chapter and in chapter 2. In the following section, we first describe variants of HMMs that have been developed recently for Bioinformatics applications. 7

17 1.2. Introduction to hidden Markov models Variants of HMMs for Bioinformatics applications Pair-HMMs and n-hmms A pair-hmm is a simple extension of an HMM that reads two rather than only one sequence at a time. One of the main applications of pair-hmms is to generate sequence alignments, so pair-hmms are particularly prevalent in Bioinformatics [39]. For example, a pair-hmm called WABA [21] can be used to identify pairs of protein coding exons in two orthologous input DNA sequences. This pair-hmm generates a global alignment between the two input sequences unlike TBlastX which generates local alignments [1]. Pair-HMMs are also used for comparative gene prediction [26, 34, 36]. These pair-hmms take a pair of homologous DNA sequences as input and predict pairs of homologous gene structures. Comparative gene prediction methods using pair-hmms have been shown [26, 36] to perform better than non-comparative approaches [8]. Figure shows a simple 5-state pair-hmm. In an n- HMM, each state can read letters from up to n input sequences. n-hmms can be used for multiple sequence alignment, but because of their high memory and time requirements, they are typically used only for short sequences, such as protein sequences. Generalized HMMs Generalized HMMs [8, 48] allow states to read more than one letter at a time from an input sequence. In Bioinformatics applications, a state of an HMM often corresponds to a biological entity, e.g. a codon consisting of 3 nucleotides, such a state would read 3 letters from the input sequence at a time. Figure shows a simple generalized pair-hmm for finding related protein-coding exons in two input DNA sequences. In particular, any HMM used for gene prediction should be a generalized HMM. 8

18 1.2. Introduction to hidden Markov models Figure 1.3: A 5-state simple pair-hmm This is a 5-state simple pair-hmm, where the Match state reads one character from each of the two input sequences and aligns them according to the emission probability of the combination of the two characters read by the state. The Emit X state reads one character from input sequence X and the Emit Y state does the same thing as the Emit X state, but it reads one character from the input sequence Y at a time. 9

19 1.2. Introduction to hidden Markov models Figure 1.4: A simple pair-hmm for predicting related protein coding regions in two input DNA sequences This simple pair-hmm can be used for predicting related protein coding regions in two input DNA sequences. There are 3 states to model the protein coding part and 3 states to model the non-coding part of the two input sequences. The states which model the protein coding part read one codon at a time, e.g. the Match Codon state reads 3 nucleotides from each input sequence at a time. 10

20 1.2. Introduction to hidden Markov models Explicit state duration HMMs Explicit state duration HMMs extend the concept of HMMs by keeping track of the state s duration, i.e. the time that the state path has stayed in the same state is also remembered in the decoding process. In this type of HMM, a set of parameters or functions for modeling the state duration is needed in addition to the set of transition and emission probabilities, and the probability of the next state is conditional on both the current state and the state duration. Genscan [8] and SNAP [25] are two ab initio gene prediction models which are based on an explicit state duration HMM. Evolutionary HMMs For many types of biological sequence analysis, it is important to know the evolutionary correlations between the sequences. [24] describe a method that uses evolutionary information to derive the transition probabilities of the insertion and deletion states of sequence alignment pair-hmms. Evogene [40] is an evolutionary HMM (EHMM), which incorporates the evolutionary relationship between the input sequences into the sequence decoding process. Evogene consists of an HMM and a set of evolutionary models. In the model, each state of the HMM consists of a set of alphabets over fixed alignment columns and a state-specific evolutionary model, and the emission probabilities of each state are calculated using the corresponding evolutionary model for each column in the input alignment. The Felsenstein algorithm [13] is used to calculate the likelihood of an alignment column given an evolutionary input tree and an evolutionary model. 11

21 1.2. Introduction to hidden Markov models HMMs which take external evidence into account Because of significant improvements in biological sequencing technology, large amounts of genome-wide data such as cdnas, ESTs and protein sequences [3, 18, 37, 38] are currently being generated. It is desirable to incorporate these types of additional evidence into computational sequence analysis methods to get a better performance. In recent years, many HMMs that can incorporate prior information on the input sequences into the prediction process have been developed. They are especially popular in gene prediction because of the highly available data. Twinscan [26] is one of the first HMMs that incorporates external information into the gene prediction process. It extends the classical gene prediction HMM Genscan [8] by incorporating matches to several homologous genomic sequences to the input DNA sequence into the predictions. Twinscan first generates a conserved sequence from the matches between homologous DNA sequences and the DNA sequence of interest. This conserved sequence is then used to bias the emission probabilities of the underlying Genscan HMM. Doublescan [34] is a comparative gene prediction pair-hmm which performs ab initio gene prediction and sequence alignment simultaneously by taking two un-annotated and unaligned homologous genomic sequences as input. It incorporates splice site information along each input sequence using the predictions of Stratasplice [29], which generates log-odd scores and posterior probabilities for potential splice site. These scores are then integrated into the prediction process, and greatly improve the prediction accuracy of splice sites. Projector [36] extends Doublescan by using the known genes in one of the two input sequences to predict the gene structures in the other un-annotated DNA input sequence. It uses so-called special emissions, which integrate the information on the annotated sequence into the prediction process by biasing the emission probabilities. In its published version, Projector is only capable of incorporating prior probabilities with discrete values (i.e. ei- 12

22 1.3. Challenges in designing and implementing HMMs ther 0 or 1). More recently, several gene prediction HMMs have been developed for integrating various types of prior information into the prediction process [7, 15, 19, 44]. These programs (they all employ HMMs but not pair-hmms) all allow taking prior information with different levels of confidence in account. For example, in Jigsaw [19], a feature vector is used to store prior information on different features from various sources in probabilities. Simple multiplication rules are then used to incorporate the vector into the nominal emission probabilities. ExonHunter [7] calls a piece of external evidence an advisor, different advisors are combined to form a superadvisor by quadratic programming (it is a problem of minimizing or maximizing a quadratic function of several variables subject to linear constraints on these variables), which is then incorporated into the prediction process. All of these HMMs are capable of integrating various prior information with different confidence scores into the prediction process, but they all have some limitations. For example, they cannot deal with contradicting pieces of information, and one state of the model can only incorporate one type of prior information. 1.3 Challenges in designing and implementing HMMs As discussed previously, HMMs are a widely used powerful concept in many applications. However, it is often not easy to design and implement an HMM. Various algorithms exist for generating predictions with the model and for training the model s parameters. The main challenges are: Long input sequences lead to large memory and time requirements for predictions and parameter training. It is difficult to set up the transition and emission parameters of the model that yield a good performance. 13

23 1.3. Challenges in designing and implementing HMMs It is time consuming and tricky to implement the model and to test several HMMs against each other ( Proto-typing ). Dealing with long input sequences In Bioinformatics, the length of input sequences for HMMs varies between applications. For example, the average gene size (before splicing) of eukaryotic genes varies from around 1kb (yeast) to 50kb (human) or longer. Before discussing the sequence decoding and parameter training algorithms in detail, we first specify the time and memory requirements of these algorithms to give readers a better impression of the challenges. For a pair-hmm with N states, the Viterbi algorithm [47], which is the most widely used algorithm for sequence decoding, requires O(N 2 L 2 ) time and O(NL 2 ) memory for analyzing a pair of sequences of length L each. For a gene prediction pair-hmm consisting of 10 states, it would thus require between 10 8 and time units and between 10 7 and 10 9 many units of memory to analyze typical input sequences which contain only a single gene. These constraints are even more serious for parameter training, as most training algorithms are iterative processes which have to consider many input sequences. These challenges for parameter estimation of HMMs are described below. New efficient algorithms are needed to make decoding and parameter training practical even for long input sequences. Parameter training The set of transition probabilities and emission probabilities constitutes the parameters of an HMM. There are additional parameters for explicit state duration HMMs and evolutionary HMMs. As the performance of an HMM critically depends on the choice of the parameter values, it is important to find good parameter values. As the states of an HMM closely reflect the biological problem that is being addressed, it is typically fairly easy to design the states of 14

24 1.3. Challenges in designing and implementing HMMs the model. However, it can be difficult to manually derive good parameter values. There are two common ways to define the parameters of an HMM: (1) Users can manually choose the parameters. This is time consuming and also requires a very good understanding of the data. Also, this strategy may not result in values that optimize the performance. (2) An other strategy is to use an automatic parameter training algorithm. This strategy works best if we have a large and diverse set of annotated training sequences. Being able to use automatic parameter training for a pair-hmm, for comparative gene prediction means that it can be readily trained to analyze different pairs of genomes. However, as we explain now, parameter training is not an easy task. There are two common types of parameter training strategies for HMMs, maximum likelihood methods (ML) and expectation maximization methods (EM) [10]. The corresponding training algorithms are usually iterative processes, whose convergence depends on the set of initial values and the amount of training data. Depending on the training algorithm, the outcome of the training may depend on the initial parameter values and may also not necessarily maximize the resulting prediction performance. Pseudo-counts for the parameters may be needed to avoid over-fitting, especially if the training data is sparse or biased. We have discussed the importance of parameter training for setting up an HMM. Because of the difficulties of implementing training algorithms that can be used efficiently on realistic data sets, parameter training algorithms are not implemented in most HMM applications (i.e. the parameters are set up manually for one particular data set). An exception are some HMMs for gene prediction [25, 30, 46]. This is because (1) the set of parameters are very important for the gene prediction performance, (2) parameters need to be adapted to predict genes for different organisms, (3) due to the great improvement in biological sequencing techniques, there are large enough sets of known genes to allow parameter training. For gene prediction methods, the main challenge is to implement efficient parameter training algorithms. The 15

25 1.4. Discussion and conclusion existing parameter training algorithms are described in chapter 2. Proto-typing Designing an HMM for a new Bioinformatics application is fairly easy given a good understandings of the underlying biological problem, whereas the implementation of several possible HMMs can be a tedious task. Proto-typing several HMMs and implementing the sequence decoding and parameter training algorithms is time-consuming and error-prone. Although there are many variations of HMMs, they all use the same prediction and training algorithms. It is particularly inefficient if the same algorithms have to be implemented repeatedly whenever a new HMM application is created. If there was a software tool for generating an HMM with the corresponding algorithms, then users would be able to focus on designing the HMM, i.e. the states of the model, which is the only task that requires human expertise and insight into the biological problem. 1.4 Discussion and conclusion HMMs are a very powerful statistical concept which allows many different kinds of sequence analysis applications. In this chapter, we have introduced the theoretical background of HMMs and some popular variants of HMMs and also described the popularity of these models in Bioinformatics. As we explained, it is typically not easy to design and implement an HMM in practice. We have discussed several challenges, such as long input sequences which lead to large memory and time requirements, difficulties with setting up the parameters of the model, and the time consuming and error-prone task of implementing efficient algorithms. Even though many different HMMs have been developed for different applications, they use similar prediction and parameter training algorithms. It is therefore not efficient if users 16

26 1.4. Discussion and conclusion spend most of their time implementing and re-implementing these algorithms for different HMMs, when they could be focusing on the design and parameterization of the HMM. We here propose an HMM-generating tool HMMConverter, which takes an XML input file for defining an HMM, its states and parameters and the algorithms to be used for generating predictions and training the parameters. Using this tool, a user is not required to have computer programming knowledge to set up an HMM and to use it for data analyses and parameter training. This tool provides several widely used sequence decoding and parameter training algorithms, including two novel parameter training algorithms. Furthermore, HMM- Converter supports many special features, such as incorporating external information on each input sequence with confidence scores, and several heuristic algorithms for reducing the run time and memory requirement greatly while keeping the performance essentially unchanged. These features are especially useful for Bioinformatics applications. In the following chapter, we introduce several widely used existing sequence decoding and parameter training algorithms. The HMM generating tool-hmmconverter is described in chapter 3, including details on all special features and novel algorithms. Chapter 4 explains how to specify an HMM using HMMConverter and chapter 5 describes some implementation details of HMMConverter. Chapter 6 describes a novel pair-hmm for comparative gene prediction called Annotaid, which is constructed using the HMMConverter framework. 17

27 Chapter 2 Decoding and parameter training algorithms for HMMs 2.1 Introduction and motivation In biological sequence analysis or biological sequence annotations, we are interested in finding good annotations for the raw biological sequences. This is the sequence decoding problem discussed in chapter 1. Every HMM assigns an overall probability to each state path which is equal to the product of the encountered transition and emission probabilities, and every state path corresponds to an annotation of the input sequence. For an HMM with well-chosen parameters, high-quality annotations should correspond to high-probability state paths, so the task for sequence decoding is to find a state path with high probability. The Viterbi algorithm [47] is one of the most widely used sequence decoding algorithms which calculates the most probable state path for a given input sequence and HMM. For many Bioinformatics applications, it is often the case that several state paths that yield the same annotation. For example, in a simple pair-hmm which generates global alignments of protein coding regions in two input DNA sequences (see figure 2.1), there are states Emit Codon x and Match Codon, which both assign the annotation label Codon to the letters they read from sequence X. In this pair-hmm, several different state paths that cor- 18

28 2.2. Decoding algorithms respond to different global alignments between the two sequences could yield the same gene annotation for the two input sequences. From this example, we can see that it is always possible to translate a state path in the model into an annotation of the input sequence. Figure 2.1: A simple generalized pair-hmm for predicting related protein coding regions in two DNA sequences This simple generalized pair-hmm has been presented in chapter 1. In particular, the Match Codon state, the Emit Codon X state and the Emit Codon Y state all assign the annotation label Codon to the letters they read from the input sequences. The three other states (except the start and the end state) assign the annotation label non coding. 2.2 Decoding algorithms In this section, we describe two existing sequence decoding algorithms for HMMs: The most commonly used Viterbi algorithm [47] and the Hirschberg algorithm [16]. The Hirschberg 19

29 2.2. Decoding algorithms algorithm can be regarded as a more memory efficient version of the Viterbi algorithm. For a given HMM and an input sequence, both algorithms derive the state path with the highest overall probability The Viterbi algorithm The Viterbi algorithm is one of the most widely used sequence decoding algorithm for HMMs. It first calculates the maximum probability that a state path in a given model M can have for a given input sequence, and then retrieves the corresponding optimal state path by a back-tracking process. In this section, we introduce the algorithm for an HMM, i.e. a single tape HMM where only the start and and end states are silent and all other states read one letter from the input sequence at a time. The algorithm consists of two parts: Part 1: Calculate the optimum probability of any state path using dynamic programming. Part 2: Retrieve the corresponding optimal state path by a back-tracking process. We first introduce some notations before explaining the algorithm with pseudo-code, some of these notations have already been introduced in chapter 1. X = (x 1,..., x L ) denotes an input sequence of length L, where x i is i-th letter of the sequence. S = {0,..., N 1} denotes the set of N states in the HMM, where 0 is the start state and N 1 is the end state. T = {t i,j i, j S}, where t i,j is the transition probability from state i to state j 20

30 2.2. Decoding algorithms E = {e i (γ) i S, γ A}, where e i (γ) is the emission probability of state i to read symbol γ from alphabet A v denotes the two-dimensional Viterbi matrix, where v s (i) is the probability of the most probable path that ends in state s and that has read the input sequence up to and including position i. ptr denotes the two-dimensional pointer matrix, where ptr i (s) is the previous state from which the maximum probability at state s and sequence position i was derived. π = (π 0,, π L+1 ) represents a state path in the model for a given input sequence of length L, where π i is the i-th state of the state path. In particular, π 0 is the start state and π L+1 is the end state. π denotes the optimal state path, it is the state path with the highest overall probability on sequence X, i.e. π = argmax π P (π X, M). We also call this state path the Viterbi path. In order to calculate the optimum probability among all state paths for a given HMM and a given input sequence, the Viterbi algorithm uses the following dynamic programming approach: Part 1: Initialization before reading the first letter from the input sequence v s (0) = 1, if s = start 0, if s start Recursion For i = 1,, L 21

31 2.2. Decoding algorithms For s = 1,, N 2 v s (i) = e s (i)max s S {v s (i 1)t s,s } (2.1) ptr i (s ) = argmax s S {v s (i 1)t s,s } (2.2) Termination at the end of the sequence and in the end state v N 1 (L) = max s S {v s (L)t s,end } ptr L (N 1) = argmax s S {v s (L)t s,end } From the definition of v, we know that the probability of the optimal state path P (x, π ) = v N 1 (L). The corresponding optimal state path can be obtained by a back-tracking procedure. Part 2: Back tracking starts at the end of the sequence in the end state and finishes at the start of the sequence in the start state. It is defined by the recursion: π i 1 = ptr i (π i ) Figure illustrates the first part of the Viterbi algorithm. The above algorithm can be easily generalized for pair-hmms by using a three dimensional matrix and looping over both X and Y dimensions in the recursion. For a generalized HMMs where states can read more than one character at a time from the input sequence, the entries of the Viterbi matrix are calculated by: v s (i) = e s (i)max s (v s (i x (s ))t s,s ), where x (s ) is the number of characters read by state s. 22

32 2.2. Decoding algorithms Figure 2.2: Illustration of the Viterbi algorithm The left part shows how the Viterbi algorithm calculates the Viterbi matrix from left to right in the recursion. The algorithm first loops along the input sequence X and then loops over the states in the model. It terminates in the top right corner of the matrix at the end of the input sequence and in the end state. The right part demonstrates the calculation of entry v s (i), where the matrix element for state s and sequence position i is derived from state s at sequence position (i 1), i.e. v s (i) = e s (i) v s (i 1)t s,s = e s (i) max s S {v s (i 1)t s,s }. Let T max be the maximum number of transitions into any state in the model. The Viterbi algorithm takes O(NT max L n ) operations and O(NL n ) memory for an n-hmms which reads n input sequences at a time. These time and memory requirements impose serious constraints when employing this algorithm with n-hmm and long input sequences The Hirschberg algorithm From equation 2.1, we can see that the Viterbi algorithm can be continued if we keep the Viterbi elements for the previous sequence position in memory. This means that only O(NL n 1 ) memory is required for an n-hmm with N states and n input sequences of identi- 23

33 2.2. Decoding algorithms cal length L if we only want to calculate the probability of the Viterbi path. However, in order to obtain the Viterbi path, O(NL n ) memory is required for storing the back-tracking pointers (see equation 2.2) for the back-tracking process. The memory needed for the back-tracking pointers thus dominates the memory requirements for the Viterbi algorithm. The Hirschberg algorithm is a more memory efficient version of the Viterbi algorithm. It requires O(NL n 1 ) memory instead of O(NL n ) for an n-hmm. Furthermore, the time requirement of the Hirschberg algorithm is of the same order as for the Viterbi algorithm. It uses the same recurrences as the Viterbi algorithm, but searches the Viterbi matrix in a smarter way. We now describe the Hirschberg algorithm for a pair-hmm. Instead of filling the Viterbi matrix in a single direction along one of the two input sequences, the Hirschberg algorithm traverses the matrix from both ends using two Hirschberg strips of O(NL) memory each (see figure 2.2.2). We call them the forward strip and the backward strip, respectively. In order to perform the backward traversal, a mirror model of the original HMM has to be set up by reversing the transitions and exchanging the end state and start state with respect to the original HMM. This mirror model does not necessarily define an HMM because the sum of the transition probabilities from a state may no longer be 1. The forward strip calculated the 3-dimensional matrix in the same direction as the Viterbi algorithm, while the backward strip uses the mirror model to move in the opposite direction, i.e. it starts at the end state and at the end of one sequence. The matrix element (i, j, s) in the backward strip stores the probability of the optimal state path that has read the input sequences from their ends up to and including position i of sequence X and position j of sequence Y and that finishes in state s. Figure shows how the two Hirschberg strips move in opposite directions and meet at the middle position of one of two input sequences (i.e. sequence X in 24

34 2.2. Decoding algorithms this example). Figure 2.3: Illustration of the Hirschberg algorithm (1) This figure shows the projection of the 3-dimensional Viterbi matrix onto the X Y plane. The two Hirschberg strips traverse the matrix from both ends of sequence X and meet at the middle of the sequence X. For a generalized HMM, the two strips have a width of max s S ( x (s))+1, so they will overlap instead of touching each other. The additional columns of the strip are used to store the values calculated for the previous sequence position which are needed in order to continue the calculation. Suppose the two strips meet at position i of sequence X, where the forward strip stores the Viterbi matrix elements for position i of sequence X and the backward strip stores the corresponding probabilities for position i + 1 of sequence X. At that sequence position in X, the probability of the optimal state path P (X, Y π ) can be calculated by finding the transition probabilities from states in the forward strip to states in the backward strip, such that the product of this transition probability and the two entries in the strip is maximum. Using this criterion, we can identify the Viterbi matrix element (i, j, s) through which the optimal state path π goes. 25

35 2.2. Decoding algorithms In summary, when the two Hirschberg strips meet, the optimum probability and a coordinate on the optimal state path can be determined. The process only continues on the left sub-matrix and the right sub-matrix which corresponds to only half of the search space of the previous iteration (see figure 2.2.2). From the above example, the position i of sequence X defines the right boundary of the left sub-matrix, and the coordinate (i, j, s) defines the end position, this coordinate also defines the start position of the right sub-matrix. Figure Figure 2.4: Illustration of the Hirschberg algorithm (2) The coordinate (i,j,s) on the optimal state path was determined in the first iteration of the Hirschberg algorithm (see figure 2.2.2). The matrix is then split into left and right submatrices and the next round of iterations is started for each of the two sub-matrices. This recursive procedure continues, until the complete state path is determined shows that two smaller Hirschberg strips search in each of the two sub-matrices. The recursive process continues until the optimal state path is completely determined. The figure also shows that in each iteration, the total search space for the Hirschberg strips is reduced by a factor of two. So the number of operations performed by the Hirschberg algorithm is given by: t+ t 2 + t 4 + 2t, where t denotes the number of operations required for searching 26

36 2.3. Parameter training the entire matrix, i.e. the number of operations in the Viterbi algorithm. 2.3 Parameter training In order to calculate the most probable state path for a given input sequence or sequences, a set of parameters must be provided. The parameters are the set of transition probabilities and emission probabilities. HMMs of the same topology but different parameter values can be used for different applications. For example, the average length of an exonic stretch in human genes is different from that in mouse genes, so the transition probabilities from the exon state to intergenic state in the HMM should not be the same. The values of these parameters determine the prediction result and the performance of the HMM. In general, the set of parameters of the HMM should be adjusted according to applications, so the parameter estimation is very crucial. However, it is often a difficult task to obtain good parameter estimation, because the parameter training process is expensive to run and training data is usually inadequate. Parameter training is essential in order to set up HMMs for novel biological data sets and to optimize their prediction performance. First of all, a set of training sequences is required for the training process. If the state paths for all the training sequences are known, then the parameter estimation becomes a simple counting process where the performance is optimized via a maximum likelihood approach. Suppose there are M training sequences. Let: X = {X 1,..., X M } denote the set of M training sequences. θ denote the set of parameters of the HMM. This set refers to the set of transition and emission probabilities, i.e. θ = {T, E}. 27

37 2.3. Parameter training r s,s the pseudo-count for the number of times the transition s s is used. r s (γ) the pseudo-count for the number of times states s reads symbol γ. We can then derive the parameters that maximize the likelihood, P (X θ), by setting: t s,s = T s,s s T, where (2.3) s, s T s,s = number of transitions from s to s in all the state paths including any pseudo-counts r s,s. Similarly, e s (γ) = E s(γ) γ E s(γ, where (2.4) ) E s (γ) = number of times state s reads letter γ in all the state paths including any pseudocounts r s (γ). Often, the state paths that correspond to a known annotation are not known or too numerous. In that case, parameter estimation becomes more difficult. In this situation, we can employ two frequently used parameter training algorithms: the Baum-Welch algorithm [2] and the Viterbi training [11] The Baum-Welch algorithm The Baum-Welch algorithm is a special case of an expectation maximization (EM) algorithm, which is a general method for probabilistic parameter estimation. The Baum-Welch algorithm iteratively calculates new estimates for all parameters and the overall log likelihood of the HMM can be shown to converge to a local maximum. We introduce a few more notations, 28

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states in the training data. First used in speech and handwriting recognition In

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

Hidden Markov Models in the context of genetic analysis

Hidden Markov Models in the context of genetic analysis Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

Using Hidden Markov Models to Detect DNA Motifs

Using Hidden Markov Models to Detect DNA Motifs San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-13-2015 Using Hidden Markov Models to Detect DNA Motifs Santrupti Nerli San Jose State University

More information

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %& Gribskov Profile #$ %& Hidden Markov Models Building a Hidden Markov Model "! Proteins, DNA and other genomic features can be classified into families of related sequences and structures How to detect

More information

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Machine Learning. Computational biology: Sequence alignment and profile HMMs 10-601 Machine Learning Computational biology: Sequence alignment and profile HMMs Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Growth

More information

Genome 559. Hidden Markov Models

Genome 559. Hidden Markov Models Genome 559 Hidden Markov Models A simple HMM Eddy, Nat. Biotech, 2004 Notes Probability of a given a state path and output sequence is just product of emission/transition probabilities If state path is

More information

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs 5-78: Graduate rtificial Intelligence omputational biology: Sequence alignment and profile HMMs entral dogma DN GGGG transcription mrn UGGUUUGUG translation Protein PEPIDE 2 omparison of Different Organisms

More information

An Introduction to Hidden Markov Models

An Introduction to Hidden Markov Models An Introduction to Hidden Markov Models Max Heimel Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ 07.10.2010 DIMA TU Berlin 1 Agenda

More information

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding MSCBIO 2070/02-710:, Spring 2015 A4: spline, HMM, clustering, time-series data analysis, RNA-folding Due: April 13, 2015 by email to Silvia Liu (silvia.shuchang.liu@gmail.com) TA in charge: Silvia Liu

More information

Eukaryotic Gene Finding: The GENSCAN System

Eukaryotic Gene Finding: The GENSCAN System Eukaryotic Gene Finding: The GENSCAN System BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC

More information

Brief review from last class

Brief review from last class Sequence Alignment Brief review from last class DNA is has direction, we will use only one (5 -> 3 ) and generate the opposite strand as needed. DNA is a 3D object (see lecture 1) but we will model it

More information

Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding. B. Majoros M. Pertea S.L. Salzberg

Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding. B. Majoros M. Pertea S.L. Salzberg Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding B. Majoros M. Pertea S.L. Salzberg ab initio gene finder genome 1 MUMmer Whole-genome alignment (optional) ROSE Region-Of-Synteny

More information

A reevaluation and benchmark of hidden Markov Models

A reevaluation and benchmark of hidden Markov Models 04-09-2014 1 A reevaluation and benchmark of hidden Markov Models Jean-Paul van Oosten Prof. Lambert Schomaker 04-09-2014 2 Hidden Markov model fields & variants Automatic speech recognition Gene sequence

More information

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004 CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004 Lecture #4: 8 April 2004 Topics: Sequence Similarity Scribe: Sonil Mukherjee 1 Introduction

More information

Stephen Scott.

Stephen Scott. 1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue

More information

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

!"#$ Gribskov Profile. Hidden Markov Models. Building an Hidden Markov Model. Proteins, DNA and other genomic features can be

!#$ Gribskov Profile. Hidden Markov Models. Building an Hidden Markov Model. Proteins, DNA and other genomic features can be Gribskov Profile $ Hidden Markov Models Building an Hidden Markov Model $ Proteins, DN and other genomic features can be classified into families of related sequences and structures $ Related sequences

More information

Chapter 6. Multiple sequence alignment (week 10)

Chapter 6. Multiple sequence alignment (week 10) Course organization Introduction ( Week 1,2) Part I: Algorithms for Sequence Analysis (Week 1-11) Chapter 1-3, Models and theories» Probability theory and Statistics (Week 3)» Algorithm complexity analysis

More information

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION * Prof. Dr. Ban Ahmed Mitras ** Ammar Saad Abdul-Jabbar * Dept. of Operation Research & Intelligent Techniques ** Dept. of Mathematics. College

More information

Exon Probeset Annotations and Transcript Cluster Groupings

Exon Probeset Annotations and Transcript Cluster Groupings Exon Probeset Annotations and Transcript Cluster Groupings I. Introduction This whitepaper covers the procedure used to group and annotate probesets. Appropriate grouping of probesets into transcript clusters

More information

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi

More information

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018 Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments

More information

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018 1896 1920 1987 2006 Chapter 8 Multiple sequence alignment Chaochun Wei Spring 2018 Contents 1. Reading materials 2. Multiple sequence alignment basic algorithms and tools how to improve multiple alignment

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Hidden Markov Models Review and Applications. hidden Markov model. what we see model M = (,Q,T) states Q transition probabilities e Ax

Hidden Markov Models Review and Applications. hidden Markov model. what we see model M = (,Q,T) states Q transition probabilities e Ax Hidden Markov Models Review and Applications 1 hidden Markov model what we see x y model M = (,Q,T) states Q transition probabilities e Ax t AA e Ay observation observe states indirectly emission probabilities

More information

Hidden Markov Model for Sequential Data

Hidden Markov Model for Sequential Data Hidden Markov Model for Sequential Data Dr.-Ing. Michelle Karg mekarg@uwaterloo.ca Electrical and Computer Engineering Cheriton School of Computer Science Sequential Data Measurement of time series: Example:

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Framework for Design of Dynamic Programming Algorithms

Framework for Design of Dynamic Programming Algorithms CSE 441T/541T Advanced Algorithms September 22, 2010 Framework for Design of Dynamic Programming Algorithms Dynamic programming algorithms for combinatorial optimization generalize the strategy we studied

More information

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation 009 10th International Conference on Document Analysis and Recognition HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation Yaregal Assabie and Josef Bigun School of Information Science,

More information

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Gene regulation DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Especially but not only during developmental stage And cells respond to

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday Divide and Conquer Algorithms Problem Set #3 is graded Problem Set #4 due on Thursday 1 The Essence of Divide and Conquer Divide problem into sub-problems Conquer by solving sub-problems recursively. If

More information

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Sequential Data Time-series: Stock market, weather, speech, video Ordered: Text, genes Sequential

More information

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models Gleidson Pegoretti da Silva, Masaki Nakagawa Department of Computer and Information Sciences Tokyo University

More information

ε-machine Estimation and Forecasting

ε-machine Estimation and Forecasting ε-machine Estimation and Forecasting Comparative Study of Inference Methods D. Shemetov 1 1 Department of Mathematics University of California, Davis Natural Computation, 2014 Outline 1 Motivation ε-machines

More information

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven)

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven) BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling Colin Dewey (adapted from slides by Mark Craven) 2007.04.12 1 Modeling RNA with Stochastic Context Free Grammars consider

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Multiple Sequence Alignment Gene Finding, Conserved Elements

Multiple Sequence Alignment Gene Finding, Conserved Elements Multiple Sequence Alignment Gene Finding, Conserved Elements Definition Given N sequences x 1, x 2,, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of

More information

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer Página 1 de 10 Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer Resources: Bioinformatics, David Mount Ch. 4 Multiple Sequence Alignments http://www.netid.com/index.html

More information

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System

ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System Nianjun Liu, Brian C. Lovell, Peter J. Kootsookos, and Richard I.A. Davis Intelligent Real-Time Imaging and Sensing (IRIS)

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Eval: A Gene Set Comparison System

Eval: A Gene Set Comparison System Masters Project Report Eval: A Gene Set Comparison System Evan Keibler evan@cse.wustl.edu Table of Contents Table of Contents... - 2 - Chapter 1: Introduction... - 5-1.1 Gene Structure... - 5-1.2 Gene

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

One report (in pdf format) addressing each of following questions.

One report (in pdf format) addressing each of following questions. MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW1: Sequence alignment and Evolution Due: 24:00 EST, Feb 15, 2016 by autolab Your goals in this assignment are to 1. Complete a genome assembler

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of

More information

BLAST - Basic Local Alignment Search Tool

BLAST - Basic Local Alignment Search Tool Lecture for ic Bioinformatics (DD2450) April 11, 2013 Searching 1. Input: Query Sequence 2. Database of sequences 3. Subject Sequence(s) 4. Output: High Segment Pairs (HSPs) Sequence Similarity Measures:

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Modeling time series with hidden Markov models

Modeling time series with hidden Markov models Modeling time series with hidden Markov models Advanced Machine learning 2017 Nadia Figueroa, Jose Medina and Aude Billard Time series data Barometric pressure Temperature Data Humidity Time What s going

More information

New String Kernels for Biosequence Data

New String Kernels for Biosequence Data Workshop on Kernel Methods in Bioinformatics New String Kernels for Biosequence Data Christina Leslie Department of Computer Science Columbia University Biological Sequence Classification Problems Protein

More information

Effect of Initial HMM Choices in Multiple Sequence Training for Gesture Recognition

Effect of Initial HMM Choices in Multiple Sequence Training for Gesture Recognition Effect of Initial HMM Choices in Multiple Sequence Training for Gesture Recognition Nianjun Liu, Richard I.A. Davis, Brian C. Lovell and Peter J. Kootsookos Intelligent Real-Time Imaging and Sensing (IRIS)

More information

Genome Browsers Guide

Genome Browsers Guide Genome Browsers Guide Take a Class This guide supports the Galter Library class called Genome Browsers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology ICB Fall 2008 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2008 Oliver Jovanovic, All Rights Reserved. The Digital Language of Computers

More information

Particle Swarm Optimization applied to Pattern Recognition

Particle Swarm Optimization applied to Pattern Recognition Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux Essential Skills for Bioinformatics: Unix/Linux SHELL SCRIPTING Overview Bash, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose

More information

Richard Feynman, Lectures on Computation

Richard Feynman, Lectures on Computation Chapter 8 Sorting and Sequencing If you keep proving stuff that others have done, getting confidence, increasing the complexities of your solutions for the fun of it then one day you ll turn around and

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Tutorial 1: Exploring the UCSC Genome Browser

Tutorial 1: Exploring the UCSC Genome Browser Last updated: May 12, 2011 Tutorial 1: Exploring the UCSC Genome Browser Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link.

More information

Alignment of Pairs of Sequences

Alignment of Pairs of Sequences Bi03a_1 Unit 03a: Alignment of Pairs of Sequences Partners for alignment Bi03a_2 Protein 1 Protein 2 =amino-acid sequences (20 letter alphabeth + gap) LGPSSKQTGKGS-SRIWDN LN-ITKSAGKGAIMRLGDA -------TGKG--------

More information

Bayesian Classification Using Probabilistic Graphical Models

Bayesian Classification Using Probabilistic Graphical Models San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Bayesian Classification Using Probabilistic Graphical Models Mehal Patel San Jose State University

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001 Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques

More information

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates CSCI 599 Class Presenta/on Zach Levine Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates April 26 th, 2012 Topics Covered in this Presenta2on A (Brief) Review of HMMs HMM Parameter Learning Expecta2on-

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Graphical Models & HMMs

Graphical Models & HMMs Graphical Models & HMMs Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Graphical Models

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Introduction to SLAM Part II. Paul Robertson

Introduction to SLAM Part II. Paul Robertson Introduction to SLAM Part II Paul Robertson Localization Review Tracking, Global Localization, Kidnapping Problem. Kalman Filter Quadratic Linear (unless EKF) SLAM Loop closing Scaling: Partition space

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Hidden Markov Models. Mark Voorhies 4/2/2012

Hidden Markov Models. Mark Voorhies 4/2/2012 4/2/2012 Searching with PSI-BLAST 0 th order Markov Model 1 st order Markov Model 1 st order Markov Model 1 st order Markov Model What are Markov Models good for? Background sequence composition Spam Hidden

More information

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVIP (7 Summer) Outline Segmentation Strategies and Data Structures Algorithms Overview K-Means Algorithm Hidden Markov Model

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Skill. Robot/ Controller

Skill. Robot/ Controller Skill Acquisition from Human Demonstration Using a Hidden Markov Model G. E. Hovland, P. Sikka and B. J. McCarragher Department of Engineering Faculty of Engineering and Information Technology The Australian

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame 1 When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from

More information

CS313 Exercise 4 Cover Page Fall 2017

CS313 Exercise 4 Cover Page Fall 2017 CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 24

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 24 Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras Lecture - 24 So in today s class, we will look at quadrilateral elements; and we will

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Exercise 2: Browser-Based Annotation and RNA-Seq Data Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence

More information

ChromHMM: automating chromatin-state discovery and characterization

ChromHMM: automating chromatin-state discovery and characterization Nature Methods ChromHMM: automating chromatin-state discovery and characterization Jason Ernst & Manolis Kellis Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure 3 Supplementary Figure

More information

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Previous Lecture Audio Retrieval - Query by Humming

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

MEMMs (Log-Linear Tagging Models)

MEMMs (Log-Linear Tagging Models) Chapter 8 MEMMs (Log-Linear Tagging Models) 8.1 Introduction In this chapter we return to the problem of tagging. We previously described hidden Markov models (HMMs) for tagging problems. This chapter

More information

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009 9 Video Retrieval Multimedia Databases 9 Video Retrieval 9.1 Hidden Markov Models (continued from last lecture) 9.2 Introduction into Video Retrieval Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme

More information