Lecture 5: Markov models

Similar documents
Biology 644: Bioinformatics

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

CS6220: DATA MINING TECHNIQUES

Genome 559. Hidden Markov Models

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven)

Eukaryotic Gene Finding: The GENSCAN System

Chapter 6. Multiple sequence alignment (week 10)

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

GLIMMER. Dennis Flottmann

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Faster Gradient Descent Training of Hidden Markov Models, Using Individual Learning Rate Adaptation

3.4 Multiple sequence alignment

Hidden Markov Models Review and Applications. hidden Markov model. what we see model M = (,Q,T) states Q transition probabilities e Ax

Computational Molecular Biology

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

CS313 Exercise 4 Cover Page Fall 2017

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

BLAST, Profile, and PSI-BLAST

Hidden Markov Models. Mark Voorhies 4/2/2012

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer

EECS730: Introduction to Bioinformatics

!"#$ Gribskov Profile. Hidden Markov Models. Building an Hidden Markov Model. Proteins, DNA and other genomic features can be

Weighted Finite-State Transducers in Computational Biology

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Structured Learning. Jun Zhu

Hidden Markov Models in the context of genetic analysis

INTRODUCTION TO BIOINFORMATICS

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Multiple Sequence Alignment Gene Finding, Conserved Elements

Finding homologous sequences in databases

Computational Genomics and Molecular Biology, Fall

Using Hidden Markov Models to Detect DNA Motifs

Stephen Scott.

INTRODUCTION TO BIOINFORMATICS

ε-machine Estimation and Forecasting

A reevaluation and benchmark of hidden Markov Models

Introduction to Hidden Markov models

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

COMP90051 Statistical Machine Learning

Research Article An Improved Scoring Matrix for Multiple Sequence Alignment

Brief review from last class

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding. B. Majoros M. Pertea S.L. Salzberg

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Data Mining Technologies for Bioinformatics Sequences

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

An Introduction to Hidden Markov Models

Statistical Methods for NLP

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

ToPS User Guide. André Yoshiaki Kashiwabara Ígor Bonadio Vitor Onuchic Alan Mitchell Durham

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining

Graphical Models & HMMs

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Documentation of HMMEditor 1.0

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

MetaPhyler Usage Manual

Alignment of Long Sequences

8/19/13. Computational problems. Introduction to Algorithm

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Hidden Markov Model II

HARDWARE ACCELERATION OF HIDDEN MARKOV MODELS FOR BIOINFORMATICS APPLICATIONS. by Shakha Gupta. A project. submitted in partial fulfillment

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

STA 4273H: Statistical Machine Learning

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Protein Sequence Classification Using Probabilistic Motifs and Neural Networks

EECS730: Introduction to Bioinformatics

New String Kernels for Biosequence Data

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

10-701/15-781, Fall 2006, Final

CS 6784 Paper Presentation

Multiple Sequence Alignment II

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

User Guide Written By Yasser EL-Manzalawy

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Greedy Algorithms Huffman Coding

Hidden Markov Model for Sequential Data

Modeling time series with hidden Markov models

Learning Hidden Markov Models for Regression using Path Aggregation

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Sequence Alignment. Ulf Leser

HPC methods for hidden Markov models (HMMs) in population genetics

Biostrings. Martin Morgan Bioconductor / Fred Hutchinson Cancer Research Center Seattle, WA, USA June 2009

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Transcription:

Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics

Problem in biology Data and patterns are often not clear cut When we want to make a method to recognise a pattern (e.g. a sequence motif), we have to learn from the data (e.g. maybe there are other differences between sequences that have the pattern and those that do not) This leads to Data mining and Machine learning

Contents: A widely used machine learning approach: Markov models Markov chain models (1st order, higher order and inhomogeneous models; parameter estimation; classification) Interpolated Markov models (and back-off models) Hidden Markov models (forward, backward and Baum- Welch algorithms; model topologies; applications to gene finding and protein family modeling

Markov Chain Models a Markov chain model is defined by: a set of states some states emit symbols other states (e.g. the begin state) are silent a set of transitions with associated probabilities the transitions emanating from a given state define a distribution over the possible next states

Markov Chain Models given some sequence x of length L, we can ask how probable the sequence is given our model for any probabilistic model of sequences, we can write this probability as key property of a (1st order) Markov chain: the probability of each X i depends only on X i-1

Markov Chain Models Pr(cggt) = Pr(c)Pr(g c)pr(g g)pr(t g)

Markov Chain Models Can also have an end state, allowing the model to represent: Sequences of different lengths Preferences for sequences ending with particular symbols

Markov Chain Models The transition parameters can be denoted by where a x i = 1xi Pr( x x ) i i 1 Similarly we can denote the probability of a sequence x as ax i 1 x i Where a Bxi represents the transition from the begin state

Example Application CpG islands CGdinucleotides are rarer in eukaryotic genomes than expected given the independent probabilities of C, G but the regions upstream of genes are richer in CG dinucleotides than elsewhere CpG islands useful evidence for finding genes Could predict CpG islands with Markov chains one to represent CpG islands one to represent the rest of the genome Example includes using Maximum likelihood and Bayes statistical data and feeding it to a HM model

Estimating the Model Parameters Given some data (e.g. a set of sequences from CpG islands), how can we determine the probability parameters of our model? One approach: maximum likelihood estimation given a set of data D set the parameters θ to maximize Pr(D θ) i.e. make the data D look likely under the model

Maximum Likelihood Estimation Suppose we want to estimate the parameters Pr(a), Pr(c), Pr(g), Pr(t) And we re given the sequences: accgcgctta gcttagtgac tagccgttac Then the maximum likelihood estimates are: Pr(a) = 6/30 = 0.2 Pr(g) = 7/30 = 0.233 Pr(c) = 9/30 = 0.3 Pr(t) = 8/30 = 0.267

These data are derived from genome sequences

Higher Order Markov Chains An nth order Markov chain over some alphabet is equivalent to a first order Markov chain over the alphabet of n-tuples Example: a 2nd order Markov model for DNA can be treated as a 1st order Markov model over alphabet: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT (i.e. all possible dipeptides)

A Fifth Order Markov Chain

Inhomogenous Markov Chains In the Markov chain models we have considered so far, the probabilities do not depend on where we are in a given sequence In an inhomogeneous Markov model, we can have different distributions at different positions in the sequence Consider modeling codons in protein coding regions

Inhomogenous Markov Chains

A Fifth Order Inhomogenous Markov Chain

Selecting the Order of a Markov Chain Model Higher order models remember more history Additional history can have predictive value Example: predict the next word in this sentence fragment finish (up, it, first, last,?) now predict it given more history Fast guys finish

Selecting the Order of a Markov Chain Model However, the number of parameters we need to estimate grows exponentially with the order for modeling DNA we need parameters for an nth order model, with n 5 normally The higher the order, the less reliable we can expect our parameter estimates to be estimating the parameters of a 2nd order homogenous Markov chain from the complete genome of E. Coli, we would see each word > 72,000 times on average estimating the parameters of an 8th order chain, we would see each word ~ 5 times on average

Interpolated Markov Models The IMM idea: manage this trade-off by interpolating among models of various orders Simple linear interpolation:

Interpolated Markov Models We can make the weights depend on the history for a given order, we may have significantly more data to estimate some words than others General linear interpolation

Gene Finding: Search by Content Encoding a protein affects the statistical properties of a DNA sequence some amino acids are used more frequently than others (Leu more popular than Trp) different numbers of codons for different amino acids (Leu has 6, Trp has 1) for a given amino acid, usually one codon is used more frequently than others This is termed codon preference Codon preferences vary by species

Codon Preference in E. Coli AA codon /1000 ---------------------- Gly GGG 1.89 Gly GGA 0.44 Gly GGU 52.99 Gly GGC 34.55 Glu GAG 15.68 Glu GAA 57.20 Asp GAU 21.63 Asp GAC 43.26

Search by Content Common way to search by content build Markov models of coding & noncoding regions apply models to ORFs (Open Reading Frames) or fixedsized windows of sequence GeneMark [Borodovsky et al.] popular system for identifying genes in bacterial genomes uses 5th order inhomogenous Markov chain models

The GLIMMER System Salzberg et al., 1998 System for identifying genes in bacterial genomes Uses 8th order, inhomogeneous, interpolated Markov chain models

IMMs in GLIMMER How does GLIMMER determine the values? First, let us express the IMM probability calculation recursively:

IMMs in GLIMMER If we haven t seen x i-1 x i-n more than 400 times, then compare the counts for the following: Use a statistical test ( χ 2 ) to get a value d indicating our confidence that the distributions represented by the two sets of counts are different

IMMs in GLIMMER χ 2 score when comparing n th -order with n-1 th -order Markov model (preceding slide)

The GLIMMER method 8th order IMM vs. 5th order Markov model Trained on 1168 genes (ORFs really) Tested on 1717 annotated (more or less known) genes

Plot sensitivity over 1-specificity

Hidden Markov models (HMMs) Given say a T in our input sequence, which state emitted it?

Hidden Markov models (HMMs) Hidden State We will distinguish between the observed parts of a problem and the hidden parts In the Markov models we have considered previously, it is clear which state accounts for each part of the observed sequence In the model above (preceding slide), there are multiple states that could account for each part of the observed sequence this is the hidden part of the problem states are decoupled from sequence symbols

HMM-based homology searching HMM for ungapped alignment Transition probabilities and Emission probabilities Gapped HMMs also have insertion and deletion states (next slide)

Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are silent. d 1 d 2 d 3 d 4 I 0 I 1 I 2 I 3 I 4 m 0 m 1 m 2 m 3 m 4 m 5 Start End Model for alignment with insertions and deletions

HMM-based homology searching Most widely used HMM-based profile searching tools currently are SAM-T99 (Karplus et al., 1998) and HMMER2 (Eddy, 1998) formal probabilistic basis and consistent theory behind gap and insertion scores HMMs good for profile searches, bad for alignment (due to parametrisation of the models) HMMs are slow

Homology-derived Secondary Structure of Proteins (HSSP) Sander & Schneider, 1991 It s all about trying to push don t know region down

The Parameters of an HMM

HMM for Eukaryotic Gene Finding Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

A Simple HMM

Three Important Questions How likely is a given sequence? the Forward algorithm What is the most probable path for generating a given sequence? the Viterbi algorithm How can we learn the HMM parameters given a set of sequences? the Forward-Backward (Baum-Welch) algorithm

Three basic problems of HMMs Once we have an HMM, there are three problems of interest. (1) The Evaluation Problem Given an HMM and a sequence of observations, what is the probability that the observations are generated by the model? (2) The Decoding Problem Given a model and a sequence of observations, what is the most likely state sequence in the model that produced the observations? (3) The Learning Problem Given a model and a sequence of observations, how should we adjust the model parameters in order to maximize Evaluation problem can be used for isolated (word) recognition. Decoding problem is related to the continuous recognition as well as to the segmentation. Learning problem must be solved, if we want to train an HMM for the subsequent use of recognition tasks.

Three Important Questions How likely is a given sequence? Forward algorithm What is the most probable path for generating a given sequence? How can we learn the HMM parameters given a set of sequences?

How Likely is a Given Sequence? The probability that the path is taken and the sequence is generated: (assuming begin/end are the only silent states on path)

How Likely is a Given Sequence?

How Likely is a Given Sequence? The probability over all paths is: but the number of paths can be exponential in the length of the sequence... the Forward algorithm enables us to compute this efficiently

How Likely is a Given Sequence: The Forward Algorithm Define f k (i) to be the probability of being in state k Having observed the first i characters of x we want to compute f N (L), the probability of being in the end state having observed all of x We can define this recursively

How Likely is a Given Sequence:

Initialisation: The forward algorithm f 0 (0) = 1 (start), f k (0) = 0 (other silent states k) probability that we re in start state and have observed 0 characters from the sequence Recursion: f l (i) = e l (i)σ k f k (i-1)a kl (emitting states), f l (i) = Σ k f k (i)a kl (silent states) Termination: Pr(x) = Pr(x 1 x L ) = f N (L) = Σ k f k (L)a kn probability that we are in the end state and have observed the entire sequence

Forward algorithm example

Three Important Questions How likely is a given sequence? What is the most probable path for generating a given sequence? Viterbi algorithm How can we learn the HMM parameters given a set of sequences?

Finding the Most Probable Path: The Viterbi Algorithm Define v k (i) to be the probability of the most probable path accounting for the first i characters of x and ending in state k We want to compute v N (L), the probability of the most probable path accounting for all of the sequence and ending in the end state Can be defined recursively Can use DP to find v N (L) efficiently

Finding the Most Probable Path: Initialisation: The Viterbi Algorithm v 0 (0) = 1 (start), v k (0) = 0 (non-silent states) Recursion for emitting states (i =1 L): Recursion for silent states:

Finding the Most Probable Path: The Viterbi Algorithm

Three Important Questions How likely is a given sequence? (clustering) What is the most probable path for generating a given sequence? (alignment) How can we learn the HMM parameters given a set of sequences? The Baum-Welch Algorithm

The Learning Problem Generally, the learning problem is how to adjust the HMM parameters, so that the given set of observations (called the training set) is represented by the model in the best way for the intended application. Thus it would be clear that the ``quantity'' we wish to optimize during the learning process can be different from application to application. In other words there may be several optimization criteria for learning, out of which a suitable one is selected depending on the application. There are two main optimization criteria found in the literature; Maximum Likelihood (ML) and Maximum Mutual Information (MMI).

The Learning Task Given: a model a set of sequences (the training set) Do: find the most likely parameters to explain the training sequences The goal is find a model that generalizes well to sequences we haven t seen before

Learning Parameters If we know the state path for each training sequence, learning the model parameters is simple no hidden state during training count how often each parameter is used normalize/smooth to get probabilities process just like it was for Markov chain models If we don t know the path for each training sequence, how can we determine the counts? key insight: estimate the counts by considering every path weighted by its probability

Learning Parameters: The Baum-Welch Algorithm An EM (expectation maximization) approach, a forward-backward algorithm Algorithm sketch: initialize parameters of model iterate until convergence Calculate the expected number of times each transition or emission is used Adjust the parameters to maximize the likelihood of these expected values Baum-Welch has as important feature that it always converges

The Expectation step

The Expectation step

The Expectation step

The Expectation step

The Expectation step First, we need to know the probability of the i th symbol being produced by state q, given sequence x: Pr( π i =k x) Given this we can compute our expected counts for state transitions, character emissions

The Expectation step

The Backward Algorithm

The Expectation step

The Expectation step

The Expectation step

The Maximization step

The Maximization step

The Baum-Welch Algorithm Initialize parameters of model Iterate until convergence calculate the expected number of times each transition or emission is used adjust the parameters to maximize the likelihood of these expected values This algorithm will converge to a local maximum (in the likelihood of the data given the model) Usually in a fairly small number of iterations