Hidden Markov Models in the context of genetic analysis

Size: px

Start display at page:

Download "Hidden Markov Models in the context of genetic analysis"

Josephine Short
5 years ago
Views:

1 Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012

2 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

3 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

4 The problem Many applications of statistics can be seen as a categorisation. We try to fit complex patterns into discrete boxes in order to apprehend them better. Clustering approaches are typical of this: Inference of an individual s ancestry being a mix of X and Y. Separation between high risk and low risk disease groups... Hidden Markov Models try to achieve exactly this purpose in a different context.

5 Basic framework

6 An example: gene discovery from DNA sequence

7 An example: gene discovery from DNA sequence We will first use this simplest example. We assume that the hidden chain X has two states: gene, or intergenic. To be complete there should be a third state: gene on the reverse strand. For now we assume that the emission probabilities P(Y i X i ) are independent conditionally on the hidden chain X. This may not be good enough for most applications but this is a place to start.

8 Notations (Y ) n i=1 represents the sequence of observed data points. The Y i can be discrete or continuous, but we will assume discrete for now. (X ) n i=1 is the sequence of hidden states. i, X i {1,..., S} and we have S discrete hidden states. We also assume that we know the distribution P(Y X ), but this set of parameters may also be unknown.

9 Basic description of Markov Chains (1) A discrete stochastic process X is Markovian is P(X n 1 X i ) = P(X i 1 1 X i )P(X i1 1 X i ) Essentially the future and the past are independent conditionally on the present: it is memory-less. One can easily make a continuous version of this. If the Markov model has S states, then the process can be described using a SxS transition matrix. The diagonal values p ii describe the probability to stay in state i.

10 Basic description of Markov Chains (2) The probability to spend exactly k units of time in state i is: P(X spends k units in i) = p k ii (1 p ii ) This is the definition of an geometric variable. In a continuous state it would be an exponential distribution. The definition of the present can also be modified: X i may for example depends on the previous k states instead of the last one. This increases the size of the parameter space but makes the model richer.

11 Basics for hidden Markov Chains The hidden Markov chain framework adds one layer (denoted Y ) to the Markovian process discribed previously. The conditional distribution of P(Y j X j = s) may be unknown, completely specified or partially specified. Typically the number of hidden states S is relatively small (no more than a few hundreds of states). But n may be very large, i. e. X and Y may be very long sequences (think DNA sequences).

12 Slightly more general version Without complicating anything, we can most of the time assume that P(Y j X j ) also varies with j. Y could also be a Markov chain. Non-Markovian stays can be, to some extent, mimicked by using a sequence of hidden state: First part of the gene, middle of the gene, end of the gene.

13 The set of parameters Θ 1 (P st ) is the transition matrix for the hidden states. 2 Q sk = P(Y = k X = s) is probability distribution for the observed chain Y give X. 3 Lastly, we need a vector µ to initiate the hidden chain X.

14 Two related problems 1 At a given point i in the sequence, what is the most likely hidden state X i? 2 What is the most likely hidden sequence (X ) n i=1? 3 The first question relates to marginal probabilities and the second to the joint likelihood.

15 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

16 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

17 What we can compute at this stage At this stage our tools are limited. Given a sequence x = (x 1,..., x n ) we can compute P(X = x, Y = y) = P(X = x)p(y = y X = x) This is the full joint likelihood for (X, Y ).

18 Why problem 1 is difficult P(X i = x i Y ) = P(X i = x i, Y ) P(Y ) = P(X i = x i, Y ) s=1,...,s P(X i = s, Y ) So the problem amounts to estimating P(X i = r, Y ) A direct computation would sum over all possible sequences: P(X i = s, Y ) = P [X = x, Y ] x x i =s With S hidden states we need to sum over S n terms, which is not practical. We need to be smarter.

19 We need to use the Markovian assumption P(X i = s, Y ) = P(X i = s)p(y X i = s) = P(X i = s) x = P(X i = s) x i 1 P(Y, X = x X i = s) P(Y i 1, X i 1 = x i 1 X i = s) x n i1 P(Y n i1, X n i1 = x n i1 X i = s) = P(X i = s)p(y i 1 X i = s) P(Y n i1 X i = s) = P(Y i 1, X i = s)p(y n i1 X i = s) = α s (i) β s (i)

20 A new computation We have shown that: P(X i = s Y ) = α s (i) β s (i) S t=1 α t(i) β t (i) where: α s (i) = P(Y i 1, X i = s) β s (i) = P(Y n i1 X i = s) And it is actually possible to compute, recursively, the quantities α s (i), β s (i).

21 Two recursive computations The (forward) recursion for α is: α s (i 1) = P(Y i1 X i1 = s) The (backward) recursion for β is: S α t (i)p ts t=1 β s (i 1) = t P st β t (i)p(y i X i = t)

22 Proof for the first recursion α s (i 1) = P(Y i1 1, X i1 = s) = t P(Y i1 1, X i1 = s X i = t)p(x i = t) = t P(Y i1 1 X i1 = s, X i = t)p(x i1 = s X i = t)p(x i = t) = P(Y i1 X i1 = s) t = P(Y i1 X i1 = s) t = P(Y i1 X i1 = s) t P ts P(Y i 1 X i = t, X i1 = s)p(x i = t) P ts P(Y i 1, X i = t) P ts α t (i) A similar proof is used for the backward recursion.

23 Computational considerations The algorithm requires to store n S floats. In terms of computation times, the requirements are in S 2 N. Linearity in n is the key feature because it enables the analysis of very long DNA sequences. Note that probabilities rapidly become infinitely small. Everything needs to be done at the log scale (be careful when implementing it). Various R packages are available for hidden Markov Chains (google it!).

24 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

25 Pbm 2: Finding the most likely hidden sequence ˆX A different problem consists of finding the most likely hidden sequence ˆX. Indeed, the most likely X i using the marginal distribution may be quite different from ˆX i. An algorithm exists to achieve this maximisation and it is called Viterbi algorithm.

26 The Viterbi algorithm Define V s (i) = max P(Y1, i X1 X i i = s) x1 i Similarly to the previous problem a forward recursion can be defined for V s (n 1) as a function of V s. Following this forward computation a reverse parsing of the Markov chain can identify the most likely sequence.

27 An exercise Here is a table that shows the probability of the data for three states (one state per row, 6 points in the chain). This matrix shows a log likelihood of the data given the position in the chain and the hidden state (which can be either 1, 2 or 3). State Assume that remaining in the same state costs no log-likelihood, but transitioning from one state to another costs one unit of likelihood. The probability over the three states is uniform to start the chain. Compute V s (i) = max P(Y1, i X1 X i i = s) x1 i and estimate the most liekly Viterbi path.

28 A few words about Andrew Viterbi Andrew James Viterbi (born in Bergamo in 1935) is an Italian-American electrical engineer and businessman. In addition to his academic work he co-founded Qualcomm. Viterbi made a very large donation to the University of Southern California to name the school the Viterbi school of engineering.

29 Computational considerations Requirements are the same as before. The algorithm requires to store n S floats. In terms of computation times, the requirements are in S 2 N. Linearity in n is the key feature because it enables the analysis of very long DNA sequences. Easy to code (in C or R, see example and R libraries).

30 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

31 Unknown parameters case Often we do not know the distribution P(Y X ). We may also not know the transition probabilities for the hidden Markov chain X. If the parameters Θ are not known, how can we estimate them?

32 What if we knew X? If we know X, the problem becomes straightforward. For example a maximum likelihood estimate would be: i P(Y = k X = s) = 1 Y i =k,x i =s i 1 X i =s More sophisticated (but still straightforward) versions of this could be used if Y was a n th order Markov chain.

33 A typical missing data problem In this missing data context, a widely used algorithm is the the Expectation-Maximisation (EM) algorithm. The EM algorithm is set up to find the parameters that maximise the likelihood of the observed data Y in the presence of missing data X. At each step the likelihood is guaranteed to increase. The algorithm can easily be stuck in a local maximum of the likelihood surface.

34 The basic idea of the EM The is a general iterative algorithm with multiple applications. It first computes the expected value of the likelihood given the current parameters (essentially imputing the hidden chain X ): Q(θ, θ n ) = E X Y (log L(X, Y, θ n )) Then maximises the quantity Q(θ, θ n1 ) as a function of θ. θ n1 = argmax Q(θ, θ n ) θ

35 EM in the context of HMM i P st = P(X i = s, X i1 = t Y ) i P(X i = s Y ) i Q sk = 1 Y i =kp(x i = s Y ) i P(X i = s Y ) The updated probabilities can be estimated using the sequences α s, β s estimated previously. This special case of the EM for HMM is called the Baum-Welch algorithm.

36 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

37 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

38 Gene prediction Zhang, Nat Rev Genetics, 2002

39 Some drawbacks of this approach The number of hidden states can be very large. Modelling codons takes three states, plus probably three states for the first and three states for the last codons. So about nine states just for the exons. One probably needs nine more states on the reverse strand. Some alternatives exist (using semi-markov models).

40 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

41 Copy number variant detection from SNP arrays Allele 1 Allele 2

42 Copy number variant detection from SNP arrays Wang et al, Genome Research 2007

43 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

44 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

45 Stochastic EM (SEM) The EM-Baum-Welch algorithm essentially uses the conditional distribution of X given Y. Another way to compute this expectation is to use a Monte-Carlo approach by simulating X given Y and taking an average. This is a trade-off: We of course do not retain the certainty that the likelihood is increasing (as provided by the EM). However added randomness may avoid the pitfall of having the estimator stuck in local maximum (a major issue with the EM).

46 Stochastic EM (SEM) A simulation of X conditionally on Y would use the following decomposition: P(X N 1 Y N 1 ) = P(X 1 Y N 1 )P(X 2 Y N 1, X 1 ) P(X N Y N 1, X N 1 1 ) This relies on being able to compute the marginal probabilities but this is what Baum-Welch does. Once the α, β have been computed, the simulation is linear in time and multiple sequences can be simulated rapidly.

47 How to simulate in practice The simulation uses the equality: P(X i1 = t Y, X i = s) = P stp(y i1 X i1 = t)p(y n i2 X i1 = s) P(Y n i1 X i = s) = P stp(y i1 X i1 = t)β t (i 1) β s (i) Note that this is a forward-backward algorithm as well but the forward step is built into the simulation step, unlike the traditional Baum-Welch.

48 Estimation issues Using a single estimated run for the hidden chain X is necessarily less efficient that relying on the expected probability. The number of data points must be very large to make the estimation precise. One could potentially take an average of multiple simulated runs. With sufficient numbers of simulations one actually gets very close to the EM. List most practical estimation procedures one has to find the good combination of tools, and there is not one answer.

49 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications Gene prediction CNV detection from SNP arrays 5 Two extensions to the basic HMM Stochastic EM Semi-Markov models

50 Semi-Markov models (HSMM) In the context of the gene prediction using three states per codon is not satisfying. We would like something that takes into account groups of 3bp jointly. Semi-Markov models do exactly this. When entering a state s, a random variable T s is drawn for the duration in state s. Then the emission probability for Y can be defined for the entire duration of the stay. So codons are naturally defined by groups of 3bp instead of dealing with multiple hidden states.

51 Backward recursion for SEM applied to semi-markov hidden chains We are interested in computing the quantities: n [1, N 1], i [1, k], β i (n) = P(Y N n1 Y n 1, X n = i) β i (N) = 1 β i (n) = P(Yn1 Y N 1 n, X n = ɛ i ) = P ij P(T γj = l)p(yn1 nl1 Xn1 nl )β j(n l) j l<n n Note the complexity not in NS 2 max(l) as opposed to NS 2 before.

52 Forward simulations for SEM One can simulate a new hidden sequence recursively with the formulas: P(X nl n1 = j Y N 1, X n = i) = P ijp(t j = l)p(y nl1 n1 X nl n1 = j)β j(n l) β i (n) This is very much analogous to the basic HMM situation, with the extra complication generated by the variable state length.

53 Estimation for semi-markov models It is possible to run a Viterbi algorithm using the same recursion derived for the Markovian case. It is also possible to use a SEM algorithm to simulate the hidden sequence X and use it to estimate the parameters of the model. A full EM is also possible but I never implemented it. The computational requirements may become challenging but it all depends on the application.

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern