CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Size: px

Start display at page:

Download "CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates"

Giles Erik Houston
5 years ago
Views:

1 CSCI 599 Class Presenta/on Zach Levine Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates April 26 th, 2012

Topics Covered in this Presenta2on A (Brief) Review of HMMs HMM Parameter Learning Expecta2on- Maximiza2on (EM) Algorithms Baum- Welch Algorithm Viterbi Learning Algorithm Markov

2 Topics Covered in this Presenta2on A (Brief) Review of HMMs HMM Parameter Learning Expecta2on- Maximiza2on (EM) Algorithms Baum- Welch Algorithm Viterbi Learning Algorithm Markov Chain Monte Carlo (MCMC) Algorithms Metropolis- Has2ngs Algorithm Gibbs Sampling Algorithm Comparisons of EM and MCMC models. Work in Progress Summary MCMC HMM Parameter Es/mates 2

3 Ques2on Given an HMM (a sequence of noisy observa2ons which come from true hidden states), how do we infer the parameters from this model such that we maximize either the likelihood of the observa2ons (Maximum Likelihood) or the likelihood of the most probable (posterior) sequence (Maximum a Posteriori). MCMC HMM Parameter Es/mates 3

Review of Hidden Markov Models (HMMs) A simple binary HMM can be characterized by: Transi2on Matrix Hidden State S Prior S 1 S 2 S 3 S 4 S n Observable O O 1 O 2 O

4 Review of Hidden Markov Models (HMMs) A simple binary HMM can be characterized by: Transi2on Matrix Hidden State S Prior S 1 S 2 S 3 S 4 S n Observable O O 1 O 2 O 3 O 4 O n Observa2on Matrix e.g For convenience, we can write the complete parameter set as: λ = (a ij, b ij, π i ) MCMC HMM Parameter Es/mates 4

5 Review of Hidden Markov Models (HMMs) Given λ, one can calculate the probability P(O λ) of some observed sequence O = O 1 O 2 O 3 O N. If we want to maximize the likelihood of the observed sequence O (i.e. maximum likelihood), we need λ to sa/sfy: λ ML = argmax λ p λ (O) Or, if we want instead to maximize the likelihood of the most probable sequence (i.e. maximum a posteriori), then λ must sa/sfy: λ MAP = argmax λ max s p λ (S,O) One can also es/mate the hidden states {S} from P({S} Ο,λ) which can be deduced (for instance) from the Viterbi algorithm. The ques2on again is, how do we op2mize λ? Are some methods bezer than others? MCMC HMM Parameter Es/mates 5

6 Review of Hidden Markov Models (HMMs) Also, can the models we use to deduce λ accommodate observa/ons which follow con/nuous distribu/ons? O 1 O 2 O 3 MCMC HMM Parameter Es/mates 6

7 HMM Parameter Learning Methods of obtaining λ can be broken up into two broad categories METHOD Frequently sample the observa/ons of an HMM un/l some confidence interval is obtained for λ (e.g. aij = 95% of the true aij). Determinis/c. Assumes that with enough sampling, a correct λ will eventually be found. EXAMPLES METHOD Use Bayes Rule on prior probabili/es to con/nuously update the probability of the current state, e.g. the posterior. Furthermore the use of random numbers can allow for quick convergence when es/ma/ng λ. EXAMPLES Expecta/on Maximiza/on (EM) - Baum- Welch Algorithm Viterbi- Training Algorithm Markov- Chain Monte Carlo (MCMC) - Metropolis- Has/ngs - Gibbs Sampling MCMC HMM Parameter Es/mates 7

8 HMM Parameter Learning (EM) EM is not per se a tool for frequen/st (ML) inference, but a framework that can also be used for compu/ng Bayesian (MAP) es/mates. Thus it is considered a method for compu/ng point es/mates, or single- valued results. Recall for instance that EM approaches use a forward- backward algorithm. That is it traverses the Markov- Chain forward and backwards un/l λ converges to some confidence interval. S 1 S 2 S t S 4 S n O 1 O 2 O t O 4 O n MCMC HMM Parameter Es/mates 8

HMM Parameter Learning (EM) The Baum- Welch algorithm was the first to implement this method and calculates these probabili/es using: Ini2al Condi2ons β Ν (i) = 1

9 HMM Parameter Learning (EM) The Baum- Welch algorithm was the first to implement this method and calculates these probabili/es using: Ini2al Condi2ons β Ν (i) = 1 and Recursion Rela2onship Where a ij = Transi/on Ma/x = P(S t =i S t+1 =j) Where b i (O t ) = Probability of observing O t when S t =i. λ MCMC HMM Parameter Es/mates 9

10 HMM Parameter Learning (EM) The Baum- Welch algorithm was the first to implement this method and calculates these probabili/es using: Ini2al Condi2ons β Ν (i) = 1 and Recursion Rela2onship Where a ij = Transi/on Ma/x = P(S t =i S t+1 =j) Where b i (O t ) = Probability of observing O t when S t =i. λ Make some ini/al guess for a ij and b i. Sample omen. Obtain a convergent λ = (a ij, b i,π) where π i = γ 1 (i) Drawbacks: Poten2ally slow convergence (each itera2on represents a full chain sweep), local maximiza2on of data likelihood rather than global maximiza2on, how big must N be to get some desired precision quickly? 10

11 HMM Parameter Learning (Bayesian) Hidden State S Observable O MCMC HMM Parameter Es/mates 11

HMM Parameter Learning (Bayesian) Hidden State S 1 1 0

12 HMM Parameter Learning (Bayesian) Hidden State S Observable O MCMC HMM Parameter Es/mates 12

13 Markov Chain Monte Carlo Markov Chain Monte Carlo (MCMC) algorithms are a class of Bayesian inference algorithms for sampling probability distribu/ons by construc/ng a Markov chain (hidden states) that has some desired distribu/on as it s equilibrium distribu/on. For example, suppose we sample observa/ons {O} χ N and we re given some standard distribu/on P (S t ) for each state (e.g. N(µ=0, σ 2 =0.2)). The total mixed system P(s)is very complex when N is large because there are many ways to construct it. Sample Distribu2on of {O} S 1 S 2 S 3 S 4 S n O 1 O 2 O 3 O 4 O n Eventually the Markov Chain {S} will converge with distribu/on P(S), in which case λ = (a ij,b i,π) can be extracted. Sample Distribu2on of {O} P(x) Markov Chain Monte Carlo methods can observe {O} and itera/vely construct a Markov Chain {S} such that it s equilibrium distribu/on = P(S). {S} (- 2,0,2) MCMC HMM Parameter Es/mates 13

14 Markov Chain Monte Carlo Many separate MCMC random- walk algorithms exists such as: - Metropolis- Has/ngs Algorithm - Gibbs Sampling Algorithm - Mul/ple- Try Metropolis Algorithm All of these methods try and sample the en/re state- space in order to reproduce the equilibrium distribu/on. For example: Metropolis- Has/ngs for a simple Markov Chain. Suppose we want to construct a Markov chain with a probability distribu/on P(S). 1.) Pick an arbitrary probability density Q(S S t ) which suggests a new state S from S t. Note that Q(S S t ) must be symmetric (e.g. Q(S S t ) = Q(S t S ) such as a gaussian) 2.) For each state S t, propose the next state a value S t+1 which is generated from Q(S S t ) 3.) Calculate an acceptance ra2o a = P(S )/P(S t ) 4.) If a 1, accept S as S t+1. Else, accept S as S t+1 if rand(0,1) a. Else return to (2) 5.) With S t+1 updated, move forward and repeat steps (2-5). MCMC HMM Parameter Es/mates 14

15 Markov Chain Monte Carlo For Bayesian inference in an HMM, the Gibbs Sampling Algorithm is commonly used to extract λ = (a ij, b i, π) which is a special case of Metropolis- Has/ngs for mul/variate distribu/ons. This procedure also allows us to extract credibility intervals for mul/ple parameters λ, rather than one (local) point es/mate from EM. For a simple HMM, studies have compared EM models to MCMC. X 1 X 2 X 3 X 4 X n Y 1 Y 2 Y 3 Y 4 Y n MCMC HMM Parameter Es/mates 15

16 Markov Chain Monte Carlo MCMC HMM Parameter Es/mates 16

17 Markov Chain Monte Carlo The Gibbs Sampling Algorithm used in this paper is as follows: Backward Probabili2es 17

18 Markov Chain Monte Carlo The Gibbs Sampling Algorithm used in this paper is as follows: MCMC HMM Parameter Es/mates 18

Comparisons between EM and MCMC For a Markov Chain of n = 1000 for three different values of σ 2, the authors found: EM σ 2 = 0.5,1,2.5 Comparisons MCMC is faster.

19 Comparisons between EM and MCMC For a Markov Chain of n = 1000 for three different values of σ 2, the authors found: EM σ 2 = 0.5,1,2.5 Comparisons MCMC is faster. To reach 95% confidence of the true mean values (and thus a ij ), EM took 2177 sec to complete whereas MCMC took 237 sec. Means MCMC can be noisy when the variance is high EM may not always obtain the global solu2on MCMC σ 2 = 0.5 MCMC σ 2 = 1 MCMC σ 2 = 1.5 Means MCMC HMM Parameter Es/mates 19

Work in Progress Implemen2ng MCMC by hand in MATLAB. (nontrivial) Comparing EM versus MCMC as a func2on of chain length. What about chains which don t adequately sample the state space?

20 Work in Progress Implemen2ng MCMC by hand in MATLAB. (nontrivial) Comparing EM versus MCMC as a func2on of chain length. What about chains which don t adequately sample the state space? Does this affect the speed of obtaining λ with MCMC? At what observa2onal variances does the random noise in MCMC overtake quick convergence to an op2mal parameter set? Are there data uncertainty thresholds which impede progress using MCMC algorithms? MCMC HMM Parameter Es/mates 20

21 Summary Markov Chain Monte Carlo (MCMC) algorithms are useful Bayesian inference tools in Hidden Markov Models (HMMs), and can be used to quickly extract an HMM parameter set. MCMC algorithms can be quicker and less computa2onally complex than EM algorithms, however their implementa2on and setup can also be much more complex. MCMC convergence appears strongly dependent on the amount of data uncertainty. EM shows some dependence on data uncertainty, but con2nuous sampling always improves inference. When MCMC does converge on an op2mized parameter set λ, it is guaranteed to have a globally maximized likelihood. EM/MLE techniques however may only find parameter sets which are locally maximized. MCMC HMM Parameter Es/mates 21

22 Thank you for your Attention

CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Exam