Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 1
Announcements Reminder: Project milestone due Wednesday beginning of class 2
Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! 3
Expectation Maximalization 4
Back to Unsupervised Learning of GMMs a simple case Remember: We have unlabeled data x 1 x 2 x m We know there are k classes We know P(y 1 ) P(y 2 ) P(y 3 ) P(y k ) We don t know µ 1 µ 2.. µ k We can write P( data µ 1. µ k ) = p = = ( x... x µ...µ ) m j= 1 m j= 1 i= 1 m 1 p j= 1 i= 1 ( x µ...µ ) k k m j p 1 1 ( x µ ) P( y = i) j i k k 1 exp 2σ 2 x j µ i 2 P ( y = i) 5
EM for simple case of GMMs: The E-step If we know µ 1,,µ k easily compute prob. point x j belongs to class y=i p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ 6
EM for simple case of GMMs: The M-step If we know prob. point x j belongs to class y=i MLE for µ i is weighted average imagine k copies of each x j, each with weight P(y=i x j ): µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 7
E.M. for GMMs E-step Compute expected classes of all datapoints for each class p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ Just evaluate a Gaussian at x j M-step Compute Max. like µ given our data s class membership distributions µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 8
E.M. for General GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ k (t), Σ 1 (t), Σ 2 (t) Σ k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t th iteration M-step P ( ) ( ) ( ( ) ( ) ) t t t y = i x, λ p p x µ, Σ j t i j i Compute Max. like µ given our data s class membership distributions ( y = i x j, λt ) x P( y = i x, λ ) ( t+ ) j µ 1 ( t+ 1) = Σ i j P j t p j ( t+ 1) i = j i = i j P ( y i x,λ ) P = m j t Just evaluate a Gaussian at x j ( ) ( )[ ][ ( )] t+ 1 t+ 1 y = i x j, λt x j µ i x j µ i P( y = i x, λ ) j m = #records j t T 9
Gaussian Mixture Example: Start 10
After first iteration 11
After 2nd iteration 12
After 3rd iteration 13
After 4th iteration 14
After 5th iteration 15
After 6th iteration 16
After 20th iteration 17
Some Bio Assay data 18
GMM clustering of the assay data 19
Resulting Density Estimator 20
Three classes of assay (each learned with it s own mixture model) 21
Resulting Bayes Classifier 22
Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 23
The general learning problem with missing data Marginal likelihood x is observed, z is missing: 24
E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means 25
Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) 26
Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) 27
The M-step maximizes lower bound on weighted data Lower bound from Jensen s: Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 28
The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] 29
Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood 30
M-step is easy Using potential function 31
E-step also doesn t decrease potential function 1 Fixing θ to θ (t) : 32
KL-divergence Measures distance between distributions KL=zero if and only if Q=P 33
E-step also doesn t decrease potential function 2 Fixing θ to θ (t) : 34
E-step also doesn t decrease potential function 3 Fixing θ to θ (t) Maximizing F(θ (t),q) over Q set Q to posterior probability: Note that 35
EM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: Realigns F with likelihood: 36
What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM 37
Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: http://www.autonlab.org/tutorials/ K-means Applet: http://www.elet.polimi.it/upload/matteucc/clustering/tu torial_html/appletkm.html Gaussian mixture models Applet: http://www.neurosci.aist.go.jp/%7eakaho/mixtureem. html 38
EM for HMMs a.k.a. The Baum-Welch Algorithm Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 39
Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: 40
Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: What if O is observed, but X is hidden 41
Log likelihood for HMMs with hidden X Marginal likelihood O is observed, X is missing For simplicity of notation, we ll consider training data consists of only one sequence: If there were m sequences: 42
E-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm! 43
The M-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use E Q(t+1) [Count(x,o)] 44
Starting state probability P(X 1 ) Using expected counts P(X 1 =a) = θ X1=a 45
Transition probability P(X t+1 X t ) Using expected counts P(X t+1 =a X t =b) = θ Xt+1=a Xt=b 46
Observation probability P(O t X t ) Using expected counts P(O t =a X t =b) = θ Ot=a Xt=b 47
E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Q(x t+1 =a,x t =b o) joint distribution between pairs of positions 48
The forwards-backwards algorithm X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Initialization: For i = 2 to n Generate a forwards factor by eliminating X i-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating X i+1 i, probability is: 49
E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Just forwards-backwards! Q(x t+1 =a,x t =b o) joint distribution between pairs of positions Homework! 50