Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 1

Announcements Reminder: Project milestone due Wednesday beginning of class 2

Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! 3

Expectation Maximalization 4

Back to Unsupervised Learning of GMMs a simple case Remember: We have unlabeled data x 1 x 2 x m We know there are k classes We know P(y 1 ) P(y 2 ) P(y 3 ) P(y k ) We don t know µ 1 µ 2.. µ k We can write P( data µ 1. µ k ) = p = = ( x... x µ...µ ) m j= 1 m j= 1 i= 1 m 1 p j= 1 i= 1 ( x µ...µ ) k k m j p 1 1 ( x µ ) P( y = i) j i k k 1 exp 2σ 2 x j µ i 2 P ( y = i) 5

EM for simple case of GMMs: The E-step If we know µ 1,,µ k easily compute prob. point x j belongs to class y=i p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ 6

EM for simple case of GMMs: The M-step If we know prob. point x j belongs to class y=i MLE for µ i is weighted average imagine k copies of each x j, each with weight P(y=i x j ): µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 7

E.M. for GMMs E-step Compute expected classes of all datapoints for each class p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ Just evaluate a Gaussian at x j M-step Compute Max. like µ given our data s class membership distributions µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 8

E.M. for General GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ k (t), Σ 1 (t), Σ 2 (t) Σ k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t th iteration M-step P ( ) ( ) ( ( ) ( ) ) t t t y = i x, λ p p x µ, Σ j t i j i Compute Max. like µ given our data s class membership distributions ( y = i x j, λt ) x P( y = i x, λ ) ( t+ ) j µ 1 ( t+ 1) = Σ i j P j t p j ( t+ 1) i = j i = i j P ( y i x,λ ) P = m j t Just evaluate a Gaussian at x j ( ) ( )[ ][ ( )] t+ 1 t+ 1 y = i x j, λt x j µ i x j µ i P( y = i x, λ ) j m = #records j t T 9

Gaussian Mixture Example: Start 10

After first iteration 11

After 2nd iteration 12

After 3rd iteration 13

After 4th iteration 14

After 5th iteration 15

After 6th iteration 16

After 20th iteration 17

Some Bio Assay data 18

GMM clustering of the assay data 19

Resulting Density Estimator 20

Three classes of assay (each learned with it s own mixture model) 21

Resulting Bayes Classifier 22

Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 23

The general learning problem with missing data Marginal likelihood x is observed, z is missing: 24

E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means 25

Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) 26

Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) 27

The M-step maximizes lower bound on weighted data Lower bound from Jensen s: Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 28

The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] 29

Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood 30

M-step is easy Using potential function 31

E-step also doesn t decrease potential function 1 Fixing θ to θ (t) : 32

KL-divergence Measures distance between distributions KL=zero if and only if Q=P 33

E-step also doesn t decrease potential function 2 Fixing θ to θ (t) : 34

E-step also doesn t decrease potential function 3 Fixing θ to θ (t) Maximizing F(θ (t),q) over Q set Q to posterior probability: Note that 35

EM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: Realigns F with likelihood: 36

What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM 37

Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: http://www.autonlab.org/tutorials/ K-means Applet: http://www.elet.polimi.it/upload/matteucc/clustering/tu torial_html/appletkm.html Gaussian mixture models Applet: http://www.neurosci.aist.go.jp/%7eakaho/mixtureem. html 38

EM for HMMs a.k.a. The Baum-Welch Algorithm Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 39

Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: 40

Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: What if O is observed, but X is hidden 41

Log likelihood for HMMs with hidden X Marginal likelihood O is observed, X is missing For simplicity of notation, we ll consider training data consists of only one sequence: If there were m sequences: 42

E-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm! 43

The M-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use E Q(t+1) [Count(x,o)] 44

Starting state probability P(X 1 ) Using expected counts P(X 1 =a) = θ X1=a 45

Transition probability P(X t+1 X t ) Using expected counts P(X t+1 =a X t =b) = θ Xt+1=a Xt=b 46

Observation probability P(O t X t ) Using expected counts P(O t =a X t =b) = θ Ot=a Xt=b 47

E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Q(x t+1 =a,x t =b o) joint distribution between pairs of positions 48

The forwards-backwards algorithm X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Initialization: For i = 2 to n Generate a forwards factor by eliminating X i-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating X i+1 i, probability is: 49

E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Just forwards-backwards! Q(x t+1 =a,x t =b o) joint distribution between pairs of positions Homework! 50