Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Size: px

Start display at page:

Download "Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University"

Austen Harrington
5 years ago
Views:

1 Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th,

2 Announcements Reminder: Project milestone due Wednesday beginning of class 2

3 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! 3

4 Expectation Maximalization 4

5 Back to Unsupervised Learning of GMMs a simple case Remember: We have unlabeled data x 1 x 2 x m We know there are k classes We know P(y 1 ) P(y 2 ) P(y 3 ) P(y k ) We don t know µ 1 µ 2.. µ k We can write P( data µ 1. µ k ) = p = = ( x... x µ...µ ) m j= 1 m j= 1 i= 1 m 1 p j= 1 i= 1 ( x µ...µ ) k k m j p 1 1 ( x µ ) P( y = i) j i k k 1 exp 2σ 2 x j µ i 2 P ( y = i) 5

6 EM for simple case of GMMs: The E-step If we know µ 1,,µ k easily compute prob. point x j belongs to class y=i p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ 6

7 EM for simple case of GMMs: The M-step If we know prob. point x j belongs to class y=i MLE for µ i is weighted average imagine k copies of each x j, each with weight P(y=i x j ): µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 7

8 E.M. for GMMs E-step Compute expected classes of all datapoints for each class p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ Just evaluate a Gaussian at x j M-step Compute Max. like µ given our data s class membership distributions µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 8

9 E.M. for General GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ k (t), Σ 1 (t), Σ 2 (t) Σ k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t th iteration M-step P ( ) ( ) ( ( ) ( ) ) t t t y = i x, λ p p x µ, Σ j t i j i Compute Max. like µ given our data s class membership distributions ( y = i x j, λt ) x P( y = i x, λ ) ( t+ ) j µ 1 ( t+ 1) = Σ i j P j t p j ( t+ 1) i = j i = i j P ( y i x,λ ) P = m j t Just evaluate a Gaussian at x j ( ) ( )[ ][ ( )] t+ 1 t+ 1 y = i x j, λt x j µ i x j µ i P( y = i x, λ ) j m = #records j t T 9

10 Gaussian Mixture Example: Start 10

11 After first iteration 11

12 After 2nd iteration 12

13 After 3rd iteration 13

14 After 4th iteration 14

15 After 5th iteration 15

16 After 6th iteration 16

17 After 20th iteration 17

18 Some Bio Assay data 18

19 GMM clustering of the assay data 19

20 Resulting Density Estimator 20

21 Three classes of assay (each learned with it s own mixture model) 21

22 Resulting Bayes Classifier 22

23 Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 23

24 The general learning problem with missing data Marginal likelihood x is observed, z is missing: 24

25 E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means 25

26 Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) 26

27 Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) 27

28 The M-step maximizes lower bound on weighted data Lower bound from Jensen s: Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 28

29 The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] 29

30 Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood 30

31 M-step is easy Using potential function 31

32 E-step also doesn t decrease potential function 1 Fixing θ to θ (t) : 32

33 KL-divergence Measures distance between distributions KL=zero if and only if Q=P 33

34 E-step also doesn t decrease potential function 2 Fixing θ to θ (t) : 34

35 E-step also doesn t decrease potential function 3 Fixing θ to θ (t) Maximizing F(θ (t),q) over Q set Q to posterior probability: Note that 35

36 EM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: Realigns F with likelihood: 36

37 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM 37

38 Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: K-means Applet: torial_html/appletkm.html Gaussian mixture models Applet: html 38

39 EM for HMMs a.k.a. The Baum-Welch Algorithm Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th,

40 Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: 40

41 Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: What if O is observed, but X is hidden 41

42 Log likelihood for HMMs with hidden X Marginal likelihood O is observed, X is missing For simplicity of notation, we ll consider training data consists of only one sequence: If there were m sequences: 42

43 E-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm! 43

44 The M-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use E Q(t+1) [Count(x,o)] 44

45 Starting state probability P(X 1 ) Using expected counts P(X 1 =a) = θ X1=a 45

46 Transition probability P(X t+1 X t ) Using expected counts P(X t+1 =a X t =b) = θ Xt+1=a Xt=b 46

47 Observation probability P(O t X t ) Using expected counts P(O t =a X t =b) = θ Ot=a Xt=b 47

48 E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Q(x t+1 =a,x t =b o) joint distribution between pairs of positions 48

49 The forwards-backwards algorithm X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Initialization: For i = 2 to n Generate a forwards factor by eliminating X i-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating X i+1 i, probability is: 49

50 E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Just forwards-backwards! Q(x t+1 =a,x t =b o) joint distribution between pairs of positions Homework! 50

Clustering web search results

Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means