CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Size: px

Start display at page:

Download "CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas"

Elizabeth Newton
5 years ago
Views:

1 CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1

2 Partially Observed GMs Speech recognition 2

3 Partially Observed GMs Evolution 3

4 Partially Observed GMs Mixture Models 4

5 Partially Observed GMs A density model p(x) may be multi-modal We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians) Each mode may correspond to a different sub-population 5

6 Unobserved Variables A variable can be unobserved (latent) because: It is an imaginary quantity meant to provide some simplified but abstractive view of the data generation process Mixture models It is a real-world object and/or phenomenon, but difficult or impossible to measure Causes of a disease, evolutionary ancestors It is a real-world object and/or phenomenon, but sometimes wasn t measured Due to faulty sensors etc. Discrete latent variables can be used to cluster data into subgroups Continuous latent variables (factors) can be used for dimensionality reduction 6

7 Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: This model can be used for unsupervised clustering 7

8 Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: Z is a latent class indicator vector X is a conditional Gaussian variable with a class-specific mean/covariance Likelihood 8

9 Why is learning with latent variables harder? In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for Bayes Nets) With latent variables, all parameters become coupled via marginalization 9

10 Towards Expectation-Maximization Let s start from the MLE for completely observed data Data log-likelihood MLE What if we do not know z n? 10

11 Towards Expectation-Maximization Let s solve this problem using Expectation-Maximization Expectation: What do we take expectation with? What do we take expectation over? Maximization: What do we maximize? What do we maximize with respect to? 11

12 Back to Basics: K-Means 12

13 Expectation-Maximization Start: Guess the centroid μ k and covariance Σ k of each of the K clusters Loop 13

14 Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: Z is a latent class indicator vector X is a conditional Gaussian variable with a class-specific mean/covariance Likelihood 14

15 Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: Likelihood The expected complete log-likelihood 15

16 E-step We maximize <l c (θ)> iteratively using the following iterative procedure: Expectation step: Compute the expected value of the sufficient statistics o the hidden variables (i.e., z) given current estimate of the parameters (i.e., π and μ) We are doing inference 16

17 M-step We maximize <l c (θ)> iteratively using the following iterative procedure: Maximization step: Compute the parameters under the current results of the expected value of the hidden variables Isomorphic to MLE except that hidden variables are replaced by their expectations 17

18 K-Means vs EM The EM algorithm for mixtures of Gaussians is like a soft-version of the K-means algorithm 18

19 Complete and Incomplete Log Likelihoods Complete log likelihood: Let X denote the observable variable(s) and Z denote the latent variable(s). If Z could be observed then MLE for fully observed models is straightforward but when Z is not observed the log-likelihood is a random quantity. It cannot be maximized directly. Incomplete log likelihood: With z unobservable, our objective becomes the log of a marginal probability: 19

20 Expected Complete Log Likelihood For any distribution q(z) define the expected complete log likelihood This is a deterministic function of θ and it factorizes as l c Does maximizing this yield a maximizer of the likelihood? Jensen s inequality 20

21 Free Energy For fixed data x, define the free energy: The EM algorithm is coordinate-ascent of F: 21

22 E-step: maximization of expected l w.r.t. q Claim: This is the posterior distribution over the latent variables given the data and the parameters. Proof: this setting attains the bound l(θ;x) >= F(q, θ) 22

23 E-step = plug in posterior expectation of latent variables Without loss of generality: assume that p(x,z θ) is a generalized exponential family distribution The expected complete log likelihood under is 23

24 M-step: maximization of expected l w.r.t. θ The free energy breaks into two terms: This is the posterior distribution over the latent variables given the data and the parameters. In the M-step, maximizing with respect to θ for fixed q we only need to consider the first term: 24

25 Example: HMM Supervised learning: estimation when the right answer is known Given: the casino player allows us to observe him one evening as he changes dice and produces 10,000 rolls Unsupervised learning: estimation when the right answer is unknown Given: 10,000 rolls of the casino player but we don t see when he changes dice 25

26 HMM: dynamic mixture models 26

27 The Baum Welch algorithm The complete log likelihood y The expected complete log-likelihood 27

28 The Baum Welch algorithm 28

29 Unsupervised ML estimation Given X=X1 Xn for which the true state path Y = Y1 Yn is unknown Expectation Maximization 1. Start with our best guess of a model M, parameters θ 2. Estimate A ij, B ik, in the training data: 3. Update θ according to A ij, B ik, (we have a supervised learning problem) 4. Repeat 1 & 2, until convergence This is the Baum-Welch algorithm. 29

30 EM for general BNs 30

31 EM Algorithm A way of maximizing likelihood for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken into two easy pieces: Estimate some missing or unobserved data from observed data and current parameters Using this complete data, find the maximum likelihood parameter estimates Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: 31

32 Mixture of experts We will model P(Y X) using different experts, each responsible for different regions of the input space. Latent variable Z chooses expert using softmax Each expert can be a linear regression model: 32

33 EM for mixture of experts Model Objective function EM E-step: M-step: FINISH 33

34 Partially Hidden Data Of course we can learn when there are missing (hidden) variables on some cases and not on others. In this case The data can have different missing values in each different sample In the E-step we estimate the hidden variables on the incomplete cases only. The M-step optimizes the log likelihood on the complete data plus the expected likelihood on the incomplete data using the E-step. 34

35 Summary Good things about EM: No learning rate parameter Very fast for low dimensions Each iteration guaranteed to improve likelihood Bad things about EM: Can get stuck in local minima Requires expensive inference step Is a maximum likelihood/map method 35

Clustering web search results

Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means