Probabilistic Graphical Models

Size: px

Start display at page:

Download "Probabilistic Graphical Models"

Arleen Parks
5 years ago
Views:

1 Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause

2 Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies will be provided! Final writeup(8 pages NIPS format) due Dec 9 2

3 Approximate inference Three major classes of general-purpose approaches Message passing E.g.: Loopy Belief Propagation (today!) Inference as optimization Approximate posterior distribution by simple distribution Mean field / structured mean field Assumed density filtering / expectation propagation Sampling based inference Importance sampling, particle filtering Gibbs sampling, MCMC Many other alternatives (often for special cases) 3

4 Sample approximations of expectations x 1,,x N samples from RV X Law of large numbers: Hereby, the convergence is with probability 1 (almost sure convergence) Finite samples: 4

5 Monte Carlo sampling from a BN Sort variables in topological ordering X 1,,X n For i = 1 to n do Sample x i ~ P(X i X 1 =x 1,, X i-1 =x i-1 ) Works even with high-treewidth models! C D I G S L H J 5

6 Computing probabilities through sampling Want to estimate probabilities Draw N samples from BN C D I Marginals G L S Conditionals H J Rejection sampling problematic for rare events 6

7 Sampling from intractable distributions Given unnormalized distribution P(X) Q(X) Q(X) efficient to evaluate, but normalizer intractable For example, Q(X) = j Ψ(C j ) Want to sample from P(X) Ingenious idea: Can create Markov chain that is efficient to simulate and that has stationary distribution P(X) 7

8 Markov Chain Monte Carlo Given an unnormalized distribution Q(x) Want to design a Markov chain with stationary distribution π(x) = 1/Z Q(x) Need to specify transition probabilities P(x x )! 8

9 Designing Markov Chains 1) Proposal distribution R(X X) Given X t = x, sample proposal x ~R(X X=x) Performance of algorithm will strongly depend on R 2) Acceptance distribution: Suppose X t = x With probability set X t+1 = x With probability 1-α, set X t+1 = x Theorem [Metropolis, Hastings]: The stationary distribution is Z -1 Q(x) Proof: Markov chain satisfies detailed balance condition! 9

10 Gibbs sampling Start with initial assignment x (0) to all variables For t = 1 to do Set x (t) = x (t-1) For each variable X i Set v i = values of all x (t) except x i Sample x (t) ifrom P(X i v i ) Gibbs sampling satisfies detailed balance equation for P Can efficientlycompute conditional distributions P(X i v i ) for graphical models 10

11 Summary of Sampling Randomized approximate inference for computing expections, (conditional) probabilities, etc. Exact in the limit But may need ridiculously many samples Can even directly sample from intractable distributions Disguise distribution as stationary distribution of Markov Chain Famous example: Gibbs sampling 11

12 Summary of approximate inference Deterministic and randomized approaches Deterministic Loopy BP Mean field inference Assumed density filtering Randomized Forward sampling Markov Chain Monte Carlo Gibbs Sampling 12

13 Assumed Recall: The light side everything fully observable low treewidth no hidden variables Then everything is nice Efficient exact inference in large models Optimal parameter estimation without local minima Can even solve some structure learning tasks exactly 13

The dark side s 4 s 1 s 2 s 3 represent s 5

14 The dark side s 4 s 1 s 2 s 3 represent s 5 s7 s s 8 6 s 11 s 9 s 10 States of the world, sensor measurements, s 12 Graphical model In the real world, these assumptions are often violated.. Still want to use graphical models to solve interesting problems.. 14

15 Remaining Challenges Inference Approximate inference for high-treewidth models Learning Dealing with missing data Representation Dealing with hidden variables 15

16 Learning general BNs Fully observable Missing data Known structure Easy! Unknown structure Hard 16

17 Dealing with missing data So far, have assumed all variables are observed in each training example C D I G S In practice, often have missing data Some variables may never be observed Missing variables may be different for each example H L J 17

18 Gaussian Mixture Modeling 18

19 Learning with missing data Suppose X is observed variables, Z hidden variables Training data: x (1), x (2),, x (N) Marginal likelihood: Marginal likelihood doesn t decompose 19

20 Intuition: EM Algorithm Iterative algorithm for parameter learning in case of missing data EM Algorithm Expectation Step: Hallucinate hidden values Maximization Step: Train model as if data were fully observed Repeat Will converge to local maximum 20

21 E-Step: x: observed data; z: hidden data Hallucinate missing values by computing distribution over hidden variables using current parameter estimate: For each example x (j), compute: Q (t+1) (z x (j) ) = P(z x (j), θ (t) ) 21

22 Towards M-step: Jensen inequality Marginal likelihood doesn t decompose Theorem [Jensen s inequality]: For any distribution P(z) and function f(z), 22

23 Lower-bounding marginal likelihood Jensen s inequality: From E-step: 23

24 Lower bound on marginal likelihood Bound of marginal likelihood with hidden variables Recall: Likelihood in fully observable case: Lower-bound interpreted as weighted data set 24

25 M-step: Maximize lower bound Lower bound: Choose θ (t+1) to maximize lower bound Use expected sufficient statistics (counts). Will see: Whenever we used Count(x,z) in fully observable case, replace by E Q t+1[count(x,z)] 25

26 Coordinate Ascent Interpretation Define energy function For any distribution Q and parameters θ: EM algorithm performs coordinate ascent on F: Monotonically converges to local maximum 26

27 EM for Gaussian Mixtures 27

28 EM Iterations [by Andrew Moore] 28

29 EM in BayesNets Complete data likelihood E B A J M 29

30 EM in BayesNets Incomplete data likelihood E B A J M 30

31 Need to compute E-Step for BNs For fixed z, x: Can compute using inference Naively specifying full distribution would be intractable E B A J M 31

32 M-step for BNs Can optimize each CPT independently! MLE in fully observed case: MLE with hidden data: 32

33 Computing expected counts Suppose we observe O=o Variables A hidden 33

34 Learning general BNs Fully observable Missing data Known structure Easy! EM Unknown structure Hard (2.) Now 34

35 Structure learning with hidden data Fully observable case: Score(D;G) = likelihood of data under most likely parameters Decomposes over families Score(D;G) = ι FamScore i (X i Pa Xi ) Can recompute score efficiently after adding/removing edges Incomplete data case: Score(D;G) = lower bound from EM Does not decompose over families Search is very expensive Structure-EM: Iterate Computing of expected counts Multiple iterations of structure search for fixed counts Guaranteed to monotonically improve likelihood score 35

36 Hidden variable discovery Sometimes, invention of a hidden variable can drastically simplify model 36

37 Learning general BNs Fully observable Missing data Known structure Easy! EM Unknown structure Hard (2.) Structure-EM 37

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 1 Announcements Reminder: Project milestone due Wednesday beginning of class 2 Coordinate