Expectation Maximization: Inferring model parameters and class labels

Size: px

Start display at page:

Download "Expectation Maximization: Inferring model parameters and class labels"

Lewis French
5 years ago
Views:

1 Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1

2 2/27/2017 Jumble of unlabeled images HISTOGRAM blue How do we model this distribution? 3 Model of jumble of unlabeled images blue blue 0.8 blue 2

3 2/27/2017 What if image types not equally represented? e.g., forest images are very likely in the collection 0.8 blue Mixture of Gaussians (1D) Each mixture component represents a unique cluster specified by: {πk, μk, σk2 } π1 π2 π3 π = [ ] π1 σ1 π2 π3 σ2 πk πk = 1 Relative proportion of each class in world from which we get data σ3 μ2 6 0 μ1 μ3 x 3

4 Mixture of Gaussians (general) μ 1,Σ 1 π 1 π 2 π 3 Each mixture component represents a unique cluster specified by: {π k, μ k, Σ k } μ 3,Σ 3 μ 2,Σ 2 7 According to the model Without observing the image content, what s the probability it s from cluster k? (e.g., prob. of seeing clouds image) Given observation x i is from cluster k, what s the likelihood of seeing x i? (e.g., just look at distribution for clouds ) π 1 π 2 π 3 8 x 4

5 Inferring soft assignments with expectation maximization (EM) Inferring cluster labels Data Desired soft assignments 10 5

6 Part 1: What if we knew the cluster parameters {π k,μ k,σ k }? Compute responsibilities Responsibility cluster k takes for observation i probability of assignment to cluster k given model parameters and observed value 12 6

7 Responsibilities in pictures Green cluster takes more responsibility Blue cluster takes more responsibility Uncertain split responsibility 13 Responsibilities in pictures Need to weight by cluster probabilities, not just cluster shapes Still uncertain, but green cluster seems more probable takes more responsibility 14 7

8 Responsibilities in equations Responsibility cluster k takes for observation i Initial probability of being from cluster k How likely is the observed value x i under this cluster assignment? 15 Responsibilities in equations Responsibility cluster k takes for observation i Normalized over all possible cluster assignments 16 8

9 Recall: According to the model Without observing the image content, what s the probability it s from cluster k? (e.g., prob. of seeing clouds image) Given observation x i is from cluster k, what s the likelihood of seeing x i? (e.g., just look at distribution for clouds ) π 1 π 2 π 3 17 x Formally: An application of Bayes rule Responsibility cluster k takes for observation i A params B A B A 18 9

10 Formally: An application of Bayes rule Responsibility cluster k takes for observation i A params B C B C 19 Formally: An application of Bayes rule r ik = p(a B,params) = p(a params)p(b A,params) p(c params)p(b C,params) = p(a params)p(b A,params) p(b params) 20 10

11 Part 1 summary Desired soft assignments (responsibilities) are easy to compute when cluster parameters {π k, μ k, Σ k } are known But, we don t know these! 21 Part 2a: Imagine we knew the cluster (hard) assignments z i 11

12 Estimating cluster parameters Imagine we know the cluster assignments Estimation problem decouples across clusters Is green point informative of fuchsia cluster parameters? NO! 23 Data table decoupling over clusters R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 R G B Cluster x 3 [1] x 3 [2] x 3 [3] 3 R G B Cluster x 4 [1] x 4 [2] x 4 [3] 1 x 5 [1] xx 5 [2] 5 [2] xx 5 [3] 5 [3] 2 x 6 [1] xx 6 [2] 6 [2] xx 6 [3] 6 [3]

13 Maximum likelihood estimation R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 x 3 [1] x 3 [2] x 3 [3] 3 Estimate {π k, μ k, Σ k } given data assigned to cluster k maximum likelihood estimation (MLE) Find parameters that maximize the score, or likelihood, of data 25 Mean/covariance MLE R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 x 3 [1] x 3 [2] x 3 [3] 3 Scalar case: 26 13

14 Cluster proportion MLE R G B Cluster x 4 [1] x 4 [2] x 4 [3] 1 # obs in cluster k R G B Cluster x 5 [1] x 5 [2] x 5 [3] 2 x 6 [1] x 6 [2] x 6 [3] 2 R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 x 3 [1] x 3 [2] x 3 [3] 3 total # of obs True for general mixtures of i.i.d. data, not just Gaussian clusters 27 Part 2a summary needed to compute soft assignments Cluster parameters are simple to compute if we know the cluster assignments But, we don t know these! 28 14

15 Part 2b: What can we do with just soft assignments r ij? Estimating cluster parameters from soft assignments Instead of having a full observation x i in cluster k, just allocate a portion r ik x i divided across all clusters, as determined by r ik 30 15

16 Maximum likelihood estimation from soft assignments Just like in boosting with weighted observations R G B r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] x 2 [1] x 2 [2] x 2 [3] x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] x 5 [1] x 5 [2] x 5 [3] x 6 [1] x 6 [2] x 6 [3] % chance this obs is in cluster 3 31 Total weight in cluster: (effective # of obs) Maximum likelihood estimation from soft assignments R G B Cluster Cluster 1 data weights weights r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] x 2 [1] x 2 [2] x 2 [3] x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] x 5 [1] x 5 [2] x 5 [3] x 6 [1] x 6 [2] x 6 [3]

17 Maximum likelihood estimation from soft assignments 33 R G B Cluster 1 weights x 1 [1] x 1 [2] x 1 [3] 0.30 x 2 [1] x 2 [2] x 2 [3] 0.01 Cluster 2 R G B weights x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 1 x[1] 4 [2] x 1 x[2] 4 [3] x [3] 0.18 x 5 [1] x 2 x[1] 5 [2] x 2 x[2] 5 [3] x [3] 0.26 Cluster 3 R G B x 6 [1] x 3 x[1] 6 [2] x 3 x[2] 6 [3] x [3] weights x 4 [1] x 1 [1] x 4 [2] x 1 [2] x 4 [3] x 1 [3] x 5 [1] x 2 x[1] 5 [2] x 2 [2] x 5 [3] x 2 [3] x 6 [1] x 3 x[1] 6 [2] x 3 [2] x 6 [3] x 3 [3] x 4 [1] x 4 [2] x 4 [3] 0.15 x 5 [1] x 5 [2] x 5 [3] 0.02 x 6 [1] x 6 [2] x 6 [3] 0.01 Cluster-specific location/shape MLE R G B Cluster 1 weights x 1 [1] x 1 [2] x 1 [3] 0.30 x 2 [1] x 2 [2] x 2 [3] 0.01 x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] 0.75 x 5 [1] x 5 [2] x 5 [3] 0.05 x 6 [1] x 6 [2] x 6 [3] 0.13 Compute cluster parameter estimates with weights on each row operation Total weight in cluster k = effective # obs 34 17

18 MLE of cluster proportions r i1 r i2 r i3 Total weight in cluster: Total weight in dataset: 6 Estimate cluster proportions from relative weights # datapoints N Total weight in cluster k = effective # obs Defaults to hard assignment case when r ij in {0,1} Hard assignments have: R G B r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] x 2 [1] x 2 [2] x 2 [3] x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] x 5 [1] x 5 [2] x 5 [3] x 6 [1] x 6 [2] x 6 [3] One-hot encoding of cluster assignment Total weight in cluster:

19 Equating the estimates 37 Part 2b summary Still straightforward to compute cluster parameter estimates from soft assignments 38 19

Expectation Maximization: Inferring model parameters and class labels

Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1 2/26/17 Jumble of unlabeled images HISTOGRAM blue