Expectation Maximization: Inferring model parameters and class labels

Size: px

Start display at page:

Download "Expectation Maximization: Inferring model parameters and class labels"

Alice Rogers
5 years ago
Views:

1 Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1

2 2/26/17 Jumble of unlabeled images HISTOGRAM blue How do we model this distribution? 3 Model of jumble of unlabeled images blue blue 0.8 blue 2

3 2/26/17 What if image types not equally represented? e.g., forest images are very likely in the collection 0.8 blue Mixture of Gaussians (1D) Each mixture component represents a unique cluster specified by: {πk, μk, σk2 } π1 π2 π3 π = [ ] π1 σ1 π2 π3 σ2 πk = 1 Relative proportion of each class in world from which we get data σ3 μ2 6 0 πk 1 μ1 μ3 x 3

4 Mixture of Gaussians (general) μ 1,Σ 1 π 1 π 3 Each mixture component represents a unique cluster specified by: {π k, μ k, Σ k } π 2 μ 3,Σ 3 μ 2,Σ 2 7 According to the model Without observing the image content, what s the probability it s from cluster k? (e.g., prob. of seeing clouds image) p(z i = k) = k Given observation x i is from cluster k, what s the likelihood of seeing x i? (e.g., just look at distribution for clouds ) p(x i z i = k, µ k, k )=N(x i µ k, k ) π 1 π 2 π 3 8 x 4

5 Inferring soft assignments with expectation maximization (EM) Inferring cluster labels Data Desired soft assignments 10 5

6 Part 1: What if we knew the cluster parameters {π k,μ k,σ k }? Compute responsibilities Responsibility cluster k takes for observation i r ik = p(z i = k { j,µ j, j } K j=1,x i ) probability of assignment to cluster k given model parameters and observed value 12 6

7 Responsibilities in pictures Green cluster takes more responsibility Blue cluster takes more responsibility Uncertain split responsibility 13 Responsibilities in pictures Need to weight by cluster probabilities, not just cluster shapes Still uncertain, but green cluster seems more probable takes more responsibility 14 7

8 Responsibilities in equations r ik = Responsibility cluster k takes for observation i k N(x i µ k, k ) Initial probability of being from cluster k How likely is the observed value x i under this cluster assignment? 15 Responsibilities in equations r ik = Responsibility cluster k takes for observation i k N(x i µ k, k ) KX j N(x i µ j, j ) j=1 Normalized over all possible cluster assignments 16 8

9 Recall: According to the model Without observing the image content, what s the probability it s from cluster k? (e.g., prob. of seeing clouds image) p(z i = k) = k Given observation x i is from cluster k, what s the likelihood of seeing x i? (e.g., just look at distribution for clouds ) p(x i z i = k, µ k, k )=N(x i µ k, k ) π 1 π 2 π 3 17 x Formally: An application of Bayes rule Responsibility cluster k takes for observation i r ik = p(z i = k { j,µ params j, j } K j=1,x i ) A k N(x i µ k, k ) = KX j N(x i µ j, j ) j=1 A B A B 18 9

10 Formally: An application of Bayes rule Responsibility cluster k takes for observation i r ik = p(z i = k { j,µ params j, j } K j=1,x i ) k N(x i µ k, k ) = KX j N(x i µ j, j ) j=1 A B C B C 19 Formally: An application of Bayes rule r ik = p(a B,params) = = p(a params)p(b A,params) X p(c params)p(b C,params) C p(a params)p(b A,params) p(b params) 20 10

11 Part 1 summary Desired soft assignments (responsibilities) are easy to compute when cluster parameters {π k, μ k, Σ k } are known But, we don t know these! 21 Part 2a: Imagine we knew the cluster (hard) assignments z i 11

12 Estimating cluster parameters Imagine we know the cluster assignments Estimation problem decouples across clusters Is green point informative of fuchsia cluster parameters? NO! 23 Data table decoupling over clusters R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 R G B Cluster x 3 [1] x 3 [2] x 3 [3] 3 R G B Cluster x 4 [1] x 4 [2] x 4 [3] 1 x 5 [1] xx 5 [2] 5 [2] xx 5 [3] 5 [3] 2 x 6 [1] xx 6 [2] 6 [2] xx 6 [3] 6 [3]

13 Maximum likelihood estimation R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 x 3 [1] x 3 [2] x 3 [3] 3 Estimate {π k, μ k, Σ k } given data assigned to cluster k maximum likelihood estimation (MLE) Find parameters that maximize the score, or likelihood, of data 25 Mean/covariance MLE 26 R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 x 3 [1] x 3 [2] x 3 [3] 3 ˆµ k = 1 N k X i in k x i ˆ k = 1 X (x i ˆµ k )(x i ˆµ k ) T N k i in k Scalar case: 13

14 Cluster proportion MLE R G B Cluster x 4 [1] x 4 [2] x 4 [3] 1 # obs in cluster k R G B Cluster x 5 [1] x 5 [2] x 5 [3] 2 x 6 [1] x 6 [2] x 6 [3] 2 R G B Cluster x 1 [1] x 1 [2] x 1 [3] 3 x 2 [1] x 2 [2] x 2 [3] 3 x 3 [1] x 3 [2] x 3 [3] 3 ˆ k = N k N total # of obs True for general mixtures of i.i.d. data, not just Gaussian clusters 27 Part 2a summary needed to compute soft assignments Cluster parameters are simple to compute if we know the cluster assignments But, we don t know these! 28 14

15 Part 2b: What can we do with just soft assignments r ij? Estimating cluster parameters from soft assignments Instead of having a full observation x i in cluster k, just allocate a portion r ik x i divided across all clusters, as determined by r ik 30 15

Maximum likelihood estimation from soft assignments Just like in boosting with weighted observations R G B r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] 0.30 0.18 0.52 x 2 [1] x 2 [2] x 2 [3] 0.01 0.26 0.

16 Maximum likelihood estimation from soft assignments Just like in boosting with weighted observations R G B r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] x 2 [1] x 2 [2] x 2 [3] x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] x 5 [1] x 5 [2] x 5 [3] x 6 [1] x 6 [2] x 6 [3] % chance this obs is in cluster 3 31 Total weight in cluster: (effective # of obs) Maximum likelihood estimation from soft assignments R G B Cluster Cluster data 1 weights r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] x 2 [1] x 2 [2] x 2 [3] x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] x 5 [1] x 5 [2] x 5 [3] x 6 [1] x 6 [2] x 6 [3]

17 Maximum likelihood estimation from soft assignments R G B Cluster 1 weights 33 x 1 [1] x 1 [2] x 1 [3] 0.30 x 2 [1] x 2 [2] x 2 [3] 0.01 Cluster 2 R G B weights x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 1 x[1] 4 [2] x 1 x[2] 4 [3] x [3] 0.18 x 5 [1] x 2 x[1] 5 [2] x 2 x[2] 5 [3] x [3] 0.26 Cluster 3 R G B weights x 6 [1] x 3 x[1] 6 [2] x 3 x[2] 6 [3] x [3] x 4 [1] x 1 [1] x 4 [2] x 1 [2] x 4 [3] x 1 [3] x 5 [1] x 2 [1] x 5 [2] x 2 [2] x 5 [3] x 2 [3] x 6 [1] x 3 [1] x 6 [2] x 3 [2] x 6 [3] x 3 [3] x 4 [1] x 4 [2] x 4 [3] 0.15 x 5 [1] x 5 [2] x 5 [3] 0.02 x 6 [1] x 6 [2] x 6 [3] 0.01 Cluster-specific location/shape MLE R G B Cluster 1 weights x 1 [1] x 1 [2] x 1 [3] 0.30 x 2 [1] x 2 [2] x 2 [3] 0.01 x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] 0.75 x 5 [1] x 5 [2] x 5 [3] 0.05 x 6 [1] x 6 [2] x 6 [3] 0.13 ˆµ k = 1 N soft k ˆ k = 1 N soft k Compute cluster parameter estimates with weights on each row operation NX r ik x i i=1 NX r ik (x i ˆµ k )(x i ˆµ k ) T i=1 N soft k = NX r ik i=1 Total weight in cluster k = effective # obs 34 17

18 MLE of cluster proportions ˆ k Total weight in cluster: 35 r i1 r i2 r i Total weight in dataset: 6 ˆ k = Nsoft k N Estimate cluster proportions from relative weights # datapoints N N soft k = NX r ik i=1 Total weight in cluster k = effective # obs Defaults to hard assignment case when r ij in {0,1} Hard assignments have: 1 i in k r ik = 0 otherwise R G B r i1 r i2 r i3 x 1 [1] x 1 [2] x 1 [3] x 2 [1] x 2 [2] x 2 [3] x 3 [1] x 3 [2] x 3 [3] x 4 [1] x 4 [2] x 4 [3] x 5 [1] x 5 [2] x 5 [3] x 6 [1] x 6 [2] x 6 [3] One-hot encoding of cluster assignment Total weight in cluster:

19 Equating the estimates ˆ k = Nsoft k N ˆµ k = 1 N soft k ˆ k = 1 N soft k NX r ik x i i=1 NX r ik (x i ˆµ k )(x i ˆµ k ) T i=1 N soft k = NX i=1 r ik 37 Part 2b summary Still straightforward to compute cluster parameter estimates from soft assignments 38 19

20 Expectation maximization (EM) Expectation maximization (EM): An iterative algorithm Motivates an iterative algorithm: 1. E-step: estimate cluster responsibilities given current parameter estimates 2. M-step: maximize likelihood over parameters given current responsibilities 40 20

21 EM for mixtures of Gaussians in pictures initialization 41 EM for mixtures of Gaussians in pictures after 1 st iteration 42 21

22 EM for mixtures of Gaussians in pictures after 2 nd iteration 43 EM for mixtures of Gaussians in pictures converged solution 44 22

23 EM for mixtures of Gaussians in pictures replay 45 The nitty gritty of EM 23

24 Convergence and initialization of EM Convergence of EM EM is a coordinate-ascent algorithm - Can equate E-and M-steps with alternating maximizations of an objective function Convergences to a local mode We will assess via (log) likelihood of data under current parameter and responsibility estimates 48 24

25 Initialization Many ways to initialize the EM algorithm Important for convergence rates & quality of local mode found Examples: - Choose K observations at random to define K centroids. Assign other observations to nearest centriod to form initial parameter estimates. - Pick centers sequentially to provide good coverage of data like in k-means++ - Initialize from k-means solution - Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed 49 Potential of vanilla EM to overfit 25

26 Overfitting of MLE Maximizing likelihood can overfit to data Imagine at K=2 example with one obs assigned to cluster 1 and others assigned to cluster 2 - What parameter values maximize likelihood? Set center equal to point and shrink variance to 0 51 Likelihood goes to! Overfitting in high dims Doc-clustering example: Imagine only 1 doc assigned to cluster k has word w (or all docs in cluster agree on count of word w) 2 Likelihood maximized by setting µ k [w] = x i [w] and σ w,k = 0 Likelihood of any doc with different count on word w being in cluster k is 0! 52 26

27 Simple regularization of M-step for mixtures of Gaussians Simple fix: Don t let variances à 0! Add small amount to diagonal of covariance estimate Alternatively, take Bayesian approach & place prior on parameters. Similar idea, but all parameter estimates are smoothed via cluster pseudo-observations. 53 Formally relating k-means and EM for mixtures of Gaussians 27

Relationship to k-means Consider Gaussian mixture model with Σ = σ 2 σ 2 σ 2 σ 2 Spherically symmetric clusters - Spherical clusters w/ equal variances, so relative likelihoods just fcn of dist to

28 Relationship to k-means Consider Gaussian mixture model with Σ = σ 2 σ 2 σ 2 σ 2 Spherically symmetric clusters - Spherical clusters w/ equal variances, so relative likelihoods just fcn of dist to cluster center - As variancesà0, likelihood ratio becomes 0 or 1 - Responsibilities weigh in cluster proportions, but dominated by likelihood disparity and let the variance parameter σ à 0 Datapoint gets fully assigned to nearest center, just as in k-means 55 Infinitesimally small variance EM = k-means 1. E-step: estimate cluster responsibilities given current parameter estimates Infinitesimally small Decision based on distance to nearest cluster center 2. M-step: maximize likelihood over parameters given current responsibilities (hard assignments!) 56 28

29 Summary for the EM algorithm What you can do now Estimate soft assignments (responsibilities) given mixture model parameters Solve maximum likelihood parameter estimation using soft assignments (weighted data) Implement an EM algorithm for inferring soft assignments and cluster parameters - Determine an initialization strategy - Implement a variant that helps avoid overfitting issues Compare and contrast with k-means - Soft vs. hard assignments - k-means as a limiting special case of EM for mixtures of Gaussians 58 29

Expectation Maximization: Inferring model parameters and class labels

Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1 2/27/2017 Jumble of unlabeled images HISTOGRAM blue