Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD
Plan for today last time we started talking about mixture models we introduced the main ideas behind EM to motivate EM, we looked at classification-maximization which is a general case of K-means today: I will briefly review the main ideas introduce EM solve EM for the case of learning Gaussian mixtures next class: proof that t EM maximizes i likelihood lih of incomplete data 2
Mixture density estimate we have seen that EM is a framework for ML estimation with missing data canonical example: want to classify vehicles into commercial/private X: vehicle weight multimodal density because there is a hidden variable Z (type of car) z {compact, sedan, station wagon, pick up, van} for a given car type the weight is approximately Gaussian (or has some other parametric form) the density is a mixture of Gaussians 3
mixture model the sample consists of pairs (x i,z i ) D={(x 1,z 1 ),, (x N,z N )} but we never get to see the z i e.g. bridge example: sensor only registers weight the car class was certainly there, but it is lost by the sensor for this reason Z is called hidden in general, two types of random variables X observed random variable Z hidden random variable 4
The basics of EM as usual, we start from an iid sample D = {x 1,,x N } goal is to find parameters Ψ * that maximize likelihood with respect to D the set D c = {(x 1,z 1 ),, (x N,z N )} is called the complete data the set D = {x 1,, x N N} is called the incomplete data 5
Complete vs incomplete data in general, the problem would be trivial if we had access to the complete data we have illustrated this with the specific example of Gaussian mixture of C components parameters Ψ = {(π 1,µ 1,Σ 1 ),,(π C,µ C,Σ C )} and shown that, given the complete data D c, we only need to split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} i i i i i i and solve, for each c, 6
Learning with complete data the solution is hence, all the hard work seems to be in figuring out what the z i are the EM algorithm does this iterativelyel 7
Learning with incomplete data (EM) the basic idea is quite simple 1. start with an initial parameter estimate Ψ (0) 2. E-step: given current parameters Ψ (i) and observations in D, guess what the values of the z i are 3. M-step: with the new z i, we have a complete data problem, solve this problem for the parameters, i.e. compute Ψ (i+1) 4. go to 2. this can be summarized as E-step estimate parameters fill in class assignments M-step z i 8
Classification-maximization to gain intuition for this we considered a variation that we called Classification-Maximization the idea is the following after the M-step we have an estimate of all the parameters, i.e. an estimate for the densities that compose the mixture we want to find the class-assignments z i (recall that z i =k if x i is a sample from the k th component) but this is a classification problem, and we know how to solve those: just use the BDR the steps are as follows 9
Classification-maximization C-step: given estimates Ψ (n) ={Ψ (n) 1,, Ψ (n) C } determine z i by the BDR split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} M-step: as before, determine the parameters of each class independently 10
For Gaussian mixtures C-step: split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} M-step: 11
K-means when covariances are identity and priors uniform C-step: split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} M-step: this is the K-means algorithm, aka generalized Loyd algorithm, aka LBG algorithm in the vector quantization literature: assign points to the closest mean; recompute the means 12
K-means (thanks to Andrew Moore, CMU) 13
K-means (thanks to Andrew Moore, CMU) 14
K-means (thanks to Andrew Moore, CMU) 15
K-means (thanks to Andrew Moore, CMU) 16
K-means (thanks to Andrew Moore, CMU) 17
Two open questions filing in the z i with the BDR seems intuitive, but Q 1 : what about problems that are not about classification? the missing data does not need to be class labels, it could be a continuous random variable Q 2 : how do I know that this converges to anything interesting? we will look at Q 2 in the next class Q 1 : EM suggests do the most intuitive operation that is ALWAYS possible don t worry about the z i directly estimate the likelihood of the complete data by its expected value given the observed data (E-step) then maximize i this expected value (M-step) this leads to the so-called Q-function 18
The Q function is defined as and is a bit tricky: it is the expected value of likelihood with respect to complete data (joint X and Z) given that we observed incomplete data (X) note that the likelihood is a function of Ψ (the parameters that we want to determine) but to compute the expected value we need to use the parameter values from the previous iteration (because we need a distribution for Z X) the EM algorithm is, therefore, as follows 19
Expectation-maximization E-step: given estimates Ψ (n) ={Ψ (n) 1,, Ψ (n) C } compute expected log-likelihood of complete data M-step: find parameter set that maximizes this expected log-likelihood let s make this more concrete by looking at the mixture case 20
Expectation-maximization to derive an EM algorithm you need to do the following 1. write down the likelihood of the COMPLETE data 2. E-step: write down the Q function, i.e. its expectation given the observed data 3. M-step: solve the maximization, deriving a closed-form solution if there is one 21
EM for mixtures (step 1) the first thing we always do in a EM problem is compute the likelihood of the COMPLETE data very neat trick to use when z is discrete (classes) instead of using z {1, 2,..., C} use a binary vector of size equal to the # of classes where z = j in the z {1, 2,..., C} notation, now becomes 22
EM for mixtures (step 1) we can now write the complete data likelihood as for example, if z = k in the z {1, 2,..., C} notation, the advantage is that becomes LINEAR in the components z j!!! 23
EM for mixtures (step 1) for the complete iid dataset D c = {(x 1,z 1 ),, (x N,z N )} and the complete data log-likelihood is this does not depend on z and simply becomes a constant for the expectation that we have to compute in the E-step 24
Expectation-maximization to derive an EM algorithm you need to do the following 1. write down the likelihood of the COMPLETE data 2. E-step: write down the Q function, i.e. its expectation given the observed data 3. M-step: solve the maximization, deriving a closed-form solution if there is one important E-step advice: do not compute terms that you do not need at the end of the day we only care about the parameters terms of Q that do not depend on the parameters are useless, e.g. in Q = f(z,ψ) + log(sin z) the expected value of log(sin z) appears to be difficult and is completely unnecessary, since it is dropped in the M-step 25
EM for mixtures (step 2) once we have the complete data likelihood i.e. to compute the Q function we only need to compute and since z ij is binary and only depends on x i the E-step reduces to computing the posterior probability bilit of each point under each class! 26
EM for mixtures (step 2) defining the Q function is 27
Expectation-maximization to derive an EM algorithm you need to do the following 1. write down the likelihood of the COMPLETE data 2. E-step: write down the Q function, i.e. its expectation given the observed data 3. M-step: solve the maximization, deriving a closed-form solution if there is one 28
EM vs CM let s compare this with the CM algorithm the C-step assigns each point to the class of largest posterior the E-step assigns the point to all classes with weight given by the posterior for this, EM is said to make soft-assignments it does not commit to any of the classes (unless the posterior is one for that class), i.e. it is less greedy no longer partition space into rigid cells, but now the boundaries are soft 29
EM vs CM what about the M-steps? for CM for EM these are the same if we threshold the h ij to make, for each i, max j h ij = 1 and all other h ij = 0 M-steps the same up to the difference of assignments 30
EM for Gaussian mixtures in summary: CM = EM + hard assignments CM special case, cannot be better let s look at the special case of Gaussian mixtures E-step: 31
M-step for Gaussian mixtures M-step: important note: in the M-step, the optimization must be subject to whatever constraint may hold in particular, we always have the constraint as usual we introduce a Lagrangian 32
M-step for Gaussian mixtures Lagrangian setting derivatives to zero 33
M-step for Gaussian mixtures leads to the update equations comparing to those of CM they are the same up to hard vs soft assg assignments. s 34
35