The E-M Algorthm Bostatstcs 615/815 Lecture 17
Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts Restart maxmzaton at proposed soluton
Summary: The Smplex Method hgh reflecton Orgnal Smplex low contracton reflecton and expanson multple contracton
Improvements to amoeba() Dfferent scalng along each dmenson If parameters have dfferent mpact on the lkelhood Track total functon evaluatons Avod gettng stuck f functon does not cooperate Rotate smplex p If the current smplex s leadng to slow mprovement
optm() Functon n R optm(pont, functon, method) Pont startng pont for mnmzaton Functon that accepts pont as argument p p g Method can be "Nelder-Mead" for smplex method (default) "BFGS", "CG" and other optons use gradent
Other Methods for Mnmzaton n Multple Dmensons Typcally, sophstcated methods wll Use dervatves May be calculated numercally. How? Select a drecton for mnmzaton, usng: Weghted average of prevous drectons Current gradent Avod rght angle turns
One parameter at a tme Smple but neffcent approach Consder Parameters θ = (θ 1, θ 2,, θ k ) Functon f (θ) M th t t h t Maxmze θ wth respect to each θ n turn Cycle through parameters
The Ineffcency θ 2 θ 1
Steepest Descent Consder Parameters θ = (θ 1, θ 2,, θ k ) F f( ) Functon f(θ; x) Score vector d ln f d ln f S = =,..., dθ dθ1 Fnd maxmum along θ + δs d ln f dθ k
Stll neffcent Consecutve steps are stll perpendcular!
Other Strateges for Multdmensonal Optmzaton Most strateges wll defne a seres of vectors or lnes through parameter space Estmate of mnmum mproved by addng an optmal multple l of each vector Some ntutve choces mght be: The functon gradent Unt vectors along one dmenson
The key s to rght angle turns! Most methods that use dervatves don t smply optmze functon along current gradent or the unt vectors
Today The E-M algorthm General algorthm for mssng data problems Requres "specalzaton" to the problem at hand Frequently appled to mxture dstrbutons
The E-M Algorthm Orgnal Ctaton Dempster, Lard and Rubn (1977) J Royal Statstcal Socety (B) 39:1-3838 Cted n over 9,184 research artcles For comparson Nelder and Mead (1965) Computer Journal 7: 308-313 Cted n over 8,094 research artcles
The Basc E-M Strategy X = (Y, Z) Complete data X Observed data Y Mssng data Z (eg. what we d lke to have!) (eg. ndvdual observatons) (eg. class assgnments) The algorthm Use estmated parameters to nfer Z Update estmated parameters usng Y and Z Repeat untl convergence
The E-M Algorthm Consder a set of startng parameters Use these to estmate the mssng data Use complete data to update parameters Repeat as necessary
Settng for the E-M Algorthm... Problem s smpler to solve for complete data Maxmum lkelhood estmates can be calculated usng standard methods Estmates of mxture parameters could be obtaned n straghtforward manner f the orgn of each observaton s known
Fllng In Mssng Data The mssng data s the group assgnment for each observaton Complete data generated by assgnng observatons to groups Probablstcally We wll use fractonal assgnments
The E-Step: Mxture of Normals Estmate mssng data Estmate assgnment of observatons to groups How? Condtonal on current parameter values Bascally, classfy each observaton
Classfcaton Probabltes = = l l j j x f x f, x j Z ), ( ), ( ),, Pr( η φ π η φ π π φ η l l l f ), ( η φ Results from the applcaton of Bayes' theorem Results from the applcaton of Bayes theorem Implemented n classprob() functon Implemented n classprob() functon classprob(nt j, double x, nt k, double *prob, double *mean, double *sd)
C Code: Updatng Group Membershps vod update_class_prob(nt n, double * data, nt k, double * prob, double * mean, double * sd, double ** class_prob) { nt, j; for ( = 0; < n; ++) for (j = 0; j < k; j++) class_prob[][j] = classprob(j, data[], [ k, prob, mean, sd); }
The M-Step Update mxture parameters to maxmze the lkelhood of the data Appears trcky, but becomes smple when we assume cluster assgnments are correct We smply use the sample proportons, and weghted means and varances to update parameters Ths step s guaranteed never to decrease lkelhood
Updatng Mxture Proportons π = Pr( Z = j x, π, φ, η ) n "Count" the observatons assgned to each group
C Code: Updatng Mxture Proportons vod update_prob(nt n, double * data, nt k, double * prob, double ** class_prob) { nt, j; for (nt j = 0; j < k; j++) { prob[j] = 0.0; 0 for (nt = 0; < n; ++) prob[j] += class_prob[][j]; } prob[j] /= n; }
Updatng Component Means j, x j Z, x j Z x η η μ = = = ),, Pr( ),, Pr( ˆ π φ π φ, x j Z x η = = ),, Pr( φ π nπ j Calculate weghted mean for group Calculate weghted mean for group Weghts are probabltes of group membershp
C Code: Update Component Means vod update_mean(nt n, double * data, nt k, double * prob, double * mean, double ** class_prob) { nt, j; for (nt j = 0; j < k; j++) { mean[j] = 0.0; 0 for (nt = 0; < n; ++) mean[j] += data[] * class_prob[][j]; } mean[j] /= n * prob[j] + TINY; }
Updatng Component Varances ˆ 2 σ = 2 ( x μ ) Pr( Z = j x, π, φ, η ) nπ j Calculate weghted sum of squared dfferences Weghts are probabltes of group membershp
C Code: Update Component Std Devatons vod update_sd(nt n, double * data, nt k, double * prob, double * mean, double * sd, double ** class_prob) { nt, j; for (nt j = 0; j < k; j++) { sd[j] = 0.0; 0 for (nt = 0; < n; ++) sd[j] += square(data[] - mean[j]) * class_prob[][j]; } sd[j] /= (n * prob[j] + TINY); sd[j] = sqrt(sd[j]); }
C Code: Update Mxture vod update_parameters (nt n, double * data, nt k, double * prob, double * mean, double * sd, double ** class ass_prob) { // Frst, we update the mxture proportons update_prob(n, data, k, prob, class_prob); // Next, update the mean for each component update_mean(n, data, k, prob, mean, class_prob); // Fnally, update the standard devaton update_sd(n, data, k, prob, mean, sd, class_prob); }
E-M Algorthm For Mxtures 1. Guesstmate startng parameters 2. Use Bayes' theorem to calculate group assgnment probabltes 3. Update parameters usng estmated assgnments 4. Repeat steps 2 and 3 untl lkelhood lh s stable
C Code: The E-M Algorthm double em(nt n, double * data, nt k, double * prob, double * mean, double * sd, double eps) { double llk = 0, prev_llk = 0; double ** class_prob = alloc_matrx(n, k); start_em(n, data, k, prob, mean, sd); do { prev_llk = llk; update_class_prob(n, data, k, prob, mean, sd, class_prob); update_parameters(n, data, k, prob, mean, sd, class_prob); llk = mxllk(n, ( data, k, prob, mean, sd); } whle (!check_tol(llk, prev_llk, eps) ); return llk; }
Pckng Startng Parameters Mxng proportons Assumed equal Means for each group Pck one observaton as the group mean Varances for each group Use overall varance
C Code: Pckng Startng Parameters vod start_ em(nt( n,, double * data,, nt k, double * prob, double * mean, double * sd) { nt, j; double mean1 = 0.0, sd1 = 0.0; for ( = 0; < n; ++) mean1 += data[]; mean1 /= n; for ( = 0; < n; ++) ) sd1 += square(data[] - mean1); sd1 = sqrt(sd1 / n); for (j = 0; j < k; j++) ) { prob[j] = 1.0 / k; mean[j] = data[rand() % n]; sd[j] = sd1; } }
Example Applcaton Old Fathful Eruptons (n = 272) Old Fathful Eruptons Freque 0 5 10 ncy 15 20 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Duraton (mns)
Usng Smplex Method A Mxture of Two Normals Ft 5 parameters Proporton n 1 st component, 2 means, 2 varances 44/50 runs found mnmum Requred about ~700 evaluatons Frst component contrbutes 0.348 of mxture Means are 2.018 and 4.273 Varances are 0.055 and 0.191 Maxmum log-lkelhood = -276.36
Usng E-M Algorthm A Mxture of Two Normals Ft 5 parameters 50/50 runs found maxmum Requred about ~25 evaluatons Frst component contrbutes 0.348 of mxture Means are 2.018 and 4.273 Varances are 0.055 and 0.191 Maxmum log-lkelhood = -276.36
Two Components Old Fathful Eruptons Ftted Dstrbuton 0 0.0 Freq quency 5 10 15 20 De ensty 0.1 0.2 0.3 0.4 0.5 0.6 1 2 3 4 5 6 Duraton (mns) 1 2 3 4 5 6 Duraton (mns)
Smplex Method: A Mxture of Three Normals Ft 8 parameters 2 proportons, 3 means, 3 varances Requred about ~1400 evaluatons Found best soluton n 7/50 runs Other solutons effectvely ncluded only 2 components The best solutons Components contrbutng.339, 0.512 and 0.149 Component means are 2.002, 4.401401 and 3.727 Varances are 0.0455, 0.106, 0.2959 Maxmum log-lkelhood = -267.89
Three Components Old Fathful Eruptons Ftted Dstrbuton Freq quency 0 5 10 Densty 0.1 0.2 0.3 15 20 0.4 0.5 0.6 0.7 1 2 3 4 5 6 Duraton (mns) 1 2 3 4 5 6 Duraton (mns) 0.0
E-M Algorthm: A Mxture of Three Normals Ft 8 parameters 2 proportons, 3 means, 3 varances Requred about ~150 evaluatons Found log-lkelhood of ~267.89 n 42/50 runs Found log-lkelhood of ~263.91 n 7/50 runs The best solutons Components contrbutng.160, 0.195 and 0.644 Component means are 1.856, 2.182 and 4.289 Varances are 0.00766, 0.0709 and 0.172 Maxmum log-lkelhood = -263.91
Three Components Old Fathful Eruptons Ftted Densty 0 0.0 Freq quency 5 10 De ensty 0.2 0.4 15 20 0.6 0.8 1 2 3 4 5 6 Duraton (mns) 1 2 3 4 5 6 Duraton (mns)
Convergence for E-M Algorthm LogLkelhood -200 Lkelh hood -300-400 Lkelhood -266-268 LogLkelhood -270 0 50 100 150 200 Iteraton -500 0 50 100 150 200 Iteraton
Convergence for E-M Algorthm Mxture Means 5 4 Me ean 3 2 1 0 50 100 150 200 Iteraton
E-M Algorthm: A Mxture of Four Normals Ft 11 parameters 3 proportons, 4 means, 4 varances Requred about ~300 evaluatons Found log-lkelhood lk lh of ~267.89 n 1/50 runs Found log-lkelhood of ~263.91 n 2/50 runs Found log-lkelhood of ~257.46 n 47/50 runs "Appears" more relable than wth 3 components
1.0 Four Components Old Fathful Eruptons Freq quency 0 5 10 De ensty 15 20 0.2 0.4 0.6 0.8 1 2 3 4 5 6 Duraton (mns) 1 2 3 4 5 6 Duraton 0.0
Today The E-M algorthm Mssng data formulaton Applcaton to mxture dstrbutons Consder multple startng ponts
Further Readng There s a nce dscusson of the E-M algorthm, wth applcaton to mxtures at: http://en.wkpeda.org/wk/em_algorthmalgorthm