Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Similar documents
Clustering web search results

Introduction to Machine Learning CMU-10701

Clustering Lecture 5: Mixture Model

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

COMS 4771 Clustering. Nakul Verma

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Inference and Representation

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

ECE 5424: Introduction to Machine Learning

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

K-Means Clustering 3/3/17

Mixture Models and the EM Algorithm

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clustering. Image segmentation, document clustering, protein class discovery, compression

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Expectation Maximization: Inferring model parameters and class labels

Mixture Models and EM

Expectation Maximization: Inferring model parameters and class labels

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Generative and discriminative classification techniques

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Note Set 4: Finite Mixture Models and the EM Algorithm

Supervised vs. Unsupervised Learning

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Unsupervised Learning: Clustering

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Unsupervised Learning

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Clustering. Supervised vs. Unsupervised Learning

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Deep Learning for Computer Vision

Introduction to Mobile Robotics

Unsupervised Learning

Latent Variable Models and Expectation Maximization

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Expectation Maximization (EM) and Gaussian Mixture Models

Methods for Intelligent Systems

K-Means and Gaussian Mixture Models

Mixture Models and EM

ECG782: Multidimensional Digital Signal Processing

Clustering algorithms

732A54/TDDE31 Big Data Analytics

Grundlagen der Künstlichen Intelligenz

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Content-based image and video analysis. Machine learning

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Pattern Recognition ( , RIT) Exercise 1 Solution

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Probabilistic Graphical Models

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

10-701/15-781, Fall 2006, Final

Segmentation: Clustering, Graph Cut and EM

IBL and clustering. Relationship of IBL with CBR

Regularization and model selection

Tracking Computer Vision Spring 2018, Lecture 24

Random projection for non-gaussian mixture models

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

MSA220 - Statistical Learning for Big Data

Clustering: Classic Methods and Modern Views

9.1. K-means Clustering

Outline Introduction Goal Methodology Results Discussion Conclusion 5/9/2008 2

Lecture 8: The EM algorithm

Generative and discriminative classification techniques

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization


Clustering Documents in Large Text Corpora

Machine Learning. Unsupervised Learning. Manfred Huber

6.801/866. Segmentation and Line Fitting. T. Darrell

SD 372 Pattern Recognition

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Missing Data Analysis for the Employee Dataset

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

1 Case study of SVM (Rob)

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Mixture models and clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Machine Learning / Jan 27, 2010

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Problem 1 (20 pt) Answer the following questions, and provide an explanation for each question.

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Approaches to Clustering

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

Transcription:

Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD

Plan for today last time we started talking about mixture models we introduced the main ideas behind EM to motivate EM, we looked at classification-maximization which is a general case of K-means today: I will briefly review the main ideas introduce EM solve EM for the case of learning Gaussian mixtures next class: proof that t EM maximizes i likelihood lih of incomplete data 2

Mixture density estimate we have seen that EM is a framework for ML estimation with missing data canonical example: want to classify vehicles into commercial/private X: vehicle weight multimodal density because there is a hidden variable Z (type of car) z {compact, sedan, station wagon, pick up, van} for a given car type the weight is approximately Gaussian (or has some other parametric form) the density is a mixture of Gaussians 3

mixture model the sample consists of pairs (x i,z i ) D={(x 1,z 1 ),, (x N,z N )} but we never get to see the z i e.g. bridge example: sensor only registers weight the car class was certainly there, but it is lost by the sensor for this reason Z is called hidden in general, two types of random variables X observed random variable Z hidden random variable 4

The basics of EM as usual, we start from an iid sample D = {x 1,,x N } goal is to find parameters Ψ * that maximize likelihood with respect to D the set D c = {(x 1,z 1 ),, (x N,z N )} is called the complete data the set D = {x 1,, x N N} is called the incomplete data 5

Complete vs incomplete data in general, the problem would be trivial if we had access to the complete data we have illustrated this with the specific example of Gaussian mixture of C components parameters Ψ = {(π 1,µ 1,Σ 1 ),,(π C,µ C,Σ C )} and shown that, given the complete data D c, we only need to split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} i i i i i i and solve, for each c, 6

Learning with complete data the solution is hence, all the hard work seems to be in figuring out what the z i are the EM algorithm does this iterativelyel 7

Learning with incomplete data (EM) the basic idea is quite simple 1. start with an initial parameter estimate Ψ (0) 2. E-step: given current parameters Ψ (i) and observations in D, guess what the values of the z i are 3. M-step: with the new z i, we have a complete data problem, solve this problem for the parameters, i.e. compute Ψ (i+1) 4. go to 2. this can be summarized as E-step estimate parameters fill in class assignments M-step z i 8

Classification-maximization to gain intuition for this we considered a variation that we called Classification-Maximization the idea is the following after the M-step we have an estimate of all the parameters, i.e. an estimate for the densities that compose the mixture we want to find the class-assignments z i (recall that z i =k if x i is a sample from the k th component) but this is a classification problem, and we know how to solve those: just use the BDR the steps are as follows 9

Classification-maximization C-step: given estimates Ψ (n) ={Ψ (n) 1,, Ψ (n) C } determine z i by the BDR split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} M-step: as before, determine the parameters of each class independently 10

For Gaussian mixtures C-step: split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} M-step: 11

K-means when covariances are identity and priors uniform C-step: split the training set according to the labels z i D 1 = {x i z i =1}, D 2 = {x i z i =2},, D C = {x i z i =C} M-step: this is the K-means algorithm, aka generalized Loyd algorithm, aka LBG algorithm in the vector quantization literature: assign points to the closest mean; recompute the means 12

K-means (thanks to Andrew Moore, CMU) 13

K-means (thanks to Andrew Moore, CMU) 14

K-means (thanks to Andrew Moore, CMU) 15

K-means (thanks to Andrew Moore, CMU) 16

K-means (thanks to Andrew Moore, CMU) 17

Two open questions filing in the z i with the BDR seems intuitive, but Q 1 : what about problems that are not about classification? the missing data does not need to be class labels, it could be a continuous random variable Q 2 : how do I know that this converges to anything interesting? we will look at Q 2 in the next class Q 1 : EM suggests do the most intuitive operation that is ALWAYS possible don t worry about the z i directly estimate the likelihood of the complete data by its expected value given the observed data (E-step) then maximize i this expected value (M-step) this leads to the so-called Q-function 18

The Q function is defined as and is a bit tricky: it is the expected value of likelihood with respect to complete data (joint X and Z) given that we observed incomplete data (X) note that the likelihood is a function of Ψ (the parameters that we want to determine) but to compute the expected value we need to use the parameter values from the previous iteration (because we need a distribution for Z X) the EM algorithm is, therefore, as follows 19

Expectation-maximization E-step: given estimates Ψ (n) ={Ψ (n) 1,, Ψ (n) C } compute expected log-likelihood of complete data M-step: find parameter set that maximizes this expected log-likelihood let s make this more concrete by looking at the mixture case 20

Expectation-maximization to derive an EM algorithm you need to do the following 1. write down the likelihood of the COMPLETE data 2. E-step: write down the Q function, i.e. its expectation given the observed data 3. M-step: solve the maximization, deriving a closed-form solution if there is one 21

EM for mixtures (step 1) the first thing we always do in a EM problem is compute the likelihood of the COMPLETE data very neat trick to use when z is discrete (classes) instead of using z {1, 2,..., C} use a binary vector of size equal to the # of classes where z = j in the z {1, 2,..., C} notation, now becomes 22

EM for mixtures (step 1) we can now write the complete data likelihood as for example, if z = k in the z {1, 2,..., C} notation, the advantage is that becomes LINEAR in the components z j!!! 23

EM for mixtures (step 1) for the complete iid dataset D c = {(x 1,z 1 ),, (x N,z N )} and the complete data log-likelihood is this does not depend on z and simply becomes a constant for the expectation that we have to compute in the E-step 24

Expectation-maximization to derive an EM algorithm you need to do the following 1. write down the likelihood of the COMPLETE data 2. E-step: write down the Q function, i.e. its expectation given the observed data 3. M-step: solve the maximization, deriving a closed-form solution if there is one important E-step advice: do not compute terms that you do not need at the end of the day we only care about the parameters terms of Q that do not depend on the parameters are useless, e.g. in Q = f(z,ψ) + log(sin z) the expected value of log(sin z) appears to be difficult and is completely unnecessary, since it is dropped in the M-step 25

EM for mixtures (step 2) once we have the complete data likelihood i.e. to compute the Q function we only need to compute and since z ij is binary and only depends on x i the E-step reduces to computing the posterior probability bilit of each point under each class! 26

EM for mixtures (step 2) defining the Q function is 27

Expectation-maximization to derive an EM algorithm you need to do the following 1. write down the likelihood of the COMPLETE data 2. E-step: write down the Q function, i.e. its expectation given the observed data 3. M-step: solve the maximization, deriving a closed-form solution if there is one 28

EM vs CM let s compare this with the CM algorithm the C-step assigns each point to the class of largest posterior the E-step assigns the point to all classes with weight given by the posterior for this, EM is said to make soft-assignments it does not commit to any of the classes (unless the posterior is one for that class), i.e. it is less greedy no longer partition space into rigid cells, but now the boundaries are soft 29

EM vs CM what about the M-steps? for CM for EM these are the same if we threshold the h ij to make, for each i, max j h ij = 1 and all other h ij = 0 M-steps the same up to the difference of assignments 30

EM for Gaussian mixtures in summary: CM = EM + hard assignments CM special case, cannot be better let s look at the special case of Gaussian mixtures E-step: 31

M-step for Gaussian mixtures M-step: important note: in the M-step, the optimization must be subject to whatever constraint may hold in particular, we always have the constraint as usual we introduce a Lagrangian 32

M-step for Gaussian mixtures Lagrangian setting derivatives to zero 33

M-step for Gaussian mixtures leads to the update equations comparing to those of CM they are the same up to hard vs soft assg assignments. s 34

35