Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Similar documents
Clustering web search results

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Introduction to Machine Learning CMU-10701

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Unsupervised Learning: Clustering

ECE 5424: Introduction to Machine Learning

Probabilistic Graphical Models

Machine Learning

Clustering. Image segmentation, document clustering, protein class discovery, compression

K-Means and Gaussian Mixture Models

COMS 4771 Clustering. Nakul Verma

Machine Learning

Inference and Representation

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering Lecture 5: Mixture Model

Deep Generative Models Variational Autoencoders

Lecture 8: The EM algorithm

Time series, HMMs, Kalman Filters

Machine Learning. Unsupervised Learning. Manfred Huber

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Machine Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning

Mixture Models and the EM Algorithm

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Lecture 3 January 22

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Markov Decision Processes (MDPs) (cont.)

Expectation Maximization: Inferring model parameters and class labels

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Simple Model Selection Cross Validation Regularization Neural Networks

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Expectation Maximization: Inferring model parameters and class labels

Dynamic Bayesian network (DBN)

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Machine Learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Auto-Encoding Variational Bayes

Grundlagen der Künstlichen Intelligenz

27: Hybrid Graphical Models and Neural Networks

Bayesian Networks Inference

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Mixture Models and EM

Bayesian Networks Inference (continued) Learning

Perceptron as a graph

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

CLUSTERING. JELENA JOVANOVIĆ Web:

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Learning to Detect Partially Labeled People

Latent Variable Models and Expectation Maximization

Warped Mixture Models

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Content-based image and video analysis. Machine learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Learning

Mean Field and Variational Methods finishing off

Estimating Labels from Label Proportions

Instance-based Learning

CS 229 Midterm Review

Semi- Supervised Learning

K-means and Hierarchical Clustering

Instance-based Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Introduction to Mobile Robotics

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Semantic Segmentation. Zhongang Qi

More details on Loopy BP

Boosting Simple Model Selection Cross Validation Regularization

IBL and clustering. Relationship of IBL with CBR

Fall 09, Homework 5

Clustering algorithms

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Mean Field and Variational Methods finishing off

Computationally Efficient M-Estimation of Log-Linear Structure Models

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

CS 6140: Machine Learning Spring 2016

Missing variable problems

CRF Feature Induction

Clustering Relational Data using the Infinite Relational Model

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Structured Learning. Jun Zhu

Unsupervised Texture Image Segmentation Using MRF- EM Framework

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Expectation Maximization (EM) and Gaussian Mixture Models

LogisBcs. CS 6140: Machine Learning Spring K-means Algorithm. Today s Outline 3/27/16

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Scene Grammars, Factor Graphs, and Belief Propagation

CSC412: Stochastic Variational Inference. David Duvenaud

Network Traffic Measurements and Analysis

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Image Segmentation using Gaussian Mixture Models

Transcription:

Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 1

Announcements Reminder: Project milestone due Wednesday beginning of class 2

Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! 3

Expectation Maximalization 4

Back to Unsupervised Learning of GMMs a simple case Remember: We have unlabeled data x 1 x 2 x m We know there are k classes We know P(y 1 ) P(y 2 ) P(y 3 ) P(y k ) We don t know µ 1 µ 2.. µ k We can write P( data µ 1. µ k ) = p = = ( x... x µ...µ ) m j= 1 m j= 1 i= 1 m 1 p j= 1 i= 1 ( x µ...µ ) k k m j p 1 1 ( x µ ) P( y = i) j i k k 1 exp 2σ 2 x j µ i 2 P ( y = i) 5

EM for simple case of GMMs: The E-step If we know µ 1,,µ k easily compute prob. point x j belongs to class y=i p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ 6

EM for simple case of GMMs: The M-step If we know prob. point x j belongs to class y=i MLE for µ i is weighted average imagine k copies of each x j, each with weight P(y=i x j ): µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 7

E.M. for GMMs E-step Compute expected classes of all datapoints for each class p 1 2 ( y = i x,µ...µ ) exp x µ P( y i) j 1 k j i = 2 2σ Just evaluate a Gaussian at x j M-step Compute Max. like µ given our data s class membership distributions µ i m j= 1 = m j= 1 P ( y = i x ) P ( y = i x ) j j x j 8

E.M. for General GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ k (t), Σ 1 (t), Σ 2 (t) Σ k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t th iteration M-step P ( ) ( ) ( ( ) ( ) ) t t t y = i x, λ p p x µ, Σ j t i j i Compute Max. like µ given our data s class membership distributions ( y = i x j, λt ) x P( y = i x, λ ) ( t+ ) j µ 1 ( t+ 1) = Σ i j P j t p j ( t+ 1) i = j i = i j P ( y i x,λ ) P = m j t Just evaluate a Gaussian at x j ( ) ( )[ ][ ( )] t+ 1 t+ 1 y = i x j, λt x j µ i x j µ i P( y = i x, λ ) j m = #records j t T 9

Gaussian Mixture Example: Start 10

After first iteration 11

After 2nd iteration 12

After 3rd iteration 13

After 4th iteration 14

After 5th iteration 15

After 6th iteration 16

After 20th iteration 17

Some Bio Assay data 18

GMM clustering of the assay data 19

Resulting Density Estimator 20

Three classes of assay (each learned with it s own mixture model) 21

Resulting Bayes Classifier 22

Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 23

The general learning problem with missing data Marginal likelihood x is observed, z is missing: 24

E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means 25

Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) 26

Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) 27

The M-step maximizes lower bound on weighted data Lower bound from Jensen s: Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 28

The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] 29

Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood 30

M-step is easy Using potential function 31

E-step also doesn t decrease potential function 1 Fixing θ to θ (t) : 32

KL-divergence Measures distance between distributions KL=zero if and only if Q=P 33

E-step also doesn t decrease potential function 2 Fixing θ to θ (t) : 34

E-step also doesn t decrease potential function 3 Fixing θ to θ (t) Maximizing F(θ (t),q) over Q set Q to posterior probability: Note that 35

EM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: Realigns F with likelihood: 36

What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM 37

Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: http://www.autonlab.org/tutorials/ K-means Applet: http://www.elet.polimi.it/upload/matteucc/clustering/tu torial_html/appletkm.html Gaussian mixture models Applet: http://www.neurosci.aist.go.jp/%7eakaho/mixtureem. html 38

EM for HMMs a.k.a. The Baum-Welch Algorithm Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 39

Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: 40

Learning HMMs from fully observable data is easy X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Learn 3 distributions: What if O is observed, but X is hidden 41

Log likelihood for HMMs with hidden X Marginal likelihood O is observed, X is missing For simplicity of notation, we ll consider training data consists of only one sequence: If there were m sequences: 42

E-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm! 43

The M-step X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use E Q(t+1) [Count(x,o)] 44

Starting state probability P(X 1 ) Using expected counts P(X 1 =a) = θ X1=a 45

Transition probability P(X t+1 X t ) Using expected counts P(X t+1 =a X t =b) = θ Xt+1=a Xt=b 46

Observation probability P(O t X t ) Using expected counts P(O t =a X t =b) = θ Ot=a Xt=b 47

E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Q(x t+1 =a,x t =b o) joint distribution between pairs of positions 48

The forwards-backwards algorithm X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Initialization: For i = 2 to n Generate a forwards factor by eliminating X i-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating X i+1 i, probability is: 49

E-step revisited X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = E-step computes probability of hidden vars x given o Must compute: Q(x t =a o) marginal probability of each position Just forwards-backwards! Q(x t+1 =a,x t =b o) joint distribution between pairs of positions Homework! 50