Latent Variable Models and Expectation Maximization

Similar documents
Introduction to Machine Learning CMU-10701

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Expectation Maximization (EM) and Gaussian Mixture Models

Inference and Representation

Clustering web search results

9.1. K-means Clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Mixture Models and the EM Algorithm

Clustering Lecture 5: Mixture Model

Note Set 4: Finite Mixture Models and the EM Algorithm

K-Means and Gaussian Mixture Models

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Mixture Models and EM

Unsupervised Learning

K-means and Hierarchical Clustering

Lecture 16: Object recognition: Part-based generative models

Generative and discriminative classification techniques

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

ECE 5424: Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CS 229 Midterm Review

Clustering: Classic Methods and Modern Views

IBL and clustering. Relationship of IBL with CBR

Machine Learning Lecture 3

10-701/15-781, Fall 2006, Final

Machine Learning

Learning from Data Mixture Models

Random projection for non-gaussian mixture models

Expectation Maximization: Inferring model parameters and class labels

Machine Learning Lecture 3

Machine Learning Lecture 3

Clustering. Image segmentation, document clustering, protein class discovery, compression

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Deep Generative Models Variational Autoencoders

Introduction to Mobile Robotics

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

Unsupervised Learning

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Machine Learning

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Mixture Models and EM

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CS 6140: Machine Learning Spring 2016

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering. Shishir K. Shah

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

10701 Machine Learning. Clustering

Clustering CS 550: Machine Learning

Semantic Segmentation. Zhongang Qi


Day 3 Lecture 1. Unsupervised Learning

1 Case study of SVM (Rob)

Announcements. Image Segmentation. From images to objects. Extracting objects. Status reports next Thursday ~5min presentations in class

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

LogisBcs. CS 6140: Machine Learning Spring K-means Algorithm. Today s Outline 3/27/16

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Clustering and Visualisation of Data

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

CS 231A Computer Vision (Fall 2012) Problem Set 3

Generative and discriminative classification techniques

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Self-organizing mixture models

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Unsupervised Learning: Clustering

CS Introduction to Data Mining Instructor: Abdullah Mueen

Fisher vector image representation

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

A Taxonomy of Semi-Supervised Learning Algorithms

Grundlagen der Künstlichen Intelligenz

Graphical Models & HMMs

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Computer Vision Lecture 6

Using Machine Learning to Optimize Storage Systems

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Introduction to Graphical Models

Transcription:

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 small then only the white circle is seen, and there is no extrema in entropy. in entropy. There There is anis entropy an entropy extrema extrema when when the the µ, Σ) α f trema scale scale is slightly is slightly larger larger than the thanradius the radius of theofbright bright circle, circle, tative p relative to a ref- which which has has In practice In practice this method this method gives gives stable stable identification identification of fea- of fea- to a refnitydensity re arts assumed are assumed to totures turesaover variety a variety of sizes of sizes and copes and copes well with well with intra-class intra-class ckground model model asin(within a range a range r). r). ant toant scaling, to scaling, although although experimental experimental tests show tests show that this thatis this We as-variability. discussed variability. Theprobabilistic saliency saliency measure measure models is designed is designed attolength betoinvari- invari- is α f d p r f p, U p ) dp r f er. e finder. 1 p(d θ) (N, f) p(d θ) Learning and thereafter and Parameters thereafter entropy the entropy todecreases Probability theasscale scale increases. increases. Distributions detected tures detected using using The n M. second The second is is variable and the and last the last ll ible possible occlusion occlusion l. shape s the shape and ocpearancthe appearance and and compasses many many of of and oc- obabilistic istic way, this way, this constrained objects objects ll a small covariance) covariance) Given fully observed training data, setting parameters θ not entirely i not entirely the case thedue casetodue aliasing to aliasing and other and other effects. effects. Note, Note, for Bayes only monochrome only nets monochrome is straight-forward information information is used is used to detect to detect and represent features. sent infeatures. many settings not all variables are observed and repre- However, (labelled) in the training data: x i = (x i, h i ) 2.3. e.g. 2.3. Feature Speech Feature representation recognition: have speech signals, but not The phoneme feature The feature detector labels detector identifies identifies regions regions of interest of interest on each on each image. e.g. image. Object The coordinates Therecognition: coordinates of theof centre the have centre give object us give Xus and labels Xthe andsize (car, the size bicycle), of but theof not region thepart region gives labels gives S. Figure S. (wheel, Figure 2 illustrates 2door, illustrates this seat) on this two ontypical two typical images images from the frommotorbike the motorbike dataset. dataset. Unobserved variables are called latent variables figs from Fergus et al. Figure Figure 2: Output 2: Output of theof feature the feature detector detector

Latent Variables and Simplicity Latent variables can explain observed correlations with a simple model. Fewer parameters. Common in science: The heart, genes, energy, gravity,... 2 2 2 Smoking Diet Exercise 2 2 2 Smoking Diet Exercise 54 HeartDisease 6 6 6 Symptom 1 Symptom 2 Symptom 3 54 162 486 Symptom 1 Symptom 2 Symptom 3 (a) (b) Fig. Russell and Norvig 2.1

Latent Variable Models: Pros Statistically powerful, often good predictions. Many applications: Learning with missing data. Clustering: missing cluster label for data points. Principal Component Analysis: data points are generated in linear fashion from a small set of unobserved components. (more later) Matrix Factorization, Recommender Systems: Assign users to unobserved user types, assign items to unobserved item types. Use similarity between user type, item type to predict preference of user for item. Winner of $1M Netflix challenge. If latent variables have an intuitive interpretation (e.g., action movies, factors ), discovers new features.

Latent Variable Models: Cons From a user s point of view, like a black box if latent variables don t have an intuitive interpretation. Statistically, hard to guarantee convergence to a correct model with more data (the identifiability problem). Harder computationally, usually no closed form for maximum likelihood estimates. However, the Expectation-Maximization algorithm provides a general-purpose local search algorithm for learning parameters in probabilistic models with latent variables.

Outline K-Means EM Example: Gaussian Mixture Models The Expectation Maximization Algorithm

Outline K-Means EM Example: Gaussian Mixture Models The Expectation Maximization Algorithm

Unsupervised Learning 2 (a) 2 We will start with an unsupervised learning (clustering) problem: Given a dataset {x 1,..., x N }, each x i R D, partition the dataset into K clusters Intuitively, a cluster is a group of points, which are close together and far from others

Distortion Measure 2 (a) 2 2 (i) 2 Formally, introduce prototypes (or cluster centers) µ k R D Use binary r nk, 1 if point n is in cluster k, otherwise (1-of-K coding scheme again) Find {µ k }, {r nk } to minimize distortion measure: J = N n=1 k=1 K r nk x n µ k 2

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This chicken-and-egg scenario suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This chicken-and-egg scenario suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This chicken-and-egg scenario suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence

2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 (b) 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize: K r nk x n µ k 2 How? k=1 Simply set r nk = 1 for the cluster center µ with smallest distance

2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 (b) 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize: K r nk x n µ k 2 How? k=1 Simply set r nk = 1 for the cluster center µ with smallest distance

2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 (b) 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize: K r nk x n µ k 2 How? k=1 Simply set r nk = 1 for the cluster center µ with smallest distance

Determining Cluster Centers Step 2: fix r nk, minimize J wrt the cluster centers µ k 2 (b) K N J = r nk x n µ k 2 switch order of sums k=1 n=1 2 2 (c) 2 Minimize wrt each µ k separately (how?) Take derivative, set to zero: 2 N r nk (x n µ k ) = n=1 µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k

Determining Cluster Centers Step 2: fix r nk, minimize J wrt the cluster centers µ k 2 (b) K N J = r nk x n µ k 2 switch order of sums k=1 n=1 2 2 (c) 2 Minimize wrt each µ k separately (how?) Take derivative, set to zero: 2 N r nk (x n µ k ) = n=1 µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k

K-means Algorithm Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Assign points to nearest cluster center Minimize J wrt µ k Set cluster center as average of points in cluster Rinse and repeat until convergence

K-means example 2 (a) 2

K-means example 2 (b) 2

K-means example 2 (c) 2

K-means example 2 (d) 2

K-means example 2 (e) 2

K-means example 2 (f) 2

K-means example 2 (g) 2

K-means example 2 (h) 2

K-means example 2 (i) 2 Next step doesn t change membership stop

K-means Convergence Repeat steps until no change in cluster assignments For each step, value of J either goes down, or we stop Finite number of possible assignments of data points to clusters, so we are guaranteed to converge eventually Note it may be a local maximum rather than a global maximum to which we converge

K-means Example - Image Segmentation Original image K-means clustering on pixel colour values Pixels in a cluster are coloured by cluster mean Represent each pixel (e.g. 24-bit colour value) by a cluster number (e.g. 4 bits for K = 1), compressed version

K-means Generalized: the set-up Let s generalize the idea. Suppose we have the following set-up. x denotes all observed variables (e.g., data points). Z denotes all latent (hidden, unobserved) variables (e.g., cluster means). J(x, Z θ) where J measures the goodness of an assignment of latent variable models given the data points and parameters θ. e.g., J = -dispersion measure. parameters = assignment of points to clusters. It s easy to maximize J(x, Z θ) wrt θ for fixed Z. It s easy to maximize J(x, Z θ) wrt Z for fixed θ.

Outline K-Means EM Example: Gaussian Mixture Models The Expectation Maximization Algorithm

Hard Assignment vs. Soft Assignment 2 (i) In the K-means algorithm, a hard 2 assignment of points to clusters is made However, for points near the decision boundary, this may not be a good idea Instead, we could think about making a soft assignment of points to clusters

Gaussian Mixture Model 1 (a) 1 (b).5.2.5.5.3.5 1.5 1 The Gaussian mixture model (or mixture of Gaussians MoG) models the data as a combination of Gaussians. a: constant density contours. b: marginal probability p(x). c: surface plot. Widely used general approximation for multi-modal distributions. See text Sec. 2.2.6 for kernel version

Gaussian Mixture Model 1 (b) 1 (a).5.5.5 1.5 1 Above shows a dataset generated by drawing samples from three different Gaussians

ag=1) θ Generative Model ag C 1 (c).5 pper Hole X.5 1 (a) (b) The mixture of Gaussians is a generative model To generate a datapoint x, we first generate a value C = i for the component/cluster C {1, 2, 3} We then generate a vector x using P(x C = i) N (x µ i, Σ i ) using the Gaussian parameters for component i

B) P(Bag=1) θ MoG Marginal over Observed Variables Bag C vor Wrapper Hole X (a) (b) The marginal distribution p(x θ) for this model with parameter values θ is: p(x) = = 3 3 p(x, C = i) = p(c = i)p(x C = i) i=1 3 w i N (x µ i, Σ i ) i=1 i=1 A mixture of Gaussians

P(Bag=1) θ Learning Gaussian Mixture Models B) Bag C vor Wrapper Hole X The (a) EM algorithm can be used (b) to learn Bayes net parameters with unobserved variables 1. Initialize parameters randomly (or using k-means) 2. Repeat the following Expectation E-step Find probability of cluster assignments Maximization M-step Update parameter values

=1) E-step: Probability of Cluster Assignments g C 1 (b).5 per Hole X.5 1 ) (b) How to find p ij = P(C = i x j )? Use Bayes Theorem p ij P(x j C = i)p(c = i) Bayes net inference!

=1) E-step: Probability of Cluster Assignments g C 1 (b).5 per Hole X.5 1 ) (b) How to find p ij = P(C = i x j )? Use Bayes Theorem p ij P(x j C = i)p(c = i) Bayes net inference!

EM for Gaussian Mixtures: Update Parameters Update parameters wrt soft cluster assignments p ij n i j p ij w i P(C = i) := n i /N µ i := j p ij x j /n i Σ i = j p ij (x j µ i )(x j µ i ) T /n i n i is the effective (estimated) number of data points from component i

MoG EM - Example 2 (a) 2 Same initialization as with K-means before Often, K-means is actually used to initialize EM

MoG EM - Example 2 (b) 2 Calculate cluster probabilities p ij Shown using mixtures of colours.

MoG EM - Example 2 (c) 2 Calculate model parameters {P(C = i), µ i, Σ i } using the cluster probabilities

MoG EM - Example 2 (d) 2 Iteration 2

MoG EM - Example 2 (e) 2 Iteration 5

MoG EM - Example 2 Iteration 2 - converged (f) 2

Log-likelihood L EM and Maximum Likelihood 1. The EM procedure is guaranteed to increase at each step, the data log-likelihood ln p(d θ). 2. Therefore converges to local log-likelihood maximum. 3. Data log-likelihood of true model is around 65 in figure 7 6 5 4 3 2 1-1 -2 5 1 15 2 Iteration number Fig. Russell and Norvig 2.12

Outline K-Means EM Example: Gaussian Mixture Models The Expectation Maximization Algorithm

Hard EM Algorithm We assume a probabilistic model, specifically the complete-data log likelihood function P(x, Z = z θ). Goodness of the model is the log-likelihood L(x, Z = z θ) ln P(x, Z = z θ). Guess the value for latent variables that is the expected value given current parameter settings: E[Z] where the distribution over latent variables is given by P(Z x, θ (i) ). Given latent variable values, the next parameter values θ i+1 are evaluated by maximizing the likelihood L(x Z = E[Z], θ). Example: fill in missing values, single variable, Gaussian model.

Generalized EM Algorithm In Bayesian fashion, do not guess a best value for the latent variables Z. Instead, average over the distribution P(Z x, θ (i) ) given the current hypothesis. Given a latent variable distribution, the next parameter values θ (i+1) are evaluated by taking the expected likelihood L(x, Z = z θ (i+1) ) over all possible latent variable settings. In one equation: θ (i+1) = argmax θ P(Z x, θ (i) )L(x, Z = z θ) z

EM Algorithm: The procedure 1. Guess an initial parameter setting θ (i). 2. Repeat until convergence: 2.1 The E-step: Evaluate P(Z = z x, θ (i) ). 2.2 The M-step: Evaluate the function Q(θ, θ (i) ) z P(Z x, θ(i) )L(x, Z = z θ) Maximize Q(θ, θ (i) ) wrt θ. Update θ (i+1) to be the maximum.

Bag P(F=cherry B) 1 θ F1 2 θ F2 P(Bag=1) EM θfor Gaussian Mixtures: E-step (I) Bag C Flavor Wrapper Hole X (a) (b) Want to evaluate Q(θ, θ (i) ) z P(Z x, θ(i) )L(x, Z = z θ) Let Z ij = 1 denote the event that data point j is assigned to cluster i. Then L(x, Z = z θ) = ln P(x j C = i, θ) z ij P(C = i θ) z ij i j = z ij [ln P(x j C = i, θ) z ij + ln P(C = i θ)] i j z ij l ij i j

P(Bag=1) EM for Gaussian Mixtures: E-step (II) θ Bag P(F=cherry B) 1 θ F1 2 θ F2 Bag C Flavor Wrapper Hole (a) (b) Want to evaluate Q(θ, θ (i) ) z P(Z x, θ(i) ) i j z ijl ij This is the expectation of a sum = sum of expectations (see assignment). Therefore P(Z x, θ (i) ) z ij l ij = z i j = P(Z ij = 1 x, θ (i) )l ij = i j p ijl ij i j X E Zij x,θ (i)[z ijl ij ] i j

EM for Gaussian Mixtures: E-step (III) P(Bag=1) θ Bag P(F=cherry B) Bag C 1 θ F1 2 θ F2 Flavor Wrapper Hole X (a) Want to evaluate Q(θ, θ (i) ) z P(Z x, θ(i) )L(x, Z = z θ) We showed that this is equal to i j p ijl(x j, z ij θ) Deriving the maximization solution is more or less straightforward and leads to the updates shown before. (b) Variations on this theme work for many models where the likelihood function factors (think Bayes nets).

EM - Summary EM finds local maximum to (incomplete) data likelihood P(x θ) = Z p(x, Z θ) Iterates two steps: E step calculates the distribution of the missing variables Z (Hard EM fills in the variables). M step maximizes expected complete log likelihood (expectation wrt E step distribution) Often the updates can be obtained in closed form E.g., for discrete Bayesian networks (see Ch.2.3.2)

Conclusion K-means clustering Gaussian mixture model What about K? Model selection: either cross-validation or Bayesian version (average over all values for K) Expectation-maximization, a general method for learning parameters of models when not all variables are observed