CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Similar documents
Clustering web search results

Introduction to Machine Learning CMU-10701

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering Lecture 5: Mixture Model

Inference and Representation

Expectation Maximization (EM) and Gaussian Mixture Models

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Probabilistic Graphical Models

Mixture Models and EM

Note Set 4: Finite Mixture Models and the EM Algorithm

ECE 5424: Introduction to Machine Learning

Mixture Models and the EM Algorithm

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Clustering and The Expectation-Maximization Algorithm

Unsupervised Learning: Clustering

COMS 4771 Clustering. Nakul Verma

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Latent Variable Models and Expectation Maximization

Expectation Maximization: Inferring model parameters and class labels

Machine Learning

Expectation Maximization: Inferring model parameters and class labels

K-Means and Gaussian Mixture Models

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Deep Generative Models Variational Autoencoders

Lecture 8: The EM algorithm

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Grundlagen der Künstlichen Intelligenz

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

What is machine learning?

Content-based image and video analysis. Machine learning

Probabilistic Graphical Models

CS 229 Midterm Review

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Clustering. Image segmentation, document clustering, protein class discovery, compression

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Biology 644: Bioinformatics

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

Bayes Net Learning. EECS 474 Fall 2016

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Intro to Artificial Intelligence

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Fall 09, Homework 5

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

CLUSTERING. JELENA JOVANOVIĆ Web:

IBL and clustering. Relationship of IBL with CBR

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Generative and discriminative classification techniques

Graphical Models & HMMs

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Unsupervised Learning

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

CS 6140: Machine Learning Spring 2016

Machine Learning. Unsupervised Learning. Manfred Huber

CS Introduction to Data Mining Instructor: Abdullah Mueen

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

STA 4273H: Statistical Machine Learning

Hidden Markov Models in the context of genetic analysis

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Supervised Learning for Image Segmentation

K-Means Clustering 3/3/17

Semi- Supervised Learning

Warped Mixture Models

Self-organizing mixture models

Variational Autoencoders. Sargur N. Srihari

Graphical Models. David M. Blei Columbia University. September 17, 2014

Mixture Models and EM

10-701/15-781, Fall 2006, Final

Segmentation: Clustering, Graph Cut and EM

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Clustering algorithms

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Problem 1 (20 pt) Answer the following questions, and provide an explanation for each question.

ALTERNATIVE METHODS FOR CLUSTERING

A Taxonomy of Semi-Supervised Learning Algorithms

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Machine Learning / Jan 27, 2010

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

LogisBcs. CS 6140: Machine Learning Spring K-means Algorithm. Today s Outline 3/27/16

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Markov Decision Processes (MDPs) (cont.)

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Transcription:

CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1

Partially Observed GMs Speech recognition 2

Partially Observed GMs Evolution 3

Partially Observed GMs Mixture Models 4

Partially Observed GMs A density model p(x) may be multi-modal We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians) Each mode may correspond to a different sub-population 5

Unobserved Variables A variable can be unobserved (latent) because: It is an imaginary quantity meant to provide some simplified but abstractive view of the data generation process Mixture models It is a real-world object and/or phenomenon, but difficult or impossible to measure Causes of a disease, evolutionary ancestors It is a real-world object and/or phenomenon, but sometimes wasn t measured Due to faulty sensors etc. Discrete latent variables can be used to cluster data into subgroups Continuous latent variables (factors) can be used for dimensionality reduction 6

Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: This model can be used for unsupervised clustering 7

Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: Z is a latent class indicator vector X is a conditional Gaussian variable with a class-specific mean/covariance Likelihood 8

Why is learning with latent variables harder? In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for Bayes Nets) With latent variables, all parameters become coupled via marginalization 9

Towards Expectation-Maximization Let s start from the MLE for completely observed data Data log-likelihood MLE What if we do not know z n? 10

Towards Expectation-Maximization Let s solve this problem using Expectation-Maximization Expectation: What do we take expectation with? What do we take expectation over? Maximization: What do we maximize? What do we maximize with respect to? 11

Back to Basics: K-Means 12

Expectation-Maximization Start: Guess the centroid μ k and covariance Σ k of each of the K clusters Loop 13

Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: Z is a latent class indicator vector X is a conditional Gaussian variable with a class-specific mean/covariance Likelihood 14

Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components: Likelihood The expected complete log-likelihood 15

E-step We maximize <l c (θ)> iteratively using the following iterative procedure: Expectation step: Compute the expected value of the sufficient statistics o the hidden variables (i.e., z) given current estimate of the parameters (i.e., π and μ) We are doing inference 16

M-step We maximize <l c (θ)> iteratively using the following iterative procedure: Maximization step: Compute the parameters under the current results of the expected value of the hidden variables Isomorphic to MLE except that hidden variables are replaced by their expectations 17

K-Means vs EM The EM algorithm for mixtures of Gaussians is like a soft-version of the K-means algorithm 18

Complete and Incomplete Log Likelihoods Complete log likelihood: Let X denote the observable variable(s) and Z denote the latent variable(s). If Z could be observed then MLE for fully observed models is straightforward but when Z is not observed the log-likelihood is a random quantity. It cannot be maximized directly. Incomplete log likelihood: With z unobservable, our objective becomes the log of a marginal probability: 19

Expected Complete Log Likelihood For any distribution q(z) define the expected complete log likelihood This is a deterministic function of θ and it factorizes as l c Does maximizing this yield a maximizer of the likelihood? Jensen s inequality 20

Free Energy For fixed data x, define the free energy: The EM algorithm is coordinate-ascent of F: 21

E-step: maximization of expected l w.r.t. q Claim: This is the posterior distribution over the latent variables given the data and the parameters. Proof: this setting attains the bound l(θ;x) >= F(q, θ) 22

E-step = plug in posterior expectation of latent variables Without loss of generality: assume that p(x,z θ) is a generalized exponential family distribution The expected complete log likelihood under is 23

M-step: maximization of expected l w.r.t. θ The free energy breaks into two terms: This is the posterior distribution over the latent variables given the data and the parameters. In the M-step, maximizing with respect to θ for fixed q we only need to consider the first term: 24

Example: HMM Supervised learning: estimation when the right answer is known Given: the casino player allows us to observe him one evening as he changes dice and produces 10,000 rolls Unsupervised learning: estimation when the right answer is unknown Given: 10,000 rolls of the casino player but we don t see when he changes dice 25

HMM: dynamic mixture models 26

The Baum Welch algorithm The complete log likelihood y The expected complete log-likelihood 27

The Baum Welch algorithm 28

Unsupervised ML estimation Given X=X1 Xn for which the true state path Y = Y1 Yn is unknown Expectation Maximization 1. Start with our best guess of a model M, parameters θ 2. Estimate A ij, B ik, in the training data: 3. Update θ according to A ij, B ik, (we have a supervised learning problem) 4. Repeat 1 & 2, until convergence This is the Baum-Welch algorithm. 29

EM for general BNs 30

EM Algorithm A way of maximizing likelihood for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken into two easy pieces: Estimate some missing or unobserved data from observed data and current parameters Using this complete data, find the maximum likelihood parameter estimates Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: 31

Mixture of experts We will model P(Y X) using different experts, each responsible for different regions of the input space. Latent variable Z chooses expert using softmax Each expert can be a linear regression model: 32

EM for mixture of experts Model Objective function EM E-step: M-step: FINISH 33

Partially Hidden Data Of course we can learn when there are missing (hidden) variables on some cases and not on others. In this case The data can have different missing values in each different sample In the E-step we estimate the hidden variables on the incomplete cases only. The M-step optimizes the log likelihood on the complete data plus the expected likelihood on the incomplete data using the E-step. 34

Summary Good things about EM: No learning rate parameter Very fast for low dimensions Each iteration guaranteed to improve likelihood Bad things about EM: Can get stuck in local minima Requires expensive inference step Is a maximum likelihood/map method 35