Inference and Representation

Similar documents
K-Means and Gaussian Mixture Models

Clustering web search results

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering Lecture 5: Mixture Model

Mixture Models and EM

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Mixture Models and the EM Algorithm

COMS 4771 Clustering. Nakul Verma

ECE 5424: Introduction to Machine Learning

Grundlagen der Künstlichen Intelligenz

Introduction to Machine Learning CMU-10701

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Latent Variable Models and Expectation Maximization

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Content-based image and video analysis. Machine learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Lecture 8: The EM algorithm

Expectation Maximization (EM) and Gaussian Mixture Models

Machine Learning

Machine Learning. Unsupervised Learning. Manfred Huber

Warped Mixture Models

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Expectation Maximization: Inferring model parameters and class labels

K-Means Clustering. Sargur Srihari

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning: Clustering

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

SD 372 Pattern Recognition

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Clustering and The Expectation-Maximization Algorithm

K-means and Hierarchical Clustering

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Clustering. Machine Learning: Jordan Boyd-Graber University of Maryland K -MEANS AND GMM. Machine Learning: Jordan Boyd-Graber UMD Clustering 1 / 1

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Expectation Maximization: Inferring model parameters and class labels

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Clustering. Image segmentation, document clustering, protein class discovery, compression

CLUSTERING. JELENA JOVANOVIĆ Web:

Random projection for non-gaussian mixture models

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

CS 229 Midterm Review

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

CSC 411: Lecture 12: Clustering

Quantitative Biology II!

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Generative and discriminative classification techniques

Expectation Maximization!

Unsupervised Learning

Machine Learning

Mixture Models and EM

Segmentation: Clustering, Graph Cut and EM

Learning from Data Mixture Models

Day 3 Lecture 1. Unsupervised Learning

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Modelling and Reasoning Assignment 2

CS Introduction to Data Mining Instructor: Abdullah Mueen

10701 Machine Learning. Clustering

Unsupervised Learning

University of Washington Department of Computer Science and Engineering / Department of Statistics

Probabilistic Graphical Models

Methods for Intelligent Systems

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France

IBL and clustering. Relationship of IBL with CBR

Bayesian Methods. David Rosenberg. April 11, New York University. David Rosenberg (New York University) DS-GA 1003 April 11, / 19

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Deep Generative Models Variational Autoencoders

Fisher vector image representation

Network Traffic Measurements and Analysis

ALTERNATIVE METHODS FOR CLUSTERING

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Markov Random Fields and Segmentation with Graph Cuts

K-Means Clustering 3/3/17

9.1. K-means Clustering

732A54/TDDE31 Big Data Analytics

Transcription:

Inference and Representation Rachel Hodos New York University Lecture 5, October 6, 2015 Rachel Hodos Lecture 5: Inference and Representation

Today: Learning with hidden variables Outline: Unsupervised learning Example: clustering Review k-means clustering Probabilistic perspective -> GMMs EM algorithm for GMMs General derivation of EM algorithm Identifiability Rachel Hodos Lecture 5: Inference and Representation

K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43

K-Means Clustering Example: Old Faithful Geyser Looks like two clusters. How to find these clusters algorithmically? David Rosenberg (New York University) DS-GA 1003 June 15, 2015 2 / 43

K-Means Clustering k-means: By Example Standardize the data. Choose two cluster centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(a). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 3 / 43

K-Means Clustering k-means: by example Assign each point to closest center. From Bishop s Pattern recognition and machine learning, Figure 9.1(b). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 4 / 43

K-Means Clustering k-means: by example Compute new class centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(c). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 5 / 43

K-Means Clustering k-means: by example Assign points to closest center. From Bishop s Pattern recognition and machine learning, Figure 9.1(d). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 6 / 43

K-Means Clustering k-means: by example Compute cluster centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(e). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 7 / 43

K-Means Clustering k-means: by example Iterate until convergence. From Bishop s Pattern recognition and machine learning, Figure 9.1(i). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 8 / 43

k-means: Failure Cases k-means: Suboptimal Local Minimum The clustering for k = 3 below is a local minimum, but suboptimal: From Sontag s DS-GA 1003, 2014, Lecture 8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 19 / 43

Gaussian Mixture Models Probabilistic Model for Clustering Let s consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z {1,2,...,k}. Z Multi(π 1,...,π k ). 2 Choose a point from the distribution for cluster Z. X Z = z p(x z). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 20 / 43

Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 Choose Z {1,2,3} Multi ( 1 3, 1 3, 1 3). 2 Choose X Z = z N(X µ z,σ z ). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 21 / 43

Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize joint according to Bayes net: p(x, z) = p(z)p(x z) = π z N (x µ z,σ z ) π z is probability of choosing cluster z. X Z = z has distribution N(µ z,σ z ). z corresponding to x is the true cluster assignment. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 22 / 43

Gaussian Mixture Models Latent Variable Model Back in reality, we observe X, not (X,Z). Cluster assignment Z is called a hidden variable. Definition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 23 / 43

Gaussian Mixture Models Model-Based Clustering We observe X = x. The conditional distribution of the cluster Z given X = x is p(z X = x) = p(x,z)/p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z = arg min P(Z = z X = x). z {1,...,k} So if we have the model, clustering is trival. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 24 / 43

Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model We ll use the common acronym GMM. What does it mean to have or know the GMM? It means knowing the parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) We have a probability model: let s find the MLE. Suppose we have data D = {x 1,...,x n }. We need the model likelihood for D. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 25 / 43

Gaussian Mixture Models Gaussian Mixture Model: Marginal Distribution Since we only observe X, we need the marginal distribution: p(x) = = k p(x, z) z=1 k π z N (x µ z,σ z ) z=1 Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 26 / 43

Gaussian Mixture Models Mixture Distributions (or Mixture Models) Definition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = w i p i (x), where w i 0, k i=1 w i = 1, and each p i is a probability density. i=1 In our Gaussian mixture model, X has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample X from the chosen distribution. Then X has a mixture distribution. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 27 / 43

The EM Algorithm for GMM EM Algorithm for GMM: Overview 1 Initialize parameters µ, Σ, π. 2 E step. Evaluate the responsibilities using current parameters: for i = 1,...,n and j = 1,...,k. γ j i = π j N (x i µ j,σ j ) k c=1 π cn (x i µ c,σ c ), 3 M step. Re-estimate the parameters using responsibilities: µ new c = 1 n γ c i x i n c Σ new c = 1 n c π new c = n c n, i=1 n γ c i (x i µ MLE )(x i µ MLE ) T i=1 4 Repeat from Step 2, until log-likelihood converges. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 36 / 43

The EM Algorithm for GMM EM for GMM Initialization From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 37 / 43

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 38 / 43

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 39 / 43

The EM Algorithm for GMM EM for GMM After 5 rounds of EM: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 40 / 43

The EM Algorithm for GMM EM for GMM After 20 rounds of EM: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 41 / 43

The EM Algorithm for GMM Relation to K-Means EM for GMM seems a little like k-means. In fact, there is a precise correspondence. First, fix each cluster covariance matrix to be σ 2 I. As we take σ 2 0, the update equations converge to doing k-means. If you do a quick experiment yourself, you ll find Soft assignments converge to hard assignments. Has to do with the tail behavior (exponential decay) of Gaussian. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 42 / 43

Overview of EM algorithm Motivation: With hidden variables, MLE is harder to compute (not always closed form solution). Also, we may want to estimate the expected states of the hidden variables. EM algorithm can help with both EM is iterative algorithm to maximixe log-likelihood Rachel Hodos Lecture 5: Inference and Representation