K-Means and Gaussian Mixture Models

Similar documents
Inference and Representation

Clustering web search results

Introduction to Machine Learning CMU-10701

Clustering Lecture 5: Mixture Model

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Mixture Models and the EM Algorithm

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Mixture Models and EM

Lecture 8: The EM algorithm

Expectation Maximization: Inferring model parameters and class labels

Content-based image and video analysis. Machine learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

ECE 5424: Introduction to Machine Learning

Unsupervised Learning: Clustering

Latent Variable Models and Expectation Maximization

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Expectation Maximization (EM) and Gaussian Mixture Models

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

K-Means Clustering. Sargur Srihari

Deep Generative Models Variational Autoencoders

Grundlagen der Künstlichen Intelligenz

Machine Learning. Unsupervised Learning. Manfred Huber

10-701/15-781, Fall 2006, Final

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

COMS 4771 Clustering. Nakul Verma

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Some questions of consensus building using co-association

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

K-means and Hierarchical Clustering

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to Mobile Robotics

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Missing Data Analysis for the Employee Dataset

Machine Learning Lecture 3

Expectation Maximization!

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Expectation Maximization: Inferring model parameters and class labels

IBL and clustering. Relationship of IBL with CBR

Clustering and The Expectation-Maximization Algorithm

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Machine Learning Lecture 3

I How does the formulation (5) serve the purpose of the composite parameterization

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Machine Learning Lecture 3

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Supervised vs. Unsupervised Learning

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Unsupervised Learning

1 Case study of SVM (Rob)

Unsupervised Learning

Lecture 3 January 22

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Machine Learning: Jordan Boyd-Graber University of Maryland K -MEANS AND GMM. Machine Learning: Jordan Boyd-Graber UMD Clustering 1 / 1

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning for OR & FE

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

CS5220: Assignment Topic Modeling via Matrix Computation in Julia

732A54/TDDE31 Big Data Analytics

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Clustering. Image segmentation, document clustering, protein class discovery, compression

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

Markov Random Fields and Segmentation with Graph Cuts

Probabilistic Graphical Models

A Formal Approach to Score Normalization for Meta-search

10701 Machine Learning. Clustering

27: Hybrid Graphical Models and Neural Networks

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Generative and discriminative classification techniques

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Radial Basis Function Networks: Algorithms

Machine Learning

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

ALTERNATIVE METHODS FOR CLUSTERING

Segmentation: Clustering, Graph Cut and EM

General Instructions. Questions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Machine Learning Classifiers and Boosting

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Data-Intensive Computing with MapReduce

Transcription:

K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43

K-Means Clustering Example: Old Faithful Geyser Looks like two clusters. How to find these clusters algorithmically? David Rosenberg (New York University) DS-GA 1003 June 15, 2015 2 / 43

K-Means Clustering k-means: By Example Standardize the data. Choose two cluster centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(a). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 3 / 43

K-Means Clustering k-means: by example Assign each point to closest center. From Bishop s Pattern recognition and machine learning, Figure 9.1(b). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 4 / 43

K-Means Clustering k-means: by example Compute new class centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(c). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 5 / 43

K-Means Clustering k-means: by example Assign points to closest center. From Bishop s Pattern recognition and machine learning, Figure 9.1(d). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 6 / 43

K-Means Clustering k-means: by example Compute cluster centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(e). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 7 / 43

K-Means Clustering k-means: by example Iterate until convergence. From Bishop s Pattern recognition and machine learning, Figure 9.1(i). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 8 / 43

K-Means Clustering k-means: formalization Dataset D = {x 1,...,x n } R d Goal (version 1): Partition data into k clusters. Goal (version 2): Partition R d into k regions. Let µ 1,...,µ k denote cluster centers. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 9 / 43

K-Means Clustering k-means: formalization For each x i, use a one-hot encoding to designate membership: r i = (0,0,...,0,0,1,0,0) R k Let Then r ic = 1(x i assigned to cluster c). r i = (r i1,r i2,...,r ik ). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 10 / 43

K-Means Clustering k-means: objective function Find cluster centers and cluster assignments minimizing n k J(r,µ) = r ic x i µ c 2. i=1 c=1 Is objective function convex? What s the domain of J? r {0,1} n k, which is not a convex set... So domain of J is not convex = J is not a convex function We should expect local minima. Could replace 2 with something else: e.g. using (or any distance metric) gives k-medoids. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 11 / 43

K-Means Clustering k-means algorithm For fixed r (cluster assignments), minimizing over µ is easy: J(r,µ) = = J c (µ c ) = n k r ic x i µ c 2 i=1 c=1 k c=1 i=1 n r ic x i µ c 2 }{{} =J c {i x i belongs to cluster c} x i µ c 2 J c is minimized by µ c = mean({x i x i belongs to cluster c}) David Rosenberg (New York University) DS-GA 1003 June 15, 2015 12 / 43

K-Means Clustering k-means algorithm For fixed µ (cluster centers), minimizing over r is easy: J(r,µ) = n k r ic x i µ c 2 i=1 c=1 For each i, exactly one of the following terms is nonzero: Take r i1 x i µ 1 2,r i2 x i µ 2 2,...,r ik x i µ k 2 r ic = 1(c = arg min x i µ j 2 ) j That is, assign x i to cluster c with minimum distance x i µ c 2 avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 13 / 43

K-Means Clustering k-means algorithm (summary) We will use an alternating minimization algorithm: 1 Choose initial cluster centers µ = (µ 1,...,µ k ). 2 Repeat e.g. choose k randomly chosen data points 1 For given cluster centers, find optimal cluster assignments: r new ic = 1(c = arg min x i µ j 2 ) j 2 Given cluster assignments, find optimal cluster centers: µ new c = arg min; x i µ c 2 m R d {i r ic =1} avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 14 / 43

K-Means Clustering k-means Algorithm: Convergence Note: Objective value never increases in an update. (Obvious: worst case, everything stays the same) Consider the sequence of objective values: J 1,J 2,J 3,... monotonically decreasing bounded below by zero Therefore, k-means objective value converges to inf t J t. Reminder: This is convergence to a local minimum. Best to repeat k-means several times, with different starting points avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 15 / 43

K-Means Clustering k-means: Objective Function Convergence Blue circles after E step: assigning each point to a cluster Red circles after M step: recomputing the cluster centers From Bishop s Pattern recognition and machine learning, Figure 9.2. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 16 / 43

K-Means Clustering k-means Algorithm: Standardizing the data With standardizing: David Rosenberg (New York University) DS-GA 1003 June 15, 2015 17 / 43

K-Means Clustering k-means Algorithm: Standardizing the data Without standardizing: David Rosenberg (New York University) DS-GA 1003 June 15, 2015 18 / 43

k-means: Failure Cases k-means: Suboptimal Local Minimum The clustering for k = 3 below is a local minimum, but suboptimal: From Sontag s DS-GA 1003, 2014, Lecture 8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 19 / 43

Gaussian Mixture Models Probabilistic Model for Clustering Let s consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z {1,2,...,k}. Z Multi(π 1,...,π k ). 2 Choose a point from the distribution for cluster Z. X Z = z p(x z). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 20 / 43

Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 Choose Z {1,2,3} Multi ( 1 3, 1 3, 1 3). 2 Choose X Z = z N(X µ z,σ z ). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 21 / 43

Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize joint according to Bayes net: p(x, z) = p(z)p(x z) = π z N (x µ z,σ z ) π z is probability of choosing cluster z. X Z = z has distribution N(µ z,σ z ). z corresponding to x is the true cluster assignment. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 22 / 43

Gaussian Mixture Models Latent Variable Model Back in reality, we observe X, not (X,Z). Cluster assignment Z is called a hidden variable. Definition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 23 / 43

Gaussian Mixture Models Model-Based Clustering We observe X = x. The conditional distribution of the cluster Z given X = x is p(z X = x) = p(x,z)/p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z = arg min P(Z = z X = x). z {1,...,k} So if we have the model, clustering is trival. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 24 / 43

Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model We ll use the common acronym GMM. What does it mean to have or know the GMM? It means knowing the parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) We have a probability model: let s find the MLE. Suppose we have data D = {x 1,...,x n }. We need the model likelihood for D. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 25 / 43

Gaussian Mixture Models Gaussian Mixture Model: Marginal Distribution Since we only observe X, we need the marginal distribution: p(x) = = k p(x, z) z=1 k π z N (x µ z,σ z ) z=1 Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 26 / 43

Gaussian Mixture Models Mixture Distributions (or Mixture Models) Definition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = w i p i (x), where w i 0, k i=1 w i = 1, and each p i is a probability density. i=1 In our Gaussian mixture model, X has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample X from the chosen distribution. Then X has a mixture distribution. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 27 / 43

Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model The model likelihood for D = {x 1,...,x n } is L(π,µ,Σ) = = n p(x i ) i=1 n i=1 z=1 k π z N (x i µ z,σ z ). As usual, we ll take our objective function to be the log of this: { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z ) i=1 z=1 avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 28 / 43

Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: J(π,µ,Σ) = { n k } log π z N (x i µ z,σ z ) i=1 z=1 Let s compare to the log-likelihood for a single Gaussian: n logn(x i µ,σ) i=1 = nd 2 log(2π) n 2 log Σ 1 2 n (x i µ) Σ 1 (x i µ) For a single Gaussian, the log cancels the exp in the Gaussian density. = Things simplify a lot. For the GMM, the sum inside the log prevents this cancellation. = Expression more complicated. No closed form expression for MLE. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 29 / 43 i=1

Issues with MLE for GMM Identifiability Issues for GMM Suppose we have found parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) that are at a local minimum. What happens if we shuffle the clusters? e.g. Switch the labels for clusters 1 and 2. We ll get the same likelihood. How many such equivalent settings are there? Assuming all clusters are distinct, there are k! equivalent solutions. Not a problem per se, but something to be aware of. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 30 / 43

Issues with MLE for GMM Singularities for GMM Consider the following GMM for 7 data points: Let σ 2 be the variance of the skinny component. What happens to the likelihood as σ 2 0? In practice, we end up in local minima that do not have this problem. Or keep restarting optimization until we do. Bayesian approach or regularization will also solve the problem. From Bishop s Pattern recognition and machine learning, Figure 9.7. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 31 / 43

Issues with MLE for GMM Gradient Descent / SGD for GMM What about running gradient descent or SGD on { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z )? i=1 z=1 Can be done but need to be clever about it. Each matrix Σ 1,...,Σ k has to be positive semidefinite. How to maintain that constraint? Rewrite Σ i = M i Mi T, where M i is an unconstrained matrix. Then Σ i is positive semidefinite. But we actually prefer positive definite, to avoid singularities. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 32 / 43

Issues with MLE for GMM Cholesky Decomposition for SPD Matrices Theorem Every symmetric positive definite matrix A R d d has a unique Cholesky decomposition: A = LL T, where L a lower triangular matrix with positive diagonal elements. A lower triangular matrix has half the number of parameters. Symmetric positive definite is better because avoids singularities. Requires a non-negativity constraint on diagonal elements. e.g. Use projected SGD method like we did for the Lasso. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 33 / 43

The EM Algorithm for GMM MLE for Gaussian Model Let s start by considering the MLE for the Gaussian model. For data D = {x 1,...,x n }, the log likelihood is given by n logn(x i µ,σ) = nd 2 log(2π) n 2 log Σ 1 2 i=1 n (x i µ) Σ 1 (x i µ). i=1 With some calculus, we find that the MLE parameters are µ MLE = 1 n Σ MLE = 1 n n i=1 x i n (x i µ MLE )(x i µ MLE ) T i=1 For GMM, If we knew the cluster assignment z i for each x i, we could compute the MLEs for each cluster. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 34 / 43

The EM Algorithm for GMM Cluster Responsibilities: Some New Notation Denote the probability that observed value x i comes from cluster j by γ j i = P(Z = j X = x i). The responsibility that cluster j takes for observation x i. Computationally, γ j i = P(Z = j X = x i ). = p (Z = j,x = x i )/p(x) π j N (x i µ j,σ j ) = k c=1 π cn (x i µ c,σ c ) The vector ( γ 1 ) i,...,γk i is exactly the soft assignment for xi. Let n c = n i=1 γc i be the number of points soft assigned to cluster c. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 35 / 43

The EM Algorithm for GMM EM Algorithm for GMM: Overview 1 Initialize parameters µ, Σ, π. 2 E step. Evaluate the responsibilities using current parameters: for i = 1,...,n and j = 1,...,k. γ j i = π j N (x i µ j,σ j ) k c=1 π cn (x i µ c,σ c ), 3 M step. Re-estimate the parameters using responsibilities: µ new c = 1 n γ c i x i n c Σ new c = 1 n c π new c = n c n, i=1 n γ c i (x i µ MLE )(x i µ MLE ) T i=1 4 Repeat from Step 2, until log-likelihood converges. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 36 / 43

The EM Algorithm for GMM EM for GMM Initialization From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 37 / 43

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 38 / 43

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 39 / 43

The EM Algorithm for GMM EM for GMM After 5 rounds of EM: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 40 / 43

The EM Algorithm for GMM EM for GMM After 20 rounds of EM: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 41 / 43

The EM Algorithm for GMM Relation to K-Means EM for GMM seems a little like k-means. In fact, there is a precise correspondence. First, fix each cluster covariance matrix to be σ 2 I. As we take σ 2 0, the update equations converge to doing k-means. If you do a quick experiment yourself, you ll find Soft assignments converge to hard assignments. Has to do with the tail behavior (exponential decay) of Gaussian. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 42 / 43

The EM Algorithm for GMM Possible Topics for Next Time In last lecture, will give high level view of several topics. Possibilities: General EM Algorithm. Bandit problems. LDA / Topic Models Ranking problems. Collaborative Filtering. Generalization bounds. Sequence models (maximum entropy Markov models, HMMs) David Rosenberg (New York University) DS-GA 1003 June 15, 2015 43 / 43