Note Set 4: Finite Mixture Models and the EM Algorithm

Similar documents
Mixture Models and the EM Algorithm

Mixture Models and EM

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

K-Means and Gaussian Mixture Models

Unsupervised Learning: Clustering

Clustering Lecture 5: Mixture Model

Inference and Representation

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Introduction to Machine Learning CMU-10701

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Learning : Clustering

Clustering web search results

Introduction to Mobile Robotics

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

COMS 4771 Clustering. Nakul Verma

IBL and clustering. Relationship of IBL with CBR

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Random projection for non-gaussian mixture models

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

University of Florida CISE department Gator Engineering. Clustering Part 2

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Network Traffic Measurements and Analysis

Clustering and Visualisation of Data

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

One-mode Additive Clustering of Multiway Data

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

ECE 5424: Introduction to Machine Learning

9.1. K-means Clustering

Latent Variable Models and Expectation Maximization

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Clustering. Shishir K. Shah

Content-based image and video analysis. Machine learning

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Machine Learning Classifiers and Boosting

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Lecture 2 The k-means clustering problem

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Clustering algorithms

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Iterative Closest Point Algorithm in the Presence of Anisotropic Noise

Expectation Maximization (EM) and Gaussian Mixture Models

Clustering and The Expectation-Maximization Algorithm

Unsupervised Learning

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering: Classic Methods and Modern Views

K-means and Hierarchical Clustering

Clustering CS 550: Machine Learning

Introduction to Machine Learning

Chapter 6 Continued: Partitioning Methods

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Generative and discriminative classification techniques

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Introduction to Trajectory Clustering. By YONGLI ZHANG

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

CS Introduction to Data Mining Instructor: Abdullah Mueen

Uncertainties: Representation and Propagation & Line Extraction from Range data

Learning from Data Mixture Models

Supervised vs. Unsupervised Learning

10-701/15-781, Fall 2006, Final

Clustering: Overview and K-means algorithm

Fall 09, Homework 5

Extracting Coactivated Features from Multiple Data Sets

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Fitting D.A. Forsyth, CS 543

CLUSTERING. JELENA JOVANOVIĆ Web:

10701 Machine Learning. Clustering

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France


Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

K-Means Clustering. Sargur Srihari

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Lecture 8: The EM algorithm

CSE 5243 INTRO. TO DATA MINING

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Deformable Object Registration

CS 229 Midterm Review

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Generative and discriminative classification techniques

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Probabilistic Graphical Models

Expectation Maximization: Inferring model parameters and class labels

Clustering: Overview and K-means algorithm

Supervised vs unsupervised clustering

Transcription:

Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for a vector x, is defined as In this equation we have p(x Θ) = α p (x z =, θ ) (1) =1 p (x z =, θ ) is the th component density with parameters θ. Typically this taes some simple parametric form such as a Gaussian multivariate density, for each component. In general the components need not all have the same parametric form, e.g., for 1-dimensional data one component could be Gaussian, another could be exponential, etc. Here we wor with real-valued x but we could also instead have discrete x and probability distributions for p(x Θ) and the K components p (x z =, θ ). z represents a K-dimensional vector of indicator variables that indicate which of the K components generated x. The notation z = denotes that the th component is 1 and all the other components are 0. α = P (z = ) is the marginal distribution of the components, i.e., the probability that a randomly selected x was generated by component (thin of it as the relative frequency with which each component generates data vectors). The α s are referred to as the mixture weights. Note that K =1 α = 1 since each x is assumed to be generated by one (but only one) of the K components. The full set of parameters in the mixture model is Θ = {α 1,..., α K, θ 1,..., θ K } 1

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 2 Membership Probabilities Another way to thin of a mixture model is via the law of total probability, where the marginal density p(x) is expressed as p(x) = p(x, z = ) = =1 p (x z =, θ )P (z = ) = =1 p (x z =, θ ) α (2) (ignoring the dependence on parameters for a moment). The marginal density for x is a weighted (convex) sum of component densities. As mentioned above, this model assumes that each x was generated by one of the K components. If we now (or fix) the parameters of the model, and we are given an observed vector x i, we can use Bayes rule to compute the probability that it was generated by a particular component, also referred to as the membership probabilities α p (x w i = i θ ) K m=1 α m p (x i θ m ) where we use p (x i θ ) as shorthand for p (x i z =, θ ). By definition K =1 w i = 1 and x i is some data vector, e.g., from a data set {x 1,..., x N }. An important point is that the membership probabilities w i reflect our uncertainty about which of the K components generated x. For example, if w i = 0.5 and K = 2 this means that we are completely uncertain about which which of the 2 components generated x i. So, if we had a mixture model for 1-dimensional data with 2 Gaussians with the same variance but different means then a point that is exactly half-way between the 2 means would have a membership probability of 0.5 for each component, and the membership probabilities would move closer to 0 or 1 as we move along the x axis towards either of the two means. =1 Using Maximum Lielihood to Learn the Parameters of a Mixture Model Consider an observed data set D x = {x 1,..., x N } where each vector x i, 1 i N is a d-dimensional vector. For simplicity assume that in our lielihood that the x i are conditionally independent given the parameters of a true underlying density p(x), and further assume that p(x) is a K-component mixture model with parameters Θ and some particular parametric form for the K components (e.g., multivariate Gaussians). The general form of the log-lielihood can be written as: log L(Θ) = log p(x i Θ) = ( ) log α p (x i θ ) Even for relatively simple component models (such as 1-dimensional Gaussians), this log-lielihood does not yield simple closed-form solutions for the maximum lielihood parameters. In general we get a set of coupled non-linear equations that must be solved in an iterative manner. (3)

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 3 The difficulty in maximizing the log-lielihood above arises because we don t now which component generated each x i. If we did now which component generated each data vector x i we could just group the data by component and estimate the parameters θ for each component separately (its easy to show that this would maximize the log-lielihood as long as the parameters θ for each component do not have any dependency on the parameters of other components). We are interested in the situation where we don t have the component indicators but we would lie to fit a mixture model nonetheless. Let z i be the K-dimensional latent (or hidden) indicator vector for each x i, with a 1 for the component that generated x i and a 0 in all the other components. So we can thin of our data as being in two parts: the observed part D x, an N d matrix of observations, and a latent/unobserved part D z, an N K matrix of latent indicator variables. There is a general framewor for generating maximum lielihood parameter estimates in the presence of missing data nown as the Expectation-Maximization (EM) Algorithm. In the notes below we describe the specification of EM for the specific case of a finite mixture model with K multivariate Gaussian components although note the EM algorithm has much wider applications in general for estimation in problems with missing data. The Gaussian Mixture Model In the remainder of this note set we will assume that we are woring with a mixture of Gaussians model, i.e., each component is a Gaussian density p (x θ ) = 1 (2π) d/2 Σ 1 e 2 (x µ )t Σ 1 1/2 (x µ ) (4) with its own parameters θ = {µ, Σ }. Note that we can compute the membership probabilities w i, given any vector x i and mixture model parameters Θ, by plugging in the functional form for p (x θ ) above into Equation 2. The EM Algorithm for Gaussian Mixtures The EM (Expectation-Maximization) algorithm for Gaussian mixtures is defined a combination of E and M steps: E-Step: Denote the current parameter values as Θ. Compute w i (using the equations above) for all data points x i, 1 i N and all mixture components 1 K. Note that for each data point x i the membership weights satisfy K =1 w i = 1 by definition (from Equation 2). This yields an N K matrix of membership weights w i, where each of the rows sum to 1. M-Step: Now we use the matrix of membership weights and the data to calculate new parameter values. Specifically, α new = 1 w i, 1 K. (5) N

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 4 These are the new mixture weights, with K =1 α = 1 by definition. Let N = N w i. This is the effective number of data points that are assigned to component, 1 K µ new = 1 N w i x i 1 K. (6) The updated mean is calculated in a manner similar to how we could compute a standard empirical average, except that the ith measurement x(i) has a fractional weight w i. Note that this is a vector equation since and x(i) are both d-dimensional vectors. µ new Σ new = 1 N w i (x i µ new )(x i µ new ) T 1 K. (7) Again we get an equation that is similar in form to how we would normally compute an empirical covariance matrix, except that the contribution of each data point is weighted by w i. Note that this is a matrix equation with dimensionality d d terms on each side. The equations in the M-step need to be computed in this order, i.e., first compute the K new α s, then the K new µ s, and finally the K new Σ s. After we have computed all of the new parameters, the M-step is complete and we can now go bac and recompute the membership weights in the E-step, then recompute the parameters again in the E-step, and continue updating the parameters in this manner. Each pair of E and M steps is considered to be one iteration. Initialization The EM algorithm can be started by either 1. Weight Initialization: initialize by starting with a matrix of randomly selected weights w i (ensuring that the random weights in each row sum to 1) and then start the algorithm with an M-step. This is relatively easy to implement. For large N it can produce initial components (after the first M- step) that are heavily overlapped with means close to each other, which can sometimes lead to slower convergence that with other methods. 2. Parameter Initialization: initialize the algorithm with a set of initial heuristically-selected parameters and then start the algorithm with an E-step. The initial set of parameters or weights can be determined by any of a variety of heuristics. For example, select K random data points as K initial means and select the covariance matrix of the whole data set (or a scaled version of it) for each of the initial K covariance matrices. Or the parameters could be chosen by using the non-probabilistic -means algorithm to first cluster the data and then defining weights based on -means memberships.

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 5 Convergence Detection Convergence is generally detected by computing the value of the log-lielihood (as defined in Equation 2) after each iteration and halting when it appears not to be changing in a significant manner from one iteration to the next. One potentially tricy issue here is that the log-lielihood values may be on very different scales from one problem to the next since they will depend on the scale of the data points x. So an alternative method (that also wors well, and is simple to implement) is to monitor how much the membership weights w i are changing from one iteration to the next, and halt the EM algorithm when the average change in these weights (across all Nd weights) is less than some small value such as 10 6. Note that EM in general can be shown to converge to a local maximum of the log-lielihood function rather than a global maximum. The particular solution it converges to in parameter space depends on where the algorithm is initialized. So in practice one often runs EM multiple times with different randomly-seeded initializations and selects the solution with the highest log-lielihood across the different runs. The K-Means Algorithm The K-means algorithm is a well-nown clustering algorithm that is based on minimizing mean-squared error to cluster centers. It is not based on a probabilistic model, but nonetheless shares some similarities with EM for Gaussian mixtures. Assume again we have a data set D x = {x 1,..., x N }. We would lie to cluster the N data vectors into K clusters. K-means has 2 steps per iteration, lie EM. Membership Assignment: Given a current set of cluster means µ, = 1,..., K, each data point x i is assigned to the cluster that it is closest to in terms of Euclidean distance. Note that this is similar to the E-step in EM, except that (a) Euclidean distance is being used (there is no notion of a covariance matrix), and (b) no membership probabilities are being computed, i.e., a hard decision on membership is being made for each data point x i. Updating of Cluster Centers: Given a set of cluster assignments (i.e., each data vector x i has been assigned to one of the K clusters), we can now update the vector representing the mean of that cluster by computing µ = 1 x N i, i:x i cluster 1 K i.e., compute the average of the N data points that are assigned to cluster for each cluster. The algorithm begins by either (a) randomly selecting K data vectors to act as initial cluster centers or (b) randomly assigning each data point to a cluster, and then starting with the appropriate step. The algorithm converges when the cluster assignments do not change from one iteration to the next (which means that the updated means will not change, which implies we have reached a fixed point).

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 6 The K-Means algorithm above can be shown to be trying to find the set of cluster means µ that minimize the sum of squared errors between each data point x i and its closest center. We can also thin of this as trying to perform data compression by finding the K best vectors µ to represent the full set of N vectors, where best is measured in a mean-squared sense. We can relate this to EM by considering the mean-square error to be a special case of the log-lielihood, by assuming that the covariances are fixed to be the identity matrix for each cluster, and by replacing the E-step with a hard assignment of each vector x i to its closest cluster center. Lie EM, K-means only guarantees that we get a local minimum of the mean-square error function, rather than a global minimum, depending on where it is initialized. Thus, as with EM, it is typical to run it multiple times from different randomly-seeded starting points, and to then select the solution with the lowest mean-square error. In practice both the K-means and Gaussian mixtures algorithms are useful for clustering and data compression. K-means can be more robust since it does not require parametric assumptions on the shapes of the clusters, but the mixture model approach is in general more flexible in terms of the types of clusters it can find.