Table of Content Chapter 9 Mixture Models and EM -means Clustering Gaussian Mixture Models (GMM) Expectation Maximiation (EM) for Mixture Parameter Estimation Introduction Mixture models allows Complex distribution to be formed from simpler distributions of observed and latent variables Distribution of observed variables alone is obtained by marginaliation A method for clustering data Maximum lielihood estimation in a mixture model is the Expectation Maximiation (EM) -means Clustering Given data set {x 1,..,x N } in D-dimensional Euclidean space Partition into clusters, which is given by One-of- coding. It is also called unsupervised classification. u is the center of th cluster Indicator variable r n {0,1} where =1,.., Describes which of clusters data point x n is assigned to r n = 1, and r nj = 0 for j because of one-of- coding. 1
Sum of squared errors Distortion measure Iterative procedure: -Means Clustering Initialie µ Minimie J w.r.t. r n, eep μ fixed (Expectation) J is the sum of squared distances of each point to the closest cluster centers u The goal is to find values for {r n } and the {μ } so as to minimie J Can be done by an iterative procedure that consists of two optimiation steps w.r.t. r n and μ Assign the nth point to the closest cluster. Minimie J w.r.t. μ, eep r n fixed (Maximiation) - Re-estimate the cluster centers based on current point Assignments until no or little change in either u or r n. Termination of -Means Two phases re-assigning data points to clusters Re-computing means of clusters Done repeatedly until no further change in assignments Since each phase reduces J, convergence is assured May converge to local minimum of J 2
Image Segmentation Goal: partition image into regions based on its intensity each of which has homogeneous visual appearance or corresponds to objects or parts of objects Each pixel is a point in R_G_B space -means clustering is used with a palette of colors Method does not tae into account proximity of different pixels Classify each pixel based on its intensity into clusters, =2,3, and 10 Online -Means Clustering Online version (Robbins-Monro procedure) where η n is a learning rate parameter made to decrease monotonically as more samples are observed Only M-step Dissimilarity Measure Euclidean distance has limitations Inappropriate for categorical labels Cluster means are non-robust to outliers Use more general or robust dissimilarity measure to measure dissimilarity (instead of the Euclidean distance as in -means) between a point and a cluster center. ν(x,x ) and distortion measure N J = r v( x, u ) n= 1 = 1 n n v(,) measures average dissimilarity to all the objects in the cluster, which gives the -medoids algorithm M-step is potentially more complex than for -means 3
Gaussian Mixture Models (GMMs) Allows a complex distribution represented by a combination of simpler distributions, hence providing a richer class of density models than the single Gaussian Parameteried and enhanced clustering Introduce the concept of the latent variable Motivates EM algorithm for ML estimation GMM Formulation To characterie a complex distribution of x, introduce the latent variable is a discreate vector of possible binary states Define joint distribution p(x,)=p(x )p() P(x) is defined through by marginaliing over = p( x, ) = p ( x) p( ) p( x ) the probability of the th element of is 1 p( x) = = 1 π N( x µ, Σ ) 4
GMM Parameter Estimation Given training data X= {x 1,..,x N } and use the ML to estimate the GMM parameters, π,µ, and Σ. The joint lielihood The log joint lielihood L( π, µ, Σ) = p( X π, µ, Σ) = π N( x µ, Σ ) n n= 1 = 1 [ π1, π2,.., π], µ = [ µ 1, µ 2,.., ], Σ= [ Σ1, Σ2,.., Σ] where π = µ N N ln{ ln L( π, µ, Σ) = π N( x µ, Σ ) } n= 1 = 1 n GMM Parameter Estimation Tae the derivatives of the log joint lielihood with respect to π,µ, and Σ and setting them to ero yield respectively π N( x µ, Σ ) γ ( ) = where π N( x µ, Σ ), and it measures the = 1 contribution to x by the th component, but it still depends on the parameters Expectation Maximiation (EM) for GMM Parameter Estimation 5
Relation to -Means Clustering GMM and -means both identify clusters, but they are different GMM gives both cluster center, cluster covariance matrix, and cluster weight, while -means only gives cluster center. -means perform hard point assignment, while GMM performs soft assignment EM Algorithm EM is a general technique for finding maximum lielihood solutions for probabilistic models with latent variables (Dempster 1977) Goal of EM is: find maximum lielihood solutions for models having latent variables or missing data Introduce latent variables in order to represent complicated distribution with simpler components 6
EM Algorithm Let X observed data, Z latent variables, parameters θ. Goal: maximie marginal log-lielihood of observed data Maximiation P(X,Z θ) simple but P(X θ) difficult due to log before the sum. Assume straightforward maximiation for complete data Ln p(x,z θ) but cannot do it since Z is unnown Latent Z is nown only through p(z X,θ ) if θ is nown. We will consider the expected complete data log-lielihood. ln p ( X θ ) = ln{ p( X, Z θ )} EM: algorithm Initialiation: Choose initial set of parameters θ old. E-step: use current parameters θ old to compute p(z X, θ old ) E-step: find the expected complete-data log-lielihood for general θ old old Q( θ, θ ) = p( Z θ, X )ln p( X, Z θ ) M-step: determine θ new by maximiing θ new Chec convergence: stop or = argmax Q( θ, θ θ old ) EM Properties EM defined heuristically can be proved to maximie the lielihood function Proof involves obtaining lower bound on loglielihood function EM guarantees the lielihood improves after each iteration old new Go to E-step θ = θ 7