Gaussian Mixture Models For Clustering Data Soft Clustering and the EM Algorithm
K-Means Clustering Input: Observations: xx ii R dd ii {1,., NN} Number of Clusters: kk Output: Cluster Assignments. Cluster Centroids: μμ jj R dd jj {1,., kk}
K-Means Clustering Let zz ii be a binary vector of dimension k associated with each observation. If the ii ttt observation belongs to the jj ttt cluster then zz iijj = 1 and all other components of z are zero. Thus, z can be considered as a cluster label vector associated with each observation. Cluster-3 1 1 Cluster-1 Cluster- 1-of-k representation for cluster assignment.
K-Means Clustering We can now cast k-means as a minimization problem with the objective function: N K J = z nk x n µ k n= 1 k = 1 Squared Distance We need to find zz nnnn and μμ kk that minimize J.
K-Means Clustering Minimizing this function w.r.t zz nnkk : All n points are independent can optimize each one independently. Choose zz nnkk to be 1 for whichever value k gives the minimum value of the squared distance. Assign the current observation to the nearest cluster center. Minimizing this function w.r.t μμ kk : Take the derivative of J w.r.t μμ kk and equate to zero. μμ kk = NN nn=1 zz nnnnxx nn NN nn=1 zz nnnn
Finally! Iterative Algorithm For K-Means: Initialize k centroids. Repeat till convergence: Calculate zz nnnn Update μμ kk
Example Looking for two clusters. Initialize the centroids. (The blue and the red crosses). - -
Example Calculate the cluster assignments. - -
Example Re-calculate the centroids based on the new cluster assignments. - -
New zz nnnn Iteration - - -
New zz nnnn Iteration - New μμ kk New Blue Centroid - - - -
New zz nnnn Iteration - New μμ kk New Blue Centroid - - - Iteration -3 - New Blue Centroid - New Red Centroid -
New zz nnnn Iteration - New μμ kk New Blue Centroid - - - Iteration -3 - Final Results New Blue Centroid - New Red Centroid - - -
Minimizing the Objective Function 9 8 J (Value of the Objective Function) 7 6 5 4 3 1 1 3 4 5 6 7 # iteration (E Step)
Terminology Model Parameters (θθ = {μμ 1.kk }) The centroids. Complete Data (yy = {xx 1..nn, zz 1..nn }) The observations along with the cluster assignments. Incomplete Data (xx 1..nn ) Only the observations. In clustering problems we only have incomplete data and need to find estimates for both θθ and zz 1..nn.
Dissecting the K-Means Algorithm θθ (nnnnnn) Initialize Model Parameters θθ (oooooo) Estimate Cluster Assignments zz nnnn Estimate New Model Parameters There is a cyclic dependency between the model parameters and the cluster assignments. Initially we guess the model parameters. Based on this guess we estimate new cluster assignments. Which in turn impacts the model parameters which are then re-estimated.
Dissecting the K-Means Algorithm θθ (nnnnnn) Initialize Model Parameters θθ (oooooo) Estimate Cluster Assignments zz nnnn Estimate New Model Parameters There is a cyclic dependency between the model parameters and the cluster assignments. Initially we guess the model parameters. Based on this guess we estimate new cluster assignments. Which in turn impacts the model parameters which are then re-estimated. E-Step M-Step
Another Problem K-Means makes hard guesses for cluster assignment. For some cases our model may not be sure about exact cluster assignment. - - Can we make this probabilistic so that zz nnnn defines the probability that the nn ttt observation belongs to the kk ttt cluster?
Another Problem K-Means makes hard guesses for cluster assignment. For some cases our model may not be sure about exact cluster assignment. - - Can we make this probabilistic so that zz nnnn defines the probability that the nn ttt observation belongs to the kk ttt cluster?
Another Problem K-Means makes hard guesses for cluster assignment. For some cases our model may not be sure about exact cluster assignment. - - Can we make this probabilistic so that zz nnnn defines the probability that the nn ttt observation belongs to the kk ttt cluster???
Probabilistic Clustering Lets place a Gaussian centered at each of the means discovered by K-Means (assume we know the covariance). Since we have run the k-means algorithm we have access to complete data i.e. yy = {xx 1..nn, zz 1..nn } The probability of the complete data is: PP XX, ZZ θθ = {ππ kk NN(μμ kk, Σ kk ) } zz nnnn Complete Data Likelihood nn kk
Probabilistic Clustering We don t know the value of Z for our data, they are missing/hidden/latent. Need to get rid of Z to calculate the data likelihood: PP XX θθ = PP(XX, ZZ θθ) ZZ (Marginalize it out) Lets see what happens to our complete data likelihood when we marginalize out Z. PP xx ii, zz ii θθ zz ii = {ππ kk NN(xx ii μμ kk, Σ kk )} zziiii zz ii kk PP xx ii θθ = ππ kk NN(xx ii μμ kk, Σ kk ) kk
Gaussian Mixture Model Data generated from a mixture distribution: PP xx = KK kk=1 ππ kk NN(xx μμ kk, Σ kk ) Linear superposition of k Gaussians. Added constraints: ππ kk 1 and Generating Data: KK kk=1 ππ kk = 1 (Multinomial Distribution). Pick one of the Gaussian randomly with probability ππ kk. Sample the value from the Gaussian centered at μμ kk. Parameters of GMM: θθ = {ππ 1..kk, μμ 1..kk, Σ 1..kk }.
Estimating the Parameters We want to estimate our model parameters such that the probability of the data being generated by the model is maximized. θθ = arg max PP(XX θθ) which is equivalent to: θθ = arg max log (PP XX θθ ) Lets apply this to our incomplete-data likelihood: θ PP XX θθ = ππ kk NN(xx nn μμ kk, Σ kk ) nn θ kk log (PP XX θθ ) = log ππ kk NN(xx nn μμ kk, Σ kk ) nn kk STUCK!!
Estimating the Parameters Make it a bit simpler, assume we know Z. Now we can maximize the complete data log likelihood and estimate the model parameters. PP XX, ZZ θθ = {ππ kk NN(μμ kk, Σ kk ) } zz nnnn nn kk log PP XX, ZZ θθ = zz nnnn log ππ kk + log (NN(xx nn μμ kk, Σ kk )) nn kk Much more easier to work with, the parameters are decoupled and we can maximize easily.
Estimating the Parameters If we maximize the complete data log likelihood we get the following estimates: μμ kk = nn zz nnnnxx nn = nn zz nnnnxx nn nn zz nnnn NN kk Σ kk = 1 NN kk zz nnnn (xx nn μμ kk )(xx nn μμ kk ) TT nn ππ kk = nn zz nnnn NN Are we done?? What about the zz nnnn? we assumed they are known, but they are not!
What about zz nnnn? θθ (nnnnnn) Initialize Model Parameters θθ (oooooo) Estimate Cluster Assignments zz nnnn Estimate New Model Parameters Recall, the game we played while using k-means. Guess the parameters, estimate the zz nnnn!! Fixing the parameters to some values, we now get a distribution over the missing ZZ i.e. PP ZZ XX, θθ. OK! But this is a distribution, how do I get individual values for zz nnnn?
What about zz nnnn? Lets evaluate the expected value of each zz nnnn, under PP ZZ XX, θθ. EE PP(ZZ XX,θθ) zz nnnn = 1 PP zz nnnn = 1 xx nn, θθ kk + PP zz nnnn = xx nn, θθ kk = PP(zz nnnn = 1 xx nn, θθ kk ). Using Bayes Theorem we have: Probability that the k-th component was chosen to generate xx nn PP zz nnnn = 1 xx nn, θθ kk = PP(xx nn zz nnnn = 1, θθ kk ) PP(zz nnnn = 1 θθ kk ) PP(xx nn θθ kk ) Probability of generating xx nn using the k-th component. Incomplete Data Likelihood for xx nn
What about zz nnnn? So, EE PP(ZZ XX,θθ) zz nnnn = ππ kknn(xx nn μμ kk, Σ kk ) ππ jj NN(xx nn μμ jj, Σ jj ) jj = γγ(zz nnnn ) This quantity can be viewed as the responsibility that the kk ttt component takes for explaining the observation xx nn. Finally, we can substitute this value for zz nnnn in our parameter estimates as our best guesses for the values of zz nnnn given our current model parameters.
EM for GMM based clustering 1. Initialize the model parameters θθ (). E-Step: Evaluate the responsibilities using current parameter estimates: γγ zz nnnn = ππ kknn(xx nn μμ kk, Σ kk ) ππ jj NN(xx nn μμ jj, Σ jj ) jj 3. M-Step: Re-estimate the parameters using the current responsibilities: μμ kk = nn γγ zz nnnn xx nn NN kk Σ kk = 1 NN kk γγ zz nnnn (xx nn μμ kk )(xx nn μμ kk ) TT nn ππ kk = nn γγ zz nnnn NN where, NN kk = nn γγ zz nnnn. 4. If convergence criterion is not satisfied go back to step-.
EM for GMM based clustering Convergence Criterion: Check for the change in the values of the parameters. Calculate the incomplete data log likelihood: log (PP XX θθ ) = log ππ kk NN(xx nn μμ kk, Σ kk ) nn kk and if the value on current iteration has not changed from the previous value, or the change is negligible (below a preset tolerance), stop.
Example 1-1 - -3-4 -5 Number of Clusters?? Data sampled from three Gaussians centered at: [-1,-3] [-3,-3] [-4.75,-3] -6-7 -8-6 -5-4 -3 - -1
Example Ground Truth K-Means 1 1-1 -1 - - -3-3 -4-4 -5-5 -6-6 -7-7 -8-6 -5-4 -3 - -1-8 -6-5 -4-3 - -1 Run K-Means with 5 random starting points. Select the solution that has the minimum sum of squared distances.
Example Ground Truth Contour Plot of the final GMM 1 1-1 -1 - - -3-3 -4-4 -5-5 -6-6 -7-7 -8-6 -5-4 -3 - -1-8 -6-5 -4-3 - -1 Soft Clustering using a three component Gaussian Mixture Model with random starting point.
Example Original Means: [-1,-3] [-3,-3] [-4.75,-3] K-Means Centroids: [-1.335,-4.57] [-1.581, -1.6458] [-4.3681, -3.49] Means of the Three Gaussians Discovered by GMM: [-1.6, -.9663] [-.9747, -.991] [-4.7488, -.9717]
EM Algorithm A very powerful method for dealing with probabilistic models that involve latent/missing variables. Each iteration of the EM is guaranteed to maximize the data log likelihood. Guaranteed to converge to a local maxima. Sensitive to starting points. We have applied it to Gaussian Mixture Models, which can model any arbitrary shaped densities. Can be used for data density estimation aside from clustering.