Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for a vector x, is defined as In this equation we have p(x Θ) = α p (x z =, θ ) (1) =1 p (x z =, θ ) is the th component density with parameters θ. Typically this taes some simple parametric form such as a Gaussian multivariate density, for each component. In general the components need not all have the same parametric form, e.g., for 1-dimensional data one component could be Gaussian, another could be exponential, etc. Here we wor with real-valued x but we could also instead have discrete x and probability distributions for p(x Θ) and the K components p (x z =, θ ). z represents a K-dimensional vector of indicator variables that indicate which of the K components generated x. The notation z = denotes that the th component is 1 and all the other components are 0. α = P (z = ) is the marginal distribution of the components, i.e., the probability that a randomly selected x was generated by component (thin of it as the relative frequency with which each component generates data vectors). The α s are referred to as the mixture weights. Note that K =1 α = 1 since each x is assumed to be generated by one (but only one) of the K components. The full set of parameters in the mixture model is Θ = {α 1,..., α K, θ 1,..., θ K } 1

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 2 Membership Probabilities Another way to thin of a mixture model is via the law of total probability, where the marginal density p(x) is expressed as p(x) = p(x, z = ) = =1 p (x z =, θ )P (z = ) = =1 p (x z =, θ ) α (2) (ignoring the dependence on parameters for a moment). The marginal density for x is a weighted (convex) sum of component densities. As mentioned above, this model assumes that each x was generated by one of the K components. If we now (or fix) the parameters of the model, and we are given an observed vector x i, we can use Bayes rule to compute the probability that it was generated by a particular component, also referred to as the membership probabilities α p (x w i = i θ ) K m=1 α m p (x i θ m ) where we use p (x i θ ) as shorthand for p (x i z =, θ ). By definition K =1 w i = 1 and x i is some data vector, e.g., from a data set {x 1,..., x N }. An important point is that the membership probabilities w i reflect our uncertainty about which of the K components generated x. For example, if w i = 0.5 and K = 2 this means that we are completely uncertain about which which of the 2 components generated x i. So, if we had a mixture model for 1-dimensional data with 2 Gaussians with the same variance but different means then a point that is exactly half-way between the 2 means would have a membership probability of 0.5 for each component, and the membership probabilities would move closer to 0 or 1 as we move along the x axis towards either of the two means. =1 Using Maximum Lielihood to Learn the Parameters of a Mixture Model Consider an observed data set D x = {x 1,..., x N } where each vector x i, 1 i N is a d-dimensional vector. For simplicity assume that in our lielihood that the x i are conditionally independent given the parameters of a true underlying density p(x), and further assume that p(x) is a K-component mixture model with parameters Θ and some particular parametric form for the K components (e.g., multivariate Gaussians). The general form of the log-lielihood can be written as: log L(Θ) = log p(x i Θ) = ( ) log α p (x i θ ) Even for relatively simple component models (such as 1-dimensional Gaussians), this log-lielihood does not yield simple closed-form solutions for the maximum lielihood parameters. In general we get a set of coupled non-linear equations that must be solved in an iterative manner. (3)

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 3 The difficulty in maximizing the log-lielihood above arises because we don t now which component generated each x i. If we did now which component generated each data vector x i we could just group the data by component and estimate the parameters θ for each component separately (its easy to show that this would maximize the log-lielihood as long as the parameters θ for each component do not have any dependency on the parameters of other components). We are interested in the situation where we don t have the component indicators but we would lie to fit a mixture model nonetheless. Let z i be the K-dimensional latent (or hidden) indicator vector for each x i, with a 1 for the component that generated x i and a 0 in all the other components. So we can thin of our data as being in two parts: the observed part D x, an N d matrix of observations, and a latent/unobserved part D z, an N K matrix of latent indicator variables. There is a general framewor for generating maximum lielihood parameter estimates in the presence of missing data nown as the Expectation-Maximization (EM) Algorithm. In the notes below we describe the specification of EM for the specific case of a finite mixture model with K multivariate Gaussian components although note the EM algorithm has much wider applications in general for estimation in problems with missing data. The Gaussian Mixture Model In the remainder of this note set we will assume that we are woring with a mixture of Gaussians model, i.e., each component is a Gaussian density p (x θ ) = 1 (2π) d/2 Σ 1 e 2 (x µ )t Σ 1 1/2 (x µ ) (4) with its own parameters θ = {µ, Σ }. Note that we can compute the membership probabilities w i, given any vector x i and mixture model parameters Θ, by plugging in the functional form for p (x θ ) above into Equation 2. The EM Algorithm for Gaussian Mixtures The EM (Expectation-Maximization) algorithm for Gaussian mixtures is defined a combination of E and M steps: E-Step: Denote the current parameter values as Θ. Compute w i (using the equations above) for all data points x i, 1 i N and all mixture components 1 K. Note that for each data point x i the membership weights satisfy K =1 w i = 1 by definition (from Equation 2). This yields an N K matrix of membership weights w i, where each of the rows sum to 1. M-Step: Now we use the matrix of membership weights and the data to calculate new parameter values. Specifically, α new = 1 w i, 1 K. (5) N

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 4 These are the new mixture weights, with K =1 α = 1 by definition. Let N = N w i. This is the effective number of data points that are assigned to component, 1 K µ new = 1 N w i x i 1 K. (6) The updated mean is calculated in a manner similar to how we could compute a standard empirical average, except that the ith measurement x(i) has a fractional weight w i. Note that this is a vector equation since and x(i) are both d-dimensional vectors. µ new Σ new = 1 N w i (x i µ new )(x i µ new ) T 1 K. (7) Again we get an equation that is similar in form to how we would normally compute an empirical covariance matrix, except that the contribution of each data point is weighted by w i. Note that this is a matrix equation with dimensionality d d terms on each side. The equations in the M-step need to be computed in this order, i.e., first compute the K new α s, then the K new µ s, and finally the K new Σ s. After we have computed all of the new parameters, the M-step is complete and we can now go bac and recompute the membership weights in the E-step, then recompute the parameters again in the E-step, and continue updating the parameters in this manner. Each pair of E and M steps is considered to be one iteration. Initialization The EM algorithm can be started by either 1. Weight Initialization: initialize by starting with a matrix of randomly selected weights w i (ensuring that the random weights in each row sum to 1) and then start the algorithm with an M-step. This is relatively easy to implement. For large N it can produce initial components (after the first M- step) that are heavily overlapped with means close to each other, which can sometimes lead to slower convergence that with other methods. 2. Parameter Initialization: initialize the algorithm with a set of initial heuristically-selected parameters and then start the algorithm with an E-step. The initial set of parameters or weights can be determined by any of a variety of heuristics. For example, select K random data points as K initial means and select the covariance matrix of the whole data set (or a scaled version of it) for each of the initial K covariance matrices. Or the parameters could be chosen by using the non-probabilistic -means algorithm to first cluster the data and then defining weights based on -means memberships.

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 5 Convergence Detection Convergence is generally detected by computing the value of the log-lielihood (as defined in Equation 2) after each iteration and halting when it appears not to be changing in a significant manner from one iteration to the next. One potentially tricy issue here is that the log-lielihood values may be on very different scales from one problem to the next since they will depend on the scale of the data points x. So an alternative method (that also wors well, and is simple to implement) is to monitor how much the membership weights w i are changing from one iteration to the next, and halt the EM algorithm when the average change in these weights (across all Nd weights) is less than some small value such as 10 6. Note that EM in general can be shown to converge to a local maximum of the log-lielihood function rather than a global maximum. The particular solution it converges to in parameter space depends on where the algorithm is initialized. So in practice one often runs EM multiple times with different randomly-seeded initializations and selects the solution with the highest log-lielihood across the different runs. The K-Means Algorithm The K-means algorithm is a well-nown clustering algorithm that is based on minimizing mean-squared error to cluster centers. It is not based on a probabilistic model, but nonetheless shares some similarities with EM for Gaussian mixtures. Assume again we have a data set D x = {x 1,..., x N }. We would lie to cluster the N data vectors into K clusters. K-means has 2 steps per iteration, lie EM. Membership Assignment: Given a current set of cluster means µ, = 1,..., K, each data point x i is assigned to the cluster that it is closest to in terms of Euclidean distance. Note that this is similar to the E-step in EM, except that (a) Euclidean distance is being used (there is no notion of a covariance matrix), and (b) no membership probabilities are being computed, i.e., a hard decision on membership is being made for each data point x i. Updating of Cluster Centers: Given a set of cluster assignments (i.e., each data vector x i has been assigned to one of the K clusters), we can now update the vector representing the mean of that cluster by computing µ = 1 x N i, i:x i cluster 1 K i.e., compute the average of the N data points that are assigned to cluster for each cluster. The algorithm begins by either (a) randomly selecting K data vectors to act as initial cluster centers or (b) randomly assigning each data point to a cluster, and then starting with the appropriate step. The algorithm converges when the cluster assignments do not change from one iteration to the next (which means that the updated means will not change, which implies we have reached a fixed point).

Note Set 4: Mixture Models and the EM Algorithm: ICS 274, Probabilistic Learning: Winter 2016 6 The K-Means algorithm above can be shown to be trying to find the set of cluster means µ that minimize the sum of squared errors between each data point x i and its closest center. We can also thin of this as trying to perform data compression by finding the K best vectors µ to represent the full set of N vectors, where best is measured in a mean-squared sense. We can relate this to EM by considering the mean-square error to be a special case of the log-lielihood, by assuming that the covariances are fixed to be the identity matrix for each cluster, and by replacing the E-step with a hard assignment of each vector x i to its closest cluster center. Lie EM, K-means only guarantees that we get a local minimum of the mean-square error function, rather than a global minimum, depending on where it is initialized. Thus, as with EM, it is typical to run it multiple times from different randomly-seeded starting points, and to then select the solution with the lowest mean-square error. In practice both the K-means and Gaussian mixtures algorithms are useful for clustering and data compression. K-means can be more robust since it does not require parametric assumptions on the shapes of the clusters, but the mixture model approach is in general more flexible in terms of the types of clusters it can find.