Mixture models and clustering

Size: px

Start display at page:

Download "Mixture models and clustering"

Juliet Phillips
6 years ago
Views:

1 1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction tasks. The motivation behind the miture model as that the available data may include unobserved (sub)groups and by incorporating such structure in the model e could obtain more accurate predictions. We are also interested in uncovering that group structure. This is a clustering problem. For eample, in the case of modeling eam scores it ould be useful to understand hat types of students there are. In a biological contet, it ould be useful to uncover hich genes are active (epressed) in hich cell types hen the measurements are from tissue samples involving multiple cell types in unknon quantities. Clustering problems are ubiquitous. There are many different types of clustering algorithms. Some generate a series of nested clusters by merging simple clusters into larger ones (hierarchical agglomerative clustering), hile others try to find a pre-specified number of clusters that best capture the data (e.g., k-means). Which algorithm is the most appropriate to use depends in part on hat e are looking for. So it is very useful to kno more than one clustering method. Miture models as generative models require us to articulate the type of clusters or subgroups e are looking to identify. The simplest type of clusters e could look for are spherical Gaussian clusters, i.e., e ould be estimating Gaussian mitures of the form m P (; θ, m) = P (j)n(; µ j, σ 2 j I) (1) j=1 here the parameters θ include {P (j)}, {µ j }, and {σ 2 j }. Note that e are estimating the miture models ith a different objective in mind. We are more interested in finding here the clusters are than ho good the miture model is as a generative model. There are many questions to resolve. For eample, ho many such spherical Gaussian clusters are there in the data? This is a model selection problem. If such clusters eist, do e have enough data to identify them? If e have enough data, can e hope to find the clusters via the EM algorithm? Is our approach robust, i.e., does our method degrade

2 2 gracefully hen data contain background samples or impurities along ith spherical Gaussian clusters? Can e make the clustering algorithm more robust? Similar questions apply to other clustering algorithms. We ill touch on some of these issues in the contet of each algorithm. Mitures and K-means It is often helpful to try to search for simple clusters. For eample, consider a miture of to Gaussians here the variances of the spherical Gaussians may be different: 2 P (; θ) = P (j)n(; µ j, σ 2 j I) (2) j=1 Points far aay from the to means, in hatever direction, ould alays be assigned to the Gaussian ith the larger variance (tails of that distribution approach zero much sloer; the posterior assignment is based on the ratio of probabilities). While this is perfectly fine for modeling the density, it does not quite agree ith the clustering goal. To avoid such issues (and to keep the discussion simpler) e ill restrict the covariance matrices to be all identical and equal to σ 2 I here σ 2 is common to all clusters. Moreover, e ill fi the miing proportions to be uniform P (j) = 1/m. With such restrictions e can more easily understand the type of clusters e ill discover. The resulting simplified miture model has the form m 1 m j=1 P (; θ) = N(; µ j, σ 2 I) () Let s begin by trying to understand ho points are assigned to clusters ithin the EM algorithm. To this end, e can see ho the miture model partitions the space into regions here ithin each region a specific component has the highest posterior probability. To start ith, consider any to components i and j and the boundary here P (i, θ) = P (j, θ), i.e., the set of points for hich the component i and j have the same posterior probability. Since the Gaussians have equal prior probabilities, this happens hen N(; µ j, σ 2 I) = N(; µ i, σ 2 I). Both Gaussians have the same spherical covariance matri so the probability that they assign to points is based on the Euclidean distances to their mean vectors. The posteriors therefore can be equal only hen the component means are equidistant from the points: µ i = µ j or 2 T (µ j µ i ) = µ j µ j (4)

3 The boundary is therefore linear 1 in. We can dra such boundaries beteen any pair of components as illustrated in Figure 1. The pairise comparisons induce a Voronoi partition of the space here, for eample, the region enclosing µ 1 in the figure corresponds to all the points hose closest mean is µ 1. This is also the region here P (1, θ) takes the highest value among all the posterior probabilities. 1 µ µ 1 µ 1 Figure 1: Voronoi regions resulting from pairise comparison of posterior assignment probabilities. Each line corresponds to a decision beteen posterior probabilities of to components. The line is highlighted (solid) hen the boundary is an active constraint for deciding the highest probability component. Note that the posterior assignment probabilities P (j, θ) evaluated in the E-step of the EM algorithm do not vanish across the boundaries. The regions merely highlight hen one assignment has a higher probability than the others. The overall variance parameter σ 2 controls ho sharply the posterior probabilities change hen e cross each boundary. Small σ 2 results in very sharp transitions (e.g., from near zero to nearly one) hile the posteriors change smoothly hen σ 2 is large. K-means. We can define a simpler and faster version of the EM algorithm by just assigning each point to the component ith the highest posterior probability (its closest mean). Geometrically, the points ithin each Voronoi region are assigned to one component. The M-step then simply repositions each mean vector to the center of the points ithin the region. The updated means define ne Voronoi regions and so on. The resulting algorithm for finding cluster means is knon as the K-means algorithm: E-step: assign each point t to its closest mean, i.e., j t = arg min j t µ j 2 M-step: recompute µ j s as means of the assigned points. 1 The linearity holds hen the to Gaussians have the same covariance matri, spherical or not.

4 4 This is a highly popular hard assignment version of the EM-algorithm. We can vie the algorithm as optimizing the complete log-likelihood of the data ith respect to the assignments j 1,..., j n (E-step) as ell as the means µ 1,..., µ m (M-step). This is equivalent to minimizing the overall squared error (distortion) to the cluster means: n J(j 1,..., j n, µ 1,..., µ m ) = t µ jt 2 (5) Each step of the algorithm, E or M-step, decreases the objective until convergence. The algorithm can never come back to the same assignments j 1,..., j n as it ould result in the same value of the objective. Since there are only O(n k ) possible partitions of points ith k means, the algorithm has to converge in a finite number of steps. At convergence, the cluster means ˆµ 1,..., µˆm are locally optimal solutions to the minimum distortion objective t=1 n J(ˆµ 1,..., µˆm) = min t µˆj 2 (6) j t=1 The speed of the K-means algorithm comes at a cost. It is even more sensitive to proper initialization than a miture of Gaussians trained ith EM. A typical and reasonable initialization corresponds to setting the first mean to be a randomly selected training point, the second as the furthest training point from the first mean, the third as the furthest training point from the to means, and so on. This initialization is a greedy packing algorithm. Identifiability. The k-means or miture of Gaussians ith the EM algorithm can only do as ell as the maimum likelihood solution they aim to recover. In other ords, even hen the underlying clusters are spherical, the number of training eamples may be too small for the globally optimal (non-trivial) maimum likelihood solution to provide any reasonable clusters. As the number of points increases, the quality of the maimum likelihood solution improves but may be hard to find ith the EM algorithm. A large number of training points also makes it easier to find the maimum likelihood solution. We could therefore roughly speaking divide the clustering problem into three regimes depending on the number of training points: not solvable, solvable but hard, and easy. For a recent discussion on such regimes, see Srebro et al., An Investigation of Computational and Informational Limits in Gaussian Miture Clustering, ICML 2006.

5 5 Distance and clustering Gaussian miture models and the K-means algorithm make use of the Euclidean distance beteen points. Why should the points be compared in this manner? In many cases the vector representation of objects to be clustered is derived, i.e., comes from some feature transformation. It is not at all clear that the Euclidean distance is an appropriate ay of comparing the resulting feature vectors. Consider, for eample, a document clustering problem. We might treat each document as a bag of ords and map it into a feature vector based on term frequencies: n () = number of times ord appears in (7) n () f( ) = = term frequency (8) n () { [ φ () = f( ) log IDF }} ] { # of docs # of docs ith ord (9) here the feature vector φ() is knon as the TFIDF mapping (there are many variations of this). IDF stands for inverse document frequency and aims to de-emphasize ords that appear in all documents, ords that are unlikely to be useful for clustering or classification. We can no interpret the vectors φ() as points in a Euclidean space and apply, e.g., the K-means clustering algorithm. There are, hoever, many other ays of defining a distance metric for clustering documents. Distance metric plays a central role in clustering, regardless of the algorithm. For eample, a simple hierarchical agglomerative clustering algorithm is defined almost entirely on the basis of the distance function. The algorithm proceeds by successively merging to closest points or closest clusters (average squared distance) and is illustrated in Figure 2. In solving clustering problems, the focus should be on defining the distance (similarity) metric, perhaps at the epense of a specific algorithm. Most algorithms could be applied ith the chosen metric. One avenue for defining a distance metric is through model selection. Let s see ho this can be done. The simplest model over the ords in a document is a unigram model here each ord is an independent sample from a multi-nomial distribution {θ }, θ = 1. The probability of all the ords in document is therefore P ( θ) = θ = θ n () (10)

6 6 a) b) c) d) Figure 2: a-c) successive merging of clusters in a hierarchical clustering algorithm and d) the resulting cluster hierarchy. It is often a good idea to normalize the documents so they have the same length. We could, for eample, say that each document is of length one and ord counts are given by term frequencies f( ). Accordingly, P ( θ) = θ f ( ) (11) here denotes a normalized version of document. Consider a cluster of documents C. The unigram model that best describes the normalized documents in the cluster can be obtained by maimizing the log-likelihood l(c; θ) = log P ( θ) = f( ) log θ (12) C The maimum likelihood solution is, as before, obtained through empirical counts (represented here by term frequencies): C 1 θˆ = C f( ) (1) C

7 7 The resulting log-likelihood value is l(c; θˆ) = f( ) log θˆ (14) C here H(θˆ) is the entropy of the unigram distribution. = C 1 f( ) log θˆ (15) C C = C θˆ log θˆ = C H(θˆ) (16) We can no proceed to define a distance corresponding to the unigram model. Consider any to clusters C i and C j and their combination C = C i C j. When C i and C j are treated as distinct clusters, e ill use a different unigram model to capture their term frequencies. The combined cluster C, on the other hand, is modeled ith a single unigram model. We can no treat the documents in the to clusters as observations and devise a model selection problem for deciding hether the clusters should be vieed as separate or merged. To this end, e ill evaluate the resulting log-likelihoods in to ays corresponding to hether the clusters are modeled as separate or combined. When separate: l(c i θˆ i) + l(c j θˆ j ) = C i H(θˆ i) C j H(θˆ j ) (17) = ( C i + C j ) Pˆ(y)H(θˆ y) (18) y {i,j} here Pˆ(y = i) = C i /( C i + C j ) is the probability that a randomly dran document from C = C i C j is from cluster C i. The parameter estimates θˆ i and θˆ j are obtained as before. Similarly, hen the to clusters are modeled ith the same unigram: l(c i, C j θˆ) = ( C i + C j )H(θˆ) (19) here the parameter estimate θˆ can be ritten as θˆ = y {i,j} Pˆ(y)θˆ y, i.e., as a cluster size eighted combination of the estimates from the individual clusters. We can no define a (squared) distance beteen C i and C j according to ho much better they are modeled as separate clusters (log-likelihood ratio statistic): d 2 (C i, C j ) = l(c i θˆ i) + l(c j θˆ j ) l(c i, C j θˆ) (20) [ ] = ( C i + C j ) H(θˆ) Pˆ(y)H(θˆ y) (21) y {i,j} = ( C i + C j ) Î(; y) (22)

8 8 here Î(; y) is the mutual information beteen ords sampled from the combined cluster C and the identity of the cluster (i or j). (Recall a similar derivation in the feature selection contet). More precisely, the mutual information is computed on the basis of Pˆ(, y) = Pˆ(y)θˆ y. Note that d 2 (C i, C j ) is symmetric and non-negative but need not satisfy all the properties of a metric. It nevertheless compares the to clusters in a very natural ay: e measure ho distinguishable they are based on their ord distributions. In other ords, Î(; y) tells us ho much a ord sampled at random from a document in the combined cluster C tells us about the cluster C i or C j that the sample came from. Lo information content means that the to clusters have nearly identical distributions over ords. The (squared) metric d 2 (C i, C j ) can be immediately used, e.g., in a hierarchical clustering algorithm (the same algorithm derived from a different perspective is knon as an agglomerative information bottleneck method). We could take the model selection idea further and decide hen to stop merging clusters rather than simply providing a scale for deciding ho different to clusters are.

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which