Clustering web search results

Size: px

Start display at page:

Download "Clustering web search results"

Owen Walsh
5 years ago
Views:

1 Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, Clustering images Set of Images [Goldberger et al.] 2 1

2 Clustering web search results 3 Some Data 4 2

3 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 5 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 6 3

4 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 7 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 8 4

5 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 9 K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 10 5

6 What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) 11 Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C 12 6

7 Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ 13 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) (For LASSO it converged to the global optimum, because of convexity) K-means is a coordinate descent algorithm! 14 7

8 Mixtures of Gaussians Machine Learning CSE546 Emily Fox University of Washington November 4, (One) bad case for k-means Clusters may overlap Some clusters may be wider than others 16 8

9 Density Estimation Estimate a density based on x 1,,x N Emily Fox Density Estimation Contour Plot of Joint Density Emily Fox

10 Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density Emily Fox Gaussians in d Dimensions 1 P(x) = (2π ) m/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( 20 10

11 Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians p(x i,µ, ) = Emily Fox Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations (b) C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 22 11

12 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) Complete data labeled by true cluster assignments C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 23 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = Complete data labeled by true cluster assignments C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 24 12

13 Clustering our Observations We must infer the cluster assignments from the observations (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = Soft assignments to clusters C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 25 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between 26 13

Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) 0.5 0 0 0.

14 Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments Emily Fox Summary of GMM Components Observations x i i 2 R d, i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K KX k, k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) Emily Fox

15 Expectation Maximization Machine Learning CSE546 Emily Fox University of Washington November 6, Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 30 15

16 But we don t see class labels!!! MLE: argmax i P(z i,x i ) But we don t know z i Maximize marginal likelihood: argmax i P(x i ) = argmax i k=1 K P(z i =k,x i ) 31 Special case: spherical Gaussians and hard assignments P(z i = k, x i 1 ) = (2π ) m/2 Σ k exp # 1 1/2 2 xi µ $ % k ( ) T Σ 1 k ( x i µ k ) & ' ( P(zi = k) If P(X z=k) is spherical, with same σ for all classes: # P(x i z i = k) exp 1 x i 2& µ $ % 2σ 2 k ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N K N % P(x i, z i = k) exp 1 & ' 2σ 2 i=1 k=1 i=1 x i µ C(i) ( ) * 2 Same as K-means!!! 32 16

17 Supervised Learning of Mixtures of Gaussians Mixtures of Gaussians: Prior class probabilities: P(z=k) Likelihood function per class: P(x z=k) Suppose, for each data point, we know location x and class z Learning is easy J For prior P(z) For likelihood function: 33 EM: Reducing Unsupervised Learning to Supervised Learning If we knew assignment of points to classes è Supervised Learning! Expectation-Maximization (EM) Guess assignment of points to classes In standard ( soft ) EM: each point associated with prob. of being in each class Recompute model parameters Iterate 34 17

18 Form of Likelihood Conditioned on class of point x i... p(x i z i,µ, ) = Marginalizing class assignment: p(x i,µ, ) = Emily Fox Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k Emily Fox

19 Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories Emily Fox Mixture models are useful for Density estimation Allows for multimodal density Clustering Want membership information for each observation e.g., topic of current document Soft clustering: p(z i = k x i, )= Hard clustering: z i = arg max p(z i = k x i, )= k Emily Fox

20 Issues Label switching Color = label does not matter Can switch labels and likelihood is unchanged Log likelihood is not convex in the parameters Problem is simpler for complete data likelihood Emily Fox ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima Emily Fox

21 If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 Emily Fox Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values z i given estimate of parameters ˆ 2. Optimize parameters to produce new ˆ given filled in data z i 3. Repeat Example: MoG (derivation soon + HW) 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. µ k, k : Emily Fox

22 E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 43 Gaussian Mixture Example: Start Emily Fox

23 After first iteration Emily Fox After 2nd iteration Emily Fox

24 After 3rd iteration Emily Fox After 4th iteration Emily Fox

25 After 5th iteration Emily Fox After 6th iteration Emily Fox

26 After 20th iteration Emily Fox Some Bio Assay data Emily Fox

GMM clustering of the assay data Emily Fox 2013

27 GMM clustering of the assay data Emily Fox Resulting Density Estimator Emily Fox

28 Expectation Maximization (EM) Setup More broadly applicable than just to mixture models considered so far Model: x y observable incomplete data not (fully) observable complete data parameters Interested in maximizing (wrt ): p(x ) = X y p(x, y ) Special case: x = g(y) Emily Fox Expectation Maximization (EM) Derivation Step 1 Rewrite desired likelihood in terms of complete data terms p(y ) =p(y x, )p(x ) Step 2 Assume estimate of parameters Take expectation with respect to ˆ p(y x, ˆ ) Emily Fox

29 Expectation Maximization (EM) Derivation Step 3 Consider log likelihood of data at any L x ( ) L x (ˆ ) relative to log likelihood at ˆ Aside: Gibbs Inequality Proof: E p [log p(x)] E p [log q(x)] Emily Fox Expectation Maximization (EM) Derivation L x ( ) L x (ˆ ) =[U(, ˆ ) U(ˆ, ˆ )] [V (, ˆ ) V (ˆ, ˆ )] Step 4 Determine conditions under which log likelihood at exceeds that at Using Gibbs inequality: ˆ If Then L x ( ) L x (ˆ ) Emily Fox

30 Motivates EM Algorithm Initial guess: Estimate at iteration t: E-Step Compute M-Step Compute Emily Fox Example Mixture Models E-Step Compute M-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] ˆ (t+1) = arg max U(, ˆ (t) ) Consider y i = {z i,x i } i.i.d. p(x i,z i ) = z ip(x i E qt [log p(y )] = X z i)= E qt [log p(x i,z i )] = i Emily Fox

31 Coordinate Ascent Behavior Bound log likelihood: L x ( ) = U(, ˆ (t) )+V(, ˆ (t) ) L x (ˆ (t) )= U(ˆ (t), ˆ (t) )+V (ˆ (t), ˆ (t) ) Figure from KM textbook Emily Fox Comments on EM Since Gibbs inequality is satisfied with equality only if p=q, any step that changes should strictly increase likelihood In practice, can replace the M-Step with increasing U instead of maximizing it (Generalized EM) Under certain conditions (e.g., in exponential family), can show that EM converges to a stationary point of L x ( ) Often there is a natural choice for y has physical meaning If you want to choose any y, not necessarily x=g(y), replace p(y ) in U with p(y, x ) Emily Fox

32 Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice Emily Fox What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 64 32

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014