Clustering web search results - PDF Free Download

Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1

Clustering web search results 3 Some Data 4 2

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 5 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 6 3

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 7 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 8 4

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 9 K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 10 5

What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) 11 Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C 12 6

Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ 13 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) (For LASSO it converged to the global optimum, because of convexity) K-means is a coordinate descent algorithm! 14 7

Mixtures of Gaussians Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 15 (One) bad case for k-means Clusters may overlap Some clusters may be wider than others 16 8

Density Estimation Estimate a density based on x 1,,x N Emily Fox 2013 17 Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Emily Fox 2013 18 9

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Emily Fox 2013 19 Gaussians in d Dimensions 1 P(x) = (2π ) m/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( 20 10

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = Emily Fox 2013 21 Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.2 0.4 0.6 0.8 1 1 0.5 0 (b) 0 0.5 1 C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 22 11

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 23 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 24 12

Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 25 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between 26 13

Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments Emily Fox 2013 27 Summary of GMM Components Observations x i i 2 R d, i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K KX k, k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) Emily Fox 2013 28 14

Expectation Maximization Machine Learning CSE546 Emily Fox University of Washington November 6, 2013 29 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 30 15

But we don t see class labels!!! MLE: argmax i P(z i,x i ) But we don t know z i Maximize marginal likelihood: argmax i P(x i ) = argmax i k=1 K P(z i =k,x i ) 31 Special case: spherical Gaussians and hard assignments P(z i = k, x i 1 ) = (2π ) m/2 Σ k exp # 1 1/2 2 xi µ $ % k ( ) T Σ 1 k ( x i µ k ) & ' ( P(zi = k) If P(X z=k) is spherical, with same σ for all classes: # P(x i z i = k) exp 1 x i 2& µ $ % 2σ 2 k ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N K N % P(x i, z i = k) exp 1 & ' 2σ 2 i=1 k=1 i=1 x i µ C(i) ( ) * 2 Same as K-means!!! 32 16

Supervised Learning of Mixtures of Gaussians Mixtures of Gaussians: Prior class probabilities: P(z=k) Likelihood function per class: P(x z=k) Suppose, for each data point, we know location x and class z Learning is easy J For prior P(z) For likelihood function: 33 EM: Reducing Unsupervised Learning to Supervised Learning If we knew assignment of points to classes è Supervised Learning! Expectation-Maximization (EM) Guess assignment of points to classes In standard ( soft ) EM: each point associated with prob. of being in each class Recompute model parameters Iterate 34 17

Form of Likelihood Conditioned on class of point x i... p(x i z i,µ, ) = Marginalizing class assignment: p(x i,µ, ) = Emily Fox 2013 35 Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k Emily Fox 2013 36 18

Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories Emily Fox 2013 37 Mixture models are useful for Density estimation Allows for multimodal density Clustering Want membership information for each observation e.g., topic of current document Soft clustering: p(z i = k x i, )= Hard clustering: z i = arg max p(z i = k x i, )= k Emily Fox 2013 38 19

Issues Label switching Color = label does not matter Can switch labels and likelihood is unchanged Log likelihood is not convex in the parameters Problem is simpler for complete data likelihood Emily Fox 2013 39 ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima Emily Fox 2013 40 20

If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 Emily Fox 2013 41 Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values z i given estimate of parameters ˆ 2. Optimize parameters to produce new ˆ given filled in data z i 3. Repeat Example: MoG (derivation soon + HW) 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. µ k, k : Emily Fox 2013 42 21

E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 43 Gaussian Mixture Example: Start Emily Fox 2013 44 22

After first iteration Emily Fox 2013 45 After 2nd iteration Emily Fox 2013 46 23

After 3rd iteration Emily Fox 2013 47 After 4th iteration Emily Fox 2013 48 24

After 5th iteration Emily Fox 2013 49 After 6th iteration Emily Fox 2013 50 25

After 20th iteration Emily Fox 2013 51 Some Bio Assay data Emily Fox 2013 52 26

GMM clustering of the assay data Emily Fox 2013 53 Resulting Density Estimator Emily Fox 2013 54 27

Expectation Maximization (EM) Setup More broadly applicable than just to mixture models considered so far Model: x y observable incomplete data not (fully) observable complete data parameters Interested in maximizing (wrt ): p(x ) = X y p(x, y ) Special case: x = g(y) Emily Fox 2013 55 Expectation Maximization (EM) Derivation Step 1 Rewrite desired likelihood in terms of complete data terms p(y ) =p(y x, )p(x ) Step 2 Assume estimate of parameters Take expectation with respect to ˆ p(y x, ˆ ) Emily Fox 2013 56 28

Expectation Maximization (EM) Derivation Step 3 Consider log likelihood of data at any L x ( ) L x (ˆ ) relative to log likelihood at ˆ Aside: Gibbs Inequality Proof: E p [log p(x)] E p [log q(x)] Emily Fox 2013 57 Expectation Maximization (EM) Derivation L x ( ) L x (ˆ ) =[U(, ˆ ) U(ˆ, ˆ )] [V (, ˆ ) V (ˆ, ˆ )] Step 4 Determine conditions under which log likelihood at exceeds that at Using Gibbs inequality: ˆ If Then L x ( ) L x (ˆ ) Emily Fox 2013 58 29

Motivates EM Algorithm Initial guess: Estimate at iteration t: E-Step Compute M-Step Compute Emily Fox 2013 59 Example Mixture Models E-Step Compute M-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] ˆ (t+1) = arg max U(, ˆ (t) ) Consider y i = {z i,x i } i.i.d. p(x i,z i ) = z ip(x i E qt [log p(y )] = X z i)= E qt [log p(x i,z i )] = i Emily Fox 2013 60 30

Coordinate Ascent Behavior Bound log likelihood: L x ( ) = U(, ˆ (t) )+V(, ˆ (t) ) L x (ˆ (t) )= U(ˆ (t), ˆ (t) )+V (ˆ (t), ˆ (t) ) Figure from KM textbook Emily Fox 2013 61 Comments on EM Since Gibbs inequality is satisfied with equality only if p=q, any step that changes should strictly increase likelihood In practice, can replace the M-Step with increasing U instead of maximizing it (Generalized EM) Under certain conditions (e.g., in exponential family), can show that EM converges to a stationary point of L x ( ) Often there is a natural choice for y has physical meaning If you want to choose any y, not necessarily x=g(y), replace p(y ) in U with p(y, x ) Emily Fox 2013 62 31

Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice Emily Fox 2013 63 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 64 32