Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1
Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014 2
Clustering web search results Carlos Guestrin 2005-2014 3
Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories Carlos Guestrin 2005-2014 4
Some Data Carlos Guestrin 2005-2014 5
K-means 1. Ask user how many clusters they d like. (e.g. k=5) Carlos Guestrin 2005-2014 6
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations Carlos Guestrin 2005-2014 7
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) Carlos Guestrin 2005-2014 8
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns Carlos Guestrin 2005-2014 9
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! Carlos Guestrin 2005-2014 10
K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! Carlos Guestrin 2005-2014 11
What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) Carlos Guestrin 2005-2014 12
Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C Carlos Guestrin 2005-2014 13
Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ Carlos Guestrin 2005-2014 14
Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) (For LASSO it converged to the global optimum, because of convexity) K-means is a coordinate descent algorithm! Carlos Guestrin 2005-2014 15
Mixtures of Gaussians Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 16
(One) bad case for k-means Clusters may overlap Some clusters may be wider than others Carlos Guestrin 2005-2014 17
Nonspherical data Carlos Guestrin 2005-2014 18
Quick Review of Gaussians Univariate and multivariate Gaussians Carlos Guestrin 2005-2014 19
Two-Dimensional Gaussians 5 spherical 10 diagonal 6 full 4 8 3 6 4 2 1 4 2 2 0 0 0 1 2 2 4 2 3 4 6 8 4 5 4 2 0 2 4 6 10 5 4 3 2 1 0 1 2 3 4 5 6 6 4 2 0 2 4 6 full spherical 0.2 diagonal 0.2 0.15 0.15 0.2 0.1 0.1 0.05 0.15 0.05 0 5 0 5 5 0 5 0.1 0.05 0 10 5 0 0 5 Carlos 10 Guestrin 5 2005-2014 0 10 5 5 0 5 10 10 5 0 5 20 10
Gaussians in d Dimensions P(x) = 1 (2π ) d/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( Carlos Guestrin 2005-2014 21
Learning Gaussians P(x) = 1 (2π ) d/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( Given data: MLE for mean: MLE for covariance: Carlos Guestrin 2005-2014 22
When the world is not Gaussian Distribution of male heights in US Distribution of male heights in Sweden What if we mix these together? Carlos Guestrin 2005-2014 23
Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Cluster indicator: (a) Per-cluster likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k Carlos Guestrin 2005-2014 24
Generative Model We can think of sampling observations from the model For each observation i, Sample a cluster assignment (a) Sample the observation from the selected Gaussian Carlos Guestrin 2005-2014 25
Density Estimation Estimate a density based on x 1,,x N Carlos Guestrin 2005-2014 26
Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Carlos Guestrin 2005-2014 27
Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Carlos Guestrin 2005-2014 28
Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = Carlos Guestrin 2005-2014 29
Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 0.5 0 (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 30
Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 31
Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 32
Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 33
Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between Carlos Guestrin 2005-2014 34
Summary of GMM Concept Estimate a density based on x 1,,x N p(x i, µ, ) = 1 (a) KX z i =1 z in (x i µ z i, z i) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments Carlos Guestrin 2005-2014 35
Summary of GMM Components Observations x i 2 R d, x i i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K k, KX k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) Carlos Guestrin 2005-2014 36
Application to Document Modeling Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 37
Cluster Documents Cluster documents based on topic Carlos Guestrin 2005-2014 38
Document Representation Bag of words model Carlos Guestrin 2005-2014 39
Issues with Document Representation Words counts are bad for standard similarity metrics Term Frequency Inverse Document Frequency (tf-idf) Increase importance of rare words Carlos Guestrin 2005-2014 40
TF-IDF Term frequency: tf(t, d) = Could also use {0, 1}, 1 + logt f(t, d),... Inverse document frequency: idf(t, D) = tf-idf: tfidf(t, d, D) = High for document d with high frequency of term t (high term frequency ) and few documents containing term t in the corpus (high inverse doc frequency ) Carlos Guestrin 2005-2014 41
A Generative Model Documents: Associated topics: Parameters simple mixture of Gaussians: Carlos Guestrin 2005-2014 42
What you get from mixture model for documents Words give topic: Topic proportions: Topic distribution of each document: Carlos Guestrin 2005-2014 43
Results from Wikipedia data using similar model (LDA) Carlos Guestrin 2005-2014 44
Expectation Maximization Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 45
Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? Carlos Guestrin 2005-2014 46
Learning Model Parameters Want to learn model parameters Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 0.5 0 (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 47
ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima Carlos Guestrin 2005-2014 48
Complete Data Imagine we have an assignment of each x i to a cluster Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 49
If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 Carlos Guestrin 2005-2014 50
Cluster Responsibilities We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,, )= 0 0 0.5 1 Soft assignments to clusters C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 51
Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values given estimate of parameters 2. Optimize parameters to produce new given filled in data 3. Repeat z i ˆ ˆ z i Example: MoG 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. k : Carlos Guestrin 2005-2014 52
E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data Carlos Guestrin 2005-2014 53
Gaussian Mixture Example: Start Carlos Guestrin 2005-2014 54
After first iteration Carlos Guestrin 2005-2014 55
After 2nd iteration Carlos Guestrin 2005-2014 56
After 3rd iteration Carlos Guestrin 2005-2014 57
After 4th iteration Carlos Guestrin 2005-2014 58
After 5th iteration Carlos Guestrin 2005-2014 59
After 6th iteration Carlos Guestrin 2005-2014 60
After 20th iteration Carlos Guestrin 2005-2014 61
Some Bio Assay data Carlos Guestrin 2005-2014 62
GMM clustering of the assay data Carlos Guestrin 2005-2014 63
Resulting Density Estimator Carlos Guestrin 2005-2014 64
Initialization In mixture model case where there are many ways to initialize the EM algorithm Examples: y i = {z i,x i } Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice Carlos Guestrin 2005-2014 65
Label switching Color = label does not matter Can switch labels and likelihood is unchanged Carlos Guestrin 2005-2014 66
What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent Carlos Guestrin 2005-2014 67
Dimensionality Reduction PCA Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 68
Dimensionality reduction Input data may have thousands or millions of dimensions! e.g., text data has Dimensionality reduction: represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional Carlos Guestrin 2005-2014 69
Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations of existing features Let s see this in the unsupervised setting just X, but no Y Carlos Guestrin 2005-2014 70
Linear projection and reconstruction x 2 project into 1-dimension z 1 x 1 reconstruction: only know z 1, what was (x 1,x 2 ) Carlos Guestrin 2005-2014 71
Principal component analysis basic idea Project n-dimensional data into k-dimensional space while preserving information: e.g., project space of 10000 words into 3-dimensions e.g., project 3-d into 2-d Choose projection with minimum reconstruction error Carlos Guestrin 2005-2014 72
Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i minimum squared error Carlos Guestrin 2005-2014 73
PCA finds projection that minimizes reconstruction error Given N data points: x i = (x 1i,,x di ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Carlos Guestrin 2005-2014 74
Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Carlos Guestrin 2005-2014 75
Reconstruction error and covariance matrix N d N N Carlos Guestrin 2005-2014 76
Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: Eigen vector: N d Minimizing reconstruction error equivalent to picking (u k+1,,u d ) to be eigen vectors with smallest eigen values Carlos Guestrin 2005-2014 77
Basic PCA algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Compute covariance matrix: Σ 1/N X c T X c Find eigen vectors and values of Σ Principal components: k eigen vectors with highest eigen values Carlos Guestrin 2005-2014 78
PCA example Carlos Guestrin 2005-2014 79
PCA example reconstruction only used first principal component Carlos Guestrin 2005-2014 80
Eigenfaces [Turk, Pentland 91] Input images: Principal components: Carlos Guestrin 2005-2014 81
Eigenfaces reconstruction Each image corresponds to adding 8 principal components: Carlos Guestrin 2005-2014 82
Scaling up Covariance matrix can be really big! Σ is d by d Say, only 10000 features finding eigenvectors is very slow Use singular value decomposition (SVD) finds to k eigenvectors great implementations available, e.g., GraphLab, python, R, Matlab svd Carlos Guestrin 2005-2014 83
SVD Write X = W S V T X data matrix, one row per datapoint W weight matrix, one row per datapoint coordinate of x i in eigenspace S singular value matrix, diagonal matrix in our setting each entry is eigenvalue λ j V T singular vector matrix in our setting each row is eigenvector v j Carlos Guestrin 2005-2014 84
PCA using SVD algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Call SVD algorithm on X c ask for k singular vectors Principal components: k singular vectors with highest singular values (rows of V T ) Coefficients become: Carlos Guestrin 2005-2014 85
What you need to know Dimensionality reduction why and when it s important Simple feature selection Principal component analysis minimizing reconstruction error relationship to covariance matrix and eigenvectors using SVD Carlos Guestrin 2005-2014 86