Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1

Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014 2

Clustering web search results Carlos Guestrin 2005-2014 3

Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories Carlos Guestrin 2005-2014 4

Some Data Carlos Guestrin 2005-2014 5

K-means 1. Ask user how many clusters they d like. (e.g. k=5) Carlos Guestrin 2005-2014 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations Carlos Guestrin 2005-2014 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) Carlos Guestrin 2005-2014 8

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns Carlos Guestrin 2005-2014 9

K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! Carlos Guestrin 2005-2014 11

What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) Carlos Guestrin 2005-2014 12

Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C Carlos Guestrin 2005-2014 13

Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ Carlos Guestrin 2005-2014 14

Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) (For LASSO it converged to the global optimum, because of convexity) K-means is a coordinate descent algorithm! Carlos Guestrin 2005-2014 15

Mixtures of Gaussians Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 16

(One) bad case for k-means Clusters may overlap Some clusters may be wider than others Carlos Guestrin 2005-2014 17

Nonspherical data Carlos Guestrin 2005-2014 18

Quick Review of Gaussians Univariate and multivariate Gaussians Carlos Guestrin 2005-2014 19

Two-Dimensional Gaussians 5 spherical 10 diagonal 6 full 4 8 3 6 4 2 1 4 2 2 0 0 0 1 2 2 4 2 3 4 6 8 4 5 4 2 0 2 4 6 10 5 4 3 2 1 0 1 2 3 4 5 6 6 4 2 0 2 4 6 full spherical 0.2 diagonal 0.2 0.15 0.15 0.2 0.1 0.1 0.05 0.15 0.05 0 5 0 5 5 0 5 0.1 0.05 0 10 5 0 0 5 Carlos 10 Guestrin 5 2005-2014 0 10 5 5 0 5 10 10 5 0 5 20 10

Gaussians in d Dimensions P(x) = 1 (2π ) d/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( Carlos Guestrin 2005-2014 21

Learning Gaussians P(x) = 1 (2π ) d/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( Given data: MLE for mean: MLE for covariance: Carlos Guestrin 2005-2014 22

When the world is not Gaussian Distribution of male heights in US Distribution of male heights in Sweden What if we mix these together? Carlos Guestrin 2005-2014 23

Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Cluster indicator: (a) Per-cluster likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k Carlos Guestrin 2005-2014 24

Generative Model We can think of sampling observations from the model For each observation i, Sample a cluster assignment (a) Sample the observation from the selected Gaussian Carlos Guestrin 2005-2014 25

Density Estimation Estimate a density based on x 1,,x N Carlos Guestrin 2005-2014 26

Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Carlos Guestrin 2005-2014 27

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Carlos Guestrin 2005-2014 28

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = Carlos Guestrin 2005-2014 29

Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 0.5 0 (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 30

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 31

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 32

Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 33

Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between Carlos Guestrin 2005-2014 34

Summary of GMM Concept Estimate a density based on x 1,,x N p(x i, µ, ) = 1 (a) KX z i =1 z in (x i µ z i, z i) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments Carlos Guestrin 2005-2014 35

Summary of GMM Components Observations x i 2 R d, x i i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K k, KX k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) Carlos Guestrin 2005-2014 36

Application to Document Modeling Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 37

Cluster Documents Cluster documents based on topic Carlos Guestrin 2005-2014 38

Document Representation Bag of words model Carlos Guestrin 2005-2014 39

Issues with Document Representation Words counts are bad for standard similarity metrics Term Frequency Inverse Document Frequency (tf-idf) Increase importance of rare words Carlos Guestrin 2005-2014 40

TF-IDF Term frequency: tf(t, d) = Could also use {0, 1}, 1 + logt f(t, d),... Inverse document frequency: idf(t, D) = tf-idf: tfidf(t, d, D) = High for document d with high frequency of term t (high term frequency ) and few documents containing term t in the corpus (high inverse doc frequency ) Carlos Guestrin 2005-2014 41

A Generative Model Documents: Associated topics: Parameters simple mixture of Gaussians: Carlos Guestrin 2005-2014 42

What you get from mixture model for documents Words give topic: Topic proportions: Topic distribution of each document: Carlos Guestrin 2005-2014 43

Results from Wikipedia data using similar model (LDA) Carlos Guestrin 2005-2014 44

Expectation Maximization Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 45

Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? Carlos Guestrin 2005-2014 46

Learning Model Parameters Want to learn model parameters Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 0.5 0 (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 47

ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima Carlos Guestrin 2005-2014 48

Complete Data Imagine we have an assignment of each x i to a cluster Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 49

If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 Carlos Guestrin 2005-2014 50

Cluster Responsibilities We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,, )= 0 0 0.5 1 Soft assignments to clusters C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 51

Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values given estimate of parameters 2. Optimize parameters to produce new given filled in data 3. Repeat z i ˆ ˆ z i Example: MoG 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. k : Carlos Guestrin 2005-2014 52

E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data Carlos Guestrin 2005-2014 53

Gaussian Mixture Example: Start Carlos Guestrin 2005-2014 54

After first iteration Carlos Guestrin 2005-2014 55

After 2nd iteration Carlos Guestrin 2005-2014 56

After 3rd iteration Carlos Guestrin 2005-2014 57

After 4th iteration Carlos Guestrin 2005-2014 58

After 5th iteration Carlos Guestrin 2005-2014 59

After 6th iteration Carlos Guestrin 2005-2014 60

After 20th iteration Carlos Guestrin 2005-2014 61

Some Bio Assay data Carlos Guestrin 2005-2014 62

GMM clustering of the assay data Carlos Guestrin 2005-2014 63

Resulting Density Estimator Carlos Guestrin 2005-2014 64

Initialization In mixture model case where there are many ways to initialize the EM algorithm Examples: y i = {z i,x i } Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice Carlos Guestrin 2005-2014 65

Label switching Color = label does not matter Can switch labels and likelihood is unchanged Carlos Guestrin 2005-2014 66

What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent Carlos Guestrin 2005-2014 67

Dimensionality Reduction PCA Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 68

Dimensionality reduction Input data may have thousands or millions of dimensions! e.g., text data has Dimensionality reduction: represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional Carlos Guestrin 2005-2014 69

Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations of existing features Let s see this in the unsupervised setting just X, but no Y Carlos Guestrin 2005-2014 70

Linear projection and reconstruction x 2 project into 1-dimension z 1 x 1 reconstruction: only know z 1, what was (x 1,x 2 ) Carlos Guestrin 2005-2014 71

Principal component analysis basic idea Project n-dimensional data into k-dimensional space while preserving information: e.g., project space of 10000 words into 3-dimensions e.g., project 3-d into 2-d Choose projection with minimum reconstruction error Carlos Guestrin 2005-2014 72

Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i minimum squared error Carlos Guestrin 2005-2014 73

PCA finds projection that minimizes reconstruction error Given N data points: x i = (x 1i,,x di ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Carlos Guestrin 2005-2014 74

Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Carlos Guestrin 2005-2014 75

Reconstruction error and covariance matrix N d N N Carlos Guestrin 2005-2014 76

Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: Eigen vector: N d Minimizing reconstruction error equivalent to picking (u k+1,,u d ) to be eigen vectors with smallest eigen values Carlos Guestrin 2005-2014 77

Basic PCA algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Compute covariance matrix: Σ 1/N X c T X c Find eigen vectors and values of Σ Principal components: k eigen vectors with highest eigen values Carlos Guestrin 2005-2014 78

PCA example Carlos Guestrin 2005-2014 79

PCA example reconstruction only used first principal component Carlos Guestrin 2005-2014 80

Eigenfaces [Turk, Pentland 91] Input images: Principal components: Carlos Guestrin 2005-2014 81

Eigenfaces reconstruction Each image corresponds to adding 8 principal components: Carlos Guestrin 2005-2014 82

Scaling up Covariance matrix can be really big! Σ is d by d Say, only 10000 features finding eigenvectors is very slow Use singular value decomposition (SVD) finds to k eigenvectors great implementations available, e.g., GraphLab, python, R, Matlab svd Carlos Guestrin 2005-2014 83

SVD Write X = W S V T X data matrix, one row per datapoint W weight matrix, one row per datapoint coordinate of x i in eigenspace S singular value matrix, diagonal matrix in our setting each entry is eigenvalue λ j V T singular vector matrix in our setting each row is eigenvector v j Carlos Guestrin 2005-2014 84

PCA using SVD algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Call SVD algorithm on X c ask for k singular vectors Principal components: k singular vectors with highest singular values (rows of V T ) Coefficients become: Carlos Guestrin 2005-2014 85

What you need to know Dimensionality reduction why and when it s important Simple feature selection Principal component analysis minimizing reconstruction error relationship to covariance matrix and eigenvectors using SVD Carlos Guestrin 2005-2014 86