ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning Topics: Unsupervised Learning: Kmeans, GMM, EM Readings: Barber 20.1-20.3 Stefan Lee Virginia Tech

Tasks Supervised Learning x Classification y Discrete x Regression y Continuous Unsupervised Learning x Clustering c Discrete ID Dimensionality x z Continuous Reduction (C) Dhruv Batra 2

Unsupervised Learning Learning only with X Y not present in training data Some example unsupervised learning problems: Clustering / Factor Analysis Dimensionality Reduction / Embeddings Density Estimation with Mixture Models (C) Dhruv Batra 3

New Topic: Clustering Slide Credit: Carlos Guestrin 4

Synonyms Clustering Vector Quantization Latent Variable Models Hidden Variable Models Mixture Models Algorithms: K-means Expectation Maximization (EM) (C) Dhruv Batra 5

Some Data (C) Dhruv Batra Slide Credit: Carlos Guestrin 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5) (C) Dhruv Batra Slide Credit: Carlos Guestrin 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations (C) Dhruv Batra Slide Credit: Carlos Guestrin 8

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) (C) Dhruv Batra Slide Credit: Carlos Guestrin 9

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns (C) Dhruv Batra Slide Credit: Carlos Guestrin 10

K-means Randomly initialize k centers (0) = 1 (0),, k (0) Assign: Assign each point i {1, n} to nearest center: C(i) Recenter: argmin j x i µ j 2 μ # becomes centroid of its points (C) Dhruv Batra Slide Credit: Carlos Guestrin 12

K-means Demo http://mlehman.github.io/kmeans-javascript/ (C) Dhruv Batra 13

What is K-means optimizing? Objective F(,C): function of centers and point allocations C: F (µ,c)= NX x i µ C(i) 2 i=1 1-of-k encoding F (µ, a) = NX kx a ij x i µ j 2 i=1 j=1 Optimal K-means: min min a F(,a) (C) Dhruv Batra 14

Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! (C) Dhruv Batra Slide Credit: Carlos Guestrin 15

K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix, optimize a (or C) min µ 1,...,µ k min a 1,...,a N NX i=1 kx a ij x i µ j 2 j=1 (C) Dhruv Batra Slide Credit: Carlos Guestrin 16

K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix a (or C), optimize min µ 1,...,µ k min a 1,...,a N NX i=1 kx a ij x i µ j 2 j=1 (C) Dhruv Batra Slide Credit: Carlos Guestrin 17

One important use of K-means Bag-of-word models in computer vision (C) Dhruv Batra 18

Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 (C) Dhruv Batra Slide Credit: Carlos Guestrin 19

Object Bag of words Fei- Fei Li

Fei- Fei Li

Interest Point Features Compute SIFT descriptor [Lowe 99] Normalize patch Detect patches [Mikojaczyk and Schmid 02] [Matas et al. 02] [Sivic et al. 03] Slide credit: Josef Sivic

Patch Features Slide credit: Josef Sivic

dictionary formation Slide credit: Josef Sivic

Clustering (usually k-means) Vector quantization Slide credit: Josef Sivic

Clustered Image Patches Fei-Fei et al. 2005

Image representation frequency.. codewords Fei- Fei Li

(One) bad case for k-means Clusters may overlap Some clusters may be wider than others GMM to the rescue! (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

GMM 35 30 25 20 15 10 5 0 25 20 15 10 5 0 5 10 15 20 25 (C) Dhruv Batra Figure Credit: Kevin Murphy 29

Recall Multi-variate Gaussians (C) Dhruv Batra 30

GMM 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (C) Dhruv Batra Figure Credit: Kevin Murphy 31

Hidden Data Causes Problems #1 Fully Observed (Log) Likelihood factorizes Marginal (Log) Likelihood doesn t factorize All parameters coupled! (C) Dhruv Batra 32

GMM vs Gaussian Joint Bayes Classifier On Board Observed Y vs Unobserved Z Likelihood vs Marginal Likelihood (C) Dhruv Batra 33

Hidden Data Causes Problems #2 35 30 25 20 15 10 5 0 25 20 15 10 5 0 5 10 15 20 25 (C) Dhruv Batra Figure Credit: Kevin Murphy 34

Hidden Data Causes Problems #2 Identifiability 19.5 14.5 35 9.5 30 25 4.5 20 µ 2 0.5 15 10 5.5 5 10.5 0 25 20 15 10 5 0 5 10 15 20 25 15.5 15.5 10.5 5.5 0.5 4.5 9.5 14.5 19.5 µ 1 (C) Dhruv Batra Figure Credit: Kevin Murphy 35

Hidden Data Causes Problems #3 Likelihood has singularities if one Gaussian collapses p(x) (C) Dhruv Batra x 36

Special case: spherical Gaussians and hard assignments If P(X Z=k) is spherical, with same for all classes: # P(x i z = j) exp 1 $ % 2σ 2 x i µ j 2 & ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N k N % P(x i, y = j) exp 1 & ' 2σ 2 i=1 j=1 i=1 x i µ C(i) 2 ( ) * M(M)LE same as K-means!!! (C) Dhruv Batra Slide Credit: Carlos Guestrin 37

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι µ 1 µ 2 µ 3 (C) Dhruv Batra Slide Credit: Carlos Guestrin 38

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: µ 1 µ 2 µ 3 (C) Dhruv Batra Slide Credit: Carlos Guestrin 39

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) µ 2 (C) Dhruv Batra Slide Credit: Carlos Guestrin 40

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) 2. Datapoint Ν(µ ι, σ 2 Ι ) µ 2 x (C) Dhruv Batra Slide Credit: Carlos Guestrin 41

The General GMM assumption There are k components Component i has an associated mean vector m i Each component generates data from a Gaussian with mean m i and covariance matrix Σ i Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) 2. Datapoint ~ N(m i, Σ i ) µ 1 µ 2 (C) Dhruv Batra Slide Credit: Carlos Guestrin 42 µ 3