ECE 5424: Introduction to Machine Learning

Size: px

Start display at page:

Download "ECE 5424: Introduction to Machine Learning"

Harvey McCormick
5 years ago
Views:

1 ECE 5424: Introduction to Machine Learning Topics: Unsupervised Learning: Kmeans, GMM, EM Readings: Barber Stefan Lee Virginia Tech

2 Tasks Supervised Learning x Classification y Discrete x Regression y Continuous Unsupervised Learning x Clustering c Discrete ID Dimensionality x z Continuous Reduction (C) Dhruv Batra 2

3 Unsupervised Learning Learning only with X Y not present in training data Some example unsupervised learning problems: Clustering / Factor Analysis Dimensionality Reduction / Embeddings Density Estimation with Mixture Models (C) Dhruv Batra 3

4 New Topic: Clustering Slide Credit: Carlos Guestrin 4

5 Synonyms Clustering Vector Quantization Latent Variable Models Hidden Variable Models Mixture Models Algorithms: K-means Expectation Maximization (EM) (C) Dhruv Batra 5

6 Some Data (C) Dhruv Batra Slide Credit: Carlos Guestrin 6

7 K-means 1. Ask user how many clusters they d like. (e.g. k=5) (C) Dhruv Batra Slide Credit: Carlos Guestrin 7

8 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations (C) Dhruv Batra Slide Credit: Carlos Guestrin 8

9 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) (C) Dhruv Batra Slide Credit: Carlos Guestrin 9

10 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns (C) Dhruv Batra Slide Credit: Carlos Guestrin 10

11 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. Repeat until terminated! (C) Dhruv Batra Slide Credit: Carlos Guestrin 11

12 K-means Randomly initialize k centers (0) = 1 (0),, k (0) Assign: Assign each point i {1, n} to nearest center: C(i) Recenter: argmin j x i µ j 2 μ # becomes centroid of its points (C) Dhruv Batra Slide Credit: Carlos Guestrin 12

13 K-means Demo (C) Dhruv Batra 13

14 What is K-means optimizing? Objective F(,C): function of centers and point allocations C: F (µ,c)= NX x i µ C(i) 2 i=1 1-of-k encoding F (µ, a) = NX kx a ij x i µ j 2 i=1 j=1 Optimal K-means: min min a F(,a) (C) Dhruv Batra 14

15 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! (C) Dhruv Batra Slide Credit: Carlos Guestrin 15

16 K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix, optimize a (or C) min µ 1,...,µ k min a 1,...,a N NX i=1 kx a ij x i µ j 2 j=1 (C) Dhruv Batra Slide Credit: Carlos Guestrin 16

17 K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix a (or C), optimize min µ 1,...,µ k min a 1,...,a N NX i=1 kx a ij x i µ j 2 j=1 (C) Dhruv Batra Slide Credit: Carlos Guestrin 17

18 One important use of K-means Bag-of-word models in computer vision (C) Dhruv Batra 18

19 Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 (C) Dhruv Batra Slide Credit: Carlos Guestrin 19

20 Object Bag of words Fei- Fei Li

21 Fei- Fei Li

22 Interest Point Features Compute SIFT descriptor [Lowe 99] Normalize patch Detect patches [Mikojaczyk and Schmid 02] [Matas et al. 02] [Sivic et al. 03] Slide credit: Josef Sivic

23 Patch Features Slide credit: Josef Sivic

24 dictionary formation Slide credit: Josef Sivic

25 Clustering (usually k-means) Vector quantization Slide credit: Josef Sivic

26 Clustered Image Patches Fei-Fei et al. 2005

27 Image representation frequency.. codewords Fei- Fei Li

28 (One) bad case for k-means Clusters may overlap Some clusters may be wider than others GMM to the rescue! (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

29 GMM (C) Dhruv Batra Figure Credit: Kevin Murphy 29

30 Recall Multi-variate Gaussians (C) Dhruv Batra 30

31 GMM (C) Dhruv Batra Figure Credit: Kevin Murphy 31

32 Hidden Data Causes Problems #1 Fully Observed (Log) Likelihood factorizes Marginal (Log) Likelihood doesn t factorize All parameters coupled! (C) Dhruv Batra 32

33 GMM vs Gaussian Joint Bayes Classifier On Board Observed Y vs Unobserved Z Likelihood vs Marginal Likelihood (C) Dhruv Batra 33

34 Hidden Data Causes Problems # (C) Dhruv Batra Figure Credit: Kevin Murphy 34

35 Hidden Data Causes Problems #2 Identifiability µ µ 1 (C) Dhruv Batra Figure Credit: Kevin Murphy 35

36 Hidden Data Causes Problems #3 Likelihood has singularities if one Gaussian collapses p(x) (C) Dhruv Batra x 36

37 Special case: spherical Gaussians and hard assignments If P(X Z=k) is spherical, with same for all classes: # P(x i z = j) exp 1 $ % 2σ 2 x i µ j 2 & ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N k N % P(x i, y = j) exp 1 & ' 2σ 2 i=1 j=1 i=1 x i µ C(i) 2 ( ) * M(M)LE same as K-means!!! (C) Dhruv Batra Slide Credit: Carlos Guestrin 37

38 The K-means GMM assumption There are k components Component i has an associated mean vector µ ι µ 1 µ 2 µ 3 (C) Dhruv Batra Slide Credit: Carlos Guestrin 38

39 The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: µ 1 µ 2 µ 3 (C) Dhruv Batra Slide Credit: Carlos Guestrin 39

40 The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) µ 2 (C) Dhruv Batra Slide Credit: Carlos Guestrin 40

41 The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) 2. Datapoint Ν(µ ι, σ 2 Ι ) µ 2 x (C) Dhruv Batra Slide Credit: Carlos Guestrin 41

42 The General GMM assumption There are k components Component i has an associated mean vector m i Each component generates data from a Gaussian with mean m i and covariance matrix Σ i Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) 2. Datapoint ~ N(m i, Σ i ) µ 1 µ 2 (C) Dhruv Batra Slide Credit: Carlos Guestrin 42 µ 3

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014