CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Size: px

Start display at page:

Download "CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning"

Rosalyn Small
5 years ago
Views:

1 CS325 Artificial Intelligence Cengiz Spring 2013

2 Unsupervised Learning Missing teacher No labels, y Just input data, x What can you learn with it?

3 Unsupervised Learning Missing teacher No labels, y Just input data, x What can you learn with it? 1 Simplifying data (e.g., dimensionality reduction) 2 Organizing data (e.g., clustering)

4 Unsupervised Learning Missing teacher No labels, y Just input data, x What can you learn with it? 1 Simplifying data (e.g., dimensionality reduction) 2 Organizing data (e.g., clustering) Works by finding structure in data, exploits redundancies

5 Unsupervised Learning Missing teacher No labels, y Just input data, x What can you learn with it? 1 Simplifying data (e.g., dimensionality reduction) 2 Organizing data (e.g., clustering) Works by finding structure in data, exploits redundancies Entry survey: Unsupervised Learning (0.5 points of final grade) What is it good for in real life? Where would you use it?

6 The Google PageRank Algorithm Why called Google PageRank?

7 The Google PageRank Algorithm Why called Google PageRank? Assigns importance to each page based on incoming links

8 The Google PageRank Algorithm Why called Google PageRank? Assigns importance to each page based on incoming links Before PageRank: Manually made online directories (e.g., Yahoo!) Bag of words maximum likelihood

9 The Google PageRank Algorithm Why called Google PageRank? Assigns importance to each page based on incoming links Before PageRank: Manually made online directories (e.g., Yahoo!) Bag of words maximum likelihood PageRank improves bag of words model: Iterative algorithm that models surfer randomly clicking away In each page, the probability of finding a target page is divided by outgoing links.

10 The Google PageRank Algorithm Why called Google PageRank? Assigns importance to each page based on incoming links Before PageRank: Manually made online directories (e.g., Yahoo!) Bag of words maximum likelihood PageRank improves bag of words model: Iterative algorithm that models surfer randomly clicking away In each page, the probability of finding a target page is divided by outgoing links. World with pages: A, B, C, and D. Init PR(x) = 0.25

11 The Google PageRank Algorithm Why called Google PageRank? Assigns importance to each page based on incoming links Before PageRank: Manually made online directories (e.g., Yahoo!) Bag of words maximum likelihood PageRank improves bag of words model: Iterative algorithm that models surfer randomly clicking away In each page, the probability of finding a target page is divided by outgoing links. World with pages: A, B, C, and D. Init PR(x) = 0.25 If B, C, and D all link to A, then PR(A) = PR(B) + PR(C) + PR(D) = 0.75

12 The Google PageRank Algorithm Why called Google PageRank? Assigns importance to each page based on incoming links Before PageRank: Manually made online directories (e.g., Yahoo!) Bag of words maximum likelihood PageRank improves bag of words model: Iterative algorithm that models surfer randomly clicking away In each page, the probability of finding a target page is divided by outgoing links. World with pages: A, B, C, and D. Init PR(x) = 0.25 If B, C, and D all link to A, then PR(A) = PR(B) + PR(C) + PR(D) = 0.75 If B had a link to pages C and A, while page D had links to all three pages, then PR(A) = PR(B)/2 + PR(C) + PR(D)/3

13 Other Unsupervised Learning Examples Dimensionality reduction: 1 Principal/independent component analysis (PCA/ICA) 2 Factor analysis 3 Google PageRank

14 Other Unsupervised Learning Examples Dimensionality reduction: 1 Principal/independent component analysis (PCA/ICA) 2 Factor analysis 3 Google PageRank Clustering: 1 Blind source separation 2 k-means clustering 3 Competitive learning 4 Expectation maximization (EM) 5 Self-organizing maps (SOM)

15 k-means Clustering Algorithm: 1 Randomly place k cluster centers 2 Assign each point to closest center 3 Move each cluster to center of gravity of new set 4 Go back to step 2 until no change

16 k-means Clustering Algorithm: 1 Randomly place k cluster centers 2 Assign each point to closest center 3 Move each cluster to center of gravity of new set 4 Go back to step 2 until no change Problems: Choosing the appropriate k Local minima High dimensionality Not mathematical

17 Improving k-means with Gaussians Gaussian or normal distribution function: N(µ, σ) = P(x µ, σ) = 1 2πσ e (x µ)2 /2σ 2

18 Improving k-means with Gaussians Gaussian or normal distribution function: N(µ, σ) = P(x µ, σ) = 1 2πσ e (x µ)2 /2σ 2 Mean and variance parameters can be approximated from data: µ = 1 x i, M i σ = 1 (x i µ) 2 M Watch Dr. Thrun use Maximum Likelihood to derive these! i

19 Multi-variate Gaussians How would a 2D Gaussian look like?

20 Multi-variate Gaussians How would a 2D Gaussian look like?

21 Multi-variate Gaussians How would a 2D Gaussian look like? ( ) 1 N(µ, Σ) = (2π) d/2 Σ 1/2 exp 2 (x µ)t Σ 1 (x µ), where Σ = 1 (x i µ) T (x i µ), and d is the number of dimensions. M i

22 Fitting Multi-variate Gaussians

23 Fitting Multi-variate Gaussians

24 Using Gaussians for Clusters Assume points belong to clusters with multi-variate Gaussian distribution. We could use Maximum Likelihood, but we don t know Gaussian parameters (mean and variance). It s a chicken-egg problem!

25 Using Gaussians for Clusters Assume points belong to clusters with multi-variate Gaussian distribution. We could use Maximum Likelihood, but we don t know Gaussian parameters (mean and variance). It s a chicken-egg problem! Solution: pretend we have centers. Choose randomly like in k-means, and then run Expectation Maximization.

26 Expectation Maximization Expectation Maximization (EM): two-step iterative algorithm

27 Expectation Maximization Expectation Maximization (EM): two-step iterative algorithm 1 Expectation step: For all i, j, calculate probability x j belongs to cluster i: p ij = P(C = i x j ) = αp(x j C = i)p(c = i)

28 Expectation Maximization Expectation Maximization (EM): two-step iterative algorithm 1 Expectation step: For all i, j, calculate probability x j belongs to cluster i: p ij = P(C = i x j ) = αp(x j C = i)p(c = i) 2 Maximization step: Recalculate parameters µ i = j p ij x j /n i Σ i = j p ij (x j µ i )(x j µ i ) T /n i

29 We Find the Gaussian Clusters x x1

30 We Find the Gaussian Clusters x x x x1

31 We Find the Gaussian Clusters x x x x1 Unsupervised learning of Gaussians is also used in Radial Basis Function neural networks

32 Can Also Use Gaussians for Density Estimation

33 Can Also Use Gaussians for Density Estimation Density

34 Summary for Expectation Maximization Expectation Maximization: All points belong to all centers Better solution Less susceptible to local minima

35 Summary for Expectation Maximization Expectation Maximization: All points belong to all centers Better solution Less susceptible to local minima What else can Expectation Maximization do?

36 Summary for Expectation Maximization Expectation Maximization: All points belong to all centers Better solution Less susceptible to local minima What else can Expectation Maximization do? Not limited to learning Gaussians Find hidden variables in Bayes net if we cannot count them like in the spam example Find hidden (latent) variables in other algorithms, like Hidden Markov Models Learning structures of problems with unknowns (e.g., Bayes nets)

37 Dementia Reduction Number of dimensions to represent these data? x2 x1

38 Dementia Reduction Number of dimensions to represent these data? x2 x1

39 Linear Dimensionality Reduction x2 x1

40 Linear Dimensionality Reduction x2 x2' x1' x1

41 Linear Dimensionality Reduction x2 x2' x1' x1 How to do this: 1 Find Gaussian parameters of data 2 Find eigenvectors and eigenvalues of covariance matrix 3 Choose eigenvectors with maximum eigenvalues 4 Project data onto selected eigenvector space

42 Linear Dimensionality Reduction Example

43 Linear Dimensionality Reduction Example

44 Linear Dimensionality Reduction Example

45 Reducing from Large Dimensional Spaces: Eigenfaces Face example = 2, 500 pixels (dimensions)

46 Reducing from Large Dimensional Spaces: Eigenfaces Face example = 2, 500 pixels (dimensions) Reduce to 12 eigenface dimensions

47 Reducing from Large Dimensional Spaces: Bodies Body example Three dimensions enough to distinguish: height, size, gender Trick is to use piecewise linear projections See linear embedding, iso maps for more info

48 Clustering by Affinity Would EM of k-means work well?

49 Clustering by Affinity Would EM of k-means work well?no.

50 Spectral Clustering Rank deficient matrix Can use Principal Components Analysis to find orthogonal components Example with clustering?

51 Competitive Learning: Neural Gas Neural gas: Growing Neural Gas Gesture recognition

52 Source Separation Coctail party problem

53 Source Separation Coctail party problem Blind source separation Independent component analysis Difference between PCA and ICA?

54 Source Separation Coctail party problem Blind source separation Independent component analysis Difference between PCA and ICA? PCA finds orthogonal components. ICA finds statistically independent components.

55 Source Separation Coctail party problem Blind source separation Independent component analysis Difference between PCA and ICA? PCA finds orthogonal components. ICA finds statistically independent components. Thus, ICA is better for signal separation.

56 Summary Learning without teachers is still useful Makes sense of hidden structure within data Many uses like: clustering, separation, dimension reduction, density estimation,... Both iterative algorithms and mathematical solutions Make sense of natural data: faces, bodies Competitive learning can be used to find best adapted solutions: e.g., find best on-screen keyboard for typing? Use unsupervised learning first to simplify data and then combine with supervised learning!

57 Summary Learning without teachers is still useful Makes sense of hidden structure within data Many uses like: clustering, separation, dimension reduction, density estimation,... Both iterative algorithms and mathematical solutions Make sense of natural data: faces, bodies Competitive learning can be used to find best adapted solutions: e.g., find best on-screen keyboard for typing? Use unsupervised learning first to simplify data and then combine with supervised learning! Exit survey: Unsupervised Learning What changed in your understanding? Any new suggestions on where would you use it?

Unsupervised Learning

Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover