Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Size: px

Start display at page:

Download "Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs"

Lucy Bryant
5 years ago
Views:

Computer Science Carnegie Mellon University

0 Mitchell - - EM, GMM Readings: Murphy 11.

1 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University K- Means + GMMs Clustering Readings: Murphy 25.5 Bishop 12.1, 12.3 HTF Mitchell - - EM, GMM Readings: Murphy , , Bishop 9 HTF Mitchell Matt Gormley Lecture 16 March 20,

2 Reminders Homework 5: Readings / Application of ML Release: Wed, Mar. 08 Due: Wed, Mar. 22 at 11:59pm 2

3 Peer Tutoring Tutor Tutee 3

4 K- MEANS 5

based Clustering K- Means K- Means Objective Computational Complexity K- Means Algorithm

5 K- Means Outline Clustering: Motivation / Applications Optimization Background Coordinate Descent Block Coordinate Descent Clustering Inputs and Outputs Objective- based Clustering K- Means K- Means Objective Computational Complexity K- Means Algorithm / Lloyd s Method K- Means Initialization Random Farthest Point K- Means++ Last Lecture This Lecture 6

6 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would we want to do this? Useful for: Automatically organizing data. Understanding hidden structure in data. Preprocessing for further analysis. Representing high- dimensional data in a low- dimensional space (e.g., for visualization purposes). Slide courtesy of Nina Balcan

7 Applications (Clustering comes up everywhere ) Cluster news articles or web pages or search results by topic. Cluster protein sequences by function or genes according to expression profile. Cluster users of social networks by interest (community detection). Facebook network Twitter Network Slide courtesy of Nina Balcan

8 Applications (Clustering comes up everywhere ) Cluster customers according to purchase history. Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey) And many many more applications. Slide courtesy of Nina Balcan

9 Whiteboard: Optimization Background Coordinate Descent Block Coordinate Descent 10

10 Clustering Question: Which of these partitions is better? 11

11 Whiteboard: Clustering Inputs and Outputs Objective- based Clustering 12

12 Whiteboard: K- Means K- Means Objective Computational Complexity K- Means Algorithm / Lloyd s Method 13

13 Whiteboard: K- Means Initialization Random Furthest Traversal K- Means++ 14

14 Lloyd s method: Random Initialization Slide courtesy of Nina Balcan

15 Lloyd s method: Random Initialization Example: Given a set of datapoints Slide courtesy of Nina Balcan

16 Lloyd s method: Random Initialization Select initial centers at random Slide courtesy of Nina Balcan

17 Lloyd s method: Random Initialization Assign each point to its nearest center Slide courtesy of Nina Balcan

18 Lloyd s method: Random Initialization Recompute optimal centers given a fixed clustering Slide courtesy of Nina Balcan

19 Lloyd s method: Random Initialization Assign each point to its nearest center Slide courtesy of Nina Balcan

20 Lloyd s method: Random Initialization Recompute optimal centers given a fixed clustering Slide courtesy of Nina Balcan

21 Lloyd s method: Random Initialization Assign each point to its nearest center Slide courtesy of Nina Balcan

22 Lloyd s method: Random Initialization Recompute optimal centers given a fixed clustering Get a good quality solution in this example. Slide courtesy of Nina Balcan

23 Lloyd s method: Performance It always converges, but it may converge at a local optimum that is different from the global optimum, and in fact could be arbitrarily worse in terms of its score. Slide courtesy of Nina Balcan

24 Lloyd s method: Performance Local optimum: every point is assigned to its nearest center and every center is the mean value of its points. Slide courtesy of Nina Balcan

25 Lloyd s method: Performance.It is arbitrarily worse than optimum solution. Slide courtesy of Nina Balcan

26 Lloyd s method: Performance This bad performance, can happen even with well separated Gaussian clusters. Slide courtesy of Nina Balcan

27 Lloyd s method: Performance This bad performance, can happen even with well separated Gaussian clusters. Some Gaussian are combined.. Slide courtesy of Nina Balcan

28 Lloyd s method: Performance If we do random initialization, as k increases, it becomes more likely we won t have perfectly picked one center per Gaussian in our initialization (so Lloyd s method will output a bad solution). For k equal- sized Gaussians, Pr[each initial center is in a different Gaussian] "! " $ % & $ Becomes unlikely as k gets large. Slide courtesy of Nina Balcan

29 Another Initialization Idea: Furthest Point Heuristic Choose c 1 arbitrarily (or at random). For j = 2,, k Pick c j among datapoints x 1, x 2,, x n that is farthest from previously chosen c 1, c 2,, c j01 Fixes the Gaussian problem. But it can be thrown off by outliers. Slide courtesy of Nina Balcan

30 Furthest point heuristic does well on previous example Slide courtesy of Nina Balcan

31 Furthest point initialization heuristic sensitive to outliers Assume k=3 (0,1) (- 2,0) (3,0) (0,- 1) Slide courtesy of Nina Balcan

32 Furthest point initialization heuristic sensitive to outliers Assume k=3 (0,1) (- 2,0) (3,0) (0,- 1) Slide courtesy of Nina Balcan

33 K- means++ Initialization: D 6 sampling [AV07] Interpolate between random and furthest point initialization Let D(x) be the distance between a point x and its nearest center. Chose the next center proportional to D 6 (x). Choose c 1 at random. For j = 2,, k Pick c j among x 1, x 2,, x n according to the distribution Pr(c j = x i ) min x i c j? 2 D 6 (x i ) Theorem: K- means++ always attains an O(log k) approximation to optimal k- means solution in expectation. Running Lloyd s can only further improve the cost. Slide courtesy of Nina Balcan

34 K- means++ Idea: D 6 sampling Interpolate between random and furthest point initialization Let D(x) be the distance between a point x and its nearest center. Chose the next center proportional to D D x = min x i c j? α. α = 0, random sampling α =, furthest point (Side note: it actually works well for k- center) α = 2, k- means++ Side note: α = 1, works well for k- median Slide adapted from Nina Balcan

35 K- means ++ Fix (0,1) (- 2,0) (3,0) (0,- 1) Slide courtesy of Nina Balcan

36 K- means++/ Lloyd s Running Time K- means ++ initialization: O(nd) and one pass over data to select next center. So O(nkd) time in total. Lloyd s method Repeat until there is no change in the cost. For each j: C K {x S whose closest center is c j } Each round takes time O(nkd). For each j: c j mean of C K Exponential # of rounds in the worst case [AV07]. Expected polynomial time in the smoothed analysis (non worst- case) model! Slide courtesy of Nina Balcan

37 K- means++/ Lloyd s Summary Running Lloyd s can only further improve the cost. Exponential # of rounds in the worst case [AV07]. Expected polynomial time in the smoothed analysis model! Does well in practice. K- means++ always attains an O(log k) approximation to optimal k- means solution in expectation. Slide courtesy of Nina Balcan

38 What value of k? Heuristic: Find large gap between (k - 1)- means cost and k- means cost. Hold- out validation/cross- validation on auxiliary task (e.g., supervised learning task). Try hierarchical clustering. Slide courtesy of Nina Balcan

39 EM AND GMMS 43

40 Expectation- Maximization Outline Background Multivariate Gaussian Distribution Marginal Probabilities Building up to GMMs Distinction #1: Model vs. Objective Function Gaussian Naïve Bayes (GNB) Gaussian Discriminant Analysis Gaussian Mixture Model (GMM) Expectation- Maximization Distinction #2: Data Likelihood vs. Marginal Data Likelihood Distinction #3: Latent Variables vs. Parameters Objective Functions for EM Hard (Viterbi) EM Algorithm Example: K- Means as Hard EM Soft (Standard) EM Algorithm Example: Soft EM for GMM Properties of EM Nonconvexity / Local Optimization Example: Grammar Induction Variants of EM 44

41 Whiteboard Background Multivariate Gaussian Distribution Marginal Probabilities 46

42 GAUSSIAN MIXTURE MODEL (GMM) 47

43 Whiteboard Building up to GMMs Distinction #1: Model vs. Objective Function Gaussian Naïve Bayes (GNB) Gaussian Discriminant Analysis Gaussian Mixture Model (GMM) 48

44 Gaussian Discriminant Analysis Data: D = {( (i), (i) )} N i=1 where (i) R M and z (i) {1,...,K} Generative Story: z Categorical( ) Gaussian(µ z, z) Model: Joint: p(,z;, µ, )=p( z; µ, )p(z; ) Log- likelihood: (, µ, )= = N i=1 N i=1 p( (i),z (i) ;, µ, ) p( (i) z (i) ; µ, )+ p(z (i) ; ) 49

45 Gaussian Discriminant Analysis Data: D = {( (i), (i) )} N i=1 where (i) R M and z (i) {1,...,K} Log- likelihood: (, µ, )= N p( (i) z (i) ; µ, )+ p(z (i) ; ) i=1 Maximum Likelihood Estimates: Take the derivative of the Lagrangian, set it equal to zero and solve. N k = 1 N µ k = k = i=1 I(z (i) = k), k N i=1 I(z(i) = k) (i), k N i=1 I(z(i) = k) N i=1 I(z(i) = k)( (i) µ k )( (i) µ k ) T N i=1 I(z(i) = k) Implementation: Just counting, k 50

46 Gaussian Mixture- Model Assume we are given data, D, consisting of N fully unsupervised examples in M dimensions: Data: (i) D = {t(i) }N i=1 where t Generative Story: Model: Joint: RM z Categorical( ) t Gaussian(µz, z) p(t, z;, µ, ) = p(t z; µ, )p(z; ) K Marginal: p(t;, µ, ) = p(t z; µ, )p(z; ) z=1 (Marginal) Log- likelihood: (, µ, ) = HQ; N p(t(i) ;, µ, ) i=1 N = i=1 HQ; K z=1 p(t(i) z; µ, )p(z; ) 51

47 Mixture- Model Assume we are given data, D, consisting of N fully unsupervised examples in M dimensions: Data: (i) D = {t(i) }N i=1 where t Generative Story: Model: RM Multinomial( Categorical( )) pgaussian(µ ( z) z, z) z t p Joint:, (t, z) = p (t z)p (z) K Marginal: p, (t) = p (t z)p (z) z=1 (Marginal) Log- likelihood: ( ) = HQ; N p, (t(i) ) i=1 N = i=1 HQ; K z=1 p (t(i) z)p (z) 52

RM Multinomial( Categorical( )) pgaussian(µ ( z) ) z, zthis could be any z t p Joint:,

(t, z) = p (t z)p (z) K Marginal: p, (t) = (Marginal) Log- likelihood: ( ) = HQ; Today

48 Mixture- Model Assume we are given data, D, consisting of N fully unsupervised examples in M dimensions: Data: (i) D = {t(i) }N i=1 where t Generative Story: Model: RM Multinomial( Categorical( )) pgaussian(µ ( z) ) z, zthis could be any z t p Joint:, arbitrary distribution parameterized by θ. (t, z) = p (t z)p (z) K Marginal: p, (t) = (Marginal) Log- likelihood: ( ) = HQ; Today we re thinking p (t z)p (z) about the case where it z=1 is a Multivariate Gaussian. N p, (t(i) ) i=1 N = i=1 HQ; K z=1 p (t(i) z)p (z) 53

49 Learning a Mixture Model Supervised Learning: The parameters decouple! D = {( (i), (i) )} N i=1 Z Unsupervised Learning: Parameters are coupled by marginalization. D = { (i) } N i=1 Z X 1 X 2 X M X 1 X 2 X M, =, N i=1 p ( (i) z (i) )p (z (i) ), =, N i=1 K z=1 p ( (i) z)p (z) = N p ( (i) z (i) ) i=1 = N p (z (i) ) i=1 54

D = {(t(i), x(i) )}N i=1 D = {t(i) }N i=1 Z Training certainly isn t as

, In many cases, we could still use box optimization Xsome black- X2 XM 1

Newton- Raphson) N to solve this coupled (i) (i) (i) = `;K t HQ; p (t z )p

50 Learning a Mixture Model Supervised Learning: The parameters decouple! Unsupervised Learning: Parameters are coupled by marginalization. D = {(t(i), x(i) )}N i=1 D = {t(i) }N i=1 Z Training certainly isn t as simple as the supervised case., In many cases, we could still use box optimization Xsome black- X2 XM 1 method (e.g. Newton- Raphson) N to solve this coupled (i) (i) (i) = `;K t HQ; p (t z )p (z ) optimization problem., i=1 Z X1, X2 = `;K t, N i=1 HQ; XM K z=1 p (t(i) z)p (z) N = `;K t HQ; p (t(i) z (i) ) This lecture is about a more problem- i=1 specific method: EM. = `;K t N i=1 HQ; p (z (i) ) 55

51 EXAMPLE: K- MEANS VS GMM 56

52 Example: K- Means 57

53 Example: K- Means 58

54 Example: K- Means 59

55 Example: K- Means 60

56 Example: K- Means 61

57 Example: K- Means 62

58 Example: K- Means 63

59 Example: K- Means 64

60 Example: GMM 65

61 Example: GMM 66

62 Example: GMM 67

63 Example: GMM 68

64 Example: GMM 69

65 Example: GMM 70

66 Example: GMM 71

67 Example: GMM 72

68 Example: GMM 73

69 Example: GMM 74

70 Example: GMM 75

71 Example: GMM 76

72 Example: GMM 77

73 Example: GMM 78

74 Example: GMM 79

75 Example: GMM 80

76 Example: GMM 81

77 Example: GMM 82

78 Example: GMM 83

79 Example: GMM 84

80 Example: GMM 85

81 Example: GMM 86

82 K- Means vs. GMM Convergence: K- Means tends to converge much faster than a GMM Speed: Each iteration of K- Means is computationally less intensive than each iteration of a GMM Initialization: To initialize a GMM, we typically first run K- Means and use the resulting cluster centers as the means of the Gaussian components Output: A GMM yields a probability distribution over the cluster assignment for each point; whereas K- Means gives a single hard assignment 87

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would