Clustering. Introduction to Data Science University of Colorado Boulder SLIDES ADAPTED FROM LAUREN HANNAH

Size: px

Start display at page:

Download "Clustering. Introduction to Data Science University of Colorado Boulder SLIDES ADAPTED FROM LAUREN HANNAH"

Margaret Jacobs
5 years ago
Views:

1 Clustering Introduction to Data Science University of Colorado Boulder SLIDES ADAPTED FROM LAUREN HANNAH Introduction to Data Science Boulder Clustering 1 of 9

2 Clustering Lab Review of k-means Work through k-means example Connection to GMM Introduction to Data Science Boulder Clustering 2 of 9

3 k-means 1: procedure KMeans(X, M) 2: s 3: Z AssignToClosestCluster(X, M) 4: while s > Score(X,Z,M) do Iterate until score stops changing 5: s Score(X,Z,M) Compute score for old configuration 6: Z AssignToClosestCluster(X, M) 7: for k {1,...,K } do For each cluster mean 8: v 0,µ k 0 9: for i {1,...,N} do For each observation 10: if z i = k then If the observation is assigned to cluster k 11: µ k µ k + x i Add observation to sum 12: v v + 1 Count points in cluster k 13: µ k µ k v 14: return Z Divide by number of observations Introduction to Data Science Boulder Clustering 3 of 9

4 Score 1: procedure Score(X, Z, M) Current objective function 2: s 0 3: for i {1,...,N} do For each observation 4: s s + x i µ zi Accumulate how far it is from its mean Introduction to Data Science Boulder Clustering 4 of 9

5 Find closest 1: procedure AssignToClosestCluster(X, M) 2: Z Vector(N) Initialize assignments Z as a N-vector 3: for i {1,...,N} do For each observation 4: d 5: for k {1,...,K } do 6: if x i µ k < d then 7: z i k 8: d < x i µ k 9: return Z Introduction to Data Science Boulder Clustering 5 of 9

6 Two Points y x Introduction to Data Science Boulder Clustering 6 of 9

7 Two Points y A: (-3.00, 3.00) B: (-4.00, 3.00) x Introduction to Data Science Boulder Clustering 6 of 9

8 Two Points µ A = 1 (( 3,3) + (3, 3) + (4, 3) + (4, 4)) 4 = ( 4,3) + ( 4,4) µ B = 2 = Introduction to Data Science Boulder Clustering 6 of 9

9 Two Points µ A = 1 (( 3,3) + (3, 3) + (4, 3) + (4, 4)) 4 = (2, 1.75) ( 4,3) + ( 4,4) µ B = 2 = Introduction to Data Science Boulder Clustering 6 of 9

10 Two Points µ A = 1 (( 3,3) + (3, 3) + (4, 3) + (4, 4)) 4 = (2, 1.75) ( 4,3) + ( 4,4) µ B = 2 = ( 4,3.5) Introduction to Data Science Boulder Clustering 6 of 9

11 Two Points y B: (-4.00, 3.50) -1-2 A: (2.00, -1.75) x Introduction to Data Science Boulder Clustering 6 of 9

12 Two Points (3, 3) + (4, 3) + (4, 4) µ A = 3 = ( 4,3) + ( 4,4) + ( 3,3) µ B = 3 = Introduction to Data Science Boulder Clustering 6 of 9

13 Two Points (3, 3) + (4, 3) + (4, 4) µ A = 3 = (3.67, 3.33) ( 4,3) + ( 4,4) + ( 3,3) µ B = 3 = Introduction to Data Science Boulder Clustering 6 of 9

14 Two Points (3, 3) + (4, 3) + (4, 4) µ A = 3 = (3.67, 3.33) ( 4,3) + ( 4,4) + ( 3,3) µ B = 3 = ( 3.67, 3.33) Introduction to Data Science Boulder Clustering 6 of 9

15 Two Points B: (-3.67, 3.33) y A: (3.67, -3.33) x Introduction to Data Science Boulder Clustering 6 of 9

16 Four Points y x µ A µ B µ C µ D (-3, 3) (-4,3) (3, -3) (4, -3) Introduction to Data Science Boulder Clustering 7 of 9

17 Four Points The observation at (3,3) is the same distance from µ A and µ C. If you look at Line 10 in the algorithm, the first mean with the smallest distance gets the assignment. So (3, 3) gets assigned to cluster A. µ A = µ B = µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

18 Four Points The observation at (3,3) is the same distance from µ A and µ C. If you look at Line 10 in the algorithm, the first mean with the smallest distance gets the assignment. So (3, 3) gets assigned to cluster A. µ A = ( 1,1) µ B = µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

19 Four Points The observation at (3,3) is the same distance from µ A and µ C. If you look at Line 10 in the algorithm, the first mean with the smallest distance gets the assignment. So (3, 3) gets assigned to cluster A. µ A = ( 1,1) µ B = ( 4,0) µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

20 Four Points The observation at (3,3) is the same distance from µ A and µ C. If you look at Line 10 in the algorithm, the first mean with the smallest distance gets the assignment. So (3, 3) gets assigned to cluster A. µ A = ( 1,1) µ B = ( 4,0) µ C = (3, 3) µ D = Introduction to Data Science Boulder Clustering 7 of 9

21 Four Points The observation at (3,3) is the same distance from µ A and µ C. If you look at Line 10 in the algorithm, the first mean with the smallest distance gets the assignment. So (3, 3) gets assigned to cluster A. µ A = ( 1,1) µ B = ( 4,0) µ C = (3, 3) µ D = (4,0) Introduction to Data Science Boulder Clustering 7 of 9

22 Four Points y A: (-3.00, 3.00) B: (-4.00, 3.00) C: (3.00, -3.00) D: (4.00, x Introduction to Data Science Boulder Clustering 7 of 9

23 Four Points A: (-1.00, 1.00) y B: (-4.00, 0.00) D: (4.00, 0.00) C: (3.00, -3.00) x Introduction to Data Science Boulder Clustering 7 of 9

24 Four Points µ A = µ B = µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

25 Four Points µ A = ( 3,3) µ B = µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

26 Four Points µ A = ( 3,3) µ B = ( 3.8, 0.6) µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

27 Four Points µ A = ( 3,3) µ B = ( 3.8, 0.6) µ C = (3.67, 3.33) µ D = Introduction to Data Science Boulder Clustering 7 of 9

28 Four Points µ A = ( 3,3) µ B = ( 3.8, 0.6) µ C = (3.67, 3.33) µ D = (3.67,3.33) Introduction to Data Science Boulder Clustering 7 of 9

29 Four Points y A: (-3.00, 3.00) B: (-3.80, -0.60) D: (3.67, 3.33) C: (3.67, -3.33) x Introduction to Data Science Boulder Clustering 7 of 9

30 Four Points µ A = µ B = µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

31 Four Points µ A = ( 3.67,3.33) µ B = µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

32 Four Points µ A = ( 3.67,3.33) µ B = ( 3.67, 3.33) µ C = µ D = Introduction to Data Science Boulder Clustering 7 of 9

33 Four Points µ A = ( 3.67,3.33) µ B = ( 3.67, 3.33) µ C = (3.67, 3.33) µ D = Introduction to Data Science Boulder Clustering 7 of 9

34 Four Points µ A = ( 3.67,3.33) µ B = ( 3.67, 3.33) µ C = (3.67, 3.33) µ D = (3.67,3.33) Introduction to Data Science Boulder Clustering 7 of 9

35 Four Points y A: (-3.67, 3.33) B: (-3.67, -3.33) D: (3.67, 3.33) C: (3.67, -3.33) x Introduction to Data Science Boulder Clustering 7 of 9

36 Bad Initialization y x Introduction to Data Science Boulder Clustering 8 of 9

37 Bad Initialization C: (0.25, 4.25) D: (0.25, 2.50) y A: (-0.50, -2.50) B: (-0.50, -4.25) x Introduction to Data Science Boulder Clustering 8 of 9

38 Bad Initialization C: (0.00, 4.00) D: (0.00, 3.00) y A: (0.00, -3.00) B: (0.00, -4.00) x Introduction to Data Science Boulder Clustering 8 of 9

39 How does it change for GMM? Introduction to Data Science Boulder Clustering 9 of 9

40 How does it change for GMM? Instead of just computing mean, you also compute variance. Introduction to Data Science Boulder Clustering 9 of 9

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics