Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Size: px

Start display at page:

Download "Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling"

Lorena Parrish
5 years ago
Views:

1 Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 1

2 Outline 2. Mixture Models & EM Algorithm 2 / 1

3 Syllabus Wed (1) 0. Introduction A. Supervised Learning Wed (2) A.1 Linear Regression Wed (3) A.2 Linear Classification Wed (4) A.3 Regularization (Given by Martin) Wed (5) A.4 High-dimensional Data Wed (6) A.5 Nearest-Neighbor Models Wed (7) A.6 Support Vector Machines Wed (8) A.7 Decision Trees Wed (9) A.8 A First Look at Bayesian and Markov Networks Extra: Wed (E) Invited Talk: Recommender Systems in work at Volkswagen B. Unsupervised Learning Wed (10) B.1 Clustering Wed (11) B.2 Dimensionality Reduction Wed (12) B.3 Frequent Pattern Mining C. Reinforcement Learning Wed.??.??. (13) C.1 State Space Models Wed.??.??. (14) C.2 Markov Decision Processes 1 / 1

4 Outline 2. Mixture Models & EM Algorithm 1 / 1

5 Unsupervised Learning For supervised learning problems, we were always given some training data D train = {(x 1, y 1 ),..., (x N, y N )} x i X corresponds to a measurement (a data instance) y i Y is a label Then the goal was to find a model f : X Y with minimal training error and decent generalization ability. In unsupervised learning, there are no labels given!! 1 / 1

6 Cluster Analysis Assume we have a dataset with no further information given. D train = {x 1,..., x N } cluster analysis tries to find commonalities among all data instances to group them into K many groups. we have to find a partition of X. 2 / 1

7 Partitions Let X := {x 1,..., x N } be a finite set. A set P := {X 1,..., X K } of subsets X k X is called a partition of X if the subsets 1. are pairwise disjoint: X k X j =, k, j {1,..., K}, k j K 2. cover X : X k = X, and k=1 3. do not contain the empty set: X k, k {1,..., K}. The sets X k are also called clusters, a partition P a clustering. K N is called number of clusters. 3 / 1

8 Partitions Let X be a finite set. A surjective function is called a partition function of X. p : {1,..., X } {1,..., K} The sets X k := p 1 (k) form a partition P := {X 1,..., X K }. x i p(x i ) x 1 1 x 2 2 x 3 2 x 4 1 p 1 (1) = {x 1, x 4 } p 1 (2) = {x 2, x 3 } 4 / 1

9 Partitions Let X := {x 1,..., x N } be a finite set. A binary N K matrix P {0, 1} N K is called a partition matrix of X if it K 1. is row-stochastic: P i,k = 1, i {1,..., N} 2. does not contain a zero column: X i,k (0,..., 0) T, k {1,..., K} The sets X k := {x i P i,k = 1} form a partition P := {X 1,..., X K }. P.,k is called membership vector of class k. k=1 5 / 1

10 Partitions For the example given through: the partition matrix would look like: x i p(x i ) x 1 1 x 2 2 x 3 2 x P = / 1

11 The Cluster Analysis Problem Given a set X called data space, e.g., X := R m, a set X X called data, and a function D : X X Part(X ) R + 0 called distortion measure where D(P) measures how bad a partition P Part(X ) for a data set X X is, find a partition P = {X 1, X 2,... X K } Part (X ) with minimal distortion D(P). 7 / 1

12 The Cluster Analysis Problem (given K) Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X Part(X ) R + 0 called distortion measure where D(P) measures how bad a partition P Part(X ) for a data set X X is, and a number K N of clusters, find a partition P = {X 1, X 2,... X K } Part K (X ) with K clusters with minimal distortion D(P). 7 / 1

13 Distortion Measures: Intuition Assume we have the following data and two cluster centers µ 1 and µ 2 : we would assign the left points to the red cluster, the right points to the blue cluster we want a distortion measure that encourages this behaviour 8 / 1

14 k-means: Distortion Sum of Distances to Cluster Centers Find a partition P such that the sum of squared distances to cluster centers in minimal: with D(P) := K k=1 n i=1: P i.k =1 x i µ k 2 µ k := mean {x i P i,k = 1, i = 1,..., n} 9 / 1

15 k-means: Distortion Sum of Distances to Cluster Centers Find a partition P such that the sum of squared distances to cluster centers in minimal: D(P) := n i=1 k=1 K P i,k x i µ k 2 = K k=1 n i=1: P i.k =1 x i µ k 2 with µ k := n i=1 P i,kx i n i=1 P i,k = mean {x i P i,k = 1, i = 1,..., n} 10 / 1

16 On the role of K Minimizing D over partitions with varying number of clusters (varying K) does not make sense a singleton clustering, where each point is its own cluster center and K = N has minimal D only minimizing with a given K makes sense Minimizing D is not easy as reassigning a point to a different cluster also shifts the cluster centers. 11 / 1

17 k-means: Minimizing Distances to Cluster Centers Add cluster centers µ as auxiliary optimization variables: D(P, µ) := n K P i,k x i µ k 2 i=1 k=1 12 / 1

18 k-means: Minimizing Distances to Cluster Centers Add cluster centers µ as auxiliary optimization variables: D(P, µ) := n K P i,k x i µ k 2 i=1 k=1 Block coordinate descent: 1. fix µ, optimize P reassign data points to clusters: P i,k := δ(k = l i ), l i := arg min x i µ k 2 k {1,...,K} 12 / 1

19 k-means: Minimizing Distances to Cluster Centers Add cluster centers µ as auxiliary optimization variables: Block coordinate descent: D(P, µ) := n i=1 k=1 K P i,k x i µ k 2 1. fix µ, optimize P reassign data points to clusters: P i,k := δ(k = l i ), l i := arg min x i µ k 2 k {1,...,K} 2. fix P, optimize µ recompute cluster centers: µ k := n i=1 P i,kx i n i=1 P i,k Iterate until partition is stable. 12 / 1

20 k-means: Initialization k-means is usually initialized by picking K data points as cluster centers at random: 1. pick the first cluster center µ 1 out of the data points at random and then 2. sequentially select the data point with the largest sum of distances to already chosen cluster centers as next cluster center k 1 µ k := x i, i := arg max i {1,...,n} x i µ l 2, l=1 k = 2,..., K 13 / 1

21 k-means: Initialization k-means is usually initialized by picking K data points as cluster centers at random: 1. pick the first cluster center µ 1 out of the data points at random and then 2. sequentially select the data point with the largest sum of distances to already chosen cluster centers as next cluster center k 1 µ k := x i, i := arg max i {1,...,n} x i µ l 2, l=1 k = 2,..., K Different initializations may lead to different local minima. run k-means with different random initializations and keep only the one with the smallest distortion (random restarts). 13 / 1

22 k-means Algorithm 1: procedure cluster-kmeans(d := {x 1,..., x N } R M, K N, ɛ R + ) 2: i 1 unif({1,..., N}), µ 1 := x i1 3: for k := 2,..., K do 4: i k := arg max k 1 n {1,...,N} l=1 xn µ l, µ i := x ik,. 5: repeat 6: µ old := µ 7: for n := 1,..., N do 8: P n := arg min k {1,...,K} x n µ k 9: for k := 1,..., K do 10: µ k := mean {x n P n = k} K k=1 µ k µ old k < ɛ 1 K 11: until 12: return P Note: In implementations, the two loops over the data (lines 6 and 9) can be combined in one loop. 14 / 1

23 Example x 1 x 2 x 3 x 4 x 5 x 6 15 / 1

24 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 = µ 1 x 5 x 6 15 / 1

25 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 15 / 1

26 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 15 / 1

27 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 15 / 1

28 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 15 / 1

29 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 15 / 1

30 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 15 / 1

31 Example x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 d = 33 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 d = 23.7 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 d = / 1

32 K-medoids: K-means for General Distances One can generalize k-means to general distances d: D(P, µ) := n K P i,k d(x i, µ k ) i=1 k=1 16 / 1

33 K-medoids: K-means for General Distances One can generalize k-means to general distances d: D(P, µ) := n K P i,k d(x i, µ k ) i=1 k=1 step 1 assigning data points to clusters remains the same P i,k := arg min d(x i, µ k ) k {1,...,K} but step 2 finding the best cluster representatives µ k is not solved by the mean and may be difficult in general. 16 / 1

34 K-medoids: K-means for General Distances One can generalize k-means to general distances d: D(P, µ) := n i=1 k=1 K P i,k d(x i, µ k ) step 1 assigning data points to clusters remains the same P i,k := arg min d(x i, µ k ) k {1,...,K} but step 2 finding the best cluster representatives µ k is not solved by the mean and may be difficult in general. idea k-medoids: choose cluster representatives out of cluster data points: n µ k := x j, j := arg min j {1,...,n}:P j,k =1 P i,k d(x i, x j ) i=1 16 / 1

35 2. Mixture Models & EM Algorithm Outline 2. Mixture Models & EM Algorithm 17 / 1

36 2. Mixture Models & EM Algorithm Soft Partitions: Row Stochastic Matrices Let X := {x 1,..., x N } be a finite set. A N K matrix P [0, 1] N K is called a soft partition matrix of X if it 1. is row-stochastic: K P i,k = 1, i {1,..., N}, k {1,..., K} 2. does not contain a zero column: X i,k (0,..., 0) T, k {1,..., K}. k=1 P i,k is called the membership degree of instance i in class k or the cluster weight of instance i in cluster k. P.,k is called membership vector of class k. SoftPart(X ) denotes the set of all soft partitions of X. Note: Soft partitions are also called soft clusterings and fuzzy clusterings. 17 / 1

37 2. Mixture Models & EM Algorithm The Soft Clustering Problem Given a set X called data space, e.g., X := R m, a set X X called data, and a function D : X X SoftPart(X ) R + 0 called distortion measure where D(P) measures how bad a soft partition P SoftPart(X ) for a data set X X is, find a soft partition P SoftPart (X ) with minimal distortion D(P). 18 / 1

38 2. Mixture Models & EM Algorithm The Soft Clustering Problem (with given K) Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X SoftPart(X ) R + 0 called distortion measure where D(P) measures how bad a soft partition P SoftPart(X ) for a data set X X is, and a number K N of clusters, find a soft partition P SoftPart K (X ) [0, 1] X K with K clusters with minimal distortion D(P). 18 / 1

39 2. Mixture Models & EM Algorithm Mixture Models For our data, no clusters are given, but this does not mean that they do not exist, there is just no way for us to measure them. Mixture models assume that there exists an unobserved nominal variable Z with K levels, which is distributed according to: or for some probabilities π k with Z Cat(π) p(z = k) = π k K π k = 1 k=1 19 / 1

40 2. Mixture Models & EM Algorithm Mixture Models Mixture models then model the joint probability of X and Z: p(x, Z) = p(z)p(x Z) = = K k=1 π δ(z=k) k K p(x Z = k) δ(z=k) k=1 K (π k p(x Z = k)) δ(z=k) k=1 And the marginal probability of a given X is: p(x ) = K p(z = k)p(x Z = k) = k=1 All we need to specify is p(x Z)! K π k p(x Z = k) k=1 20 / 1

41 2. Mixture Models & EM Algorithm Gaussian Mixture Models Gaussian mixture models are mixture models where the probability of seeing an instance, given its cluster membership is a Gaussian: Or equivalently: p(x = x Z = k) = p(x i z i = k) = N (x i ; µ k, Σ k ) 1 (2π) m Σ k e 1 2 (x µ k) T Σ 1 k (x µ k ) for a mean µ k and Covariance Matrix Σ k 21 / 1

42 2. Mixture Models & EM Algorithm Mixture Models: Intuition Clearly, we see that the hidden variable Z has two outcomes! 22 / 1

43 2. Mixture Models & EM Algorithm Mixture Models: Intuition Data may come from a mixture of two Gaussians! 23 / 1

44 2. Mixture Models & EM Algorithm Mixture Models: Intuition Heightlines of a mixture of two Gaussians! 24 / 1

45 2. Mixture Models & EM Algorithm Maximum Likelihood Estimate? The complete data loglikelihood of the completed data (X, Z) then is n K l(θ; X, Z) := δ(z i = k)(ln π k + ln p(x = x i Z = k; θ k ) i=1 k=1 with Θ := (π 1,..., π K, θ 1,..., θ K ) θ k = (µ k, Σ k ) l cannot be computed because z i s are unobserved. We cannot learn this model by computing a maximum likelihood estimate! 25 / 1

46 2. Mixture Models & EM Algorithm Expected Complete Likelihood & EM Algorithm We calculate the expected value of the loglikelihood with respect to the conditional distribution of Z given X under the currently estimated θ t 1 : Q(Θ Θ t 1 ) = E Z X,Θ t 1[l(Θ; X, Z)] From old Θ t 1, we know the distribution of the Z, then compute the expectation value of l with respect to Z (Expectation Step) We derive a quantity Q, that we can then maximize by optimizing Θ (Maximization Step) From the new Θ, we can then update Z and repeat the process 26 / 1

47 2. Mixture Models & EM Algorithm Expected Complete Likelihood [ N ] Q(Θ Θ t 1 ) = E log p(x i, z i Θ) = = = = i=1 [ [ N K ]] E log π k p(x i θ k )) δ(z i =k) i=1 N i=1 k=1 N i=1 k=1 N k=1 K E[δ(z i = k)] log[π k p(x i θ k ] K p(z i = k x i, Θ t 1 ) log[π k p(x i θ k ] K r ik log π k + i=1 k=1 i=1 k=1 N K r ik log p(x i θ k ) 27 / 1

48 2. Mixture Models & EM Algorithm Expected Complete Likelihood (Expectation Step) [ N ] Q(Θ Θ t 1 ) = E log p(x i, z i Θ) i=1 =... N K = r ik log π k + i=1 k=1 N i=1 k=1 K r ik log p(x i θ k ) with π k p(x i θ k ) r ik = p(z = k x i, Θ) = k π k p(x i θ k ) which is called the responsibilities of a cluster k to an instance i Computing the r ik yields the probabilites for Z (Expectation Step) 28 / 1

49 2. Mixture Models & EM Algorithm Maximization Step (I) Q(Θ Θ t 1 ) needs to be maximized for all π k and for the parameters of the individual Gaussians θ k = (µ k, Σ k ). For π, Q is maximized by setting π k = 1 N i r ik k For µ k and Σ k, we only have to look at the second part of Q N K l(µ k, Σ k ) = r ik log p(x i Θ k ) i=1 k=1 = 1 r ik [log Σ k + (x i µ k ) Σ k (x i µ k )] 2 i 29 / 1

50 2. Mixture Models & EM Algorithm Maximization Step (II) l(µ k, Σ k ) is maximized for: µ k = n i=1 r i,kx i k i=1 r i,k And Σ k = = n i=1 r i,k(x i µ k ) (x i µ k ) n i=1 r i,k n i=1 r i,kxi T x i µ k µ k n i=1 r i,k 30 / 1

51 2. Mixture Models & EM Algorithm Gaussian Mixtures for Soft Clustering The responsibilities r [0, 1] N K are a soft partition. P := r The negative expected loglikelihood can be used as cluster distortion: D(P) := max Q(Θ, r) Θ To optimize D, we simply can run EM. 31 / 1

52 2. Mixture Models & EM Algorithm Gaussian Mixtures for Soft Clustering The responsibilities r [0, 1] N K are a soft partition. P := r The negative expected loglikelihood can be used as cluster distortion: D(P) := max Q(Θ, r) Θ To optimize D, we simply can run EM. For hard clustering: assign points to the cluster with highest responsibility (hard EM): r (t 1) i,k = δ(k = arg max r (t 1) k i,k ) (0b ) =1,...,K 31 / 1

53 2. Mixture Models & EM Algorithm Gaussian Mixtures for Soft Clustering / Example / 1

54 2. Mixture Models & EM Algorithm Gaussian Mixtures for Soft Clustering / Example 32 / 1

55 2. Mixture Models & EM Algorithm Gaussian Mixtures for Soft Clustering / Example 32 / 1

56 2. Mixture Models & EM Algorithm Model-based Cluster Analysis Different parametrizations of the covariance matrices Σ k restrict possible cluster shapes: full Σ: all sorts of ellipsoid clusters. diagonal Σ: ellipsoid clusters with axis-parallel axes unit Σ: spherical clusters. One also distinguishes cluster-specific Σ k : each cluster can have its own shape. shared Σ k = Σ: all clusters have the same shape. 33 / 1

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany