Lecture 3 January 22

Size: px

Start display at page:

Download "Lecture 3 January 22"

Abigail Atkinson
6 years ago
Views:

1 EE 38V: Large cale Learning pring 203 Lecture 3 January 22 Lecturer: Caramanis & anghavi cribe: ubhashini Krishsamy 3. Clustering In the last lecture, we saw Locality ensitive Hashing (LH) which uses hash functions to map high dimensiol data to points in lower dimensiol space. We will now study a closely related learning problem called Clustering that also involves dimensiolity reduction. Examples of large dimensiol data applications include recommendation engines (grouping of consumers/consumer products), search engines (grouping of files/sites), and the loan ky urvey Project that catalogues millions of sky objects with the aim of constructing a 3-D map of the sky. Learning algorithms can be classified based on the data available. While upervised learning uses labelled data (sample input and output points) to learn the mapping function, Unsupervised learning methods like clustering try to find structure in unlabelled data. More specifically, the problem of clustering is to partition a set n data points x, x 2,..., x n X into k groups based on some notion of similarity. We will study clustering algorithms for two kind of models -. Probabilistic Model: In this kind of models, it is assumed that each of the data points was independently generated from a mixture of k distributions. The conditiol densities are modelled with cluster dependent parameters, and the algorithms then try to solve the maximum likelihood problem. X i p(x c) k q(x c a ), a where c a are the cluster dependent parameters. The goal is to maximize n i p(x i c). 2. Optimization Model: ome algorithms consider the problem in an optimization framework where the goal is to maximize agreements/similarities within clusters. Given some pseudo-distance metric d : X X R, partition the points such that the sum of distances to the cluster centers is minimized. The center of a group {x i } is defined as follows - c() : arg min{ d(x i, z) : z X }. z 3-

2 EE 38V Lecture 3 January 22 pring 203 The problem is to find 2 K that minimizes C() k a a d(x i, c( a )). This optimization problem is NP-hard even for points on a 2-D plane, and so we consider approximate algorithms. 3.2 k-means Algorithm k-means is an approximate algorithm that aims to solve the optimization problem above. The algorithm is given below. Data: x, x 2,..., x n Result: 2 K Initialize partition 2 K. while C() improves do. Centers update: Compute the cluster centers of the partition, c( a ) : arg min z { a d(x i, z) : z X }. 2. Cluster Assignment: Compute the partition induced by the centers, a {i [n] : a arg min b [k] d(x i, c b )}. Compute the objective C(). end Algorithm : k-means Computing the partitions is straight forward; it is the update of the cluster centers that drives the computatiol cost. The k-means algorithm is popular in applications with distance metric that allows easily computable cluster centers. ome examples are given below.. Euclidean distance: If X R m and d(x, y) x y 2 2 m i (x i y i ) 2, then c() x i. 2. pherical distance: If X m {x R m : z 2 } (all points lie on a unit sphere) and d(x, y) < x, y >, then where c() x i. c() arg min{ d(x i, z) : z 2 } z arg max{< x i, z >: z 2 } z c() c() 2, 3-2

3 EE 38V Lecture 3 January 22 pring KL divergence: If X P m {x R m : x > 0, x } (all points lie on the probability simplex) and d(x, y) D(x y) < x, log(x/y) >, then c() x i. For all the three distance metrics, the cluster center is easy to compute as the cost is only that of vector addition. But it should be noted that the cluster centers need not be one of the data points and might not be meaningful in the context of the application. Also, Euclidean distance (which is a convex function) gives large weights to larger distances, and therefore k-means with this distance measure is not very robust to outliers. An altertive algorithm is the k-medoids algorithm which forces the cluster centers to be one of the data points. Here again, the most computatiolly intensive step is the determition of cluster centers given a partition of the data points. Theorem 3.. k-means termites after at most k n iterations. Proof: The cost C() decreases monotonically at each iteration, and stabilizes once successive partitions coincide. This is true because the algorithm altertely minimizes the cost over all partitions and over all cluster centers. ince there are at most k n possible partitions, and the algorithm does not get trapped in cycles, it termites within k n iterations. The main drawback of the k-means algorithm is that the solution may not be the global optimum, and it may take a long time to converge. It is known to be NP-hard in R d for a general d and k 2, and for 2-D plane and a general k. For a fixed d and k, polynomial time algorithms exist. A variant of k-means is the Incremental k-means algorithm. A drawback of this algorithm is that it may not converge, or it may converge to a local optimum. Data: x, x 2,..., x n Result: 2 K Initialize partition 2 K. while C() improves do For each a, b [k], each i a, shift i to b if the new partition improves cost end Algorithm 2: Incremental k-means Another issue is the choice of the number of clusters, k. One strategy to choose k is to look at the cost versus number of clusters curve and identify the k above which the curve flattens out (see Fig 3.). That is, increasing the number of clusters beyond this value provides only a margil improvement in the total cost, and therefore there is no incentive to choose a higher value of k. This technique is called the Elbow Method. The following theorem gives a lower bound on the cost that an optimal algorithm can achieve in the Euclidean distance case. 3-3

4 EE 38V Lecture 3 January 22 pring 203 Cost of k clusters Elbow Region Number of Clusters k Figure 3.. Elbow Method Theorem 3.2. where X x x 2. x n min(m,n) min C() σ l (X) 2, lk+, x, x 2,..., x n R m are the data points, and σ l (X) is the l th largest singular value of X. Remark: Here is a simple example where the bound is tight. uppose that the data points x, x 2,..., x n {z, z 2,..., z k } where z i z j for all i, j [k]. Then XX T is a block diagol matrix with k diagol blocks, and the rank of each block is. Thus the rank of XX T is k, and the LH and the RH in the above inequality are both equal to zero. Proof: C() a x i c a 2 a x i 2 c a 2 a n x i 2 x j 2. n a j a a i 3-4 a

5 EE 38V Lecture 3 January 22 pring 203 Let u a a e i and Y n u n2 u 2. nk u k. Then F() x j 2 j a X T u a 2 a a a u T a XX T u a T r(y XX T Y T ). Note that since a b, and u a u b, we have Y Y T I k k. Now max F() max T r(y XX T Y T ) s.t. Y n u n2 u 2. nk u k, Therefore, Lemma 3.3. Y Y T I k k, Y R k n max T r(y XX T Y T ) s.t. Y Y T I k k, Y R k n σ l (X) 2 from Lemma 3.3. l min C() n i min(m,n) lk+ x i 2 max F() σ l (X) 2. max T r(y XX T Y T ) σ (X) σ k (X) 2. s.t. Y Y T I, Y R k n 3-5

6 EE 38V Lecture 3 January 22 pring 203 Proof: T r(y XX T Y T ) λ i (Y XX T Y T ) i λ i (XX T ) (since Y Y T I k k λ i (Y XX T Y T ) λ i (XX T ) i k). i In the above inequality, equality is achieved for Y that has the top k eigenvectors of XX T as its rows. This completes the proof. 3.3 Next time: Probabilistic Model As already mentioned, the k-means algorithm views clustering in an optimization framework. Altertively, one could view the problem in a probabilistic setting where the data set is modelled by a mixture of distributions with each cluster represented by a parametric distribution. Very often, it is assumed that each data point is equally likely to be from any of the cluster. Thus, the data points x, x 2,... x n are assumed to be generated according to p( c) k k a q( c a), and are used to find an ML estimate of the parameters. A widely used model is the Gaussian mixture model with fixed variances and means (interpreted as the cluster centers) as the parameters of the component Gaussian distributions. Another interesting model is one in which the means are fixed and the variance is cluster dependent; but we will not consider that model here. Thus, given that q(x c) exp( x c 2 ), we would like to minimize the negative log-likelihood given (2πσ 2 ) m/2 2σ 2 by l(c x,... x n ) σ 0 n log( exp( x i c a 2 )) 2σ 2 n i i n i a log(max a {exp( x i c a 2 2σ 2 )}) min a { x i c a 2 2σ 2 }. This establishes a connection between the optimization and the probabilistic models. For very small σ, l(c x) min C(, c). The k-means algorithm tries to minimize min C(, c) while the EM algorithm (which we will see in the next class) tries to minimize l(c x). 3-6

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics