Data Mining Learning from Large Data Sets

Size: px

Start display at page:

Download "Data Mining Learning from Large Data Sets"

Terence Shaw
5 years ago
Views:

1 Data Mining Learning from Large Data Sets Lecture 8 Clustering large data sets L Andreas Krause

2 Announcements! Homework 4 out tomorrow 2

Course organizapon! Retrieval! Given a query, find most similar item in a large data set! Determine relevance of search results! Applica'ons: GoogleGoggles, Shazam,!

3 Course organizapon! Retrieval! Given a query, find most similar item in a large data set! Determine relevance of search results! Applica'ons: GoogleGoggles, Shazam,! Supervised learning (ClassificaPon, Regression)! Learn a concept (funcpon mapping queries to labels)! Applica'ons: Spam filtering, predicpng price changes,! Unsupervised learning (Clustering, dimension reducpon)! IdenPfy clusters, common pa]erns ; anomaly detecpon! Applica'ons: Recommender systems, fraud detecpon,! Learning with limited feedback! Learn to oppmize a funcpon that s expensive to evaluate! Applica'ons: Online adverpsing, opt. UI, learning rankings, 3

4 Unsupervised learning! Learning without labels! Typically useful for exploratory data analysis ( find pa]erns ; visualizapon; )! Most common methods:! Clustering (unsupervised classificapon)! Dimension reducpon (unsupervised regression) 4

5 What is clustering?! Given data points, group into clusters such that! Similar points are in the same cluster! Dissimilar points are in different clusters! Points are typically represented either! in (high- dimensional) Euclidean space! in a metric space, given in terms of pairwise distances (Jaccard, cosine, )! Anomaly / outlier detecpon: IdenPficaPon of points that don t fit well in any of the clusters 5

6 Examples of clustering! Cluster! Documents based on the words they contain! Images based on image features! DNA sequences based on edit distance! Products based on which customers bought them! Customers based on their purchase history! Web surfers based on their queries / sites they visit! 6

Standard approaches to clustering! Hierarchical clustering! Build a tree (either bo]om- up or top- down), represenpng the distances among the data points!

7 Standard approaches to clustering! Hierarchical clustering! Build a tree (either bo]om- up or top- down), represenpng the distances among the data points! Example: single-, average- linkage agglomerapve clustering! ParPPonal approaches! Define and oppmize a nopon of goodness defined over parppons! Example: Spectral clustering, graph- cut based approaches! Model- based approaches! Maintain cluster models and infer cluster membership (e.g., assign each point to closest center)! Example: k- means, Gaussian mixture models, 7

8 We will! Review standard clustering algorithms! K- means! ProbabilisPc mixture models! Discuss how to scale them to massive data sets and data streams 8

9 Clustering example 9

10 k- means! Assumes points are in Euclidean space! Represent clusters as centers! Each point is assigned to closest center x i 2 R d µ j 2 R d Goal: Pick centers to minimize average squared distance NX i=1! Non- convex oppmizapon! min j µ j x i 2 2! NP- hard è can t solve oppmally in general 10

11 Classical k- means algorithm! IniPalize cluster centers! E.g., pick one point at random, the other ones with maximum distance! While not converged! Assign each point x i to closest center! Update center as mean of assigned data points 11

12 K- means 12

13 13

14 14

15 15

16 16

17 17

18 18

19 ProperPes of k- means! Guaranteed to monotonically decrease average squared distance in each iterapon NX L(µ) = min µ j x i 2 2 j i=1! Converges to a local oppmum! Complexity:! Per iterapon! Have to process enpre data set in each iterapon 19

20 K- means for large data sets / streams! What if data set does not fit in main memory?! In principle not a problem (why?)! But each iterapon spll requires an enpre pass through the data set! Recall supervised learning (online SVM, etc.)! There we were able to process one data point at a Pme! Get (provably) good solupons from a single pass through the data! Could even do it in parallel!! Can we do the same thing for clustering?? 20

21 Streaming clustering! How should me maintain clusters as new data arrives? 21

22 Recall online SVM! Recall Online SVMs (& stochaspc gradient descent)! Loss funcpon decomposes addi'vely over data set L(w) = X i hinge(x i ; y i, w)! Can take a (sub- )gradient step for each data point 22

23 Online k- means! For k- means, loss funcpon also decomposes addipvely over data set L(µ) = NX i=1 min j µ j x i 2 2! Let s try take a (sub- )gradient step for each data point 23

24 CalculaPng the gradient L(µ) = NX i=1 min j µ j x i

25 Online k- means algorithm! IniPalize centers randomly! For t = 1:N! Find! Set c = arg min j µ j x t 2 µ c µ c + t (x t µ c ) X! To converge to local oppmum, need that t t = 1 X t 2 t < 1 25

26 PracPcal aspects! Generally works best if data is «randomly» ordered (like stochaspc gradient descent)! Typically, want to choose larger value for k! How can one implement mulpple random restarts in one pass? 26

27 Problems with online k- means! Have to commit to k in advance! ObjecPve funcpon non- convex (and problem NP- hard) à guarantees for online convex programming / SGD do not apply!! Not clear how to parallelize 27

28 AlternaPve: Summarizing large data sets! Idea:! Efficiently construct a compact version C of the data set D such that solving k- means on C gives a good solupon to D! Approach:! First construct C such that it allows approximately answer k- means queries NX i.e., approximately evaluate L(µ) = i=1 min j µ j x i 2 2! Then solve k- means using the approximate loss funcpon 28

29 k- mean queries 29

30 30

31 31

32 Data set summarizapon for k- means 32

33 Data set summarizapon for k- means! Key idea: Replace many points by one weighted representapve L k (µ; C) = X (w,x)2c w min j2{1,...,k} µ j x

34 Coresets L k (µ; C) = X (w,x)2c w min j2{1,...,k} µ j x 2 2 C is called a (k,ε)- coreset for data set D, if (1 ")L k (µ; D) apple L k (µ; C) apple (1 + ")L k (µ; D) 34

35 ConstrucPng coresets! Suppose for all pairs of points:! First a]empt: Random sampling x i x j 2 apple 1! Pick n << N points C uniformly at random from D E[L k (µ; C)] = L k (µ; D)! Hoeffding s inequality gives Pr L k (µ; C) L k (µ; D) " 1 1 O log " 2 apple 2exp! Thus need points to ensure 2" 2 n absolute error at most ε with probability at least 1- δ 35

36 ConstrucPng coresets! Suppose for all pairs of points:! First a]empt: Random sampling x i x j 2 apple 1! Pick n << N points C uniformly at random from D! Assign uniform weights N/n 36

37 Can we get small relapve error?! To ensure low mulpplicapve error, need more complex construcpon! è will use non- uniform sampling! 37

38 Sampling DistribuPon Bias sampling towards small clusters Sampling distribupon

39 Importance Weights Sampling distribupon Weights

40 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points 40

41 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random 41

42 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 42

43 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 43

44 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 44

45 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 45

46 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 46

47 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 47

48 CreaPng a Sampling DistribuPon Small clusters are represented Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 48

49 CreaPng a Sampling DistribuPon ParPPon data via a Voronoi diagram centered at points 49

50 CreaPng a Sampling DistribuPon Points in sparse cells get more mass and points far from centers Sampling distribupon 50

51 Importance Weights Points in sparse cells get more mass and points far from centers Sampling distribupon Weights 51

52 Non- uniform sample 52

53 Coresets via AdapPve Sampling C is (k,ε)- coreset of size polynomial in d,k,log n, 1/ε, 1/δ 53

54 Can do be]er: Coresets for k- means! Theorem [Har- Peled and Kushal, 05] One can find efficiently a (k,ε)- coreset for k- means of size k O 3 /" d+1! Theorem [Feldman et al 07] One can efficiently find a weak (k,ε)- coreset of size O poly(k, 1/")! Allows PTAS for k- means! 54

55 Coresets exist for! K- means, K- median! K- line means / median! PageRank! SVMs! Diameter of a point set! Matrix low- rank approximapon! 55

56 ComposiPon of Coresets [c.f. Har- Peled, Mazumdar 04] Merge 56

57 ComposiPon of Coresets [Har- Peled, Mazumdar 04] Merge Compress 57

58 Coresets on Streams [Har- Peled, Mazumdar 04] Merge Compress 58

59 Coresets on Streams [Har- Peled, Mazumdar 04] Merge Compress 59

60 Coresets on Streams [Har- Peled, Mazumdar 04] Merge Compress Error grows linearly with number of compressions 60

61 Coresets on Streams Error grows with height of tree

62 Coresets in Parallel 62

63 k- means clustering with coresets! Given data set D, desired number of clusters k, precision ε! Construct (k, ε) - coreset C! E.g., in parallel using MapReduce! Solve k- means on coreset! If coreset small, can even do exhauspve search!! In pracpce, run k- means with many restarts! ResulPng solupon will be (1+ε)- oppmal for D! è Provably near- oppmal solupon! 63

64 Summary so far! Clustering is a central problem in unsupervised learning! Two main classes of approaches! Hierarchical (difficult to scale)! Assignment based! Discussed k- means algorithm! Widely used clustering algorithm! Non- linear versions available! Can scale to large data sets using online oppmizapon and coreset construcpons 64

65 Acknowledgments! The slides are partly based on material By Chris Bishop, Andrew Moore and Danny Feldman 65

Introduc)on to Machine Learning: Clustering Algorithms How to Analyze Your Own Genome Fall 2013

Introduc)on to Machine Learning: Clustering Algorithms 02-223 How to Analyze Your Own Genome Fall 203 Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster