MapReduce ML & Clustering Algorithms

Size: px

Start display at page:

Download "MapReduce ML & Clustering Algorithms"

Bertha Glenn
5 years ago
Views:

1 MapReduce ML & Clustering Algorithms

2 Reminder MapReduce: A trade-off between ease of use & possible parallelism Graph Algorithms Approaches: Reduce input size (filtering) Graph specific optimizations (Pregel & Giraph) 2

3 Today Machine Learning More filtering -- reducing input size Machine Learning Optimizations & AllReduce Applications: k-means clustering 3

4 Machine Learning Definition: Fitting a function to the data 4

5 Machine Learning Definition: Fitting a function to the data Examples: Classification: Data (x,y) pairs x 2 R d,y = ±1 Temperature = 15, Pressure = 755mm, Cloud Cover = 90%. : No Rain Temperature = 17, Pressure = 760mm, Cloud Cover = 75% : No Rain Temperature = 23, Pressure = 766mm, Cloud Cover = 95% : Rain Temperature = 19, Pressure = 740mm, Cloud Cover = 100% :??? 5

6 Machine Learning Definition: Fitting a function to the data Examples: Classification. Regression: Data (x,y) pairs x 2 R d,y 2 R Temperature = 15, Pressure = 755mm, Cloud Cover = 90%. : 1mm Rain Temperature = 17, Pressure = 760mm, Cloud Cover = 75% : 0mm Rain Temperature = 23, Pressure = 766mm, Cloud Cover = 95% : 9mm Rain Temperature = 19, Pressure = 740mm, Cloud Cover = 100% :??? 6

7 Machine Learning Definition: Fitting a function to the data Examples: Classification. Regression. Clustering: x 2 R d Data, Goal: find a sensible grouping into groups k 7

8 Machine Learning Definition: Fitting a function to the data Examples: Classification. Regression. Clustering. Today: Regression & Clustering 8

9 Regression Data: Aarhus (yesterday) Temperature = 15, Pressure = 755mm, Cloud Cover = 90%. : 1mm Temperature = 17, Pressure = 760mm, Cloud Cover = 75% : 0mm Temperature = 23, Pressure = 766mm, Cloud Cover = 95% : 9mm Temperature = 11, Pressure = 740mm, Cloud Cover = 100% : 5mm Matrix Form: 0 x = C A y = C A Temperature Pressure Cloud Cover 9

10 Linear Regression x = C A y = C A y Approximate by a linear function of : Example: 0.05 weight on Temperature, 0 on pressure, 0.1 on Humidity x 10

11 Linear Regression x = C A y = C A Approximate by a linear function of : Example: 0.05 weight on Temperature, 0 on pressure, 0.1 on Humidity Predictions: y Find that minimizes the squared distance: x = = kx yk 2 11

12 Linear Regression x = C A y = C A Approximate by a linear function of : Example: 0.05 weight on Temperature, 0 on pressure, 0.1 on Humidity Predictions: Find that minimizes the squared distance: Very simple! y Is this complex enough to capture all of the data? x = = kx yk 2 12

13 Linear Regression x = C A y = C A Approximate by a linear function of : Example: 0.05 weight on Temperature, 0 on pressure, 0.1 on Humidity Predictions: Find that minimizes the squared distance: Very simple! y Is this complex enough to capture all of the data? Maybe if you have a lot of features x = = kx yk 2 13

14 Doing Regression Problem: X 2 R n d Y 2 R n 1 2 R d 1 kx Y k 2 Examples, labels: Find a set of weights that minimizes: 14

15 Doing Regression Problem: X 2 R n d Y 2 R n 1 2 R d 1 kx Y k 2 Examples, labels: Find a set of weights that minimizes: Exact Solution: =(X T X) 1 X T Y 15

16 Doing Regression Problem: X 2 R n d Y 2 R n 1 2 R d 1 kx Y k 2 Examples, labels: Find a set of weights that minimizes: Exact Solution: If the number of examples is large.....but the dimensionality is small Then: =(X T X) 1 X T Y n d (X T X) 2 R d d X T Y 2 R d 1 and are small.. 16

17 Doing Regression Problem: X 2 R n d Y 2 R n 1 2 R d 1 kx Y k 2 Examples, labels: Find a set of weights that minimizes: Exact Solution: If the number of examples is large.....but the dimensionality is small Then: =(X T X) 1 X T Y n d (X T X) 2 R d d X T Y 2 R d 1 and are small....and therefore fit onto a single machine 17

18 Doing Regression Problem: X 2 R n d Y 2 R n 1 2 R d 1 kx Y k 2 Examples, labels: Find a set of weights that minimizes: Exact Solution: If the number of examples is large.....but the dimensionality is small Then: =(X T X) 1 X T Y n d (X T X) 2 R d d X T Y 2 R d 1 and are small....and therefore fit onto a single machine X T X, X T Y Idea: Compute in parallel, finish on a single machine 18

19 Computation Idea: X T X, X T Y Compute in parallel, finish on a single machine 19

20 Computation Idea: X T X, X T Y Compute in parallel, finish on a single machine How hard? Computing Matrix-Matrix & Matrix-Vector products For square matrices: O p p n n p n n time m p M Machine memory Total memory O(1) m = n 3/4,M = n 3/2 when 20

21 Computation Idea: X T X, X T Y Compute in parallel, finish on a single machine How hard? Computing Matrix-Matrix & Matrix-Vector products For square matrices: O time and! p p n n p n n m p M p n n m p M Machine memory Total memory O(1) m = n 3/4,M = n 3/2 when 21

22 How far can you go? ML Theory Statistical query model Interact with data only via some that s averaged over all of the examples. f(x, Y )= 1 n This is trivial to parallelize f(x, y) X (x,y)2(x Y ) f(x, y) 22

23 Statistical Query Model What can you do using the statistical query model? Linear Regression Naive Bayes Logistic Regression Neural Networks Principle Component Analysis (PCA) 23

24 Statistical Query Model What can you do using the statistical query model? But required runtime per step... O(d 2 ) O(d) O(d 2 ) O(d) O(d 2 ) Linear Regression Naive Bayes Logistic Regression Neural Networks Principle Component Analysis (PCA) 24

25 Statistical Query Model What can you do using the statistical query model? But required runtime per step... O(d 2 ) O(d) O(d 2 ) O(d) O(d 2 ) Linear Regression Naive Bayes Logistic Regression Neural Networks Principle Component Analysis (PCA)...Applicable only to low dimensional spaces 25

26 Large Dimensionality What if the dimension is too high? Goal Minimize: J( ) =kx Y k 2 26

27 Large Dimensionality What if the dimension is too high? Goal Minimize: J( ) =kx Y k 2 greedy solution: gradient descent new = old + rj( old ) Single j J( ) =(y x ) x j 27

28 Batch Gradient Descent Given examples: 1. Compute gradient for every example for j J( ) =(y x ) x j Easy to parallelize! 28

29 Batch Gradient Descent Given examples: 1. Compute gradient for every example for every coordinate Easy to parallelize! 2. j J( ) =(y new(j) = old(j) + x ) x j nx i=1 (y (i) x (i) ) x (i) j 29

30 Batch Gradient Descent Given examples: 1. Compute gradient for every example for every coordinate Easy to parallelize! 2. j J( ) =(y new(j) = old(j) + x ) x j nx i=1 (y (i) x (i) ) x (i) j 3. Repeat until convergence Will converge given mild conditions on 30

31 Performance Batch Gradient Descent in MR: Easy to write But requires many many rounds to converge......this is inefficient in MapReduce O(1) Remember, aimed for rounds. 31

32 Performance Batch Gradient Descent in MR: Easy to write But requires many many rounds to converge......this is inefficient in MapReduce. O(1) Remember, aimed for rounds. Same problem exists sequentially: Batch Gradient Descent looks at all examples in every round! 32

33 Sequential Improvement What if we update the gradient after every example Read one example: (x (i),y (i) ) 33

34 Sequential Improvement What if we update the gradient after every example Read one example: (x (i),y (i) ) Update the parameter: new(j) = old(j) + (y (i) x (i) ) x (i) j 34

35 Sequential Improvement What if we update the gradient after every example Read one example: (x (i),y (i) ) Update the parameter: new(j) = old(j) + (y (i) x (i) ) x (i) j Repeat until changes are minor Again, converges to somewhere near the local minimum Known as Stochastic Gradient Descent 35

36 Stochastic GS & All-Reduce How to parallelize stochastic gradient descent? Main Loop: First compute new(j) = old(j) + (y (i) y (i) x (i) Then: perform the update x (i) ) x (i) j Requires all coordinates Easy to parallelize by coordinate 36

37 All-Reduce Optimizes taking a sum & propagating the updates. 37

38 All-Reduce Optimizes taking a sum & propagating the updates

39 All-Reduce Optimizes taking a sum & propagating the updates

40 All-Reduce Optimizes taking a sum & propagating the updates

41 All-Reduce Optimizes taking a sum & propagating the updates

42 All-Reduce Optimizes taking a sum & propagating the updates

43 All-Reduce Optimizes taking a sum & propagating the updates

44 All-Reduce Optimizes taking a sum & propagating the updates Very optimized! Can get a significant speed up over straightforward Hadoop Mostly maintain fault tolerance given by Hadoop Pipelining means nodes are never idle Delay propagation of gradient by a few rounds 44

45 All-Reduce Optimizes taking a sum & propagating the updates Very optimized! Can get a significant speed up over straightforward Hadoop Mostly maintain fault tolerance given by Hadoop Pipelining means nodes are never idle Delay propagation of gradient by a few rounds Gets good results! Can be made to work with other Statistical Query Algorithms 45

46 Regression Overview Two approaches: Exact Computation Works only if the dimension is small (quadratic algorithms on dimension allowed) Streaming style computation Works even if dimension is large AllReduce makes MapReduce more scalable 46

47 Today Machine Learning More filtering -- reducing input size Machine Learning Optimizations & AllReduce Applications: k-means clustering 47

48 Clustering Clustering: Group similar items together One of the oldest problems in CS: Thousands of papers Hundreds of algorithms 48

49 Lloyd s Method: k-means Initialize with random clusters 49

50 Lloyd s Method: k-means Assign each point to nearest center 50

51 Lloyd s Method: k-means Recompute optimum centers (means) 51

52 Lloyd s Method: k-means Repeat: Assign points to nearest center 52

53 Lloyd s Method: k-means Repeat: Recompute centers 53

54 Lloyd s Method: k-means Repeat... 54

55 Lloyd s Method: k-means Repeat...Until clustering does not change 55

56 Lloyd s Method: k-means Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge. 55

57 Lloyd s Method: k-means Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge. Minimizes: (X, C) = X x2x d(x, C) 2 56

58 k-means Initialization Random? 57

59 k-means Initialization Random? 58

60 k-means Initialization Random? A bad idea 59

61 k-means Initialization Random? A bad idea Even with many random restarts! 59

62 Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 60

63 Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 61

64 Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 62

65 Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 63

66 Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 64

67 Sensitive to Outliers 65

68 Sensitive to Outliers 65

69 Sensitive to Outliers 65

70 Sensitive to Outliers 65

71 Sensitive to Outliers 66

72 k-means++ Interpolate between two methods. Give preference to further points. Let D(p) be the distance between p and the nearest cluster center. Sample next center proportionally to D (p). 67

73 k-means++ Interpolate between two methods. Give preference to further points. Let D(p) be the distance between p and the nearest cluster center. Sample next center proportionally to D (p). kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ D (p) Select next point p with probability P ; UpdateDistances(); x D (p) } 68

74 k-means++ Interpolate between two methods. Give preference to further points. Let D(p) be the distance between p and the nearest cluster center. Sample next center proportionally to D (p). kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ D (p) Select next point p with probability P ; UpdateDistances(); x D (p) } Original Lloyd s: Furthest Point: k-means++: =0 = 1 =2 69

75 k-means++ 70

76 k-means++ 70

77 k-means++ 70

78 k-means++ 70

79 k-means++ Theorem [AV 07]: k-means++ guarantees a (log k) approximation 71

80 Algorithm Initialization: kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ D Select next point p with probability 2 (p) P ; UpdateDistances(); p D2 (p) } Very Sequential! Must update all distances before selecting next cluster 72

81 Goal: Simulate k-means++ k-means++: Is a soft greedy algorithm Adapt sample & prune technique Sample multiple points in each round In the end, prune back down to k points 73

82 k-means kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ D Select next point p with probability 2 (p) P ; UpdateDistances(); p D2 (p) } } 74

83 k-means kmeans++: Select first point c uniformly at random for (int i=1; i < log`( (X, c)) ; ++i){ Select point p independently with probability k ` UpdateDistances(); } Prune to k points total by clustering the clusters } D (p) P x D (p) 75

84 k-means kmeans++: Select first point c uniformly at random for (int i=1; i < log`( (X, c)) ; ++i){ Select point p independently with probability k ` UpdateDistances(); } Prune to k points total by clustering the clusters } D (p) P x D (p) Independent selection Easy MR 76

85 k-means Oversampling Parameter kmeans++: Select first point c uniformly at random for (int i=1; i < log`( (X, c)) ; ++i){ Select point p independently with probability k ` UpdateDistances(); } Prune to k points total by clustering the clusters } D (p) P x D (p) Independent selection Easy MR 77

86 k-means Oversampling Parameter kmeans++: Select first point c uniformly at random for (int i=1; i < log`( (X, c)) ; ++i){ Select point p independently with probability k ` UpdateDistances(); } Prune to k points total by clustering the clusters } D (p) P x D (p) Independent selection Easy MR Re-clustering step 78

87 k-means : Analysis How Many Rounds? Theorem: After O(log`(n )) rounds, guarantee O(1) approximation In practice: fewer iterations are needed Need to re-cluster O(k` log`(n )) intermediate centers Discussion: Number of rounds independent of k Tradeoff between number of rounds and memory 79

88 How well does this work? Random Initialization cost 1e+16 1e+15 1e+14 1e+13 1e+12 KDD Dataset, k=65 l/k=1 l/k=2 l/k=4 l=1 l=2 l=4 k-means++ 1e k-means log # Rounds 80

89 Performance vs. k-means++ Even better on small datasets: 4600 points, 50 dimensions (SPAM) Accuracy: Time (iterations): 81

90 Conclusion: ML Three prevalent ideas: If the dimension is small: prune (in a smart way) If the dimension is large, figure out what to optimize All-Reduce For clustering, adapt methods by oversampling & pruning 82

91 Conclusion: MapReduce Overall: A robust implementation of the BSP model Easy to work with, easy to think about Things to Keep in Mind: The data will be skewed! For specific classes of problems, additional optimizations possible Graphs with Pregel, ML with All-Reduce Wisely sampling the input & using the sample gets you very far 83

92 Conclusion: MapReduce Overall: A robust implementation of the BSP model Easy to work with, easy to think about Things to Keep in Mind: The data will be skewed! For specific classes of problems, additional optimizations possible Graphs with Pregel, ML with All-Reduce Wisely sampling the input & using the sample gets you very far Apparently it never rains in Aarhus :-) 84

93 References Map-Reduce for Machine Learning on Multicore. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, Kunle Olukotun, NIPS Space-Round Tradeoffs for MapReduce Computations. Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, Eli Upfal, ICS Efficient Noise-Tolerant Learning from Statistical Queries. Michael Kearns, STOC A Reliable Effective Terascale Linear Learning System. Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford, ArXiv k-means++: The Advantages of Careful Seeding. David Arthur, S.V., SODA Scalable k-means++. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, S.V., VLDB

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU Templates for scalable data analysis 2 Synchronous Templates Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU Running Example Inbox Spam Running Example Inbox Spam Spam Filter