Data-Intensive Computing with MapReduce

Size: px

Start display at page:

Download "Data-Intensive Computing with MapReduce"

Rolf Davis
5 years ago
Views:

Data-Intensive Computing with MapReduce Session 7: Clustering and Classification Jimmy Lin University of Maryland Thursday, March 7, 2013 This work is

1 Data-Intensive Computing with MapReduce Session 7: Clustering and Classification Jimmy Lin University of Maryland Thursday, March 7, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

2 Today s Agenda Personalized PageRank Clustering Classification

3 Clustering Source: Wikipedia (Star cluster)

4 Problem Setup Arrange items into clusters l High similarity between objects in the same cluster l Low similarity between objects in different clusters Cluster labeling is a separate problem

5 Applications Exploratory analysis of large collections of objects Image segmentation Recommender systems Cluster hypothesis in information retrieval Computational biology and bioinformatics Pre-processing for many other algorithms

6 Three Approaches Hierarchical K-Means Gaussian Mixture Models

7 Hierarchical Agglomerative Clustering Start with each document in its own cluster Until there is only one cluster: l Find the two clusters c i and c j, that are most similar l Replace c i and c j with a single cluster c i c j The history of merges forms the hierarchy

8 HAC in Action A B C D E F G H

9 Cluster Merging Which two clusters do we merge? What s the similarity between two clusters? l Single Link: similarity of two most similar members l Complete Link: similarity of two least similar members l Group Average: average similarity between members

10 Link Functions Single link: l Uses maximum similarity of pairs: sim(c i,c j ) = l Can result in straggly (long and thin) clusters due to chaining effect Complete link: l Use minimum similarity of pairs: sim(c i,c j )= max sim(x, y) x2c i,y2c j min sim(x, y) x2c i,y2c j l Makes more tight spherical clusters

11 MapReduce Implementation What s the inherent challenge? One possible approach: l Iteratively use LSH to group together similar items l When dataset is small enough, run HAC in memory on a single machine l Observation: structure at the leaves is not very important

12 K-Means Algorithm Let d be the distance between documents Define the centroid of a cluster to be: µ(c) = 1 c X x x2c Select k random instances {s 1, s 2, s k } as seeds. Until clusters converge: l Assign each instance x i to the cluster c j such that d(x i, s j ) is minimal l Update the seeds to the centroid of each cluster l For each cluster c j, s j = µ(c j )

13 K-Means Clustering Example Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters Converged!

14 Basic MapReduce Implementation 1: class Mapper 2: method Configure() 3: c LoadClusters() 4: method Map(id i, pointp) 5: n NearestClusterID(clusters c, pointp) 6: p ExtendPoint(point p) (Just a clever way to keep 7: Emit(clusterid n, pointp) track of denominator) 1: class Reducer 2: method Reduce(clusterid n, points[p 1, p 2,...]) 3: s InitPointSum() 4: for all point p 2 points do 5: s s + p 6: m ComputeCentroid(point s) 7: Emit(clusterid n, centroid m)

15 MapReduce Implementation w/ IMC 1: class Mapper 2: method Configure() 3: c LoadClusters() 4: H InitAssociativeArray() 5: method Map(id i, pointp) 6: n NearestClusterID(clusters c, pointp) 7: p ExtendPoint(point p) 8: H{n} H{n} + p 9: method Close() 10: for all clusterid n 2 H do 11: Emit(clusterid n, pointh{n}) 1: class Reducer 2: method Reduce(clusterid n, points[p 1, p 2,...]) 3: s InitPointSum() 4: for all point p 2 points do 5: s s + p 6: m ComputeCentroid(point s) 7: Emit(clusterid n, centroid m)

16 Implementation Notes Standard setup of iterative MapReduce algorithms l Driver program sets up MapReduce job l Waits for completion l Checks for convergence l Repeats if necessary Must be able keep cluster centroids in memory l With large k, large feature spaces, potentially an issue l Memory requirements of centroids grow over time! Variant: k-medoids

17 Clustering w/ Gaussian Mixture Models Model data as a mixture of Gaussians Given data, recover model parameters Source: Wikipedia (Cluster analysis)

18 Gaussian Distributions Univariate Gaussian (i.e., Normal): p(x; µ, 2 )= 1 p 2 1 exp (x µ)2 2 2 l A random variable with such a distribution we write as: x N (µ, 2 ) Multivariate Gaussian: p(x; µ, ) = l A vector-value random variable with such a distribution we write as: x N (µ, ) 1 exp (2 ) n/2 1/2 1 2 (x µ)t 1 (x µ)

19 Univariate Gaussian Source: Wikipedia (Normal Distribution)

20 Multivariate Gaussians apple apple apple apple µ = = Figure 2: µ = = e figure on the left shows a heatmap indicating [ ] values of the density function fo 3 is-aligned multivariate Gaussian with mean µ = and diagonal covariance matrix ] Source: Lecture notes by Chuong B. Do (IIT Delhi)

21 Gaussian Mixture Models Model parameters l Number of components: K l Mixing weight vector: l For each Gaussian, mean and covariance matrix: Varying constraints on co-variance matrices l Spherical vs. diagonal vs. full l Tied vs. untied µ 1:K 1:K

22 Learning for Simple Univariate Case Problem setup: l Given number of components: l Given points: x 1:N l Learn parameters:,µ 1:K, Model selection criterion: maximize likelihood of data K 2 1:K l Introduce indicator variables: 1 if xn is in cluster k z n,k = 0 otherwise l Likelihood of the data: p(x 1:N,z 1:N,1:K µ 1:K, 2 1:K, )

23 EM to the Rescue! We re faced with this: p(x 1:N,z 1:N,1:K µ 1:K, l It d be a lot easier if we knew the z s! Expectation Maximization l Guess the model parameters 2 1:K, ) l E-step: Compute posterior distribution over latent (hidden) variables given the model parameters l M-step: Update model parameters using posterior distribution computed in the E-step l Iterate until convergence

25 EM for Univariate GMMs Initialize: Iterate:,µ 1:K, 2 1:K l E-step: compute expectation of z variables E[z n,k ]= 2 N (x n µ k, k ) P k k N (x 0 n µ k 0, 2 k 0) k 0 l M-step: compute new model parameters k = 1 X z n,k N n 1 X µ k = P n z z n,k x n n,k n 2 1 X k = P n z z n,k x n µ k 2 n,k n

26 MapReduce Implementation x 1 z 1,1 z 1,2 z 1,K Map E[z n,k ]= 2 N (x n µ k, k ) P k k N (x 0 n µ k 0, 2 k 0) k 0 x 2 x 3 z 2,1 z 2,1 z 2,2 z 2,3 z 2,K z 2,K x N z N,1 z N,2 z N,K Reduce k = 1 X z n,k N n 1 X µ k = P n z z n,k x n n,k n 2 1 X k = P n z z n,k x n µ k 2 n,k n

27 K-Means vs. GMMs K-Means GMM Map Compute distance of points to centroids E-step: compute expectation of z indicator variables Reduce Recompute new centroids M-step: update values of model parameters

28 Summary Hierarchical clustering l Difficult to implement in MapReduce K-Means l Straightforward implementation in MapReduce Gaussian Mixture Models l Implementation conceptually similar to k-means, more bookkeeping

29 Classification Source: Wikipedia (Sorting)

30 Supervised Machine Learning The generic problem of function induction given sample instances of input and output l Classification: output draws from finite discrete labels l Regression: output is a continuous value Focus here on supervised classification l Suffices to illustrate large-scale machine learning This is not meant to be an exhaustive treatment of machine learning!

31 Applications Spam detection Content (e.g., movie) classification POS tagging Friendship recommendation Document ranking Many, many more!

32 Supervised Binary Classification Restrict output label to be binary l Yes/No l 1/0 Binary classifiers form a primitive building block for multi-class problems l One vs. rest classifier ensembles l Classifier cascades

33 Limits of Supervised Classification? Why is this a big data problem? l Isn t gathering labels a serious bottleneck? Solution: user behavior logs l Learning to rank l Computational advertising l Link recommendation The virtuous cycle of data-driven products

34 The Task Given D = {(x i,y i )} n i x i =[x 1,x 2,x 3,...,x d ] y 2 {0, 1} label (sparse) feature vector Induce f : X! Y l Such that loss is minimized 1 nx `(f(x i ),y i ) n i=0 Typically, consider functions of a parametric form: 1 arg min n nx `(f(x i ; ),y i ) i=0 loss function model parameters

35 Key insight: machine learning as an optimization problem! (closed form solutions generally not possible)

36 Gradient Descent: Preliminaries Rewrite: 1 arg min n Compute gradient: nx `(f(x i ; ),y i ) i=0 l Points to fastest increasing direction ) rl( ) d arg min L( ) So, at any point: * b=a rl(a) L(a) L(b) * caveats

37 Gradient Descent: Iterative Update Start at an arbitrary point, iteratively update: (t+1) (t) (t) rl( (t) ) We have: L( (0) ) L( (1) ) L( (2) )... Lots of details: l Figuring out the step size l Getting stuck in local minima l Convergence rate l

38 Gradient Descent (t+1) (t) (t) 1 n Repeat until convergence: nx r`(f(x i ; (t) ),y i ) i=0

39 Intuition behind the math `(x) d dx` r` (t+1) (t) (t) 1 n New weights Old weights nx r`(f(x i ; (t) ),y i ) i=0 Update based on gradient

41 Gradient Descent (t+1) (t) (t) 1 n nx r`(f(x i ; (t) ),y i ) i=0 Source: Wikipedia (Hills)

42 Lots More Details Gradient descent is a first order optimization technique l Often, slow convergence l Conjugate techniques accelerate convergence Newton and quasi-newton methods: l Intuition: Taylor expansion f(x + x) =f(x)+f 0 (x) x f 00 (x) x 2 l Requires the Hessian (square matrix of second order partial derivatives): impractical to fully compute

43 Logistic Regression Source: Wikipedia (Hammer)

44 Logistic Regression: Preliminaries Given Let s define: Interpretation: D = {(x i,y i )} n i x i =[x 1,x 2,x 3,...,x d ] y 2 {0, 1} f(x; w) : R d! {0, 1} 1ifw x t f(x; w) = 0ifw x <t apple Pr (y =1 x) ln =w x Pr (y =0 x) apple Pr (y =1 x) ln =w x 1 Pr (y =1 x)

45 Relation to the Logistic Function After some algebra: Pr (y =1 x) = Pr (y =0 x) = ew x 1+e w x 1 1+e w x The logistic function: f(z) = ez e z logistic(z) z

46 Training an LR Classifier Maximize the conditional likelihood: arg max w Define the objective in terms of conditional log likelihood: L(w) = l We know l Substituting: nx L(w) = ny Pr(y i x i, w) i=1 nx ln Pr(y i x i, w) i=1 y 2 {0, 1} so: Pr(y x, w) = Pr(y =1 x, w) y (1 y) [1 Pr(y =0 x, w)] i=1 y i ln Pr(y i =1 x i, w) + (1 y i )lnpr(y i =0 x i, w)

47 LR Classifier Update Rule Take the derivative: L(w) = nx i=1 X n i=0 General form for update rule: Final update rule: y i ln Pr(y i =1 x i, w) + (1 y i )lnpr(y i =0 x i, w) x i y i Pr(y i =1 x i, w) w (t+1) w (t) + (t) r w L(w (t) ) @w d w (t+1) i w (t) i + (t) nx j=0 x j,i y j Pr(y j =1 x j, w (t) )

48 Lots more details Regularization Different loss functions Want more details? Take a real machine-learning course!

49 MapReduce Implementation (t+1) (t) (t) 1 n nx r`(f(x i ; (t) ),y i ) i=0 mappers single reducer compute partial gradient mapper mapper mapper mapper iterate until convergence reducer update model

50 Shortcomings Hadoop is bad at iterative algorithms l High job startup costs l Awkward to retain state across iterations High sensitivity to skew l Iteration speed bounded by slowest task Potentially poor cluster utilization l Must shuffle all data to a single reducer Some possible tradeoffs l Number of iterations vs. complexity of computation per iteration l E.g., L-BFGS: faster convergence, but more to compute

51 Gradient Descent Source: Wikipedia (Hills)

52 Stochastic Gradient Descent Source: Wikipedia (Water Slide)

53 Batch vs. Online Gradient Descent (t+1) (t) (t) 1 n i=0 batch learning: update model after considering all training instances Stochastic Gradient Descent (SGD) online learning: update model after considering each (randomly-selected) training instance In practice just as good! nx r`(f(x i ; (t) ),y i ) (t+1) (t) (t) r`(f(x; (t) ),y)

54 Practical Notes Most common implementation: l Randomly shuffle training instances l Stream instances through learner Single vs. multi-pass approaches Mini-batching as a middle ground between batch and stochastic gradient descent We ve solved the iteration problem! What about the single reducer problem?

55 Ensembles Source: Wikipedia (Orchestra)

56 Ensemble Learning Learn multiple models, combine results from different models to make prediction Why does it work? l If errors uncorrelated, multiple classifiers being wrong is less likely l Reduces the variance component of error A variety of different techniques: l Majority voting l Simple weighted voting: nx y = arg max k p k (y x) y2y l Model averaging l k=1

57 Practical Notes Common implementation: l Train classifiers on different input partitions of the data l Embarassingly parallel! Contrast with bagging Contrast with boosting

58 MapReduce Implementation (t+1) (t) (t) r`(f(x; (t) ),y) training data training data training data training data mapper mapper mapper mapper model model model model

59 MapReduce Implementation: Details Shuffling/resort training instances before learning Two possible implementations: l Mappers write model out as side data l Mappers emit model as intermediate output

60 Sentiment Analysis Case Study Lin and Kolcz, SIGMOD 2012 Binary polarity classification: {positive, negative} sentiment l Independently interesting task l Illustrates end-to-end flow l Use the emoticon trick to gather data Data l Test: 500k positive/500k negative tweets from 9/1/2011 l Training: {1m, 10m, 100m} instances from before (50/50 split) Features: Sliding window byte-4grams Models: l Logistic regression with SGD (L2 regularization) l Ensembles of various sizes (simple weighted voting)

0.82 0.81 1m instances 10m instances 100m instances Ensembles with 10m examples better than 100m single classifier! Diminishing returns Accuracy 0.

61 m instances 10m instances 100m instances Ensembles with 10m examples better than 100m single classifier! Diminishing returns Accuracy 0.8 for free Number of Classifiers in Ensemble single classifier 10m ensembles 100m ensembles

62 Takeaway Lesson Big data recipe for problem solving l Simple technique l Simple features l Lots of data Usually works very well!

63 Today s Agenda Personalized PageRank Clustering Classification

64 Questions? Source: Wikipedia (Japanese rock garden)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides