Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin
4/26/2010 Joos-Hendrik Boese Slide 2/14 Motivation & Idea Exploratory data studies with MapReduce are tedious: MR Jobs are typically long running User is forced to wait for final result of job before she can decide on next query But: speed of thought performance required for interactive analysis! DSS and OLAP users are demanding Exact (final) result is rarely required in non-lookup queries! How to reduce Time-to-Solution? i. Increase degree of parallelization ii. Exploit preliminary results, (as seen in DBMS [Online Aggregation]) Degree of parallelization Map Reduce Idea Time-to-Solution beyond todays systems Online Aggregation Degree of exploiting preliminary results Combine MR parallelization and Online Aggregation to achieve Time-to-Solutions beyond todays systems!
4/26/2010 Joos-Hendrik Boese Slide 3/14 Agenda Online Aggregation Streaming MapReduce Framework Assess Quality of Preliminary Results Experiments for Single-Phase ML Algorithms How to implement Online Aggregation for Multi-Phase ML Algorithms Reducing K-Means to a Single-Phase Algorithm Experimental results for Online K-Means
4/26/2010 Joos-Hendrik Boese Slide 4/14 Online Aggregation Developed for large-scale data analysis on RDBMS [Hellerstein et. al 1997] For an SQL aggregate progress, preliminary results, and confidence intervals are provided during query execution Puts user in the driver seat : Select AVG(grade) from ENROLL GROUP BY major; i. can cancel futile queries prematurely ii. can stop execution if preliminary result is precise enough Often aggregate estimations are close to the final result after processing only a small fraction of the input data
4/26/2010 Joos-Hendrik Boese Slide 5/14 Online Aggregation in MapReduce How to implement online aggregation? Standard MapReduce system model is batch-oriented! Each operator executes completely before producing any output Map-Phase must be finished before Reduce-Phase is started Standard MapReduce system model Input file 1 Input file m read map 1 map 2 map m result file worker write result file read sort reduce 1 sort reduce 2 result 1 sort reduce n result k worker write Map-Phase Reduce-Phase To implement online aggregation, results must be continuously streamed from mapper to reducers!
4/26/2010 Joos-Hendrik Boese Slide 6/14 Streaming MapReduce Mappers send <k,v> directly to the reducer responsible for k Progress information is added to each <k,v> pair by a control component Reducer output their results regularly to a collector component Collector computes preliminary (and final) results Preliminary results are used for convergence estimation and visualization Currently implemented as shared-memory framework Map-Phase Streaming MapReduce system model Reduce-Phase Input files Input stream map 1 map 2 map m worker threads comb 1 comb 2 comb n dictionary (key red.) monitor progress control thread reduce 1 reduce 2 reduce n worker threads online collector visualizer collector thread preliminary and final result 10% result 100% Messages
4/26/2010 Joos-Hendrik Boese Slide 7/14 Convergence Estimation Collector component saves and visualizes intermediate results using signatures of intermediate results A signature sig(r) of each intermediate result R is saved to plot a convergence graph sig(r) is either the full result R, a compact representation of R, or the quality of R A convergence graph depicts the distance between intermediate results using a specific metric dif(sig(r i ),sig(r j )) Example: relative word freq. R is a histogram(word->rel. freq) sig(r) = R dif() is Mean Squared Error Online convergence curve plots dif(sig(r i-1 ),sig(r i )) Online convergence curve slope indicates degree of change in results
4/26/2010 Joos-Hendrik Boese Slide 8/14 Experiments: 1-Phase Algorithms 1-Phase Algorithms: word count (wc), texts of gutenberg project linear regression (lr), synthetic dataset (s) PCA (pca), synthetic dataset (s) Naive Bayes (nb), (i) spam dataset (m), (ii) url dataset (u) Offline convergence curves Offline convergence curve: dif(sig(r i ),sig(r f )) with R f =final result D is the 95-percentile of dif values S(y) is the smallest fraction of input data that all values of dif encountered afterwards are smaller than y% of D
4/26/2010 Joos-Hendrik Boese Slide 9/14 Multi-Phase Algorithms Iterative algorithms like k-means, EM, SVM, require multiple MR phases Example: multi-phase k-means clustering on MR Input: k cluster centroids (cluster center) G i (i=1..k) and partitioned input data D Map task m computes C m i. s, the sum of sample vectors assigned to G G ' 1 i = i,and C m i. n n C m Reducer can compute new centroids as C m m=1 i. s i. n n m=1 MapReduce Job 1 MapReduce Job 2 G D mapper 1 C 1 i. s ;C 1 i. n mapper n mapper 1 reducer C 1 i. s ;C 1 i. n comp. G ' G ' D mapper n reducer comp. G ' ' G ' ' D C i n. s ;C i n. n C i n. s ;C i n. n How to derive preliminary results?
4/26/2010 Joos-Hendrik Boese Slide 10/14 Single-Phase K-Means Idea: re-submit preliminary results G to mappers G is updated via event mechanism Update global centroids G D 1 D n G G mappers assign samples of D 1 to current G C i 1. s,c i 1. n C n i. s, C n i. n assign samples of D n to current G reducer periodically compute and update global centroids G online collector/ visualizer write preliminary and final result 10% result 100% Problem: mapper has to revisit samples already processed when an update for global centroids G is received Not all samples may fit into memory!
4/26/2010 Joos-Hendrik Boese Slide 11/14 Auxiliary Clusters Each sample s is stored in an auxiliary cluster (AC) ACs represent a group of similar samples by two values a c : centroid of AC and a n : number of samples represented by AC a c are treated as data points weighted by a n D m s G mapper m assign samples of D m to current G create/update ACs assign a c to G C i m. s, C i m. n C i m. s, C i m. n reducer periodically compute and update global centroids G weighted by a n Update and creation of ACs: A new AC is created for each s until max number L of ACs is reached s is added to the closest AC if its dist. is smaller than the largest dist. between two ACs Otherwise the two ACs with the smallest dist. are merged and a new AC is created for s
Experiments: Online K-Means Dataset: US Census (1990) with 2.4Mio samples Offline convergence graph Parameter: k = 5; number of ACs L = 1000 dif = MSE R f of online k-means differs from standard k-means Accuracy is evaluated using total sum of distances U Standard k-means: U off = 1.52*10 9 Total sum of distances online k-means approx offline result Online k-means: U on = 1.98*10 9 Accuracy increases with L 4/26/2010 Joos-Hendrik Boese Slide 12/14
THANK YOU! 4/26/2010 Joos-Hendrik Boese Slide 13/14
4/26/2010 Joos-Hendrik Boese Slide 14/14 Summary & Conclusion We presented a system model and approaches to exploit the potential of online aggregation in MapReduce We presented an approach to handle iterative algorithms which need multiple MapReduce phases Experiments show that our algorithms are capable to combine benefits of online aggregation with parallelism in a streaming MapReduce framework Future research includes: adapting other iterative data mining algorithms such as SVN or EM porting extensions for online aggregation to Hadoop Online
4/26/2010 Joos-Hendrik Boese Slide 15/14 Progress Estimation To estimate progress we monitor: i. The size of all input data processed by all mappers ii. For each k how many pairs are emitted by mappers per input byte Progress is monitored per key k, i.e. how much input contributed to a <k,v> pair emitted by a reducer Allows for algorithm specific progress metrics, if progress of reducer is heterogeneous. Example: (<k a, v>,p=25%) (<k b, v>,p=75%) online collector collector thread preliminary result 25%
Experimental Results 1Ph. k-means 4/26/2010 Joos-Hendrik Boese Slide 16/14