Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Size: px

Start display at page:

Download "Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University"

Hollie Reeves
5 years ago
Views:

1 Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

2 Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges & promising directions Pig Hive JAQL Hadoop HDFS

3 Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

4 Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

5 Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

6 Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used Are defaults or rules-ofthumb good enough?

7 Running time (minutes) Running time (minutes) Running time (seconds) Running time (seconds) Experiments On EC2 and local clusters

8 Illustrative Result: 50GB Terasort 17-node cluster, concurrent map+reduce slots mapred.reduce. tasks io.sort. factor io.sort.record. percent Running time Based on popular rule-ofthumb Performance at default and rule-of-thumb settings can be poor Cross-parameter interactions are significant

9 Complexity Space of execution choices Problem Space Multi-job workflows Declarative HiveQL/Pig operations Job configuration parameters Energy considerations Cost in pay-as-you-go environment Performance objectives Current approaches: Predominantly manual Post-mortem analysis Is this where we want to be?

10 Can DB Query Optimization Technology Help? MapReduce Query job Optimizer: Enumerate Cost Search Good setting plan of parameters Database Execution Hadoop Engine Results But: MapReduce jobs are not declarative No schema about the data Impact of concurrent jobs & scheduling? Space of parameters is huge Can we: Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years Or innovate! Exploit design & usage properties of MapReduce frameworks

11 Spectrum of Query Optimizers Conventional Optimizers Cost models + statistics about data Rulebased AT s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO)

12 Spectrum of Query Optimizers Conventional Optimizers Cost models + statistics about data Rulebased Learning Optimizers (learn from execution & adapt) Tuning Optimizers (proactively try different plans) AT s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO)

13 Spectrum of Query Optimizers Conventional Optimizers Cost models + statistics about data Rulebased Learning Optimizers (learn from execution & adapt) Tuning Optimizers (proactively try different plans) Exploit usage & design properties of MapReduce frameworks: High ratio of repeated jobs to new jobs Schema can be learned (e.g., Pig scripts) Common sort-partition-merge skeleton Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results) Fine-grained and pluggable scheduler

14 Summary Call to action to improve automatic optimization techniques in MapReduce frameworks Automated generation of optimized Hadoop configuration parameter settings, HiveQL/Pig/JAQL query plans, etc. Rich history to learn from MapReduce execution creates unique opportunities/challenges

Analysis in the Big Data Era

Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /