Op#mizing MapReduce for Highly- Distributed Environments

Size: px

Start display at page:

Download "Op#mizing MapReduce for Highly- Distributed Environments"

Silvia Walton
6 years ago
Views:

1 Op#mizing MapReduce for Highly- Distributed Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota hep:// 1

2 Big Data Data- rich enterprises and communi3es Both user- facing services and batch data processing Commercial, social, scien3fic E.g.: Google, Facebook, Amazon, Akamai, LHC,... Data analysis is key Search and indexing Ad op3miza3on Accou3ng and billing Spam detec3on and network monitoring Scien3fic data analysis

3 Geographically Distributed Data Commercial. E.g.: Warehouse, ecommerce data Public/social. E.g.: User blogs, traffic data Access logs. E.g.: CDNs Scien3fic. E.g.: oceanic, atmospheric data Mobile. E.g.: phone pics, sensors

4 Distributed Computa3on Resources Distributed data centers/ clouds E.g.: Amazon EC2 regions Edge servers E.g.: Akamai CDN servers Computa3onal Grids E.g.: FutureGrid, BOINC 4

5 Highly Distributed Environments 5 n Ques#on: How to analyze distributed data efficiently in such environments?

6 Talk Outline Mo3va3on Highly- Distributed MapReduce Our Research: MapReduce Op3miza3on Concluding Remarks 6

7 Highly Distributed Computa3on Data import Ini3al embarrassingly parallel computa3on Grouping / reorganiza3on Final summarizing computa3on 7

8 Highly Distributed Computa3on Data import Ini3al embarrassingly parallel computa3on Grouping / reorganiza3on Final summarizing computa3on Push Map Shuffle Reduce 8

9 Highly Distributed MapReduce Our focus: Efficient execu3on of MapReduce in highly- distributed environments MapReduce is simple and powerful: Designed for scalability and fault- tolerance Can express several data analysis algorithms MapReduce is widely used Popularized by open- source Hadoop project A rich eco- system of higher- level languages, tools 9

10 MapReduce Dataflow Push Map Shuffle Reduce Input Data

11 Tradi3onal MapReduce Push Map Shuffle Reduce Input Network Data and compute nodes largely homogeneous

12 Highly- Distributed MapReduce Push Map Shuffle Reduce Input Data 1... Input Data N Datacenter 1... Datacenter N 12

13 Problem: Heterogeneity Push Map Shuffle Reduce How can MapReduce handle this heterogeneity? Datacenter 1 Datacenter N 13

14 Possible Solu3ons Centralized Execu3on Push data over WAN May limit parallelism Problem if large input data Local push Shuffle over WAN Poor load balancing Problem if large intermediate data 14

15 Experimental Results: Amazon EC2 Time in seconds 1000 Amazon EC2: 6 US, 3 EU small instances, 1 data node each 900 Local Centralized MR 800 Local Centralized Data Push MR 800 Distributed MR 700 dominant Local Push cost Distributed Local Push MR dominant Performance depends on network, applica3on characteris3cs Push US Push EU Map Reduce Result Combine Total Time in Seconds 900 Shuffle cost Push US Push EU Map Reduce Result Combine Total WordCount (Text) Large input data WordCount (Random) Large intermediate data 15

16 Talk Outline Mo3va3on Highly- Distributed MapReduce Our Research: MapReduce Op#miza#on Concluding Remarks 16

17 Op3mizing MapReduce: Key Ideas Heterogeneity- aware execu#on Data placement and task scheduling should consider network locality, node speeds Applica#on- aware op#miza#on High data aggrega3on => Reduce push cost Low data aggrega3on => Reduce shuffle cost Make globally op#mal decisions Op3mize across phase boundaries by factoring in downstream effects 17

18 Research Overview Approach 1: Model- driven MapReduce op3miza3on Approach 2: Cross- phase op3miza3on in Hadoop 18

19 Approach 1: Model- Driven Op3miza3on Key idea: op4mize mul4ple phases to minimize end- to- end execu4on 4me Model MapReduce data flow Using model, derive op7mal execu7on plan 19

20 MapReduce Execu3on Model 20

21 MapReduce Execu3on Model Parameters D i Size of input data at data source i B ij Link bandwidth from node i to node j C i Mapper/Reducer compute rates α Ra3o of size out /size in for map phase Execu4on plan Each source: where to push data Each mapper: where to shuffle data 21

22 Op3miza3on Objec#ve: minimize makespan Constraints Each data source (mapper) must push (shuffle) all of its data!(i, j) " E : 0 # x ij #1 $ (i, j)"e!i " V : x ij =1 One- reducer- per- key: y k denotes frac3on reduced at reducer k!j " M, k " R : x jk = y k 22

23 Obvious Solu3ons Aren t Makespan (s) PlanetLab measurements: 4 US, 2 Europe, 2 Asia nodes; 1 data source each α = 0.1 Makespan (s) α = Reduce Shuffle Neither purely {centralized, distributed} Map is always bener. Push Reduce Shuffle Map Push 0 0 Op#miza#on Algorithm Op#miza#on Algorithm 23

Benefit of Op3miza3on PlanetLab measurements: 4 US, 2 Europe, 2 Asia nodes; 1 data source each Makespan (s) 2500 2000

different scenarios 5000 Reduce Shuffle Map Push 0 Uniform uniform myopic Myopic mul3 Op3mized e2e mul3 Op3miza3on

24 Benefit of Op3miza3on PlanetLab measurements: 4 US, 2 Europe, 2 Asia nodes; 1 data source each Makespan (s) Reduce Shuffle Makespan (s) Map Push Model- driven op3miza3on performs best under different scenarios 5000 Reduce Shuffle Map Push 0 Uniform uniform myopic Myopic mul3 Op3mized e2e mul3 Op3miza3on Algorithm 0 uniform Uniform myopic Myopic mul3 Op3mized e2e mul3 Op3miza3on Algorithm α=0.1 (Data Aggrega3on) α=10 (Data Expansion)

25 Comparison to Hadoop Emulated PlanetLab, Hadoop (Modified for model- based execu3on plans) Word Count Full Inverted Index Makespan (s) Reduce Reduce 3500 Op3mized Map plan outperforms 3000 Hadoop Map for 2500 Push Push different applica3ons 2000 Makespan (s) Uniform Hadoop Op3mized 0 Uniform Hadoop Op3mized Execu3on Plan Execu3on Plan 1500

26 Approach 2: Cross- phase Op3miza3on in Hadoop Key idea: factor in downstream effects Proac3ve techniques: Map- aware Push Shuffle- aware Map Implemented in Hadoop

27 Push/Map Barrier Push, then Map Push/map barrier: Wai3ng à waste Mappers cannot demand more or less work 27

28 Map- aware Push Pipeline push and map Hide latency Feedback: mappers pull on demand Infer locality dynamically No model of racks / switches Monitor bandwidth at run3me Choose nearest task Proac4vely op4mize data movement, task placement together 28

29 Map/Shuffle Bonlenecks Map outputs Shuffle, then Reduce Slow shuffle links can create bonlenecks. 29

30 Shuffle- aware Map Key idea: do not assign work to mappers that will slow shuffle Es3mate 3me T m for mapper m to finish task Push, map, and shuffle Include accumulated map outputs Dynamic, based on history, network monitoring Refuse work to possible bonleneck mappers Refuse if T m > min m T m + α Large α à tradi3onal Hadoop 30

Makespan (s) Benefit of Map- aware Push Two

workers (2 EU, 2 US) 1000 800 600 400 200 0 Word

map Map- aware Push Push/Map Approach 21% reduc3on

31 Makespan (s) Benefit of Map- aware Push Two PlanetLab data sources (EU, US) Four map/reduce workers (2 EU, 2 US) Word Count on PlanetLab (1GB Random Text) Push- then- map Map- aware Push Push/Map Approach 21% reduc3on in 3me for push & map Reduce Overlapped Push/Map Map Push 31

Benefit of Shuffle- aware Map Makespan (s) 1600 1400 1200 1000

Hadoop Default Shuffle- aware Map Scheduling Approach Worse

32 Benefit of Shuffle- aware Map Makespan (s) InvertedIndex on PlanetLab (800MB ebook data) Hadoop Default Shuffle- aware Map Scheduling Approach Worse push & map for bener shuffle & reduce Reduce Overlapped Push/Map 32

End- to- end Performance Makespan (s) 1800 1600 1400 1200 1000

33 End- to- end Performance Makespan (s) InvertedIndex on PlanetLab (800MB ebook data) Hadoop Default End- to- end Scheduling Approach Push- aware Map, Map- aware Shuffle compose Reduce Overlapped Push/ Map Map Push 33

34 Concluding Remarks Geographically distributed data, resources Many applica3ons fit MapReduce Op3mizing for highly- distributed environments: Consider mul3ple phases together Minimize end- to- end execu3on 3me Acknowledgments: Students: Ben Heintz, Chenyu Wang Ramesh Sitaraman (UMASS), Jon Weissman (UMN) NSF support 34

35 Thank You! hnp:// 35

Cross-Phase Optimization in MapReduce

Cross-Phase Optimization in MapReduce Cross-Phase Optimization in Benjamin Heintz, Chenyu Wang, Abhishek Chandra, and Jon Weissman Department of Computer Science & Engineering University of Minnesota Minneapolis, MN {heintz,chwang,chandra,jon}@cs.umn.edu