Data Engineering. How MapReduce Works. Shivnath Babu

Size: px

Start display at page:

Download "Data Engineering. How MapReduce Works. Shivnath Babu"

Maximilian Richards
5 years ago
Views:

1 Data Engineering How MapReduce Works Shivnath Babu

2 Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

3 Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

4 Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2

5 Job Submission

6 Initialization

7 Scheduling

8 Execution

9 Map Task

10 Reduce Tasks

11 Hadoop Distributed File-System (HDFS)

12 Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

13 Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Shuffle Reduce Wave 2 What if #reduces increased to 9? Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2 Reduce Wave 3

14 Spark Programming Model Driver Program sc=new SparkContext rdd=sc.tex=ile( hdfs:/ / ) rdd.filter( ) rdd.cache rdd.count rdd.map SparkContext Cluster Manager Worker Node Worker Node Writes Executer Task Cache Task Executer Task Cache Task Datanode Datanode User (Developer) HDFS 14 Taken from

Structure Controlled parqqoning to opqmize

15 Spark Programming Model Driver Program sc=new SparkContext rdd=sc.tex=ile( hdfs:/ / ) rdd.filter( ) rdd.cache rdd.count rdd.map Writes RDD (Resilient Distributed Dataset) Immutable Data structure In- memory (explicitly) Fault Tolerant Parallel Data Structure Controlled parqqoning to opqmize data placement Can be manipulated using a rich set of operators User (Developer) 15

RDD Programming Interface: Programmer can perform 3 types of

Lazy in nature. They are executed only when some acqon is performed.

driver program a value or exports data to a storage system axer

Example: Count() Reduce(funct) Collect Take() Persistence For

16 RDD Programming Interface: Programmer can perform 3 types of operaqons Transforma6ons Create a new dataset from an exisqng one. Lazy in nature. They are executed only when some acqon is performed. Example : Map(func) Filter(func) DisQnct() Ac6ons Returns to the driver program a value or exports data to a storage system axer performing a computaqon. Example: Count() Reduce(funct) Collect Take() Persistence For caching datasets in- memory for future operaqons. OpQon to store on disk or RAM or mixed (Storage Level). Example: Persist() Cache() 16

17 How Spark works RDD: Parallel collecqon with parqqons User applicaqon can create RDDs, transform them, and run acqons. This results in a DAG (Directed Acyclic Graph) of operators. DAG is compiled into stages Each stage is executed as a collecqon of Tasks (one Task for each ParQQon). 17

18 Summary of Components Task : The fundamental unit of execuqon in Spark Stage: CollecQon of Tasks that run in parallel DAG : Logical Graph of RDD operaqons RDD : Parallel dataset of objects with parqqons 18

19 Example sc.textfile( /wiki/pagecounts ) RDD[String] textfile 19

20 Example sc.textfile( /wiki/pagecounts ).map(line => line.split( \t )) RDD[String] RDD[List[String]] textfile map 20

21 Example sc.textfile( /wiki/pagecounts ).map(line => line.split( \t )).map(r => (R[0], int(r[1]))) RDD[String] RDD[List[String]] RDD[(String, Int)] 21 textfile map map

22 Example sc.textfile( /wiki/pagecounts ).map(line => line.split( \t )).map(r => (R[0], int(r[1]))).reducebykey(_+_) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] textfile map map reducebykey 22

23 Example sc.textfile( /wiki/pagecounts ).map(line => line.split( \t )).map(r => (R[0], int(r[1]))).reducebykey(_+_, 3).collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] reducebykey collect 23

24 ExecuQon Plan Stage 1 Stage 2 textfile map map reducebykey collect Stages are sequences of RDDs, that don t have a Shuffle in between 24

25 ExecuQon Plan Stage 1 Stage 2 textfile map map reducebykey collect 1. Read HDFS split 2. Apply both the maps 3. Start partial reduce 4. Write shuffle data 1. Read shuffle data 2. Final reduce 3. Send result to driver program Stage 1 Stage 2 25

26 Stage ExecuQon Task 1 Task 2 Task 2 Task 2 Create a task for each Partition in the input RDD Serialize the Task Schedule and ship Tasks to Slaves And all this happens internally (you don t have to do anything) 26

27 Spark Executor (Slave) Fetch Input Fetch Input Fetch Input Core 1 Execute Task Execute Task Execute Task Write Output Write Output Write Output Core 2 Core 3 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output 27

Data-intensive computing systems

Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors