MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Size: px

Start display at page:

Download "MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35"

Anabel Collins
5 years ago
Views:

1 MapReduce Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

2 Agenda 1 MapReduce Motivation Definition Example Why MapReduce? Distributed Environment Fault Tolerance Pro and Con Summary Kiril Valev (LMU) MapReduce / 35

3 Motivation Process large amounts of data Kiril Valev (LMU) MapReduce / 35

4 Motivation Process large amounts of data Process it fast (parallel) Kiril Valev (LMU) MapReduce / 35

5 Motivation Process large amounts of data Process it fast (parallel) Focus on problem, not on implementation Kiril Valev (LMU) MapReduce / 35

6 Motivation Process large amounts of data Process it fast (parallel) Focus on problem, not on implementation MapReduce Framework! Kiril Valev (LMU) MapReduce / 35

7 Definition What is MapReduce? MapReduce is a programming model and an associated implementation for processing and generating large data sets. [1] What is map? A map function processes a key value pair to generate a set of intermediate key value pairs. [1] What is reduce? A reduce function merges all intermediate values associated with the same intermediate key. [1] Kiril Valev (LMU) MapReduce / 35

8 Map k and v are input values for the map function. Those values are processed and mapped to a new list, which contains none, one or many pairs, depending on the implementation: Formal definition (k, v) [(l 1, x 1 ),..., (l n, x n )] The results of the map function are called intermediate key value pairs. Kiril Valev (LMU) MapReduce / 35

9 Map Map function for counting words in a collection of documents: Example (map pseudocode) map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Map function iterates over every word in a document and emits an intermediate result. Kiril Valev (LMU) MapReduce / 35

10 Reduce The reduce function operates on the output of the map function (intermediate results): Formal definition (l, [y 1,..., y n ]) [w 1,..., w m ] For each key l, a list with corresponding values is passed to the reduce function. Those values are reduced to a new result. Kiril Valev (LMU) MapReduce / 35

11 Reduce Reduce function for counting words in a collection of documents. Example (reduce pseudocode) reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Reduce function aggregates the word counts for a specific word key. Kiril Valev (LMU) MapReduce / 35

12 Example input key doc1 doc2 input value this is document one. and this is document two. Kiril Valev (LMU) MapReduce / 35

13 Example input key doc1 doc2 input value this is document one. and this is document two. this 1 is 1 document 1 one 1 and 1 this 1 is 1 document 1 two 1 Kiril Valev (LMU) MapReduce / 35

14 Example input key doc1 doc2 input value this is document one. and this is document two. this 1 is 1 document 1 one 1 and 1 this 1 is 1 document 1 two 1 and 1 document 2 is 2 one 1 this 2 two 1 Kiril Valev (LMU) MapReduce / 35

15 Why MapReduce? Question: What is special about it? Map and reduce functions already existed for a long time. Answer: It s not the functions, the framework is special! Programs written in this functional style are automatically parallelized Thus executed on a large cluster of machines This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system Kiril Valev (LMU) MapReduce / 35

16 MapReduce [1] Kiril Valev (LMU) MapReduce / 35

17 Fork and Assign Kiril Valev (LMU) MapReduce / 35

18 Fork and Assign The user specifies how many mappers and reducers are allocated. How many jobs? The amount of map and reduce jobs should be larger than the amount of workers. Typical setup: M = , R = 5.000, worker = Kiril Valev (LMU) MapReduce / 35

19 MapReduce Kiril Valev (LMU) MapReduce / 35

20 Read M data pieces M map tasks 16MB to 64MB Locality Kiril Valev (LMU) MapReduce / 35

21 Locality Network bandwidth is a relatively scarce resource Depending on the filesystem, data might be on the same machine Reading local or nearby (rack) data makes input faster Master tries to efficiently assign data to a worker Kiril Valev (LMU) MapReduce / 35

22 MapReduce Kiril Valev (LMU) MapReduce / 35

23 Local Write Periodical write of buffered data Partitioned into R regions Notify master about location Kiril Valev (LMU) MapReduce / 35

24 Combiner Function The combiner function is a refinement of the map phase. Before the output is written to the local storage, the combiner function preprocesses the intermediate key value pairs: Kiril Valev (LMU) MapReduce / 35

25 Combiner Function The combiner function is a refinement of the map phase. Before the output is written to the local storage, the combiner function preprocesses the intermediate key value pairs: this 1 this 1 this 1 this 1 Kiril Valev (LMU) MapReduce / 35

26 Combiner Function The combiner function is a refinement of the map phase. Before the output is written to the local storage, the combiner function preprocesses the intermediate key value pairs: this 1 this 1 this 1 this 1 this 4 Kiril Valev (LMU) MapReduce / 35

27 Partitioning Function The output of the map function (intermediate key value pairs) is kept in a buffer and periodically written to the local disk. A partitioning function is applied on the key, producing R partitions: Partitioning Function hash(key) mod R Function can also be implemented by the user (URL of same host in one partition): Partitioning Function hash(hostname(key)) mod R Kiril Valev (LMU) MapReduce / 35

28 MapReduce Kiril Valev (LMU) MapReduce / 35

29 Remote Read Reduce task notified by master RPC to get data from map worker Sorted by intermediate key Kiril Valev (LMU) MapReduce / 35

30 MapReduce Kiril Valev (LMU) MapReduce / 35

31 Write Reduce function iterates over sorted, unique intermediate keys Append output to partition file Kiril Valev (LMU) MapReduce / 35

32 Fault Tolerance Worker Failure Master periodically checks heartbeat Could be a machine assigned with a map or reduce task completed map task re-executed in progress reduce task re-executed Kiril Valev (LMU) MapReduce / 35

33 Fault Tolerance Master Failure There is only one master Failure unlikely Put up with re-executing job in case of failure OR: write checkpoints with master data structure and restore in case of failure Kiril Valev (LMU) MapReduce / 35

34 Backup Tasks Deal with slow workers (Bottleneck) Stagglers, that take an unusually long time to complete Start backup tasks when close to completion Redundant execution of remaining tasks Completed either when primary, or backup tasks finished Kiril Valev (LMU) MapReduce / 35

35 Pro [2] Simple and easy to use Flexible Independent of the storage Fault tolerance Scalability Kiril Valev (LMU) MapReduce / 35

36 Con [2] No high-level language No schema and no index A single fixed dataflow Very young (compared to DBMS) Kiril Valev (LMU) MapReduce / 35

37 Summary Dataflow Input Map Combine/Partition Reduce Output MapReduce Simple (no experience in distributed computing!) Fault tolerance Focus on problem, not implementation Many problems can be modeled Kiril Valev (LMU) MapReduce / 35

38 MapReduce The End Kiril Valev (LMU) MapReduce / 35

39 Discussion on Hacker News Kiril Valev (LMU) MapReduce / 35

40 Data User generated Logfiles Images Raw text Kiril Valev (LMU) MapReduce / 35

41 Problem Analytics Profiling/Advertising Pattern Recognition Parallel Job Scheduler Protein Binding Monte-Carlo Kiril Valev (LMU) MapReduce / 35

42 References Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon (2011) Parallel Data Processing with MapReduce: A Survey SIGMOD Record, December 2011 (Vol. 40, No. 4). Dean, Jeffrey and Ghemawat, Sanjay (2008) MapReduce: simplified data processing on large clusters Commun. ACM 51(1), Hacker News (2013) Ask HN: To everybody who uses MapReduce: what problems do you solve? Kiril Valev (LMU) MapReduce / 35

Parallel Programming Concepts

Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model