Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology
! MapReduce is a distributed/parallel computing framework introduced by Google to support computing on large data sets on clusters of computers.! The framework is inspired by map and reduce functions commonly used in functional programming (although their purpose in the MapReduce framework is not the same as their original forms).! User implements map() and reduce()! functions! Runtime library takes care of EVERYTHING else 2
! Large Scale Data Processing! Want to process lots of data ( > 1 TB)! Want to parallelize across hundreds/ thousands of CPUs! Want to make this easy 3
! A simple programming model that applies to certain large-scale computing problems! Hide messy details in MapReduce runtime library:! automatic parallelization! load balancing! network and disk transfer optimization! handling of machine failures! Robustness! Improvements to core library benefit all users of library! 4
Typical problem solved by MapReduce! Read a lot of data! Map: extract something you care about from each record! Shuffle and Sort! Reduce: aggregate, summarize, filter, or transform! Write the results! Outline stays the same, but map and reduce change to fit the problem 5
Programming Model! Functions borrowed from functional programming languages! Users implement interface of two functions:! map() " Process a key/value pair to generate intermediate key/value pairs " map (in_key, in_value) -> (out_key, intermediate_value) list! reduce() " Merge all intermediate values associated with the same key " reduce (out_key, intermediate_value list) -> out_value list 6
Example: Counting Words! Counting words in a large set of documents:! map()! Input <filename, file_text>! Parses file and emits <word, count> pairs " eg. < hello, 1>! reduce()! Sums all values for the same key and emits <word, TotalCount> " eg. < hello, (3 5 2 7)> => < hello, 17> 7
Example: Use of MapReduce map(string key, string value)! //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ); reduce(string key, iterator values)! //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result)); 8
Actual Source Code! The example is written in pseudo-code! Actual implementation is in C++, using a MapReduce library! True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) 9
Parallelism! map() functions run in parallel, creating different intermediate values from different input data sets! reduce() functions also run in parallel, each working on a different output key! All values are processed independently.! Synchronization required between the two functions. 10
How MapReduce Works! User to do list:! Write map() and reduce() functions! indicate: " Input/output files " M: number of map tasks " R: number of reduce tasks " W: number of machines! Submit the job! This requires no knowledge of parallel & distributed systems!!! What about everything else? 11
12
Data Distribution! Input files are split into M pieces on distributed file system! Typically ~ 16 to 128 MB blocks! Intermediate files created from map tasks are written to local disk! Output files are written to distributed file system 13
Assigning Tasks! Many copies of user program (Master/ Workers) are started! Master finds idle machines and assigns them tasks! Tries to utilize data localization by running map tasks on machines with data 14
Execution (map)! Map workers read in contents of corresponding input partition! Perform user-defined map computation to create intermediate <key,value> pairs! Periodically buffered output pairs written to local disk! Partitioned into R regions by a partitioning function 15
Partition Function! In addition to map and reduce functions, you may specify! Partition: (k', number of reducers) -- choice of reducer for k! Implemented by the invisible shuffleand-sort stage! Default partition:! hash(k') mod R (#reducers)! Each reducer sees the keys in its partition in sorted order 16
Execution (reduce)! Reduce workers iterate over ordered intermediate data! Each unique key encountered values are passed to user's reduce function! eg. <key, [value1, value2,..., valuen]>! Output of user's reduce function is written to output file on distributed file system! When all tasks have completed, master wakes up user program 17
Observations! No reduce can begin until map is complete! Tasks scheduled based on location of data! If map worker fails any time before reduce finishes, task must be completely rerun! Master must communicate locations of intermediate files! MapReduce library does most of the hard work for us! 18
19
MapReduce: Granularity! Fine granularity tasks: many more map tasks than machines! Minimizes time for fault recovery! Can pipeline shuffling with map execution! Better dynamic load balancing 20
Optimization! No reduce can start until map is complete:! A single slow disk controller can ratelimit the whole process! Master re-assigns slow-moving map tasks; and uses results of first copy to finish 21
Optimization! Combiner functions can run on same machine as a mapper! Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth 22
Fault Tolerance! Worker failure:! Detect failure via periodic heartbeats! Re-execute completed and in-progress map tasks! Re-execute in-progress reduce tasks! Task completion committed through master! Master failure:! State is checkpointed to replicated file system! New master recovers & continues! Very Robust: lost 1600 of 1800 machines once, but finished fine 23
Applications! Structure of the Web:! Input is (URL, contents)! Scan through the document's contents looking for links to other URLs! Map outputs (URL, linked-to URL) you get a simple representation of the WWW link graph! Map outputs (linked-to URL, URL) you get the reverse link graph, what web pages link to me?! Map outputs (linked-to URL, anchor text) you get how do other web pages characterize me? 24
Applications! Google uses MapReduce for! Page indexing pipeline: What are all the pages that match this query?! PageRank: What are the best pages that match this query?! and more others! Greatly simplifies large-scale computations at Google 25
References! Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters - http://labs.google.com/ papers/mapreduce.html! Ralf Lammel, Google's MapReduce Programming Model Revisited - http:// www.cs.vu.nl/~ralf/mapreduce/paper.pdf! http://code.google.com/edu/parallel/ mapreduce-tutorial.html 26