The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab
Overview MapReduce was firstly introduced by Google on 2004. MapReduce is a programming model for processing large data sets. MapReduce is a software framework that allows developers to write programs that process massive amounts of data in parallel across a distributed cluster of computers. Several implementations of MapReduce are available in a variety of programming languages such as Java, C++, Python, Perl, Ruby, and C.
Overview MapReduce has been gaining popularity and it has been used at Google extensively to process 20 petabytes of data per day. At Google, MapReduce was used to completely regenerate Google s index of the World Wide Web. It replaced the old ad hoc programs that updated the index and ran the various analysis. MapReduce is also used in Facebook for production jobs including data import, hourly reports, etc.
Motivations Google, Yahoo, etc. deal with: Massive amounts of data (terabytes) need to process data fairly quickly use very large numbers of machines So, there is a demand for large scale data processing. Lots of machines needed : Coordination and Scaling issues Fault tolerance is essential. MapReduce provides an efficient solution to satisfy such requirements.
MapReduce model The MapReduce model is inspired by the map and reduce functions commonly used in functional programming: Map: extract something we care about from each record of data. Reduce: aggregate, summarize, filter, or transform. The framework is divided into two parts: Map function that divides out work to different nodes in the distributed cluster; Reduce, another function that collates the work and resolves the results into a single value.
MapReduce model
MapeReduce Flow The input is a list of records The records are split among the different computers by Map The result of the Map computation is a list of key/value pairs Reduce combines the set of values that has the same key into a single value
Example: Word Count Each document is split into words Each word is counted by the map function The framework combines all pairs with the same key and feeds them to reduce Reduce function: sum all input values to find the total appearances of that word.
Example: Word Count Map Function map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Reduce Function reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences, just 1 in this simple example. The reduce function sums together all counts emitted for a particular word.
MapReduce Architecture The Map splits the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function. The number of partitions (R) and the partitioning function are specified by the user.
Execution overview Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.
Fault Tolerance in Map/Reduce Fault tolerance is one of the most critical issues for MapReduce. MapReduce handles failures through re-execution. The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Once a machine failure is detected, in-progress map or reduce tasks on that machine need to be re-executed on other machines.
Fault Tolerance in Map/Reduce
Hadoop Map/Reduce Framework Hadoop is an open-source software framework that supports data-intensive distributed applications Hadoop now is used by Yahoo, Amazon, IBM, Facebook, Rackspace, the New York Times, etc. MapReduce is considered the heart of Hadoop. MapReduce programs have been implemented internally at Google over the past nine years. An average of one hundred thousand MapReduce jobs are executed on Google clusters every day.
Hadoop Map/Reduce Framework Here is some statistics for a subset of MapReduce jobs run at Google in various months. Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.
Hadoop Map/Reduce Framework The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster node. The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and reexecuting the failed tasks. The slaves execute the tasks as directed by the master.
Hadoop Map/Reduce: Word Count Program Map Function: Reduce Function:
Hadoop Map/Reduce: Word Count Program Main Function:
Conclusions MapReduce is a flexible programming framework for many applications through a couple of restricted Map()/Reduce() constructs MapReduce hides the details of parallelization, faulttolerance, locality optimization, and load balancing. The model is easy to use, even for programmers without experience with parallel and distributed systems. A number of frameworks supporting MapReduce are in development.
References 1. J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107 113, Jan. 2008. 2. Hadoop. [Online]. Available: http://lucene.apache.org/hadoop 3. Map/reduce tutorial. [Online]. Available: http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html 4. J. Dean, Designs, lessons and advice from building large distributed systems, in The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), Big Sky, MT, October 2009. 5. Amazon elastic mapreduce. [Online]. Available: http://aws.amazon.com/elasticmapreduce/ 6. Distributed Systems. [Online]. Available: http://code.google.com/edu/parallel/index.html 7. On wikipedia. [Online]. Available: http://en.wikipedia.org/wiki/mapreduce/
Thank You! Questions & Answers