Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing (time consuming and limits interactivity) * Data transfer usually becomes the bottleneck if the amount of data is huge - Cloud computing: computing-to-data (send computing to data) * System collects and maintains data (shared, active data set) * Computation co-located with storage * Data transfer is faster Data-to-computing Computing-to-data 5-1

- Example of data intensive computing: Google * Data-intensive computing: process 20 petabytes of data per day * Run MapReduce system on top of the Google File System (GFS) * In the GFS, data are partitioned into chunks, and each chunk is replicated * Data processing is co-located with data storage # When a file needs to be processed, job scheduler finds the host nodes for each file chunk and then schedules a map process on each node (a) MapReduce programming model - MapReduce * A software framework for solving large-scale computing problems * Developed by Google * Computation is expressed as map and reduce functions * Map function # Written by the user # Process a key/value pair to generate a set of intermediate key/value pairs: map(key1, value1) list(key2, value2) * Reduce function # Written by the user # Merges all intermediate values associated with the same intermediate key: reduce(key2, list(value2)) list(value2) 5-2

- Example: wordcount * Counts the occurrences of each word in a large collection of documents * Steps: # Read the input (typically from a distributed file system) # Break the input into key/value pairs # Partition the pairs into groups for processing - E.g., the Map function emits a word and its associated count of occurrence: 1 # Reduce the key/value pairs, once for each unique key in the sorted list, to produce a combined result - E.g., the Reduce function sums all the counts emitted for a particular word * Pseudo code def map(key, value): # key: document # value: document contents list = [] for x in value: if test: list.append((key, x)) return list * Example: wordcount for the document to be or not to be def reduce(key, listofvalues): # key: a word # values: a list of counts result = 0 for x in listofvalues: result += x return (key, result) 5-3

Output of map: to, 1 be, 1 or, 1 not, 1 to, 1 be, 1 Process of Reduce: key = to value = 1, 1 2 key = be value = 1, 1 2 key = or value = 1 1 key = not value = 1 1 Output of Reduce: to, 2 be, 2 or, 1 not, 1 - Main features of MapReduce * Data-aware: when scheduling, the MapReduce-Master node takes into consideration the data location retrieved from the GFS-Master node * Simplicity: allow to easily design parallel and distributed applications * Manageability: easier to manage input and output data because data and computation are allocated (taking advantage of the GFS) * Scalability: increasing the nodes will increase performance * Fault tolerance: data in GFS are distributed; hardware failure can be handled by simply removing the failed nodes and then install new ones * Reliability: tasks can be assigned to many nodes; failed task can be reassigned to another node; slowing down tasks can be handled by adding more nodes - Execution of MapReduce: * First split the input file into M pieces (16 to 64 MB per piece) * Start many copies of the program 5-4

# One is the master and the others are workers # Jobs of the master: scheduling and monitoring - Scheduling: assigns the map and reduce tasks to the workers - Monitoring: monitors the task progress and the worker health * Master assigns the map task to an idle worker, taking into account of the data locality 5-5

* The map worker reads the content of the split and emits key/value pairs to the Map function * Map function produces intermediate key/value pairs and buffers them in the memory * The map work passes the locations of the stored pairs to the reduce worker * The reduce worker reads the buffered data using remote procedure calls (PRC) * The reduce worker sorts the keys, group values of the same key together, and then passes the intermediate values to the Reduce function * Finally, the Reduce function produces the output in R output files (one per reduce task) - Google MapReduce implementation * Large clusters of Linux PCs connected through Ethernet switches * Tasks are forked using RPCs * Buffering and communication occurs by reading and writing files on the GFS * The runtime library is written in C++ with interfaces in Python and Java * MapReduce jobs are spread across its massive computing clusters * Example: MapReduce statistics for different months Aug. 04 Mar. 06 Sep. 07 Number of jobs (1000s) Avg. completion time (sec) Machine years used Map input data (TB) Map output data (TB) Reduce output data (TB) Avg. machines per job 29 634 217 3,288 758 193 157 171 874 2,002 52,254 6,743 2,970 268 2,217 395 11,081 403,152 34,774 14,018 394 5-6

(b) Major MapReduce implementations for the cloud - MapReduce implementations around the world Owner Imp. Name Start Time Distribution Model Google Apache GridGain Nokia Geni.com Manjrasoft Google MapReduce Hadoop GridGain Disco SkyNet MapReduce.net (Optional service of Aneka) 2004 2004 2005 2008 2007 2008 - Comparison of MapReduce implementations Focus Google MapReduce Dataintensive Internal use Open source Open source Open source Open source Commercial Hadoop Disco MapReduce.NET Skynet GridGain Data-intensive Data-intensive Data- and computeintensive Dataintensive Data- and computeintensive Architecture platform Master-slave Linux Master-slave Cross-platform Master-slave Linux, Mac OS X Master-slave.NET Windows P2P OSindependent Master-slave Windows, Linux, Mac OS X Storage system GFS HDFS, CloudStore, S3 GlusterFS WinDFS, CIFS, and NTFS Message queuing: Tuplespace Data grid 5-7

and MySQL Implementation Technology C++ Java Erlang C# Ruby Java Programming environment Java and Python Java, shell utilities using Hadoop streaming, C+ + using Hadoop pipes Python C# Ruby Java Deployment On Google clusters Private and public cloud (EC2) Private and public cloud (EC2) Using Aneka, can be deployed on private and public cloud Web application (Rails) Private and public cloud Some users and applications Google Baidu, NetSeer, A9.com, Facebook, Nokia research center Vel Tech University - Hadoop * Top-level Apache open-source project * Advocated by Google, Yahoo!, Microsoft, and Facebook * Subprojects of Hadoop: Geni.com MedVoxel, Pointloyalty, Traficon 5-8

# Hadoop Common: common utilities that support the other Hadoop subprojects # Avro: data serialization system that provides dynamic integration with scripting languages # Chukwa: data collection system for managing large distributed systems # HBase: scalable, distributed database that supports structured data storage for large tables # HDFS: distributed file system that provides high throughput access to application data # Hive: data warehouse infrastructure provides data summarization and ad hoc querying # MapReduce: software framework for distributed processing of large data sets # Pig: high-level data-flow language and execution framework for parallel computation # ZooKeeper: high-performance coordination service for distributed applications * HadoopMapReduce overview: # Hadoop common (formerly Hadoop core) - Includes file system, RPC, and serialization libraries - Provides basic services for building a cloud computing environment - Two subprojects * MapReduce framework: has master/slave architecture # Master (also called JobTracker) - Responsible for query the NameNode for the block locations - Schedules the tasks on the slave which is hosting the block locations - Monitors the success and failures of the tasks # Slave (also called TaskTracker): executes the tasks as directed by the master * Hadoop Distributed File System (HDFS) # A distributed file system to run on clusters of commodity machines 5-9

# Highly fault-tolerant # High speed data access # Appropriate for data-intensive applications * Major enterprise solutions based on Hadoop: Yahoo!, Cloudera, Amazon, Sun Microsystems, IBM, # Organizations using Hadoop to run distributed applications: 5-10

- Disco * An open-source MapReduce implementation developed by Nokia * Started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data process * Core of Disco is written in Erlang; users of Disco write codes in Python * Based on master-slave architecture: 5-11

# When master receives jobs from a client, it add them to the job queue, and runs them in the cluster when CPUs become available # On each node, there is a Worker supervisor responsible for spawning and monitoring all the running Python worker processes within that node # Python worker runs the assigned tasks and then sends the addresses of the resulting files to the master through their supervisor # An httpd daemon (Web server) runs on each node which enables a remote Python worker to access files from the local disk of the particular node - MapReduce.NET * A realization of MapReduce for the.net platform of Microsoft 5-12

* Objective: provide support for a wider variety of data-intensive and compute-intensive application, e.g., MRPGA: MapReduce for parallel GA applications * MapReduce.NET runtime library is assisted by several components services from Aneka and runs on WinDFS # Aneka is a.net-based platform for entreprise and public cloud computing # WinDFS: Windows distributed file system * MapReduce.NET can also work with the Common Internet File System (CIFS) or NTFS - Skynet * A Ruby implementation of MapReduce, created by Geni * An adaptive, self-upgrading, fault-tolerant, and fully distributed system * At the heart of Skynet is plug-in based message queue architecture, with the message queue allowing workers to watch out for each other * Tasks put on the message queue are picked up by Skynet workers * Skynet tells the worker where all the needed code is, and the workers put their results back on the message queue - GridGain * An open cloud platform, developed in Java, for Java * Enable users to develop and run applications on private or public clouds * New features are added in addition to MapReduce: # Distributed task session, checkpoints for long running tasks, early and late load balance, and affinity co-location with data grids 5-13

(c) MapReduce impacts and research directions - MapReduce s influence * Many projects are exploring ways to support MapReduce on various types of distributed architecture and for a wider range of applications: * Examples: # QT Concurrent - A C++ library for multi-threaded application - Provides a MapReduce implementation for multi-core computers 5-14

# Stanford s Phoenix - A MapReduce implementation that targets shared memory architecture # Mars framework - Aims to provide a generic framework for developers to implement data- and computationintensive tasks correctly, efficiently, and easilly on the GPU # Hadoop, Disco, Skynet, and GridGain - Open-source implementations of MapReduce for large-scale data processing # Map-Reduce-Merge - An extension on MapReduce which adds a merge phase to easily process data relationships among heterogeneous datasets # Microsoft Dryad - A distributed execution engine for coarse-grain data parallel applications - Tasks are expressed as directed acyclic graph (DAG) 5-15