Chapter 5. The MapReduce Programming Model and Implementation

Similar documents
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 11 Hadoop & Spark

Map Reduce Group Meeting

The MapReduce Framework

Clustering Lecture 8: MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Distributed Systems 16. Distributed File Systems II

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MI-PDB, MIE-PDB: Advanced Database Systems

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

MapReduce, Hadoop and Spark. Bompotas Agorakis

HADOOP FRAMEWORK FOR BIG DATA

Introduction to MapReduce

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Hadoop. copyright 2011 Trainologic LTD

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Hadoop An Overview. - Socrates CCDH

Map Reduce & Hadoop Recommended Text:

Improving the MapReduce Big Data Processing Framework

A BigData Tour HDFS, Ceph and MapReduce

Introduction to MapReduce

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

CS427 Multicore Architecture and Parallel Computing

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

HADOOP. K.Nagaraju B.Tech Student, Department of CSE, Sphoorthy Engineering College, Nadergul (Vill.), Sagar Road, Saroonagar (Mdl), R.R Dist.T.S.

Advanced Data Management Technologies

Top 25 Hadoop Admin Interview Questions and Answers

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Database Applications (15-415)

50 Must Read Hadoop Interview Questions & Answers

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

A Glimpse of the Hadoop Echosystem

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed File Systems II

Distributed Systems CS6421

Large-Scale GPU programming

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Lecture 12 DATA ANALYTICS ON WEB SCALE

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

MapReduce and Hadoop

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

MapReduce. U of Toronto, 2014

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Introduction to Hadoop and MapReduce

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 345A Data Mining. MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

A brief history on Hadoop

Hadoop/MapReduce Computing Paradigm

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

OPEN SOURCE GRID MIDDLEWARE PACKAGES

Chase Wu New Jersey Institute of Technology

Hadoop and HDFS Overview. Madhu Ankam

Distributed Computation Models

Introduction to Data Management CSE 344

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

STA141C: Big Data & High Performance Statistical Computing

Introduction to MapReduce

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Embedded Technosolutions

Next-Generation Cloud Platform

CS 61C: Great Ideas in Computer Architecture. MapReduce

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Programming Languages and Techniques (CIS120e)

Big Data Hadoop Course Content

CISC 7610 Lecture 2b The beginnings of NoSQL

Certified Big Data and Hadoop Course Curriculum

Big Data for Engineers Spring Resource Management

Cloud Computing & Visualization

docs.hortonworks.com

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

A Survey on Big Data

Big Data Management and NoSQL Databases

Parallel Computing: MapReduce Jin, Hai

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

BigData and Map Reduce VITMAC03

Big Data 7. Resource Management

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

1. Introduction to MapReduce

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Transcription:

Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing (time consuming and limits interactivity) * Data transfer usually becomes the bottleneck if the amount of data is huge - Cloud computing: computing-to-data (send computing to data) * System collects and maintains data (shared, active data set) * Computation co-located with storage * Data transfer is faster Data-to-computing Computing-to-data 5-1

- Example of data intensive computing: Google * Data-intensive computing: process 20 petabytes of data per day * Run MapReduce system on top of the Google File System (GFS) * In the GFS, data are partitioned into chunks, and each chunk is replicated * Data processing is co-located with data storage # When a file needs to be processed, job scheduler finds the host nodes for each file chunk and then schedules a map process on each node (a) MapReduce programming model - MapReduce * A software framework for solving large-scale computing problems * Developed by Google * Computation is expressed as map and reduce functions * Map function # Written by the user # Process a key/value pair to generate a set of intermediate key/value pairs: map(key1, value1) list(key2, value2) * Reduce function # Written by the user # Merges all intermediate values associated with the same intermediate key: reduce(key2, list(value2)) list(value2) 5-2

- Example: wordcount * Counts the occurrences of each word in a large collection of documents * Steps: # Read the input (typically from a distributed file system) # Break the input into key/value pairs # Partition the pairs into groups for processing - E.g., the Map function emits a word and its associated count of occurrence: 1 # Reduce the key/value pairs, once for each unique key in the sorted list, to produce a combined result - E.g., the Reduce function sums all the counts emitted for a particular word * Pseudo code def map(key, value): # key: document # value: document contents list = [] for x in value: if test: list.append((key, x)) return list * Example: wordcount for the document to be or not to be def reduce(key, listofvalues): # key: a word # values: a list of counts result = 0 for x in listofvalues: result += x return (key, result) 5-3

Output of map: to, 1 be, 1 or, 1 not, 1 to, 1 be, 1 Process of Reduce: key = to value = 1, 1 2 key = be value = 1, 1 2 key = or value = 1 1 key = not value = 1 1 Output of Reduce: to, 2 be, 2 or, 1 not, 1 - Main features of MapReduce * Data-aware: when scheduling, the MapReduce-Master node takes into consideration the data location retrieved from the GFS-Master node * Simplicity: allow to easily design parallel and distributed applications * Manageability: easier to manage input and output data because data and computation are allocated (taking advantage of the GFS) * Scalability: increasing the nodes will increase performance * Fault tolerance: data in GFS are distributed; hardware failure can be handled by simply removing the failed nodes and then install new ones * Reliability: tasks can be assigned to many nodes; failed task can be reassigned to another node; slowing down tasks can be handled by adding more nodes - Execution of MapReduce: * First split the input file into M pieces (16 to 64 MB per piece) * Start many copies of the program 5-4

# One is the master and the others are workers # Jobs of the master: scheduling and monitoring - Scheduling: assigns the map and reduce tasks to the workers - Monitoring: monitors the task progress and the worker health * Master assigns the map task to an idle worker, taking into account of the data locality 5-5

* The map worker reads the content of the split and emits key/value pairs to the Map function * Map function produces intermediate key/value pairs and buffers them in the memory * The map work passes the locations of the stored pairs to the reduce worker * The reduce worker reads the buffered data using remote procedure calls (PRC) * The reduce worker sorts the keys, group values of the same key together, and then passes the intermediate values to the Reduce function * Finally, the Reduce function produces the output in R output files (one per reduce task) - Google MapReduce implementation * Large clusters of Linux PCs connected through Ethernet switches * Tasks are forked using RPCs * Buffering and communication occurs by reading and writing files on the GFS * The runtime library is written in C++ with interfaces in Python and Java * MapReduce jobs are spread across its massive computing clusters * Example: MapReduce statistics for different months Aug. 04 Mar. 06 Sep. 07 Number of jobs (1000s) Avg. completion time (sec) Machine years used Map input data (TB) Map output data (TB) Reduce output data (TB) Avg. machines per job 29 634 217 3,288 758 193 157 171 874 2,002 52,254 6,743 2,970 268 2,217 395 11,081 403,152 34,774 14,018 394 5-6

(b) Major MapReduce implementations for the cloud - MapReduce implementations around the world Owner Imp. Name Start Time Distribution Model Google Apache GridGain Nokia Geni.com Manjrasoft Google MapReduce Hadoop GridGain Disco SkyNet MapReduce.net (Optional service of Aneka) 2004 2004 2005 2008 2007 2008 - Comparison of MapReduce implementations Focus Google MapReduce Dataintensive Internal use Open source Open source Open source Open source Commercial Hadoop Disco MapReduce.NET Skynet GridGain Data-intensive Data-intensive Data- and computeintensive Dataintensive Data- and computeintensive Architecture platform Master-slave Linux Master-slave Cross-platform Master-slave Linux, Mac OS X Master-slave.NET Windows P2P OSindependent Master-slave Windows, Linux, Mac OS X Storage system GFS HDFS, CloudStore, S3 GlusterFS WinDFS, CIFS, and NTFS Message queuing: Tuplespace Data grid 5-7

and MySQL Implementation Technology C++ Java Erlang C# Ruby Java Programming environment Java and Python Java, shell utilities using Hadoop streaming, C+ + using Hadoop pipes Python C# Ruby Java Deployment On Google clusters Private and public cloud (EC2) Private and public cloud (EC2) Using Aneka, can be deployed on private and public cloud Web application (Rails) Private and public cloud Some users and applications Google Baidu, NetSeer, A9.com, Facebook, Nokia research center Vel Tech University - Hadoop * Top-level Apache open-source project * Advocated by Google, Yahoo!, Microsoft, and Facebook * Subprojects of Hadoop: Geni.com MedVoxel, Pointloyalty, Traficon 5-8

# Hadoop Common: common utilities that support the other Hadoop subprojects # Avro: data serialization system that provides dynamic integration with scripting languages # Chukwa: data collection system for managing large distributed systems # HBase: scalable, distributed database that supports structured data storage for large tables # HDFS: distributed file system that provides high throughput access to application data # Hive: data warehouse infrastructure provides data summarization and ad hoc querying # MapReduce: software framework for distributed processing of large data sets # Pig: high-level data-flow language and execution framework for parallel computation # ZooKeeper: high-performance coordination service for distributed applications * HadoopMapReduce overview: # Hadoop common (formerly Hadoop core) - Includes file system, RPC, and serialization libraries - Provides basic services for building a cloud computing environment - Two subprojects * MapReduce framework: has master/slave architecture # Master (also called JobTracker) - Responsible for query the NameNode for the block locations - Schedules the tasks on the slave which is hosting the block locations - Monitors the success and failures of the tasks # Slave (also called TaskTracker): executes the tasks as directed by the master * Hadoop Distributed File System (HDFS) # A distributed file system to run on clusters of commodity machines 5-9

# Highly fault-tolerant # High speed data access # Appropriate for data-intensive applications * Major enterprise solutions based on Hadoop: Yahoo!, Cloudera, Amazon, Sun Microsystems, IBM, # Organizations using Hadoop to run distributed applications: 5-10

- Disco * An open-source MapReduce implementation developed by Nokia * Started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data process * Core of Disco is written in Erlang; users of Disco write codes in Python * Based on master-slave architecture: 5-11

# When master receives jobs from a client, it add them to the job queue, and runs them in the cluster when CPUs become available # On each node, there is a Worker supervisor responsible for spawning and monitoring all the running Python worker processes within that node # Python worker runs the assigned tasks and then sends the addresses of the resulting files to the master through their supervisor # An httpd daemon (Web server) runs on each node which enables a remote Python worker to access files from the local disk of the particular node - MapReduce.NET * A realization of MapReduce for the.net platform of Microsoft 5-12

* Objective: provide support for a wider variety of data-intensive and compute-intensive application, e.g., MRPGA: MapReduce for parallel GA applications * MapReduce.NET runtime library is assisted by several components services from Aneka and runs on WinDFS # Aneka is a.net-based platform for entreprise and public cloud computing # WinDFS: Windows distributed file system * MapReduce.NET can also work with the Common Internet File System (CIFS) or NTFS - Skynet * A Ruby implementation of MapReduce, created by Geni * An adaptive, self-upgrading, fault-tolerant, and fully distributed system * At the heart of Skynet is plug-in based message queue architecture, with the message queue allowing workers to watch out for each other * Tasks put on the message queue are picked up by Skynet workers * Skynet tells the worker where all the needed code is, and the workers put their results back on the message queue - GridGain * An open cloud platform, developed in Java, for Java * Enable users to develop and run applications on private or public clouds * New features are added in addition to MapReduce: # Distributed task session, checkpoints for long running tasks, early and late load balance, and affinity co-location with data grids 5-13

(c) MapReduce impacts and research directions - MapReduce s influence * Many projects are exploring ways to support MapReduce on various types of distributed architecture and for a wider range of applications: * Examples: # QT Concurrent - A C++ library for multi-threaded application - Provides a MapReduce implementation for multi-core computers 5-14

# Stanford s Phoenix - A MapReduce implementation that targets shared memory architecture # Mars framework - Aims to provide a generic framework for developers to implement data- and computationintensive tasks correctly, efficiently, and easilly on the GPU # Hadoop, Disco, Skynet, and GridGain - Open-source implementations of MapReduce for large-scale data processing # Map-Reduce-Merge - An extension on MapReduce which adds a merge phase to easily process data relationships among heterogeneous datasets # Microsoft Dryad - A distributed execution engine for coarse-grain data parallel applications - Tasks are expressed as directed acyclic graph (DAG) 5-15