The MapReduce Framework

Similar documents
The MapReduce Abstraction

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

L22: SC Report, Map Reduce

Parallel Computing: MapReduce Jin, Hai

MapReduce: Simplified Data Processing on Large Clusters

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS 61C: Great Ideas in Computer Architecture. MapReduce

Clustering Lecture 8: MapReduce

Parallel Programming Concepts

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

MapReduce: A Programming Model for Large-Scale Distributed Computation

Map Reduce Group Meeting

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Big Data Management and NoSQL Databases

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Distributed Computations MapReduce. adapted from Jeff Dean s slides

1. Introduction to MapReduce

Introduction to Data Management CSE 344

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Chapter 5. The MapReduce Programming Model and Implementation

MI-PDB, MIE-PDB: Advanced Database Systems

Introduction to MapReduce

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce-style data processing

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Advanced Data Management Technologies

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to Data Management CSE 344

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Large-Scale GPU programming

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Map Reduce

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

MapReduce, Hadoop and Spark. Bompotas Agorakis

Map Reduce & Hadoop Recommended Text:

Parallel data processing with MapReduce

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Introduction to Data Management CSE 344

Introduction to MapReduce

Advanced Database Technologies NoSQL: Not only SQL

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

HADOOP FRAMEWORK FOR BIG DATA

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

MapReduce and Hadoop

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Database Systems CSE 414

CS-2510 COMPUTER OPERATING SYSTEMS

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to MapReduce

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Distributed Systems CS6421

Survey on MapReduce Scheduling Algorithms

Database System Architectures Parallel DBs, MapReduce, ColumnStores

ABSTRACT I. INTRODUCTION

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Concurrency for data-intensive applications

Hadoop/MapReduce Computing Paradigm

THE SURVEY ON MAPREDUCE

Database Applications (15-415)

Mitigating Data Skew Using Map Reduce Application

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Data Analysis Using MapReduce in Hadoop Environment

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Hadoop. copyright 2011 Trainologic LTD

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Part A: MapReduce. Introduction Model Implementation issues

Introduction to MapReduce

Distributed Systems. CS422/522 Lecture17 17 November 2014

732A54/TDDE31 Big Data Analytics

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

BigData and MapReduce with Hadoop

CS 345A Data Mining. MapReduce

MongoDB DI Dr. Angelika Kusel

MapReduce: A Flexible Data Processing Tool

Introduction to MapReduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

MapReduce & BigTable

Map-Reduce. John Hughes

How to Implement MapReduce Using. Presented By Jamie Pitts

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

MapReduce Simplified Data Processing on Large Clusters

CSE 344 MAY 2 ND MAP/REDUCE

Map-Reduce (PFP Lecture 12) John Hughes

Cloud, Big Data & Linear Algebra

Transcription:

The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab

Overview MapReduce was firstly introduced by Google on 2004. MapReduce is a programming model for processing large data sets. MapReduce is a software framework that allows developers to write programs that process massive amounts of data in parallel across a distributed cluster of computers. Several implementations of MapReduce are available in a variety of programming languages such as Java, C++, Python, Perl, Ruby, and C.

Overview MapReduce has been gaining popularity and it has been used at Google extensively to process 20 petabytes of data per day. At Google, MapReduce was used to completely regenerate Google s index of the World Wide Web. It replaced the old ad hoc programs that updated the index and ran the various analysis. MapReduce is also used in Facebook for production jobs including data import, hourly reports, etc.

Motivations Google, Yahoo, etc. deal with: Massive amounts of data (terabytes) need to process data fairly quickly use very large numbers of machines So, there is a demand for large scale data processing. Lots of machines needed : Coordination and Scaling issues Fault tolerance is essential. MapReduce provides an efficient solution to satisfy such requirements.

MapReduce model The MapReduce model is inspired by the map and reduce functions commonly used in functional programming: Map: extract something we care about from each record of data. Reduce: aggregate, summarize, filter, or transform. The framework is divided into two parts: Map function that divides out work to different nodes in the distributed cluster; Reduce, another function that collates the work and resolves the results into a single value.

MapReduce model

MapeReduce Flow The input is a list of records The records are split among the different computers by Map The result of the Map computation is a list of key/value pairs Reduce combines the set of values that has the same key into a single value

Example: Word Count Each document is split into words Each word is counted by the map function The framework combines all pairs with the same key and feeds them to reduce Reduce function: sum all input values to find the total appearances of that word.

Example: Word Count Map Function map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Reduce Function reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences, just 1 in this simple example. The reduce function sums together all counts emitted for a particular word.

MapReduce Architecture The Map splits the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function. The number of partitions (R) and the partitioning function are specified by the user.

Execution overview Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.

Fault Tolerance in Map/Reduce Fault tolerance is one of the most critical issues for MapReduce. MapReduce handles failures through re-execution. The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Once a machine failure is detected, in-progress map or reduce tasks on that machine need to be re-executed on other machines.

Fault Tolerance in Map/Reduce

Hadoop Map/Reduce Framework Hadoop is an open-source software framework that supports data-intensive distributed applications Hadoop now is used by Yahoo, Amazon, IBM, Facebook, Rackspace, the New York Times, etc. MapReduce is considered the heart of Hadoop. MapReduce programs have been implemented internally at Google over the past nine years. An average of one hundred thousand MapReduce jobs are executed on Google clusters every day.

Hadoop Map/Reduce Framework Here is some statistics for a subset of MapReduce jobs run at Google in various months. Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.

Hadoop Map/Reduce Framework The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster node. The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and reexecuting the failed tasks. The slaves execute the tasks as directed by the master.

Hadoop Map/Reduce: Word Count Program Map Function: Reduce Function:

Hadoop Map/Reduce: Word Count Program Main Function:

Conclusions MapReduce is a flexible programming framework for many applications through a couple of restricted Map()/Reduce() constructs MapReduce hides the details of parallelization, faulttolerance, locality optimization, and load balancing. The model is easy to use, even for programmers without experience with parallel and distributed systems. A number of frameworks supporting MapReduce are in development.

References 1. J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107 113, Jan. 2008. 2. Hadoop. [Online]. Available: http://lucene.apache.org/hadoop 3. Map/reduce tutorial. [Online]. Available: http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html 4. J. Dean, Designs, lessons and advice from building large distributed systems, in The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), Big Sky, MT, October 2009. 5. Amazon elastic mapreduce. [Online]. Available: http://aws.amazon.com/elasticmapreduce/ 6. Distributed Systems. [Online]. Available: http://code.google.com/edu/parallel/index.html 7. On wikipedia. [Online]. Available: http://en.wikipedia.org/wiki/mapreduce/

Thank You! Questions & Answers