MapReduce: Simplified Data Processing on Large Clusters

Similar documents
The MapReduce Abstraction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Parallel Computing: MapReduce Jin, Hai

L22: SC Report, Map Reduce

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Parallel Programming Concepts

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

The MapReduce Framework

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS 61C: Great Ideas in Computer Architecture. MapReduce

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Map Reduce Group Meeting

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Big Data Management and NoSQL Databases

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Introduction to Data Management CSE 344

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Introduction to MapReduce

Introduction to Data Management CSE 344

Concurrency for data-intensive applications

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

1. Introduction to MapReduce

Database Systems CSE 414

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

MapReduce-style data processing

Introduction to Map Reduce

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to Data Management CSE 344

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

CS427 Multicore Architecture and Parallel Computing

CS MapReduce. Vitaly Shmatikov

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Introduction to MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

Large-Scale GPU programming

CSE 344 MAY 2 ND MAP/REDUCE

Advanced Database Technologies NoSQL: Not only SQL

Advanced Data Management Technologies

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

CS 345A Data Mining. MapReduce

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Clustering Lecture 8: MapReduce

Introduction to MapReduce

Parallel and Distributed Computing

MapReduce: Simplified Data Processing on Large Clusters

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CompSci 516: Database Systems

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

How to Implement MapReduce Using. Presented By Jamie Pitts

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Parallel data processing with MapReduce

Lecture #30 Parallel Computing

Multiprocessors 2014/2015

Part A: MapReduce. Introduction Model Implementation issues

CS61C : Machine Structures

Lecture 11 Hadoop & Spark

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

MapReduce for Data Intensive Scientific Analyses

MapReduce. Cloud Computing COMP / ECPE 293A

Review. Agenda. Request-Level Parallelism (RLP) Google Query-Serving Architecture 8/8/2011

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop)

MapReduce on Multi-core

Smart Phone. Computer. Core. Logic Gates

Internet Measurement and Data Analysis (13)

Instructors Randy H. Katz hdp://inst.eecs.berkeley.edu/~cs61c/fa13. Power UVlizaVon Efficiency. Datacenter. Infrastructure.

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Introduction to MapReduce

Data Intensive Computing

Transcription:

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1

Motivation Summary Example Implementation Discussion Outline Winter '10 EECS 345 Distributed Systems 2

Motivation for MapReduce Parallel programming is hard Goal of MapReduce: Process a lot of data (terabytes) over a lot of machines (hundreds or thousands) Hide details of parallelization Need a new programming model Winter '10 EECS 345 Distributed Systems 3

Summary How does MapReduce help? Hides details of parallelization from programmer Provides some transparency Handles fault-tolerance Data distribution is automatic Does load balancing and scheduling Monitors the status of systems and overall progress of the program Uses Google File System (GFS) Winter '10 EECS 345 Distributed Systems 4

How it Works Restricts the programming model Divide work into key/value pairs Makes it easier to use Programmers write Map and Reduce functions MapReduce handles the rest Winter '10 EECS 345 Distributed Systems 5

User-defined Input: key/value pair (input key, input value) Map Output: List of intermediate key/value pairs list(output key, intermediate value) Analysis of a worker s dataset produces intermediate values Input and intermediate values may be from a different domain Winter '10 EECS 345 Distributed Systems 6

User-defined Reduce Input: intermediate key/value pairs (output key, list(intermediate value) Output: output keys and values list(output value) Merges together all intermediate values for a particular key into a new set of values Output is often one value but does not have to be Winter '10 EECS 345 Distributed Systems 7

Example: Counting words map(string input_key, String input_value):! // input_key: document name! // input_value: document contents! for each word w in input_value:! EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values):! // output_key: a word! // output_values: a list of counts! int result = 0;! for each v in intermediate_values:! each result += ParseInt(v);! Emit(AsString(result)); Winter '10 EECS 345 Distributed Systems 8

Example (cont d) Document is split up for workers Map step: Each word gets an initial value of 1 Each word is a key with a list of values Reduce Step: Takes a key (in this case a word), and a list of values (all 1 ) Adds them up Passes them up the tree Winter '10 EECS 345 Distributed Systems 9

Winter '10 EECS 345 Distributed Systems 10

Other Examples Distributed Grep Distributed Sort Machine Learning Reverse Web-Link Graph And more Winter '10 EECS 345 Distributed Systems 11

Implementation Uses a distributed file system to manage data GFS (SOSP 2003) Bandwidth is a bottleneck Request data location from GFS Assign tasks to the same machine or one on the same switch (localizes activity) Combiner Function Do partial merging of intermediate keys Reduce network traffic Winter '10 EECS 345 Distributed Systems 12

Worker failure Implementation Detect with heartbeat Use backup tasks to reduce stragglers Some failures caused by inputs Debug and fix? Local Execution Send message to master from signal handler on seg_fault Master skips a record after seeing two failures Winter '10 EECS 345 Distributed Systems 13

Sort Normal execution Without backup tasks 891 seconds 1283 seconds 200 tasks killed 933 seconds Winter '10 EECS 345 Distributed Systems 14

500 1000 Output (MB/s) 20000 15000 10000 5000 0 500 1000 Seconds (a) Normal execution Figure 3: Data transf Winter '10 EECS 345 Distributed Systems 15

Output (MB/s) 20000 15000 10000 5000 0 500 1000 Seconds (b) No backup tasks Winter '10 EECS 345 Distributed Systems 16

Output (MB/s) 20000 15000 10000 5000 0 500 1000 Seconds (c) 200 tasks killed Winter '10 EECS 345 Distributed Systems 17

Comments/Questions? Winter '10 EECS 345 Distributed Systems 18

Discussion What does MapReduce provide that is novel? Some benefits of MapReduce are not new Master failure (use checkpoints) our current implementation aborts the MapReduce computation if the master fails. Is there a better way to handle master failure? When would checkpoints be useful? What are some other types of problems that we could solve using MapReduce? What are some limitations of MapReduce? MapReduce vs. DBMS Winter '10 EECS 345 Distributed Systems 19