MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Similar documents
Parallel Programming Concepts

The MapReduce Abstraction

MapReduce: Simplified Data Processing on Large Clusters

L22: SC Report, Map Reduce

Parallel Computing: MapReduce Jin, Hai

Big Data Management and NoSQL Databases

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS MapReduce. Vitaly Shmatikov

Map Reduce Group Meeting

MI-PDB, MIE-PDB: Advanced Database Systems

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

The MapReduce Framework

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

1. Introduction to MapReduce

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Database Systems CSE 414

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Part A: MapReduce. Introduction Model Implementation issues

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Concurrency for data-intensive applications

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to MapReduce

Introduction to Map Reduce

Large-Scale GPU programming

MapReduce Simplified Data Processing on Large Clusters

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Data Management CSE 344

CS 345A Data Mining. MapReduce

Introduction to MapReduce

Batch Processing Basic architecture

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MapReduce: A Programming Model for Large-Scale Distributed Computation

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Advanced Data Management Technologies

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

CSE 344 MAY 2 ND MAP/REDUCE

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Programming Models MapReduce

Clustering Lecture 8: MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

MapReduce-style data processing

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Introduction to MapReduce (cont.)

CS-2510 COMPUTER OPERATING SYSTEMS

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Introduction to MapReduce

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

MapReduce. Cloud Computing COMP / ECPE 293A

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Programming Systems for Big Data

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

How to Implement MapReduce Using. Presented By Jamie Pitts

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

MapReduce. U of Toronto, 2014

CompSci 516: Database Systems

CSE 414: Section 7 Parallel Databases. November 8th, 2018

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

MapReduce: A Flexible Data Processing Tool

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Parallel and Distributed Computing

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Map-Reduce (PFP Lecture 12) John Hughes

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Lecture 11 Hadoop & Spark

6.830 Lecture Spark 11/15/2017

Advanced Database Technologies NoSQL: Not only SQL

Parallel data processing with MapReduce

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Map-Reduce. John Hughes

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

MapReduce and Hadoop

CS370 Operating Systems

Transcription:

MapReduce Kiril Valev LMU valevk@cip.ifi.lmu.de 23.11.2013 Kiril Valev (LMU) MapReduce 23.11.2013 1 / 35

Agenda 1 MapReduce Motivation Definition Example Why MapReduce? Distributed Environment Fault Tolerance Pro and Con Summary Kiril Valev (LMU) MapReduce 23.11.2013 2 / 35

Motivation Process large amounts of data Kiril Valev (LMU) MapReduce 23.11.2013 3 / 35

Motivation Process large amounts of data Process it fast (parallel) Kiril Valev (LMU) MapReduce 23.11.2013 3 / 35

Motivation Process large amounts of data Process it fast (parallel) Focus on problem, not on implementation Kiril Valev (LMU) MapReduce 23.11.2013 3 / 35

Motivation Process large amounts of data Process it fast (parallel) Focus on problem, not on implementation MapReduce Framework! Kiril Valev (LMU) MapReduce 23.11.2013 3 / 35

Definition What is MapReduce? MapReduce is a programming model and an associated implementation for processing and generating large data sets. [1] What is map? A map function processes a key value pair to generate a set of intermediate key value pairs. [1] What is reduce? A reduce function merges all intermediate values associated with the same intermediate key. [1] Kiril Valev (LMU) MapReduce 23.11.2013 4 / 35

Map k and v are input values for the map function. Those values are processed and mapped to a new list, which contains none, one or many pairs, depending on the implementation: Formal definition (k, v) [(l 1, x 1 ),..., (l n, x n )] The results of the map function are called intermediate key value pairs. Kiril Valev (LMU) MapReduce 23.11.2013 5 / 35

Map Map function for counting words in a collection of documents: Example (map pseudocode) map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Map function iterates over every word in a document and emits an intermediate result. Kiril Valev (LMU) MapReduce 23.11.2013 6 / 35

Reduce The reduce function operates on the output of the map function (intermediate results): Formal definition (l, [y 1,..., y n ]) [w 1,..., w m ] For each key l, a list with corresponding values is passed to the reduce function. Those values are reduced to a new result. Kiril Valev (LMU) MapReduce 23.11.2013 7 / 35

Reduce Reduce function for counting words in a collection of documents. Example (reduce pseudocode) reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Reduce function aggregates the word counts for a specific word key. Kiril Valev (LMU) MapReduce 23.11.2013 8 / 35

Example input key doc1 doc2 input value this is document one. and this is document two. Kiril Valev (LMU) MapReduce 23.11.2013 9 / 35

Example input key doc1 doc2 input value this is document one. and this is document two. this 1 is 1 document 1 one 1 and 1 this 1 is 1 document 1 two 1 Kiril Valev (LMU) MapReduce 23.11.2013 9 / 35

Example input key doc1 doc2 input value this is document one. and this is document two. this 1 is 1 document 1 one 1 and 1 this 1 is 1 document 1 two 1 and 1 document 2 is 2 one 1 this 2 two 1 Kiril Valev (LMU) MapReduce 23.11.2013 9 / 35

Why MapReduce? Question: What is special about it? Map and reduce functions already existed for a long time. Answer: It s not the functions, the framework is special! Programs written in this functional style are automatically parallelized Thus executed on a large cluster of machines This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system Kiril Valev (LMU) MapReduce 23.11.2013 10 / 35

MapReduce [1] Kiril Valev (LMU) MapReduce 23.11.2013 11 / 35

Fork and Assign Kiril Valev (LMU) MapReduce 23.11.2013 12 / 35

Fork and Assign The user specifies how many mappers and reducers are allocated. How many jobs? The amount of map and reduce jobs should be larger than the amount of workers. Typical setup: M = 200.000, R = 5.000, worker = 2.000 Kiril Valev (LMU) MapReduce 23.11.2013 13 / 35

MapReduce Kiril Valev (LMU) MapReduce 23.11.2013 14 / 35

Read M data pieces M map tasks 16MB to 64MB Locality Kiril Valev (LMU) MapReduce 23.11.2013 15 / 35

Locality Network bandwidth is a relatively scarce resource Depending on the filesystem, data might be on the same machine Reading local or nearby (rack) data makes input faster Master tries to efficiently assign data to a worker Kiril Valev (LMU) MapReduce 23.11.2013 16 / 35

MapReduce Kiril Valev (LMU) MapReduce 23.11.2013 17 / 35

Local Write Periodical write of buffered data Partitioned into R regions Notify master about location Kiril Valev (LMU) MapReduce 23.11.2013 18 / 35

Combiner Function The combiner function is a refinement of the map phase. Before the output is written to the local storage, the combiner function preprocesses the intermediate key value pairs: Kiril Valev (LMU) MapReduce 23.11.2013 19 / 35

Combiner Function The combiner function is a refinement of the map phase. Before the output is written to the local storage, the combiner function preprocesses the intermediate key value pairs: this 1 this 1 this 1 this 1 Kiril Valev (LMU) MapReduce 23.11.2013 19 / 35

Combiner Function The combiner function is a refinement of the map phase. Before the output is written to the local storage, the combiner function preprocesses the intermediate key value pairs: this 1 this 1 this 1 this 1 this 4 Kiril Valev (LMU) MapReduce 23.11.2013 19 / 35

Partitioning Function The output of the map function (intermediate key value pairs) is kept in a buffer and periodically written to the local disk. A partitioning function is applied on the key, producing R partitions: Partitioning Function hash(key) mod R Function can also be implemented by the user (URL of same host in one partition): Partitioning Function hash(hostname(key)) mod R Kiril Valev (LMU) MapReduce 23.11.2013 20 / 35

MapReduce Kiril Valev (LMU) MapReduce 23.11.2013 21 / 35

Remote Read Reduce task notified by master RPC to get data from map worker Sorted by intermediate key Kiril Valev (LMU) MapReduce 23.11.2013 22 / 35

MapReduce Kiril Valev (LMU) MapReduce 23.11.2013 23 / 35

Write Reduce function iterates over sorted, unique intermediate keys Append output to partition file Kiril Valev (LMU) MapReduce 23.11.2013 24 / 35

Fault Tolerance Worker Failure Master periodically checks heartbeat Could be a machine assigned with a map or reduce task completed map task re-executed in progress reduce task re-executed Kiril Valev (LMU) MapReduce 23.11.2013 25 / 35

Fault Tolerance Master Failure There is only one master Failure unlikely Put up with re-executing job in case of failure OR: write checkpoints with master data structure and restore in case of failure Kiril Valev (LMU) MapReduce 23.11.2013 26 / 35

Backup Tasks Deal with slow workers (Bottleneck) Stagglers, that take an unusually long time to complete Start backup tasks when close to completion Redundant execution of remaining tasks Completed either when primary, or backup tasks finished Kiril Valev (LMU) MapReduce 23.11.2013 27 / 35

Pro [2] Simple and easy to use Flexible Independent of the storage Fault tolerance Scalability Kiril Valev (LMU) MapReduce 23.11.2013 28 / 35

Con [2] No high-level language No schema and no index A single fixed dataflow Very young (compared to DBMS) Kiril Valev (LMU) MapReduce 23.11.2013 29 / 35

Summary Dataflow Input Map Combine/Partition Reduce Output MapReduce Simple (no experience in distributed computing!) Fault tolerance Focus on problem, not implementation Many problems can be modeled Kiril Valev (LMU) MapReduce 23.11.2013 30 / 35

MapReduce The End Kiril Valev (LMU) MapReduce 23.11.2013 31 / 35

Discussion on Hacker News Kiril Valev (LMU) MapReduce 23.11.2013 32 / 35

Data User generated Logfiles Images Raw text 0 2 4 6 Kiril Valev (LMU) MapReduce 23.11.2013 33 / 35

Problem Analytics Profiling/Advertising Pattern Recognition Parallel Job Scheduler Protein Binding Monte-Carlo 0 1 2 3 4 5 Kiril Valev (LMU) MapReduce 23.11.2013 34 / 35

References Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon (2011) Parallel Data Processing with MapReduce: A Survey SIGMOD Record, December 2011 (Vol. 40, No. 4). Dean, Jeffrey and Ghemawat, Sanjay (2008) MapReduce: simplified data processing on large clusters Commun. ACM 51(1), 107 113. Hacker News (2013) Ask HN: To everybody who uses MapReduce: what problems do you solve? https://news.ycombinator.com/item?id=6706545 Kiril Valev (LMU) MapReduce 23.11.2013 35 / 35