Parallel Computing: MapReduce Jin, Hai

Similar documents
L22: SC Report, Map Reduce

The MapReduce Abstraction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: Simplified Data Processing on Large Clusters

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Programming Concepts

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

The MapReduce Framework

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Big Data Management and NoSQL Databases

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Map Reduce Group Meeting

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MapReduce-style data processing

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Introduction to Data Management CSE 344

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

MapReduce: A Programming Model for Large-Scale Distributed Computation

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

1. Introduction to MapReduce

Concurrency for data-intensive applications

CS MapReduce. Vitaly Shmatikov

Clustering Lecture 8: MapReduce

Introduction to Map Reduce

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Introduction to MapReduce

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce Simplified Data Processing on Large Clusters

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Introduction to Data Management CSE 344

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Introduction to MapReduce

Introduction to MapReduce

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Batch Processing Basic architecture

MI-PDB, MIE-PDB: Advanced Database Systems

Introduction to Data Management CSE 344

Database Systems CSE 414

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to MapReduce

CS-2510 COMPUTER OPERATING SYSTEMS

CS 345A Data Mining. MapReduce

Advanced Data Management Technologies

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Large-Scale GPU programming

MapReduce. Cloud Computing COMP / ECPE 293A

Map-Reduce Applications: Counting, Graph Shortest Paths

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Developing MapReduce Programs

CS427 Multicore Architecture and Parallel Computing

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CSE 344 MAY 2 ND MAP/REDUCE

How to Implement MapReduce Using. Presented By Jamie Pitts

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Introduction to Hadoop and MapReduce

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Distributed Computation Models

Map Reduce. Yerevan.

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Lessons Learned While Building Infrastructure Software at Google

Part A: MapReduce. Introduction Model Implementation issues

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

MapReduce in Erlang. Tom Van Cutsem

MapReduce and Hadoop

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

Lecture 11 Hadoop & Spark

Survey on MapReduce Scheduling Algorithms

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Parallel and Distributed Computing

Programming Systems for Big Data

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Transcription:

Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

! MapReduce is a distributed/parallel computing framework introduced by Google to support computing on large data sets on clusters of computers.! The framework is inspired by map and reduce functions commonly used in functional programming (although their purpose in the MapReduce framework is not the same as their original forms).! User implements map() and reduce()! functions! Runtime library takes care of EVERYTHING else 2

! Large Scale Data Processing! Want to process lots of data ( > 1 TB)! Want to parallelize across hundreds/ thousands of CPUs! Want to make this easy 3

! A simple programming model that applies to certain large-scale computing problems! Hide messy details in MapReduce runtime library:! automatic parallelization! load balancing! network and disk transfer optimization! handling of machine failures! Robustness! Improvements to core library benefit all users of library! 4

Typical problem solved by MapReduce! Read a lot of data! Map: extract something you care about from each record! Shuffle and Sort! Reduce: aggregate, summarize, filter, or transform! Write the results! Outline stays the same, but map and reduce change to fit the problem 5

Programming Model! Functions borrowed from functional programming languages! Users implement interface of two functions:! map() " Process a key/value pair to generate intermediate key/value pairs " map (in_key, in_value) -> (out_key, intermediate_value) list! reduce() " Merge all intermediate values associated with the same key " reduce (out_key, intermediate_value list) -> out_value list 6

Example: Counting Words! Counting words in a large set of documents:! map()! Input <filename, file_text>! Parses file and emits <word, count> pairs " eg. < hello, 1>! reduce()! Sums all values for the same key and emits <word, TotalCount> " eg. < hello, (3 5 2 7)> => < hello, 17> 7

Example: Use of MapReduce map(string key, string value)! //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ); reduce(string key, iterator values)! //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result)); 8

Actual Source Code! The example is written in pseudo-code! Actual implementation is in C++, using a MapReduce library! True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) 9

Parallelism! map() functions run in parallel, creating different intermediate values from different input data sets! reduce() functions also run in parallel, each working on a different output key! All values are processed independently.! Synchronization required between the two functions. 10

How MapReduce Works! User to do list:! Write map() and reduce() functions! indicate: " Input/output files " M: number of map tasks " R: number of reduce tasks " W: number of machines! Submit the job! This requires no knowledge of parallel & distributed systems!!! What about everything else? 11

12

Data Distribution! Input files are split into M pieces on distributed file system! Typically ~ 16 to 128 MB blocks! Intermediate files created from map tasks are written to local disk! Output files are written to distributed file system 13

Assigning Tasks! Many copies of user program (Master/ Workers) are started! Master finds idle machines and assigns them tasks! Tries to utilize data localization by running map tasks on machines with data 14

Execution (map)! Map workers read in contents of corresponding input partition! Perform user-defined map computation to create intermediate <key,value> pairs! Periodically buffered output pairs written to local disk! Partitioned into R regions by a partitioning function 15

Partition Function! In addition to map and reduce functions, you may specify! Partition: (k', number of reducers) -- choice of reducer for k! Implemented by the invisible shuffleand-sort stage! Default partition:! hash(k') mod R (#reducers)! Each reducer sees the keys in its partition in sorted order 16

Execution (reduce)! Reduce workers iterate over ordered intermediate data! Each unique key encountered values are passed to user's reduce function! eg. <key, [value1, value2,..., valuen]>! Output of user's reduce function is written to output file on distributed file system! When all tasks have completed, master wakes up user program 17

Observations! No reduce can begin until map is complete! Tasks scheduled based on location of data! If map worker fails any time before reduce finishes, task must be completely rerun! Master must communicate locations of intermediate files! MapReduce library does most of the hard work for us! 18

19

MapReduce: Granularity! Fine granularity tasks: many more map tasks than machines! Minimizes time for fault recovery! Can pipeline shuffling with map execution! Better dynamic load balancing 20

Optimization! No reduce can start until map is complete:! A single slow disk controller can ratelimit the whole process! Master re-assigns slow-moving map tasks; and uses results of first copy to finish 21

Optimization! Combiner functions can run on same machine as a mapper! Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth 22

Fault Tolerance! Worker failure:! Detect failure via periodic heartbeats! Re-execute completed and in-progress map tasks! Re-execute in-progress reduce tasks! Task completion committed through master! Master failure:! State is checkpointed to replicated file system! New master recovers & continues! Very Robust: lost 1600 of 1800 machines once, but finished fine 23

Applications! Structure of the Web:! Input is (URL, contents)! Scan through the document's contents looking for links to other URLs! Map outputs (URL, linked-to URL) you get a simple representation of the WWW link graph! Map outputs (linked-to URL, URL) you get the reverse link graph, what web pages link to me?! Map outputs (linked-to URL, anchor text) you get how do other web pages characterize me? 24

Applications! Google uses MapReduce for! Page indexing pipeline: What are all the pages that match this query?! PageRank: What are the best pages that match this query?! and more others! Greatly simplifies large-scale computations at Google 25

References! Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters - http://labs.google.com/ papers/mapreduce.html! Ralf Lammel, Google's MapReduce Programming Model Revisited - http:// www.cs.vu.nl/~ralf/mapreduce/paper.pdf! http://code.google.com/edu/parallel/ mapreduce-tutorial.html 26