L22: SC Report, Map Reduce

Similar documents
Parallel Computing: MapReduce Jin, Hai

Parallel Programming Concepts

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce Abstraction

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

The MapReduce Framework

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Big Data Management and NoSQL Databases

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Map Reduce Group Meeting

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS MapReduce. Vitaly Shmatikov

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Large-Scale GPU programming

Concurrency for data-intensive applications

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce: A Programming Model for Large-Scale Distributed Computation

Introduction to Data Management CSE 344

Introduction to MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MI-PDB, MIE-PDB: Advanced Database Systems

1. Introduction to MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce Simplified Data Processing on Large Clusters

MapReduce-style data processing

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Advanced Data Management Technologies

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 345A Data Mining. MapReduce

Batch Processing Basic architecture

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Clustering Lecture 8: MapReduce

Database Systems CSE 414

Introduction to Map Reduce

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CS427 Multicore Architecture and Parallel Computing

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to Data Management CSE 344

MapReduce: Simplified Data Processing on Large Clusters

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Parallel data processing with MapReduce

Part A: MapReduce. Introduction Model Implementation issues

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce. Cloud Computing COMP / ECPE 293A

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Introduction to Data Management CSE 344

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Introduction to MapReduce

CS-2510 COMPUTER OPERATING SYSTEMS

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

How to Implement MapReduce Using. Presented By Jamie Pitts

DISTRIBUTED SYSTEMS [COMP9243] Distributed Object based: Lecture 7: Middleware. Slide 1. Slide 3. Message-oriented: MIDDLEWARE

Survey on MapReduce Scheduling Algorithms

Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

CompSci 516: Database Systems

Lecture 11 Hadoop & Spark

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Improving the MapReduce Big Data Processing Framework

Programming Systems for Big Data

Advanced Database Technologies NoSQL: Not only SQL

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Introduction to MapReduce (cont.)

Database Applications (15-415)

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Introduction to MapReduce

MapReduce for Data Intensive Scientific Analyses

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

MapReduce and Hadoop

Map-Reduce. John Hughes

EECS 498 Introduction to Distributed Systems

Clustering websites using a MapReduce programming model

Distributed Programming the Google Way

MapReduce on Multi-core

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Map-Reduce (PFP Lecture 12) John Hughes

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Transcription:

L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source 11/23/10 What is MapReduce? Parallel programming model meant for large clusters () Reduce - User implements Map() and Parallel computing framework - Libraries take care of EVERYTHING else - Parallelization - Fault Tolerance - Data Distribution - Load Balancing Functional Abstractions Hide Parallelism Map and Reduce Functions borrowed from functional programming languages (eg. Lisp) Map() - Process a key/value pair to generate intermediate key/value pairs Reduce() - Merge all intermediate values associated with the same key Useful model for many practical tasks (large data) 11/23/10 1

Example: Counting Words Example Use of MapReduce Counting words in a large set of documents () Map - Input <filename, file text> - Parses file and emits <word, count> pairs - eg. < hello, 1> () Reduce - Sums values for the same key and emits <word, TotalCount> - eg. < hello, (3 5 2 7)> => < hello, 17> map(string key, string ( value //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ); reduce(string key, iterator ( values //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result)); How MapReduce Works Data Distribution User to do list: - indicate: - Input/output files - M: number of map tasks - R: number of reduce tasks - W: number of machines - Write map and reduce functions - Submit the job Input files are split into M pieces on distributed file system - Typically ~ 64 MB blocks Intermediate files created from map tasks are written to local disk Output files are written to distributed file system This requires no knowledge of parallel/distributed systems!!! What about everything else? 2

Assigning Tasks Many copies of user program are started Tries to utilize data localization by running map tasks on machines with data One instance becomes the Master Master finds idle machines and assigns them tasks ( map ) Execution Map workers read in contents of corresponding input partition Perform user-defined map computation to create intermediate <key,value> pairs Periodically buffered output pairs written to local disk - Partitioned into R regions by a partitioning function Partition Function ( reduce ) Execution Example partition function: hash(key) mod R Why do we need this? Example Scenario: - Want to do word counting on 10 documents - 5 map tasks, 2 reduce tasks Reduce workers iterate over ordered intermediate data - Each unique key encountered values are passed to user's reduce function - eg. <key, [value1, value2,..., valuen]> Output of user's reduce function is written to output file on global file system When all tasks have completed, master wakes up user program 3

Observations No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files MapReduce library does most of the hard work for us! Fault Tolerance Workers are periodically pinged by master - No response = failed worker Master writes periodic checkpoints On errors, workers send last gasp UDP packet to master - Detect records that cause deterministic crashes and skips them 4

Fault Tolerance Debugging Input file blocks stored on multiple machines When computation almost done, reschedule inprogress tasks - Avoids stragglers Offers human readable status info on http server - Users can see jobs completed, in-progress, processing rates, etc. Sequential implementation - Executed sequentially on a single machine - Allows use of gdb and other debugging tools MapReduce Conclusions References Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details Computer architecture not very important - Portable model Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters Josh Carter, http://multipart-mixed.com/software/ mapreduce_presentation.pdf Ralf Lammel, Google's MapReduce Programming Model Revisited http://code.google.com/edu/parallel/mapreduce-tutorial.html RELATED - Sawzall - Pig - Hadoop 5