Parallel Programming Concepts

Similar documents
L22: SC Report, Map Reduce

The MapReduce Abstraction

Parallel Computing: MapReduce Jin, Hai

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

MapReduce: Simplified Data Processing on Large Clusters

Big Data Management and NoSQL Databases

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS MapReduce. Vitaly Shmatikov

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

MI-PDB, MIE-PDB: Advanced Database Systems

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Introduction to MapReduce

CS 345A Data Mining. MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS 61C: Great Ideas in Computer Architecture. MapReduce

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

The MapReduce Framework

Map Reduce Group Meeting

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

1. Introduction to MapReduce

Introduction to Map Reduce

Concurrency for data-intensive applications

Large-Scale GPU programming

Introduction to MapReduce

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Clustering Lecture 8: MapReduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Batch Processing Basic architecture

MapReduce-style data processing

CS427 Multicore Architecture and Parallel Computing

Advanced Data Management Technologies

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Introduction to Data Management CSE 344

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

MapReduce. Cloud Computing COMP / ECPE 293A

Remote Procedure Call. Tom Anderson

Introduction to Data Management CSE 344

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Programming Systems for Big Data

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Database Applications (15-415)

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Introduction to Data Management CSE 344

Database Systems CSE 414

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Part A: MapReduce. Introduction Model Implementation issues

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce Simplified Data Processing on Large Clusters

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to MapReduce (cont.)

Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

MapReduce. Tom Anderson

Data Partitioning and MapReduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

Advanced Database Technologies NoSQL: Not only SQL

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Map-Reduce (PFP Lecture 12) John Hughes

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Parallel and Distributed Computing

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Survey on MapReduce Scheduling Algorithms

Programming Models MapReduce

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Map-Reduce. John Hughes

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CSE 344 MAY 2 ND MAP/REDUCE

6.830 Lecture Spark 11/15/2017

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Improving the MapReduce Big Data Processing Framework

MapReduce in Erlang. Tom Van Cutsem

Information Retrieval

MapReduce. Stony Brook University CSE545, Fall 2016

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Chapter 5. The MapReduce Programming Model and Implementation

DISTRIBUTED SYSTEMS [COMP9243] Distributed Object based: Lecture 7: Middleware. Slide 1. Slide 3. Message-oriented: MIDDLEWARE

EECS 498 Introduction to Distributed Systems

Transcription:

Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al.

Examples for Parallel Programming Support 2

MapReduce 3 Programming model + associated implementation Processing and generating large data sets Map: key/value pair intermediate key/value pairs Reduce: merge all intermediate values associated with the same intermediate key Origin: Lisp

Run-time system 4 Automated parallelization and distribution Partitioning the input data Scheduling the program s execution across a set of machines Handling machine failures Managing the required inter-machine communication Programmers do not have to think about parallel and distributed system specifics

Programming Model 5 Input: key/value pairs Map Reduce Output: Intermediate: set of key/value pairs Grouped by intermediate key key/value pairs Typically zero or one output value per intermediate key Output

Example: Wordcount 6 Counting the number of occurences of each word in a large collection of documents: map(string key, String value): for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Example: Wordcount 7

Types 8 Type Specification map (k1,v1) list(k2,v2) reduce (k2,list(v2)) list(v2) Example Work items: ID, binary content character, number of occurences map (long,byte[]) list(char,int) reduce (char,list(int)) list(int)

More Examples 9 Example Map Reduce Distributed Grep Count of URL access frequency Reverse weblink graph Term-vector per host (list of most important words) Inverted index Distributed sort Emits a line, if it matches the pattern Processes logs of requests: <URL,1> <target,source>, if link is found in source <hostname,term vector> for each input document Parse document, emit <word, document ID> Extract keys from records: <key,record> Emit unchanged Add values per URL: <URL, total count> <target,list(source)> Add all term vectors together: <hostname, term vector> Sort and emit <word,list(document ID)> Emit unchanged (done by ordering properties)

Google Implementation of MapReduce 10 Large cluster of standard PCs with local disks x86, Ethernet: 100 Mbit/s to 1 Gbit/s, 2-4GB RAM, IDE Custom global file system Replication for availability and reliability Job scheduling system Set of tasks to set of machines Machine failures are common (large number of machines)

Execution Execution overview overview 11

Start up program copies on a cluster of machines 12 Execution overview Split input files into M pieces typically 16-64MB per piece Example: 200,000 Map Tasks 5,000 Reduce Tasks 2,000 Workers

13 Execution overview Program copies: Workers M Map tasks R Reduce tasks Special copy: Master Assigns tasks to idle workers

Worker is assigned a map tasks read input split parse key/value pairs execute map function create intermediates 14 Execution overview

Periodically buffered pairs are written to local disk. Partitioned into R regions by partitioning function Locations passed to Master Master forwards locations to reduce workers 15 Execution overview

16 Execution overview Reduce worker uses RPC to read buffered data from locations (local disks of map workers) Sort by intermediate keys grouping

17 Execution overview For each unique intermediate key: execute reduce function for key and corresponding set of intermediate values Output is appended to final output file for this reduce partition

18 Execution overview When all map and reduce tasks have been completed: Master wakes up the user program

Specific Google properties 19 Network bottleneck in Google cluster Master tries to use locality information about the input data, which is stored in the distributed file system For large MapReduce tasks, most input data is read locally Fault tolerance Periodic heartbeat between master and workers For a failed worker, re-execute completed and in-progress map tasks (of this particular worker) For a failed master, MapReduce is aborted user has to reexecute Span backup tasks (cloned workers, same task) when MapReduce is close to completion, to compensate faulty (delaying) workers

Refinements 20 Refinement Partitioning Function Ordering Guarantees Combiner Function Input and Output Types Side-Effects Skipping Bad Records Local Execution Status Information Counters Description User functions for data partitioning are possible (hash(key) mod R is default) Intermediate key/value pairs are ordered inc. Partial merging of local data (like reduce) Some standard formats; user can specify more Additional files have to be addressed by the user Ignore records with deterministic crashes (configurable) Special MapReduce library for sequential execution Master runs an internal HTTP server for diagnosis Count occurences of various events; user defined