MapReduce. Tom Anderson

Similar documents
1/10/16. RPC and Clocks. Tom Anderson. Last Time. Synchroniza>on RPC. Lab 1 RPC

Remote Procedure Call. Tom Anderson

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Parallel Programming Concepts

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Clustering Lecture 8: MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop

Outline. Spanner Mo/va/on. Tom Anderson

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

EECS 498 Introduction to Distributed Systems

MapReduce in Erlang. Tom Van Cutsem

Map-Reduce (PFP Lecture 12) John Hughes

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Large- Scale Sor,ng: Breaking World Records. Mike Conley CSE 124 Guest Lecture 12 March 2015

Introduction to MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS 345A Data Mining. MapReduce

Batch Processing Basic architecture

Map-Reduce. John Hughes

hashfs Applying Hashing to Op2mize File Systems for Small File Reads

Big Data Management and NoSQL Databases

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce. Cloud Computing COMP / ECPE 293A

Map- reduce programming paradigm

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Transac.on Management. Transac.ons. CISC437/637, Lecture #16 Ben Cartere?e

Parallel Computing: MapReduce Jin, Hai

MapReduce. U of Toronto, 2014

Data Centers. Tom Anderson

Architecture of So-ware Systems Massively Distributed Architectures Reliability, Failover and failures. Mar>n Rehák

Distributed Systems. Communica3on and models. Rik Sarkar Spring University of Edinburgh

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

MI-PDB, MIE-PDB: Advanced Database Systems

Map-Reduce. Marco Mura 2010 March, 31th

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

BigData and Map Reduce VITMAC03

Introduction to MapReduce

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Computation Models

Data Partitioning and MapReduce

Introduc)on to. CS60092: Informa0on Retrieval

Op#mizing MapReduce for Highly- Distributed Environments

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

CmpE 138 Spring 2011 Special Topics L2

Distributed Systems. Day 3: Principles Continued Jan 31, 2019

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

Introduction to MapReduce

Programming Models MapReduce

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS-2510 COMPUTER OPERATING SYSTEMS

Distributed Filesystem

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

MapReduce. Stony Brook University CSE545, Fall 2016

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems. Communica3on and models. Rik Sarkar 2015/2016. University of Edinburgh

Lecture 11 Hadoop & Spark

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MapReduce: Simplified Data Processing on Large Clusters

Database Applications (15-415)

Data-Intensive Distributed Computing

HBase... And Lewis Carroll! Twi:er,

CS 378 Big Data Programming

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CompSci 516: Database Systems

Map Reduce. Yerevan.

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Cloud Computing CS

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

L22: SC Report, Map Reduce

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Example. You manage a web site, that suddenly becomes wildly popular. Performance starts to degrade. Do you?

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices

A brief history on Hadoop

Informa(on Retrieval

6.830 Lecture Spark 11/15/2017

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CS5412: OTHER DATA CENTER SERVICES

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Part A: MapReduce. Introduction Model Implementation issues

Transcription:

MapReduce Tom Anderson

Last Time Difference between local state and knowledge about other node s local state Failures are endemic Communica?on costs ma@er

Why Is DS So Hard? System design Par??oning of responsibili?es: what should client do, what should server do? Which servers should do what? Failures are endemic, par?al and ambiguous If the server doesn t reply, how do you tell if it is (a) the network, (b) the server, or c) neither: they are both just being slow? Concurrency and consistency Distributed state, replicated state, caching How do we keep this state consistent?

Why Is DS So Hard? Performance Genera?ng a single FB or Google page involves calls to hundreds of different machines Performance can be variable and unpredictable Tail latency: only as fast as the slowest machine Implementa?on and tes?ng Nearly impossible to test/reproduce all failure cases Security Adversary can silently compromise machines and manipulate messages

MapReduce A programming model to help unsophis?cated programmers use a data center without thinking about failures and distribu?on. Popular distributed programming framework Many related frameworks Lab 1: Help you get up to speed on Go and distributed programming Exposure to some fault tolerance Mo?va?on for be@er fault tolerance in later labs

MapReduce Computa?onal Model For each key k with value v, compute a new set of key- value pairs: map (k,v) list(k,v ) For each key k and list of values v, compute a new (hopefully smaller) list of values: reduce (k,list(v )) list(v ) User writes map and reduce func?ons. Framework takes care of parallelism, distribu?on, and fault tolerance.

MapReduce Steps 1. Split document into set of <k1, v1> pairs 2. Run Map(k1, v1) on each element of each split to produce a set of <k2, v2> pairs 3. Coalesce results from each mapper into a (sorted) list for each key 4. Run Reduce(k2, list(v2)) - > list(v2) Op?onally run reduce func?on on results for each key produced by each mapper, to reduce network bw 5. Merge result

MapReduce In Ac?on

Example: grep find lines that match text pa@ern 1. Master splits file into M almost equal chunks at line boundaries 2. Master hands each par??on to mapper 3. map phase: for each par??on, call map on each line of text search line for word output line number, line of text if word shows up, nil if not 4. Par??on results among R reducers map writes each output record into a file, hashed on key

Example: grep 5. Reduce phase: each reduce job collects 1/R output from each Map job all map jobs have completed! Reduce func?on is iden?ty: v1 in, v1 out 6. merge phase: master merges R outputs

Another Example: PageRank Compute importance of web pages Search result ordering Pages are important if linked by important pages Ini?ally: assign every page a default value Map: For every page k with outlink l, emit <l, value> Reduce: For each target page l, output new value as average of inlink values Repeat MapReduce un?l done

Ques?ons Suppose we run MapReduce across N workers, with M map par??ons and R reducers Example: 200K M; 5K R; 2K N Number of tasks? Number of intermediate files?

Lab 1 Hint Lab 1 provides code to do all the steps of MapReduce, but on a single node: RunSingle

Ques?ons How much speedup do we expect on N servers? What are the bo@lenecks to performance?

Ques?on Why the roughly fixed (64MB) size for the ini?al size of the input split files?

Ques?on How does the master tell a specific worker to do a specific (map, reduce) task?

Remote Procedure Call (RPC) A request from the client to execute a func?on on the server.

RPC Implementa?on

For MapReduce master and worker, who s the client? who s the server?

Is an RPC like a normal func?on call? Binding Client needs a connec?on to server Server must implement the required func?on Performance local call: maybe 10 cycles = ~3 ns in data center: 10 microseconds => ~1K slower in the wide area: millions of?mes slower Failures What happens if messages get dropped? What if client crashes? What if server crashes? What if server appears to crash but is slow? What if network par??ons?

MapReduce Fault Tolerance Model Master is not fault tolerant Assump?on: this single machine won't fail during running a mapreduce app Many workers, so have to handle their failures Assump?on: workers are fail stop They can fail and stop They may reboot They don't send garbled weird packets aoer a failure

What kinds of faults does MapReduce Network: need to tolerate? Worker:

Tools for Dealing With Faults Retry if pkt is lost: resend worker crash: give task to another worker may execute MR job twice! (is this ok? Why?) Replicate E.g., input files Replace E.g., new worker can be added

Lab 1 Simplifica?ons No key in map Assume global file system No par?al failures Files either completely wri@en or not created If restart some failed opera?on, ok to write to the same filename

DeWi@/Stonebraker Cri?que A giant step backward in the programming paradigm for large- scale data intensive applica?ons A sub- op?mal implementa?on, in that it uses brute force instead of indexing Not novel at all: represents a specific implementa?on of well known techniques developed nearly 25 (now 35) years ago Missing most of the features that are rou?nely included in current DBMS Incompa?ble with all of the tools DBMS users have come to depend on"