Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

Similar documents
Batch Processing Basic architecture

Distributed Systems Intro and Course Overview

CS MapReduce. Vitaly Shmatikov

Clustering Lecture 8: MapReduce

Parallel Programming Concepts

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

L22: SC Report, Map Reduce

Big Data Management and NoSQL Databases

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Parallel Computing: MapReduce Jin, Hai

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

EECS 498 Introduction to Distributed Systems

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Introduction to Data Management CSE 344

CS 61C: Great Ideas in Computer Architecture. MapReduce

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

MI-PDB, MIE-PDB: Advanced Database Systems

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

RPCs in Go. COS 418: Distributed Systems Precept 2. Themis Melissaris and Daniel Suo

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy) August 29 December 12, 2017 Lecture 1-29

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to Data Management CSE 344

CS 345A Data Mining. MapReduce

Part A: MapReduce. Introduction Model Implementation issues

Distributed Systems CS6421

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Fundamentals Large-Scale Distributed System Design. (a.k.a. Distributed Systems 1)

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

The MapReduce Abstraction

Introduction to MapReduce (cont.)

Operating Systems CMPSC 473. Introduction January 15, Lecture 1 Instructor: Trent Jaeger

CSC 261/461 Database Systems. Fall 2017 MW 12:30 pm 1:45 pm CSB 601

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Improving the MapReduce Big Data Processing Framework

CMPT 354: Database System I. Lecture 1. Course Introduction

Languages and Compilers

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

MapReduce: Simplified Data Processing on Large Clusters

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

CS427 Multicore Architecture and Parallel Computing

Primary-Backup Replication

Concurrency for data-intensive applications

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS 241 Data Organization. August 21, 2018

Map- Reduce. Everything Data CompSci Spring 2014

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Backrub (Google)

About Me. Office Hours: Tu 4-5, W 1-2, or by appointment Office: 346A IST Bldg

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

CMSC433 - Programming Language Technologies and Paradigms. Introduction

Advanced Data Management Technologies

Introduction to Data Management CSE 344

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

COS 333: Advanced Programming Techniques. Copyright 2017 by Robert M. Dondero, Ph.D. Princeton University

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Database Applications (15-415)

Chapter 5. The MapReduce Programming Model and Implementation

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

COS 333: Advanced Programming Techniques

COMP 117: Internet-scale Distributed Systems Lessons from the World Wide Web

Introduction to MapReduce

Remote Procedure Call. Tom Anderson

CSE 5306 Distributed Systems. Course Introduction

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Map Reduce

Introduction to MapReduce

CSC209H Lecture 1. Dan Zingaro. January 7, 2015

The MapReduce Framework

Database Systems CSE 414

Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

EECS 482 Introduction to Operating Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Mitigating Data Skew Using Map Reduce Application

Large-Scale GPU programming

Welcome to CS 241 Systems Programming at Illinois

Lecture 1: January 23

CSC 111 Introduction to Computer Science (Section C)

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Data-Intensive Distributed Computing

COS 318: Operating Systems. Introduction

Hadoop Map Reduce 10/17/2018 1

CS153: Compilers Lecture 1: Introduction

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Distributed and Operating Systems Spring Prashant Shenoy UMass Computer Science.

Transcription:

Distributed Systems COS 418: Distributed Systems Lecture 1 Mike Freedman Backrub (Google) 1997 2 Google 2012 The Cloud is not amorphous 3 4 1

Google Microsoft Facebook 5 6 7 8 2

100,000s of physical servers 10s MW energy consumption Facebook Prineville: $250M physical infro, $1B IT infra 9 Everything changes at scale The goal of distributed systems Service with higher-level abstractions/interface e.g., file system, database, key-value store, programming model, RESTful web service, Hide complexity Scalable (scale-out) Reliable (fault-tolerant) Well-defined semantics (consistent) Security Pods provide 7.68Tbps to backplane 11 Do heavy lifting so app developer doesn t need to 12 3

Research results matter: NoSQL Research results matter: Paxos 13 14 Research results matter: MapReduce Course Organization 15 16 4

Learning the material: People Lecture Professors Mike Freedman, Kyle Jamieson Slides available on course website Office hours immediately after lecture Precept: TAs Themis Melissaris, Daniel Suo Main Q&A forum: www.piazza.com Graded on class participation: so ask & answer! No anonymous posts or questions Can send private messages to instructors Learning the Material: Books Lecture notes! No required textbooks. Programming reference: The Go Programming Language, Alan Donovan and Brian Kernighan (www.gopl.io, $17 Amazon!) Topic reference: Distributed Systems: Principles and Paradigms. Andrew S. Tanenbaum and Maaten Van Steen Guide to Reliable Distributed Systems. Kenneth Birman 17 18 Grading Five assignments (5% for first, then 10% each) 90% 24 hours late, 80% 2 days late, 50% >5 days late THREE free late days (we ll figure which one is best) Only failing grades I ve given are for students who don t (try to) do assignments Two exams (50% total) Midterm exam before spring break (25%) Final exam during exam period (25%) Class participation (5%) In lecture, precept, and Piazza 19 Policies: Write Your Own Code Programming is an individual creative process. At first, discussions with friends is fine. When writing code, however, the program must be your own work. Do not copy another person s programs, comments, README description, or any part of submitted assignment. This includes character-by-character transliteration but also derivative works. Cannot use another s code, etc. even while citing them. Writing code for use by another or using another s code is academic fraud in context of coursework. Do not publish your code e.g., on github, during/after course! 20 5

Assignment 1 Learn how to program in Go Implement sequential Map/Reduce Instructions on assignment web page Due September 28 (two weeks) meclickers : Quick Surveys Why are you here? A. Needed to satisfy course requirements B. Want to learn Go or distributed programming C. Interested in concepts behind distributed systems D. Thought this course was Bridges 21 22 Application: Word Count SELECT count(word) FROM data Case Study: MapReduce (Data-parallel programming at scale) GROUP BY word cat data.txt tr -s '[[:punct:][:space:]]' '\n' sort uniq -c 23 24 6

Using partial aggregation Using partial aggregation 1. Compute word counts from individual files 2. Then merge intermediate output 3. Compute word count on merged outputs 1. In parallel, send to worker: Compute word counts from individual files Collect result, wait until all finished 2. Then merge intermediate output 3. Compute word count on merged intermediates 25 26 MapReduce: Programming Interface map(key, value) -> list(<k, v >) Apply function to (key, value) pair and produces set of intermediate pairs MapReduce: Programming Interface map(key, value): for each word w in value: EmitIntermediate(w, "1"); reduce(key, list<value>) -> <k, v > Applies aggregation function to values Outputs result 27 reduce(key, list(values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 28 7

MapReduce: Optimizations Putting it together combine(list<key, value>) -> list<k,v> Perform partial aggregation on mapper node: <the, 1>, <the, 1>, <the, 1> à <the, 3> reduce() should be commutative and associative partition(key, int) -> int Need to aggregate intermediate vals with same key Given n partitions, map key to partition 0 i < n Typically via hash(key) mod n 29 map combine partition reduce 30 Synchronization Barrier Fault Tolerance in MapReduce Map worker writes intermediate output to local disk, separated by partitioning. Once completed, tells master node. Reduce worker told of location of map task outputs, pulls their partition s data from each mapper, execute function across data 31 Note: All-to-all shuffle b/w mappers and reducers Written to disk ( materialized ) b/w each stage 32 8

Fault Tolerance in MapReduce Straggler Mitigation in MapReduce Master node monitors state of system If master failures, job aborts and client notified Map worker failure Both in-progress/completed tasks marked as idle Reduce workers notified when map task is re-executed on another map worker Reducer worker failure In-progress tasks are reset to idle (and re-executed) Completed tasks had been written to global file system 33 Tail latency means some workers finish late For slow map tasks, execute in parallel on second map worker as backup, race to complete task 34 You ll build (simplified) MapReduce! Assignment 1: Sequential Map/Reduce Learn to program in Go! Due September 28 (two weeks) Assignment 2: Distributed Map/Reduce Learn Go s concurrency, network I/O, and RPCs Due October 19 This Friday Grouped Precept, Room CS 105 Program your next service in Go Sameer Ajmani Manages Go lang team @ Google (Earlier: PhD w/ Barbara Liskov @ MIT) 35 36 9