Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Similar documents
Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

CSE 344 MAY 2 ND MAP/REDUCE

Introduction to Data Management CSE 344

Why compute in parallel?

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to Data Management CSE 344

CSE 414: Section 7 Parallel Databases. November 8th, 2018

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

MapReduce: Simplified Data Processing on Large Clusters

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CompSci 516: Database Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

The MapReduce Abstraction

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Parallel Computing: MapReduce Jin, Hai

L22: SC Report, Map Reduce

The MapReduce Framework

6.830 Lecture Spark 11/15/2017

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Part A: MapReduce. Introduction Model Implementation issues

MI-PDB, MIE-PDB: Advanced Database Systems

Batch Processing Basic architecture

CS 61C: Great Ideas in Computer Architecture. MapReduce

Database Applications (15-415)

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Big Data Management and NoSQL Databases

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to Map Reduce

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

MapReduce, Hadoop and Spark. Bompotas Agorakis

Parallel Programming Concepts

Cloud Computing CS

Databases 2 (VU) ( / )

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop Map Reduce 10/17/2018 1

Clustering Lecture 8: MapReduce

Introduction to MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Programming Models MapReduce

Improving the MapReduce Big Data Processing Framework

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Large-Scale GPU programming

TI2736-B Big Data Processing. Claudia Hauff

Concurrency for data-intensive applications

CSE 544: Principles of Database Systems

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Advanced Database Technologies NoSQL: Not only SQL

Distributed Computation Models

MapReduce: A Programming Model for Large-Scale Distributed Computation

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS MapReduce. Vitaly Shmatikov

MapReduce-style data processing

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

CSE 344 Final Review. August 16 th

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

Lecture 11 Hadoop & Spark

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Map- Reduce. Everything Data CompSci Spring 2014

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Introduction to MapReduce

Distributed computing: index building and use

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Introduction to Database Systems CSE 414

MapReduce: Algorithm Design for Relational Operations

MapReduce and Friends

CSE 444: Database Internals. Lecture 23 Spark

1. Introduction to MapReduce

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to Data Management CSE 344

MapReduce and Hadoop

Data Partitioning and MapReduce

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Introduction to MapReduce

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Chapter 4: Apache Spark

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Transcription:

Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s lecture HW6 released Not due until next Friday 5/11 No WQ6 (Yay!) CSE 414 - Spring 2018 1 2 Approaches to Parallel Query Evaluation Inter-query parallelism One query per node Good for transactional (OLTP) workloads Inter-operator parallelism Operator per node Good for analytical (OLAP) workloads Parallel Data Processing in the 20 th Century Intra-operator parallelism Operator on multiple nodes Good for both? CSE 414 - Spring 2018 3 We study only intra-operator parallelism: most scalable 4 Parallel Execution of RA Operators: Partitioned Hash-Join Data: R(, A, ), S(,, C) Query: R(, A, ) S(,, C) Initially, both R and S are partitioned on and Reshuffle R on R. and S on S. Each server computes the join locally R 1, S 1 R 2, S 2... R P, S P R 1, S 1 R 2, S 2... R P, S P CSE 414 - Spring 2018 5 Data: R(,A, ), S(,, C) Query: R(,A,) S(,,C) Partition Shuffle on Local Join 1 20 2 50 Parallel Join Illustration R1 S1 R2 S2 1 20 3 20 4 20 M1 101 50 102 50 3 20 4 20 R1 S1 R2 S2 M1 201 20 2 50 M2 M2 201 20 202 50 101 50 102 50 202 50 6 1

Data: R(A, ), S(C, D) Query: R(A,) =C S(C,D) roadcast Join roadcast S Reshuffle R on R. R 1 R 2... R P S Parallel Data Processing @ 2000 R 1, S R 2, S... R P, S Why would you want to do this? CSE 414 - Spring 2018 7 CSE 414 - Spring 2018 8 Optional Reading Original paper: https://www.usenix.org/legacy/events/osdi04/t ech/dean.html Rebuttal to a comparison with parallel Ds: http://dl.acm.org/citation.cfm?doid=1629175.1 629198 Chapter 2 (Sections 1,2,3 only) of Mining of Massive Datasets, by Rajaraman and Ullman http://i.stanford.edu/~ullman/mmds.html CSE 414 - Spring 2018 9 Motivation We learned how to parallelize relational database systems While useful, it might incur too much overhead if our query plans consist of simple operations MapReduce is a programming model for such computation First, let s study how data is stored in such systems CSE 414 - Spring 2018 10 Distributed File System (DFS) For very large files: Ts, Ps Each file is partitioned into chunks, typically 64M Each chunk is replicated several times ( 3), on different racks, for fault tolerance Implementations: Google s DFS: GFS, proprietary Hadoop s DFS: HDFS, open source MapReduce Google: paper published 2004 Free variant: Hadoop MapReduce = high-level programming model and implementation for large-scale parallel data processing CSE 414 - Spring 2018 11 CSE 414 - Spring 2018 12 2

Typical Problems Solved by MR Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, transform Write the results Paradigm stays the same, change map and reduce functions for different problems Files! Data Model A file = a bag of (key, value) pairs Sounds familiar after HW5? A MapReduce program: Input: a bag of (inputkey, value) pairs Output: a bag of (outputkey, value) pairs outputkey is optional CSE 414 - Spring 2018 13 slide source: Jeff Dean CSE 414 - Spring 2018 14 Step 1: the MAP Phase Step 2: the REDUCE Phase User provides the MAP-function: Input: (input key, value) Output: bag of (intermediate key, value) User provides the REDUCE function: Input: (intermediate key, bag of values) Output: bag of output (values) System applies the map function in parallel to all (input key, value) pairs in the input file System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function CSE 414 - Spring 2018 15 CSE 414 - Spring 2018 16 Example MAP REDUCE Counting the number of occurrences of each word in a large collection of documents Each Document The key = document id (did) The value = set of words (word) (did1,v1) (did2,v2) (did3,v3) (w3,1) Shuffle (w1, (1,1,1,,1)) (w2, (1,1,)) (w3,(1)) (w1, 25) (w2, 77) (w3, 12) map(string key, String value): // key: document name // value: document contents for each word w in value: emitintermediate(w, 1 ); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; result += ParseInt(v);.... emit(asstring(result)); 17 CSE 414 - Spring 2018 18 3

Workers A worker is a process that executes one task at a time Typically there is one worker per processor, hence 4 or 8 per node (did1,v1) (did2,v2) (did3,v3) MAP Tasks (M) (w3,1) Shuffle REDUCE Tasks (R) (w1, (1,1,1,,1)) (w1, 25) (w2, (1,1,)) (w2, 77) (w3,(1)) (w3, 12).... CSE 414 - Spring 2018 19 20 Fault Tolerance If one server fails once every year... then a job with 10,000 servers will fail in less than one hour MapReduce handles fault tolerance by writing intermediate files to disk: Mappers write file to local disk Reducers read the files (=reshuffling); if the server fails, the reduce task is restarted on another server CSE 414 - Spring 2018 21 Implementation There is one master node Master partitions input file into M splits, by key Master assigns workers (=servers) to the M map tasks, keeps track of their progress Workers write their output to local disk, partition into R regions Master assigns workers to the R reduce tasks Reduce workers read regions from the map workers local disks CSE 414 - Spring 2018 22 Interesting Implementation Details ackup tasks: Straggler = a machine that takes unusually long time to complete one of the last tasks. E.g.: ad disk forces frequent correctable errors (30M/s à 1M/s) The cluster scheduler has scheduled other tasks on that machine Stragglers are a main reason for slowdown Solution: pre-emptive backup execution of the last few remaining in-progress tasks Worker 1 Worker 2 Worker 3 Straggler Example Killed ackup execution Straggler Killed time CSE 414 - Spring 2018 23 CSE 414 - Spring 2018 24 4

Using MapReduce in Practice: Implementing RA Operators in MR Relational Operators in MapReduce Given relations R(A,) and S(,C) compute: Selection: σ A=123 (R) Group-by: γ A,sum() (R) Join: R S CSE 414 - Spring 2018 26 Selection σ A=123 (R) Selection σ A=123 (R) if t.a = 123: EmitIntermediate(t.A, t); (123, [ t 2, t 3 ] ) if t.a = 123: EmitIntermediate(t.A, t); A t 1 23 t 2 123 t 3 123 t 4 42 reduce(string A, Iterator values): Emit(v); ( t 2, t 3 ) 27 reduce(string A, Iterator values): Emit(v); No need for reduce. ut need system hacking in Hadoop to remove reduce from MapReduce 28 Group y γ A,sum() (R) Join EmitIntermediate(t.A, t.); (23, [ t 1 ] ) (42, [ t 4 ] ) (123, [ t 2, t 3 ] ) Two simple parallel join algorithms: Partitioned hash-join (we saw it, will recap) A t 1 23 10 t 2 123 21 t 3 123 4 t 4 42 6 reduce(string A, Iterator values): s = 0 s = s + v Emit(A, s); ( 23, 10 ), ( 42, 6 ), (123, 25) 29 roadcast join CSE 414 - Spring 2018 30 5

R(A,) =C S(C,D) Partitioned Hash-Join R(A,) =C S(C,D) Partitioned Hash-Join Reshuffle R on R. and S on S. Initially, both R and S are horizontally partitioned R 1, S 1 R 2, S 2... R P, S P type case t.relationname of R : EmitIntermediate(t., ( R, t)); S : EmitIntermediate(t.C, ( S, t)); actual tuple Each server computes the join locally R 1, S 1 R 2, S 2... R P, S P CSE 414 - Spring 2018 31 reduce(string k, Iterator values): R = empty; S = empty; case v.type of: R : R.insert(v) S : S.insert(v); for v1 in R, for v2 in S Emit(v1,v2); 32 R(A,) =C S(C,D) Reshuffle R on R. roadcast Join R 1 R 2... R P roadcast S S R(A,) =C S(C,D) roadcast Join map(string value): readfromnetwork(s); /* over the network */ hashtable = new HashTable() for each w in S: hashtable.insert(w.c, w) map should read several records of R: value = some group of tuples from R Read entire table S, build a Hash Table R 1, S R 2, S... R P, S CSE 414 - Spring 2018 33 for each v in value: for each w in hashtable.find(v.) Emit(v,w); reduce(): /* empty: map-side only */ 34 HW6 HW6 will ask you to write SQL queries and MapReduce tasks using Spark You will get to implement SQL using MapReduce tasks Can you beat Spark s implementation? 6