CSE 344 MAY 2 ND MAP/REDUCE

Similar documents
Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Introduction to Data Management CSE 344

Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Introduction to Data Management CSE 344

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Why compute in parallel?

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Data Management CSE 344

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

Introduction to Data Management CSE 344

CSE 544: Principles of Database Systems

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Distributed Computations MapReduce. adapted from Jeff Dean s slides

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

MapReduce: Simplified Data Processing on Large Clusters

Parallel Computing: MapReduce Jin, Hai

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS 61C: Great Ideas in Computer Architecture. MapReduce

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

The MapReduce Abstraction

CompSci 516: Database Systems

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Huge market -- essentially all high performance databases work this way

Hadoop vs. Parallel Databases. Juliana Freire!

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Introduction to Map Reduce

6.830 Lecture Spark 11/15/2017

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE 344 MAY 7 TH EXAM REVIEW

Advanced Database Technologies NoSQL: Not only SQL

MapReduce, Hadoop and Spark. Bompotas Agorakis

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Parallel Programming Concepts

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Database Applications (15-415)

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

L22: SC Report, Map Reduce

The MapReduce Framework

Parallel DBs. April 23, 2018

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Cloud Computing CS

Improving the MapReduce Big Data Processing Framework

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Part A: MapReduce. Introduction Model Implementation issues

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

MapReduce and Friends

Parallel DBs. April 25, 2017

The amount of data increases every day Some numbers ( 2012):

HadoopDB: An open source hybrid of MapReduce

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Hadoop/MapReduce Computing Paradigm

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Map Reduce Group Meeting

Introduction to MapReduce

Lecture 11 Hadoop & Spark

Map Reduce. Yerevan.

Clustering Lecture 8: MapReduce

CSE 190D Spring 2017 Final Exam Answers

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Large-Scale GPU programming

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

Big Data Management and NoSQL Databases

CSE 544, Winter 2009, Final Examination 11 March 2009

The Stratosphere Platform for Big Data Analytics

CSE 190D Spring 2017 Final Exam

Concurrency for data-intensive applications

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

CSE 344 Final Examination

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Advanced Data Management Technologies

Introduction to MapReduce

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Databases 2 (VU) ( / )

Transcription:

CSE 344 MAY 2 ND MAP/REDUCE

ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review

PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data è higher speed Scaleup: More nodes, more data è same speed

LINEAR V.S. NON- LINEAR SPEEDUP Speedup 1 5 10 15 # nodes (=P)

LINEAR V.S. NON- LINEAR SCALEUP Batch Scaleup Ideal 1 5 10 15 # nodes (=P) AND data size

WHY SUB-LINEAR SPEEDUP AND SCALEUP? Startup cost Cost of starting an operation on many nodes Interference Contention for resources between nodes Skew Slowest node becomes the bottleneck

SHARED NOTHING Cluster of commodity machines on Interconnection Network high-speed network Called "clusters" or "blade servers Each machine has its own memory and disk: lowest contention. P P P Example: Google M M M D D D Because all machines today have many cores and many disks, sharednothing systems typically run many "nodes on a single physical machine. We discuss only Shared Easy to maintain Nothing and in scale class Most difficult to administer and tune.

APPROACHES TO PARALLEL QUERY EVALUATION Inter-query parallelism Transaction per node Good for transactional workloads cid=cid cid=cid cid=cid pid=pid pid=pid Customer pid=pid Customer Customer Product Purchase Product Purchase Product Purchase Inter-operator parallelism Operator per node Good for analytical workloads pid=pid cid=cid Customer Product Purchase Intra-operator parallelism Operator on multiple nodes Good for both? pid=pid cid=cid Customer Product Purchase We study only intra-operator parallelism: most scalable CSE 344-2017au 8

DISTRIBUTED QUERY PROCESSING Data is horizontally partitioned on many servers Operators may require data reshuffling First let s discuss how to distribute data across multiple nodes / servers

HORIZONTAL DATA PARTITIONING Data: Servers: K A B 1 2... P

HORIZONTAL DATA PARTITIONING Data: Servers: K A B K A B 1 2... P K A B K A B Which tuples go to what server?

HORIZONTAL DATA PARTITIONING Block Partition: Partition tuples arbitrarily s.t. size(r 1 ) size(r P ) Hash partitioned on attribute A: Tuple t goes to chunk i, where i = h(t.a) mod P + 1 Recall: calling hash fn s is free in this class Range partitioned on attribute A: Partition the range of A into - = v 0 < v 1 < < v P = Tuple t goes to chunk i, if v i-1 < t.a < v i

UNIFORM DATA V.S. SKEWED DATA Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Uniform Hash-partition Uniform Assuming good hash function On the key K On the attribute A Range partition May be skewed E.g. when all records have the same value of the attribute A, then all records end up in the same partition Keep this in mind in the next few slides

PARALLEL EXECUTION OF RA OPERATORS: GROUPING Data: R(K,A,B,C) Query: γ A,sum(C) (R) How to compute group by if: R is hash-partitioned on A? R is block-partitioned? R is hash-partitioned on K?

PARALLEL EXECUTION OF RA OPERATORS: GROUPING Data: R(K,A,B,C) Query: γ A,sum(C) (R) R is block-partitioned or hash-partitioned on K R 1 R 2... R P Reshuffle R on attribute A Run grouping on reshuffled partitions R 1 R 2 R P...

SPEEDUP AND SCALEUP Consider: Query: γ A,sum(C) (R) Runtime: only consider I/O costs If we double the number of nodes P, what is the new running time? Half (each server holds ½ as many chunks) If we double both P and the size of R, what is the new running time? Same (each server holds the same # of chunks) But only if the data is without skew!

SKEWED DATA R(K,A,B,C) Informally: we say that the data is skewed if one server holds much more data that the average E.g. we hash-partition on A, and some value of A occurs many times Then the server holding that value will be skewed

PARALLEL EXECUTION OF RA OPERATORS: PARTITIONED HASH-JOIN Data: R(K1, A, B), S(K2, B, C) Query: R(K1, A, B) S(K2, B, C) Initially, both R and S are partitioned on K1 and K2 Reshuffle R on R.B and S on S.B R 1, S 1 R 2, S 2... R P, S P Each server computes the join locally R 1, S 1 R 2, S 2... R P, S P

Data: R(K1,A, B), S(K2, B, C) Query: R(K1,A,B) S(K2,B,C) PARALLEL JOIN ILLUSTRATION R1 S1 R2 S2 Partition K1 B 1 20 K2 B 101 50 K1 B 3 20 K2 B 201 20 2 50 102 50 4 20 202 50 M1 M2 Shuffle on B R1 S1 R2 S2 Local Join K1 B 1 20 3 20 K2 B 201 20 K1 B 2 50 K2 B 101 50 102 50 4 20 M1 M2 202 50

Data: R(A, B), S(C, D) Query: R(A,B) B=C S(C,D) BROADCAST JOIN Broadcast S Reshuffle R on R.B R 1 R 2... R P S R 1, S R 2, S... R P, S Why would you want to do this?

Order(oid, item, date), Line(item, ) EXAMPLE PARALLEL QUERY PLAN Find all orders from today, along with the items ordered SELECT * FROM Order o, Line i WHERE o.item = i.item AND o.date = today() join o.item = i.item select date = today() scan Item i scan Order o

Order(oid, item, date), Line(item, ) PARALLEL QUERY PLAN join o.item = i.item select date = today() scan Order o Node 1 Node 2 Node 3 hash h(o.item) select date=today() hash h(o.item) select date=today() hash h(o.item) select date=today() scan Order o scan Order o scan Order o Node 1 Node 2 Node 3

Order(oid, item, date), Line(item, ) PARALLEL QUERY PLAN scan join Item i o.item = i.item date = today() Order o Node 1 Node 2 Node 3 hash h(i.item) hash h(i.item) hash h(i.item) scan Item i scan Item i scan Item i Node 1 Node 2 Node 3

Order(oid, item, date), Line(item, ) EXAMPLE PARALLEL QUERY PLAN join join join o.item = i.item o.item = i.item o.item = i.item Node 1 Node 2 Node 3 contains all orders and all lines where hash(item) = 3 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 1

MOTIVATION We learned how to parallelize relational database systems While useful, it might incur too much overhead if our query plans consist of simple operations MapReduce is a programming model for such computation First, let s study how data is stored in such systems

DISTRIBUTED FILE SYSTEM (DFS) For very large files: TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times ( 3), on different racks, for fault tolerance Implementations: Google s DFS: GFS, proprietary Hadoop s DFS: HDFS, open source

MAPREDUCE Google: paper published 2004 Free variant: Hadoop MapReduce = high-level programming model and implementation for large-scale parallel data processing

TYPICAL PROBLEMS SOLVED BY MR Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, transform Write the results Paradigm stays the same, change map and reduce functions for different problems

DATA MODEL Files! A file = a bag of (key, value) pairs A MapReduce program: Input: a bag of (inputkey, value) pairs Output: a bag of (outputkey, value) pairs

STEP 1: THE MAP PHASE User provides the MAP-function: Input: (input key, value) Ouput: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file

STEP 2: THE REDUCE PHASE User provides the REDUCE function: Input: (intermediate key, bag of values) Output: bag of output (values) System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function

EXAMPLE Counting the number of occurrences of each word in a large collection of documents Each Document The key = document id (did) The value = set of words (word) map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 32

MAP REDUCE (did1,v1) (w1,1) (w2,1) (w3,1) Shuffle (w1, (1,1,1,,1)) (w1, 25) (did2,v2) (w1,1) (w2,1) (w2, (1,1,)) (w3,(1)) (w2, 77) (w3, 12) (did3,v3)....