Introduction to Data Management CSE 344

Similar documents
Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

CSE 344 MAY 2 ND MAP/REDUCE

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Data Management CSE 344

Database Systems CSE 414

Introduction to Data Management CSE 344

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Why compute in parallel?

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

CSE 544: Principles of Database Systems

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: Simplified Data Processing on Large Clusters

CS 61C: Great Ideas in Computer Architecture. MapReduce

Database System Architectures Parallel DBs, MapReduce, ColumnStores

CSE 414: Section 7 Parallel Databases. November 8th, 2018

CompSci 516: Database Systems

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Big Data Management and NoSQL Databases

Part A: MapReduce. Introduction Model Implementation issues

Parallel Computing: MapReduce Jin, Hai

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

The MapReduce Abstraction

MI-PDB, MIE-PDB: Advanced Database Systems

The MapReduce Framework

Introduction to Map Reduce

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Large-Scale GPU programming

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

L22: SC Report, Map Reduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Programming Concepts

Databases 2 (VU) ( / )

MapReduce, Hadoop and Spark. Bompotas Agorakis

Database Applications (15-415)

6.830 Lecture Spark 11/15/2017

Batch Processing Basic architecture

Clustering Lecture 8: MapReduce

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MapReduce and Friends

Introduction to MapReduce

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Lecture 11 Hadoop & Spark

Hadoop/MapReduce Computing Paradigm

MapReduce. Stony Brook University CSE545, Fall 2016

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Map Reduce Group Meeting

Shark: Hive (SQL) on Spark

Distributed computing: index building and use

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Programming Models MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Advanced Data Management Technologies

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

1. Introduction to MapReduce

Cloud Computing CS

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Map Reduce. Yerevan.

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Data Partitioning and MapReduce

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Developing MapReduce Programs

Distributed Systems CS6421

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Mitigating Data Skew Using Map Reduce Application

HaLoop Efficient Iterative Data Processing on Large Clusters

Map-Reduce. Marco Mura 2010 March, 31th

Advanced Database Technologies NoSQL: Not only SQL

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

CISC 7610 Lecture 2b The beginnings of NoSQL

Concurrency for data-intensive applications

MapReduce: A Programming Model for Large-Scale Distributed Computation

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CSE 190D Spring 2017 Final Exam Answers

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Distributed computing: index building and use

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

MapReduce and Hadoop

Transcription:

Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1

HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS) Give your credit card Click, click, click and you have a MapReduce cluster We will analyze a real 0.5TB graph Processing the entire data takes hours Required problems: queries on a subset only Extra credit problem #4: entire data Tomorrow s Quiz Sections Step by step instructions on how to connect to AWS Don t miss! CSE 344 - Winter 2013 2

Amazon Warning We HIGHLY recommend you remind students to turn off any instances after each class/session as this can quickly diminish the credits and start charging the card on file. You are responsible for the overages. AWS customers can now use billing alerts to help monitor the charges on their AWS bill. You can get started today by visiting your Account Activity page to enable monitoring of your charges. Then, you can set up a billing alert by simply specifying a bill threshold and an e-mail address to be notified as soon as your estimated charges reach the threshold. CSE 344 - Winter 2013 3

Outline Today: Query Processing in Parallel DBs Next Lecture: Parallel Data Processing at Massive Scale (MapReduce) Reading assignment: Chapter 2 (Sections 1,2,3 only) of Mining of Massive Datasets, by Rajaraman and Ullman http://i.stanford.edu/~ullman/mmds.html CSE 344 - Winter 2013 4

Review Why parallel processing? What are the possible architectures for a parallel database system? What are speedup and scaleup? CSE 344 - Winter 2013 5

Basic Query Processing: Quick Review in Class Basic query processing on one node. Given relations R(A,B) and S(B, C), no indexes, how do we compute: Selection: σ A=123 (R) Group-by: γ A,sum(B) (R) Join: R S CSE 344 - Winter 2013 6

Basic Query Processing: Quick Review in Class Basic query processing on one node. Given relations R(A,B) and S(B, C), no indexes, how do we compute: Selection: σ A=123 (R) Scan file R, select records with A=123 Group-by: γ A,sum(B) (R) Scan file R, insert into a hash table using attr. A as key When a new key is equal to an existing one, add B to the value Join: R S Scan file S, insert into a hash table using attr. B as key Scan file R, probe the hash table using attr. B CSE 344 - Winter 2013 7

Parallel Query Processing How do we compute these operations on a shared-nothing parallel db? Selection: σ A=123 (R) (that s easy, won t discuss ) Group-by: γ A,sum(B) (R) Join: R S Before we answer that: how do we store R (and S) on a sharednothing parallel db? CSE 344 - Winter 2013 8

Horizontal Data Partitioning Data: K A B Servers: 1 2... P CSE 344 - Winter 2013 9

Horizontal Data Partitioning Data: Servers: K A B K A B 1 2... P K A B K A B CSE 344 - Winter 2013 10

Horizontal Data Partitioning Data: Servers: K A B K A B 1 2... P K A B K A B Which tuples go to what server? CSE 344 - Winter 2013 11

Horizontal Data Partitioning Block Partition: Partition tuples arbitrarily s.t. size(r 1 ) size(r P ) Hash partitioned on attribute A: Tuple t goes to chunk i, where i = h(t.a) mod P + 1 Range partitioned on attribute A: Partition the range of A into - = v 0 < v 1 < < v P = Tuple t goes to chunk i, if v i-1 < t.a < v i CSE 344 - Winter 2013 12

Parallel GroupBy Data: R(K,A,B,C) Query: γ A,sum(C) (R) Discuss in class how to compute in each case: R is hash-partitioned on A R is block-partitioned R is hash-partitioned on K CSE 344 - Winter 2013 13

Parallel GroupBy Data: R(K,A,B,C) Query: γ A,sum(C) (R) R is block-partitioned or hash-partitioned on K R 1 R 2... R P Reshuffle R on attribute A R 1 R 2 R P... CSE 344 - Winter 2013 14

Parallel Join Data: R(K1,A, B), S(K2, B, C) Query: R(K1,A,B) S(K2,B,C) Initially, both R and S are horizontally partitioned on K1 and K2 R 1, S 1 R 2, S 2 R P, S P CSE 344 - Winter 2013 15

Parallel Join Data: R(K1,A, B), S(K2, B, C) Query: R(K1,A,B) S(K2,B,C) Reshuffle R on R.B and S on S.B Initially, both R and S are horizontally partitioned on K1 and K2 R 1, S 1 R 2, S 2... R P, S P Each server computes the join locally R 1, S 1 R 2, S 2... R P, S P CSE 344 - Winter 2013 16

Speedup and Scaleup Consider: Query: γ A,sum(C) (R) Runtime: dominated by reading chunks from disk If we double the number of nodes P, what is the new running time? If we double both P and the size of R, what is the new running time? CSE 344 - Winter 2013 17

Speedup and Scaleup Consider: Query: γ A,sum(C) (R) Runtime: dominated by reading chunks from disk If we double the number of nodes P, what is the new running time? Half (each server holds ½ as many chunks) If we double both P and the size of R, what is the new running time? Same (each server holds the same # of chunks) CSE 344 - Winter 2013 18

Uniform Data v.s. Skewed Data Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Hash-partition On the key K On the attribute A CSE 344 - Winter 2013 19

Uniform Data v.s. Skewed Data Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Uniform Hash-partition On the key K On the attribute A Uniform May be skewed Assuming good hash function E.g. when all records have the same value of the attribute A, then all records end up in the same partition CSE 344 - Winter 2013 20

Parallel DBMS Parallel query plan: tree of parallel operators Intra-operator parallelism Data streams from one operator to the next Typically all cluster nodes process all operators Can run multiple queries at the same time Inter-query parallelism Queries will share the nodes in the cluster Notice that user does not need to know how his/her SQL query was processed CSE 344 - Winter 2013 21

Loading Data into a Parallel DBMS Example using Teradata System AMP = Access Module Processor = unit of parallelism CSE 344 - Winter 2013 22

Order(oid, item, date), Line(item, ) Example Parallel Query Execution Find all orders from today, along with the items ordered SELECT * FROM Order o, Line i WHERE o.item = i.item AND o.date = today() join o.item = i.item select date = today() scan Item i scan Order o CSE 344 - Winter 2013 23

Order(oid, item, date), Line(item, ) Example Parallel Query Execution join o.item = i.item select date = today() scan Order o AMP 1 AMP 2 AMP 3 hash h(o.item) select date=today() hash h(o.item) select date=today() hash h(o.item) select date=today() scan Order o scan Order o scan Order o AMP 1 AMP 2 AMP 3 CSE 344 - Winter 2013 24

Order(oid, item, date), Line(item, ) Example Parallel join o.item = i.item Query Execution scan Item i date = today() Order o AMP 1 AMP 2 AMP 3 hash h(i.item) hash h(i.item) hash h(i.item) scan Item i scan Item i scan Item i AMP 1 AMP 2 AMP 3 CSE 344 - Winter 2013 25

Order(oid, item, date), Line(item, ) Example Parallel Query Execution join join join o.item = i.item o.item = i.item o.item = i.item AMP 1 AMP 2 AMP 3 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 3 CSE 344 - Winter 2013 26

Parallel Dataflow Implementation Use relational operators unchanged Add a special shuffle operator Handle data routing, buffering, and flow control Inserted between consecutive operators in the query plan Two components: ShuffleProducer and ShuffleConsumer Producer pulls data from operator and sends to n consumers Producer acts as driver for operators below it in query plan Consumer buffers input data from n producers and makes it available to operator through getnext interface You will use this extensively in 444 CSE 344 - Winter 2013 27

Parallel Data Processing at Massive Scale CSE 344 - Winter 2013 28

Data Centers Today Large number of commodity servers, connected by high speed, commodity network Rack: holds a small number of servers Data center: holds many racks CSE 344 - Winter 2013 29

Data Processing at Massive Scale Want to process petabytes of data and more Massive parallelism: 100s, or 1000s, or 10000s servers Many hours Failure: If medium-time-between-failure is 1 year Then 10000 servers have one failure / hour CSE 344 - Winter 2013 30

Distributed File System (DFS) For very large files: TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times ( 3), on different racks, for fault tolerance Implementations: Google s DFS: GFS, proprietary Hadoop s DFS: HDFS, open source CSE 344 - Winter 2013 31

MapReduce Google: paper published 2004 Free variant: Hadoop MapReduce = high-level programming model and implementation for large-scale parallel data processing CSE 344 - Winter 2013 32

Observation: Your favorite parallel algorithm Reduce (Shuffle) Map CSE 344 - Winter 2013 33

Typical Problems Solved by MR Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, transform Write the results Outline stays the same, map and reduce change to fit the problem CSE 344 - Winter 2013 34 slide source: Jeff Dean

Data Model Files! A file = a bag of (key, value) pairs A MapReduce program: Input: a bag of (inputkey, value)pairs Output: a bag of (outputkey, value)pairs CSE 344 - Winter 2013 35

Step 1: the MAP Phase User provides the MAP-function: Input: (input key, value) Ouput: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file CSE 344 - Winter 2013 36

Step 2: the REDUCE Phase User provides the REDUCE function: Input: (intermediate key, bag of values) Output: bag of output (values) System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function CSE 344 - Winter 2013 37

Example Counting the number of occurrences of each word in a large collection of documents Each Document The key = document id (did) The value = set of words (word) map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); 38 Emit(AsString(result));

MAP REDUCE (did1,v1) (w1,1) (w2,1) (w3,1) Shuffle (w1, (1,1,1,,1)) (w1, 25) (did2,v2) (w1,1) (w2,1) (w2, (1,1, )) (w3,(1 )) (w2, 77) (w3, 12) (did3,v3).... 39

Jobs v.s. Tasks A MapReduce Job One single query, e.g. count the words in all docs More complex queries may consists of multiple jobs A Map Task, or a Reduce Task A group of instantiations of the map-, or reducefunction, which are scheduled on a single worker CSE 344 - Winter 2013 40

Workers A worker is a process that executes one task at a time Typically there is one worker per processor, hence 4 or 8 per node CSE 344 - Winter 2013 41

MAP Tasks REDUCE Tasks (did1,v1) (w1,1) (w2,1) (w3,1) Shuffle (w1, (1,1,1,,1)) (w1, 25) (did2,v2) (w1,1) (w2,1) (w2, (1,1, )) (w3,(1 )) (w2, 77) (w3, 12) (did3,v3)....

MapReduce Execution Details Reduce (Shuffle) Task Output to disk, replicated in cluster Intermediate data goes to local disk Map Task Data not necessarily local File system: GFS or HDFS CSE 344 - Winter 2013 43

MR Phases Each Map and Reduce task has multiple phases: Local storage ` CSE 344 - Winter 2013 44

Example: CloudBurst Slot ID Map Shuffle Sort Reduce CloudBurst. Lake Washington Dataset (1.1GB). 80 Mappers 80 Reducers. Time 45

Implementation There is one master node Master partitions input file into M splits, by key Master assigns workers (=servers) to the M map tasks, keeps track of their progress Workers write their output to local disk, partition into R regions Master assigns workers to the R reduce tasks Reduce workers read regions from the map workers local disks CSE 344 - Winter 2013 46

Interesting Implementation Details Worker failure: Master pings workers periodically, If down then reassigns the task to another worker CSE 344 - Winter 2013 47

Interesting Implementation Details Backup tasks: Straggler = a machine that takes unusually long time to complete one of the last tasks. Eg: Bad disk forces frequent correctable errors (30MB/s à 1MB/s) The cluster scheduler has scheduled other tasks on that machine Stragglers are a main reason for slowdown Solution: pre-emptive backup execution of the last few remaining in-progress tasks CSE 344 - Winter 2013 48

MapReduce Summary Hides scheduling and parallelization details However, very limited queries Difficult to write more complex queries Need multiple MapReduce jobs Solution: declarative query language CSE 344 - Winter 2013 49

Declarative Languages on MR PIG Latin (Yahoo!) New language, like Relational Algebra Open source HiveQL (Facebook) SQL-like language Open source SQL / Tenzing (Google) SQL on MR Proprietary CSE 344 - Winter 2013 50

Parallel DBMS vs MapReduce Parallel DBMS Relational data model and schema Declarative query language: SQL Many pre-defined operators: relational algebra Can easily combine operators into complex queries Query optimization, indexing, and physical tuning Streams data from one operator to the next without blocking Can do more than just run queries: Data management Updates and transactions, constraints, security, etc. CSE 344 - Winter 2013 51

Parallel DBMS vs MapReduce MapReduce Data model is a file with key-value pairs! No need to load data before processing it Easy to write user-defined operators Can easily add nodes to the cluster (no need to even restart) Uses less memory since processes one key-group at a time Intra-query fault-tolerance thanks to results on disk Intermediate results on disk also facilitate scheduling Handles adverse conditions: e.g., stragglers Arguably more scalable but also needs more nodes! CSE 344 - Winter 2013 52

Review: Parallel DBMS SQL Query From: Greenplum Database Whitepaper 53