Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Similar documents
CSE 344 MAY 2 ND MAP/REDUCE

Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

CSE 344 APRIL 20 TH RDBMS INTERNALS

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344. Lectures 9: Relational Algebra (part 2) and Query Evaluation

CSE 544: Principles of Database Systems

Introduction to Database Systems CSE 344

CSE 344 APRIL 27 TH COST ESTIMATION

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Introduction to Data Management CSE 344

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Why compute in parallel?

CSE 344 FEBRUARY 14 TH INDEXING

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Lecture 21: Query Optimization (1)

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

Lecture 19: Query Optimization (1)

Review: Query Evaluation Steps. Example Query: Logical Plan 1. What We Already Know. Example Query: Logical Plan 2.

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Review. Administrivia (Preview for Friday) Lecture 21: Query Optimization (1) Where We Are. Relational Algebra. Relational Algebra.

Database Systems CSE 414

CSE 444: Database Internals. Lecture 3 DBMS Architecture

CSE 444: Database Internals. Lecture 3 DBMS Architecture

Introduction to Database Systems CSE 444

Announcements. Agenda. Database/Relation/Tuple. Discussion. Schema. CSE 444: Database Internals. Room change: Lab 1 part 1 is due on Monday

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

CSE 544 Principles of Database Management Systems

Announcements. Agenda. Database/Relation/Tuple. Schema. Discussion. CSE 444: Database Internals

Agenda. Discussion. Database/Relation/Tuple. Schema. Instance. CSE 444: Database Internals. Review Relational Model

CSC 261/461 Database Systems Lecture 19

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CSE 544, Winter 2009, Final Examination 11 March 2009

CSE 444: Database Internals. Lectures 5-6 Indexing

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Database Applications (15-415)

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Optimization Overview

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

CSE 344 MAY 7 TH EXAM REVIEW

CSE 344 Final Review. August 16 th

Database Systems CSE 414

CompSci 516 Data Intensive Computing Systems

L22: The Relational Model (continued) CS3200 Database design (sp18 s2) 4/5/2018

CMPT 354: Database System I. Lecture 7. Basics of Query Optimization

Announcements. Two Classes of Database Applications. Class Overview. NoSQL Motivation. RDBMS Review: Serverless

Database Applications (15-415)

EECS 647: Introduction to Database Systems

Principles of Data Management. Lecture #9 (Query Processing Overview)

Database Systems CSE 414

Lecture 17: Query execution. Wednesday, May 12, 2010

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

CSE wi Final Exam Sample Solution

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

CSE 344 JULY 9 TH NOSQL

CSE 344 Final Examination

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

What we already know. Late Days. CSE 444: Database Internals. Lecture 3 DBMS Architecture

Implementation of Relational Operations

Class Overview. Two Classes of Database Applications. NoSQL Motivation. RDBMS Review: Client-Server. RDBMS Review: Serverless

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Hadoop vs. Parallel Databases. Juliana Freire!

Huge market -- essentially all high performance databases work this way

DBMS Query evaluation

Lecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto

RELATIONAL OPERATORS #1

Evaluation of Relational Operations. Relational Operations

Announcements. Using Electronics in Class. Review. Staff Instructor: Alvin Cheung Office hour on Wednesdays, 1-2pm. Class Overview

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Final Review. CS 145 Final Review. The Best Of Collection (Master Tracks), Vol. 2

CSEP 544: Lecture 04. Query Execution. CSEP544 - Fall

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

CSE 344 APRIL 11 TH DATALOG

Database Systems CSE 414

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

CSE 190D Spring 2017 Final Exam Answers

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Architecture and Implementation of Database Systems (Winter 2014/15)

CSE 414: Section 7 Parallel Databases. November 8th, 2018

CPS352 Lecture: Database System Architectures last revised 3/27/2017

Introduction to Database Systems CSE 344

CSE 544 Principles of Database Management Systems

Lecture 20: Query Optimization (2)

Transcription:

Introduction to Database Systems CSE 414 Lecture 16: Query Evaluation CSE 414 - Spring 2018 1

Announcements HW5 + WQ5 due tomorrow Midterm this Friday in class! Review session this Wednesday evening See course website HW6 will be released later this week Due on Friday 5/11 No WQ6 (yet)! 2

Class Overview Unit 1: Intro Unit 2: Relational Data Models and Query Languages Unit 3: Non-relational data Unit 4: RDMBS internals and parallel query processing Unit 5: DBMS usability, conceptual design Unit 6: Transactions Unit 7: Advanced topics 3

From Logical RA Plans to Physical Plans CSE 414 - Spring 2018 4

Query Evaluation Steps Review SQL query Parse Query Query optimization Generate Logical Plan Select Physical Plan Query Execution Logical plan (RA) Physical plan Disk 5

Logical vs Physical Plans Logical plans: Created by the parser from the input SQL text Expressed as a relational algebra tree Each SQL query has many possible logical plans Physical plans: Goal is to choose an efficient implementation for each operator in the RA tree Each logical plan has many possible physical plans CSE 414 - Spring 2018 6

Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Review: Relational Algebra SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = Seattle and x.sstate = WA π sname σ scity= Seattle and sstate= WA and pno=2 sid = sid Relational algebra expression is also called the logical query plan Supplier Supply CSE 414 - Spring 2018 7

Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Query Plan 1 (On the fly) (On the fly) π sname σ scity= Seattle and sstate= WA and pno=2 (Nested loop) sid = sid A physical query plan is a logical query plan annotated with physical implementation details SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = Seattle and x.sstate = WA Supplier (File scan) Supply (File scan) 8

Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Query Plan 2 (On the fly) (On the fly) π sname Same logical query plan Different physical plan σ scity= Seattle and sstate= WA and pno=2 (Hash join) sid = sid SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = Seattle and x.sstate = WA Supplier (File scan) Supply (File scan) 9

Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Query Plan 3 (On the fly) (Sort-merge join) π sname sid = sid (Scan & write to T1) (a) σ scity= Seattle and sstate= WA (d) (c) (b) σ pno=2 Different but equivalent logical query plan; different physical plan SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = Seattle and x.sstate = WA (Scan & write to T2) Supplier (File scan) Supply (File scan) CSE 414 - Spring 2018 10

Query Optimization Problem For each SQL query many logical plans For each logical plan many physical plans Choosing the best one among them is the goal of query optimization More on this later in the quarter CSE 414 - Spring 2018 11

Distributed query processing CSE 414 - Spring 2018 12

Why compute in parallel? Multi-cores: Most processors have multiple cores This trend will likely increase in the future Big data: too large to fit in main memory Distributed query processing on 100x-1000x servers Widely available now using cloud services Recall HW3 and motivation for NoSQL! CSE 414 - Spring 2018 13

Performance Metrics for Parallel DBMSs Nodes = processors, computers Speedup: More nodes, same data è higher speed Scaleup: More nodes, more data è same speed CSE 414 - Spring 2018 14

Linear v.s. Non-linear Speedup Speedup Ideal 1 5 10 15 # nodes (=P) CSE 414 - Spring 2018 15

Linear v.s. Non-linear Scaleup Batch Scaleup Ideal 1 5 10 15 # nodes (=P) AND data size CSE 414 - Spring 2018 16

Why Sub-linear Speedup and Scaleup? Startup cost Cost of starting an operation on many nodes Interference Contention for resources between nodes Skew Slowest node becomes the bottleneck CSE 414 - Spring 2018 17

Approaches to Parallel Query Evaluationcid=cid cid=cid cid=cid Inter-query parallelism One query per node Good for transactional (OLTP) workloads pid=pid pid=pid Customer pid=pid Customer Customer Product Purchase Product Purchase Product Purchase Inter-operator parallelism Operator per node Good for analytical (OLAP) workloads Intra-operator parallelism Operator on multiple nodes Good for both? cid=cid pid=pid Customer Product Purchase cid=cid pid=pid Customer Product Purchase CSE 414 - Spring 2018 18 We study only intra-operator parallelism: most scalable

Parallel Data Processing in the 20 th Century 19

Let s parallelize RDBMS Data is horizontally partitioned on many servers Operators may require data reshuffling First let s discuss how to distribute data across multiple nodes / servers CSE 414 - Spring 2018 20

Horizontal Data Partitioning Data: Servers: K A B 1 2... P CSE 414 - Spring 2018 21

Horizontal Data Partitioning Data: Servers: K A B 1 2... P K A B K A B K A B Which tuples go to what server? CSE 414 - Spring 2018 22

Recall: Horizontal Data Partitioning Block Partition: Partition tuples arbitrarily s.t. size(r 1 ) size(r P ) Hash partitioned on attribute A: Tuple t goes to chunk i, where i = h(t.a) mod P + 1 Recall: calling hash fn s is free in this class Range partitioned on attribute A: Partition the range of A into - = v 0 < v 1 < < v P = Tuple t goes to chunk i, if v i-1 < t.a < v i CSE 414 - Spring 2018 23

Uniform Data v.s. Skewed Data Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Uniform Hash-partition On the key K On the attribute A Uniform May be skewed Assuming good hash function E.g. when all records have the same value of the attribute A, then all records end up in the same partition CSE 414 - Spring 2018 24 Keep this in mind in the next few slides

Parallel Execution of RA Operators: Data: R(K,A,B,C) Query: γ A,sum(C) (R) Grouping How to compute group by if: R is hash-partitioned on A? R is block-partitioned? R is hash-partitioned on K? CSE 414 - Spring 2018 25

Parallel Execution of RA Operators: Grouping Data: R(K,A,B,C) Query: γ A,sum(C) (R) R is block-partitioned or hash-partitioned on K R 1 R 2... R P Reshuffle R on attribute A Run grouping on reshuffled partitions R 1 R 2 R P... CSE 414 - Spring 2018 26

Speedup and Scaleup Consider: Query: γ A,sum(C) (R) Runtime: only consider I/O costs If we double the number of nodes P, what is the new running time? Half (each server holds ½ as many chunks) If we double both P and the size of R, what is the new running time? Same (each server holds the same # of chunks) But only if the CSE 414 data - Spring is 2018without skew! 27

Skewed Data R(K,A,B,C) Informally: we say that the data is skewed if one server holds much more data that the average E.g., we hash-partition on A, and some value of A occurs very many times ( Justin Bieber ) Then the server holding that value will be skewed CSE 414 - Spring 2018 28

Parallel Execution of RA Operators: Partitioned Hash-Join Data: R(K1, A, B), S(K2, B, C) Query: R(K1, A, B) S(K2, B, C) Reshuffle R on R.B and S on S.B Initially, both R and S are partitioned on K1 and K2 R 1, S 1 R 2, S 2... R P, S P Each server computes the join locally R 1, S 1 R 2, S 2... R P, S P CSE 414 - Spring 2018 29

Data: R(K1,A, B), S(K2, B, C) Query: R(K1,A,B) S(K2,B,C) Parallel Join Illustration R1 S1 R2 S2 Partition K1 B 1 20 K2 B 101 50 K1 B 3 20 K2 B 201 20 2 50 102 50 4 20 202 50 M1 M2 Shuffle on B R1 S1 R2 S2 K1 B K2 B K1 B K2 B Local Join 1 20 3 20 201 20 2 50 101 50 102 50 4 20 M1 M2 202 50 30

Data: R(A, B), S(C, D) Query: R(A,B) B=C S(C,D) Broadcast Join Broadcast S Reshuffle R on R.B R 1 R 2... R P S R 1, S R 2, S... R P, S Why would you want to do this? CSE 414 - Spring 2018 31

Order(oid, item, date), Line(item, ) Putting it Together: Example Parallel Query Plan Find all orders from today, along with the items ordered SELECT * FROM Order o, Line i WHERE o.item = i.item AND o.date = today() join o.item = i.item select date = today() scan Item i scan Order o CSE 414 - Spring 2018 32

Order(oid, item, date), Line(item, ) Example Parallel Query Plan join o.item = i.item select date = today() scan Order o Node 1 Node 2 Node 3 hash h(o.item) select date=today() hash h(o.item) select date=today() hash h(o.item) select date=today() scan Order o scan Order o scan Order o Node 1 Node 2 Node 3 CSE 414 - Spring 2018 33

Order(oid, item, date), Line(item, ) Example Parallel join o.item = i.item Query Plan scan Item i date = today() Order o Node 1 Node 2 Node 3 hash h(i.item) hash h(i.item) hash h(i.item) scan Item i scan Item i scan Item i Node 1 Node 2 Node 3 CSE 414 - Spring 2018 34

Order(oid, item, date), Line(item, ) Example Parallel Query Plan join join join o.item = i.item o.item = i.item o.item = i.item Node 1 Node 2 Node 3 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 3 CSE 414 - Spring 2018 35