Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please)

Similar documents
Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please)

Homework 5: Miscellanea (due April 26 th, 2013, 9:05am, in class hard-copy please)

Project Assignment 2 (due April 6 th, 2016, 4:00pm, in class hard-copy please)

Homework 1: Relational Algebra and SQL (due February 10 th, 2016, 4:00pm, in class hard-copy please)

Project Assignment 2 (due April 6 th, 2015, 4:00pm, in class hard-copy please)

Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please)

Reminders - IMPORTANT:

Homework 6: FDs, NFs and XML (due April 13 th, 2016, 4:00pm, hard-copy in-class please)

Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please)

Homework 1: RA, SQL and B+-Trees (due Feb 7, 2017, 9:30am, in class hard-copy please)

Homework 6: FDs, NFs and XML (due April 15 th, 2015, 4:00pm, hard-copy in-class please)

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Homework 7: Transactions, Logging and Recovery (due April 22nd, 2015, 4:00pm, in class hard-copy please)

CS 4320/5320 Homework 2

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Database Applications (15-415)

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Carnegie Mellon University in Qatar Spring Problem Set 4. Out: April 01, 2018

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

CompSci 516 Data Intensive Computing Systems

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

More about Databases. Topics. More Taxonomy. And more Taxonomy. Computer Literacy 1 Lecture 18 30/10/2008

CS 564 Final Exam Fall 2015 Answers

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

Query Processing and Advanced Queries. Query Optimization (4)

Database Systems CSE 414

CSE 190D Spring 2017 Final Exam

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Hash-Based Indexing 165

Principles of Data Management. Lecture #9 (Query Processing Overview)

Overview of Implementing Relational Operators and Query Evaluation

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

CMU - SCS / Database Applications Spring 2013, C. Faloutsos Homework 1: E.R. + Formal Q.L. Deadline: 1:30pm on Tuesday, 2/5/2013

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Modern Database Systems CS-E4610

Module 9: Selectivity Estimation

INFO 1103 Homework Project 2

Relational Query Optimization

University of California, Berkeley. CS 186 Introduction to Databases, Spring 2014, Prof. Dan Olteanu MIDTERM

Database Applications (15-415)

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

Database Systems CSE 414

Parser: SQL parse tree

CS54100: Database Systems

CSE 444: Database Internals. Lecture 11 Query Optimization (part 2)

CSE 444 Homework 1 Relational Algebra, Heap Files, and Buffer Manager. Name: Question Points Score Total: 50

Midterm 1: CS186, Spring 2015

Midterm 2: CS186, Spring 2015

Database Applications (15-415)

Database Systems. Project 2

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

EECS 647: Introduction to Database Systems

CS4604 Prakash Spring 2016! Project 3, HTML and PHP. By Sorour Amiri and Shamimul Hasan April 20 th, 2016

Evaluation of Relational Operations: Other Techniques

CS 186 Midterm, Spring 2003 Page 1

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Query Evaluation Overview, cont.

Chapters 15 and 16b: Query Optimization

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

CS 461: Database Systems. Final Review. Julia Stoyanovich

Implementation of Relational Operations

University of Waterloo Midterm Examination Sample Solution

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

CSE 544, Winter 2009, Final Examination 11 March 2009

Evaluation of Relational Operations

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

Query Processing & Optimization. CS 377: Database Systems

Introduction to Database Systems CSE 344

CSC 261/461 Database Systems Lecture 19

Review The Big Picture

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

CSE 190D Spring 2017 Final Exam Answers

Homework 6. Question Points Score Query Optimization 20 Functional Dependencies 20 Decompositions 30 Normal Forms 30 Total: 100

Query Evaluation Overview, cont.

Final Exam CSE232, Spring 97

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

CPSC 421 Database Management Systems. Lecture 11: Storage and File Organization

Modern Database Systems Lecture 1

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

Query Processing and Query Optimization. Prof Monika Shah

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

CS330. Query Processing

Databases IIB: DBMS-Implementation Exercise Sheet 13

Outline. Database Tuning. Join Strategies Running Example. Outline. Index Tuning. Nikolaus Augsten. Unit 6 WS 2014/2015

Transcription:

Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2016, Prakash Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please) Reminders: a. Out of 100 points. Contains 6 pages. b. Rough time-estimates: 6~8 hours. c. Please type your answers. Illegible handwriting may get no points, at the discretion of the grader. Only drawings may be hand-drawn, as long as they are neat and legible. d. There could be more than one correct answer. We shall accept them all. e. Whenever you are making an assumption, please state it clearly. f. Lead TA for this HW: Sorour Amiri. g. If you have any problems with the server, contact TAs (Sorour Amiri/Shamimul Hasan). Q1. Relational Operators [25 points] Consider the natural join R S of relations R(a, b, c) and S(c, d, e) given the following: Relation R contains NR = 60,000 tuples with 15 tuples per page. Relation S has NS = 5,000 tuples with 100 tuples per page. B = 201 buffer pages are available. Assume that both relations are stored as simple heap files and that neither relation has any indexes built on it. Q1.1. (5 points) Which is the smaller relation i.e. which is the one with the fewer number of pages? Write down the number of pages in the smaller relation (call it M) and the number of pages in the larger relation (call it N). Q1.2. (20=5x4 points) Find the costs (in terms of disk accesses) of the following joins (make sure you also write the formulae you use for each one in terms of M, N, and B): A. Nested Loops Join (page 454 of textbook or check slides) B. Block nested loops Join (page 455 of textbook or check slides) C. Sort-Merge Join (page 461 of textbook or check slides) D. Hash Join (page 464 of textbook or check slides) Q2. Query Optimization (Shoe Store) [25 points] Consider the following schema of a shoe store database: Store(sid, location) 1

ShoeType(id, style, size, color) Inventory(type, sid, quantity) Table Store has the data of each shoe store in the database which contains a unique id and the location. The ShoeType table has the data of each possible type of shoe in the database. This table keeps a unique value as the id, the style, size and the color of the shoe. The Inventory table listed the number of shoe types in each store. Inventory.type is a foreign key to ShoeType.id and Inventory.sid is a foreign key to Store.sid. We are given the following information about the database: Store contains 500 records with 10 records per page. ShoeType contains 1000 records with 40 records per page. Inventory contains 50,000 records with 50 records per page. There are 100 values for Inventory.type. There are 50 value for Store.location There are 10 values for ShoeType.size (5,6,,14) There are 1000 values for Inventory.quantity Consider the following queries: Query 1: SELECT sid FROM Store where location= Blacksburg ; Query 2: SELECT S.sid, S.location, T.size FROM Store S, ShoeType T, Inventory I WHERE S.sid=I.sid AND I.type = T.id; Query 3: SELECT S.sid, S.location, T.size FROM Store S, ShoeType T, Inventory I WHERE S.sid=I.sid AND I.type = T.id AND T.size>7 AND I.quantity = 20; Q2.1. (5 points) Assuming uniform distribution of values, estimate size of the result for Query 1 (number of tuples). Q2.2. (5 points) Again assuming uniform distribution of values and column independence, estimate the number of tuples returned by Query 3. Also assume you are given that the size of Query 2 is 50,000 tuples. (Hint: Estimate the selectivity of T.size>7 AND I.quantity = 20) Q2.3. (5 points) Draw all possible unique left-deep join query trees for Query 2. Q2.4. (10 points) From your answer in Q2.3, consider only those trees where the smallest relation is the left most relation. Now, for the first join in each such tree 2

(i.e. the join at the bottom of the tree), what join algorithm would work best (i.e. cost the least)? Assume that you have 70 pages of buffer memory. There are no indexes; so indexed nested loop is not an option. Consider the Page Oriented Nested Join (PNJ), Block-NJ, Sort-Merge-J, and Hash-J. Make sure you show your work, and also write down the formula you use for each case (check slides). Q3. Query Optimization (Stats and Range Queries) [20 points] Consider the following three relations: cdtoc(id, release, discid, track_count, leadout_offset, track_offset) release(id, title, artist, year, comment_count, catalogue, source, barcode) track(id, release, title, artist, sequence) The above information was gathered from the MusicBrainz Database where various pieces of information about music are stored, from artists and their releases to works and their composers. The cdtoc table is the table of contents of each CD. In the release table we keep the information of any released album. The track table has the information of tracks inside the released albums. Note that we have slightly modified the dataset to make it more understandable. Assume there are no existing indexes on these tables and relations are stored as simple heap files. These three tables are stored in cs4604.cs.vt.edu server. This is the first time you will be accessing the PostgreSQL server, so refer to the guidelines here (account information etc.): http://courses.cs.vt.edu/~cs4604/spring16/project/postgresql.html Use the following commands to copy the tables to your private database. Note that you need to run these commands at the command prompt of cs4604.cs.vt.edu, NOT at the psql prompt (i.e. run these on the first prompt you get after ssh-ing to the server): pg_dump -U YOUR-PID -t cdtoc cs4604s16 psql -d YOUR-PID pg_dump -U YOUR-PID -t release cs4604s16 psql -d YOUR- PID pg_dump -U YOUR-PID -t track cs4604s16 psql -d YOUR-PID Sanity Check: run the following two statements and verify the output. select count(*) from cdtoc; // output count = 284301 select count(*) from release; // output count = 284301 select count(*) from track; // output count = 3409654 For this question, it may help to familiarize you with the pg_class and pg_stats tables, provided by PostgreSQL as part of their catalog. Please see the links below: http://www.postgresql.org/docs/8.4/static/view-pg-stats.html 3

http://www.postgresql.org/docs/8.4/static/catalog-pg-class.html Now answer the following questions: Q3.1. (6 points) Using a single SQL SELECT query on the pg_class table, for each relation cdtoc, release' and track to find (a) the number of rows, (b) the number of attributes and (c) if the relation has a primary key. Write down the SQL query you used and also paste the output. Q3.2. (8 points) We want to find the top five years in which most albums are added to the release table. There are two ways to do that: A. (4 points) Write a SQL query that uses the release table find the top 5 years in which most albums are added to the release table. Paste the output too when you run the query. B. (4 points) Now write a SQL query that uses only the pg_stats table instead to find the top 5 years in which albums are added to the release table. Again, paste the output. Hint 1: You may find the most_common_vals column in the table pg_stats useful. Hint 2: to get the first three elements of a list type (e.g. mylist ), you can use mylist[1:5], and to get the first three elements of anyarray type (e.g. myarry ), you can use (myarray::varchar::varchar[])[1:5] see this link: http://www.postgresql.org/docs/8.4/static/view-pg-stats.html Q3.3. (6 points) Consider the following query that retrieves all the albums which are added after 2010. Note: You do not need to run anything on the server for this part. Query 3: SELECT * FROM release WHERE year > 2010; A. (3 points) Recall there is no index on year attribute. Will the optimizer use an Index scan or a simple File Sequential Scan? Explain in only 1 line. B. (3 points) What would happen if someone creates a non-clustered B+-Tree index on year attribute? Will it help to speed up the retrieval? Again explain in only 1. 4

Q4. Query Optimization (Joins) [30 points] Again consider the same three relations and the database tables given in Q3 before. In this question, we want to explore the effect of creating indexes on join optimization. Again start with the assumption that there are no indexes on the tables. It will help to familiarize yourself with the EXPLAIN and ANALYZE commands of PostgreSQL. Please visit the links given at the end for the same and study how to run it on a given SQL query and what output these commands return. Q4.1. (5 points) Consider the same Query 3 from Q3.3 in the previous page. Run the explain analyze command sequence on it and answer the following questions. A. (1 point) Copy-paste the output of using EXPLAIN ANALYZE on Query 3. B. (2 points) What is the estimated result cardinality of Query 3? C. (2 points) What is the number of rows Query 3 actually returns? Note: No need to paste the actual output rows of Query 3 since it will be too large. We just want the output of running EXPLAIN ANALYZE. Q4.2 (10 points) Consider the following query Query 4: SELECT release.title, track.artist FROM release,track, cdtoc WHERE release.title = track.title AND release.year >= 2011 AND release.catalogue = 5 AND release.comment_count = 10; The catalogue is a number that can often be found on the spine or near the barcode. The comment_count records the number of comments that are on a released album in the database. Run the explain analyze command sequence on Query 4 and get the query plan returned by the optimizer to answer the following questions. A. (2 points) Explain what does Query 4 do? B. (1 points) Copy-paste the output of using EXPLAIN ANALYZE on Query 4. C. (5 points) Draw the query plan returned using tree and relational algebra notations as given in lecture slides. 5

D. (2 points) Report the estimated result cardinality, actual result cardinality (i.e. the true number of rows returned) and the total runtime of executing Query 4. Note: Again, we do not want the actual output rows of Query 4 itself since it will be too large. Q4.3 (5 points) For the Query 4, which attributes for the three tables you think should have index? Create index on these three tables for the proper attributes. Write the SQL queries you used to create these indexes (give these indexes any names you want). Q4.4 (10 points) Update the statistics using the VACCUM and ANALYZE commands. Now run the EXPLAIN ANALYZE command again on the Query 4 given in Q4.2 and get the new query plan returned by the optimizer to answer the following questions. A. (2 points) Copy-paste the output of using EXPLAIN ANALYZE on Query 4. B. (6 points) Draw the query plan returned using tree and relational algebra notations as given in lecture slides. C. (2 points) Report the actual runtime of Query 4 now, and compare it with your previous answer in Q4.2(C). Was it worth constructing the indexes? Hints: 1. Check the statistics collected by PostgreSQL: http://www.postgresql.org/docs/8.4/static/planner-stats.html 2. How to use EXPLAIN command and understand its output: http://www.postgresql.org/docs/8.4/static/sql-explain.html http://www.postgresql.org/docs/8.4/static/performance-tips.html 6