CS 4320/5320 Homework 2

Similar documents
Database Systems, CSCI Exam #2 Thursday November 4, 2010 at 2 pm

Overview of Implementing Relational Operators and Query Evaluation

Principles of Data Management. Lecture #9 (Query Processing Overview)

University of Waterloo Midterm Examination Sample Solution

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

CS330. Query Processing

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

CSCI-UA: Database Design & Web Implementation. Professor Evan Sandhaus

Relational Query Optimization

Overview of Query Evaluation. Overview of Query Evaluation

Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please)

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

Query Evaluation Overview, cont.

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Overview of Query Evaluation

Query Evaluation Overview, cont.

Relational Query Optimization

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Databases. Charles Severance

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

Relational Query Optimization. Highlights of System R Optimizer

CSE 444 Homework 1 Relational Algebra, Heap Files, and Buffer Manager. Name: Question Points Score Total: 50

CS 564 Final Exam Fall 2015 Answers

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

CSE 444: Database Internals. Section 4: Query Optimizer

External Sorting Implementing Relational Operators

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Ian Kenny. December 1, 2017

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Principles of Data Management. Lecture #12 (Query Optimization I)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Cost Models. the query database statistics description of computational resources, e.g.

DB tutorial using Database.NET dbforge Studio Express and SQLiteSpy

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

CMPUT 391 Database Management Systems. An Overview of Query Processing. Textbook: Chapter 11 (first edition: Chapter 14)

Lecture #14 Optimizer Implementation (Part I)

Database Applications (15-415)

Database Applications (15-415)

Carnegie Mellon University in Qatar Spring Problem Set 4. Out: April 01, 2018

Query Processing and Advanced Queries. Query Optimization (4)

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Database Applications (15-415)

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Implementation of Relational Operations

Database Applications (15-415)

7. Query Processing and Optimization

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

CS222P Fall 2017, Final Exam

Evaluation of Relational Operations

Query Processing & Optimization. CS 377: Database Systems

QUIZ: Is either set of attributes a superkey? A candidate key? Source:

CompSci 516 Data Intensive Computing Systems

CSC 261/461 Database Systems Lecture 19

CSE 444: Database Internals. Lectures 5-6 Indexing

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Database Applications (15-415)

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

CSE 414 Midterm. April 28, Name: Question Points Score Total 101. Do not open the test until instructed to do so.

Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please)

CSCI-UA: Database Design & Web Implementation. Professor Evan Sandhaus

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Modern Database Systems Lecture 1

Database Systems CSE 414

Midterm 1: CS186, Spring 2015

CS 461: Database Systems. Final Review. Julia Stoyanovich

Overview of Query Evaluation. Chapter 12

Query Processing and Query Optimization. Prof Monika Shah

Evaluation of Relational Operations

INFO 1103 Homework Project 2

CSE wi Final Exam Sample Solution

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Homework Assignment 2. Due Date: October 21th, :30pm (noon) CS425 - Database Organization Results

IMPORTANT: Circle the last two letters of your class account:

QUERY OPTIMIZATION [CH 15]

Evaluation of Relational Operations. Relational Operations

ADVANCED DATABASE SYSTEMS. Lecture #15. Optimizer Implementation (Part // // Spring 2018

Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please)

CSE 344 FEBRUARY 14 TH INDEXING

Overview of Query Processing and Optimization

Modern Database Systems CS-E4610

Hash-Based Indexing 165

Evaluation of Relational Operations

15-415/615 Faloutsos 1

Transcription:

CS 4320/5320 Homework 2 Fall 2017 Due on Friday, 20th of October 2017 at 11:59 pm This assignment is out of 75 points and accounts for 10% of your overall grade. All answers for this homework should be contained in a single pdf file submitted on CMS. Selecting Indicies (25 Points) In this exercise, we ask you to compare the execution cost of different access methods and to identify the optimal one. Consider the schema created by the following statements: CREATE TABLE Songs (song_id INTEGER PRIMARY KEY, artist VARCHAR(100) NOT NULL, genre VARCHAR(50) release_date DATE); The Songs table contains information on songs such as their name, the artist who recorded the song, and the release date. We assume that the following indices are available on that table: 1. An unclustered hash index on column artists. 2. A clustered B+ tree index on column genre. 3. An unclustered B+ tree index on columns (genre, release date). The following information is available in the DB catalog: Table Songs consumes 20,000 disk pages. Table Songs contains 12,000,000 tuples. The number of distinct values in column name is 12,000,000. The number of distinct values in column artist is 60,000. The number of distinct values in column genre is 20. 1

The column release date keeps track of only the year and month and is over the years 2012 to 2015 (4 years with range [2012,2015]), from this we know the number of distinct values in the column release date is 48. We additionally add the following assumptions to simplify solutions: 1. Data assumptions Uniform distribution of tuples over respective value domains. Independence between different predicates, i.e. the probability that a tuple satisfies one predicate is independent from the probability that it satisfies another predicate. Due to the latter assumption, given a first predicate p 1 with selectivity s 2 and a second predicate p 2 with selectivity s 2, the selectivity of p 1 p 2 is s 1 s 2 Assume key boundaries are aligned with page boundaries. 2. Index assumptions Each hash index requires on average 1.2 page I/Os to find the right index. Each tree index requires on average 4 page I/Os to retrieve a leaf page. 1000 index entries fit on a page. We give a succinct natural language description of several SQL queries in the following. We ask you to point out all possibilities to use the existing indices to answer those queries (keep in mind that multiple indices might be used together to answer a query!) and to identify the optimal one. In order to justify that you have indeed selected the optimal access method, calculate cost estimates for reasonable alternatives. (a) (5 points) Retrieve songs in the genre Folk. (b) (5 points) Retrieve songs released in April, March, or May 2014. (c) (5 points) Retrieve songs in the genre Pop Punk released in April, March, or May in 2014. (d) (5 points) Retrieve songs released in 2015 that are recorded by artists in a list of 100 artists. (e) (5 points) Retrieve the songs released by the artist Shaggy in the genre Raggae Fusion. Estimating Execution Cost (25 points) We ask you to estimate the execution cost for the query plan depicted in Figure 1. This query plan joins three tables created by the following commands. CREATE TABLE Artists (artist_id INTEGER PRIMARY KEY, name VARCHAR(50) NOT NULL); CREATE TABLE Albums (album_id INTEGER PRIMARY KEY, artist_id INTEGER NOT NULL, genre VARCHAR(50), release_date DATE, FOREIGN KEY (artist_id) REFERENCES Artists(artist_id));

Figure 1: Query plan for joining three tables from the music database CREATE TABLE Songs (song_id INTEGER PRIMARY_KEY, album_id INTEGER NOT NULL, name VARCHAR(50) NOT NULL, FOREIGN KEY (album_id) REFERENCES Albums(album_id)); The following information about the relations is available in the database catalog: We store 1,000 artists, the artist id field uses 10 bytes per tuple and the artist name uses 50 bytes per tuple. We store 100,000 albums, the album id field uses 10 bytes per tuple, artist id uses 10 bytes per tuple, name uses 100 bytes per tuple, genre uses 50 bytes per tuple, and date uses 30 bytes per tuple. We store 2,000,000 songs, the song id field uses 10 bytes per tuple, the album id uses 10 bytes per tuple, and the song name uses 50 bytes per tuple. Assume that we have 100 buffer pages and each buffer page stores 1000 bytes. We make again the uniformity and independence assumptions as described in the accompanying text to Question 1. In order to calculate execution cost for the plan in Figure 1, we assume that no indices are available and no data is initially stored in main memory. We assume that all operations which can be executed on-the-fly are indeed executed on-the-fly to minimize run time. Also, we assume that the buffer is optimally exploited for each join (in particular, the buffer storing the outer relation is as large as possible for the outer relation). In your calculations, break down the execution cost of the entire plan and justify for each operation what cost it incurs. You may assume Frightened Rabbit is unique to a single artist id. Proposing a Good Query Plan (25 Points) In this exercise, we ask you to propose a good query plan for a given query and to thoroughly justify your decisions. Consider the database schema created by the following commands:

CREATE TABLE Artists (artist_id INTEGER PRIMARY KEY, name VARCHAR(50) NOT NULL); CREATE TABLE Albums (album_id INTEGER PRIMARY KEY, artist_id INTEGER NOT NULL, genre VARCHAR(50), release_date DATE, FOREIGN KEY (artist_id) REFERENCES Artists(artist_id)); CREATE TABLE Songs (song_id INTEGER PRIMARY_KEY, album_id INTEGER NOT NULL, name VARCHAR(50) NOT NULL, FOREIGN KEY (album_id) REFERENCES Albums(album_id)); The database has the following indices: A clustered B+ tree index on the Albums table over (genre, release date) The database catalog contains the following information on the stored data: The Artists table contains 100,000 artists The Albums table contains 2,000,000 albums The Songs table contains 10,000,000 songs. To simplify calculates, we assume that each integer column takes 10 bytes of storage per tuple, and each VARCHAR column takes N bytes where N is the length of the string (i.e. VARCHAR(100) takes 100 bytes). We make the same uniformity and independence assumptions described in the early parts of the homework. You may assume uniformity assumptions from questions (1) and (2). Consider the following query on the aforementioned database schema which retrieves all the songs in albums by the artist Bon Iver that were released after July, 2012 and not in the experimental genre. SELECT AR.name, AL.name, S.name FROM Artists AR, Albums AL, Songs S WHERE AR.artist_id = AL.artist_id AND AR.name = Bon Iver AND AL.album_id = S.album_id AND Al.genre!= Experimental AND Al.release_date > 2012-07 Propose a good left-deep query plan for this example query. Draw the intended query plan out. Using drawing programs such as Inkscape will be a big help. You may also handwrite and scan, but it must be legible for credit. Use cost estimates to justify join orders, join operations, indexes used or not used, and the point at which predicates and projections are applied. Finally, calculate the cost of the query plan you propose by breaking down cost into components that are associated with the different operations.

Additionally we introduce one pruning heuristic (see lecture 15 slides) to avoid Cartesian products if possible: We do not combine a partial plan with a relation unless there is a nontrivial join condition between them (i.e. We eliminate all join orders that join Artists directly with Songs as there is no connecting predicate). You may make the same index assumptions as in question 1.