CS 4320/5320 Homework 2 Fall 2017 Due on Friday, 20th of October 2017 at 11:59 pm This assignment is out of 75 points and accounts for 10% of your overall grade. All answers for this homework should be contained in a single pdf file submitted on CMS. Selecting Indicies (25 Points) In this exercise, we ask you to compare the execution cost of different access methods and to identify the optimal one. Consider the schema created by the following statements: CREATE TABLE Songs (song_id INTEGER PRIMARY KEY, artist VARCHAR(100) NOT NULL, genre VARCHAR(50) release_date DATE); The Songs table contains information on songs such as their name, the artist who recorded the song, and the release date. We assume that the following indices are available on that table: 1. An unclustered hash index on column artists. 2. A clustered B+ tree index on column genre. 3. An unclustered B+ tree index on columns (genre, release date). The following information is available in the DB catalog: Table Songs consumes 20,000 disk pages. Table Songs contains 12,000,000 tuples. The number of distinct values in column name is 12,000,000. The number of distinct values in column artist is 60,000. The number of distinct values in column genre is 20. 1
The column release date keeps track of only the year and month and is over the years 2012 to 2015 (4 years with range [2012,2015]), from this we know the number of distinct values in the column release date is 48. We additionally add the following assumptions to simplify solutions: 1. Data assumptions Uniform distribution of tuples over respective value domains. Independence between different predicates, i.e. the probability that a tuple satisfies one predicate is independent from the probability that it satisfies another predicate. Due to the latter assumption, given a first predicate p 1 with selectivity s 2 and a second predicate p 2 with selectivity s 2, the selectivity of p 1 p 2 is s 1 s 2 Assume key boundaries are aligned with page boundaries. 2. Index assumptions Each hash index requires on average 1.2 page I/Os to find the right index. Each tree index requires on average 4 page I/Os to retrieve a leaf page. 1000 index entries fit on a page. We give a succinct natural language description of several SQL queries in the following. We ask you to point out all possibilities to use the existing indices to answer those queries (keep in mind that multiple indices might be used together to answer a query!) and to identify the optimal one. In order to justify that you have indeed selected the optimal access method, calculate cost estimates for reasonable alternatives. (a) (5 points) Retrieve songs in the genre Folk. (b) (5 points) Retrieve songs released in April, March, or May 2014. (c) (5 points) Retrieve songs in the genre Pop Punk released in April, March, or May in 2014. (d) (5 points) Retrieve songs released in 2015 that are recorded by artists in a list of 100 artists. (e) (5 points) Retrieve the songs released by the artist Shaggy in the genre Raggae Fusion. Estimating Execution Cost (25 points) We ask you to estimate the execution cost for the query plan depicted in Figure 1. This query plan joins three tables created by the following commands. CREATE TABLE Artists (artist_id INTEGER PRIMARY KEY, name VARCHAR(50) NOT NULL); CREATE TABLE Albums (album_id INTEGER PRIMARY KEY, artist_id INTEGER NOT NULL, genre VARCHAR(50), release_date DATE, FOREIGN KEY (artist_id) REFERENCES Artists(artist_id));
Figure 1: Query plan for joining three tables from the music database CREATE TABLE Songs (song_id INTEGER PRIMARY_KEY, album_id INTEGER NOT NULL, name VARCHAR(50) NOT NULL, FOREIGN KEY (album_id) REFERENCES Albums(album_id)); The following information about the relations is available in the database catalog: We store 1,000 artists, the artist id field uses 10 bytes per tuple and the artist name uses 50 bytes per tuple. We store 100,000 albums, the album id field uses 10 bytes per tuple, artist id uses 10 bytes per tuple, name uses 100 bytes per tuple, genre uses 50 bytes per tuple, and date uses 30 bytes per tuple. We store 2,000,000 songs, the song id field uses 10 bytes per tuple, the album id uses 10 bytes per tuple, and the song name uses 50 bytes per tuple. Assume that we have 100 buffer pages and each buffer page stores 1000 bytes. We make again the uniformity and independence assumptions as described in the accompanying text to Question 1. In order to calculate execution cost for the plan in Figure 1, we assume that no indices are available and no data is initially stored in main memory. We assume that all operations which can be executed on-the-fly are indeed executed on-the-fly to minimize run time. Also, we assume that the buffer is optimally exploited for each join (in particular, the buffer storing the outer relation is as large as possible for the outer relation). In your calculations, break down the execution cost of the entire plan and justify for each operation what cost it incurs. You may assume Frightened Rabbit is unique to a single artist id. Proposing a Good Query Plan (25 Points) In this exercise, we ask you to propose a good query plan for a given query and to thoroughly justify your decisions. Consider the database schema created by the following commands:
CREATE TABLE Artists (artist_id INTEGER PRIMARY KEY, name VARCHAR(50) NOT NULL); CREATE TABLE Albums (album_id INTEGER PRIMARY KEY, artist_id INTEGER NOT NULL, genre VARCHAR(50), release_date DATE, FOREIGN KEY (artist_id) REFERENCES Artists(artist_id)); CREATE TABLE Songs (song_id INTEGER PRIMARY_KEY, album_id INTEGER NOT NULL, name VARCHAR(50) NOT NULL, FOREIGN KEY (album_id) REFERENCES Albums(album_id)); The database has the following indices: A clustered B+ tree index on the Albums table over (genre, release date) The database catalog contains the following information on the stored data: The Artists table contains 100,000 artists The Albums table contains 2,000,000 albums The Songs table contains 10,000,000 songs. To simplify calculates, we assume that each integer column takes 10 bytes of storage per tuple, and each VARCHAR column takes N bytes where N is the length of the string (i.e. VARCHAR(100) takes 100 bytes). We make the same uniformity and independence assumptions described in the early parts of the homework. You may assume uniformity assumptions from questions (1) and (2). Consider the following query on the aforementioned database schema which retrieves all the songs in albums by the artist Bon Iver that were released after July, 2012 and not in the experimental genre. SELECT AR.name, AL.name, S.name FROM Artists AR, Albums AL, Songs S WHERE AR.artist_id = AL.artist_id AND AR.name = Bon Iver AND AL.album_id = S.album_id AND Al.genre!= Experimental AND Al.release_date > 2012-07 Propose a good left-deep query plan for this example query. Draw the intended query plan out. Using drawing programs such as Inkscape will be a big help. You may also handwrite and scan, but it must be legible for credit. Use cost estimates to justify join orders, join operations, indexes used or not used, and the point at which predicates and projections are applied. Finally, calculate the cost of the query plan you propose by breaking down cost into components that are associated with the different operations.
Additionally we introduce one pruning heuristic (see lecture 15 slides) to avoid Cartesian products if possible: We do not combine a partial plan with a relation unless there is a nontrivial join condition between them (i.e. We eliminate all join orders that join Artists directly with Songs as there is no connecting predicate). You may make the same index assumptions as in question 1.