University of Waterloo Midterm Examination Solution

Similar documents
University of Waterloo Midterm Examination Sample Solution

Statistics. Duplicate Elimination

CMSC424: Database Design. Instructor: Amol Deshpande

Query Processing. Solutions to Practice Exercises Query:

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

CS 245 Midterm Exam Winter 2014

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Introduction to Database Systems CSE 444, Winter 2011

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Advanced Database Systems

Implementing Joins 1

Chapter 13: Query Processing

CSIT5300: Advanced Database Systems

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Hash-Based Indexing 165

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

Chapter 12: Query Processing

Database Applications (15-415)

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Evaluation of Relational Operations: Other Techniques

TotalCost = 3 (1, , 000) = 6, 000

Chapter 12: Query Processing. Chapter 12: Query Processing

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Problem Set 2 Solutions

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Query Processing & Optimization

Database System Concepts

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Midterm Review. March 27, 2017

CSE 344 MAY 7 TH EXAM REVIEW

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Chapter 12: Query Processing

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Datenbanksysteme II: Implementing Joins. Ulf Leser

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

CS-245 Database System Principles

Chapter 12: Indexing and Hashing. Basic Concepts

Dtb Database Systems. Announcement

University of California, Berkeley. CS 186 Introduction to Databases, Spring 2014, Prof. Dan Olteanu MIDTERM

Overview of Implementing Relational Operators and Query Evaluation

Database Applications (15-415)

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

Database Systems CSE 414

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Chapter 12: Indexing and Hashing

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

RELATIONAL OPERATORS #1

Implementation of Relational Operations

Chapter 11: Indexing and Hashing

Evaluation of Relational Operations: Other Techniques

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Evaluation of Relational Operations. Relational Operations

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

COSC-4411(M) Midterm #1

CSIT5300: Advanced Database Systems

Query Processing and Advanced Queries. Query Optimization (4)

Lecture 15: The Details of Joins

CSE 544 Principles of Database Management Systems

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Chapter 17: Parallel Databases

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Parser: SQL parse tree

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Architecture and Implementation of Database Systems (Summer 2018)

CS 186/286 Spring 2018 Midterm 1

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Architecture and Implementation of Database Systems Query Processing

CS 222/122C Fall 2017, Final Exam. Sample solutions

Storage hierarchy. Textbook: chapters 11, 12, and 13

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Query Processing & Optimization. CS 377: Database Systems

CS330. Query Processing

Principles of Database Management Systems

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Database Management System

Plan for today. Query Processing/Optimization. Parsing. A query s trip through the DBMS. Validation. Logical plan

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

EECS 647: Introduction to Database Systems

CS54100: Database Systems

IMPORTANT: Circle the last two letters of your class account:

Review. Support for data retrieval at the physical level:

Transcription:

University of Waterloo Midterm Examination Solution Winter, 2011 1. (6 total marks) The diagram below shows an extensible hash table with four hash buckets. Each number x in the buckets represents an entry for a record for which the hashed key value is x. For example, the number 20 in the first hash bucket represents a record for which the hashed key value is 20. The maximum number of records per bucket is four. directory 00 01 10 11 hash table 20,4,16 5,25,21,13 27,15,19 a. (2 marks) Give a sequence of two insertions that will cause exactly one new hash bucket to be added without changing the size of the directory. An example of an insertion is insert 9, which refers to insertion of a tuple with a hashed key value of 9. The two hashed key values in your sequence should be distinct, and should be distinct from the values already shown in the hash table. Both insertions should be into the bucket pointed to by directory entries 00 and 10, and at least one should have hash value 10. For example: insert 8 insert 10 Give a sequence of two insertions that will not cause the hash table s directory to grow and that will not cause any new hash buckets to be added. Start from the original hash table shown in the figure above, not from the hash table that would result from the inserts from part (a). The two hashed key values in your sequence should be distinct, and should be distinct from the values already shown in the hash table. One insertion must go to bucket 00 (or 10), and the other insertion must go to bucket 11. For example insert 8 insert 11 c. (2 marks) Give a sequence of two insertions that will cause the hash table s directory to double twice. If this is not possible, write NOT POSSIBLE. Start from the original hash table shown in the figure above. The two hashed key values in your sequence should be distinct, and should be distinct from the values already shown in the hash table. To cause the directory to double twice, both insertions must go into what will become bucket 101 after the directory doubles for the first time. In other words x mod 8 must equal 5. For example: insert 29 insert 37 CS448/648 1 of 6

2. (6 total marks) Suppose that a B+Tree index has been created for some attribute R.x of a relation R. a. (1 mark) Assume that the B+Tree is clustered and of Type I (index leaves contain tuples of R). Consier the two rightmost leaf blocks of the B+Tree. Is it necessarily true that the these two blocks are sequential in the file or disk that contains the B+Tree? Answer YES or NO. No. Again, assume that the B+Tree is clustered and of Type I. Suppose that x 1, x 2, and x 3 are values of R.x, and x 1 < x 2 < x 3. Suppose also that the tuples corresponding to x 1 and x 3 are located on the same page, p. Is it necessarily true that the tuple corresponding to x 2 is also located on page p? Answer YES or NO. Yes. c. (2 marks) Repeat question (b), but this time under the assumption that the index is unclustered and of Type II (index leaves contain tuple IDs). No. d. (1 mark) Which of these two types (Type I clustered, Type II unclustered) of indexes will allow the tuples of R to be retrieved in order of their R.x values? Answer Type I clustered, Type II unclustered, both, or neither. Both. 3. (12 total marks) For this question, consider a query that performs a join of tables R and S with the join condition R.a = S.b. R.a is the primary key of R, and there is a clustered index on R.a. S.b is a candidate key of S, but there is no index on S.b. Suppose that the size of R is B blocks, that the size of S is also B blocks, and that the size of the index on R.a is negligible (much less than B). a. (4 marks) Suppose the plan for the query is a block nested loop join of R and S, with S as the outer. M blocks of memory are available for the join operation, where M < B. Give an expression (in terms of B and M) for the I/O cost of this plan. Assume that the plan output is not materialized. S will be read once at a cost of B. R will be read B/M times, at a cost of B each time. So, the total cost is: B + B M B = B + B2 M Suppose instead that the plan for the query is a merge join of R and S. Tuples of R are read in R.a order using the clustered index on R.a. S is sorted using an external merge sort, the output of which is pipelined into the join. A total of M blocks of memory are available, where B < M < B. How much memory should be used for the merge join operator, and how much should be used for the sort operator? Briefly justify your answer. Since R.a is a key of R and S.b is a candidate key of S, there are no duplicate join keys. In that case, the merge join operator only needs to hold one tuple from each relation. Therefore, all, or almost all of the available memory should be given to the sort operator. CS448/648 2 of 6

3. (cont d) c. (4 marks) Estimate the I/O cost of the plan from part (b), assuming that the M blocks of memory are allocated as you specified in your answer to that part. Assume that block writes and block reads have the same cost and that the plan output is not materialized. For the sort, the run formation phase will involve reading S (cost B) and writing B/M sorted runs (cost B). Since B < M, the number of runs will be less then M, and a single merge pass will suffice. That at will have an I/O cost of B to read in the runs, and no cost to pipeline to the sorted output data. So, the total I/O cost for the sort will be 3B. The join will consume the pipelined sort output (no I/O cost) will read R using the index on R.a. Since the index is clusterd, each block of R will be read once, for a cost of B. In addition, some index blocks will have to be read, but this cost is negligible since the index is assumed to be very small. Thus, the total cost is 4B: 3B from the sort and B from reading R. d. (2 marks) Assuming that B < M < B, under what circumstances is the block nested loop join plan preferable to the merge join plan? Express your answer in terms of B and M, and briefly justify it. The block nested loop join is preferable when which occurs when B 3 < M B + B2 M < 4B CS448/648 3 of 6

4 (8 marks) The pseudo-code below implements the GetNext method of an iterator version of tuple-oriented nestedloop join, as discussed in class. Iterator state: Iterator R; // right (inner) child Iterator L; // left (outer) child Tuple t; // current outer tuple GetNext() { tuple t2; // current inner tuple if (t == NULL) t = L.GetNext(); while (t <> NULL) { while((t2 = R.GetNext()) <> NULL) { if ( t matches t2 ) return ( t join t2 ) R.Close(); R.Open(); t = L.GetNext(); return(null); Implement the GetNext method of an iterator for merge join, assuming that both the left and right inputs are already sorted on their join keys. Declare the iterator state and write your code in the style of the example above. To simplify this problem, you may assume that the left input (only) does not contain duplicate join keys. The right input may contain duplicates. Iterator state: Iterator R; // right (inner) child Iterator L; // left (outer) child Tuple tl; // current left tuple (need to save, as it may match multiple // right tuples) // on exit, either tl has matched tr or one or both // inputs are exhausted GetNext() { Tuple tr; // right tuple // don t get a new tl except when GetNext is first called if (tl == NULL) tl = L.GetNext(); tr = R.GetNext(); while (tl <> NULL and tr <> NULL) { if (tl matches tr) then return (tl join tr); if (joinkey(t1) < joinkey(tr)) then tl = L.GetNext()l else tr = R.GetNext() return(null) CS448/648 4 of 6

5 (8 total marks) Consider the hybrid hash join algorithm presented in the notes and discussed class. Assume that a total of M blocks of memory are available for the join, of which k blocks are used to build the hash table for the first partition and the remaining M k blocks are used to stage the remaining partitions to temporary files. For the purposes of this question, make the simplifying assumption that the hash table is perfectly space-efficient, i.e., k blocks worth of tuples can be stored in a hash table of size k blocks. a. (4 marks) What is the size of the largest build relation that can be handled by this join operator without having to recursively re-partition the build relation? Express your answer in terms of M and k, and justify your answer. Each partition must have at most k blocks, since it needs to fit into the hash table. The maximum number of partitions is M k (actually, M k + 1), since we need one block to stage each partition that is being sent to a temporary file on the disk. Thus, the largest build file that can be handled is about k(m k) = Mk k 2 What is the best value for k, if the goal is to maximize the size of build table that can be handled by this operator without re-partitioning? Justify your answer. d(mk k 2 ) dk = M 2k This reaches zero when k = M/2, which maximizes build file size. c. (2 marks) Does memory size impose a limit on the maximum size of the probe input that can be handled (without re-partitioning) by this operator? If so, what is the maximum probe table input size (in terms of M and k) that can be handled by this operator? If not, write NO LIMIT. NO LIMIT Memory imposes a limit on the number of partitions of the probe relation (which will be M k +1), but not on the size of the partitions, since they do not need to be loaded into the in-memory hash table. CS448/648 5 of 6

6 (10 total marks) Provide brief answers to the each of the following questions. Brief means a sentence or two. a. (2 marks) What does pipelining mean, in the context of relational query processing? Pipelining means that each query operator makes its output tuples available to the next operator as soon as they are produced, rather than waiting until it has finished producing all of its output tuples. What is the database physical design problem? It is the problem if choosing which indexes (or other physical structures) to create for a particular database, usually given some information about the types of queries and updates that are expected. c. (2 marks) An external merge sort operates in two phases. What are the two phases, and what occurs in each phase? The first phase is run formation, during which the entire input is consumed and several individually-sorted runs are produced. The second phase is merging, during which multiple sorted runs are merged to produce longer runs - and eventually one single sorted run. d. (2 marks) How does the generalized clock replacement algorithm differ from the basic clock algorithm? The basic algorithm maintains one bit per cached page to indicate whether that page has been used recently, and tries to evict pages that have not been used recently. The generalized algorithm instead maintains a counter per cached page and tries to evict pages with lower counter values. e. (2 marks) What is the cache inclusion problem? In a two-level cache hierarchy, cache inclusion refers to the problem that the lower level cache may contain many of the same pages that are found in the upper level cache. CS448/648 6 of 6