CS 245 Midterm Exam Solution Winter 2015

Similar documents
CS 245 Midterm Exam Winter 2014

CS-245 Database System Principles

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

CS 245 Final Exam Winter 2016

University of California, Berkeley. CS 186 Introduction to Databases, Spring 2014, Prof. Dan Olteanu MIDTERM

Some Practice Problems on Hardware, File Organization and Indexing

IMPORTANT: Circle the last two letters of your class account:

CS 245 Final Exam Winter 2017

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

Database Management Systems Written Examination

CS 155 Final Exam. CS 155: Spring 2005 June 2005

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Storage hierarchy. Textbook: chapters 11, 12, and 13

Homework Assignment 2: Java Console and Graphics

CS 155 Final Exam. CS 155: Spring 2009 June 2009

CPSC 310: Database Systems / CSPC 603: Database Systems and Applications Exam 2 November 16, 2005

Database Management Systems (COP 5725) Homework 3

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

CS 186/286 Spring 2018 Midterm 1

Final Examination CSE 100 UCSD (Practice)

CS 155 Final Exam. CS 155: Spring 2004 June 2004

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Chapter 12: Query Processing. Chapter 12: Query Processing

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Chapter 12: Query Processing

CS145 Midterm Examination

University of Waterloo Midterm Examination Sample Solution

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

EXTERNAL SORTING. Sorting

Database Management Systems Paper Solution

Query Processing & Optimization

CS 186/286 Spring 2018 Midterm 1

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

CSE344 Midterm Exam Winter 2017

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

CSE 562 Database Systems

CSE Midterm - Spring 2017 Solutions

Disk Scheduling COMPSCI 386

2.2.2.Relational Database concept

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Advanced Database Systems

CS122 Lecture 15 Winter Term,

CS 155 Final Exam. CS 155: Spring 2006 June 2006

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

CS34800 Information Systems

Chapter 12: Query Processing

Chapter 12: Query Processing

CSE 190D Spring 2017 Final Exam Answers

University of Waterloo Midterm Examination Solution

IMPORTANT: Circle the last two letters of your class account:

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Queen s University Faculty of Arts and Science School of Computing CISC 432* / 836* Advanced Database Systems

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

COS 226 Algorithms and Data Structures Fall Midterm

Indexing: Overview & Hashing. CS 377: Database Systems

Query Evaluation and Optimization

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Database Applications (15-415)

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

COSC-4411(M) Midterm #1

Database System Concepts

Chapter 12: Secondary-Storage Structure. Operating System Concepts 8 th Edition,

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 11: Indexing and Hashing

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Disk Scheduling. Based on the slides supporting the text

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

Query Processing: The Basics. External Sorting

System Structure Revisited

CS 155 Final Exam. CS 155: Spring 2012 June 11, 2012

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

External Sorting Implementing Relational Operators

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Chapter 11: Indexing and Hashing

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

CS 61B Summer 2005 (Porter) Midterm 2 July 21, SOLUTIONS. Do not open until told to begin

Data Management for Data Science

Chapter 17 Indexing Structures for Files and Physical Database Design

Chapter 13: Query Processing

Data on External Storage

Chapter 12: Indexing and Hashing. Basic Concepts

1 (10) 2 (8) 3 (12) 4 (14) 5 (6) Total (50)

Modern Database Systems Lecture 1

CS 155 Final Exam. CS 155: Spring 2011 June 3, 2011

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

CSE 544, Winter 2009, Final Examination 11 March 2009

Chapter 12: Indexing and Hashing

From SQL-query to result Have a look under the hood

CS348: INTRODUCTION TO DATABASE MANAGEMENT (Winter, 2011) FINAL EXAMINATION

Chapter 12: Mass-Storage Systems. Operating System Concepts 8 th Edition,

Unit 3 Disk Scheduling, Records, Files, Metadata

Transcription:

CS 245 Midterm Exam Solution Winter 2015 This exam is open book and notes. You can use a calculator and your laptop to access course notes and videos (but not to communicate with other people). You have 70 minutes to complete the exam. Print your name: CS 245 Staff The Honor Code is an undertaking of the students, individually and collectively: 1. that they will not give or receive aid in examinations; that they will not give or receive unpermitted aid in class work, in the preparation of reports, or in any other work that is to be used by the instructor as the basis of grading; 2. that they will do their share and take an active part in seeing to it that others as well as themselves uphold the spirit and letter of the Honor Code. The faculty on its part manifests its condence in the honor of its students by refraining from proctoring examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures that create temptations to violate the Honor Code. While the faculty alone has the right and obligation to set academic requirements, the students and faculty will work together to establish optimal conditions for honorable academic work. I acknowledge and accept the Honor Code. Signed: CS 245 S t a f f Problem Points Maximum 1 10 10 2 10 10 3 10 10 4 10 10 5 10 10 6 10 10 Total 60 60 1

Problem 1 (10 points) For parts (a), (b), consider a B+ tree of order n (where n is odd) that has a depth L > 1. (a) What is the minimum number of record pointers the tree can contain? ( n + 1 2 2 ) L 1 (b) What is the maximum number of record pointers the tree can contain? n(n + 1) L 1 (c) Imagine you have a B+ tree with 3 levels in which each node has exactly 10 keys. There is a record for each key 1, 2, 3,..., N, where N is the number of records. How many nodes must be examined to nd all records with keys in the range [95, 134]? By following the sequence pointers, we only need to look at one node in each of the first two levels. We only need to look at leaf nodes containing a record in this range, of which there are 5 (Node with record # 91-100, 101-110, 111-120, 121-130, 131-140). This means we must read a total of 1 + 1 + 5 = 7 nodes. (d) Draw the state of the following B+ tree after deleting the record with key 3. Note we are using the textbook's notation for representing nodes. 2

The resulting B+ tree is shown below: 3

Problem 2 (10 points) For this problem consider two relations R and S, with schemas R(A, B, C) and S(B, C, D). The following sub-problems ask you to rewrite given logical queries by manipulating the orders of various selection and projection operations, without changing the nal result. We use σ X as shorthand to denote a selection based on a predicate that involves attributes/conditions X. For example, σ A represents σ A=a (for some value a) and σ A B represents σ (A=a) (B=b) (for some values a, b). Assume that R and S are sets (not bags). But note that some operators may yield a bag (even if the input was a set), so the nal answer may be a bag. All your answers should contain at most 2 join operators. NOTE: We will give full points if your nal answer is correct, but show intermediate steps wherever possible/applicable as you can receive partial points for them if your nal answer is incorrect. If you only give an incorrect nal answer with no steps, no points will be awarded. (a) Consider Expression 1: σ B (σ A C (R) σ D (S)). (a1) Rewrite Expression 1 by pushing the selections as far in as possible (smallest possible join(s)). σ B (σ A C (R) σ D (S)) = σ B (σ C (σ A (R) σ D (S))) (0.1) = σ B (σ A C (R) σ D C (S)) (0.2) = σ A C B (R) σ D C B (S) (0.3) σ A B C (R) σ B C D (S) (a2) Rewrite Expression 1 by pushing the selections as far out as possible (largest possible join(s)). σ A B C D (R S) (b) Consider: Expression 2: σ B (σ A C (R) σ D (S)). (b1) Rewrite Expression 2 by pushing the selections as far in as possible (smallest possible join(s)). σ B (σ A C (R) σ D (S)) = σ B (σ A (R) σ D (S)) σ B (σ C (R) σ D (S)) (0.4) = (σ A B (R) σ D B (S)) (σ C B (R) σ D B (S)) (0.5) = (σ A B (R) σ D B (S)) (σ C B (R) σ D B C (S)) (0.6) 4

(σ A B (R) σ B D (S)) (σ B C (R) σ B C D (S)) (b2) Rewrite Expression 2 by pushing the selections as far out as possible (largest possible join(s)). σ B (A C) D (R S) (c) Consider Expression 3: π A,B (σ C (R) S). Rewrite Expression 3 by pushing the selection and projection as far in as possible (smallest possible join(s)). π A,B (σ C (R) S) = π A,B (σ C (R) σ C (S)) (0.7) = π A,B (σ C (R) π B,C (σ C (S))) (0.8) (can project out D as it is not part of the join or final projected view) = π A,B (σ C (R) π B (σ C (S))) (0.9) (can project out D as σ C ensures join condition over C) = π A,B (σ C (R)) π B (σ C (S)) (0.10) (can project out C from R as join no longer uses C) Note that performing the last two steps (projecting out C from R, S) without first pushing σ C to S is incorrect, as it could result in a potentially different logical result. π A,B (σ C (R)) π B (σ C (S)) 5

Problem 3 (10 points) Consider a relational DBMS with the following employee relation E: E has attributes id, name, age, salary. id is the primary key attribute (no duplicate ids). E holds 1024 tuples. All records (tuples) in E are stored sequentially (in increasing order) based on their id. Each data block that holds E records contains 4 records. (Blocks have been compacted so they are all full.) We are building a traditional index (not a B+-tree) for this sequential le. Each index block can hold up to 8 (value, pointer) entries (where the pointer can identify either a block or a record). Index entries are sorted by id. Index blocks are stored contiguously so we use binary search to look for a key in the index. (We know the address of the rst block and the number of index blocks, so we can probe the block in the middle at each step.) (a) Suppose we construct a dense index INDEX1 on E.id, and we search for a record with a given id. In the worst case scenario, how many blocks must the system read to retrieve the record? Assume that the a record with the given id exists. Number of blocks: 8 + 1 = 9 1024 records fits in 128 INDEX1 blocks, to perform a binary search and read 1 INDEX1 block, we need at most log 2 128 + 1 = 8 INDEX1 reads. We also need 1 record read. Therefore, the total is 8 + 1 = 9. In the worst case, BinSearch(2 n ) = log 2 2 n + 1 = n + 1 reads (need n reads to determine which index block to read and need 1 read to actually read the index block). (b) To improve access speed, we add an index INDEX2 on INDEX1. What type of index makes sense, sparse or dense? Type of index: Sparse (c) After building INDEX2 we want to nd a record with a given id. In the worst case scenario, how many blocks must the system read to retrieve the contents of the (existing) record? To simplify your work for parts (c) through (e), assume you have a function BinSearch(blocks) that returns the number of blocks you need to access for a binary search of blocks sequential blocks, to nd and retrieve a (id, ptr) entry that exists in the index. (The result does not include the cost of accessing lower level indexes (if any) or of fetching the actual record.) You can use this function in your answer if you like. Number of blocks: BinSearch(16) + 2 [= 7] 6

In the worst case, BinSearch(2 n ) = log 2 2 n + 1 = n + 1 reads (need n reads to determine which index block to read and need 1 read to actually read the index block). Recall from part (a) that we have 128 INDEX1 blocks. Therefore, there are 128 8 = 16 INDEX2 blocks. To perform a binary search, we need at most BinSearch(16) = 5 INDEX2 reads + 1 INDEX1 read + 1 record read (d) Next let us build a dense secondary index on E.salary (let us call it INDEX3). Suppose that there are 64 distinct salary values, with an equal number of records per salary value. First let us say that we implement the E.salary index in a naive way, where duplicate values are stored in the index. That is, if there are x records salary = 1000, the index contains x entries of the form (1000, pointer). (Salaries are sorted in the index and the index blocks are stored contiguously.) Let 1000 be the second largest salary value (i.e., only one salary is larger than 1000). (The search procedure does not know this fact). Say we want to read the content of all records with salary = 1000. In the worst case scenario, how many blocks need to be fetched? Number of blocks: (BinSearch(128) + 1) + 16 = BinSearch(128) + 17 [= 25] As 1024 records fits in 128 INDEX3 blocks, to perform a binary search, we need at most BinSearch(128) = 8 INDEX3 reads. These 8 reads will allow us to find and retrieve the first index block containing a (1000, pointer) entry. Since there are 16 records with salary of 1000 (thus 16 (1000, pointer) entries) and each index block can only store 8 (value, pointer) entries, we must read 2 INDEX3 blocks, not just the first one. Therefore, we need to add 1 read. Lastly, we read 16 records. (e) Next, suppose we build E.salary without duplicate values. For instance, if there are x records with salary = 1000, there will be a single entry in the index, pointing to a set of contiguous blocks that store the x record pointers. Salaries are still sorted in the index blocks are stored contiguously. Each pointer block can store 16 pointer entries. In the worst case scenario, how many blocks need to be fetched to read the content of all records with salary = 1000? Number of blocks: BinSearch(8) + 17 [= 21] 64 records (because there are 64 distinct salary values) fits in 8 INDEX3 blocks, to perform a binary search, we need at most BinSearch(8) = 4 INDEX3 reads + 1 pointer block read (16 pointers to records at 16 per block) + 16 reads for 16 records 7

Problem 4 (10 points) Consider two relations R 1 (A, B, C), R 2 (B, C, D). We are given the following information: T (R 1 ) = 100, 000; V (R 1, A) = 100; V (R 1, B) = 50; V (R 1, C) = 20. T (R 2 ) = 10, 000; V (R 2, B) = 20; V (R 2, C) = 50; V (R 2, D) = 10. All attributes are 10 bytes in size. We use the containment of value sets assumption. We assume query values are selected from values in the relations. Compute the number of tuples and size (in bytes) of the following expressions: (a) W = (σ A=3 R 1 ) (σ B=5 R 2 ) Let s call M = σ A=3 R 1 and N = σ B=5 R 2. 100000 T (M)T (N) T (W ) = max(v (M, B), V (N, B)) max(v (M, C), V (N, C)) = 50 50 S(W ) = 4 10 = 40 100 10000 20 = 200 (b) Y = π CD [(σ B=3 C=5 R 2 )] Let s call P = σ B=3 C=5 R 2. We need to calculate T (Y ) = max(v (P, C), V (P, D)) but we know V (P, C) = 1, so we need V (P, D). By containment of value sets, we have that T (R V (P, D) = min(t (P ), V (R 2, D)). But note that T (P ) = 2 ) V (R 2,B)V (R 2,C) = 10. T (Y ) = min(t (P ), V (R 2, D)) = 10 S(Y ) = 2 10 = 20 (c) U = σ D=1 [(σ A=3 B=5 R 1 ) (σ C=3 D=5 R 2 )] Because of σ C=3 D=5, we must have that (σ A=3 B=5 R 1 ) (σ C=3 D=5 R 2 ) only contains tuples with D = 5. Applying σ D=1 to this relation will yield an empty relation. Therefore, T (U) = 0 8

Problem 5 (10 points) Consider a hard disk with the following specications: 6000 RPM 3.5in in diameter 250GB usable capacity 100 cylinders, numbered from 1 (innermost) to 100 (outermost). Takes time 1+(t/100) milliseconds to move the head t across t cylinders. 20% overhead between blocks (gaps). Block size is 32 MB. transfer rate is 16 GB/sec. For this problem 1GB is 10 9 bytes, 1MB is 10 6 bytes. (a) Based on the specications, calculate the average rotational delay (in milliseconds) of this disk. The average rotational delay is half of the maximum rotational delay. 1 2 60 s 6000 = 5 ms (b) Suppose that we have just nished reading a block at track 50, and we next want to read a block at track 10. What is the total read time (time for the desired block to appear in memory)? You can ignore the delays we ignored in class. We will only consider seek time, rotational delay, and transfer time: ( 1 + ) ( ) 50 10 32 10 6 + (5) + 100 16 10 9 103 = 8.4 ms 9

(c) Suppose that we have just nished reading a block at track 50, and seven IO requests are currently pending. These request are for blocks at the following tracks, shown in the order in which the requests arrived: 60, 40, 90, 35, 5, 100, 45. That is, the request for track 60 was the rst one that arrived, the request for 40 was the second one, and so on. Consider the following three disk-scheduling policies: FCFS (rst-come-rst-serve) Elevator policy SSTF (shortest-seek-time-rst) For each of the policies, list the requests in the order in which they will be serviced. For example, if you write 5, 35, 40,... it means that the request for a block at track 5 will be serviced rst, followed by the one for track 35, and so on. Service order for FCFS: 60, 40, 90, 35, 5, 100, 45 Service order for Elevator: You can either move up or down first, so there are two possible options. You only need to state one. 60, 90, 100, 45, 40, 35, 5 45, 40, 35, 5, 60, 90, 100 Service order for SSTF: 45, 40, 35, 60, 90, 100, 5 10

Problem 6 (10 points) State if the following statements are true or false. Please write TRUE or FALSE in the space provided. (a) With a linear hash table, when it is necessary to grow the directory, the directory is doubled in size. FALSE Explanation: This is true for extensible hashing. For linear hashing, you add one more bucket at a time. (b) Fixed-format records must always be xed-size. FALSE Explanation: The schema of fixed-format records only specifies number of fields, type of each field, order in record, and meaning of each field. It does not necessarily specify the length of each field. For instance, you can have a fixed-format record with two fields, integer and string, in which the integer specifies the length of the string. (c) When the index ts in memory and most queries are searching for keys that do not exist in the relation, dense indexes are always better than sparse indexes. TRUE Explanation: With the entire dense indexes in memory, the system does not need to access the record on storage at all. (d) A grid index is only eective when there are two dimensions. FALSE Explanation: You can make a multi-dimensional grid. The logic is exactly the same. (e) Encrypting records in a database makes it harder to build indexes for them. TRUE Explanation: Searching for keys is not as simple with encryption. (f) Stanford ID numbers denitely do not follow a Zipan distribution. TRUE 11

Explanation: The IDs are assigned in order. There is no particular reason why it should follow Zipfian distribution. (g) A column store is usually superior to the more traditional row store when queries refer to many attributes (columns). FALSE Explanation: If the query refers to one attribute, it would be more superior since we only need to look at a small part of the storage. However, for queries referring to multiple attributes, the system still needs to read a big chunk of data from storage. (h) The expression σ p (R S) = σ p (R) S does not hold when R and S are bags. FALSE Explanation: Try to convince yourself the expression holds. (i) In a multi-attribute hash table (as discussed in class), a dierent hash function can be used for each attribute. TRUE Explanation: The only necessary condition is that for each attribute, you need to be consistent with your choice of hash function. (j) The 5 minute rule is not useful when data resides on an SSD. We will accept both answers, depending on how you interpret the word 5 minute rule. Explanation for TRUE: SSD has a different IO per second rate and a different price. You cannot directly apply the result of the calculation from HDD and main memory directly to SSD. You can, however, use a similar calculation to come up with the appropriate time interval. Explanation for FALSE: The name 5 minute rule only refers to the technique we use for calculation, not the exact number five. You can use the same technique to come up with the appropriate time interval for SSD. 12