CS 222/122C Fall 2016, Midterm Exam

Similar documents
Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

CS222P Fall 2017, Final Exam

University of California, Berkeley. CS 186 Introduction to Databases, Spring 2014, Prof. Dan Olteanu MIDTERM

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

IMPORTANT: Circle the last two letters of your class account:

CS 186/286 Spring 2018 Midterm 1

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Lecture Data layout on disk. How to store relations (tables) in disk blocks. By Marina Barsky Winter 2016, University of Toronto

CS 222/122C Fall 2017, Final Exam. Sample solutions

Lecture 8 Index (B+-Tree and Hash)

Storage hierarchy. Textbook: chapters 11, 12, and 13

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

CSE 544 Principles of Database Management Systems

CSE 444: Database Internals. Lectures 5-6 Indexing

Database Applications (15-415)

CSC 261/461 Database Systems Lecture 17. Fall 2017

COSC-4411(M) Midterm #1

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Unit 3 Disk Scheduling, Records, Files, Metadata

Database Applications (15-415)

CS 186/286 Spring 2018 Midterm 1

CS-245 Database System Principles

CS220 Database Systems. File Organization

Tree-Structured Indexes. Chapter 10

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Midterm Exam #2 (Version A) CS 122A Winter 2017

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

CS 245 Midterm Exam Winter 2014

Kathleen Durant PhD Northeastern University CS Indexes

Final Examination CSE 100 UCSD (Practice)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Lecture 13. Lecture 13: B+ Tree

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Tree-Structured Indexes

Midterm 2, CS186 Prof. Hellerstein, Spring 2012

Tree-Structured Indexes

Database Systems II. Record Organization

Tree-Structured Indexes

RAID in Practice, Overview of Indexing

Midterm Exam (Version B) CS 122A Spring 2017

Database Applications (15-415)

University of Waterloo Midterm Examination Sample Solution

CS542: Database I, Summer Course Project

COSC-4411(M) Midterm #1

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Midterm 1: CS186, Spring 2015

Indexing: Overview & Hashing. CS 377: Database Systems

Database Management Systems (COP 5725) Homework 3

STORING DATA: DISK AND FILES

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Some Practice Problems on Hardware, File Organization and Indexing

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Midterm Exam #1 Version A CS 122A Winter 2017

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

CS 564 Final Exam Fall 2015 Answers

Midterm Exam #1 Version B CS 122A Winter 2017

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Physical Database Design: Outline

Topics to Learn. Important concepts. Tree-based index. Hash-based index

ACCESS METHODS: FILE ORGANIZATIONS, B+TREE

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

Indexing and Hashing

CS 461: Database Systems. Final Review. Julia Stoyanovich

CSE 373 Sample Midterm #2 (closed book, closed notes, calculators o.k.)

Database Management Systems Written Exam

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

Storage and Indexing

ΗΥ460 Συστήματα Διαχείρισης Βάσεων Δεδομένων Χειμερινό Εξάμηνο 2017 Διδάσκοντες: Βασίλης Χριστοφίδης, Δημήτρης Πλεξουσάκης, Χαρίδημος Κονδυλάκης

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

CS 61B Summer 2005 (Porter) Midterm 2 July 21, SOLUTIONS. Do not open until told to begin

R16 SET - 1 '' ''' '' ''' Code No: R

Data Management for Data Science

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Storage and Indexing, Part I

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

CS 405G: Introduction to Database Systems. Storage

System Structure Revisited

CSIT5300: Advanced Database Systems

Midterm spring. CSC228H University of Toronto

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification)

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Chapter 11: Indexing and Hashing

Data on External Storage

Disks & Files. Yanlei Diao UMass Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Chapter 17 Indexing Structures for Files and Physical Database Design

CS 186 Midterm, Spring 2003 Page 1

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

COS 226 Fall 2015 Midterm Exam pts.; 60 minutes; 8 Qs; 15 pgs :00 p.m. Name:

CSE344 Midterm Exam Winter 2017

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Transcription:

STUDENT NAME: STUDENT ID: Instructions: CS 222/122C Fall 2016, Midterm Exam Principles of Data Management Department of Computer Science, UC Irvine Prof. Chen Li (Max. Points: 100) This exam has six (6) questions. This exam is closed book. However, you can use one cheat sheet (A4 size). The total time for the exam is 80 minutes, so budget your time accordingly. Be sure to answer each part of each question after reading the whole question carefully. If you don t understand something, ask for clarification. If you still find ambiguities in a question, write down the interpretation you are taking and then answer the question based on that interpretation. The last two pages of this exam are blank; you can use them as scratch paper. QUESTION POINTS SCORE 1 12 2 13 3 20 4 20 5 20 6 15 TOTAL 100 100 1

Question 1 (12 points) Short questions a) (4 pts) Explain the purpose of the pin() and unpin() operations for a page in the buffer pool and how they are related to cache replacement. A requester for a page needs to pin the page to indicate that the requester is using it. Internally we keep a pin count to know the number of requesters for this page. pin() increments this count by 1. The buffer manager can replace this page only if the pin count is 0. A requester should unpin the page after it finishes the operations on the page. unpin() decrements this count by 1. b) (4 pts) The search performance of B+-tree depends on its length, which is related to the fanout of each node. For each of the INT type and VARCHAR type for an index key, give a technique to increase the fanout. Explain each technique using a simple example. INT: since all the key values within a node are sorted, we can store the first key value and a sequence of deltas. These deltas tend to be small, and can be stored with fewer bytes using variable-length coding. VARCHAR: Suffix compression: David Smith can be shortened to Dav. In general, as long as the condition the right-most key in the left child < key <= the left-most key in the right child, we have the opportunity to reduce the size of the key to save space. Another solution is prefix compression. The main idea is the following. For those keys share some common prefix, we can store the prefix once then store the additional characters for each key. c) (4 pts) Briefly explain the main idea behind a column store and at least one main advantage compared to a row store. 2

Store all the values of one attribute/column physically together. Advantages: (1) If a query only retrieves a few columns, we can reduce the number of disk IOs by not reading the bytes of other irrelevant columns; (2) all the values with the same type can be compressed more easily. 3

Question 2 (13 points): Record Manager a) (5 pts) Suppose we have a table with the following schema: Player(pid INT, name VARCHAR(30), team VARCHAR(30), points_per_game FLOAT) We store the table as a heap file of variable-length records. Each record is stored as a sequence of bytes with a directory of pointers. Assume that: (1) The directory uses 2-byte offsets to store the ending positions of the fields; (2) The schema is known in all record operations so there is no need to store the number of fields in each record; (3) Integers and floating point numbers are 4-byte long. (4) Each NULL value is represented using a special value in the corresponding pointer in the directory. 1) Please fill in the following byte array to explain how the following record is stored. Clearly indicate the number of bytes per segment in the array, and all their values. (23, NULL, Chicago, 37.1) Solution: 2) Explain how to use the directory to access the team field of this record. Make sure to include all the details in the calculation. Solution: Since team is the 2nd field (starting from 0) of the record, we access the 2nd slot in the directory, which has a value of 19. Since it s a positive value, we know it s not a NULL. Then we access its previous slot, which has a value -12, indicating the previous value is a NULL, and it ends at position 12. (If the design uses an arbitrary negative value to indicate the NULL value, we have to keep looking backwards until we see the first non-null value, which is still 12.) Then we know the length of this field is 19-12 = 7. We read 7 bytes as the value for the team field. 4

b) (4 pts) Suppose we have an empty page with a size of 4,096 bytes. Use a diagram to show the layout of the page after we insert 3 new records to the page with a byte-array size of 30, 25, and 35, respectively. We assume each offset takes 2 bytes, and each array length also uses 2 bytes. Finish the following diagram by filling in the missing values, including the record offsets and lengths, number of bytes in free space, and number of slots. 5

Solution : c) (4 pts) In what operation do we need a tombstone for a record? When a record is updated with a larger size and it no longer fits in the current page. What information should be included in a tombstone? The new RID where the moved record is stored. 6

Question 3 (20 points): B+ Tree For sub-question a), b), and c), consider the following B+ tree (called tree 1 ): a) (4 pts) Suppose a data entry with key 90 is inserted to the B+ tree. Write down all the nodes that need to be read and all the nodes that need to be updated. You can mark R (for read ) and U (for Update ) on the nodes in the tree above directly. Read: A, C, K Write: K b) (4 pts) Consider the original B+ tree 1 above. Draw the new B+ tree after inserting a data entry with key 42. If there is no ambiguity, you can just draw the part that is changed. Solution : c) (4 pts) Consider the original B+ tree 1 above. Draw the new B+ tree after inserting a data entry with key 79. If there is no ambiguity, you can just draw the part that is changed. 7

Solution : For sub-questions d) and e), consider the following B+ tree (called tree 2 ): d) (4 pts) Consider the original B+ tree 2 above. Draw the new B+ tree after deleting the data entry with key 23. Assume we need to redistribute data entries if possible. If there is no ambiguity, you can just draw the part that is changed. 8

Solution : e) (4 pts) Consider the original B+ tree 2 above. Draw the new B+ tree after deleting the data entry with key 50. If there is no ambiguity, you can just draw the part that is changed. Solution : We also accept solution if the entry 50 in the root node is replaced by 59. 9

Question 4. Dynamic Hashing (20 Points) We have a table: Flight (flightno INT, aircraft INT, origin VARCHAR(10), destination VARCHAR(10), price INT). It has an extendible hashing index shown below. The directory consists of an array of 4-byte pointers, each of which points to a page/bucket. The hash function h uses the least significant bits of h(k). Up to 4 records can fit into one page/bucket. For the following questions, we assume that we split a page whenever it has an overflow page. Binary value table Decimal Binary 3 00000011 4 00000100 5 00000101 6 00000110 7 00000111 10 00001010 12 00001100 14 00001110 15 00001111 16 00010000 20 00010100 21 00010101 22 00010110 36 00100100 38 00100110 43 00101011 51 00110011 64 01000000 68 01000100 10

a) (5 Points) We insert a data entry with a hash value 43. Show the file structure after the insertion. If there is no ambiguity, you can just draw the part that changed. 11

b) (5 Points) On the original hash table, we delete a record for an entry with hash value 6. Suppose the merge policy is that whenever we can merge a bucket with its mirror bucket without causing an overflow, we merge these two buckets. Draw the new index structure after the deletion. If there is no ambiguity, you can just draw the part that changed. 12

c) (10 Points) We have a linear hashing index (shown below) for the same table: Flight (flightno INT, aircraft INT, origin VARCHAR(10), destination VARCHAR(10), price INT) Each bucket can hold up to 4 records. Our split policy is to split the page indicated by Next pointer whenever an overflow page is created due to an insertion. Binary value table Decimal Binary 7 00000111 8 00001000 9 00001001 10 00001010 11 00001011 14 00001110 15 00001111 17 00010001 18 00010010 21 00010101 24 00011000 25 00011001 26 00011010 30 00011110 31 00011111 35 00100011 36 00100100 41 00101001 44 00101100 57 00111001 13

Show the final file structure that will result if entries with hash values 15, 26, 57, and 25 are inserted to the linear hash table one by one. Make sure to include the details about the Level and Next pointer. Solution: structure after inserting 15 : 14

The final structure: 15

Question 5. Query Performance (20 points) Consider a table Flight (flightno, aircraft, origin, destination, price) containing 100,000 records organized as a heap file. flightno is an integer with values distributed uniformly over the range [0,, 99,999]. aircraft is an integer with values distributed uniformly over the range [100, 400]. price is a double with values distributed over the range [0,, 10,000]. We have the following secondary indexes on the table: 1) Unclustered B+ tree index on aircraft ; 2) Unclustered B+ tree index on price ; 3) Unclustered B+ tree index on composite key <aircraft, price> ; 4) Unclustered B+ tree index on composite key <price, aircraft>. a) (5 points) Draw a single diagram to illustrate the main idea of these four indexes and how their leaf nodes point to records in the heap file. 16

For each of the following queries, decide a most efficient access strategy (either a scan on the heap file or using one of the four indexes). Briefly explain the reason. b) (5 points) SELECT * FROM Flight F WHERE F.aircraft = 393; Index 1 (equality on aircraft) c) (5 points) SELECT * FROM Flight F WHERE F.price > 5,000. Heap-file scan or Index 2 (sorting record ids before accessing the heap file). d) (5 points) SELECT F.aircraft, count (*) FROM Flight F WHERE F.price < 500 GROUP BY aircraft; Index 3 (index-only plan); or Index 4 (Index-only plan) (Index 4 is accepted only when selectivity of price is explained) 17

Question 6. PAX Record Format (15 points) In class we talked about a PAX format to store records within the page as multiple mini-pages corresponding to different attributes. The following diagram illustrates its main idea for a simple table Emp(id INT, name VARCHAR(20), age DOUBLE). Notice the diagram is very abstract and conceptual, without giving the byte-level details. a) (3 points) What information do we need to put into the page header? Why? # of slots, which is # of mini-pages (optional); The offset and length of each minipage. b) (3 points) For the directory of each mini-page, what optimization can we do to utilize space more efficiently? For a mini-page of an attribute with a fixed length (e.g., id and age ), there is no need to store the length of each value. We could even get rid of the slots in the directory. c) (3 points) We want to allow the mini-pages to be able to move or shift within the page so that the space can be used more efficiently. Give a scenario where a mini-mage needs to be shifted, and which part(s) of the page structure needs to be changed accordingly. 18

When we insert a new record, the mini-page for the name field is running out of space. At this moment, we can shrink the mini-pages of the other two attributes to allocate more space for this name field. Correspondingly, we need to modify the offsets and lengths in the header for these mini-pages. d) (3 points) Suppose we want to implement this PAX format in our course projects. We would like to keep the same API of the record manager so that other modules are not affected by this change. Consider the following function: RC insertrecord(filehandle &filehandle, const vector<attribute> &recorddescriptor, const void *data, RID &rid); Notice that each RID still consists of a page number and an offset. Briefly explain how this function is internally implemented using the PAX format. For each field, use the header of the page to locate the beginning of its mini-page. Go to the directory of the mini-page. Add a field value to the mini-page, and add its offset to its directory. If a mini-page is full, try to shrink other mini-pages to allocate more space for this mini-page. e) (3 points) Give one query where this PAX format is more efficient than the traditional page format that is implemented in our projects. SELECT age FROM emp; Reason: We only need to access the values from the mini-page for the age attribute. 19