COSC-4411(M) Midterm #1

Similar documents
COSC-4411(M) Midterm #1

IMPORTANT: Circle the last two letters of your class account:

CSCI-6421 Final Exam York University Fall Term 2004

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

University of California, Berkeley. CS 186 Introduction to Databases, Spring 2014, Prof. Dan Olteanu MIDTERM

Virtual Memory. Chapter 8

Database Applications (15-415)

CS 245 Midterm Exam Winter 2014

Chapter 12: Query Processing. Chapter 12: Query Processing

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

CS-245 Database System Principles

Advanced Database Systems

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

Chapter 12: Query Processing

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

CS 186/286 Spring 2018 Midterm 1

CS 222/122C Fall 2016, Midterm Exam

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

External Sorting Implementing Relational Operators

University of Waterloo Midterm Examination Sample Solution

CS 186/286 Spring 2018 Midterm 1

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSC 261/461 Database Systems Lecture 17. Fall 2017

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

Database design and implementation CMPSCI 645. Lecture 08: Storage and Indexing

Database Applications (15-415)

L9: Storage Manager Physical Data Organization

Overview of Storage and Indexing

Data on External Storage

Lecture 8 Index (B+-Tree and Hash)

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Midterm 1: CS186, Spring 2015

EXTERNAL SORTING. Sorting

CS222P Fall 2017, Final Exam

COMP 346 WINTER 2018 MEMORY MANAGEMENT (VIRTUAL MEMORY)

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

CSIT5300: Advanced Database Systems

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Principles of Data Management. Lecture #9 (Query Processing Overview)

Roadmap. Handling large amount of data efficiently. Stable storage. Parallel dataflow. External memory algorithms and data structures

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

CSE 344 FEBRUARY 14 TH INDEXING

CSE-1520R Test #1. The exam is closed book, closed notes, and no aids such as calculators, cellphones, etc.

CSE-1520R Test #1. The exam is closed book, closed notes, and no aids such as calculators, cellphones, etc.

TotalCost = 3 (1, , 000) = 6, 000

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

STORING DATA: DISK AND FILES

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

CSE 530A. B+ Trees. Washington University Fall 2013

Intro to DB CHAPTER 12 INDEXING & HASHING

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Chapter 13: Query Processing

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Queen s University Faculty of Arts and Science School of Computing CISC 432* / 836* Advanced Database Systems

Database Applications (15-415)

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 3 - Memory Management

CS 245 Midterm Exam Solution Winter 2015

Database Applications (15-415)

ECE 5730 Memory Systems

CSE 190D Spring 2017 Final Exam Answers

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

Database Applications (15-415)

University of Waterloo Midterm Examination Solution

Tree-Structured Indexes

Disks and Files. Jim Gray s Storage Latency Analogy: How Far Away is the Data? Components of a Disk. Disks

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

Tree-Structured Indexes

CSE-3421 Test #1 Design

Evaluation of Relational Operations: Other Techniques

Memory. Objectives. Introduction. 6.2 Types of Memory

Tree-Structured Indexes

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

CS 540: Introduction to Artificial Intelligence

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

Indexing: B + -Tree. CS 377: Database Systems

Sorting a File of Records

a process may be swapped in and out of main memory such that it occupies different regions

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

RAID in Practice, Overview of Indexing

Introduction. Choice orthogonal to indexing technique used to locate entries K.

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Datenbanksysteme II: Caching and File Structures. Ulf Leser

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Transcription:

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 1 of 10 COSC-4411(M) Midterm #1 Sur / Last Name: Given / First Name: Student ID: Instructor: Parke Godfrey Exam Duration: 75 minutes Term: Winter 2004 Answer the following questions to the best of your knowledge. Be precise and be careful. The exam is open-book and open-notes. Write any assumptions you need to make along with your answers, whenever necessary. There are five major questions. Points for each question and sub-question are as indicated. In total, the exam is out of 50 points. If you need additional space for an answer, just indicate clearly where you are continuing. Regrade Policy Regrading should only be requested in writing. Write what you would like to be reconsidered. Note, however, that an exam accepted for regrading will be reviewed and regraded in entirety (all questions). Grading Box 1. 2. 3. 4. 5. Total

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 2 of 10 1. (10 points) Buffer Pool. Okay, I ve replaced the replacement strategy. What next? [short answer / analysis] Dr. Mark Dogfurry of Very Small Databases, Inc., has devised the following replacement strategy. Within the database system, every transaction has a unique timestamp value, start, which is the time the transaction commenced. (A transaction is, for example, a query executing. It will pin, and then unpin, a number of pages.) It is always a transaction that pins a page. Associated with each buffer pool frame is an xtime and a ctime. When a page s pin count = 0 and the page is then pinned, or the page is initially fetched into the pool, its frame s xtime is set equal to the start value of the transaction that requested (pinned) the page. If the page is already pinned (pin count > 0), its frame s xtime is set to the transaction s start if start is newer than the frame s current xtime; otherwise, its xtime value is left as is. The frame s ctime is set to the current clock time whenever the page s pin count becomes 0. For replacement, the page with the oldest xtime over all pages with pin count = 0 is chosen. In the case of ties for oldest xtime, the one with the newest ctime of those is chosen. a. (3 points) What type of replacement strategy is this? (LRU, MRU, Clock, hybrid, etc.?) Briefly describe. The strategy acts most like LRU. A page is chosen for replacement based on oldest timestamp (xtime), which is what LRU does. It differs from LRU in how it handles ties for oldest (here, ties on xtime). LRU might randomly pick between ties for oldest. To be fair, on a single-processor machine, LRU would not see any ties; unpinned times would be sequential. Dogfurry s strategy, on the other hand, then chooses the page with the youngest ctime across pages tied for the oldest xtime. So this acts as MRU over pages of the same xtime. Furthermore, ties on xtime will be common, since a single transaction will pin and unpin many pages. Thus, his strategy is truly a hybrid. A note: While this is mostly LRU-like, it is different in a significant way. LRU is generally done with respect to when pages were unpinned (to pincount 0). This is not based on pages times, but on transactions times.

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 3 of 10 b. (4 points) Identify an advantage Dr. Dogfurry s replacement strategy might have (compared with the basic replacement strategies of LRU, MRU, and Clock). This should solve the problem of sequential flooding that LRU can have, for many cases. When a given transaction must read a sequence of pages repeatedly, since these pages are marked with the same xtime, replacement over those will be done by MRU. So sequential flooding will not occur here. Most cases of (potential) sequential flooding probably do occur within the scope of a transaction, so Dogfurry s strategy may offer a good solution. However, the strategy does not solve all cases of sequential flooding. If a sequence of pages were being pinned by different transactions, then they will be replaced according to LRU. This would seem to be an unusual scenario though. c. (3 points) Identify a disadvantage Dr. Dogfurry s replacement strategy might have (compared with the basic replacement strategies of LRU, MRU, and Clock). The strategy favors newer transactions to older ones, Since older xtimes are replaced. Thus, an old transaction is not likely to have its requests in the buffer pool, so it will slow down. Longer transactions will run longer and become older in comparision to other transactions. Thus long transactions will be slowed down under this strategy. On the other hand, short transactions should speed up. Our buffer pool replacement strategy should not play favorites among transactions (unless we have designed it to on purpose); this is a side-effect. Within a transaction, MRU is used. This ignores data locality, which is the reason LRU tends to be better than MRU generally. So MRU may not be the best choice of strategy for within a transaction. This strategy has more overhead than plain LRU, MRU, or Clock. It probably could be implemented efficiently, so this is a minor complaint. But it is important that the replacement strategy is very fast, because this is at the core of buffer pool routines and is called extremely often. This still doesn t accommodate all forms of sequential flooding, namely flooding that can occur due to the interaction of multiple transactions. Whether this type of sequential flooding is common (and thus important to remedy) would need to be studied. I was not looking for all these answers, but for a reasonable observation about what the disadvantages might be.

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 4 of 10 2. (10 points) Index Logic. Take the next index to the left. [short answer / exercise] a. (5 points) You are told that the following indexes are available on the table Employee: key type clustered? A. name, address tree yes B. age, salary hash no C. name tree no D. salary, age hash no E. name, age tree yes You are suspicious that this information is not correct. Why? Identify three problems with what is reported. It is impossible that C. is unclustered if A. or E. are clustered. It is not possible to have two clustered indexes on the same table with different keys: A. & E. It makes no sense to have B. and D.. They are identical functionally....

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 5 of 10 b. (5 points) Consider the query SELECT order#, amount, when FROM Purchases WHERE amount BETWEEN 25 AND 30 AND when > 1999-11-14 ; There are 10,000,000 purchase records. There are 25 records are on each data-record page, on average. 4,000,000 purchase records have when > 1999-11-14. 50,000 purchase records have 25 amount 30. Two indexes are available: A. A clustered B+ tree index on when of type alternative 2. The index pages are three deep, with the leaf pages at depth four. B. An unclustered B+ tree index on amount of type alternative 2. The index pages are three deep, with the leaf pages at depth four. For each index, 50 data entries fit per data-entry (leaf) page. What is the I/O cost using each index to evaluate the query? So which index is best for this? For A., it will cost 3 I/O s to read the index pages from root down, one I/O to read the data entry page at the beginning of the range, and then 160,000 I/O s to read the data record pages with the matching records. Four million records match the when condition. At 25 records per page, they would occupy 160,000 pages. The index is clustered, so the matching records are clustered together. Therefore, it costs about 160,004 I/O s to fetch the records; we check the amount condition on-the-fly. Some said that we would read 80,000 data entry pages (all the entries matching), and fetch the records based on the entries, reading roughly 160,000 data-record pages. Thus, the total would be 240,003 I/O s. This is not how the textbook presents it; we can read data-record pages sequentially for a range. However in real systems, a clustered index is not fully clustered; the data-record pages are allowed to become slightly unsorted. This is a compromise for efficiency on updates. As a consequence though under this design, the data-entry pages must be read. I counted this as right too. For B., it will cost 3 I/O s to read the index pages from root down, and 1,000 I/O s to read the data-entry records that match on the amount condition. This time we read all the matching data-entries regardless, because the index is unclustered. 50,000 records match, and since 50 data-entries fit per page, that is 1,000 pages. Then, to fetch each of the 50,000 records, it will cost us an I/O each to fetch the appropriate data-record page. So 51,003 I/O s. (We might save some on the 50,000 due to hits in the buffer pool. However, the file is 40,000 pages in size, so the savings here will be negligable.) Using B. wins.

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 6 of 10 3. (10 points) General. Grab bag. [multiple choice] a. (2 points) Consider I. clustered tree indexes II. unclustered tree indexes III. clustered hash indexes IV. unclustered hash indexes Range queries can benefit from A. Just I. B. Just I & II. C. Just I, II, & III. D. Just I & III. E. Potentially any of I, II, III, & IV. b. (2 points) The buffer manager manages A. lock management for transaction processing B. query processing C. file allocation and deallocation D. disk memory E. main memory for the database system. c. (2 points) Which of the following is false? A. Locating a record by key in a sorted file by binary search and locating it via a B+ tree make practically the same number of key comparisons. B. Locating a record by key in a sorted file by binary search requires more I/O s than locating it via a B+ tree, in general. C. A bulk build of a B+ tree is faster than building it by inserting a record at a time. D. If the data records are kept in a sorted file, there is no need for a B+ tree index based on the same search / sort key. E. If there is an unclustered B+ tree index over the data records, this does not mean that the records are necessarily sorted. d. (2 points) Which of the following is false? A. The trend is that disk I/O speeds are getting faster in ratio to CPU speeds. B. Page size is dictated by the hardware. C. Generally, many records fit on a page. D. Sequential reads and writes are important to a database system s performance. E. Generally, I/O time dominates CPU time in database operations. e. (2 points) The external merge sort routine A. for its merge passes requires that the input runs all be of equal length. B. can accommodate variable length input runs in a merge pass, but may in that case need to allocate more output frames. C. must use quick-sort in its pass 0. D. may not be faster sorting a given input file given twice the buffer pool allocation. E. cannot sort a file that is already sorted on a different key.

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 7 of 10 4. (10 points) Index Mechanics. Always losing your keys? [exercise] A linear hash has just been started. The linear hashed file currently just has one bucket (primary page). The current hash function pair is h 0, h 1. Here, h 0 masks for zero (!) righthand bits from the hashed key, and so always returns bucket address 0. Hash function h 1 masks for 1 right-hand bit, h 2 for 2, and so forth. Assume that each page can hold two entries. The file currently has one entry of 21 (10101 2 ). 0 21 next A split should be triggered whenever an overflow page is created. Show the linear hashed file after each of the following inserts: 14 (1110 2 ), 7 (111 2 ), 35 (100011 2 ), and 28 (11100 2 ). The insertions are cumulative, so your final hashed file should contain 21, 14, 7, 35, and 28. 21 14 next 21 14 7 next 0 14 next 1 21 7 0 14 next 1 21 7 35 00 0 01 1 21 7 35 next 10 0 14 00 0 28 01 1 21 7 35 next 10 0 14

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 8 of 10 5. (10 points) External Sorting. I ll pass. [analysis] Consider the following optimization. If the last merge of the current merge pass would involve a k-way merge where k < (B 1), then the routine fills up the last merge by adding (B 1) k current runs runs just created in this same pass in previous merge steps to make for a (B 1)-way merge, when possible. a. (5 points) Say that pass (i 1) produced 8 runs and that we are doing 3-way merges. In pass i, we first take 3 of the 8 runs and merge them into a new single run. We next take the next 3 of the 8 runs and merge them into a new run. Now we only have 2 of the original 8 runs left. In the basic external sort routine, we would finish pass i by merging these 2 runs in a 2-way merge. Under the revised algorithm (with the optimization ), we would perform a 3-way merge again instead, using the remaining 2 runs of the 8 (from pass (i 1)) and one of the 2 new runs just created in the first two merges of pass i. Does this help in this example? Why or why not? Let us say i = 1 here, and pass zero made 8 runs of length 4 (B). So the file is 32 pages in size. In pass one, in the first merge, we merge 3 of the 8 runs into one run of length 12; in the second merge, we merge the next 3 of the 8 runs into one run of length 12; in the third merge, we normally merge the remaining 2 of the 8 runs into one run of length 8. This cost 64 I/O s (= 2 size of file). Under the modification, in the third merge of pass one, we would borrow one of the runs we made previously in pass one so we could do a 3-way merge again instead of a 2-way merge. The run we borrow is of length 12, and we merge it with the remaining 2 runs of length 4 each, resulting in a run of length 20. The cost of the pass is now 88 I/O s, however! We had to read in the borrowed run and write it out in addtion to the other I/O s. In the pass two, in the original version, we merge the three resulting runs from pass one into a single run, and we are done. This costs 64 I/O s. In the pass two, in the new version, we merge the two resulting runs from pass one into a single run, and we are done. This costs 64 I/O s. So in this example, pass one cost more for the new version than the original version. The other passes cost the same. So the new version was more expensive!

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 9 of 10 b. (5 points) Does this optimization actually always / ever make an external sort more efficient? That is, does it always / ever save I/O s? Argue briefly why or why not. In part a., we established that there are cases when the new algorithm costs more. Does it always cost more? No. When the new version results in fewer passes, it is less expensive since it saves all the I/O s of an additional pass. Consider that in the example above, the file was only 28 pages long. Under the original algorithm, pass zero makes 7 runs of length 4 each. Pass one would make 3 runs, 2 of length 12, and one of length 4 (a 1-way merge!). Pass two finishs with a single run. So 3 64 = 192 I/O s. Under the new algorithm, pass zero makes 7 runs of length 4 each, like before. Pass one would make 2 runs like before in the first and second merges. For the third merge, it would borrow the two just created merges to fill out the 3-way merge. This would result in a single run. So no pass two is necessary! The extra cost is reading and writing the two borrowed runs an extra time: 2 2 12 = 48. Plus the 2 passes: 2 64 = 128. So 176 I/O s in all. Okay, not a huge savings in this example. However, we can show other cases when the savings is much more significant.

12 February 2004 COSC-4411(M) Midterm #1 & answers p. 10 of 10 (Scratch space.) Relax. Turn in your exam. Go home.