12 February 2004 COSC-4411(M) Midterm #1 & answers p. 1 of 10 COSC-4411(M) Midterm #1 Sur / Last Name: Given / First Name: Student ID: Instructor: Parke Godfrey Exam Duration: 75 minutes Term: Winter 2004 Answer the following questions to the best of your knowledge. Be precise and be careful. The exam is open-book and open-notes. Write any assumptions you need to make along with your answers, whenever necessary. There are five major questions. Points for each question and sub-question are as indicated. In total, the exam is out of 50 points. If you need additional space for an answer, just indicate clearly where you are continuing. Regrade Policy Regrading should only be requested in writing. Write what you would like to be reconsidered. Note, however, that an exam accepted for regrading will be reviewed and regraded in entirety (all questions). Grading Box 1. 2. 3. 4. 5. Total
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 2 of 10 1. (10 points) Buffer Pool. Okay, I ve replaced the replacement strategy. What next? [short answer / analysis] Dr. Mark Dogfurry of Very Small Databases, Inc., has devised the following replacement strategy. Within the database system, every transaction has a unique timestamp value, start, which is the time the transaction commenced. (A transaction is, for example, a query executing. It will pin, and then unpin, a number of pages.) It is always a transaction that pins a page. Associated with each buffer pool frame is an xtime and a ctime. When a page s pin count = 0 and the page is then pinned, or the page is initially fetched into the pool, its frame s xtime is set equal to the start value of the transaction that requested (pinned) the page. If the page is already pinned (pin count > 0), its frame s xtime is set to the transaction s start if start is newer than the frame s current xtime; otherwise, its xtime value is left as is. The frame s ctime is set to the current clock time whenever the page s pin count becomes 0. For replacement, the page with the oldest xtime over all pages with pin count = 0 is chosen. In the case of ties for oldest xtime, the one with the newest ctime of those is chosen. a. (3 points) What type of replacement strategy is this? (LRU, MRU, Clock, hybrid, etc.?) Briefly describe. The strategy acts most like LRU. A page is chosen for replacement based on oldest timestamp (xtime), which is what LRU does. It differs from LRU in how it handles ties for oldest (here, ties on xtime). LRU might randomly pick between ties for oldest. To be fair, on a single-processor machine, LRU would not see any ties; unpinned times would be sequential. Dogfurry s strategy, on the other hand, then chooses the page with the youngest ctime across pages tied for the oldest xtime. So this acts as MRU over pages of the same xtime. Furthermore, ties on xtime will be common, since a single transaction will pin and unpin many pages. Thus, his strategy is truly a hybrid. A note: While this is mostly LRU-like, it is different in a significant way. LRU is generally done with respect to when pages were unpinned (to pincount 0). This is not based on pages times, but on transactions times.
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 3 of 10 b. (4 points) Identify an advantage Dr. Dogfurry s replacement strategy might have (compared with the basic replacement strategies of LRU, MRU, and Clock). This should solve the problem of sequential flooding that LRU can have, for many cases. When a given transaction must read a sequence of pages repeatedly, since these pages are marked with the same xtime, replacement over those will be done by MRU. So sequential flooding will not occur here. Most cases of (potential) sequential flooding probably do occur within the scope of a transaction, so Dogfurry s strategy may offer a good solution. However, the strategy does not solve all cases of sequential flooding. If a sequence of pages were being pinned by different transactions, then they will be replaced according to LRU. This would seem to be an unusual scenario though. c. (3 points) Identify a disadvantage Dr. Dogfurry s replacement strategy might have (compared with the basic replacement strategies of LRU, MRU, and Clock). The strategy favors newer transactions to older ones, Since older xtimes are replaced. Thus, an old transaction is not likely to have its requests in the buffer pool, so it will slow down. Longer transactions will run longer and become older in comparision to other transactions. Thus long transactions will be slowed down under this strategy. On the other hand, short transactions should speed up. Our buffer pool replacement strategy should not play favorites among transactions (unless we have designed it to on purpose); this is a side-effect. Within a transaction, MRU is used. This ignores data locality, which is the reason LRU tends to be better than MRU generally. So MRU may not be the best choice of strategy for within a transaction. This strategy has more overhead than plain LRU, MRU, or Clock. It probably could be implemented efficiently, so this is a minor complaint. But it is important that the replacement strategy is very fast, because this is at the core of buffer pool routines and is called extremely often. This still doesn t accommodate all forms of sequential flooding, namely flooding that can occur due to the interaction of multiple transactions. Whether this type of sequential flooding is common (and thus important to remedy) would need to be studied. I was not looking for all these answers, but for a reasonable observation about what the disadvantages might be.
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 4 of 10 2. (10 points) Index Logic. Take the next index to the left. [short answer / exercise] a. (5 points) You are told that the following indexes are available on the table Employee: key type clustered? A. name, address tree yes B. age, salary hash no C. name tree no D. salary, age hash no E. name, age tree yes You are suspicious that this information is not correct. Why? Identify three problems with what is reported. It is impossible that C. is unclustered if A. or E. are clustered. It is not possible to have two clustered indexes on the same table with different keys: A. & E. It makes no sense to have B. and D.. They are identical functionally....
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 5 of 10 b. (5 points) Consider the query SELECT order#, amount, when FROM Purchases WHERE amount BETWEEN 25 AND 30 AND when > 1999-11-14 ; There are 10,000,000 purchase records. There are 25 records are on each data-record page, on average. 4,000,000 purchase records have when > 1999-11-14. 50,000 purchase records have 25 amount 30. Two indexes are available: A. A clustered B+ tree index on when of type alternative 2. The index pages are three deep, with the leaf pages at depth four. B. An unclustered B+ tree index on amount of type alternative 2. The index pages are three deep, with the leaf pages at depth four. For each index, 50 data entries fit per data-entry (leaf) page. What is the I/O cost using each index to evaluate the query? So which index is best for this? For A., it will cost 3 I/O s to read the index pages from root down, one I/O to read the data entry page at the beginning of the range, and then 160,000 I/O s to read the data record pages with the matching records. Four million records match the when condition. At 25 records per page, they would occupy 160,000 pages. The index is clustered, so the matching records are clustered together. Therefore, it costs about 160,004 I/O s to fetch the records; we check the amount condition on-the-fly. Some said that we would read 80,000 data entry pages (all the entries matching), and fetch the records based on the entries, reading roughly 160,000 data-record pages. Thus, the total would be 240,003 I/O s. This is not how the textbook presents it; we can read data-record pages sequentially for a range. However in real systems, a clustered index is not fully clustered; the data-record pages are allowed to become slightly unsorted. This is a compromise for efficiency on updates. As a consequence though under this design, the data-entry pages must be read. I counted this as right too. For B., it will cost 3 I/O s to read the index pages from root down, and 1,000 I/O s to read the data-entry records that match on the amount condition. This time we read all the matching data-entries regardless, because the index is unclustered. 50,000 records match, and since 50 data-entries fit per page, that is 1,000 pages. Then, to fetch each of the 50,000 records, it will cost us an I/O each to fetch the appropriate data-record page. So 51,003 I/O s. (We might save some on the 50,000 due to hits in the buffer pool. However, the file is 40,000 pages in size, so the savings here will be negligable.) Using B. wins.
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 6 of 10 3. (10 points) General. Grab bag. [multiple choice] a. (2 points) Consider I. clustered tree indexes II. unclustered tree indexes III. clustered hash indexes IV. unclustered hash indexes Range queries can benefit from A. Just I. B. Just I & II. C. Just I, II, & III. D. Just I & III. E. Potentially any of I, II, III, & IV. b. (2 points) The buffer manager manages A. lock management for transaction processing B. query processing C. file allocation and deallocation D. disk memory E. main memory for the database system. c. (2 points) Which of the following is false? A. Locating a record by key in a sorted file by binary search and locating it via a B+ tree make practically the same number of key comparisons. B. Locating a record by key in a sorted file by binary search requires more I/O s than locating it via a B+ tree, in general. C. A bulk build of a B+ tree is faster than building it by inserting a record at a time. D. If the data records are kept in a sorted file, there is no need for a B+ tree index based on the same search / sort key. E. If there is an unclustered B+ tree index over the data records, this does not mean that the records are necessarily sorted. d. (2 points) Which of the following is false? A. The trend is that disk I/O speeds are getting faster in ratio to CPU speeds. B. Page size is dictated by the hardware. C. Generally, many records fit on a page. D. Sequential reads and writes are important to a database system s performance. E. Generally, I/O time dominates CPU time in database operations. e. (2 points) The external merge sort routine A. for its merge passes requires that the input runs all be of equal length. B. can accommodate variable length input runs in a merge pass, but may in that case need to allocate more output frames. C. must use quick-sort in its pass 0. D. may not be faster sorting a given input file given twice the buffer pool allocation. E. cannot sort a file that is already sorted on a different key.
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 7 of 10 4. (10 points) Index Mechanics. Always losing your keys? [exercise] A linear hash has just been started. The linear hashed file currently just has one bucket (primary page). The current hash function pair is h 0, h 1. Here, h 0 masks for zero (!) righthand bits from the hashed key, and so always returns bucket address 0. Hash function h 1 masks for 1 right-hand bit, h 2 for 2, and so forth. Assume that each page can hold two entries. The file currently has one entry of 21 (10101 2 ). 0 21 next A split should be triggered whenever an overflow page is created. Show the linear hashed file after each of the following inserts: 14 (1110 2 ), 7 (111 2 ), 35 (100011 2 ), and 28 (11100 2 ). The insertions are cumulative, so your final hashed file should contain 21, 14, 7, 35, and 28. 21 14 next 21 14 7 next 0 14 next 1 21 7 0 14 next 1 21 7 35 00 0 01 1 21 7 35 next 10 0 14 00 0 28 01 1 21 7 35 next 10 0 14
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 8 of 10 5. (10 points) External Sorting. I ll pass. [analysis] Consider the following optimization. If the last merge of the current merge pass would involve a k-way merge where k < (B 1), then the routine fills up the last merge by adding (B 1) k current runs runs just created in this same pass in previous merge steps to make for a (B 1)-way merge, when possible. a. (5 points) Say that pass (i 1) produced 8 runs and that we are doing 3-way merges. In pass i, we first take 3 of the 8 runs and merge them into a new single run. We next take the next 3 of the 8 runs and merge them into a new run. Now we only have 2 of the original 8 runs left. In the basic external sort routine, we would finish pass i by merging these 2 runs in a 2-way merge. Under the revised algorithm (with the optimization ), we would perform a 3-way merge again instead, using the remaining 2 runs of the 8 (from pass (i 1)) and one of the 2 new runs just created in the first two merges of pass i. Does this help in this example? Why or why not? Let us say i = 1 here, and pass zero made 8 runs of length 4 (B). So the file is 32 pages in size. In pass one, in the first merge, we merge 3 of the 8 runs into one run of length 12; in the second merge, we merge the next 3 of the 8 runs into one run of length 12; in the third merge, we normally merge the remaining 2 of the 8 runs into one run of length 8. This cost 64 I/O s (= 2 size of file). Under the modification, in the third merge of pass one, we would borrow one of the runs we made previously in pass one so we could do a 3-way merge again instead of a 2-way merge. The run we borrow is of length 12, and we merge it with the remaining 2 runs of length 4 each, resulting in a run of length 20. The cost of the pass is now 88 I/O s, however! We had to read in the borrowed run and write it out in addtion to the other I/O s. In the pass two, in the original version, we merge the three resulting runs from pass one into a single run, and we are done. This costs 64 I/O s. In the pass two, in the new version, we merge the two resulting runs from pass one into a single run, and we are done. This costs 64 I/O s. So in this example, pass one cost more for the new version than the original version. The other passes cost the same. So the new version was more expensive!
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 9 of 10 b. (5 points) Does this optimization actually always / ever make an external sort more efficient? That is, does it always / ever save I/O s? Argue briefly why or why not. In part a., we established that there are cases when the new algorithm costs more. Does it always cost more? No. When the new version results in fewer passes, it is less expensive since it saves all the I/O s of an additional pass. Consider that in the example above, the file was only 28 pages long. Under the original algorithm, pass zero makes 7 runs of length 4 each. Pass one would make 3 runs, 2 of length 12, and one of length 4 (a 1-way merge!). Pass two finishs with a single run. So 3 64 = 192 I/O s. Under the new algorithm, pass zero makes 7 runs of length 4 each, like before. Pass one would make 2 runs like before in the first and second merges. For the third merge, it would borrow the two just created merges to fill out the 3-way merge. This would result in a single run. So no pass two is necessary! The extra cost is reading and writing the two borrowed runs an extra time: 2 2 12 = 48. Plus the 2 passes: 2 64 = 128. So 176 I/O s in all. Okay, not a huge savings in this example. However, we can show other cases when the savings is much more significant.
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 10 of 10 (Scratch space.) Relax. Turn in your exam. Go home.