Database Management Systems (COP 5725) Homework 3

Similar documents
Question 1 (a) 10 marks

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Hash-Based Indexing 165

Queen s University Faculty of Arts and Science School of Computing CISC 432* / 836* Advanced Database Systems

CS 564 Final Exam Fall 2015 Answers

QUERY OPTIMIZATION [CH 15]

TotalCost = 3 (1, , 000) = 6, 000

CS 245 Midterm Exam Winter 2014

IMPORTANT: Circle the last two letters of your class account:

Optimizing logical query plans

IMPORTANT: Circle the last two letters of your class account:

Optimization of Logical Queries

Final Review. CS 377: Database Systems

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Database Management Systems (COP 5725) Homework 2

EECS 647: Introduction to Database Systems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2009 Quiz I Solutions

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

CompSci 516: Database Systems

Storage and Indexing

CS-245 Database System Principles

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Lecture 19: Query Optimization (1)

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Goal of Concurrency Control. Concurrency Control. Example. Solution 1. Solution 2. Solution 3

DATABASE MANAGEMENT SYSTEMS

CPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery

CS 222/122C Fall 2017, Final Exam. Sample solutions

RELATIONAL OPERATORS #1

Query Processing. Introduction to Databases CompSci 316 Fall 2017

CS222P Fall 2017, Final Exam

Transactions and Concurrency Control

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

CS 245 Midterm Exam Solution Winter 2015

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CSE 190D Spring 2017 Final Exam

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting Implementing Relational Operators

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Chapter 12: Query Processing

Assignment 6 Solutions

Relational DBMS Internals Solutions Manual. A. Albano, D. Colazzo, G. Ghelli and R. Orsini

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Chapter 12: Query Processing. Chapter 12: Query Processing

Overview of Implementing Relational Operators and Query Evaluation

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

Review of Storage and Indexing

Database Management Systems Written Examination

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please)

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Overview of Storage and Indexing

Overview of Storage and Indexing

QUERY OPTIMIZATION. CS 564- Spring ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Database Applications (15-415)

Overview of Storage and Indexing

Chapter 3. Algorithms for Query Processing and Optimization

EXTERNAL SORTING. Sorting

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

CS 222/122C Fall 2016, Midterm Exam

Review. Administrivia (Preview for Friday) Lecture 21: Query Optimization (1) Where We Are. Relational Algebra. Relational Algebra.

CISC437/637 Database Systems Final Exam

Database Management Systems Paper Solution

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

CSE 190D Spring 2017 Final Exam Answers

McGill April 2009 Final Examination Database Systems COMP 421

Lecture 21: Query Optimization (1)

Database Applications (15-415)

CSE 444: Database Internals. Lectures 5-6 Indexing

CSIT5300: Advanced Database Systems

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Concurrency Control. R &G - Chapter 19

Chapter 13: Query Processing

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

CS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1

Database System Concepts

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

CISC437/637 Database Systems Final Exam

Query Processing: The Basics. External Sorting

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

CSE 544 Principles of Database Management Systems

Database Management Systems Written Exam

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

Final Exam CSE232, Spring 97

Database Applications (15-415)

Overview of Storage and Indexing

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

University of Waterloo Midterm Examination Sample Solution

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Transcription:

Database Management Systems (COP 5725) Homework 3 Instructor: Dr. Daisy Zhe Wang TAs: Yang Chen, Kun Li, Yang Peng yang, kli, ypeng@cise.uf l.edu November 26, 2013 Name: UFID: Email Address: Pledge(Must be signed according to UF Honor Code) On my honor, I have neither given nor received unauthorized aid in doing this assignment. Signature For grading use only: Question: I II III IV Total Points: 26 25 24 25 100 Score: i

COP5725, Fall 2013 Homework 3 Page 1 of 7 I. [26 points] Indexing. Make the following assumptions for questions (1) and (2): A bucket can hold two keys and a pointer. The initial database D contains one object with key 10100. Six objects with the following keys are inserted to D in the following order: 00110, 11010, 10011, 01010, 10110, and 01011. (1) [3 points] Assume that an extensible hash table is used to index the database. Show the index structure after the insertions. (2) [3 points] Assume that a linear hash table is used to index the database with the restriction that at most 80% of the hash table can be full at any time. Show the index structure after the insertions. For (1) (left) and (2) (right). i = 3 00110 2 000 001 010 011 100 101 110 111 01010 2 01011 10011 3 10100 3 10110 11010 2 i = 3 n = 5 r = 7 000 001 010 011 100 00110 01010 01011 10011 10100 10110 11010 (3) [3 points] Describe a scenario when extensible hash tables are preferred over linear hash tables. Explain your answer. Two possibilities: When the insertions are very frequent. This is because linear hash table is reorganized every time a new bucket is added. A new bucket is added frequently. When the values of the keys that the index is built on uniformly distributed keys. (i.e., the hash table is filled uniformly). For (4)-(6), consider a B+ tree whose nodes contain up to 4 keys (5 pointers). (4) [3 points] Bulkload the B+ tree with values 46, 10, 70, 49, 23, 40, 59, 29, 34, 54, 75, 30.

COP5725, Fall 2013 Homework 3 Page 2 of 7 34 54 10 23 29 30 34 40 46 49 54 59 70 75 (5) [3 points] Show the result B+ tree after inserting values 80, 24, 42. 42 24 34 54 70 10 23 24 29 30 34 40 42 46 49 54 59 70 75 80 (6) [3 points] Based on the B+ tree in (5), show the result B+ tree after deleting values 10, 40. 29 42 54 70 23 24 29 30 34 42 46 49 54 59 70 75 80 (7) [8 points] Fill in the cost table below for Alternative 1 ISAM and B+ tree indices. Assume each index takes P pages on disk, has height H, and fanout F at each internal node. Assume there are R tuples in the relation, and B tuples fit on a leaf (or overflow) page. In each case, assume infinite buffer pool size, but the buffer pool starts out empty. For each page that gets dirty, add 1 to your I/O cost since it will eventually have to be flushed to disk. For ISAM, assume that a leaf node maintains only a pointer to the beginning of an overflow list. Given the constraints of a B+ Tree/ISAM, assume whatever data you want in the tree for each case below.

COP5725, Fall 2013 Homework 3 Page 3 of 7 ISAM B+ Tree Worst-case # IOs for range query P (index consists of root with a linear string of overflow pages. Need to look at all overflow pages since they re not sorted) or H + R/B or H + (F H 1) + R/B (look at whole leaf level and all data in last leaf overflow) H + F H (range query covers the whole table) H + R/B P was not accepted here, as this would imply only 2 I/Os, given the structure of the index. Worst-case # IOs for insert P + 2 (index consist of a root with a string of overflow pages. Need to scan til the end, and add a new overflow page in the worst case, and update the previous last overflow page with a pointer) or H + R/B + 2 3H + 1 (every node needs to split, +1 for new root. Read pages we re going to split on the way down, so we don t need to read them again.) II. [25 points] Query Evaluation. Suppose we want to compute (R(a, b) S(a, c)) T (a, d) in the order indicated. We have M = 101 main memory buffers, and the number of disk blocks (pages) for R and S B(R) = B(S) = 2000. Now we decide to use one-pass or two-pass sort-merge-join algorithms to implement the query. (1) [2 points] Would you use a one- or two-pass sort-merge-join for R S? Explain. Two-pass sort-merge-join, since both operands are larger than main memory. (2) We shall use the appropriate number of passes for the second join, first dividing T into some number of sublists sorted by a, and merging them with the sorted and pipelined stream of tuples from the join R S. For what values of B(T ) should we choose for the join of T with R S: i. [3 points] A one-pass join; i.e., we read T into memory, and compare its tuples with the tuples of R S as they are generated. B(T ) 60. ii. [3 points] A two-pass join; i.e., we create sorted sublists for T and keep one buffer in memory for each sorted sublist, while we generate tuples of R S. B(T ) > 60. iii. [4 points] For cases in i. ii., what is the total number of disk I/O s (in terms of B(T ))? For i. we need 3 (2000 + 2000) = 12, 000 I/O s to perform the two-pass sortmerge-join of R and S, and B(T ) I/O s to read T in the one-pass join of (R S) T. The total # of I/O s is 12, 000 + B(T ).

COP5725, Fall 2013 Homework 3 Page 4 of 7 For ii. we need 2B(T ) disk I/O s to sort B(T ) into sublists; 12,000 disk I/O s to join R S; B(T ) to read the sorted lists of T. The total number of disk I/O s is 12, 000 + 3B(T ). (3) [4 points] Consider the query (R(a, b) S(a, c)) T (c, d), i.e., the second join is based on attribute c instead of a. How would you choose the join algorithms? Provide a new cost estimation if your choices differ from (2). We need to re-sort the intermediate result R S based on the attribute c. New cost is 12, 000 + 3B(T ) + 2B(R S), where the 2B(R S) term comes from writing out the sublists of R S and read them in again while joining (R S) T. For (4)-(6), you are given M memory blocks and a relation R. (4) [3 points] Describe a two-pass hash-based algorithm for duplicate elimination, δ(r). (Hint: review the aggregation algorithm with grouping). Hash R into M 1 buckets based on all attributes. Perform δ on each bucket in isolation, using M memory blocks. (5) [3 points] What is the largest relation your algorithm can handle given M blocks of main memory? M(M 1). (6) [3 points] What is the number of disk I/O s of your algorithm? B(R) for reading R and hashing; B(R) for writing out the buckets; B(R) for reading the buckets and do the actual duplication. 3B(R) in total. III. [24 points] Query Optimization. Consider the following database schema: Employees(eid: integer, ename: string, sal: integer, title: string, age: integer) Suppose that the following indexes, all using Alternative (2) for data entries, exist: a hash index on eid, a B+ tree index on sal, a hash index on age, and a clustered B+ tree index on (age, sal). Each Employees record is 100 bytes long, and you can assume that each index data entry is 20 bytes long. The Employees relation contains 10,000 pages. (1) Consider each of the following selection conditions and, assuming that the reduction factor (RF) for each term that matches an index is 0.1, compute the cost of the most selective access path for retrieving all Employees tuples that satisfy the condition (in terms of the number of I/O s):

COP5725, Fall 2013 Homework 3 Page 5 of 7 i. [4 points] age=25. The clustered B+ tree index would be the best option here, with a cost of 2 (lookup) + 10000 0.1 (data pages) + 10000 0.2 (index pages) 0.1 = 1202. Although the hash index has a less lookup time, the potential number of record lookups (10000 0.1 20 tuples per page = 20000) renders the clustered index more efficient. ii. [4 points] sal>200 AND age>30 AND title= CFO. Here an age condition is present, so the clustered B+ tree index on (age, sal) can be used. The cost is 2 + 10, 000 0.2 0.1 (all index pages needs to be fetched satisfying age>30) + 10, 000 0.1 0.1 (data pages) = 302. Consider the following relational schema and SQL query: Emp(eid, did, sal, hobby) Dept(did, dname, floor, phone) Finance(did, budget, sales, expences) SELECT D.dname, F.budget FROM Emp E, Dept D, Finance F WHERE E.did = D.did AND D.did = F.did AND D.floor = 1 AND E.sal >= 59000 AND E.hobby = yodelling; (2) [5 points] Identify a query plan that a decent query optimizer would choose. π D.dname, F.budget π F.did, F.budget π E.did π D.did, D.dname F σ E.sal 59000, E.hobby="yodelling" σ D.floor=1 (3) Suppose that the following additional information is available: E Unclustered B+ tree indexes exist on Emp.did, Emp.sal, Dept.did, and Finance.did (each leaf page contains up to 200 entries). The systems statistics indicate that employee salaries range from 10,000 to 60,000, employees enjoy 200 different hobbies. The company owns two floors in the building. There are a total of 50,000 employees and 5,000 departments (each with corresponding financial information) in the database. The DBMS used by the company has just one join method available, namely index nested loops. D

COP5725, Fall 2013 Homework 3 Page 6 of 7 i. [3 points] For each of the query s base relations, estimate the number of tuples that would be initially selected from that relation if all of the non-join predicates on that relation were applied to it before any join processing begins. Emp: 50000 1000 50000 1 200 = 5. Dept: 5000 1 2 = 2500. Finance: 5000. ii. [8 points] Under the System R approach, determine a join order that has the least estimated cost. Compute the cost of your plan (in terms of the number of disk I/O s). ((D E) F ). First, we use the fact that there is a B-tree index on salary to retrieve the tuples from E such that E.salary >= 59000. We estimate that (50000/50) = 1000 such tuples selected out, with a cost of 1 tree traversal (say 3 I/O s to get to the leaf) + the cost of scanning the leaf pages (1000/200 + 1-1 = 5) + the cost of retrieving the 1000 tuples (since the index is unclustered each tuple is potentially 1 disk I/O) = 3 + 5 + 1000 = 1008. Of these 1000 retrieved tuples, do an on-the fly select out only those that have hobby = "yodelling", we estimate there will be (1000/200) = 5 such tuples. Pipeline these 5 tuples from E one at a time to D. By using the B+ tree index on D.did and the fact the D.did is a key, we can find the matching tuples for the join by searching the D.did B+ tree and retrieving at most 1 matching tuple per tuple from E. The cost of E D is hence total cost of index nested loop. 5 (tree traversal of D.did Btree + record retrieval) = 5 (3 + 1) = 20. Now select out the 5/2 = 3 tuples that have D.floor = 1 on the fly and pipeline it to the next level F. (This is done after E D is done). Use the B+ tree index on F.did and the fact that F.did is a key to retrieve at most 1 tuple for each of the 3 pipelined tuples. This cost is at most 3 (3 + 1) = 12. Ignoring the cost of writing out the final result, we get a total cost of 1008 + 20 + 12 = 1040. IV. [25 points] Transactions and Concurrency Control. (1) For each of the following schedules: a) r 1 (A); r 2 (B); w 1 (B); w 2 (C); r 3 (C); w 3 (A); b) r 1 (A); r 2 (A); r 1 (B); r 2 (B); r 3 (A); r 4 (B); w 1 (A); w 2 (B); Answer the following questions: i. [4 points] What is the precedence graph for the schedule? ii. [4 points] Is the schedule conflict-serializable? If so, what is an equivalent serial schedules? a) i. T 2 T 1, T 2 T 3, T 1 T 3. ii. Yes, equivalent schedules: T 2 T 1 T 3. b) i. T 2 T 1, T 3 T 1, T 1 T 2, T 4 T 2. ii. No, there are cycles in the precedence graph (T 2 T 1, T 1 T 2 ). (2) [17 points] Consider the following two transactions: T 1 : w 1 (C); r 1 (A); w 1 (A); r 1 (B); w 1 (B); T 2 : r 2 (B); w 2 (B); r 2 (A); w 2 (A);

COP5725, Fall 2013 Homework 3 Page 7 of 7 Say our scheduler performs exclusive locking only (i.e., no shared locks). For each of the following three instances of transactions T 1 and T 2 annotated with lock and unlock actions, say whether the annotated transactions: 1. obey two-phase locking, 2. will necessarily result in a conflict serializable schedule (if no deadlock occurs), 3. will necessarily result in a strict schedule (if no deadlock occurs), 4. will necessarily result in a serial schedule (if no deadlock occurs), and 5. may result in a deadlock. a) T 1 : l 1 (B); l 1 (C); w 1 (C); l 1 (A); r 1 (A); w 1 (A); r 1 (B); w 1 (B); Commit; u 1 (A); u 1 (C); u 1 (B); T 2 : l 2 (B); r 2 (B); w 2 (B); l 2 (A); r 2 (A); w 2 (A); Commit; u 2 (A); u 2 (B); b) T 1 : l 1 (C); l 1 (A); r 1 (A); w 1 (C); w 1 (A); l 1 (B); r 1 (B); w 1 (B); u 1 (A); u 1 (C); u 1 (B); Commit; T 2 : l 2 (B); r 2 (B); w 2 (B); l 2 (A); r 2 (A); w 2 (A); Commit; u 2 (A); u 2 (B); c) T 1 : l 1 (C); w 1 (C); l 1 (A); r 1 (A); w 1 (A); l 1 (B); r 1 (B); w 1 (B); Commit; u 1 (A); u 1 (C); u 1 (B); T 2 : l 2 (B); r 2 (B); w 2 (B); l 2 (A); r 2 (A); w 2 (A); Commit; u 2 (A); u 2 (B); Format your answer in a table with Yes(Y)/No(N) entries. 2PL Necessarily conflict Serializable Necessarily strict schedule Necessarily Serial schedule May result in deadlock a) Y Y Y Y N b) Y Y N Y Y c) Y Y Y N Y