CSE 190D Spring 2017 Final Exam

Similar documents
CSE 190D Spring 2017 Final Exam Answers

CS 564 Final Exam Fall 2015 Answers

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

IMPORTANT: Circle the last two letters of your class account:

Final Exam CSE232, Spring 97

CMSC424: Database Design. Instructor: Amol Deshpande

CSE 344 Final Examination

CSE344 Final Exam Winter 2017

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

CSE 344 Final Review. August 16 th

CS222P Fall 2017, Final Exam

Assignment 6 Solutions

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Database Management Systems (COP 5725) Homework 3

CSE 232A Graduate Database Systems

COURSE 1. Database Management Systems

CSE 344 Final Examination

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

CSE 414 Final Examination

Evaluation of Relational Operations: Other Techniques

EECS 647: Introduction to Database Systems

Overview of Transaction Management

CMSC 461 Final Exam Study Guide

CISC437/637 Database Systems Final Exam

CSE 544, Winter 2009, Final Examination 11 March 2009

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CISC437/637 Database Systems Final Exam

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Transactions and Concurrency Control

CS 222/122C Fall 2017, Final Exam. Sample solutions

CSC 261/461 Database Systems Lecture 21 and 22. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Evaluation of Relational Operations: Other Techniques

Advanced Database Systems

Final Exam CSE232, Spring 97, Solutions

Implementation of Relational Operations: Other Operations

Evaluation of Relational Operations

Database Applications (15-415)

Chapter 12: Query Processing. Chapter 12: Query Processing

RELATIONAL OPERATORS #1

Name Class Account UNIVERISTY OF CALIFORNIA, BERKELEY College of Engineering Department of EECS, Computer Science Division J.

Chapter 3. Algorithms for Query Processing and Optimization

Transaction Management Overview

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

Introduction to Data Management. Lecture #18 (Transactions)

Chapter 12: Query Processing

University of Waterloo Midterm Examination Sample Solution

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text.

Lecture 12. Lecture 12: The IO Model & External Sorting

CS 186/286 Spring 2018 Midterm 1

Database Management Systems Written Examination

Final Review. May 9, 2017

Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please)

Final Review. May 9, 2018 May 11, 2018

CSE 190D Database System Implementation

Database Management Systems Paper Solution

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

CS 186/286 Spring 2018 Midterm 1

CSE 562 Final Exam Solutions

CompSci 516 Data Intensive Computing Systems

Evaluation of relational operations

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Chapter 13: Query Processing

IMPORTANT: Circle the last two letters of your class account:

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CSE 344 Final Examination

Database Systems CSE 414

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

What are Transactions? Transaction Management: Introduction (Chap. 16) Major Example: the web app. Concurrent Execution. Web app in execution (CS636)

Goal of Concurrency Control. Concurrency Control. Example. Solution 1. Solution 2. Solution 3

CSE 444, Winter 2011, Final Examination. 17 March 2011

CSE Midterm - Spring 2017 Solutions

CSE 414 Database Systems section 10: Final Review. Joseph Xu 6/6/2013

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Transaction Management: Introduction (Chap. 16)

Introduction to Data Management CSE 344

Intro to Transaction Management

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Parser: SQL parse tree

Evaluation of Relational Operations: Other Techniques

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2009 Quiz I Solutions

Final Review. CS 145 Final Review. The Best Of Collection (Master Tracks), Vol. 2

CSE 444: Database Internals. Lectures 13 Transaction Schedules

CSE 414 Final Examination. Name: Solution

Hash-Based Indexing 165

Transactions. Transaction. Execution of a user program in a DBMS.

Transaction Management and Concurrency Control. Chapter 16, 17

Hadoop Map Reduce 10/17/2018 1

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

Evaluation of Relational Operations

Transaction Management Overview. Transactions. Concurrency in a DBMS. Chapter 16

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

CSE 232A Graduate Database Systems

Transcription:

CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae, etc. Apart from this, the exam is closed book/notes/electronics/peers. 3. Please wait until being told to start reading and working on the exam. 4. Please ensure that your writing is clear and legible! Points Score Q 1 20 Q 2 10 Q 3 25 Q 4 21 Q 5 24 Total 100 Extra Credit +5 1

2

Q 1. [20pts] For the following questions, clearly circle or. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join algorithm. 2. Data shuffling among worker nodes is a key component of parallel query processing in MapReduce/Hadoop, Spark, and parallel DBMSs. 3. Using the double buffering technique typically reduces the total number of passes needed for an external merge sort. 4. All four SQL isolation levels guarantee that there will be no WW conflicts during a concurrent execution of transactions. 5. It is typically possible to speed up the processing of the aggregation γ COUNT( ) (R) by using a B+ tree index on R. 6. An avoiding-cascading-aborts (ACA) schedule is always guaranteed to be a recoverable schedule. 3

7. Apart from schema information, the RDBMS catalog also stores statistics about both relation instances and indexes. 8. A hash index is typically efficient for answering selection queries with = predicates (denoted <> in SQL). 9. Spark s API mostly subsumes the MapReduce programming abstraction. 10. The clock algorithm is an approximation of the MRU buffer replacement policy. 4

Q 2. [10pts] Hash Join with non-uniform partitioning. (This question is a small tweak on a quiz question!) We are joining two tables R and S, which have 4BN R and 12BN S pages respectively, using a hash join. We are given that 4BN R 12BN S and that the number of available buffer pages is 4B + 1. The buffer pool is initially empty. We are also given that 2FN S = 4B 1, where F is the hash table fudge factor. The distribution of the join attribute values in R and S are such that after the first hash partitioning phase, we get exactly 4B partitions each of R and S. Each partition of R is N R pages long, but the partitions of S have multiple lengths. Suppose that S gets partitioned as follows: B partitions have length N S pages each, 2B partitions of length 3N S pages each, and B partitions of length 5N S pages each. What is the I/O cost of the above join using the regular hash join algorithm discussed in class? Exclude the cost of writing the output. Assume that perfect uniform splitting occurs during the recursive repartitioning and that we do not need to recurse more than once. Briefly explain and show all of your calculations clearly. (Hint: The answer is of the following form: xbn R + ybn S, where x {10, 12, 14, 16, 18, 20} and y {50, 54, 58, 62, 66, 70}.) 5

Q 3. [25pts] Query processing and optimization. We are given a relational database schema with the following three relations, wherein A and B are discrete (string) attributes and C, D, and E are numeric attributes. R(A, B, C), S(A, B), and T(B, D, E) Q 3.1 [10pts] For each of the following questions, clearly circle either or for whether the given queries Q 1 and Q 2 are equivalent. 1. Q 1 : σ A= a B= b ((R S) T) and Q 2 : (σ A= a B= b (R S)) T 2. Q 1 : π E (σ D=1 (R T)) and Q 2 : π E (T) π B,E (σ D=1 (R T)) 3. Q 1 : π B (R S) and Q 2 : (π B (R)) (π B (S)) 4. Q 1 : π B,C,E (R (σ E=1 (T))) and Q 2 : σ E=1 (π B,C,E (R T)) 5. Q 1 : R S T and Q 2 : (R S) (S T) (T R) 6

Q 3.2 [9pts] We are given an instance of the relation T whose heap file has a size of 10 million pages. We are also given a B+ tree index on T with the index key (B, D) that follows the alternative of storing RID lists in the leaf pages. The index has 5 million leaf pages. Assume that each B value is 8 B in size and that there are a million unique values of B in the given instance of T. A page is 8 KB in size. Describe an efficient physical query plan with I/O cost less than 9 million pages for processing the following SQL query, given 4 GB of buffer memory and an initially empty buffer pool. Mention if each physical operator is pipelined or materialized and compute the total I/O cost of your plan (in number of pages) rounded to the nearest million. SELECT B, AVG(D) FROM T GROUP BY B 7

Q 3.3 [6pts] Name three indexes on R (including at least one hash index) that match the predicate in the following SQL query and briefly explain why each index matches. SELECT * FROM R WHERE (B = "b" OR A <> "a") AND NOT (C <= 20 OR A <> "a") 8

Q 4. [21pts] Transaction management and concurrency control. We are given a database with three distinct data objects A, B, and C. We are also given the following three transactions that arrive concurrently. T1 : R(A), W(A), R(B), W(B), Commit T2 : R(B), W(B), R(C), W(C), Commit T3 : R(C), W(C), R(A), W(A), Commit Q 4.1 [2pts] Give a clear example of a serial schedule of the three transactions. Q 4.2 [5pts] Give a clear example of a serializable schedule of the three transactions that is not serial but is equivalent to the serial schedule in your above answer. 9

Q 4.3 [14pts] Consider the following interleaved schedule. For each of the questions that follow, clearly circle yes or no. R T1 (A), R T2 (B), R T3 (C), W T1 (A), W T2 (B), W T3 (C), R T1 (B), W T1 (B), Commit T1, R T2 (C), W T2 (C), Commit T2, R T3 (A), W T3 (A), Commit T3 1. Is the schedule serializable? 2. Is the schedule recoverable? 3. Does the schedule have a WW conflict between any pair of transactions? 4. Does the schedule have a WR conflict between any pair of transactions? 5. Does the schedule have a RW(R) conflict between any pair of transactions? 6. Suppose we use the READ UNCOMMITTED isolation level of SQL. Will it lead to a deadlock with the given schedule? 7. Suppose we use the SERIALIZABLE isolation level of SQL. Will it lead to a deadlock with the given schedule? 10

Q 5. [24pts] For the following questions, clearly circle the right answer (only one option is correct). 1. Which of the following symbols does not represent a relational operator from the extended relational algebra? (a) γ (b) (c) µ (d) π (e) 2. Which of the following relational operators do not preserve the schema of (at least one of) their inputs? (a) Set union (b) Set intersection (c) Set difference (d) Select (e) Project 3. Which of the following relational operators can be processed using a regular (unmodified) hash join implementation? (a) Set union (b) Set intersection (c) Set difference (d) Select (e) Project 4. Which of the following SQL aggregates require a shuffle among worker nodes in a parallel DBMS when the GROUP BY list is empty? (a) SUM (b) AVG (c) VARIANCE (d) MEDIAN (e) MAX 5. Which file organization is typically the most efficient for inserting new records? (a) Heap file (b) Sorted file (c) B+ tree index with AltRecord 6. We are given this join query: R S T U. Recall that some query optimizers only consider left-deep join trees for join order enumeration. How many different left-deep join trees exist for this query? (Hint: Swapping the left and right input of a in a given tree yields a different tree.) (a) 4 (b) 6 (c) 15 (d) 16 (e) 24 11

7. Which is the dominant parallelism paradigm that is used in parallel DBMSs, MapReduce/Hadoop, and Spark? (a) Shared-nothing (b) Shared-memory (c) Shared-Disk 8. Given the following bags A = {a, b, c, a, a, x, y, x, z} and B = {e, a, b, d, x, y, a, y}, what is the result of the bag intersection of A and B (INTERSECT ALL in SQL)? (a) {a, b, x, y} (b) {b, a, y, x, a} (c) {x, b, a, x, y} (d) {a, b, c, x, y, z} 9. In a hard disk, which of the following components of the data access time accounts for the delay caused by the radial movement of the arm? (a) Seek time (b) Rotational delay (c) Transfer time 10. Given 1 GB of buffer memory, what is roughly the largest file size (to the nearest order of magnitude) that can technically be sorted with only 2 passes? (a) 10 GB (b) 1 TB (c) 100 TB (d) 10 PB (e) 100 PB 11. Suppose we are given two union-compatible relations R and S with sizes N R and N S ( N R ) pages respectively. Suppose we have B = 2 + F N S /2 pages of buffer memory (F is the hash table fudge factor) and the buffer pool is initially empty. What is the minimum possible I/O cost of a UNION ALL of R and S, excluding output write cost? (a) 6N R + 6N S (b) 4N R + 4N S (c) 2N R + 2N S (d) N R + 2N S (e) N R + N S 12. In a B+ tree index, which nodes are allowed to have duplicates of the index key? (a) Leaf nodes (b) Root node (c) n-root internal nodes 12

Extra Credit Q. [5pts]. Suppose you are working for Netflix and using a Hadoop cluster. You are given a dump of a large Ratings table with the same schema from class: Ratings (RID, MovieID, UserID, Stars, RateDate). The table is stored as a distributed CSV file on HDFS that is hash partitioned on RID. You are tasked with obtaining the number of five-star ratings for each movie in the given file. Unfortunately, your Hadoop cluster only has regular MapReduce (no Hive, Pig, Spark, or any SQL engine). How will you implement this task in MapReduce? A proper explanation of the Map and Reduce phases are enough; no need to write code in Java, Python, etc. Tired of writing low-level MapReduce programs, you decide to convince your boss that Netflix really needs to use Hive or SparkSQL. How can you use the above task to achieve your goal? 13

This page is intentionally left blank (useful for rough work). 14