CSE 444 Final Exam 2 nd Midterm

Similar documents
CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

CSE 444 Final Exam. December 17, Question 1 / 24. Question 2 / 20. Question 3 / 16. Question 4 / 16. Question 5 / 16.

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

CSE 344 Final Review. August 16 th

CSE 414 Midterm. April 28, Name: Question Points Score Total 101. Do not open the test until instructed to do so.

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

CSE 414 Database Systems section 10: Final Review. Joseph Xu 6/6/2013

CS145 Midterm Examination

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

Lecture 20: Query Optimization (2)

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

CSE344 Midterm Exam Winter 2017

Query Processing & Optimization

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

CSIT5300: Advanced Database Systems

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Principles of Database Management Systems

CMSC 424 Database design Lecture 18 Query optimization. Mihai Pop

Principles of Data Management. Lecture #9 (Query Processing Overview)

McGill April 2009 Final Examination Database Systems COMP 421

CSE wi Final Exam Sample Solution

CS 245 Midterm Exam Solution Winter 2015

CSE 344 Final Examination

Query processing and optimization

CSE 444 Homework 1 Relational Algebra, Heap Files, and Buffer Manager. Name: Question Points Score Total: 50

Ch 5 : Query Processing & Optimization

CSE Midterm - Spring 2017 Solutions

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Chapter 12: Query Processing

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

CSE344 Final Exam Winter 2017

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

Database System Concepts

Introduction Alternative ways of evaluating a given query using

Review. Administrivia (Preview for Friday) Lecture 21: Query Optimization (1) Where We Are. Relational Algebra. Relational Algebra.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

CSE 582 Autumn 2002 Exam 11/26/02

Lecture 21: Query Optimization (1)

From SQL-query to result Have a look under the hood

CSE 444, Winter 2011, Final Examination. 17 March 2011

CSE344 Midterm Exam Fall 2016

CS 564 Final Exam Fall 2015 Answers

Final Exam CSE232, Spring 97

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

Chapter 12: Query Processing. Chapter 12: Query Processing

Introduction to Data Management CSE 344

Lecture 19: Query Optimization (1)

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Chapter 11: Query Optimization

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Evaluation of relational operations

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Database Systems CSE 414

Advanced Database Systems

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CSE 413 Final Exam. June 7, 2011

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

CSE 544 Principles of Database Management Systems

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

Chapter 13: Query Processing

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Midterm 2: CS186, Spring 2015

CompSci 516 Data Intensive Computing Systems

Chapter 13: Query Optimization. Chapter 13: Query Optimization

CSE 344 MAY 2 ND MAP/REDUCE

Chapter 12: Query Processing

Introduction to Database Systems CSE 444

CISC437/637 Database Systems Final Exam

Chapter 14: Query Optimization

Overview of Implementing Relational Operators and Query Evaluation

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

CS145 Midterm Examination

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Databases 2 Lecture IV. Alessandro Artale

CS122 Lecture 4 Winter Term,

CSE 143 Final Exam Part 1 - August 18, 2011, 9:40 am

Database Management Systems (COP 5725) Homework 3

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Query Optimization Overview. COSC 404 Database System Implementation. Query Optimization. Query Processor Components The Parser

Name: Database Systems ( 資料庫系統 ) Midterm exam, November 15, 2006

Query Processing Strategies and Optimization

15-415/615 Faloutsos 1

Parallel DBs. April 25, 2017

Final Exam CSE232, Spring 97, Solutions

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Modern Database Systems Lecture 1

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Midterm 1: CS186, Spring 2015

Transcription:

CSE 444 Final Exam 2 nd Midterm August 20, 2010 Name Sample Solution Question 1 / 16 Question 2 / 16 Question 3 / 30 Question 4 / 20 Question 5 / 18 Total / 100 The exam is open textbook and lecture slides only. No other written materials, including homework solutions, and no other devices, including laptops, phones, signal flares, or other means of communication. CSE 444 Final, Aug. 20, 2010 Sample Solution Page 1 of 10

Question 1. Performance Tuning (16 points) Suppose we have a relation R(a,b,c,d), where attribute a is the key. The relation is clustered on a, and there is a single B+ tree index on a. Consider each of the following query workloads, independently of each other. If it is possible to speed it up significantly by adding a single additional B+ tree index to relation R, specify which attribute or set of attributes that index should cover. You may only add at most one new index. If two or more possible indices would do an equally good job, pick one of them. If adding a new index would not make a significant difference, you should say so. Give a brief justification for your answers. (a) All queries have the form: select * from R where b >? and b <? and d =? An index on (d,b) would be best since it allows us to select exactly the tuples that satisfy these queries. (Indices on d only would require reading through unneeded values of b. Indices on b alone or (b,d) also are less selective since although they allow us to access the specified values of b, some of the d values in the selected b range might match, while others might not.) (b) 100,000 queries have the form: select * from R where b =? and c =? and 100,000 queries have the form: select * from R where b =? and d =? An index on either (b,c) or (b,d) would be most selective for half of the queries and would be helpful for the others. An index on b alone would not be quite as good. It would help all queries equally well, but miss the opportunity to speed up half of them further. CSE 444 Final, Aug. 20, 2010 Sample Solution Page 2 of 10

Question 2. Query translation (16 points) Suppose we have two relations R(a,b,c) S(p,q) Draw a logical query plan (relational algebra tree) for the following query. You may choose any query plan as long as it is correct don t worry about efficiency. SELECT x.c, count(*) FROM R as x, S as y, R as z WHERE x.a = y.p and y.q = z.a and z.b = xyzzy GROUP BY x.c Here is a straightforward translation. One useful optimization would be to push the select down below the uppermost join, but that would not be expected from a simple query parser. It s fine if you did that however. It s also fine if the joins were in the opposite order. x.c, cnt γ x.c, count(*) -> cnt σ z.b = xyzzy y.q = z.a x.a = y.p R as z R as x S as y CSE 444 Final, Aug. 20, 2010 Sample Solution Page 3 of 10

Question 3. Query plans (30 points) We would like to explore query plans involving three new relations: R(a,b) S(p,q) T(x,y) There are no foreign key relationships between these tables. Each table is clustered on its primary key and has a B+ tree index on that key. We have the following statistics about the tables: B(R) = 50 T(R) = 2000 V(R,a) = 50 V(R,b) = 20 B(S) = 200 T(S) = 4000 V(S,p) = 400 V(S,q) = 100 B(T) = 100 T(T) = 1000 V(T,x) = 200 V(T,y) = 100 Now consider the following logical query plan: R.b, S.q σ R.a = 42 R.b = T.y S.q = T.x R S T Answer questions about this query on the next pages. You may remove this page for reference if it is convenient. CSE 444 Final, Aug. 20, 2010 Sample Solution Page 4 of 10

Question 3 (cont.) (a) (15 points) What is the estimated cost of the logical query plan given in the diagram on the previous page? For full credit you should both give formulas involving things like T( ), B( ), and V(, ), and then substitute actual numbers and simplify to get your final estimate. Be sure to show your work so we can fairly award partial credit if there is a problem. Hint: Cost estimates for logical plans do not include physical operations, but do consider things like the sizes of relations and estimated sizes of intermediate results. The estimated size of a join T(X Y) on a common attribute a is T(X)T(Y) / max (V(X,a), V(Y,a)). First join: T(S)T(T) / max (V(S,q), V(T,x)) = 4,000 * 1,000 / max (100, 200) = 4,000,000 / 200 = 20,000 Second join: T(first join)t(r) / max (V(R,b), V(T,y)) = 20,000 * 2,000 / max (20, 100) = 400,000 The select on attribute a is estimated to return 1/V(R,a) of the input, so we estimate the number of tuples in the select result to be T(second join) / V(R,a) = 400,000 / 50 = 8,000. A project does not change the size of the result, so the estimated size of the final result is 8,000 tuples. [During the exam one student noticed a bug in this question. Attribute a is a key for R, but V(R,a) is only 50. We took that into account when grading in the one case where it mattered.] (continued next page) CSE 444 Final, Aug. 20, 2010 Sample Solution Page 5 of 10

Question 3 (cont.) (15 points) (b) Assume that we have M = 180 blocks of memory available. Without rearranging this logical plan, pick the best (cheapest) physical plan to execute it in the available memory. The query is repeated below and you should annotate the diagram showing which physical operations you would use. You should indicate how you plan to access the files as well as how to implement the operations. Be sure it is clear whether intermediate results are materialized on disk or not, which relations are stored in hash tables or other data structures in memory, etc. You do not need to justify your answers, but some brief explanations would help, particularly if it is necessary to assign partial credit. [There is enough room in main memory to store both R and T in hash tables, then scan S sequentially and pull tuples through the pipelines on the fly without materializing any intermediate results. Total cost would be B(R) + B(S) + B(T).] R.b, S.q use main memory pipelines to connect operations on the fly σ R.a = 42 R.b = T.y hash join: hash R into main memory hash join: hash T into main memory S.q = T.x R file scan S file scan T file scan CSE 444 Final, Aug. 20, 2010 Sample Solution Page 6 of 10

Question 4. Oink! (20 points) Consider the following Pig Latin program, which uses the same input data as in the Pig Tutorial: raw = LOAD 's3n://uw cse444 proj4/excite.log.bz2' USING PigStorage('\t') AS (user, time, query); a = GROUP raw BY user; b = FOREACH a GENERATE group AS user, COUNT(raw) AS n_searches; c = GROUP b ALL; d = FOREACH c GENERATE AVG(b.n_searches), MIN(b.n_searches), MAX(b.n_searches); STORE d INTO '/user/hadoop/answer.txt' USING PigStorage(); (a) (6 points) What does this program compute and store into the final output file? Describe what the result is, not the details of how it is computed. (Hint: GROUP b ALL sends all tuples of bag/relation b to a single group.) The output contains a single row: where avg min max avg = mean number of searches by one user min = minimum number of searches by one user max = maximum number of searches by one user [For those who are curious, the actual output when this was run against the excite.log.bz2 data was: 4.705126673040153 1 452 Thanks to Michael, our intrepid TA, for running the program to verify that it worked, as well as coming up with it in the first place.] (b) (14 points) In order to run this program, the Pig system translates it into a sequence of one or more Hadoop Map Reduce jobs. On the next page describe the Map Reduce job(s) needed to execute this program. For each job you should describe the input and output of each map and reduce phase, including the keys and values at each step. You do not need to guess exactly how Pig would translate the program as long as your answer gives a reasonable implementation as a sequence of map reduce jobs. Please write your answer on the next page. CSE 444 Final, Aug. 20, 2010 Sample Solution Page 7 of 10

Question 4. (cont.) (b) Give a series of one or more map reduce jobs needed to execute this program. Be sure to describe the key/value pairs at each stage. raw = LOAD 's3n://uw cse444 proj4/excite.log.bz2' USING PigStorage('\t') AS (user, time, query); a = GROUP raw BY user; b = FOREACH a GENERATE group AS user, COUNT(raw) AS n_searches; c = GROUP b ALL; d = FOREACH c GENERATE AVG(b.n_searches), MIN(b.n_searches), MAX(b.n_searches); STORE d INTO '/user/hadoop/answer.txt' USING PigStorage(); This program generates two map reduce jobs. Job 1: map 1 input: keys = tuple IDs, values = (user, time, query) tuples. map 1 output: key = user, value =. The actual value doesn t matter, it could be the integer 1, a single character, or any other marker to record one search by that user. reduce 1 input: key = user, value = [ ], i.e., array of search markers. reduce 1 output: key = user, value = length of array, i.e., search count for that user. Job 2: map 2 input: key = user, value = search count for that user map 2 output: key = x, value = search count v. Here the key value doesn t matter except that it needs to be the same for all map outputs. reduce 2 input: key = x (same as map 2 output key), value = [v], i.e., array of all individual user search counts. reduce 2 output: average, min, and max of values in input array [v]. CSE 444 Final, Aug. 20, 2010 Sample Solution Page 8 of 10

Question 5. XML (18 points) Suppose we have an XML file named food.xml that describes a series of products carried in a grocery store according to the following DTD: <!DOCTYPE groceries [ <!ELEMENT groceries (product)* > <!ELEMENT product (stock_nbr, name, description, price) > <!ELEMENT description (category, color?) > ]> The rest of the DTD is omitted to save space. It describes all of the remaining elements as #PCDATA, i.e., character strings. Each product has a unique stock_nbr element (stock number). The price is given in pennies, i.e., 317 means $3.17. (a) Write an XPath or XQuery expression that returns the names of all products whose price is less than or equal to 200. The output should be a sequence of elements: <name>gum</name> <name>banana</name> There are many possible answers using either XPath or XQuery. Here is a simple XPath version: /groceries/product[price/text() <= 200]/name Here text() is optional and would be inferred if it were omitted. (continued next page) CSE 444 Final, Aug. 20, 2010 Sample Solution Page 9 of 10

Question 5. (cont) DTD repeated for reference: <!DOCTYPE groceries [ <!ELEMENT groceries (product)* > <!ELEMENT product (stock_nbr, name, description, price) > <!ELEMENT description (category, color?) > ]> (b) Write an XQuery that returns a well formed xml document that contains a list of product names and prices of products whose category is veggie. The output should be formatted as follows, but don t worry about where the line breaks are as long as the proper elements are included in the right order: <result> <item><name>broccoli</name> <price>149</price> </item> <item> </item> </result> There are many possible ways to do this. Here s a fairly straightforward one: <result> { for $x in document( food.xml )/groceries for $y in $x/product[description/category = veggie ] return <item> { $y/name, $y/price } </item> } </result> Many solutions used a where clause to specify the veggie condition instead of including it in the path. That s it! Have a great summer vacation!! CSE 444 Final, Aug. 20, 2010 Sample Solution Page 10 of 10