CSE wi Final Exam Sample Solution

Similar documents
CSE 344 APRIL 27 TH COST ESTIMATION

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Database Systems CSE 414

Lecture 21: Query Optimization (1)

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Lecture 19: Query Optimization (1)

CSE 414 Database Systems section 10: Final Review. Joseph Xu 6/6/2013

Introduction to Database Systems CSE 344

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Review. Administrivia (Preview for Friday) Lecture 21: Query Optimization (1) Where We Are. Relational Algebra. Relational Algebra.

CSE 344 APRIL 20 TH RDBMS INTERNALS

CSE 344 FEBRUARY 14 TH INDEXING

CSE 344 Final Examination

CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

CSE 444 Midterm Exam

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

CS 564 Final Exam Fall 2015 Answers

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Introduction to Data Management CSE 344. Lectures 9: Relational Algebra (part 2) and Query Evaluation

CSE 444 Final Exam. December 17, Question 1 / 24. Question 2 / 20. Question 3 / 16. Question 4 / 16. Question 5 / 16.

CSE 444 Final Exam 2 nd Midterm

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

IMPORTANT: Circle the last two letters of your class account:

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam

CSE 414 Final Examination

. : B.Sc. (H) Computer Science. Section A is compulsory. Attempt all parts together. Section A. Specialization lattice and Specialization hierarchy

McGill April 2009 Final Examination Database Systems COMP 421

CSE 444 Midterm Exam

CSE 344 Midterm Exam

CSE 544 Principles of Database Management Systems

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

CS 4320/5320 Homework 2

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

CSE 344 Final Exam. March 17, Question 1 / 10. Question 2 / 30. Question 3 / 18. Question 4 / 24. Question 5 / 21.

Midterm 2: CS186, Spring 2015

In these relations, vin is a foreign key in ORDER referring to the CAR relation, and techid is a foreign key

CSE 444, Winter 2011, Midterm Examination 9 February 2011

CSE 344 Final Examination

1 (10) 2 (8) 3 (12) 4 (14) 5 (6) Total (50)

Introduction to Database Systems CSE 444

CSE 444, Winter 2011, Final Examination. 17 March 2011

CSE344 Final Exam Winter 2017

CSE 344 Final Examination

CSE 344 Final Examination

CSE 344 Final Examination

CSE 544 Principles of Database Management Systems

Query Processing & Optimization

CSE344 Midterm Exam Fall 2016

CPS 216 Spring 2003 Homework #1 Assigned: Wednesday, January 22 Due: Monday, February 10

Review: Query Evaluation Steps. Example Query: Logical Plan 1. What We Already Know. Example Query: Logical Plan 2.

Final Exam CSE232, Spring 97

CSE 544 Principles of Database Management Systems

CS 461: Database Systems. Final Review. Julia Stoyanovich

Announcements. Agenda. Database/Relation/Tuple. Discussion. Schema. CSE 444: Database Internals. Room change: Lab 1 part 1 is due on Monday

12. MS Access Tables, Relationships, and Queries

Database Management Systems

CSE 344 Midterm. Monday, Nov 4, 2013, 9:30-10:20. Question Points Score Total: 100

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2009 Quiz I Solutions

CSE 414 Midterm. April 28, Name: Question Points Score Total 101. Do not open the test until instructed to do so.

VIEW OTHER QUESTION PAPERS

Database Management Systems Paper Solution

CS348: INTRODUCTION TO DATABASE MANAGEMENT (Winter, 2011) FINAL EXAMINATION

[18 marks] Consider the following schema for tracking customers ratings of books for a bookstore website.

CSE 530 Midterm Exam

CSE 414 Final Examination. Name: Solution

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

CS145 Midterm Examination

CSE 544, Winter 2009, Final Examination 11 March 2009

Agenda. Discussion. Database/Relation/Tuple. Schema. Instance. CSE 444: Database Internals. Review Relational Model

A7-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

1. (a) Briefly explain the Database Design process. (b) Define these terms: Entity, Entity set, Attribute, Key. [7+8] FIRSTRANKER

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

IMPORTANT: Circle the last two letters of your class account:

Multiple-Choice. 3. When you want to see all of the awards, even those not yet granted to a student, replace JOIN in the following

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

CSE 444: Database Internals. Lecture 3 DBMS Architecture

Final Review. May 9, 2017

Final Review. May 9, 2018 May 11, 2018

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

CSE 344 Final Examination

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

Section 1: Redundancy Anomalies [10 points]

CSE344 Midterm Exam Winter 2017

L Information Systems for Engineers. Final exam. ETH Zurich, Autumn Semester 2017 Friday

Chapter 13: Query Optimization. Chapter 13: Query Optimization

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Announcements. Agenda. Database/Relation/Tuple. Schema. Discussion. CSE 444: Database Internals

Bachelor in Information Technology (BIT) O Term-End Examination

CSE 444: Database Internals. Lecture 3 DBMS Architecture

CS3DB3/SE4DB3/SE6DB3 TUTORIAL

Computer Science 597A Fall 2008 First Take-home Exam Out: 4:20PM Monday November 10, 2008 Due: 3:00PM SHARP Wednesday, November 12, 2008

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

CSE Midterm - Spring 2017 Solutions

CSE 444 Midterm Exam

Where Are We? Next Few Lectures. Integrity Constraints Motivation. Constraints in E/R Diagrams. Keys in E/R Diagrams

Transcription:

Question 1. (21 points, 7 each) SQL. We have a small database with information about people and events. Where: Person(pid, name) Event(eid, name, start_time, end_time) Invited(pid, eid, going) Person contains information about all people in the database. Each person has a unique integer id (pid), which is the key. Event is a list of all events. Each event has a unique integer event id (eid), which is the key. The start and end time of each event is given as an integer, which represents the number of seconds that have elapsed since 00:00:00 on January 1, 1970 (i.e., times are given as the number of seconds from a specific time in the past.) You may assume that the start_time of each event is always less than the event end_time. Each tuple in Invited indicates that the Person identified by pid has been invited to the Event identified by eid. The attribute going has the value 1 if the person will attend the event and is 0 if not. Attributes pid and eid are foreign keys referencing Person and Event respectively. Write SQL queries that return the information described below. Grading note: in many cases there are other possible queries that produce the requested answers. Those also received full credit if they were correct. (a) (7 points) Write a SQL query that returns the number of people who are invited to and are going to attend the event Film Festival. SELECT COUNT(*) FROM Invited i, Event e WHERE i.eid = e.eid AND e.name = Film Festival AND i.going = 1; (b) (7 points) Write a SQL query that finds the event or events that the most people have been invited to attend. The query should return the event id(s) (eid) and event name(s) (name) of the event(s) that have the most persons invited, regardless of whether those people plan to attend. SELECT e.eid, e.name FROM Event e, Invited i WHERE e.eid = i.eid GROUP BY e.eid, e.name HAVING COUNT(i.pid) >= ALL (SELECT COUNT(i2.pid) FROM Invited i2, Event e2 WHERE i2.eid = e2.eid GROUP BY e2.eid); CSE 414 Final Exam, March 17, 2014 Page 1 of 13

Question 1. (cont) (c) (7 points) Write a SQL query that returns the ids (pid) and names of all persons who have been invited to two distinct events that overlap in time. Two events overlap if Person(pid, name) Event(eid, name, start_time, end_time) Invited(pid, eid, going) one starts while the other is in progress. If the previous event ends at exactly the same time that a later one starts, the two events do not overlap. The query result should include each person (pid and name) only once, even if they have been invited to more than one overlapping event. Persons should be included in the result regardless of whether or not they plan to attend either or both events. SELECT DISTINCT p.pid, p.name FROM Person p, Invited i1, Invited i2, Event e1, Event e2 WHERE p.pid = i1.pid AND p.pid = i2.pid AND i1.eid = e1.eid AND i2.eid = e2.eid AND e1.eid <> e2.eid AND e1.start_time <= e2.start_time AND e2.start_time < e1.end_time; - - distinct events Question 2. (9 points, 3 each) Relational algebra. Suppose we have two relations A and B. Relation A has na tuples and relation B has nb tuples. For each of the following relational algebra expressions, give the minimum and maximum number of tuples that can appear in the result. (You do not need to show your work, but it might be helpful in case we need to assign partial credit to a not- quite- right answer.) (a) A B max = min(na, nb) min = 0 (b) σ c (A) B (where c is some logical (Boolean) condition) max = na min = 0 (c) A B (recall that is the left- outer- join operator) max = na * nb min = na CSE 414 Final Exam, March 17, 2014 Page 2 of 13

Question 3. (12 points, 4 each) XML. Suppose we have the following XML document describing a music catalog. The information is organized by label and lists for each label the artists and albums that have been released on that label. A single artist may have released albums on more than one label, so an artist name may appear under multiple label listings (A DTD for this data is given in the next problem. Feel free to reference that if it is useful. You also can remove this page for reference if you wish.) <Catalog> <Label> <Name> Swan Song </Name> <Artist> <Name> Led Zeppelin </Name> <Album> <Title> Physical Graffiti </Title> <Song> Custard Pie </Song> <Song> Kashmir </Song> <Price> 19.99 </Price> <Year> 1975 </Price> </Album> </Artist> </Label > <Label> <Name> Albert </Name> <Artist> <Name> AC/DC </Name> <Album> <Title> T.N.T. </Title> <Song> Live Wire </Song> <Song> T.N.T. </Song> <Song> High Voltage </Song> <Price> 19.99 </Price> <Year> 1975 </Year> </Album> <Album> <Title> Back in Black </Title> <Song> Shoot to Thrill </Song> <Song> Back in Black </Song> <Song> Rock and Roll Ain t Noise Pollution </Song> <Price> 25.99 </Price> <Year> 1980 </Year> </Album> </Artist> </Label> <Label> <Name> RCA </Name> <Artist> <Name> Justin Timberlake </Name> <Album> <Title> The 20/20 Experience </Title> <Song> Suit & Tie </Song> <Song> Mirrors </Song> <Price> 12.99 </Price> <Year> 2013 </Year> </Album> </Artist> </Label> </Catalog> (continued on next page) CSE 414 Final Exam, March 17, 2014 Page 3 of 13

Question 3. (cont.) Give the results of the following XPath expressions when they are applied to the XML document on the previous page. (a) //Price <Price> 19.99 </Price> <Price> 19.99 </Price> <Price> 25.99 </Price> <Price> 12.99 </Price> (b) Catalog/Label//Album[//Year/text() >= 1980]/Title <Title> Back in Black </Title> <Title> The 20/20 Experience </Title> (c) Catalog/Label[count(//Album/Song) > 2]/Name <Name> Albert </Name> CSE 414 Final Exam, March 17, 2014 Page 4 of 13

Question 4. (12 points, 6 each) For reference, the DTD of the previous question s XML data is: <!DOCTYPE Catalog [ <!ELEMENT Catalog (Label+)> <!ELEMENT Label (Name, Artist+)> <!ELEMENT Artist (Name, Album+)> <!ELEMENT Album (Title, Song+, Price, Year)> <!ELEMENT Name(#PCDATA)> <!ELEMENT Title(#PCDATA)> <!ELEMENT Song(#PCDATA)> <!ELEMENT Price(#PCDATA)> <!ELEMENT Year(#PCDATA)> ]> Below, give XQuery expressions that return the specified results. The queries should work on any valid XML document described by the above DTD, not just the sample data given in the previous problem. The results should be well- formed XML. If needed, you can assume the data is in a file named catalog.xml. Grading note: As with the SQL queries, there are other answers that produce the requested results. As long as they worked properly, they received full credit. (a) Return a list giving the names of every artist included in the catalog and, for each artist, the titles of all of the albums they have released on any label. <result> for $artist in distinct- values(doc( catalog.xml )//Artist/Name) return <Artist> {$artist/text()}{ for $album in doc( catalog.xml )//Label/Artist[/Name = $artist]//title return <Album> {$album/text()} </Album> } </Artist> </result> (b) Return a list showing for each year the names of artists who released an album during that year. <result> for $year in distinct- values(doc( catalog.xml )//Year) return <Year> {$year/text()} { for $artist in distinct- values(doc( catalog.xml )//Artist[//Year = $year]) return <Artist> {$artist/name/text()} </Artist> } </Year> </result> CSE 414 Final Exam, March 17, 2014 Page 5 of 13

Question 5. (20 points) Database design. We would like to design a database to describe the members of the Marvel/DC comic book and movie universe. This universe has the following properties that we want to model. A citizen has a name and unique social security number (ssn) A superhero has a name and a unique superhero id number (sidn) A superhero power has a type and an effect, and that combination is unique A city has a name, population, latitude and longitude A superhero may have a secret identity as a citizen A superhero may have one or more powers Citizens and superheroes reside within a city (i.e. Gotham, Metropolis) (a) (10 points) Draw an E/R diagram that captures this information. There are obviously several possible variations on this diagram and as long as your solution was a reasonable design for the given constraints it received credit. For instance a solution could use Name as the key for the City entities instead of latitude/longitude. Or it might be reasonable to restrict a Superhero to having a single Power, or allow a single Citizen to be the Secret Identity of more than one Superhero. (continued next page) CSE 414 Final Exam, March 17, 2014 Page 6 of 13

Question 5. (cont.) (b) (10 points) Give a series of SQL Create Table statements to create tables to hold the information in the E/R diagram from part (a) of this question. Your SQL statements should identify the key or keys of each table and include any appropriate constraints, including foreign key constraints. Create table City(name varchar(50), population INTEGER, latitude INTEGER, longitude INTEGER, primary key(latitude, longitude) ); Create table Person(name varchar(50), latitude INTEGER, longitude INTEGER, foreign key (latitude, longitude) references City(latitude, longitude) ); Create table Superhero(SIDN INTEGER primary key, foreign key (latitude, longitude) references Person(latitude, longitude) ); Create table Citizen(SSN INTEGER primary key, foreign key (latitude, longitude) references Person(latitude, longitude) Create table Power(type varchar(20), effect varchar(50), SIDN INTEGER references Superhero(SIDN), primary key(type, effect) ); Create table secretidentity( SIDN INTEGER references Superhero(SIDN), SSN INTEGER references Citizen(SSN), primary key(sidn, SSN) ); Grading note: We allowed any reasonable definitions that matched the E/R diagram given as an answer to the previous part of the question. CSE 414 Final Exam, March 17, 2014 Page 7 of 13

Question 6. (15 points) Database design. We have a database with information about university courses. Unfortunately, it was designed by Mr. Frumble, and it has only one table with this schema: Course(CourseID, Course_name, Instructor, Instructor_email, Max_students, Room) These functional dependencies exist between attributes in this table: CourseID - > Course_name CourseID - > Instructor CourseID - > Room Instructor - > Instructor_email Room - > Max_students Is this schema in BCNF? If so, give a justification for your answer and identify the key(s) in the schema. If not, identify the bad functional dependencies and decompose the schema into two or more tables that are in BCNF. Identify the key(s) of each table. No, it is not in BCNF. One bad dependency is Instructor - > Instructor_email because Instructor+ = {Instructor, Instructor_email}, which does not include all of the attributes. The other bad dependency is Room - > Max_students because Room+ = {Room, Max_students}, which also is not the entire set of attributes. First decomposition: break the original table into: I(Instructor, Instructor_email) and C2(CourseID, Course_name, Instructor, Max_students, Room) Instructor is the key of the I relation. C2 still has the bad dependency Room - > Max_students, so we break that into two tables: R(Room, Max_students) C(CourseID, Course_name, Instructor, Room) The three tables I, R, and C are in BCNF because they have no bad dependencies. Another solution would be to use the Room - > Max_students dependency for the first decomposition, then break apart the resulting tables using the Instructor - > Instructor_email one. CSE 414 Final Exam, March 17, 2014 Page 8 of 13

Question 7. (18 points, 6 each) Serializibility. For each of the following schedules, draw the precedence (conflict) graph and decide if the schedule is conflict- serializable. If the schedule is conflict- serializable, give an equivalent serial schedule by listing the transactions in an order they could occur (you do not need to write down all of the read- write operations, just give the order of the transactions). If the schedule is not conflict- serializable, explain why not. (a) R1(A) W4(C) R3(C) W2(B) R4(B) R3(B) Yes this is serializable and one possible order is T1, T2, T4, T3. T1 can happen anywhere, but the ordering T2, T4, T3 is the only possible ordering of those three transactions. (b) R3(B) R1(A) R1(B) W1(C) R2(A) R1(D) R2(C) W1(B) This is also serializable. The only possible order is T3, T1, T2. (c) R1(A) R2(A) W2(A) R3(B) R2(B) W3(B) W2(B) W1(A) Not serializable. There is a cycle between T1 and T2 on A, and a cycle between T2 and T3 on B. CSE 414 Final Exam, March 17, 2014 Page 9 of 13

Question 8. (16 points) Query estimation. One of our favorite queries involves tables of suppliers and parts. The two tables have the following schemas: Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) We assume that these tables have the following characteristics: T(Supplier) = 1000 T(Supply) = 10000 B(Supplier) = 100 B(Supply) = 100 V(Supplier, scity) = 20 V(Supply, pno) = 2500 V(Supplier, state) = 10 Now, consider the following query, where the physical operators to be used are shown in parentheses. (On the fly) #!3,6+$ (On the fly)!!&"'(%)*+,-.+/$"$!!','+%)01/$"$234%5$ (Block nested loop join)!"#$%$!"#$ Supplier (File scan) Supply (File scan) (a) (4 points) Translate this query into SQL. SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid AND scity = Seattle AND sstate = WA AND pno = 2 CSE 414 Final Exam, March 17, 2014 Page 10 of 13

Question 8 (cont.) (b) (6 points) What is the expected cost of this physical query? Remember that the cost of a query is measured in terms of the expected number of I/O operations, not including writing the result. Also, you should assume that there is enough main memory to hold all necessary data, if that is a consideration. Query diagram and relation information repeated for convenience: (On the fly) (On the fly) #!3,6+$!!&"'(%)*+,-.+/$"$!!','+%)01/$"$234%5$ Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 (Block nested loop join)!"#$%$!"#$ T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 Supplier (File scan) Supply (File scan) Write your cost estimate below. It would help to show enough of your work so we can follow it, in case it is necessary to award partial credit. The cost of the block nested loop join is B(Supplier) + B(Supplier)*B(Supply) = 100 + 100*100 = 10100. There is no additional cost for the select and project operations since these don t involve additional disk I/O operations. (continued next page) CSE 414 Final Exam, March 17, 2014 Page 11 of 13

Question 8. (cont) (c) (6 points) Now suppose we use this different plan for the same SQL query. (On the fly) #!2,3+$ Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) (Block nested loop join)!"#$%$!"#$ T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10!!&"'(%)*+,-.+/$"$!!','+%)01/$! $425%6$ T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 Supplier (File scan) Supply (File scan) What is the cost estimate for this different plan? (Again, it would help to show enough of your work so we can follow it, but be sure the final answer is clearly indicated.) The cost estimate for scanning the two files is B(Supplier) + B(Supply) = 200. The project operation on Supply will yield an estimated 10000/2500 = 4 tuples, which is far less than one block. The project operation on Supplier involves cities, which individually yields an estimated 1000/20 = 50 tuples and states, which is estimated to yield 1000/10 = 100 tuples. In any case, the project operation on Supplier will yield at most 5 blocks worth of data to be streamed to the join operation to be joined with the block of data from Supply. All of that can be streamed in memory with no additional disk operations, so the total cost estimate is 200. CSE 414 Final Exam, March 17, 2014 Page 12 of 13

Question 9. (12 points) Map- Reduce. One of the original applications of the map- reduce framework was to compute the page rank of each page on the web. The basic idea behind page rank is that pages are more likely to contain useful information if many other pages on the web contain links to them. For this problem we d like to compute a simplified version of page rank by counting the number of other pages that link to each page in our data. The input is a large collection of key- value data pairs (pageurl, linkurl) where pageurl is the key of the input data file and linkurl is the associated value. A (pageurl, linkurl) pair indicates that the page identified by pageurl contains a link to the page identified by linkurl. We assume that the data contains no duplicates, i.e., there is only one key- value pair for links from one page to another even if the page identified by pageurl contains multiple links to linkurl. Describe a sequence of one or more map- reduce jobs (not Pig programs) that will report the number of pages that link to each page that appears as a linkurl in the input data. The output should be a collection of key- value pairs (linkurl, count), where count is the number of pages that contain a link to linkurl. You need to clearly describe the (key, value) pairs that are input to and output from each map and reduce phase of the map- reduce job(s) needed. Be sure to clearly describe the input and output (key, value) pairs from each map and reduce stage, and explain (concisely) how the output of each map or reduce stage is computed from its input. If it takes more than one map- reduce job to compute the final result you should show how the output of each job is used as input to subsequent ones. This can be done in a single map- reduce pass since we are basically counting the number of times each linkurl appears in the second part of an input tuple. Map: input: (pageurl, linkurl) tuples output: (linkurl, 1) for each linkurl read in an input tuple Reduce: input: (linkurl, set of {1,1,1,1 } with a 1 for each copy of linkurl found in the map phase) output: (linkurl, n) where n is the number of 1 s associated with linkurl in the input To cut down on the communication overhead, the each worker in the map phase could store the linkurls it discovers in a hash table and count how many times each linkurl is seen, then emit a single (linkurl, c) tuple for each linkurl, where c is the number of occurrences discovered by that map worker. The reduce phase would then have as input (linkurl, {c1, c2,, cn}) where the ci numbers are the individual map counts. The output would be (linkurl, n) as before except that n would be the sum of c1+c2+ +cn. CSE 414 Final Exam, March 17, 2014 Page 13 of 13