CSEP 514 Midterm Tuesday, Feb. 7, 2017, 5-6:20pm Name: Question Points Score 1 50 2 25 3 50 4 25 Total: 150 This exam is CLOSED book and CLOSED devices. You are allowed ONE letter-size page with notes (both sides). You have 80 minutes; budget time carefully. Please read all questions carefully before answering them. Some questions are easier, others harder. Plan to answer all questions, do not get stuck on one question. If you have no idea how to answer a question, write your thoughts about the question for partial credit. Good luck! 1
1 SQL and Indexing 1. (50 points) The following database contains the answers collected during a poll: Subject(sid, name, age) Question(qid, description, category) Answer(sid, qid, vote) Subject stores all subjects that have been polled. Question stores a set of questions. For examples, a question could be Would you rather live in large city than in a suburb? Answer stores the answers. The vote is 0 or 1, representing no and yes respectively. Answers are voluntary, not every subject answers every question: if a subject did not answer a question, then that (subject,question) pair is not inserted in the relation Answer. All primary keys are underlined. The attributes types are as follows: sid,qid, age, vote are integers. name, description, category are text. CONTINUED ON NEXT PAGE Page 2
Subject(sid, name, age) Question(qid, description, category) Answer(sid, qid, vote) (a) (12 points) Write SQL statements to create the tables for the polling database. Choose the right types for the attributes, and define all key and foreign key constraints. You should turn in CREATE TABLE statements. Solution: drop table if exists Answer; drop table if exists Subject; drop table if exists Question; create table Subject(sid int primary key, name text, age int); create table Question(qid int primary key, description text, category text); create table Answer (sid int references Subject, qid int references Question, vote int, primary key (sid, qid)); insert into Subject values(1, Alice, 22); insert into Subject values(2, Bob, 33); insert into Subject values(3, Carol, 44); insert into Question values(10, Q1?, sports ); insert into Question values(20, Q2?, sports ); insert into Question values(30, Q3?, living ); insert into Answer values(1,10,1); insert into Answer values(1,30,1); insert into Answer values(2,10,0); insert into Answer values(2,20,1); insert into Answer values(2,30,1); insert into Answer values(3,20,0); insert into Answer values(3,30,0); 2 points off for missing keys 2 points off for missing references Page 3
Subject(sid, name, age) Question(qid, description, category) Answer(sid, qid, vote) (b) (15 points) Write a SQL query to compute, for every question, the number of yes votes. Return the questions in decreasing order of their number of yes votes. Your query should return the question id, its description, and its number of yes votes. Solution: select z.qid, z.description, count(*) as cnt from Subject x, Answer y, Question z where x.sid = y.sid and y.qid= z.qid and y.vote = 1 group by z.qid, z.description order by cnt desc; Another solution is to replace count(*) with sum(y.vote) and drop the condition y.vote=1. 2 points off if vote is ignored. Page 4
Subject(sid, name, age) Question(qid, description, category) Answer(sid, qid, vote) (c) (15 points) We say that two subjects are similar if they gave the same answers to at least 50 questions. Write a SQL query to return the names of all subjects who are similar to Alice. You may assume that Alice is a existing subject in your database, and that the name Alice is unique. Solution: select x.sid, x.name from Subject a, Answer b, Subject x, Answer y where a.name = Alice and a.sid = b.sid and x.sid = y.sid and b.vote = y.vote group by x.sid, x.name having count(*) >= 50; 2 points off for computing the aggregate in a subquery 3 points partial credit for solutions that made no sense to me 7-8 points partial credit for using having count(*) < 50 combined with not exists Page 5
(d) Consider the following three queries stated in English: Subject(sid, name, age) Question(qid, description, category) Answer(sid, qid, vote) 1. Find all categories where some question received a yes vote. 2. Find all categories where every question received a yes vote. 3. Find all categories where some question received only yes votes. 4. Find all categories where every question received only yes votes. For each of the SQL queries below, indicate which of the three English queries above they correspond to, or write NONE if they do not correspond to any English query. i. (2 points) select distinct x.category from question x where not exists (select * from question u, answer v where x.category = u.category and u.qid = v.qid and v.vote = 0); i. 4 To which English query does it correspond? ii. (2 points) select distinct x.category from question x where not exists (select * from answer v where x.qid = v.qid and v.vote = 0); ii. 3 To which English query does it correspond? iii. (2 points) select distinct x.category from question x where not exists (select * from question u where x.category = u.category and not exists (select * from answer v where u.qid = v.qid and v.vote = 1)); To which English query does it correspond? iv. (2 points) select distinct x.category from question x, answer v where x.qid = v.qid and v.vote = 1; To which English query does it correspond? iii. 2 iv. 1 Page 6
2 Relational Algebra 2. (25 points) (a) (10 points) Write a Relational Algebra expression in the form of a logical query plan (i.e., draw a tree, or write an RA expression) that is equivalent to the SQL query below. select distinct x.category from question x where not exists (select * from question u, answer v where x.category = u.category and u.qid = v.qid and v.vote = 0); -5 if a join -3 for extra join Ok if missing δ -2 points if missing or Π * Π category Π category category Question Question σ vote=0 Solution: 3 points partial credit for non-sense This join may be dropped Answer Page 7
(b) Consider three relations R(A, B), S(C, D), T (E, F ), where all attributes are integers. Which of the following relational algebra expressions are equivalent? i. (3 points) (R B=C S) D=E T = R B=C (S D=E T ) Equivalent? ii. (3 points) σ A D (R B=C S) = σ A B (R) B=C σ C D (S) i. Yes ii. No Equivalent? iii. (3 points) γ A,sum(D) K (R B=C S) = π A,K (R B=C γ C,sum(D) K (S)) iii. No Equivalent? iv. (3 points) γ A,sum(D) K (R B=C S) = γ A,sum(L) K (R B=C γ C,sum(D) L (S)) Equivalent? v. (3 points) Assume B is a key in R(A, B): R B=C S = R B=C Π CD (R B=C S) iv. Yes Equivalent? v. Yes Page 8
3 Query Execution and Indexes 3. (50 points) (a) Answer true or false: i. (2 points) Physical data independence means the ability of the query optimizer to select the best plan. i. False True or false? ii. (2 points) Physical data independence means that the databases does not need to change when the underlying technology evolves over the years, such as the increased density of the data on hard discs. ii. False True or false? iii. (2 points) Physical data independence means that the SQL queries don t need to change when we modify the physical store of the database, such as adding or removing indices or re-organizing the layout of the relations. iii. True True or false? iv. (2 points) Given sufficient time and manpower, every Java program can be rewritten entirely in SQL. True or false? iv. False Page 9
(b) Assume that the table Answer(sid, qid, vote) is very large, and consider the following three queries: Q1 = select * from Answer where sid = 123456; Q2 = select * from Answer where qid = 333333; Q3 = select * from Answer where sid = 123456 and qid = 333333; Further assume that there are many subjects and many questions, and that each subject answered only a small number of questions, and each question is answered by only a small number of subjects. i. (2 points) Which of the queries Q1,Q2,Q3 can be answered efficiently using a B + -tree index on Answer(sid)? Indicate all queries that might benefit from this index. i. Q1,Q3 ii. (2 points) Which of the queries Q1,Q2,Q3 can be answered efficiently using a B + -tree index on Answer(qid)? Indicate all queries that might benefit from this index. ii. Q2,Q3 iii. (2 points) Which of the queries Q1,Q2,Q3 can be answered efficiently using a B + -tree index on Answer(sid,qid)? Indicate all queries that might benefit from this index. iii. Q1,Q3 iv. (2 points) Which of the queries Q1,Q2,Q3 can be answered efficiently using a B + -tree index on Answer(qid,sid)? Indicate all queries that might benefit from this index. iv. Q2,Q3 Page 10
(c) We have a very large relation Subject(sid, name, age) and need to compute following logical query plan σ age=30 (Subject) i. (2 points) If there exists a clustered index on age then an index based selection is always more efficient than a sequential scan. i. True True or false? ii. (2 points) If there exists an unclustered index on age then an index based selection is always more efficient than a sequential scan. True or false? ii. False (d) Let R(A, B), S(C, D) be two large relations, and assume we have four indexes, on R(A), on R(B), on S(C), and on S(D), denoted IA, IB, IC, ID. For each expression below, indicate which indices may be useful to compute it. If you have a choice, then write accordingly, e.g. IA or both (IB and IC); if it is best not to use an index at all, then write NONE. i. (2 points) R B=C S ii. (2 points) σ A= 1234 (R) B=C S i. NONE iii. (2 points) R B=C σ D= 5678 (S) ii. IA and IC iv. (2 points) σ A= 1234 (R) B=C σ D= 5678 (S) iii. ID and IB iv. (IA and IC) or (ID and Page 11
(e) Consider three large relations R(A, B), S(C, D), T (E, F ), and the following query plan: (R B=C S) D=E T The optimizer uses the following physical plan: Create in main memory a hash table for S Create in main memory a hash table for T Probe R. Assume that the result of the plan is sent directly to the client, and is not stored in the main memory. i. (2 points) Is R pipelined? ii. (2 points) Is S pipelined? i. Yes ii. No iii. (2 points) This physical plan is executable if and only if the relation R fits entirely in main memory. iii. No iv. (2 points) This physical plan is executable if and only if both S and T fit together in main memory. iv. Yes v. (2 points) This physical plan is executable if and only if all three relations R, S, and T fit together in main memory. v. No Page 12
(f) Answer yes or no: i. (2 points) If a physical operator is pipelined, then it can return answers to the parent before its child operator finishes processing all the data. i. Yes ii. (2 points) If a physical operator is blocking, then it can return answers to the parent before its child operator finishes processing all the data. ii. No iii. (2 points) In general, a blocking operator is more efficient than a pipelined operator. iii. No iv. (2 points) In general, a blocking operator requires more memory or more disk space than a pipelined operator. v. (2 points) A selection operator σ A=30 (R) is blocking. iv. Yes vi. (2 points) A merge-join operator R B=C S is blocking. v. No vi. Yes Page 13
4 Entity-Relationship Diagrams 4. (25 points) (a) (15 points) You are volunteering to help a political party reach out to voters. Design an E/R diagram for your database to store the following information. Political parties: they have a name. People: they have names, address, phone number. Voters are people. Volunteers are people; each volunteer is affiliated with a party. Contact: is a relationship between a volunteer and a voter. Each contact has a date, when that contact was made. Leans: some voters lean towards a political party (only one). name address phone Person isa date isa Volunteer Contact Voter Affiliation Party Leans Solution: Page 14
(b) (10 points) Consider the E/R Diagram below: eid name Employee Manages isa isa Developer Review Manager level department Design the corresponding relational schema. Choose reasonable types for the attributes (integer or text). Show all keys and foreign keys. You should turn in a set of CREATE TABLE statements. Solution: drop table if exists Review; drop table if exists Developer; drop table if exists Manager; drop table if exists Employee; create table Employee ( eid int primary key, name text, m int -- references Manager -- note: circular references in Postgres require ALTER TABLE -- of course, this was not required on the exam ); create table Manager ( eid int primary key references Employee, department text); create table Developer ( eid int primary key references Employee, level int); Page 15
create table Review ( did int references Developer, mid int references Manager); Page 16