Exam. Question: Total Points: Score:

FS 2016 Data Modelling and Databases Date: June 9, 2016 ETH Zurich Systems Group Prof. Gustavo Alonso Exam Name: Question: 1 2 3 4 5 6 7 8 9 10 11 Total Points: 15 20 15 10 10 15 10 15 10 10 20 150 Score: Rules (please read carefully) You have 120 minutes for the exam. Please write your name and Legi number on this cover page. Please write your Legi number on all other pages, including the additional pages you possibly use. Please write your answers on the exam sheets. Use blue or black ink, DO NOT USE red ink. DO NOT USE pencils. Write as clearly as possible and cross out everything that you do not consider to be part of your solution. Answers can be given in either English or German. Remarks Most questions are designed such that the fastest way to solve them is to solve the problem without looking at the answers and only then to find the given answer that corresponds to your solution. We expect it to be slower to try out all combinations of questions and solutions in order to find the ones that match.

1 Mapping ER to Relational model (15 points) Consider the ER models 1 to 4 and the relational models (a) to (j) given below. For each ER model, give a relational model that represents it correctly by adding a checkmark ( ) in the corresponding field in the table. Only add one checkmark per ER model. If an ER model is represented by none of the given relational models, then add a checkmark in the none column. ER Model Relational Model none (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) ER model 1 ER model 2 ER model 3 ER model 4

B 1 1 C 1 c B 1 N C 1 c b R S s b R S s r N D N d r N D N d ER model 1 ER model 2 B 1 N C 1 c B 1 N C 1 c b r R N S N D s d b r R N is_a N D s d ER model 3 ER model 4 (a) B(b) C(c) D(c,d,s) R(b,c,d,r) (b) B(b) C(c) D(c,d,s) R(b,c,c,r) (c) B(b) C(c) D(c,d,s) R(b,c,c,d,r), extra key: b,d,c (d) B(b) C(c) D(c,d,s) R(b,c,d,r), extra key: b,d (e) B(b) R(b,c,d,r), extra key: b,d S(c,d,s) (f) B(b) C(c) D(c,d,s) R(b,c,c,d,r) (g) B(b) C(c) D(c,d,s) R(b,c,d,r), extra key: b,d (h) B(b) D(c,d) R(b,c,d,r), extra key: b,d S(c,d,s) (i) B(b) C(c) D(c,d,s) R(b,c,d,r) (j) B(b) C(c,d,s) D(d) R(b,c,d,r), extra key: b,d

2 Relational Algebra (20 points) Consider the following relational schema: Runner ( Name, Birthday, Country ) Run ( Name, Distance, Time ) A runner can run in several runs over different race distances. Thanks to high-speed cameras, two runners cannot have the exact same time in the same run. (a) (10 points) For every description find a matching relational algebra query. For some descriptions there is no matching query. 1. All 100m race distance runs in which only runners from Switzerland (CH ) participated. 2. All runs with a distance greater than 100m in which only runners from Switzerland (CH ) participated. 3. All runs in which only runners from Switzerland (CH ) participated. 4. All 100m race distance runs in which the runners were not from Switzerland (CH ). 5. All runs in which the runners were not from Switzerland (CH ). A) Π Name,Distance,T ime ((Runner σ Country!= CH (Runner)) (Run σ Distance < 100 (Run))) B) Π Name,Distance,T ime ((σ Country = CH (Runner) σ Country!= CH (Runner)) σ Distance!= 100(Run σ Distance < 100 (Run))) C) Π Name,Distance,T ime ((σ Country!= CH (Runner) σ Country = CH (Runner)) σ Distance=100(Run σ Distance > 100 (Run))) D) Π Name,Distance,T ime ((σ Country = CH (Runner) σ Country!= CH (Runner)) σ Distance=100(Run σ Distance < 100 (Run))) E) Π Name,Distance,T ime (σ Country!= CH (Runner) ) σ Distance>100(Run σ Distance < 100 (Run)))

Fill in the table below by writing to every description on the left the right query letter on the right. If there is no matching query for a description, put a cross. Description 1 2 3 4 5 Query (b) (5 points) Which of the following relational algebra expressions finds all runners which only participated in 100m race distance runs. 1. Π Name (σ Distance=100 Run) 2. Π Name (Runner) Π Name (σ Distance=100 Run) 3. Π Name (Run) Π Name (σ Distance=100 Run) 4. Π Name (Run) Π Name (σ Distance!=100 Run) 5. Π Name (Runner) Π Name (σ Distance!=100 Run) (c) (5 points) We now want to find winners for every distance. A winner has the shortest time for a given distance. 1. 2. 3. 4. Π Name,Country,Distance,T ime (Runner (Run Π Run1.Name,Run1.Distance,Run1.T ime ( σ Run1.time>Run2.time (ρ Run1 (Run) Name ρ Run2 (Run))))) Π Name,Country,Distance,T ime (Runner (Run Π Run1.Name,Run1.Distance,Run1.T ime ( σ Run1.time>Run2.time (ρ Run1 (Run) Distance ρ Run2 (Run))))) Π Name,Country,Distance,T ime (Runner (Run Π Run1.Name,Run1.Distance,Run1.T ime ( σ Run1.time<Run2.time (ρ Run1 (Run) Name (ρ Run2 (Run)))))) Π Name,Country,Distance,T ime (Runner (Run Π Run1.Name,Run1.Distance,Run1.T ime ( σ Run1.time<Run2.time (ρ Run1 (Run) Distance ρ Run2 (Run))))) Mark all the queries that find the winners with a checkmark ( ) in the table below. 1 2 3 4

3 Integrity constraints (15 points) Consider the following schemas: Schema A Schema B Schema C CREATE TABLE tab1 ( tab1_key INT NOT NULL UNIQUE) ; CREATE TABLE tab1 ( tab1_key INT NOT NULL) ; CREATE TABLE tab2 ( tab2_ ref INT NOT NULL) ; CREATE TABLE tab1 ( tab1_key INT NOT NULL UNIQUE) ; CREATE TABLE tab2 ( tab2_ ref INT NOT NULL, FOREIGN KEY ( tab2_ref ) REFERENCES tab1 ( tab1_key ) ) ; Schema D CREATE TABLE tab1 ( tab1_key INT PRIMARY KEY) ; CREATE TABLE tab2 ( tab2_key INT PRIMARY KEY, tab2_ ref INT NOT NULL, CONSTRAINT tab2_fk FOREIGN KEY ( tab2_ref ) REFERENCES tab1 ( tab1_key ) ON DELETE CASCADE ON UPDATE SET NULL) ; CREATE TABLE tab2 ( tab2_ ref INT NOT NULL, FOREIGN KEY ( tab2_ref ) REFERENCES tab1 ( tab1_key ) ON DELETE RESTRICT ON UPDATE NO ACTION) ; Schema E CREATE TABLE tab1 ( tab1_key INT PRIMARY KEY) ; CREATE TABLE tab2 ( tab2_key INT PRIMARY KEY, tab2_ ref INT NOT NULL, FOREIGN KEY ( tab2_ref ) REFERENCES tab1 ( tab1_key ) ON DELETE CASCADE ON UPDATE CASCADE) ; CREATE TABLE tab3 ( tab3_ ref INT, CONSTRAINT tab3_fk FOREIGN KEY ( tab3_ref ) REFERENCES tab2 ( tab2_key ) ON DELETE CASCADE ON UPDATE CASCADE) ; CREATE TABLE tab3 ( tab3_ ref INT, FOREIGN KEY ( tab3_ref ) REFERENCES tab2 ( tab2_key ) ON DELETE SET NULL ON UPDATE CASCADE) ; For schemas A-C, the tables initially have the following content: tab1_key tab2_ref 1 1 2 2 For schemas D and E, the tables initially have the following content: tab1_key tab2_key tab2_ref tab3_ref 1 10 1 10 2 20 2 20 How many records will each schema hold (summed over all of its tables) after executing individually one of the following statements? Fill in the table below and mark with X an execution error. Statements: I. UPDATE tab2 SET tab2_ref=3 WHERE tab2_ref =1; I I. UPDATE tab1 SET tab1_key=3 WHERE tab1_key =1; I I I. DELETE FROM tab1 WHERE tab1_key =1; I II III A B C D E

4 SQL I (10 points) Given is the following schema (based on the ZVV timetable) with two tables trips and stop_times where trips contain a list of trips trams make within one day, and stop_times contains all the stops made for every trip: CREATE TABLE trips( trip_id INTEGER NOT NULL PRIMARY KEY, tram_number INTEGER NOT NULL ); CREATE TABLE stop_times( trip_id INTEGER NOT NULL REFERENCES trips(trip_id), stop_sequence INTEGER NOT NULL, stop_name VARCHAR(50) NOT NULL, arrival_time TIMESTAMP NOT NULL, departure_time TIMESTAMP NOT NULL, PRIMARY KEY (trip_id, stop_sequence) ); (a) (5 points) The following queries can be executed on the schema. All queries are functional and return some result. 1. SELECT arrival_time FROM stop_times ORDER BY arrival_time DESC LIMIT 1 2. SELECT MAX(arrival_time) FROM stop_times GROUP BY trip_id 3. SELECT MAX(arrival_time) FROM stop_times JOIN trips USING (trip_id) 4. SELECT MAX(arrival_time) as arrival_time FROM stop_times st, trips t WHERE st.trip_id = t.trip_id GROUP BY t.trip_id ORDER BY arrival_time DESC LIMIT 1 Which of the above queries are equivalent? Two queries are equivalent if they return the same set of results for any data that the database may contain.

(b) (5 points) A tram track is defined as a tuple of two consecutive stops (stop_name 1, stop_name 2 ). Two stops are consecutive, if there is a trip (trip_id) which contains both stops and in which their stop_sequence numbers differ by 1, i.e., stop_sequence stop_name2 = 1+stop_sequence stop_name1. Fill in the blanks below to obtain a SQL query that finds the number of trips for each tram track and lists the 10 most frequented tram tracks. SELECT FROM WHERE AND GROUP BY ORDER BY LIMIT 10 st1.stop_name, st2.stop_name, COUNT(*) as tcount stop_times st1, stop_times st2

5 SQL II (10 points) Consider two tables: director ( id, name) movie ( title, dir_id, year) Here, Dir_Id is a foreign key and references the Id of the Director. In addition, the Name column of Director is unique. Assume that the database already contains data but no NULL values, and the earliest year for a movie stored in the database is 1919. The following five inserts are executed one after the other in the database: 1: INSERT INTO director (id, name) VALUES (10001, Jack Thompson ) 2: INSERT INTO movie (title, dir_id, year) VALUES ( Foo Movie 2, 10001, NULL) 3: INSERT INTO director (id, name) VALUES (10002, Thomas Smith ) 4: INSERT INTO movie (title, dir_id, year) VALUES ( Some Movie, 10001, NULL) 5: INSERT INTO movie (title, dir_id, year) VALUES ( Some Movie 2, NULL, 1999) Following the inserts the three queries below are executed on the database: A. SELECT DISTINCT dir_id FROM movie WHERE year=1999 B. SELECT count(*) FROM movie, director WHERE movie.dir_id=director.id AND director.name LIKE Thom% C. SELECT year, count(title) FROM movies GROUP BY year Using the table below mark which of the three queries (A.,B. and C.) would have a different set of records as a result before and after running the inserts (1 to 5, above). Identify all insert(s) that modify the result. If the result does not change, leave that column empty. Query Result set changes? (Y/N) If yes, these inserts modify it: A. B. C.

6 Functional Dependencies & 3NF (15 points) Consider the following relation: R(A, B, C, D, E) with the following functional dependencies (there are no further non-trivial functional dependencies). D AC AB CD BD AE A C ABC E (a) (9 points) For each set of attributes in the table below, decide whether it is a candidate key of R. If it is not a candidate key, add a functional dependency that if added to the functional dependencies above would make it a candidate key. This functional dependency for the potential candidate key X contains X on the left-hand side and all attributes on the right-hand side that are not in the closure of X. For example, if R example = (W, X, Y, Z) with X Y, X becomes a candidate key by adding the dependency X W Z. If there is no such dependency, please write None. Possible Is Cand. Is not Added Functional Candidate Key Key Cand. Key Dependency A D AB BC BD ABC (b) (6 points) Apply the synthesis algorithm to transform the schema into 3NF (loss and dependency preserving). You might need to find the minimal basis of the relation first.

7 Query Processing (10 points) In the following query, the query processor creates a join for the R.id = S.id condition and decides to push the inequality (i.e., > instead of = in the join condition) into another join: SELECT * FROM R, S, T WHERE R.id = S.id AND S.id > T.id Furthermore, the following assumptions about the joins and the relations in this query hold: Relation R contains 100k tuples. Relation S contains 2k tuples. Relation T contains 10k tuples. Both joins have a selectivity of 10%. That means that given the Cartesian product of any two join relations, 10% of the tuples get passed on to the next stage. Example: R 1 contains 20 tuples and R 2 contains 10 tuples, therefore the Cartesian product contains 20*10=200 tuples. Out of these, 20 (=10%) match the join condition and are the output of the join. There exist three join implementations that the query processor may apply: Nested-Loop Join (NLJ) Grace Hash Join (GHJ) Sort-Merge Join (SMJ) (a) (4 points) Determine which join R S of the three above implementations has which execution complexity in the following table. Here, R and S denote the number of tuples in table R respectively S. Assume that R > S holds. Complexity O( R + S ) O( R S ) O( R log R ) Join Implementation

(b) (6 points) The query processor chooses query plans that have the lowest cost for execution. Assume a cost model where the cost of a join is based on its execution complexity. For example, if the complexity is O( R + S ) with R = 10 and S = 20, then the cost of this join is 10 + 20 = 30. Rank the query plans on the left hand side of the table below by execution cost. A rank of 1 means that it has the lowest execution cost. Assume a logarithm of base 10. Justify your decision: Query Plan (R SMJ S) NLJ T R SMJ (S NLJ T) (R GHJ S) NLJ T Rank

8 Decomposition Lemma (15 points) Consider the following relational schema Student_Lecture (Student_Id, Student_Name, Student_Address, Lecture, Date_Enrolled, Teaching_Assisant) with the following functional dependencies: Student_Id Student_Name, Student_Address Student_Id, Lecture Date_Enrolled, Teaching_Assistant (a) (3 points) Give an example of a candidate key and a super key of Student_Lecture. (b) (3 points) In which normal form is the relation Student_Lecture? 1NF 2NF 3NF BCNF 4NF (c) (3 points) Which are the functional dependencies which prevent the relation from being in the next higher normal form? Student_Id Student_Name, Student_Address Student_Id, Lecture Date_Enrolled, Teaching_Assistant (d) (6 points) Determine a lossless decomposition of the schema into R 1 and R 2 that preserves all the functional dependencies. Formally prove that the decomposition is lossless.

9 Minimal Basis (10 points) Consider the following set S of functional dependencies: (1) A B (2) B C (3) AD E (4) BC D (5) AC DG (6) C E (7) A C (8) CD F Which of the following is a minimal basis of S? If multiple solutions exist, mark all of them. If no solution is correct, choose "None of the above". A B B C A G B D C E CD F A B,E B C BE D AD G CD F A B AE DG B D D E BD C CD F C G B C AC E A B D E CD F None of the above

10 Transactions (10 points) For each of the following histories indicate the most strict recoverability class and one possible serialization order if the history is serializable. Operations: r i (A) - Transaction i reads data object A. w i (A) - Transaction i writes to data object A. c i - Transaction i commits. a i - Transaction i aborts. History 1 w 1 (A) r 2 (B) w 1 (B) c 1 w 2 (A) c 2 2 r 2 (A) r 1 (A) w 1 (A) r 1 (B) w 1 (B) w 2 (A) c 1 c 2 3 w 2 (B) w 1 (A) r 2 (A) w 2 (A) c 2 c 1 4 w 3 (A) w 2 (A) c 3 w 1 (A) r 1 (B) w 2 (A) c 2 w 1 (B) 5 w 2 (A) w 1 (A) r 2 (A) w 2 (A) c 1 c 2 6 w 1 (A) w 2 (A) r 2 (A) w 2 (A) c 2 r 1 (A) c 1 7 w 2 (A) r 1 (A) w 3 (A) a 3 c 2 c 1 8 r 2 (C) w 1 (A) r 2 (B) w 1 (B) c 1 w 2 (C) c 2 History Not Recoverable Recoverable ACA Strict Serialization Order Not Serializable 1 2 3 4 5 6 7 8

11 Commit protocols (20 points) (a) (7 points) The coordinator A and two participants B 1, B 2 run the 2PC protocol. We assume that the coordinator is also a participant. We model the execution of the protocol as a series of events, which are either message events or failure events. We define a message event (P, Q, M) as "P sends the message M to Q", where P, Q {A, B 1, B 2 } and the message M {request, yes, no, abort, commit}, meaning request to vote, voting yes, voting no, request to abort, and request to commit respectively. We define a failure event (P, fail) as the failure of node P. Consider now the following series of events: timestep I II III IV V VI event (A, B 1, request) (B 1, A, yes) (A, B 2, request) (B 2, A, yes) (A, B 1, commit) (A, B 2, commit) For each of the scenarios listed in the table below, replace the event at one of the timesteps I-VI with a different event for the scenario to happen. If there are multiple possibilities, replace the earliest one. Assume that all the actions following the given modification will also change according to the 2PC protocol. Scenario timestep (I-VI) event 2PC aborts, but no node has failed a participant experiences a timeout waiting for a message the coordinator experiences a timeout waiting for a message 2PC blocks a Cooperative Termination Protocol is run and the protocol finishes

(b) (7 points) Assume now that the nodes A, B 1 and B 2 from the previous exercise run the 3PC protocol. We include the additional message pre-commit. What is the minimum number of messages that the coordinator has to send before it can fail without preventing the participants to commit? Assume that this is the only failure. Write down such a scenario using the notation introduced above up to the moment when the coordinator fails. Minimum number of coordinator messages: Scenario: (b) (c) (6 points) Considering the scenario from the previous exercise, will the participants eventually terminate if the coordinator does not send the last message? If yes, what will be their decision? If no, why?

Empty Pages You can use the following pages at your convenience. If you use them for a solution, mark as clearly as possible to which question your solution belongs and what is part of that solution and what is not.