Assignment 6: SQL III Solution

Data Modelling and Databases Exercise dates: April 12/April 13, 2018 Ce Zhang, Gustavo Alonso Last update: April 16, 2018 Spring Semester 2018 Head TA: Ingo Müller Assignment 6: SQL III Solution This assignment will be discussed during the exercise slots indicated above. If you want feedback for your copy, hand it in during the lecture on the Wednesday before (preferably stapled and with your e-mail address. You can also annotate your copy with questions you think should be discussed during the exercise session. If you have questions that are not answered by the solution we provide, send them to Sabir (sabir.akhadov@inf.ethz.ch. This exercise sheet builds on the previous ones and it uses the Employees, ZVV and TPC-H schema and data which you can find on the course website. 1 Arrival times (ZVV The following queries can be executed on the ZVV schema. All queries are valid and return some result. What do they compute and which are equivalent? Two queries are equivalent if they return the same set of results for any data that the database may contain. 1. SELECT arrival_ time FROM stop_ times ORDER BY arrival_ time DESC LIMIT 1 2. SELECT MAX ( arrival_ time FROM stop_ times JOIN trips USING ( trip_id 4. SELECT MAX ( arrival_ time AS arrival_ time FROM stop_ times st JOIN trips t USING ( trip_id GROUP BY t. trip_id ORDER BY arrival_ time DESC LIMIT 1 3. SELECT MAX ( arrival_ time FROM stop_ times GROUP BY trip_id

Queries 1, 2 and 4 are equivalent and return the maximum arrival time in the database. Query 3 returns the maximum arrival time per trip. 2 Trip count (ZVV Repeat the task from question 1 for the following queries: 1. SELECT COUNT (* FROM trips 2. SELECT COUNT (* FROM trips t JOIN stop_ times st USING ( trip_id 3. SELECT COUNT (* FROM stop_ times GROUP BY trip_id 5. SELECT SUM ( vala FROM ( SELECT stop_name, COUNT (* AS vala FROM stop_ times JOIN stops USING ( stop_id GROUP BY stop_ name taba 6. SELECT COUNT ( DISTINCT trip_id FROM stop_ times 4. SELECT COUNT (* FROM trips JOIN stop_ times USING ( trip_id JOIN stops USING ( stop_id WHERE stop_ name LIKE '%' Queries 1 and 6 return the number of trips in the database and the number of trips having at least one stop. They are not equivalent, unless assumed that every trip has at least one stop, which is reasonable but technically not enforced by the schema. Queries 2, 4 and 5 are equivalent and return the number of stops in the database. The WHERE predicate in query 4 evaluates to true for all non-null values, which is guaranteed by the schema for stop_name. Query 3 returns the number of stops per trip.

3 Tram tracks and terminals (ZVV 1. A tram track is defined as a tuple of two consecutive stops (stop_name 1, stop_name 2. Two stops are consecutive, if there is a trip (trip_id which contains both stops and in which their stop_sequence numbers differ by 1: stop_sequence stop_name2 = 1 + stop_sequence stop_name1 Fill in the blanks below to obtain a SQL query that finds the number of trips for each tram track and lists the 10 most frequented ones. SELECT s1. stop_name, s2. stop_name, COUNT (* AS tcount FROM stop_ times st1, stop_ times st2, stops s1, stops s2 WHERE st1.trip_id = st2.trip_id AND st1.stop_sequence + 1 = st2.stop_sequence AND s1. stop_id = st1. stop_id AND s2. stop_id = st2. stop_id GROUP BY s1.stop_name, s2.stop_name ORDER BY tcount DESC LIMIT 10 2. A tram stop (stop_name is terminal if it is the last stop for any trip (as identified through the trip_id. The last stop of a trip can be identified by its stop_sequence. For example, if a tram makes a trip along six tram stops, stop_sequence = 6 identifies the last stop of the trip. Fill in the blanks below to obtain an SQL query that finds all terminal tram stops (stop_name. Also make sure the result does not contain duplicate entries. SELECT DISTINCT stop_name FROM stops s JOIN stop_ times st1 USING ( stop_id, ( SELECT trip_id, MAX(stop_sequence AS maxstop FROM stop_times GROUP BY trip_id st2 WHERE st1.trip_id = st2.trip_id AND st1.stop_sequence = st2.maxstop Knowing the database schema, how else could we formulate the previous query? Compare the results. SELECT DISTINCT trip_ headsign FROM trips ;

4 Union (TPC-H Reminder: In the TPC-H schema, one part can be provided by multiple suppliers. Supplypart relation stores this connection and (partid, supplierid is its key. Given the following queries: 1. SELECT p. partid, p. partname FROM Part p JOIN Supplypart sp ON sp. partid = p. partid WHERE sp. supplierid = 6 OR sp. supplierid = 33 2. ( SELECT p. partid, p. partname FROM Part p JOIN Supplypart sp ON sp. partid = p. partid WHERE sp. supplierid = 6 UNION ( SELECT p. partid, p. partname FROM Part p JOIN Supplypart sp ON sp. partid = p. partid WHERE sp. supplierid = 33 1. What is the difference between results of the following two queries? Both queries return ids and names of parts provided by suppliers with ids 6 and 33. However, the first query doesn t return distinct results, i.e. some tuples will appear in twice. UNION, on the other hand, guarantees that all the returned rows are distinct. 2. How to quickly make the first query equivalent to the second? Use keyword DISTINCT. 3. How to quickly make the second query equivalent to the first? Replace UNION with UNION ALL. repeated. The latter returns all results, even those which are

5 Rewriting subquery expressions (Employees What do the following queries do? Rewrite them without using subquery expressions 1 so that they return the same results. We do not count combinations using UNION, INTERSECT and EXCEPT as subquery expressions. 1. SELECT e. emp_no, e. first_name, e. last_ name WHERE e. emp_no IN ( SELECT dm. emp_no FROM dept_ manager dm JOIN salaries s ON dm. emp_no = s. emp_no WHERE dm. to_date > NOW ( AND s. to_date > NOW ( ORDER BY s. salary DESC LIMIT 5 The query returns employee numbers, first names and last names of the top 5 currently most earning managers. This can be rewritten as: SELECT e. emp_no, e. first_name, e. last_ name JOIN dept_ manager dm ON e. emp_no = dm. emp_no JOIN salaries s ON dm. emp_no = s. emp_no WHERE dm. to_date > NOW ( AND s. to_date > NOW ( ORDER BY salary DESC LIMIT 5 Here we make an assumption that a department manager cannot appear work in several departments at the same time. Otherwise we could use the DISTINCT keyword. 2. SELECT e. emp_no, e. first_name, e. last_ name JOIN dept_ emp de ON e. emp_no = de. emp_no JOIN salaries s ON e. emp_no = s. emp_no WHERE de. to_date > NOW ( AND s. to_date > NOW ( AND s. salary > ( SELECT s1. salary FROM dept_ manager dm JOIN salaries s1 ON s1. emp_no = dm. emp_no WHERE dm. dept_no = de. dept_no AND dm. to_date > NOW ( 1 www.postgresql.org/docs/9.6/static/functions-subquery.html

AND s1. to_date > NOW ( The query returns employee numbers, first names and last names of all employees currently earning more than their managers. This can be rewritten as: SELECT e. emp_no, e. first_name, e. last_ name JOIN dept_ emp de ON e. emp_no = de. emp_no JOIN dept_ manager dm ON dm. dept_no = de. dept_no JOIN salaries se ON e. emp_no = se. emp_no JOIN salaries sm ON dm. emp_no = sm. emp_no WHERE de. to_date > NOW ( AND dm. to_date > NOW ( AND se. to_date > NOW ( AND sm. to_date > NOW ( AND se. salary > sm. salary 3. SELECT e. emp_no, e. first_name, e. last_ name WHERE e. emp_no NOT IN ( SELECT dm. emp_no FROM dept_ manager dm WHERE dm. to_date > NOW ( The query returns employee numbers, first names and last names of all employees who are not managers. This can be rewritten as: ( SELECT e. emp_no, e. first_name, e. last_ name EXCEPT ( SELECT e. emp_no, e. first_name, e. last_ name JOIN dept_ manager dm ON e. emp_no = dm. emp_no WHERE dm. to_date > NOW (

Alternatively, we can use a LEFT OUTER JOIN, IS NULL pattern to find all employees who are not managers and allow employees who have been managers in the past. SELECT e. emp_no, e. first_name, e. last_ name LEFT OUTER JOIN dept_ manager dm USING ( emp_no WHERE dm. emp_no IS NULL OR dm. to_date < NOW (; 6 Having (Employees Write queries using the HAVING clause which do the following: 1. Find all employees who have worked in more than one department. SELECT e. first_name, e. last_ name JOIN dept_ emp de ON e. emp_no = de. emp_no GROUP BY e. emp_no HAVING COUNT ( DISTINCT dept_no > 1 2. Find names of all departments where current average salaries are higher by 5000 than in year 2000. SELECT d. dept_ name FROM dept_ emp de JOIN salaries s ON de. emp_no = s. emp_no JOIN departments d ON d. dept_no = de. dept_no WHERE s. to_date > NOW ( AND de. to_date > NOW ( GROUP BY d. dept_ name HAVING AVG ( s. salary - 5000 >= ( SELECT AVG (s1. salary FROM dept_ emp de1 JOIN salaries s1 ON de1. emp_no = s1. emp_no JOIN departments d1 ON d1. dept_no = de1. dept_no WHERE d1. dept_ name = d. dept_ name AND EXTRACT ( YEAR FROM s1. to_date >= 2000 AND EXTRACT ( YEAR FROM s1. from_ date <= 2000

AND EXTRACT ( YEAR FROM de1. to_date >= 2000 AND EXTRACT ( YEAR FROM de1. from_ date <= 2000 AND s1. to_date >= de1. from_ date AND de1. to_date >= s1. from_ date Alternatively with a join SELECT d. dept_ name FROM dept_ emp de JOIN salaries s ON de. emp_no = s. emp_no JOIN departments d ON d. dept_no = de. dept_no JOIN dept_ emp de2000 ON de2000. dept_no = de. dept_no JOIN salaries s2000 ON s2000. emp_no = de2000. emp_no WHERE s. to_date > NOW ( AND de. to_date > NOW ( AND EXTRACT ( YEAR FROM s2000. to_date >= 2000 AND EXTRACT ( YEAR FROM s2000. from_ date <= 2000 AND EXTRACT ( YEAR FROM de2000. to_date >= 2000 AND EXTRACT ( YEAR FROM de2000. from_ date <= 2000 AND s2000. to_date >= de2000. from_ date AND de2000. to_date >= s2000. from_ date GROUP BY d. dept_ name HAVING AVG (s. salary >= AVG ( s2000. salary + 5000; 7 Null values (Employees Warning: This part of the exercise will alter the schema and data of the employee database and might make solutions to other exercises not work as expected. To be safe, create a new database for this exercise where you reload the employee data from scratch. We will alter the schema of the employee database slightly, to explore a use for the NULL value. To be more specific, currently if an employment is open-ended in the database, the date for its end is set to year 9999. An alternative is to use a NULL value for the end date, which signifies that there is no termination decided. In our current schema we cannot add NULL to the columns, so to play around we need to first modify the schema. ALTER TABLE salaries ALTER COLUMN to_date DROP NOT NULL Now we can replace infinity dates with NULLs. UPDATE salaries SET to_date = NULL WHERE EXTRACT ( YEAR FROM to_date = 9999

1. Now that this is done, run the following three queries and decide which one(s correctly return(s the number of employees who received a salary in December of 1996. To receive money in a month a person has to work at least one day in that month. SELECT COUNT (* FROM salaries WHERE to_date >= ' 1996-12 -01 ' SELECT COUNT (* FROM salaries WHERE from_ date <= ' 1996-12 -31 ' AND to_date >= ' 1996-12 -01 ' SELECT COUNT (* FROM salaries WHERE from_ date <= ' 1996-12 -31 ' AND ( to_date >= '1996-12 -01 ' OR to_date IS NULL 2. We want to find out how many contracts (entries in the Salaries table ended in each particular year. For this we wrote the following query (Q1: SELECT EXTRACT ( YEAR FROM to_date AS year_expired, COUNT (* AS num_ expired FROM salaries GROUP BY year_ expired ORDER BY year_ expired We wrote a similar query (Q2 in a more verbose way, as follows: ( SELECT EXTRACT ( YEAR FROM to_date AS year_expired, COUNT (* AS num_ expired FROM salaries WHERE EXTRACT ( YEAR FROM to_date <= 1990 GROUP BY year_ expired ORDER BY year_ expired UNION ALL ( SELECT EXTRACT ( YEAR FROM to_date AS year_expired, COUNT (* AS num_ expired FROM salaries WHERE EXTRACT ( YEAR FROM to_date > 1990 GROUP BY year_ expired ORDER BY year_ expired For this exercise try and first answer the question without using the database. Then verify your answer by running the actual queries. Which of the following statement(s are true?

Q1 and Q2 return the same set of groups Q1 returns more groups than Q2 Q2 returns more groups than Q1 Q1 has one group more than Q2 Q2 has one group more than Q1 8 Case (TPC-H Select queries which return correct results for the following task: Get order ids, names of customers and date labels for all orders placed in years 1995 and later. For orders from years 1995-1997, the date labels should say '95-97'. For orders from year 1998 and later, the labels should say '98-XX'. SELECT orderid, customername, ( CASE WHEN o. orderdate >= '01-01 -1998 ' THEN '98 - XX ' WHEN o. orderdate >= '01-01 -1995 ' THEN '95-97 ' END AS date_ label FROM Orders o, Customer c WHERE o. customerid = c. customerid SELECT orderid, customername, date_ label FROM ( SELECT orderid, customername, ( CASE WHEN o. orderdate >= '01-01 -1998 ' THEN '98 - XX ' WHEN o. orderdate >= '01-01 -1995 ' THEN '95-97 ' ELSE ' NOTHING ' END AS date_ label FROM Orders o, Customer c WHERE o. customerid = c. customerid AS labeled WHERE date_ label!= ' NOTHING ' SELECT orderid, customername, ( CASE WHEN o. orderdate >= ' 01-01 -1995 ' THEN ' 95-97 ' WHEN o. orderdate >= '01-01 -1998 ' THEN '98 - XX ' END AS date_ label FROM Orders o, Customer c WHERE o. customerid = c. customerid AND o. orderdate >= ' 01-01 -1995 ' SELECT orderid, customername, ( CASE WHEN o. orderdate >= '01-01 -1998 ' THEN '98 - XX ' WHEN o. orderdate >= '01-01 -1995 ' THEN '95-97 ' END AS date_ label FROM Orders o, Customer c WHERE o. customerid = c. customerid AND o. orderdate >= ' 01-01 -1995 '

Explanation: 1 is incorrect because it returns all orders. For those placed before 1995, the column date_label is NULL. 3 is incorrect because the column date_label has the value '95-97' for all returned results. This is because the CASE clause works sequentially: if the first case is matched, no other cases will be checked. 9 Updatable views (Employees 1. Which of the following views are updatable? CREATE VIEW DeptManager AS SELECT e. emp_no, e. first_name, e. last_ name, dept_ manager dm WHERE e. emp_no = dm. emp_no CREATE VIEW Age AS SELECT emp_no, first_name, last_name, DATE_PART ('year ', age ( NOW (, birth_date AS age FROM employees CREATE VIEW SELECT * HiredPast97 AS FROM employees WHERE hire_ date >= ' 01. 01. 1997 ' CREATE VIEW DeptEmployee AS SELECT e. emp_no, e. first_name, e. last_name, d. dept_no, d. dept_ name, dept_ emp de, departments d WHERE e. emp_no = de. emp_no AND de. dept_no = d. dept_no CREATE VIEW Depts AS SELECT e. emp_no, COUNT (* AS number_ of_ depts, dept_ emp de, departments d WHERE e. emp_no = de. emp_no AND de. dept_no = d. dept_no GROUP BY e. emp_no CREATE VIEW HiredIn AS SELECT EXTRACT ( YEAR FROM hire_ date AS year_hired, COUNT (* FROM employees GROUP BY EXTRACT ( YEAR FROM hire_ date

Explanation: 1 and 4 are not updatable because of table joins. 2 is not updatable because of a one-way projection in the select list. 5 is not updatable because of table joins and aggregation. 6 is not updatable because of aggregation. 2. Suppose we add a new column to the employees relation: annotation of type text. ALTER TABLE employees ADD annotation TEXT Can we update it using the updatable views above? 3 can be updated only if we create the view after adding the new column. 10 Custom database Create a schema using the following commands: CREATE TABLE team ( team_id SERIAL, team_ name VARCHAR (30 NOT NULL, city_ name VARCHAR (30, PRIMARY KEY ( team_id ; CREATE TABLE referee ( referee_ id SERIAL, first_ name VARCHAR (30 NOT NULL, last_ name VARCHAR (30 NOT NULL, PRIMARY KEY ( referee_ id ; CREATE TABLE game ( game_ date DATE NOT NULL, home_ team_ id INT NOT NULL, away_ team_ id INT NOT NULL, referee_ id INT NOT NULL, PRIMARY KEY ( game_date, home_team_id, away_ team_ id ; 10.1 Altering tables 1. Add a nullable column short_name to the team relation. Its values should not be longer than 3 characters.

ALTER TABLE team ADD short_ name VARCHAR (3 2. Change the type of short_name to a string up to 5 characters. ALTER TABLE team ALTER short_ name TYPE VARCHAR (5 3. Remove the created column. ALTER TABLE team DROP short_ name 10.2 Data manipulation Assume that all tables are empty, i.e. the following statements have been executed: DELETE FROM team ; DELETE FROM referee ; DELETE FROM game ; Select the statements that will execute correctly on the empty database. INSERT INTO team ( team_id, team_ name VALUES (0, 'a', (3, 'b', (5, 'c'; Explanation: Although this command works, it is a dangerous practice to explicitly insert values which are defined as auto-incremented. If some values are going to be added later without explicitly inserting team_id, SQL will have no way of knowing how to increment it without violating the unique key constraint. INSERT INTO referee ( first_ name VALUES (' John '; Explanation: The column last_name cannot be NULL.

INSERT INTO team ( team_name VALUES ('x', ('y'; INSERT INTO game ( game_date, home_team_id, away_team_id, referee_ id VALUES ( NOW (, 0, 1, 1337 ; DELETE FROM team WHERE team_id = 0 Explanation: Currently, a game can have a referee whose id is not in the database. Also, a team can be deleted even if there are games related to it in the database. To prevent using id of a non-existent referee and deleting related entries, you can use a foreign key. You will learn about the foreign keys later in this course. INSERT INTO team ( team_ name VALUES (' Real Madrid C. F.'; UPDATE team SET city_ name =' Barcelona ' WHERE team_ name = ' FC Barcelona '; INSERT INTO referee ( first_name, last_ name VALUES (' Jane ', ' Smith '; INSERT INTO referee ( first_name, last_ name VALUES (' Jane ', ' Smith '; Explanation: There is no constraint for unique (first_name, last_name tuple in the database. INSERT INTO game ( game_date, home_team_id, away_team_id, referee_ id VALUES ('1307-10 -13 ', 123, 321, 0, ('1307-10 -13 ', 123, 321, 1; Explanation: (game_date, home_team_id, away_team_id is a primary key in the game relation. For this reason, inserting these three values twice violates unique key constraint.