Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please) Reminders: a. Out of 100 points. Contains 5 pages. b. Rough time-estimates: 5~8 hours. c. Please type your answers. Illegible handwriting may get no points, at the discretion of the grader. Only drawings may be hand-drawn, as long as they are neat and legible. d. There could be more than one correct answer. We shall accept them all. e. Whenever you are making an assumption, please state it clearly. f. Unless otherwise mentioned, you may use any SQL/RA operator seen in class/in textbook. g. Unless otherwise specified, assume set-semantics for RA and bag-semantics for SQL. h. Feel free to use the linear notation for RA and create intermediate views for SQL (unless otherwise mentioned in the problem). i. Each HW has to be done individually, without taking any help from non-class resources (e.g. websites etc). Q1. RA: Bars [30 points] Consider the following relational database that stores information about bars and customers: Drinker (name, address)! Bar (name, address)! Beer (name, brewer)! Frequents (drinker, bar, times a week) Likes (drinker, beer) Serves (bar, beer, price) Write the following queries in relational algebra: Q1.1. Q1.2. Q1.3. Q1.4. Q1.5. (2 points) Find all drinkers who frequent James Joyce Pub. (2 points) Find all bars that serve both Amstel and Corona. (3 points) Find all bars that serve at least one of the beers Amy likes for no more than $2.50. (3 points) For each bar, find all beers served at this bar that are liked by none of the drinkers who frequent that bar. (5 points) Find all drinkers who frequent only those bars that serve some beers they like. 1
Q1.6. Q1.7. (5 points) Find all drinkers who frequent every bar that serves some beers they like. (10 points) Find those drinkers who enjoy exactly the same set of beers as Amy. Q2. SQLite [25 points] This problem will use a database containing data about a university. The relations are in a SQLite database. Download and install SQLite3 from http://www.sqlite.org The schema of the database is provided below (keys are in bold, field types are omitted and they could be easily identified using SQLite): student(sid, sname, sex, age, year, gpa) dept(dname, numphds) prof(pname, dname) course(cno, cname, dname) major(dname, sid) section(dname, cno, sectno, pname) enroll(sid, grade, dname, cno, sectno) Before you start, it would be a good idea to take a look at the database file and familiarize yourself with its contents. You can run run this file on SQLite, and the database and tables will be loaded properly. File can be found here: http://people.cs.vt.edu/~badityap/classes/cs5614-fall14/homeworks/hw1/database.txt In this assignment, you will only deal with querying part of SQL. You are NOT allowed to tamper with (change the contents of) the database, i.e., CREATE, INSERT, DELETE, ALTER, UPDATE etc. However, please feel free to issue any query-oriented SQL statements, even if they are not related with the questions in this assignment. Queries Write SQL queries that answer the questions below (one query per question) and run them on SQLite. The query answers must not contain duplicates, but you should use the SQL keyword distinct only when necessary. For this question, creation of temporary tables is NOT allowed, i.e., for each question you have to write exactly one SQL statement (possible using nested SQL). Note that it is possible that the answer to some of them may be empty. Q2.1. (2 points) To find the name of the oldest student. 2
Q2.2. (2 points) Find the names and gpas of the students who are enrolled in 312. Q2.3. (2 points) Find the names and majors of students who are taking one of the Artificial Intelligence courses. Q2.4. (2 points) Find the names of students who are enrolled in a course from both the "Computer Sciences" and "Chemical Engineering" departments. Q2.5. (3 points) For each department, find the average age of the students majoring in that department along with the age difference between the oldest and youngest students. Q2.6. (3 points) Find the names of students being taught by professor "Jason Singer". Q2.7. (4 points) How many students have more than one major? (Hint: requires a nested query) Q2.8. (4 points) Find the name(s) of the oldest first year student {year = 1} (Hint: requires a nested query) Q2.9. (3 points) For those departments that have no majors taking a "Computer Sciences" course, print the department name and the number of PhD students in the department. Assignment Submission Format your answers as follows (in the hardcopy itself): 1. Query: SQL statement 1 (for query 1); Result: Copy-paste Output for query 1 2. Query: SQL statement 2 (for query 2) Result: Copy-paste Output for query 2.. 3
Q3: Crypt-arithmetic [20 points] This exercise is designed to help you think out of the box on the use of database programming for solving problems. You are given the crypt-arithmetic puzzle: SEND + MORE ----- MONEY The goal of the puzzle is to substitute numbers (from zero to nine) for letters, so that the addition works out. There are some constraints your solution should respect: 1. The same number should be used for a given letter, throughout. For example, if you guess, "5" for the letter E, then E should get the value "5" at all the places it occurs. 2. Different letters should get different numbers, e.g., you cannot assign "4" to both E and to M. 3. None of the numbers SEND, MORE, or MONEY have any leading zeroes, i.e., they do not begin with a sequence of zeroes. Explain how you will solve this puzzle by creating database tables and writing a query. Q3.1. (5 points) The schema of the tables you use. Q3.2. (10 points) Your SQL query. Q3.3. (5 points) The solution you get for the puzzle when you use an SQL interpreter and RDBMS to solve this puzzle. Copy-paste the output you get. Hints The SQL query may be quite long so you may find it useful to create the query in a text file and use the source command (or equivalent) in your SQL interpreter to read in and execute the query. Q4: Social Network Friends [10 points] We are in-charge of an online social network MyBook. Consider the relation MyBookFriends(Id, FriendId), which is a giant table for each user on MyBook. The relation MyBookFriends keeps track of all friends of that user. Together, the two attributes comprise the (only) key for this relation. Researchers studying social networks are interested in counting the number of people who have k friends, for every possible value of k. 4
(10 points) Write a SQL query that operates on MyBookFriends and returns a relation Counts(NumFriends, NumIds). If this relation stores the tuple (k, l), then it means that there are l distinct users (Id values) in MyBookFriends, each of who has exactly k friends. For no points: imagine you run this query on a real social network like Facebook, and then plot the values with k on the x-axis and l on the y-axis---what is the shape of the plot you expect? Uniform? Linear? Non-linear? Any other particular function? Hints As you can imagine, you do not know beforehand all the different values of k (or l) that should appear in Counts. Hence your SQL query should be able to figure out all these values automatically and correctly. Q5. B+ Tree [15 points] Assume the following B+ tree exists with d = 2: Sketch the state of the B+ tree after each step in the following sequence of insertions and deletions, maintaining at least 50% occupancy at each step and overflow triggered split. In the diagram above we have not shown pointers in the leaf nodes for simplicity but remember that the leaf nodes are linked lists. Note: Use the insertion and deletion algorithms given in the textbook section 10.5 (page 349) and 10.6 (page 353) respectively (also in the Slides). Root node can have 1 to 2d keys. During deletion redistribute the leaf pages wherever possible. Q5.1. (3 points) Insert 34 Q5.2. (3 points) Insert 2 Q5.3. (3 points) Insert 15 Q5.4. (3 points) Delete 28 Q5.5. (3 points) Delete 8 5