6 December 2004 CS-6421 Final Exam p. 1 of 7 CSCI-6421 Final Exam York University Fall Term 2004 Due: 6pm Wednesday 15 December 2004 Last Name: First Name: Instructor: Parke Godfrey Exam Duration: take home Term: Fall 2004 Your assignment, should you choose to accept it, is to answer the following questions to the best of your knowledge. Try to keep answers brief and to the point, but be precise and be careful. Write any assumptions you make along with your answers, whenever necessary. The exam is open-book and open-notes. The exam is take-home. There are ten main questions. Each is worth five points. #1 and #10 are compulsary. You must do the compulsaries and five others of your choosing, for seven questions in all. So the test is 35 points in total. You may do an additional (eighth) problem. If so, I will drop the non-compulsary with the lowest score. If you do more than eight (the two compulsaries plus more than six others), I shall randomly dispense non-compulsaries until I have eight to grade!
6 December 2004 CS-6421 Final Exam p. 2 of 7 For the logicians: 1. (5 points) Datalog. What s that again, in English this time!? Consider (again) the schema Movie(title, director, year) Cast(actor, title, role) FK (title) refs Movie Consider the following rules which are used in the queries to follow. castin (A, M) cast (A, M, R). actor (A) castin (A, M). castout (A, M) castin (A, M), castin (A, N), M N. dicast (A, M) cast (A, M, R 1 ), cast (A, M, R 2 ), R 1 R 2. For each of the following Datalog queries, restate the query in concise, understandable English. a. query (A) dicast (A, M). b. query (A) actor (A), movie (M, D, Y), notcastin (A, M). actor (A), not query (A). c. query (A) castin (A, M 1 ), castout (A, M 2 ). d. query (A) castin (A, M), notcastout (A, M). e. query (A) castin (A, M), notdicast (A, M), notcastout (A, M). 2. (5 points) Negation Semantics. A somewhat stable model. a. (3 points) Is there a Datalog database P such that p (a positive atomic consequence) is a consequence of P with respect to the stable model semantics, but p is not a consequence of P with respect to the well founded semantics, and P has a unique stable model? If this cannot happen, explain why not. Otherwise, provide an example. b. (2 points) Is there a Datalog database P such that p (a positive atomic consequence) is a consequence of P with respect to the well founded semantics, but p is not a consequence of P with respect to the stable model semantics? (Note that when P has no stable models, everything is a consequence of P with respect to the stable model semantics.) If this cannot happen, explain why not. Otherwise, provide an example.
6 December 2004 CS-6421 Final Exam p. 3 of 7 3. (5 points) Datalog to SQL. To the max! Consider the Datalog rules maximalp (X 1,..., X k ) p (X 1,..., X k ), nottrumpedp (X 1,..., X k ). trumpedp (X 1,..., X k ) p (X 1,..., X k ), p (Y 1,..., Y k ), greater ([Y 1,...,Y k ], [X 1,...,X k ]). greater ([X Xs], [Y Ys]) X > Y, greatereq (Xs, Ys). greater ([X Xs], [X Ys]) greater (Xs, Ys). greatereq ([X Xs ], [X Ys]) X Y, greatereq (Xs, Ys). greatereq ([], []). and the query maximalp (X 1,..., X k ). a. (2 points) What does this query evaluate? b. (3 points) Write the query in SQL. 4. (5 points) Containment. Stop being repetitious and redundant. A rule R (defining predicate r) is logically redundant with respect to database DB if any query Q possible with respect to DB has the same answers whether evaluated against DB or against DB {R}. (Assume r has other rules defining it already in DB and that R is an additional rule for r.) a. (1 point) Consider the DB and R p (X, Y) e (X, A), e (A, B), e (B, Y). e (X, Y) f (X, Y). e (X, Y) f (Y, X). f (a, b). f (a, f). f (f, h). f (b, c). f (b, g). f (f, i). f (c, d). f (c, h). f (g, i). f (d, e). f (d, i). f (g, j). f (e, a). f (e, j). f (h, j). p (X, Y) e (X, Y). Is R logically redundant with respect to DB? Why or why not? b. (3 points) Describe a general method to determine whether a rule R is redundant with respect to DB. c. (1 point) Is your procedure decidable? Why or why not?
6 December 2004 CS-6421 Final Exam p. 4 of 7 5. (5 points) Expressiveness. Express yourself! a. One could, albeit with much effort, code up chess via the win and recursion-throughnegation like we did for the stones game in class. If our chess program is locally stratified, then this means that there is a perfect model, and everything is assigned true or false. This means win ( beginning board state )is either true or false. So it would be known that white (the first player) could always win playing a perfect game or that black (the second player) could always win playing a perfect game. Does this mean that a game of chess is necessarily winnable by the perfect white player or the perfect black player? Why or why not? b. Are there any types of queries that can be expressed in SQL but not Datalog? c. Are there any types of queries that can be expressed in SQL but not Datalog? (Careful.) d. Is Datalog a superset of first-order predicate calculus (logic)? Why or why not? e. Is Datalog interpreted under negation-as-finite-failure, the well founded semantics, or the stable model semantics a subset of first-order predicate calculus (logic)? Why or why not?
6 December 2004 CS-6421 Final Exam p. 5 of 7 For the engineers: 6. (5 points) Sequential Reads. Speed it up. Consider each of the join algorithms that we have studied: a. BNLJ (block nested loops join), b. INLJ (index nested loops join), c. HJ (two-pass hash join), d. SMJ (two-pass sort merge join), and e. MJ (merge join, with outer and inner sorted prior). Explain briefly whether sequential reads and writes would be advantageous in each case. Assume that sequential reads are generally not advantageous for filescans of base tables. Base tables become fragmented on disk over time due to inserts and deletes. 7. (5 points) Indexes. In a mess. You have just joined the team at Very Small Databases, Inc., (VSDB). You have been assigned to work with the infamous database expert Dr. Mark Dogfurry. Your first job is to work with him to tune a database being built for the company Geisel & Associates. The Geisel & Associates database includes two tables, Sneech and Whovillian. Dr. Dogfurry tells you the following. For table Sneech, there are the following indexes. 1. A unique clustered hash index on name. 2. An unclustered B+ tree index on specialty + no stars. 3. An unclustered B+ tree index on birthdate + hometown. 4. A clustered B+ tree index on birthdate + hometown + school level. For table Whovillian, there are the following indexes. 5. A unique unclustered hash index on name. 6. A unique unclustered hash index on age. 7. A unique hash index on address + name. 8. An unclustered B+ tree index on grade. The order of the attributes as listed for the composite index keys (for example, birthdate + hometown) is important: this is the order of the attributes by which the index is built. The information that Dr. Dogfurry has given you is suspect. That is, there is reason to believe that there are mistakes in what he has told you. State five distinct problems with the above information. Explain why each is a problem: that it is an impossibility; that it is useless; that it is redundant; and so forth.
6 December 2004 CS-6421 Final Exam p. 6 of 7 8. (5 points) Index Mechanics. Always losing your keys? a. (3 points) A linear hash has just been started. The linear hashed file currently just has one bucket (primary page). The current hash function pair is h 0,h 1. Here, h 0 masks for zero (!) right-hand bits from the hashed key, and so always returns bucket address 0. Hash function h 1 masks for 1 right-hand bit, h 2 for 2, and so forth. Assume that each page can hold two entries. The file currently has one entry of 21 (10101 2 ). 0 21 next A split should be triggered whenever an overflow page is created. Show the linear hashed file after each of the following inserts: 30 (11110 2 ), 18 (10010 2 ), 35 (100011 2 ), and 17 (10001 2 ), and 13 (1101 2 ). The insertions are cumulative, so your final hashed file should contain 30, 18, 35, 17, and 13. b. (2 points) Consider an extendible hash index that has 2 10 directory slots. What can you say about how many buckets so data-record pages if alternative #1, data-entry pages if alternative #2 or #3 that the index has? 9. (5 points) Joins. Is it further to New York or by train? Consider the schema R(A, B) and S(C, A). The underlined attributes designate the primary key. S has a foreign key cast on R through A, and S.A is not nullable. Consider the query select * from R, S where R.A = S.A; There is a clustered tree index on R.A and an unclustered tree index on S.A. Each index is of alternative #2 and has two layers of index pages. So the third layer in each case consists of the data-entry pages. Let N T generically denote the number of records in table T, and V T.C denote the number of distinct values found in column C of table T. N R N S. That is, the number of records in table R is much less than the number of records in table S. a. (3 points) Which INLJ (index nested loops) join is better for the query? A. R as the outer, using the unclustered index on S.A for probing. B. S as the outer, using the clustered index on R.A for probing. Justify your claim. (Use N R, N S, V R.A, etc. in your argument.) b. (2 points) Would employing an INL join for this query make sense? Why or why not?
6 December 2004 CS-6421 Final Exam p. 7 of 7 10. (5 points) Query Optimization. Simply the best plan available. Schema: Statistics: Employee(eid, name, did, jobcat, salary) JobBenefits(title, jobcat, since) FK (title) refs Benefit Benefit(title, description, cost) Employee: 100,000 records on 2,000 pages jobcat: 500 distinct values did (department ID): 100 distinct values (department #13 is accounting) JobBenefits: 3,500 records on 70 pages title: 200 distinct values jobcat: 500 distinct values (same values as in Employee.jobCat) Benefit: 200 records on 20 pages cost: ranges over $500,...,$10,500 Indexes: Employee: Clustered tree index on eid. (Index pages two deep; third layer, data-entry pages.) Unclustered tree index on did, jobcat. (Index pages two deep; third layer, data-entry pages.) JobBenefits: Clustered tree index on jobcat, title. (Index page one deep; second layer, data-entry pages.) Benefit: Query: Hash index on title. select name, eid, B.title from Employee E, JobBenefits J, Benefit B where E.jobCat = J.jobCat and J.title = B.title and E.did = 13 and B.cost > 10000; You have an allocation of twelve buffer frames. a. (1 point) Estimate the cardinality (the resulting number of records) of the query. b. (4 points) Devise a good query plan for the query. Show the query tree, fully annotated with the chosen algorithms and access paths. Estimate the cost of your plan. For full credit, you should have a plan that costs less than 1,500 I/O s.