INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

-------------------------------------------------------------------------------------------------------------- INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados Exam 1 16 June 2014 -------------------------------------------------------------------------------------------------------------- The duration of this exam is 2,5 Hours. You can access your own written materials, but the exam is to be done individually. You are not allowed to use computers, tablets, nor mobile phones. The maximum grade of the exam is 20 pts. Write your answers below the questions. Write your number and name at the top of each page. Present all calculations performed. After the exam starts, you can leave the room one hour after delivering the exam. The following table is be used by instructors, ONLY: 1 2 3 4 5 SUM 4 4 4 4 4 20 1

1. (4 vals) Indexing 1.1. (2,5 pts) Suppose that we are using extendable hashing on a file that contains records with the following search-key values: Search-key values Hash value Ronaldo 00010 Messi 00011 Hernandez 00101 Iniesta 00111 Lahm 00011 Ibrahimovic 00001 Rooney 00011 Neymar 00111 Show the extendable hash structure for this file, if the hash values for each search key are as shown in the table above, and if buckets can hold up to three records. Use the least significant bits of the hash value. 2

1.2. (1 pt) Suppose that you have a sorted file and want to construct a dense primary B+ tree index on this file. a) One way to accomplish this task is to scan the file, record by record, inserting each one using the B+ tree insertion procedure. What performance and storage utilization problems are there with this approach? b) Explain how the bulk-loading algorithm improves upon this scheme. 1.3. (0,5 pt) What is the difference between a B+-tree index and a B+-tree file organization? Indicate one advantage of the each schema. 3

2. (4 pts) Query Processing and Optimization 2.1. (2,5 pts) Consider performing a natural join between the following two relations: Client(Name,ID) ClientDetails(ID,Property,Value) Assume that the Client tuples are stored contiguously on 2000 disk blocks and that the ClientDetails tuples are stored contiguously on 400 blocks. Each block of Client or ClientDetails holds up to 50 tuples. There are 102 memory blocks available. Compute the I/O cost for each of the following join algorithms, justifying your result. Ignore the I/O cost of writing the output to disk. Unless stated otherwise, the tuples in the relations are not sorted. a) Merge join, sorted relations (i.e., assume that both relations are sorted). b) Merge join, unsorted relations (i.e., assume that both relations are unsorted). c) Index join. Assume that there is an index on the ID column of Client. We read a block of ClientDetails and, for each tuple in this block, we use the index to find all matching tuples of Client. Each of these Client tuples is read into memory and joined with the tuples from ClientDetails. We repeat the process for all blocks of ClientDetails. Assume that the index is entirely in memory, and assume that, on average, each tuple of ClientDetails matches 4 tuples of Client. d) Hash join. Assume Client as the build relation. 4

2.2. (1 pt) Consider the following database relations: Client(Name,Address,ClientID) ClientSubcriptions(ClientID,SubscriptionType) ClientID: Foreign Key(Client) The relation Client has 200 tuples and the relation ClientSubscriptions has 600 tuples. Answer the following questions: a) Estimate the number of tuples of Client X ClientSubscriptions. b) Consider the selection: σ ClientID=2 (Client X ClientSubscriptions). Estimate the number of tuples returned by the selection. 2.3. (0,5 pt) What would change in the answer to question 2.2.b) if the selection condition was ClientID>2? 5

3.(4 pts) Transactions and Concurrency Control 3.1. (2,5 pts) Consider a multi-granularity locking system, with lock modes IX, X, IS, S and SIX. The objects are arranged in the following hierarchy: relation / \ / \ block A block B / \ / \ / \ / \ a1 a2 a3 b1 b2 Assume there are two active transactions T1 and T2, and there are no lock upgrades. a) Transaction T1 has already obtained the following locks: IS lock on R, S on B. Transaction T2 wants to modify b2 (and nothing else) while T1 is active. What lock does T2 need to get? Which of these locks can T2 get at this point? b) Transaction T1 has already obtained the following locks: IS lock on R, S on A. Transaction T2 wants to read b2 and modify a1 (and nothing else) while T1 is active. What locks does T2 need to get? Which of these locks can T2 get at this point? c) Transaction T1 has already obtained the following locks: IX lock on R, IX on A, X on a1. Transaction T2 wants to read a2 and modify a3 (and nothing else) while T1 is active. What locks does T2 need to get? Which of these locks can T2 get at this point? d) Transaction T1 has already obtained the following locks: IX lock on R, IX on A, X on a1. Transaction T2 wants to read a2 and modify a1 (and nothing else) while T1 is active. What locks does T2 need to get? Which of these locks can T2 get at this point? 6

3.2. (1 pt) Consider the following two transactions: T1: R1(X) W1(X) R1(Y) W1(Y) T2: R2(Y) W2(Y) Which of the following schedules would be allowed under the 2-Phase Locking protocol? Justify your answer. Assume that the L lock actions in the schedules are exclusive. a) L1(X) L1(Y) L2(Y) R1(X) W1(X) R1(Y) W1(Y) R2(Y) W(Y) U2(Y) U1(Y) U1(X) b) L1(Y) L1(X) R1(Y) W1(Y) R1(X) W1(X) U1(X) U1(Y) L2(Y) R2(Y) W2(Y) U2(Y) 7

3.3. (0,5 pt) Consider the following schedule for three concurrent transactions and indicate whether it is possible under the timestamp-based protocol. Justify. T1 T2 T3 -------------------------------------------- R(A) R(C) W(A) W(B) W(B) W(C) 8

4. (4 pts) Recovery Management 4.1. (2,5 pts) Briefly answer the following questions regarding the ARIES recovery algorithm: a) If the system fails repeatedly during recovery, what is the maximum number of log records that can be written (as a function of the number of update and other log records written before the crash) before restart completes successfully? Justify. b) What is the oldest log record we need to retain? Justify. 9

4.2. (1 pt) In the context of the ARIES algorithm, explain the purpose of the checkpoint mechanism. Explain how does the frequency of checkpoints affect: (i) The system's performance when no failure occurs. (ii) The time it takes to recover from a system crash. (iii) The time it takes to recover from a disk crash (i.e., a crash on stable storage). 4.3. (0,5 pt) How does the recovery manager ensure atomicity of transactions? How does it ensure durability? 10

5. (4 pts) Miscellaneous 5.1. (1 pt) Discuss how the SQL Server DBMS supports data indexing, trying to answer the following particular questions: a) Are clustered indexes on non-key attributes supported? b) Are composite indexes supported? Besides composite indexes, is there any other way to implement the idea of "covering indexes" through non-clustered indexes? c) Can the non-clustered indexes be sparse, or do they have to be dense? d) How are tuple locators (i.e., the pointers to the actual tuples associated to the index keys) implemented in the case of non-clustered indexes (i.e., what is the relationship between clustered and non-clustered indexes)? e) Why should clustered index keys use as few columns as possible? 11

5.2. (1 pt) Explain the two general strategies through which parallelism can be used in DBMS processing. Clearly indicate how DBMSs can make use of these two general strategies (i.e., what are the typical operations where each of the strategies can be employed). 12

5.3.(1 pt) Consider a database storing information about the expenses and revenues associated to the employees of a given company, specifically (i) the monthly revenues generated by each employee, and (ii) the yearly salary for each employee. Consider also the following two equivalent SQL queries, which return the employees that are paid yearly the same value that they generate to the company in revenues: SELECT * FROM Employees WHERE salary = 12*revenue; SELECT * FROM Employees WHERE salary/12 = revenue; Explain which query is better (i.e., in which situations is one query better than the other, in terms of execution efficiency). Justify your answer. 13

5.4. (1 pt) Explain the difference between a system crash and a "disaster." Indicate what kinds of strategies are used in database management systems for handling both types of problems. 14