Department of Computer Science and Engineering 2013/2014 Database Administration and Tuning Mini- Project 2 - solution 2nd semester Question 1 Equivalence Between Relational Algebra Expressions Consider the following two information needs, and consider the alternative SQL queries that are presented to address these information needs. Information Need 1 Information Need 2 SELECT * FROM T1 JOIN (T2 EXCEPT T3) TR ON T1.A1 = TR.B1 ( SELECT * FROM T1 JOIN T2 ON T1.A1 = T2.B1 ) EXCEPT ( SELECT * FROM T1 JOIN T3 ON T1.A1 = T3.B1 ) SELECT T1.A2 FROM T1 LEFT OUTER JOIN T2 ON T1.A1 = T2.B1 WHERE T1.A2 > 10 SELECT T1.A2 FROM ( SELECT * FROM T1 WHERE T1.A2 > 10) TR LEFT OUTER JOIN T2 ON TR.A1 = T2.B1 (a) Write the four SQL queries in the form of relational algebra expressions, considering a direct translation for the SQL instructions that are shown. (b) Using the equivalence rules between relational algebra expressions, show that the two alternative SQL queries presented for each information need are indeed equivalent. (c) For each of the information needs, explain how one of the considered alternative queries can be said to be more efficient. Question 2 Query Optimization and Estimation of Join Sizes Consider the following two relations: R ( A, B, C ) S (C, D) Consider also the following information, regarding the two relations: IST/DEI Pág. 1 de 12
Nr tuples R = 10000 Nr tuples S = 5000 V(R, A) = 20 V(R, B) = 50 V(S, C) = 200 V(R, C) = 150 V(S, D) = 30 Estimate the number of tuples that results from the expression π A,C,D [ (σ A=3 AND B=5 R) X S] Question 3 Transaction Isolation Levels and Associated Problems Consider a relational database for information related to product sales in a store, with the following four tables. The tables are slightly modified versions of those that are available in the AdventureWorks database: PRODUCTION.PRODUCT ( ProductID, Name, ListPrice, QuantityInStock ) SALES.CUSTOMER ( CustomerID, Phone, Email, FirstName, LastName ) SALES.SALE ( SaleID, DeliveryAddress, CreditCard, SaleDate, CustomerID, SaleTotal ) SALES.SALE_PRODUCT ( SaleID, ProductID, Quantity ) Assume that the following three stored procedures can run concurrently in a given application that is supported by the relational database: create_product : creates new rows in the PRODUCTION.PRODUCT table, for new products that will be sold in the store. This procedure uses only the PRODUCTION.PRODUCT table to insert new products. create_modify_sale: records new records, or modifies existing SALES.SALE and SALES.SALE_PRODUCT information. This procedure writes to the SALES.SALE and SALES.SALE_PRODUCT tables, updates the corresponding tuples in the PRODUCTION.PRODUCT table, and may be reading from the SALES.CUSTOMER and PRODUCTION.PRODUCT tables. create_customer: records new customer data in the database. This procedure uses only the SALES.CUSTOMER table to insert new customers. (a) Give a scenario that leads to a possible dirty read in the concurrent execution of operations from this group of stored procedures, or explain why a dirty read cannot happen in this group of stored procedures. (b) Give a scenario that leads to a possible non- repeatable read in the concurrent execution of operations from this group of stored procedures, or explain why a non- repeatable read cannot happen in this group of stored procedures. IST/DEI Pág. 2 de 12
(c) Give an example of a possible phantom read in the concurrent execution of operations from this group of stored procedures, or explain why a phantom read cannot happen in this group of stored procedures. (d) Indicate what transaction isolation level would you use for executing each of the three procedures above, and why? For each procedure you should use the least restricted transaction isolation level that ensures correctness. (d) Consider now an abstract situation, involving the concurrent execution of two transactions. In the description shown bellow, X() means acquiring an exclusive lock, S() means acquiring a shared lock, R() refers to a read operation, and W() refers to a write operation. All locks are released during commit or abort. T1: X(A), R(A), W(A), X(B), R(B), W(B), Commit T2: X(B), R(B), W(B), S(A), R(A), Abort Indicate which of the following problems can occur if the following transactions run concurrently, and explain why: Deadlock in the execution of both transactions; Dirty read in transaction T1; Non- repeatable read in transaction T2; Question 4 Concurrency Control 4.1. Consider the following schedule for two concurrent transactions: T1 T2 1 lock- S(A) 2 read(a) 3 lock- X(B) 4 write(b) 5 unlock(b) 6 lock- S(B) 7 read(b) 8 unlock(a) 9 unlock(b) (a) Is the schedule allowed in Strict 2- Phase Locking? Justify. (b) Is the schedule allowed by the timestamp- based protocol? Justify. IST/DEI Pág. 3 de 12
4.2. Consider now the following schedule for three concurrent transactions: T1 T2 T3 1 write(a) 2 write(a) 3 write(a) 4 write(b) 5 write(b) (a) Is the schedule allowed in Strict 2- Phase Locking? Justify. (b) Is the schedule allowed by the timestamp- based protocol? Justify. Question 5 Recovery System Consider the following simplified representation for the log records that correspond to a given execution, and suppose the ARIES algorithm is followed by the recovery system: LSN Type Transaction Page 00 Begin_checkpoint - - 10 End_checkpoint - - 20 Update T1 P1 30 Update T2 P2 40 Update T3 P3 50 Commit T2-60 Update T3 P2 70 Update T1 P5 CRASH!!! Consider also that the system crashes during recovery from the crash that is represented above, after writing two log records to stable storage. Note that the active transaction table and the dirty page table are empty at the time of the checkpointing. Show the contents of the log after (a) the analysis phase, (b) the redo phase, and (c) the undo phase. IST/DEI Pág. 4 de 12
Solutions to Question 1 The two equivalent expressions for the first information need are as follows: T1 X (T1.A1 = B1) (T2 T3) (T1 X (T1.A1 = T2.B1) T2) (T1 X (T1.A1 = T3.B1) T3) The two equivalent expressions for the second information need are as follows: (b) π T1.A2 ( δ (T1.A2 > 10) (T1 = X (T1.A1 = T2.B1) T2) ) π T1.A2 ( (δ (T1.A2 > 10) (T1)) = X (T1.A1 = T2.B1) T2 ) In the case of the first information need, let us rename (T1 X (T1.A1 = B1) (T2 T3)) as R1, (T1 X (T1.A1 = T2.B1) T2) as R2 and (T1 X (T1.A1 = T3.B1) T3) as R3. If a tuple t belongs to R1, it will also belong to R2. If a tuple t belongs to R3, then t will belong to T3, and hence t cannot belong to R1. From these two we can say that: t, t R1 t (R2 R3) If a tuple t belongs to R2 R3, then t will belong to T2 and t will not belong to T3. Therefore: t, t (R2 R3) t R1 The above two equations imply the given equivalence, in the case of the first information need. In the case of the second information need, let us first notice that the selection condition (T1.A2 > 10) uses only attributes from T1. Therefore, if any tuple t in the output of the operation (T1 = X T2) is filtered out by the selection of the left hand side, then all the tuples in T1 whose value is equal to t are filtered out by the selection of the right hand side. Therefore: t, t NOT- IN δ (T1.A2 > 10) (T1 = X (T1.A1 = T2.B1) T2) t NOT- IN (δ (T1.A2 > 10) (T1) = X (T1.A1 = T2.B1) T2 Using a similar reasoning, we can also conclude that: IST/DEI Pág. 5 de 12
t, t NOT- IN (δ (T1.A2 > 10) (T1)) = X (T1.A1 = T2.B1) E2 t NOT- IN δ (T1.A2 > 10) (T1 = X (T1.A1 = T2.B1) T2) The above two equations imply the given equivalence, in the case of the second information need. (c) The first equivalence is helpful because evaluation of the right hand side of the equivalence may avoid producing many output tuples, which are anyway going to be removed from the result. Thus the right hand side expression (i.e., the one where the minus is executed prior to the join) can in principle be evaluated more efficiently than the left hand side expression. The second equivalence is also helpful because evaluation of the right hand side join (i.e., the expression involving the selection only after the join) may involve many tuples, which will finally be removed from the result. The left hand side expression can, in principle, be evaluated more efficiently, given that the selection is used to filter results prior to the join. Solutions to Question 2 R ( A, B, C ) S (C, D) Nr tuples R = 10000 Nr tuples S = 5000 V(R, A) = 20 V(R, B) = 50 V(S, C) = 200 V(R, C) = 150 V(S, D) = 30 π A,C,D [ (σ A=3 AND B=5 R) X S] Nb tuples (σ A=3 AND B=5 R) = Nb tuples (σ A=3 (σ B=5 R)) Nb tuples (σ B=5 R) = Nr tuples R/V(R, B) = 10000/50 = 200 Nb tuples (σ A=3 (σ B=5 R)) = Nb tuples (σ B=5 R)/V(σ B=5 R, A) = 200/20 = 10 Y = (σ A=3 AND B=5 R) Attributes Y Attributes S = C is not a key of Y nor S V(Y, C) = V(R, C) = 150 Nr tuples (Y X S) = min ( (Nb Tuples Y*Nb tuples S)/V(S, C), (Nb Tuples Y*Nb tuples S)/V(Y, C) ) = IST/DEI Pág. 6 de 12
= min (10*5000/200, 10 *5000/150) = 500/2 = 250 Z = (σ A=3 AND B=5 R) X S Nr tuples Z = 250 Nb tuples π A, B, C,D [ (σ A=3 AND B=5 R) X S] = V(Z, (A,B,C,D)) {A,B,C,D} contains attributes from Y and from S, in particular A1 = {A,B,C} belongs to Y and A2 = {C,D} belongs to S, so: V(Z, (A,B,C,D)) = min ( V(Y, (A,B,C)) * V(S, D), V(Y, (A,B))*V(S, (C,D)), Nb tuples Z) V(Y, (A,C,D)) = min (V(Y,A) * V(Y,C)*V(Y,C), Nb tuples Y) = min(1*1*150, 10) = 10 V(Y, (A,B)) = min (V(Y,A)*V(Y,B), Nbtuples Y) = min(1, 10) = 1 V(S, (C,D)) = min (V(S,C) * V(S,D), Nb tuples S) = min(200*30, 5000) = 5000 Therefore: V(Z, (A,B,C,D)) = min ( V(Y, (A,B,C)) * V(S, D), V(Y, (A,B))*V(S, (C,D)), Nb tuples Z) = = min (10*30, 1*5000, 250) = = min(300, 5000, 250) = 250 Solutions to Question 3 (a) A dirty read can occur when having two concurrent transactions executing the operation named create_modify_sale. For instance, a transaction T1 can start its execution in the read uncommitted isolation level. Concurrently, a transaction T2 starts to create a sale for that same product, and registers a change in the QuantityInStock attribute. Before transaction T2 commits, transaction T1 can read a dirty value for the QuantityInStock attribute. (b) A non- repeatable read can occur when having two concurrent transactions executing the operation named create_modify_sale. For instance, a transaction T1 can start its execution in either the read uncommitted or the read committed isolation level, and read the QuantityInStock attribute for a given product. Concurrently, a transaction T2 creates a sale for that same product and commits. Transaction T1 will then use an incorrect value when updating the value for the QuantityInStock attribute, resulting from the read of a different value for the QuantityInStock attribute. (c) A phantom read will not occur in the concurrent execution of the three stored procedures described in this exercise, given that there are no operations for deleting tuples from the PRODUCTION.PRODUCT or SALES.CUSTOMER tables, and given that the IST/DEI Pág. 7 de 12
create_modify_sale procedure only reads data from specific tuples in the PRODUCTION.PRODUCT and SALES.CUSTOMER tables (i.e., there are no range- queries involved). (d) I would use the read uncommitted isolation for executing create_product and create_customer, given that these stored procedures do not perform reads, and they only insert new tuples into the PRODUCTION.PRODUCT and SALES.CUSTOMER tables (i.e., assuming that these creation operations can be performed atomically, and that adding a new tuple does not involve checking for the existence of another tuple with the same primary key otherwise one can choose the read committed isolation level). I would use repeatable read for the create_modify_sale procedure, in order to avoid the problem of non- repeatable reads. (e) A deadlock can occur in the concurrent execution of both transactions, as shown in the following example: T1: X(A) T2: X(B) T1: R(A) T1: W(A) T1: X(B) - holds until T2 releases X- lock on B T2: S(A) - holds until T1 releases X- lock on A DEADLOCK A dirty read cannot happen in transaction T1, given that both transactions acquire X- locks on the variables before write operations, and S- locks on the variables before read operations. We also have that a non- repeatable read cannot occur in transaction T2, given that the transactions hold the locks (X- locks and S- locks) until the commit or abort operations, never "downgrading" them. In brief, we have that two- phase locking ensures serializable schedules. Solutions to Question 4 4.1. This schedule is not allowed in the timestamp protocol because at step 7, the W- timestamp of B is TS(T1) and TS(T0) < W- timestamp(b). The schedule is allowed under S2PL because each transaction first requests locks and once starts to unlock, doesn t request any more locks (2PL) and because each transaction can keep its exclusive locks until the commit time. This means that T2 must commit at time T5 and T2 at time T8. 4.2. This schedule cannot have lock instructions added to make it legal under strict two- phase locking protocol because T1 must unlock (A) between steps 2 and 3, and must lock IST/DEI Pág. 8 de 12
(B) between steps 4 and 5. The schedule works under timestamp- based protocol, because each time there is a write operation over A (or B), the timestamp of the transaction Ti, TS(Ti) >= W- timestamp(a/b). Solutions to Question 5 LSN Type Transaction Page 00 Begin_checkpoint - - 10 End_checkpoint - - 20 Update T1 P1 30 Update T2 P2 40 Update T3 P3 50 Commit T2-60 Update T3 P2 70 Update T1 P5 CRASH!!! a) 1 st Analysis phase (after 1 st crash): LSN 20 Add (T1,20) to TT and (P1,20) to DPT LSN 30 Add (T2,30) to TT and (P2,30) to DPT LSN 40 Add (T3,40) to TT and (P3,40) to DPT LSN 50 Change status of T2 to C LSN 60 Change (T3,40) to (T3,60) LSN 70 Change (T1,20) to (T1,70) and add (P5,70) to DPT At the end of analysis, the transaction table (ATT) contains the following entries: (T1,70,U), (T2,50,C) and (T3,60,U). The Dirty Page Table (DTP) has the following entries: (P1,20), (P2,30), (P3,40), and (P5,70). No changes to the log b) 1 st Redo phase (after 1 st crash):: Redo starts from LSN20 (minimum reclsn in DPT). LSN 20 Check whether P1 has pagelsn more than 10 or not. LSN 30 Redo the change in P2 IST/DEI Pág. 9 de 12
LSN 40 Redo the change in P3 LSN 50 No action LSN 60 Redo the changes on P2 LSN 70 Redo the changes on P5 Since T2 is committed, write record: LSN 80 T2 end in the log and remove T2 from the ATT. c) 1 st Undo phase (after 1 st crash):: ToUndo list consists of (70, 60). Read LSN 70: Undo the changes in P5. Append a CLR: Undo T1 LSN 70, undonextlsn = 20. Add 20 to ToUndo list. 2 nd CRASH! d) 2nd Analysis phase (after 2nd crash): LSN 20 Add (T1,20) to TT and (P1,20) to DPT LSN 30 Add (T2,30) to TT and (P2,30) to DPT LSN 40 Add (T3,40) to TT and (P3,40) to DPT LSN 50 Change status of T2 to C LSN 60 Change (T3,40) to (T3,60) LSN 70 Change (T1,20) to (T1,70) and add (P5,70) to DPT LSN 80 Remove T2 from ATT At the end of analysis, the transaction table (ATT) contains the following entries: (T1,90,U) and (T3,60,U). The Dirty Page Table (DTP) has the following entries: (P1,20), (P2,30), (P3,40), and (P5,70). No changes to the log e) 2nd Redo phase (after 2nd crash):: Redo starts from LSN20 (minimum reclsn in DPT). LSN 20 Check whether P1 has pagelsn more than 10 or not. Since it is a committed transaction, we probably do not need to redo this update. LSN 30 Redo the change in P2 LSN 40 Redo the change in P3 IST/DEI Pág. 10 de 12
LSN 50 No action LSN 60 Redo the changes on P2 LSN 70 Redo the changes on P5 LSN 80 No action LSN 90 Redo Undo the changes on P5 No changes to the log- f) 2nd Undo phase (after 2ndcrash):: ToUndo list consists of (90, 60). Read LSN 90: CLR with undonextlsn = 20 so Add 20 to ToUndo list. ToUndo consists of (60, 20). Read LSN 60: Undo the changes on P2. Append a 100 CLR: Undo T3 LSN 60 with undonextlsn = 40. Add 40 to ToUndo. ToUndo consists of (40, 20). Read LSN 40: Undo the changes on P3. Append two records to the log: 110 CLR:Undo T3 LSN 40 with undonextlsn =null 120 T3 End ToUndo consists of (20). Read LSN 20: Undo the changes on P1. Append two records to the log: 130 CLR: Undo T1 LSN 20 with undonextlsn =null 140 T1 End The log looks like the following after recovery: LSN 00 begin checkpoint LSN 10 end checkpoint LSN 20 update: T1 writes P1 LSN 30 update: T2 writes P2 LSN 40 update: T3 writes P3 LSN 50 T2 commit prevlsn = 30 LSN 60 update: T3 writes P2 prevlsn = 40 LSN 70 update: T1 writes P5 prevlsn = 20 IST/DEI Pág. 11 de 12
LSN 80 T2 end LSN 90 CLR: Undo T1 LSN 70 undonextlsn= 20 LSN 100 CLR: Undo T3 LSN 60 undonextlsn= 40 LSN 110 CLR: Undo T3 LSN 40 undonextlsn= null LSN 120 T3 end LSN 130 CLR: Undo T1 LSN 20 undonextlsn= null LSN 140 T1 end IST/DEI Pág. 12 de 12