permanent. Otherwise (only one process refuses or crashs), the state before beginning the operations is reload.

Size: px

Start display at page:

Download "permanent. Otherwise (only one process refuses or crashs), the state before beginning the operations is reload."

Irene Davis
6 years ago
Views:

1 Distributed Transactions Simple World 1960s: all data were held on magnetic tape; simple transaction management Example supermarket with automatic inventory system: ¾One tape for yesterday's inventory, one for today's changes; the output is the new inventory ¾In case of a failure: rewind tapes and restart job Modern transaction concept originates from database management systems (1980) Transactions for distributed objects: late 1980s and 1990s Transactions Concept of transactions is strongly related to Mutual Exclusion: Mutual exclusion ¾Shared resources (data, servers,...) are controlled in a way, that no more than one process is allowed to access it at a time Transactions ¾Protect shared data against simultaneous access by concurrent processes to ensure a consistent state ¾Avoid 'faulty accesses' caused by process failures Atomic operations, i.e. a sequence of steps in a transaction are finished all or none. After starting a transaction, all involved processes can perform operations (create, delete, modify). Sometime, the initiator asks them all to commit their work. If all do so, the changes became permanent. Otherwise (only one process refuses or crashs), the state before beginning the operations is reload. That means: A transaction is a sequence of operations that is guaranteed to be atomic, even in presence of multiple concurrent clients and process crashes. Problem today Example banking application using a PC (an account is a remote object) 1. Withdraw amount x from account a 2. Deposit amount x to account b Transaction T: a.withdraw(100); b.deposit(100); c.withdraw(200); b.deposit(200); deposit(amount) deposit amount in the account withdraw(amount) withdraw amount from the account getbalance() amount return the balance of the account setbalance(amount) set the balance of the account to amount create(name) account create a new account with a given name lookup(name) account return a reference to the account with the given name branchtotal() amount return the total of all the balances at the branch But... When there is a connection loss after the frist step, the money is lost. Thus: group the operations belonging together in a transaction. Key issue: rolling back to the initial state

2 Transaction Primitives Programming with transactions: special primitives are required Typical example: Primitive Description BEGIN_TRANSACTION Make the start of a transaction END_TRANSACTION Terminate the transaction and try to commit ABORT_TRANSACTION Kill the transaction and restore the old values READ Read data from a file, a table, or otherwise WRITE Write data to a file, a table, or otherwise Characteristics of Transactions Each transaction has to fulfil the ACID properties: A Atomic. From 'outside', all operations of a transaction seem to be a single unit C Consistent. The transaction does not violate system invariants I Isolated. Concurrent transactions do not interfere with each other D Durable. When a transaction is permitted, all changes become permanent That means: 1. Either all of the operations must be completed successfully or they must have no effect at all even in the presence of server crashes (A, D, C) 2. Transactions are free from interference by operations performed on behalf of other concurrent clients (I, C) Transaction Example Simple example: booking a stop-over flight Transaction to reserve three flights commits: BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi; END_TRANSACTION Transaction aborts when third flight is unavailable: BEGIN_TRANSACTION reserve WP -> JFK; reserve JFK -> Nairobi; reserve Nairobi -> Malindi full => Now the bookings for the first two ABORT_TRANSACTION flights have to be undone. Classification of Transactions Flat transactions Simply a series of operations (in a single machine) that satisfy the ACID properties. Simplest type of transactions, most often used. Limitations: partial results cannot be committed or aborted what to do with transactions distributed across several machines? Better approaches Nested transactions Transactions can fork into sub-transactions Distributed transactions Basing on flat or nested transactions, but covering the distribution aspect

Nested Transactions Nested transactions allow more concurrency: a transaction is divided into a number of sub-transactions Top-level transaction forks off children which run (independently) in

3 Nested Transactions Nested transactions allow more concurrency: a transaction is divided into a number of sub-transactions Top-level transaction forks off children which run (independently) in parallel Hierarchies of sub-transactions are allowed Considerable administration! T : top-level transaction T 1 = opensubtransaction T 2 = opensubtransaction T 1 : T 2 : opensubtransaction opensubtransaction opensubtransaction T 11 : T 12 : pre. commit pre. commit pre. commit T 21 : opensubtransaction T 211 : commit pre. commit abort pre.commit Distributed Transactions Problem with flat and nested transactions: transactions cannot be distributed across multiple machines Solution: distributed transactions: flat or nested transactions that operate on distributed data Problem now: need for distributed algorithms to guarantee the ACID properties M Flat transaction Nested transaction X T 11 X T Client Y T T Client Z T 1 T 2 Y T 12 T 21 T 22 N P Sub-Transactions Logical division of the work of the top-level transaction When a (sub-)transaction starts, it gets a private copy of all data. This makes rollbacks simple. To a parent, a sub-transaction is atomic with respect to failures and concurrent access transactions at the same level (e.g. T 1 and T 2 ) can run concurrently but access to common objects is serialised a sub-transaction can fail independently of its parent and other sub-transactions, but its parent decides, what to do (give up, start another sub-transaction,...) Only the top-level transaction has permanence; with it commits, all (successful) sub-transactions become permanent Distributed Transactions Distributed transactions normally have a coordinator which coordinates all transaction operations at the involved machines: BEGIN_TRANSACTION END_TRANSACTION join. join participant A a.withdraw(4); BranchX T participant b.withdraw(t, 3); Client B b.withdraw(3); T = BEGIN_TRANSACTION a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3); END_TRANSACTION join Note: the coordinator is in one of the servers, e.g. BranchX BranchY participant C D BranchZ c.deposit(4); d.deposit(3);

4 Transaction Management &OLHQWV 7 3 7UDQVDFWLRQ0DQDJHU 2SHUDWLRQVHTXHQFHV 7UDQVDFWLRQ0DQDJHU&RRUGLQDWRU $OORFDWLRQRIWUDQVDFWLRQ,'V7,'V $VVLJQLQJ7,'V ZLWKRSHUDWLRQV &RRUGLQDWLRQRIFRPPLWPHQWV DERUWVDQGUHFRYHU\ %(*,1B75$16$&7,21(1'B75$16$&7,21 6FKHGXOHU 2SHUDWLRQVHTXHQFHV ZLWK7,'V 6HULDOLVHG RSHUDWLRQV 6FKHGXOHU &RQFXUUHQF\FRQWURO /RFNFRQWURO/2&.5(/($6( 7LPHVWDPSRSHUDWLRQV 'HDGORFNGHWHFWLRQ 'DWD5HFRYHU\0DQDJHU UHDGZULWH RSHUDWLRQV /RJ '% UHDGZULWH RSHUDWLRQV 'DWD5HFRYHU\0DQDJHU /RJJLQJRSHUDWLRQV 0DNHWUDQVDFWLRQVSHUPDQHQW 8QGRWUDQVDFWLRQV 5HFRYHUDIWHUDFUDVK ([HFXWHUHDGZULWHRSHUDWLRQV Atomicity: manager transforms transaction primitives into scheduling requests Isolation and consistency: guaranteed by scheduler Durability: realised by the data manager Implementation of Transactions There are two common methods for implementing transactions: Private Workspace A starting process is given copies of all the files which it has to access Until commit/abort, all operations are made on the private workspace Optimisations: ¾ Do not copy files which only have to be read, only set a pointer to the parent's workspace. ¾ For writing into files, only copy the file's index. When writing in a file block, only copy this block and update the address in the copied file index. In distributed transactions: one process is started on each involved machine, getting a workspace containing the files necessary on this machine (as above) Writeahead Log Files are modified in place, but before changing a block, a record is written to a log, containing: which transaction is making the change, which file/block is changed, what are the old and new values. In case of an abort: make a simple rollback, read the log from the end and undo all recorded changes Distributed Transaction Management Distributed transactions are similar to centralised ones. The problem is to coordinate the operations necessary on distributed data &OLHQWV 'LVWULEXWHG 'LVWULEXWHG7UDQFDFWLRQ 0DQDJHU $OORFDWLRQRIWUDQVDFWLRQ,'V7,'V $VVLJQLQJ7,'V ZLWKRSHUDWLRQV &RRUGLQDWLRQRIFRPPLWPHQWV DERUWVDQGUHFRYHU\ %(*,1B75$16$&7,21(1'B75$16$&7,21 6FKHGXOHU 'DWD 0DQDJHU /RJ '% 5HVRXUFH 6FKHGXOHU 'DWD 0DQDJHU /RJ '% 5HVRXUFH 6FKHGXOHU 'DWD 0DQDJHU /RJ '% 5HVRXUFH /RFDO6FKHGXOHU /RFDOFRQFXUUHQF\ FRQWURO /RFDO ORFN FRQWURO/2&.5(/($6( /RFDOWLPHVWDPS RSHUDWLRQV &RRSHUDWLYHGHDGORFN GHWHFWLRQ /RFDO'DWD5HFRYHU\0DQDJHU 0DQDJHORFDOORJ 0DNHORFDOWUDQVDFWLRQVSHUPDQHQW 8QGRORFDOWUDQVDFWLRQV /RFDOO\UHFRYHU\ DIWHUDFUDVK ([HFXWHORFDOUHDGZULWHRSHUDWLRQV Private Workspace a) The file index and disk blocks for a three-block file b) The situation after a transaction has modified block 0 and appended block 3 c) After committing, the changes are stored

5 Writeahead Log x = 0; y = 0; BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y; END_TRANSACTION; (a) Log [x = 0/1] (b) Log [x = 0/1] [y = 0/2] (c) Log [x = 0/1] [y = 0/2] [x = 1/4] (d) a) A transaction b) d) The log before each statement is executed The Lost Update Problem Transaction T: balance = b.getbalance(); b.setbalance(balance*1.1); a.withdraw(balance/10) Transaction U: balance = b.getbalance(); b.setbalance(balance*1.1); c.withdraw(balance/10) balance = b.getbalance(); $200 balance = b.getbalance(); $200 b.setbalance(balance*1.1); $220 b.setbalance(balance*1.1); $220 a.withdraw(balance/10) $80 c.withdraw(balance/10) $280 The initial balances of accounts A, B, C are $100, $200, $300 Bothtransfer transactions increase B s balance by 10% - the final result should be $242. Finally, B only gets to $220 - T had worked with a wrong value, thus the update made by U gets lost Concurrency Control Problem for the scheduler: guarantee consistency and isolation Controlling the execution of concurrent transactions working on shared data at the same time Achieved by giving concurrent transactions access to data in a specific order. The final result must be the same as if all transactions would have been executed sequentially Concurrency Control If there would be no concurrency control, two problems with concurrent transactions would arise: 1.) Lost update problem: two transactions both read the same old value of a variable and use it to calculate a new value 2.) Inconsistent retrievals problem: a retrieval transaction observes values that are involved in an ongoing updating transaction The Inconsistent Retrievals Problem Transaction V: a.withdraw(100) b.deposit(100) Transaction W: abranch.branchtotal() a.withdraw(100); $100 total = a.getbalance() $100 total = total+b.getbalance() $300 total = total+c.getbalance() b.deposit(100) $300 A and B both initially have $200. V transfers $100 from A to B while W calculates branch total (which should be $400 for A and B together) it occurs an inconsistent retrieval because V has only done the withdraw part when W sums the balances of A and B

6 Serialisation The sketched problems can be avoided: The transactions are scheduled to avoid overlapping access to the accounts accessed by both of them Aserially equivalent interleaving is one in which the combined effect is the same as if the transactions had been done one at a time in some order The same effect means The read operations return the same values The instance variables of the objects have the same values at the end Serially Equivalent Interleaving of V and W (Inconsistent Retrievals) Transaction V: a.withdraw(100); b.deposit(100) Transaction W: abranch.branchtotal() a.withdraw(100); $100 b.deposit(100) $300 total = a.getbalance() $100 total = total+b.getbalance() $400 total = total+c.getbalance()... If W is run before or after V, the problem will not occur Therefore it will not occur in a serially equivalent ordering of V and W The illustration is serial, but it need not to be Serially Equivalent Interleaving of T and U (Lost Updates) Transaction T: balance = b.getbalance() b.setbalance(balance*1.1) a.withdraw(balance/10) Transaction U: balance = b.getbalance() b.setbalance(balance*1.1) c.withdraw(balance/10) balance = b.getbalance() $200 b.setbalance(balance*1.1) $220 balance = b.getbalance() $220 a.withdraw(balance/10) $80 b.setbalance(balance*1.1) $242 c.withdraw(balance/10) $278 If one of T and U runs before the other, they can t get a lost update The same is true if they are run in a serially equivalent ordering Serialisation How to achieve a serially equivalent interleaving? BEGIN_TRANSACTION x = 0; x = x + 1; END_TRANSACTION BEGIN_TRANSACTION x = 0; x = x + 2; END_TRANSACTION BEGIN_TRANSACTION x = 0; x = x + 3; END_TRANSACTION Three transactions working on the same data item. Possible schedules would be: Schedule 1 x = 0; x = x + 1; x = 0; x = x + 2; x = 0; x = x + 3 Legal Schedule 2 x = 0; x = 0; x = x + 1; x = x + 2; x = 0; x = x + 3; Legal Schedule 3 x = 0; x = 0; x = x + 1; x = 0; x = x + 2; x = x + 3; Illegal Schedule 1 is serialised Schedule 2 is not serialised, but produces no illegal results Schedule 3 is illegal: conflicting write operations!

7 Serialisation Representation of transactions is a series of read and write operations Example: T: x = 0; x = x+1 write(t, x); read(t, x); write(t, x) Concurrency control schedules conflicting operations to serialise them Conflict: two operations operate on the same data item, at least one is a write operation For two transactions to be serially equivalent, it is necessary and sufficient that all pairs of conflicting operations of the two transactions are executed in the same order at all of the objects they both access read and write Operation Conflict Rules Operations of different transactions conflict Reason read read No Because the effect of a pair of read operations does not depend on the order in which they are executed read write Yes Because the effect of a read and a writeoperation depends on the order of their execution write write Yes Because the effect of a pair of write operations depends on the order of their execution Conflicting operations A pair of operations conflicts if their combined effect depends on the order in which they were performed e.g. read and write (whose effects are the result returned by read and the value set by write) Non-serially Equivalent Interleaving of Operations Consider T: x = read(i); write(i, 10); write(j, 20); U: y = read(j); write(j, 30); z = read (i); Transaction T: Transaction U: x = read(i) write(i, 10) y = read(j) write(j, 30) write(j, 20) z = read (i) Each transaction s access to i and j is serialised w.r.t one another, but T makes all accesses to i before U does U makes all accesses to j before T does Therefore this interleaving is not serially equivalent Concurrency Control Algorithms Three categories of algorithms: Locking: ¾ A transaction is granted the exclusive access by setting locks ¾ Most used practice for concurrency control. Timestamp: ¾ Operations are ordered by using timestamps before they are carried out ¾ No conflicts can happen Optimistic: ¾ No control while executing operations ¾ Synchronisation takes place at the end of a transaction ¾ If conflicts had occurred, some transactions are forced to abort

8 Locking Oldest and most widely used algorithm is locking: When a process needs to read/write on a data item, it requests the scheduler to grant it a lock for this data item. After performing read/write, the scheduler is asked to release the lock The scheduler is responsible to make valid schedules from all requests; it must be provided with an algorithm that produces only serialised schedules Known algorithm: Two-phase locking Steps in two-phase locking (2PL): 1. When the scheduler receives a request op(t, x) from the transaction manager, it tests, if the operation conflicts with any other operation already granted a lock. Yes: the operation is delayed No: a lock is granted for data item x; the operation is passed on to the data manager 2. When the data manager acknowledges the execution of the operation, the belonging lock is released by the scheduler 3. When a lock for T is released, no more locks are granted for T, no matter for which data item T is requesting a lock The algorithm is widely used, because by rule 3, serialisability is given Transactions T and U with Exclusive Locks when T is about to use b, it is locked for T Transaction T: balance = b.getbalance() b.setbalance(bal*1.1) a.withdraw(bal/10) Transaction U: balance = b.getbalance() b.setbalance(bal*1.1) c.withdraw(bal/10) Operations Locks Operations Locks when U is about to use b, it is still locked for T, and U waits BEGIN_TRANSACTION bal = b.getbalance() lock B b.setbalance(bal*1.1) BEGIN_TRANSACTION a.withdraw(bal/10) lock A bal = b.getbalance() waits for T's lock on B END_TRANSACTION when T commits, it unlocks B unlock A,B b.setbalance(bal*1.1) lock B c.withdraw(bal/10) lock C END_TRANSACTION unlock B,C U can now continue Thus: the use of the lock on b effectively serialises access to b Locks hold by a Transaction Growing phase: locks are requested by the transaction when needed Shrinking phase: when the first lock is released, no more locks can be requested

9 Strict Two-Phase Locking In many systems, locks are released only if the transaction commits or aborts. (delaying read and write operations) Main advantage: If T 2 reads a data item formerly modified by T 1, it can be sure that the value is already committed and no rollback is made after its reading operation This avoids cascaded aborts Two-phase Locking in Distributed Systems Several ways for implementation in distributed systems: Centralised 2PL A single site (lock manager) is responsible for granting and releasing locks. Each transaction manager communicates with it to get its locks. If a lock is granted, the transaction manager communicates directly with the responsible data managers (notice: data can be replicated over several machines). After finishing all operations, the lock is given back to the lock manager. Primary 2PL Each data item is assigned a primary copy. The lock manager on that copy's machine is responsible for granting and releasing locks. By this, centralised 2PL is enhanced by distributing the locking across multiple machines. Distributed 2PL Data may be replicated across multiple machines. The schedulers on each machine now are not only granting and releasing local locks, but also forward operations to the (local) data manager. Types of Locks Concurrency control protocols are designed to deal with conflicts between operations in different transactions on the same object Describe protocols in terms of read and write operations, which are assumed to be atomic read operations of different transactions do not conflict, therefore exclusive locks reduce concurrency more than necessary The many reader/single writer scheme allows several transactions to read an object or a single transaction to write it (but not both) It uses read locks (sometimes calles shared locks) and write locks For one object Lock requested read write Lock already set none OK OK read OK wait write wait wait Implementation of Lock Manager The granting of locks will be implemented by a separate object in a server that is called lock manager. The lock manager holds a set of locks, for example in a hash table. Each lock is an instance of the class Lock and is associated with a particular object. its variables refer to the object, the holder(s) of the lock and its type The lock manager code uses wait (when an object is locked) and notify when the lock is released The lock manager provides setlock and unlock operations for use by the server

10 Lock Class public class Lock { private Object object; // the object being protected by the lock private Vector holders; // the TIDs of current holders private LockType locktype; // the current type No interference between threads public synchronized void acquire(transid trans, LockType alocktype ){ while(/*another transaction holds the lock in conflicting mode*/) { try { wait(); catch ( InterruptedException e){/*...*/ if(holders.isempty()) { // no TIDs hold lock holders.addelement(trans); locktype = alocktype; Get shared lock else if(/*another transaction holds the lock, share it*/ ) ){ if(/* this transaction not a holder*/) holders.addelement(trans); else if (/* this transaction is a holder but needs a more exclusive lock*/) locktype.promote(); Wait for lock to be released Get lock Make shared lock to exclusiv lock LockManager Class public class LockManager { private Hashtable thelocks; public void setlock(object object, TransID trans, LockType locktype){ Lock foundlock; synchronized(this){ // find the lock associated with object // if there isn t one, create it and add to the hashtable foundlock.acquire(trans, locktype); // synchronize this one because we want to remove all entries public synchronized void unlock(transid trans) { Enumeration e = thelocks.elements(); while(e.hasmoreelements()){ Lock alock = (Lock)(e.nextElement()); if(/* trans is a holder of this lock*/ ) alock.release(trans); Lock Class public synchronized void release(transid trans ){ holders.removeelement(trans); // remove this holder // set locktype to none notifyall(); Release lock; all waiting transactions are notified Deadlocks when using Locks Notice: locking can lead to deadlocks: if two processes acquire the same two locks in opposite order, a deadlock may result Transaction T Transaction U Operations Locks Operations Locks a.deposit(100); write lock A b.deposit(200) write lock B b.withdraw(100) waits for U's lock on B a.withdraw(200); waits for T's lock on A When locks are used, each of T and U acquires a lock on one account and then gets blocked when it tries to access the account the other one has locked. We have a deadlock. The lock manager must be designed to deal with deadlocks.

11 Deadlocks Definition of deadlock: deadlock is a state in which each member of a group of transactions is waiting for some other member to release a lock. a wait-for graph can be used to represent the waiting relationships between current transactions (nodes = transactions, edges = waitfor relationships between transactions) Waits for A T U means: T U Waits for B Deadlock prevention... is unrealistic: e.g. lock all of the objects used by a transaction when it starts unnecessarily restricts access to shared resources. it is sometimes impossible to predict at the start of a transaction which objects will be used. Deadlocks can also be prevented by requesting locks on objects in a predefined order but this can result in premature locking and a reduction in concurrency More complex wait-for Graph V T W T C U U W V B Waits for T, U and V share a read lock on C and W holds write lock on B (which V is waiting for) T and W then request write locks on C and a deadlock occurs e.g. V is in two cycles - look on the left Deadlock detection...by finding cycles in the wait-for graph After detecting a deadlock, a transaction must be selected to be aborted to break the cycle The software for deadlock detection can be part of the lock manager The lock manager holds a representation of the wait-for graph so that it can check it for cycles from time to time Edges are added to the graph and removed from the graph by the lock manager s setlock and unlock operations When a cycle is detected, choose a transaction to be aborted and then remove from the graph all the edges belonging to it it is hard to choose a victim - e.g. choose the oldest transaction or the one in the most cycles

12 Timeouts on locks Alternative for deadlock detection by analysing the wait-for graph: Lock timeouts can be used (are commonly used) to resolve deadlocks ¾ Each lock is given a limited period in which it is invulnerable ¾ After this time, a lock becomes vulnerable ¾ Provided that no other transaction is competing for the locked object, the vulnerable lock is allowed to remain ¾ But if any other transaction is waiting to access the object protected by a vulnerable lock, the lock is broken (that is, the object is unlocked) and the waiting transaction resumes ¾ The transaction whose lock has been broken is normally aborted y Problems with lock timeouts ¾ Locks may be broken when there is no deadlock ¾ If the system is overloaded, lock timeouts will happen more often and long transactions will be penalised ¾ It is hard to select a suitable length for a timeout Phantom Deadlocks A deadlock that is detected, but is not really one Happens when there appears to be a cycle, but one of the transactions has released a lock, due to time lags in distributing graphs In the figure suppose U releases the object at X then waits for V at Y and the global detector gets Y s graph before X s (T U V T) local wait-for graph local wait-for graph global deadlock detector T U V T T U V X U Y Distributed Deadlocks Distributed transactions lead to distributed deadlocks In theory a global wait-for graph can be constructed from local ones One server is the global deadlock detector and collects local wait-for graphs in certain intervals A cycle in a global wait-for graph that is not in local ones is a distributed deadlock W W Waits for C D A Z Waits for V X Held by U V U Held by B Waits for Y But a centralised solution has poor availablity, no fault tolerance, bad scalability, and causes high communication costs. Furthermore: Phantom Deadlocks Edge Chasing - a Distributed Approach to Deadlock Detection No global graph is constructed, but each server knows about some of the edges Servers try to find cycles by sending probes which follow the edges of the graph through the distributed system Send a probe for an edge T 1 T 2 when T 2 is waiting for some time By forwarding probes along a waiting chain, cycles can be found Each coordinator records whether its transactions are active or waiting the local lock manager tells coordinators if transactions start/stop waiting when a transaction is aborted to break a deadlock, the coordinator tells the participants, locks are removed and edges taken from wait-for graphs

13 Edge-Chasing Algorithm Three steps: Initiation: When a server notes that T starts waiting for U, where U is waiting at another server, it initiates detection by sending a probe containing the edge < T U > to the server where U is blocked. If U is sharing a lock, probes are sent to all the holders of the lock. Detection: Detection consists of receiving probes and deciding whether a deadlock has occurred and whether to forward the probes. E.g. when a server receives probe < T U > it checks if U is waiting, e.g. U V, if so it forwards < T U V > to server where V waits When a server adds a new edge, it checks whether a cycle is created Resolution: When a cycle is detected, a transaction in the cycle is aborted to break the deadlock. Edge Chasing Conclusion Probe to detect a cycle with N transactions will require 2(N-1) messages Studies of databases show that the average deadlock involves 2 transactions the above algorithm detects deadlock provided that waiting transactions do not abort no process crashes, no lost messages to be realistic it would need to allow for the above failures refinements of the algorithm to avoid more than one transaction causing a detection to start and then more than one being aborted Probes Transmitted to Detect Deadlocks example of edge chasing starts with X sending <W U>, then Y sends <W U V >, then Z sends <W U V W> W U V W Deadlock detected Z C Waits for V W U V W Waits for Initiation W U U A X Waits for Y B (Pessimistic) Timestamp Ordering Different approach to concurrency control: Assign a timestamp ts(t) to transaction T at the moment it starts UsingLamport's timestamps: total order is given. In distributed systems, a timestamp consists of a pair <local timestamp, server-id>. ts(t) is assigned with each operation done by T. This allows a total order on all read and write events. Each data item x is assigned with two timestamps ts RD (x) and ts WR (x) for read/write operations. The read timestamp is set to ts(t) of the transaction T which most recently read x (analogously: write timestamp) If two operations conflict, the data manager processes the one with the lower timestamp first A transaction's request to write an object is valid only if that object was last read and written by earlier transactions. A transaction's request to read an object is valid only if that object was last written by an earlier transaction. (Tentative operations: work on own copy, make them permanent when committing.)

14 Rules for read and write Operations Rule T c T i 1. write read T c must not write an object that has been read by any T i where This requires that T c WKHPD[LPXPUHDGWLPHVWDPSRIWKHREMHFW T i >T c 2. write write T c must not write an object that has been written by any T i where This requires that T c > write timestamp of the committed object. T i >T c 3. read write T c must not read an object that has been written by any T i where This requires that T c > write timestamp of the committed object. T i >T c Example: T sends a read(t, x) with ts(t) < ts WR (x) T is aborted T sends a read(t, x) with ts(t) > ts WR (x) read operation is performed, ts RD (x) = max{ts(t), ts RD (x) T sends a write(t, x) with ts(t) < ts RD (x) T is aborted T sends a write(t, x) with ts(t) > ts RD (x) no younger operation has read the value; change it, set ts WR (x) = max{ts(t), ts WR (x) Write Operations and Timestamps if (T c PD[LPXPUHDGWLPHVWDPSRQobject D && T c > write timestamp on committed version of D) perform write operation on tentative version of D with write timestamp T c else /* write is too late */ Abort transaction T c (a) T 3 write (b) T 3 write Before After T 2 T 2 T 3 Time Before After T 1 T 1 T Key: 2 T 2 T 3 Time T i Committed T i Tentative (c) T 3 write Before T 1 T 4 (d) T 3 Before write T 4 Transaction aborts object produced by transaction T i (with write timestamp T i ) T 1 <T 2 <T 3 <T 4 After T 1 T 3 T After 4 T 4 Time Time Pessimistic Timestamp Ordering Let ts RD (x) = ts WR (x) = ts(t 1 ), where T 1 has ended some time before. T 2 and T 3 start concurrently with ts(t 2 ) < ts(t 3 ) Look at T 2 's write (a - d) and read (e - h): Read Operations and Timestamps if ( T c > write timestamp on committed version of D) { let D selected be the version of D with the maximum write timestamp T c if (D selected is committed) perform read operation on the version D selected else Wait until the transaction that made version D selected commits or aborts then reapply the read rule else Abort transaction T c (a) T 3 read T 2 Selected read proceeds Time (b) T 3 read T 2 T 4 Selected read proceeds Time Key: T i Committed (c) T 3 read (d) T 3 read T 1 T 2 Selected read waits Time T 4 Transaction aborts Time T i Tentative object produced by transaction T i (with write timestamp T i ) T 1 < T 2 < T 3 < T 4

15 Transaction Commits with Timestamp Ordering When a coordinator receives a commit request, it will always be able to carry it out because all operations have been checked for consistency with earlier transactions Committed versions of an object must be created in timestamp order The server may sometimes need to wait, but the client need not wait To ensure recoverability, the server will save the waiting to be committed versions in permanent storage The timestamp ordering algorithm is strict because The read rule delays each read operation until previous transactions that had written the object had committed or aborted Writing the committed versions in order ensures that the write operation is delayed until previous transactions that had written the object have committed or aborted Optimistic Concurrency Control Third approach to concurrency control: Just do your operations without taking care to any operations of other processes. Problems can be thought about later. (Optimistic control) This should work, because in practise, conflicts are seldom Control algorithm: ¾ keep track on all data items used by a transaction ¾ When committing, all other transactions are checked, if they had modified one of the used data items since the transaction had started. ¾ If yes, the transaction is aborted. Optimistic concurrency control fits best when implementing private workspaces Deadlock free, no process has to wait for locks Disadvantage: a transaction can fail unnecessarily and forced to run again. In cases of heavy load, this decreases performance heavily No real comparison with other approaches are possible: has primarily focused on non-distributed systems, has seldom been implemented in commercial systems Remarks on Timestamp Ordering Concurrency Control The method avoids deadlocks, but e.g. a transaction seeing a larger timestamp aborts, whereas in locking it could wait or even proceed immediatelly (Pessimistic ordering) Modification known as ignore obsolete write rule is an improvement If a write is too late it can be ignored instead of aborting the transaction, because if it had arrived in time its effects would have been overwritten anyway. However, if another transaction has read the object, the transaction with the late write fails due to the read timestamp on the item Multiversion timestamp ordering Allows more concurrency by keeping multiple committed versions late read operations need not be aborted Optimistic Concurrency Control Each transaction has three phases: 1. Working phase the transaction uses a tentative version of the objects it accesses (dirty reads can t occur as we read from a committed version or a copy of it) the coordinator records the readset and writeset of each transaction 2. Validation phase at END_TRANSACTION the coordinator validates the transaction (looks for conflicts) if the validation is successful the transaction can commit. if it fails, either the current transaction, or one it conflicts with is aborted 3. Update phase If validated, the changes in its tentative versions are made permanent. read-only transactions can commit immediately after passing validation.

16 Validation of Transactions Use the read-write conflict rules to ensure that a particular transaction is serially equivalent with respect to all other overlapping transactions Each transaction is given a transaction number (defines its position in time) when it starts validation (the number is kept if it commits) The rules ensure serialisability of transaction T v (transaction being validated) with respect to transaction T i T v T i Rule write read 1. T i must not read objects written by T v read write 2. T v must not read objects written by T i write write 3. T i must not write objects written by T v and T v must not write objects written by T i Rule 3 is valid if only one process at once can be in the update phase Rules 1 and 2 are more complicated: test for overlaps between T v and T i Backward Validation of Transactions check T v with preceding overlapping transactions Backward validation of transaction T v : boolean valid = true; for (int T i = starttn+1; T i <= finishtn; T i ++){ if (read set of T v intersects write set of T i ) valid = false; starttn is the biggest transaction number assigned to some other committed transaction when T v started its working phase finishtn is biggest transaction number assigned to some other committed transaction when T v started its validation phase In figure, StartTn + 1 = T 2 and finishtn = T 3. In backward validation, the read set of T v must be compared with the write sets of T 2 and T 3. The only way to resolve a conflict is to abort T v Validation of Transactions Working Validation Update T 1 T 2 Earlier committed transactions T 3 Transaction being validated T v Later active transactions active 1 active 2 Two forms of validation: Backward validation Forward validation Forward Validation Rule 1: the write set of T v is compared with the read sets of all overlapping active transactions (active1 and active2). Rule 2: (read T v vs write T i ) is automatically fulfilled because the active transactions do not write until T v has completed. Forward validation of transaction T v : boolean valid = true; for (int Tid = active1; Tid <= activen; Tid++){ if (write set of T v intersects read set of Tid) valid = false;

17 Comparison of Forward and Backward Validation Which transaction should be aborted? Forward validation allows flexibility, whereas backward validation allows only one choice (the one being validated) In general, read sets are larger than write sets. Backward validation compares a possibly large read set against the old write sets overhead of storing old write sets Forward validation checks a small write set against the read sets of active transactions need to allow for new transactions starting during validation Comparison of Methods for Concurrency Control Pessimistic approachs (detect conflicts as they arise) Timestamp ordering: serialisation order decided statically Locking: serialisation order decided dynamically Timestamp ordering is better for transactions where reads >> writes, Locking is better for transactions where writes >> reads Strategy for aborts Timestamp ordering immediate Locking waits but can get deadlock Optimistic method All transactions proceed, but may need to abort at the end Efficient operations when there are few conflicts, but aborts lead to repeating work Optimistic Control in Distributed Systems Problem: different servers can order transactions in different ways Parallel validation of transactions Solution: Global validation: the orderings of all servers are checked to be serialisable Globally unique transaction numbers for validation Committing Transactions Commit: the operations should have been performed by all processes in a group, or by none. Often realised by using a coordinator Simple algorithm: One-phase commit ¾ The coordinator asks all processes in the group to perform an operation. ¾ Problem: if one process cannot perform its operation, it cannot notify the coordinator. Thus in practise better schemes are needed. Most common: Two-phase commit (2PC)

18 Two-phase Commit Assume: a distributed transaction involves several processes, each running on a different machine and no failures occur Algorithm: 2 phases with 2 steps each a. Voting phase b. Decision phase 1. The coordinator sends a VOTE_REQUEST message to all participants 2. Each receiving participant sends back a VOTE_COMMIT message to notify the coordinator that it is locally prepared to commit its part of the transaction 3. If the coordinator receives the VOTE_COMMIT from all participants, it commits the whole transaction by sending a GLOBAL_COMMIT to all participants. If one or more participants had voted to abort, the coordinator aborts as well and sends a GLOBAL_ABORT to all participants 4. A participant receiving the GLOBAL_COMMIT, commits locally. Same for abort. Two-Phase Commit A participant P blocked in the READY state can contact another participant Q trying to decide from Q's state what to do: State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT Make transition to ABORT READY Contact another participant When all participants are in state READY, no decision can be made. In that case, the only option is blocking until the coordinator recovers. Recovery means to save current states. Two-Phase Commit a) The finite state machine for the coordinator in 2PC. b) The finite state machine for a participant. But... several problems arise if failures can occur: process crashes, message loss Timeouts are needed to avoid indefinite waiting for crashed processes in the INIT, WAIT resp. READY state. This is no problem with INIT and WAIT, because no commit or abort was decided till then. But when waiting in state READY, the participant cannot abort, because the coordinator may have given a commit to other processes Recoverability If a transaction aborts, the server must make sure that other concurrent transactions do not see any of its effects Two problems: Dirty reads An interaction between a read operation in transaction A and an earlier write operation on the same object by a transaction B that then aborts A has committed before B aborts, so it has seen a value that has never existed A transaction that committed with a dirty read is not recoverable. Solution: delay the own commit of the transaction performing a read which could become a dirty read. Premature writes Interactions between write operations on the same object by different transactions, one of which aborts Write operations must be delayed until the first transaction which has modified the data item has committed or aborted

19 A Dirty Read when Transaction T aborts Transaction T: Transaction U: a.getbalance() a.setbalance(balance + 10) a.getbalance() a.setbalance(balance + 20) balance = a.getbalance() $100 a.setbalance(balance + 10) $110 yu reads A s balance (which was set by T) and then commits balance = a.getbalance() $110 a.setbalance(balance + 20) $130 T subsequently aborts. commit transaction abort transaction U has committed, so it cannot be undone. It has performed a dirty read Two-Phase Commit: Steps taken by the coordinator Actions by coordinator: saving states for recovery write START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; record vote; if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; 1. phase 2. phase Premature Writes - Overwriting uncommitted Values Transaction T: Transaction U: a.setbalance(105) a.setbalance(110) $100 a.setbalance(105) $105 interaction between write operations when a transaction aborts a.setbalance(110) $110 Some database systems keep before images and restore them after aborts. e.g. $100 is before image of T s write, $105 is before image of U s write If U aborts we get the correct balance of $105, But if U commits and then T aborts, we get $100 instead of $110 Two-Phase Commit: Steps taken by a Participant Actions by participant: write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; else { write VOTE_ABORT to local log; send VOTE_ABORT to coordinator; 1. phase 2. phase

20 Two-Phase Commit: Second Thread of Participant Actions for handling decision requests: /* executed by separate thread */ while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */ Stay blocked. If all processes are in the READY state, no more operations can be taken to recover. Thus: blocking commit protocol Solution: three-phase commit protocol (3PC) Transaction T decides whether to commit T 11 abort (at M) T 1 provisional commit (at X) T T 12 T 21 T 2 aborted (at Y) T 22 provisional commit (at N) provisional commit (at N) provisional commit (at P) Recall that 1. A parent can commit even if a sub-transaction aborts 2. If a parent aborts, then its sub-transactions must abort In the figure, each sub-transaction has either provisionally committed or aborted Two-phase commit protocol for nested transactions Recall top-level transaction T and sub-transactions T 1, T 2, T 11, T 12, T 21, T 22 A sub-transaction starts after its parent and finishes before it When a sub-transaction completes, it makes an independent decision either to commit provisionally or to abort. A provisional commit is not the same as being prepared: it is a local decision and is not backed up on permanent storage. If the server crashes subsequently, its replacement will not be able to carry out a provisional commit. A two-phase commit protocol is needed for nested transactions It allows servers of provisionally committed transactions that have crashed to abort them when they recover. Information held by coordinators of nested transactions When a top-level transaction commits it carries out a 2PC Each coordinator has a list of its sub-transactions At provisional commit, a sub-transaction reports its status and the status of its descendents to its parent If a sub-transaction aborts, it tells its parent Coordinator of transaction Child transactions Participant Provisional Abort list commit list T T 1, T 2 yes T 1, T 12 T 11, T 2 T 1 T 11, T 12 yes T 1, T 12 T 11 T 2 T 21, T 22 no (aborted) T 2 T 11 no (aborted) T 11 T 12, T 21 T 12 but not T 21 T 21, T 12 T 22 no (parent aborted) T 22

21 Three-Phase Commit Avoid blocking in the presence of process failures. (Not applied often in practise, because the blocking process failures occur very seldom) Coordinator: if all VOTE_COMMITs are received, a PRECOMMIT is sent. If one process crashs and do not respond to this request, the coordinator commits, because the crashed participant has already written a VOTE_COMMIT to its log. Participant: if all participants wait in state PRECOMMIT, then commit. If all are in state READY, then abort (No problem, because no crashed process can be in state COMMIT). Coordinator Participant

Distributed Transactions

Distributed Transactions Distributed ransactions 1 ransactions Concept of transactions is strongly related to Mutual Exclusion: Mutual exclusion Shared resources (data, servers,...) are controlled in a way, that not more than