Transactions. CS6450: Distributed Systems Lecture 16. Ryan Stutsman

Transactions CS6450: Distributed Systems Lecture 16 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

The transaction Definition: A unit of work: May consist of multiple data accesses or updates Must commit or abort as a single atomic unit Transactions can either commit, or abort When commit, all updates performed on database are made permanent, visible to other transactions When abort, database restored to a state such that the aborting transaction never executed 2

Defining properties of transactions Atomicity: Either all constituent operations of the transaction complete successfully, or none do Consistency: Each transaction in isolation preserves a set of integrity constraints on the data Isolation: Transactions behavior not impacted by presence of other concurrent transactions Durability: The transaction s effects survive failure of volatile (memory) or non-volatile (disk) storage 3

Challenges 1. High transaction speed requirements If always fsync() to disk for each result on transaction, yields terrible performance 2. Atomic and durable writes to disk are difficult In a manner to handle arbitrary crashes Hard disks and solid-state storage use write buffers in volatile memory

Today 1. Techniques for achieving ACID properties Write-ahead logging and checkpointing Serializability and two-phase locking 2. Algorithms for Recovery and Isolation Exploiting Semantics (ARIES) 5

What does the system need to do? Transactions properties: ACID Atomicity, Consistency, Isolation, Durability Application logic checks consistency (C) This leaves two main goals for the system: 1. Handle failures (A, D) 2. Handle concurrency (I) 6

Failure model: crash failures Standard crash failure model: Machines are prone to crashes: Disk contents (non-volatile storage) okay Memory contents (volatile storage) lost Machines don t misbehave ( Byzantine ) 7

Account transfer transaction Transfers $10 from account A to account B transaction transfer(a, B): begin_tx a ß read(a) if a < 10 then abort_tx else write(a, a 10) b ß read(b) write(b, b+10) commit_tx 8

Problem Suppose $100 in A, $100 in B transaction transfer(a, B): begin_tx a ß read(a) if a < 10 then abort_tx else write(a, a 10) b ß read(b) write(b, b+10) commit_tx commit_tx starts the commit protocol: write(a, $90) to disk write(b, $110) to disk What happens if system crash after first write, but before second write? After recovery: Partial writes, money is lost Lack atomicity in the presence of failures 9

System structure Smallest unit of storage that can be atomically written to non-volatile storage is called a page Buffer manager moves pages between buffer pool (in volatile memory) and disk (in non-volatile storage) Buffer manager Buffer pool Page Disk Non-volatile storage 10

Two design choices 1. Force all a transaction s writes to disk before transaction commits? Yes: force policy No: no-force policy 2. May uncommitted transactions writes overwrite committed values on disk? Yes: steal policy No: no-steal policy 11

Performance implications 1. Force all a transaction s writes to disk before transaction commits? Yes: force policy Then slower disk writes appear on the critical path of a committing transaction 2. May uncommitted transactions writes overwrite committed values on disk? Then buffer manager loses write scheduling flexibility No: no-steal policy 12

Undo & redo 1. Force all a transaction s writes to disk before transaction commits? Choose no: no-force policy Need support for redo: complete a committed transaction s writes on disk 2. May uncommitted transactions writes overwrite committed values on disk? Choose yes: steal policy Need support for undo: removing the effects of an uncommitted transaction on disk 13

How to implement undo & redo? Log: A sequential file that stores information about transactions and system state Resides in separate, non-volatile storage One entry in the log for each update, commit, abort operation: called a log record Log record contains: Monotonic-increasing log sequence number (LSN) Old value (before image) of the item for undo New value (after image) of the item for redo 14

System structure Buffer pool (volatile memory) and disk (non-volatile) The log resides on a separate partition or disk (in nonvolatile storage) Buffer pool Buffer manager Page Disk Log Non-volatile storage 15

Write-ahead Logging (WAL) Ensures atomicity in the event of system crashes under no-force/steal buffer management 1. Force all log records pertaining to an updated page into the (non-volatile) log before any writes to page itself 2. A transaction is not considered committed until all its log records (including commit record) are forced into the log 16

WAL example force_log_entry(a, old=$100, new=$90) force_log_entry(b, old=$100, new=$110) write(a, $90) write(b, $110) force_log_entry(commit) Does not have to flush to disk What if the commit log record size > the page size? How to ensure each log record is written atomically? Write a checksum of entire log entry 17

Goal #2: Concurrency control Transaction isolation 18

Two concurrent transactions transaction sum(a, B): begin_tx a ß read(a) b ß read(b) print a + b commit_tx transaction transfer(a, B): begin_tx a ß read(a) if a < 10 then abort_tx else write(a, a 10) b ß read(b) write(b, b+10) commit_tx 19

Isolation between transactions Isolation: sum appears to happen either completely before or completely after transfer Sometimes called before-after atomicity Schedule for transactions is an ordering of the operations performed by those transactions 20

Problem for concurrent execution: Inconsistent retrieval Serial execution of transactions transfer then sum: debit transfer: r A w A r B w B sum: r A r B Concurrent execution resulting in inconsistent retrieval, result differing from any serial execution: debit credit credit transfer: r A w A r B w B sum: r A r B Time à = commit 21

Isolation between transactions Isolation: sum appears to happen either completely before or completely after transfer Sometimes called before-after atomicity Given a schedule of operations: Is that schedule in some way equivalent to a serial execution of transactions? 22

Equivalence of schedules Two operations from different transactions are conflicting if: 1. They read and write to the same data item 2. The write and write to the same data item Two schedules are equivalent if: 1. They contain the same transactions and operations 2. They order all conflicting operations of non-aborting transactions in the same way 23

Conflict serializability Ideal isolation semantics: conflict serializability A schedule is conflict serializable if it is equivalent to some serial schedule i.e., non-conflicting operations can be reordered to get a serial schedule 24

A serializable schedule Ideal isolation semantics: conflict serializability A schedule is conflict serializable if it is equivalent to some serial schedule i.e., non-conflicting operations can be reordered to get a serial schedule transfer: r A w A r B w B sum: r A r B r A Serial schedule Conflict-free! Time à = commit 25

A non-serializable schedule Ideal isolation semantics: conflict serializability A schedule is conflict serializable if it is equivalent to some serial schedule i.e., non-conflicting operations can be reordered to get a serial schedule transfer: r A w A r B w B sum: r A r B But in a serial schedule, sum s reads either both before w A or both after w B Conflicting ops Conflicting ops Time à = commit 26

Testing for serializability Each node t in the precedence graph represents a transaction t Edge from s to t if some action of s precedes and conflicts with some action of t 27

Serializable schedule, acyclic graph Each node t in the precedence graph represents a transaction t Edge from s to t if some action of s precedes and conflicts with some action of t transfer: r A w A r B w B sum: r A r B Serializable transfer sum Time à = commit 28

Non-serializable schedule, cyclic graph Each node t in the precedence graph represents a transaction t Edge from s to t if some action of s precedes and conflicts with some action of t transfer: r A w A r B w B sum: r A r B transfer sum Non-serializable Time à = commit 29

Testing for serializability Each node t in the precedence graph represents a transaction t Edge from s to t if some action of s precedes and conflicts with some action of t In general, a schedule is conflict-serializable if and only if its precedence graph is acyclic 30

How to ensure a serializable schedule? Locking-based approaches Strawman 1: Big Global Lock Acquire the lock when transaction starts Release the lock when transaction ends Results in a serial transaction schedule at the cost of performance 31

Locking Locks maintained by transaction manager Transaction requests lock for a data item Transaction manager grants or denies lock Lock types Shared: Need to have before read object Exclusive: Need to have before write object Shared (S) Exclusive (X) Shared (S) Yes No Exclusive (X) No No 32

How to ensure a serializable schedule? Strawman 2: Grab locks independently, for each data item (e.g., bank accounts A and B) transfer: A r A w A A B r B w B B sum: A r A A B r B B Permits this non-serializable interleaving Time à = commit / = exclusive- / Shared-lock; / = X- / S-unlock 33

Two-phase locking (2PL) 2PL rule: Once a transaction has released a lock it is not allowed to obtain any other locks A growing phase when transaction acquires locks A shrinking phase when transaction releases locks In practice: Growing phase is the entire transaction Shrinking phase is during commit 34

2PL allows only serializable schedules 2PL rule: Once a transaction has released a lock it is not allowed to obtain any other locks transfer: A r A w A A B r B w B B sum: A r A A B r B B 2PL precludes this non-serializable interleaving Time à = commit / = X- / S-lock; / = X- / S-unlock 35

2PL and transaction concurrency 2PL rule: Once a transaction has released a lock it is not allowed to obtain any other locks transfer: A r A A w A B r B B w B sum: A r A B r B 2PL permits this serializable, interleaved schedule Time à = commit / = X- / S-lock; / = X- / S-unlock; = release all locks 36

2PL doesn t exploit all opportunities for concurrency 2PL rule: Once a transaction has released a lock it is not allowed to obtain any other locks transfer: r A w A r B w B sum: r A r B 2PL precludes this serializable, interleaved schedule Time à = commit (locking not shown) 37

Issues with 2PL What if a lock is unavailable? Is deadlock possible? Yes; but a central controller can detect deadlock cycles and abort involved transactions The phantom problem Database has fancier ops than key-value store T1: begin_tx; update employee (set salary = 1.1 salary) where dept = CS ; commit_tx T2: insert into employee ( carol, CS ) Even if they lock individual data items, could result in nonserializable execution 38

Serializability versus linearizability Linearizability: a guarantee about single operations on single objects Once write completes, all later reads (by wall clock) should reflect that write Serializability is a guarantee about transactions over one or more objects Doesn t impose realtime constraints Linearizability + serializability = strict serializability Transaction behavior equivalent to some serial execution And that serial execution agrees with real-time 39

Today 1. Techniques for achieving ACID properties Write-ahead logging and check-pointing à A,D Serializability and two-phase locking à I 2. Algorithms for Recovery and Isolation Exploiting Semantics (ARIES) 40

ARIES (Mohan, 1992) In IBM DB2 & MSFT SQL Server, gold standard Key ideas: 1. Refinement of WAL (steal/no-force buffer management policy) 2. Repeating history after restart due to a crash (redo) 3. Log every change, even undo operations during crash recovery Helps for repeated crash/restarts 41

ARIES stable storage data structures Log, composed of log records, each containing: LSN: Log sequence number (strictly increasing) prevlsn: Pointer to the previous log record for the same transaction A linked list for each transaction, threaded through the log Pages pagelsn: Uniquely identifies the log record for the latest update applied to this page 42

ARIES in-memory data structures Transaction table (T-table): one entry per transaction Transaction identifier Transaction status (running, committed, aborted) lastlsn: LSN of the most recent log record written by the transaction Dirty page table: one entry per page Page identifier recoverylsn: LSN of log record for earliest change to that page not on disk 43

Transaction commit 1. Write log records associated with this transaction to the log as transaction executes 2. Write end log record to the log This is the actual commit point Often also defines the equivalent serial order too 44

Checkpoint Happens while other transactions are running, as a separate transaction Does not flush dirty pages to disk Does tell us how much to fix on crash 1. Write begin checkpoint to log 2. Write current transaction table, dirty page table, and end checkpoint to log 3. Force log to non-volatile storage 4. Store begin checkpoint LSN à master record 45

Log: T2: Crash recovery: Phase 1 (Analysis) firstlsn T1: Earliest recoverylsn Checkpoint T3: T4: 1. Start with checkpointed T- & dirty page-tables 2. Read log forward from checkpoint, updating tables For end entries, remove T from T-table (T1, T3) For other log entries, add (T3, T4) or update T-table Add LSN to dirty page table s recoverylsn end end 46

Crash recovery: Phase 2 (REDO) Log: firstlsn T2: T1: T3: T4: Start at firstlsn, scan log entries forward in time Reapply action, update pagelsn Database state now matches state as recorded by log at the time of crash 47

Crash recovery: Phase 3 (UNDO) Log: firstlsn T2: T1: T3: T4: Scan log entries backwards from the end. For updates: Write compensation log record (CLR) to log Contains prevlsn for update: UndoNextLSN Undo the update s operation 48

Crash recovery: Phase 3 (UNDO) Log: firstlsn T2: T1: T3: T4: Scan log entries backwards from the end. For CLRs: If UndoNextLSN = null, write end record Undo for that transaction is done Else, skip to UndoNextLSN for processing Turned the undo into a redo, done in Phase 2 49

Log (time ) Write page 1 Write page 1 Write page 1 CLR FOR LSN 30 CLR FOR LSN 20 CLR FOR LSN 10 LSN: 10 20 30 Restart Undo Undo Restart Undo 40 50 60 50

ARIES: Concluding thoughts Brings together all the concepts we ve discussed for ACID, concurrent transactions Introduced redo for repeating history, novel undo logging for repeated crashes For the interested: Compare with System R (not discussed in this class) 51

OCC: What if access patterns rarely, if ever, conflict? 52

Serializability Execution of a set of transactions over multiple items is equivalent to some serial execution of the same set of transactions 53

Lock-based concurrency control Big Global Lock: Results in a serial transaction schedule at the cost of performance Two-phase locking with finer-grain locks: Growing phase when txn acquires locks Shrinking phase when txn releases locks (typically commit) Allows txn to execute concurrently, improving performance 54

Be optimistic! Goal: Low overhead for non-conflicting txns Assume success! Process transaction as if would succeed Check for serializability only at commit time If fails, abort transaction Optimistic Concurrency Control (OCC) Higher performance when few conflicts vs. locking Lower performance when many conflicts vs. locking 55

2PL & OCC = strict serialization Provides semantics as if only one transaction was running on DB at time, in serial order + Real-time guarantees 2PL: Pessimistically get all the locks first OCC: Optimistically create copies, but then recheck all read + written items before commit 56

OCC: Three-phase approach Active phase: Txn can read values of committed data items Txn caches updates until validation phase Validation phase Ensure reads are still up-to-date and no write-write conflicts If validates, transaction s updates applied to DB Otherwise, transaction restarted Validation order is typically serial; defines the order for serializability Commit phase or Finished 57

OCC: Why validation is necessary txn coord O New txn reads into cache P and Q P and Q s copies at inconsistent state When commit txn updates, new versions overwrite old P Q txn coord 58

Validation in a Nutshell Phases: active, validating, finished To validate T: If RS(T) WS(U) for U validated while T active, then abort T U is committing ahead of T and it wrote into T s read set If WS(T) WS(U) for U validated while T validating, then abort T U is committing ahead of T and it wrote into T s write set 59

Validation Example 1D W m 3D W N Y 1D W :Z* N 3D W Z* L ± ) - q U k &W & 3 k [W; OW& 5 k G F 1D W i n 1D W n 3D K : Z* L ± 3D W N* A ± 60

Tradeoffs 2PL Simple implementation Fast under low contention Long stalls, blocking when contention increases, but ok Moderate bookkeeping costs in normal case Deadlocks Optimistic Great when low contention, collapses under high Low overhead No stalls/blocking inside database Starvation

Compromise: Timestamp-based CC Simple approach Allocate a timestamp to each tx when it starts Ensure that txns appear to execute in that order For enforcement tag values with timestamps Popular now for in-memory databases Still optimistic in sense that it aborts on conflicts More pessimistic since it protects reads Doesn t require validation * 62

Timestamp Management Let RT(X) be the read timestamp of object X Let WT(X) be the write timestamp of object X Let c(x) = true, if tx last wrote X known committed Global, strictly increasing timestamp on begin On read check If TS(T) WT(X), then read is in serial order If c(x), allow read and if TS(T) > RT(X) let RT(X) = TS(T) If!c(X),... 63

Timestamp Management On write check If TS(T) RT(X) and TS(T) WT(X), then write is in serial order Update X WT(X) = TS(T) Set c(x) = false If TS(T) RT(X) but TS(T) < WT(X), then write is in serial order, but a later write exists If c(x), ignore T on X, if!c(x)... danger If TS(T) < RT(X), abort On commit, mark all records X in T as c(x) 64

Read too late Y ] & V - WO V - &W & Y &W & 65

Write too late Y WO V - ] & V i i 0 ' - &W & Y &W & 66

Dirty Reads Y ] & V - WO V HmO Y &W & - &W & Y W-K & 67

Thomas Write Rule & Pitfalls Y ] & V - ] & V - V ~o) - &W & Y &W & - /KY Y & Y W-K & 68

Timestamp-based CC Example '< '< '4 Z R L JNN pgm pxg wtkm 6Tkm wtkm 6Tkm wtkm 6Tkm EnR - k HZ'C wtkpgm! h#x* wtkpxg V< nr - 6Tkumm &h/d lf 6Tkumm =k HL'f -K &f =! Z'2 ' w &K NN 69

Can we do better? - -k -e -l Z pgm umm pxg uug wtkm 6Tkm HZ' wtkpgm = HZ' 6Tkpgm k HZ' s UU &K KK =k HZ' 6Tkumm e HZ' -K & ^HZ' wtkuug 70

Can we do better? - -k -e -l Z pgm umm pxg uug wtkm 6Tkm HZ' wtkpgm = HZ' 6Tkpgm k HZ' s UU &K KK =k HZ' 6Tkumm e HZ' -K & ^HZ' wtkuug A@0 A@150 A@200 71

Multi-version concurrency control Generalize use of multiple versions of objects 72

Multi-version concurrency control Maintain multiple versions of objects, each with own timestamp. Deliver correct version to reads. Reduces read/write conflicts Can also use for weaker isolation models: Snapshot Isolation Unlike 2PL/OCC, reads never rejected Occasionally run garbage collection to clean up No version of a record can be seen that is earlier than the first version before the oldest active transaction 73

Timestamps in MVCC Transactions are assigned timestamps, which may get assigned to versions of records txns read/write Every object version O V has both read and write TS ReadTS: Largest timestamp of txn that read O V WriteTS: Timestamp of txn that wrote O V 74

Executing transaction T in MVCC Find version of object O to read: # Determine the last version written before tx timestamp Find O V s.t. max { WriteTS(O V ) WriteTS(O V ) <= TS(T) } ReadTS(O V ) = max(ts(t), ReadTS(O V )) Return O V to T wt k vm " C A " 3 gm a 3pmm && YZ& &K ] & -< & WG W/& KG ] &F & Y &WYZ :m Perform write of object O or abort if conflicting: Find O V s.t. max { WriteTS(O V ) WriteTS(O V ) <= TS(T) } # Abort if another T exists and has read O after T If ReadTS(O V ) > TS(T) Abort and roll-back T Else Create new version O W Set ReadTS(O W ) = WriteTS(O W ) = TS(T) 75

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 write(o) by TS=3 O 76

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 W(1) = 3 R(1) = 3 write(o) by TS=5 O 77

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 O write(o) by TS = 4 W(1) = 3 R(1) = 3 W(2) = 5 R(2) = 5 Find v such that max WriteTS(v) <= (TS = 4) Þ v = 1 has (WriteTS = 3) <= 4 If ReadTS(1) > 4, abort Þ 3 > 4: false Otherwise, write object 78

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 W(1) = 3 R(1) = 3 W(3) = 4 R(3) = 4 W(2) = 5 R(2) = 5 O Find v such that max WriteTS(v) <= (TS = 4) Þ v = 1 has (WriteTS = 3) <= 4 If ReadTS(1) > 4, abort Þ 3 > 4: false Otherwise, write object 79

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 W(1) = 3 R(1) = 3 R(1) = 5 O BEGIN Transaction tmp = READ(O) WRITE (O, tmp + 1) END Transaction Find v such that max WriteTS(v) <= (TS = 5) Þ v = 1 has (WriteTS = 3) <= 5 Set R(1) = max(5, R(1)) = 5 80

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 O W(1) = 3 R(1) = 3 R(1) = 5 BEGIN Transaction tmp = READ(O) WRITE (O, tmp + 1) END Transaction W(2) = 5 R(2) = 5 Find v such that max WriteTS(v) <= (TS = 5) Þ v = 1 has (WriteTS = 3) <= 5 If ReadTS(1) > 5, abort Þ 5 > 5: false Otherwise, write object 81

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 O write(o) by TS = 4 W(1) = 3 R(1) = 35 W(2) = 5 R(2) = 5 Find v such that max WriteTS(v) <= (TS = 4) Þ v = 1 has (WriteTS = 3) <= 4 If ReadTS(1) > 4, abort Þ 5 > 4: true 82

Digging deeper Notation txn txn txn W(1) = 3: Write creates version 1 with WriteTS = 3 TS = 3 TS = 4 TS = 5 R(1) = 3: Read of version 1 returns timestamp 3 O W(1) = 3 R(1) = 3 5 BEGIN Transaction tmp = READ(O) WRITE (P, tmp + 1) END Transaction W(2) = 5 R(2) = 5 Find v such that max WriteTS(v) <= (TS = 4) Þ v = 1 has (WriteTS = 3) <= 4 Set R(1) = max(4, R(1)) = 5 Read on O succeeds in the past, write on P doesn t conflict, so commit ok 83

Snapshot Isolation (SI) Most commercial DBMS give up serializability to reduce aborts and CC overhead Simple relaxation: choose timestamp read from committed txns < timestamp no need to protect/validate read set in MVCC but read at a single timestamp Split transaction into read set and write set All reads execute as if one snapshot All writes execute as if one later snapshot Yields snapshot isolation < serializability 84

Serializability vs. Snapshot isolation Intuition: Bag of marbles: ½ white, ½ black Transactions: T1: Change all white marbles to black marbles T2: Change all black marbles to white marbles Serializability (2PL, OCC) T1 T2 or T2 T1 In either case, bag is either ALL white or ALL black Snapshot isolation (MVCC) T1 T2 or T2 T1 or T1 T2 Bag is ALL white, ALL black, or ½ white ½ black 85