Multi-version concurrency control

Size: px

Start display at page:

Download "Multi-version concurrency control"

Allen Ferguson
5 years ago
Views:

copies, but then recheck all read + written items before commit 2 Multi-version concurrency control Multi-version concurrency control Maintain multiple

1 Spanner Storage insights 2P & CC = strict serialization Provides semantics as if only one transaction was running on DB at time, in serial order + Real-time guarantees CS 518: Advanced Computer Systems ecture 6 Michael Freedman 2P: Pessimistically get all the locks first CC: ptimistically create copies, but then recheck all read + written items before commit 2 Multi-version concurrency control Multi-version concurrency control Maintain multiple versions of objects, each with own timestamp. Allocate correct version to reads. Prior example of MVCC: Generalize use of multiple versions of objects 3 4 1

2 Multi-version concurrency control Maintain multiple versions of objects, each with own timestamp. Allocate correct version to reads. nlike 2P/CC, reads never rejected ccasionally run garbage collection to clean up MVCC Intuition Split transaction into read set and write set All reads execute as if one snapshot All writes execute as if one later snapshot Yields snapshot isolation < serializability 5 6 Serializability vs. Snapshot isolation Intuition: Bag of marbles: ½ white, ½ black Transactions: T1: Change all white marbles to black marbles T2: Change all black marbles to white marbles Serializability (2P, CC) T1 T2 or T2 T1 In either case, bag is either A white or A black Timestamps in MVCC Transactions are assigned timestamps, which may get assigned to objects those s read/write Every object version V has both read and write TS ReadTS: argest timestamp of that reads V ritets: Timestamp of that wrote V Snapshot isolation (MVCC) T1 T2 or T2 T1 or T1 T2 Bag is A white, A black, or ½ white ½ black 7 8 2

3 Executing transaction T in MVCC Find version of object to read: # Determine the last version written before read snapshot time Find V s.t. max { ritets( V ) ritets( V ) <= TS(T) } ReadTS( V ) = max(ts(t), ReadTS( V )) Return V to T Perform write of object or abort if conflicting: Find V s.t. max { ritets( V ) ritets( V ) <= TS(T) } # Abort if another T exists and has read after T If ReadTS( V ) > TS(T) Abort and roll-back T Else Create new version Set ReadTS( ) = ritets( ) = TS(T) 9 write() by TS=3 10 R(1) = 3 write() by TS=5 11 R(1) = 3 write() by (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 If ReadTS(1) > 4, abort Þ 3 > 4: false therwise, write object 12 3

4 R(1) = 3 (3) = 4 R(3) = 4 (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 If ReadTS(1) > 4, abort Þ 3 > 4: false therwise, write object 13 R(1) = 35 BEGIN Transaction tmp = READ() RITE (, tmp + 1) END Transaction Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 5 Set R(1) = max(5, R(1)) = 5 14 R(1) = 53 BEGIN Transaction tmp = READ() RITE (, tmp + 1) END Transaction (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 5 If ReadTS(1) > 5, abort Þ 5 > 5: false therwise, write object 15 R(1) = 35 write() by (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 If ReadTS(1) > 4, abort Þ 5 > 4: true 16 4

Distributed Transactions R(1) = 53 BEGIN Transaction tmp = READ() RITE (P, tmp + 1) END Transaction (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 Set R(1) = max(4,

5 Distributed Transactions R(1) = 53 BEGIN Transaction tmp = READ() RITE (P, tmp + 1) END Transaction (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 Set R(1) = max(4, R(1)) = 5 Then write on P succeeds as well Consider partitioned data over servers Consider partitioned data over servers P Q R R P Q R R hy not just use 2P? How do you get serializability? Grab locks over entire read and write set n single machine, single CMMIT op in the A Perform writes Release locks (at commit time) 19 In distributed setting, assign global timestamp to (at sometime after lock acquisition and before commit) Centralized manager Distributed consensus on timestamp (not all ops) 20 5

6 Strawman: Consensus per group? P Q R R Spanner: Google s Globally- Distributed Database R S SDI 2012 Single amport clock, consensus per group? inearizability composes! But doesn t solve concurrent, non-overlapping problem Google s Setting Scale-out vs. fault tolerance Dozens of zones (datacenters) Per zone, s of servers Per server, partitions (tablets) Every tablet replicated for fault-tolerance (e.g., 5x) 23 P P P Q QQ Every tablet replicated via Paxos (with leader election) So every operation within transactions across tablets actually a replicated operation within Paxos RSM Paxos groups can stretch across datacenters! (CPS took same approach within datacenter) 24 6

TrueTime Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? Global wall-clock time with bounded uncertainty earliest TT.

7 TrueTime Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? Global wall-clock time with bounded uncertainty earliest TT.now() 2*ε latest time Consider event e now which invoked tt = TT.new(): Guarantee: tt.earliest <= t abs (e now ) <= tt.latest Timestamps and TrueTime Commit ait and Replication Acquired locks T Pick s > TT.now().latest s Commit wait Release locks ait until TT.now().earliest > s Acquired locks T Start consensus Pick s Achieve consensus Notify followers Release locks Commit wait done average ε average ε

8 Client-driven transactions Client: 1. Issues reads to leader of each tablet group, which acquires read locks and returns most recent data 2. ocally performs writes 3. Chooses coordinator from set of leaders, initiates commit 4. Sends commit message to each leader, include identify of coordinator and buffered writes 5. aits for commit from coordinator 29 Commit ait and 2-Phase Commit n commit msg from client, leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps og prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: ait until hear from other participants Choose commit timestamp >= prepare ts, > local ts ogs commit record through Paxos ait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks 30 Commit ait and 2-Phase Commit Example Start logging Acquired locks T C Acquired locks T P1 Acquired locks Done logging Release locks Committed Notify participants s c Release locks Release locks Remove X Risky post P from friend list T C T 2 s p = 6 s c = 8 s = 15 T P Remove myself from X s friend list s p = 8 s c = 8 T P2 Compute s p for each Prepared Send s p Commit wait done Compute overall s c Time <8 My friends My posts X s friends [X] [me] 8 [] [] 15 [P]

9 Read-only optimizations Given global timestamp, can implement read-only transactions lock-free (snapshot isolation) Step 1: Choose timestamp s read = TT.now.latest() Step 2: Snapshot read (at s read ) to each tablet Can be served by any up-to-date replica Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? TrueTime Architecture TrueTime implementation GPS GPS GPS Atomic-clock GPS GPS now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift = 1ms μs/sec Client +6ms ε Datacenter 1 Datacenter 2 Datacenter n 0sec 30sec 60sec 90sec time Compute reference [earliest, latest] = now ± ε 35 hat about faulty clocks? Bad CPs 6x more likely in 1 year of empirical data 36 9

Known unknowns > unknown unknowns Rethink algorithms to reason about uncertainty The case for log storage: Hardware tech affecting software design 37 38 atency Numbers Every Programmer Should Know

10 Known unknowns > unknown unknowns Rethink algorithms to reason about uncertainty The case for log storage: Hardware tech affecting software design atency Numbers Every Programmer Should Know June 7, 2012 ~2016 Seagate ($50) 1TB HDD 7200RPM Model: STD1000DM003-1SB10C peration Sequential Read Sequential rite Random Read 4KiB Random rite 4KiB HDD Performance 176 MB/s 190 MB/s MB/s 121 IPS MB/s 224 IPS From See also 39 DQ Random Read 4KiB DQ Random rite 4KiB MB/s 292 IPS MB/s 227 IPS

~2016 peration HDD Performance SSD Performance Sequential Read 176 MB/s 2268 MB/s Sequential rite 190 MB/s 1696 MB/s Random Read 4KiB Random rite 4KiB DQ Random Read 4KiB DQ Random rite 4KiB Seagate

11 ~2016 peration HDD Performance SSD Performance Sequential Read 176 MB/s 2268 MB/s Sequential rite 190 MB/s 1696 MB/s Random Read 4KiB Random rite 4KiB DQ Random Read 4KiB DQ Random rite 4KiB Seagate ($50) 1TB HDD 7200RPM Model: STD1000DM003-1SB10C MB/s 121 IPS MB/s 224 IPS MB/s 292 IPS MB/s 227 IPS Samsung ($330) 512 GB 960 Pro NVMe PCIe M.2 Model: MZ-V6P512B MB/s 10,962 IPS 151 MB/s 36,865 IPS 348 MB/s IPS 399 MB/s 97,412 IPS 41 Idea: Traditionally disks laid out with spatial locality due to cost of seeks bservation: main memory getting bigger most reads from memory Implication: Disk workloads now write-heavy avoid seeks write log New problem: Many seeks to read, need to occasionally defragment New tech solution: SSDs seeks cheap, erase blocks change defrag 42 11

Multi-version concurrency control

Multi-version concurrency control MVCC and Distributed Txns (Spanner) 2P & CC = strict serialization Provides semantics as if only one transaction was running on DB at time, in serial order + Real-time guarantees CS 518: Advanced Computer