Multi-version concurrency control

Similar documents
Multi-version concurrency control

Concurrency Control II and Distributed Transactions

Distributed Transactions

Lock-based concurrency control. Q: What if access patterns rarely, if ever, conflict? Serializability. Concurrency Control II (OCC, MVCC)

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Defining properties of transactions

Spanner: Google s Globally- Distributed Database

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google s Globally- Distributed Database

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Google Spanner - A Globally Distributed,

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Transactions. CS6450: Distributed Systems Lecture 16. Ryan Stutsman

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Exam 2 Review. Fall 2011

Corbett et al., Spanner: Google s Globally-Distributed Database

Building Consistent Transactions with Inconsistent Replication

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Integrity in Distributed Databases

EECS 498 Introduction to Distributed Systems

CS 347 Parallel and Distributed Data Processing

Distributed Systems. GFS / HDFS / Spanner

CS 347 Parallel and Distributed Data Processing

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Recovering from a Crash. Three-Phase Commit

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Concurrency control 12/1/17

(Pessimistic) Timestamp Ordering

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Building Consistent Transactions with Inconsistent Replication

EECS 498 Introduction to Distributed Systems

Strong Consistency & CAP Theorem

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

CSE 444: Database Internals. Lecture 25 Replication

Synchronization. Chapter 5

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

EECS 498 Introduction to Distributed Systems

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

Distributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.

CPS 512 midterm exam #1, 10/7/2016

Topics in Reliable Distributed Systems

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Last Lecture. More Concurrency. Concurrency So Far. In This Lecture. Serialisability. Schedules. Database Systems Lecture 15

Distributed Transaction Management. Distributed Database System

Distributed Systems. 12. Concurrency Control. Paul Krzyzanowski. Rutgers University. Fall 2017

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

The transaction. Defining properties of transactions. Failures in complex systems propagate. Concurrency Control, Locking, and Recovery

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Replications and Consensus

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Control. CS432: Distributed Systems Spring 2017

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps

DATABASE TRANSACTIONS. CS121: Relational Databases Fall 2017 Lecture 25

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Introduction to Distributed Systems Seif Haridi

Distributed Databases. CS347 Lecture 16 June 6, 2001

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Failures, Elections, and Raft

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

CS 425 / ECE 428 Distributed Systems Fall 2017

Distributed PostgreSQL with YugaByte DB

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Optimistic Concurrency Control. April 18, 2018

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Distributed Databases

Consensus and related problems

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

CS October 2017

Optimistic Concurrency Control. April 13, 2017

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

CS 245: Database System Principles

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Time and distributed systems. Just use time stamps? Correct consistency model? Replication and Consistency

Applications of Paxos Algorithm

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Chapter 5. Concurrency Control Techniques. Adapted from the slides of Fundamentals of Database Systems (Elmasri et al., 2006)

Exam 2 Review. October 29, Paul Krzyzanowski 1

Strong Consistency and Agreement

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Primary-Backup Replication

Modeling, Analyzing, and Extending Megastore using Real-Time Maude

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Transcription:

MVCC and Distributed Txns (Spanner) 2P & CC = strict serialization Provides semantics as if only one transaction was running on DB at time, in serial order + Real-time guarantees CS 518: Advanced Computer Systems ecture 6 Michael Freedman 2P: Pessimistically get all the locks first CC: ptimistically create copies, but then recheck all read + written items before commit 2 Multi-version concurrency control Multi-version concurrency control Maintain multiple versions of objects, each with own timestamp. Allocate correct version to reads. Prior example of MVCC: Generalize use of multiple versions of objects 3 4 1

Multi-version concurrency control Maintain multiple versions of objects, each with own timestamp. Allocate correct version to reads. nlike 2P/CC, reads never rejected ccasionally run garbage collection to clean up MVCC Intuition Split transaction into read set and write set All reads execute as if one snapshot All writes execute as if one later snapshot Yields snapshot isolation < serializability 5 6 Serializability vs. Snapshot isolation Intuition: Bag of marbles: ½ white, ½ black Transactions: T1: Change all white marbles to black marbles T2: Change all black marbles to white marbles Serializability (2P, CC) T1 T2 or T2 T1 In either case, bag is either A white or A black Timestamps in MVCC Transactions are assigned timestamps, which may get assigned to objects those s read/write Every object version V has both read and write TS ReadTS: argest timestamp of that reads V ritets: Timestamp of that wrote V Snapshot isolation (MVCC) T1 T2 or T2 T1 or T1 T2 Bag is A white, A black, or ½ white ½ black 7 8 2

Executing transaction T in MVCC Find version of object to read: # Determine the last version written before read snapshot time Find V s.t. max { ritets( V ) ritets( V ) <= TS(T) } ReadTS( V ) = max(ts(t), ReadTS( V )) Return V to T Perform write of object or abort if conflicting: Find V s.t. max { ritets( V ) ritets( V ) <= TS(T) } # Abort if another T exists and has read after T If ReadTS( V ) > TS(T) Abort and roll-back T Else Create new version Set ReadTS( ) = ritets( ) = TS(T) 9 write() by TS=3 10 R(1) = 3 write() by TS=5 11 R(1) = 3 write() by (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 If ReadTS(1) > 4, abort Þ 3 > 4: false therwise, write object 12 3

R(1) = 3 (3) = 4 R(3) = 4 (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 If ReadTS(1) > 4, abort Þ 3 > 4: false therwise, write object 13 R(1) = 35 BEGIN Transaction tmp = READ() RITE (, tmp + 1) END Transaction Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 5 Set R(1) = max(5, R(1)) = 5 14 R(1) = 53 BEGIN Transaction tmp = READ() RITE (, tmp + 1) END Transaction (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 5 If ReadTS(1) > 5, abort Þ 5 > 5: false therwise, write object 15 R(1) = 35 write() by (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 If ReadTS(1) > 4, abort Þ 5 > 4: true 16 4

Distributed Transactions R(1) = 53 BEGIN Transaction tmp = READ() RITE (P, tmp + 1) END Transaction (2) = 5 R(2) = 5 Find v such that max ritets(v) <= () Þ v = 1 has (rite) <= 4 Set R(1) = max(4, R(1)) = 5 Then write on P succeeds as well 17 18 Consider partitioned data over servers Consider partitioned data over servers P Q R R P Q R R hy not just use 2P? How do you get serializability? Grab locks over entire read and write set n single machine, single CMMIT op in the A Perform writes Release locks (at commit time) 19 In distributed setting, assign global timestamp to (at sometime after lock acquisition and before commit) Centralized manager Distributed consensus on timestamp (not all ops) 20 5

Strawman: Consensus per group? P Q R R Spanner: Google s Globally- Distributed Database R S SDI 2012 Single amport clock, consensus per group? inearizability composes! But doesn t solve concurrent, non-overlapping problem 21 22 Google s Setting Scale-out vs. fault tolerance Dozens of zones (datacenters) Per zone, 100-1000s of servers Per server, 100-1000 partitions (tablets) Every tablet replicated for fault-tolerance (e.g., 5x) 23 P P P Q QQ Every tablet replicated via Paxos (with leader election) So every operation within transactions across tablets actually a replicated operation within Paxos RSM Paxos groups can stretch across datacenters! (CPS took same approach within datacenter) 24 6

TrueTime Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? Global wall-clock time with bounded uncertainty earliest TT.now() 2*ε latest time Consider event e now which invoked tt = TT.new(): Guarantee: tt.earliest <= t abs (e now ) <= tt.latest 25 26 Timestamps and TrueTime Commit ait and Replication T Pick s > TT.now().latest s Commit wait Release locks ait until TT.now().earliest > s T Start consensus Pick s Achieve consensus Notify followers Release locks Commit wait done average ε average ε 27 28 7

Client-driven transactions Client: 1. Issues reads to leader of each tablet group, which acquires read locks and returns most recent data 2. ocally performs writes 3. Chooses coordinator from set of leaders, initiates commit 4. Sends commit message to each leader, include identify of coordinator and buffered writes 5. aits for commit from coordinator 29 Commit ait and 2-Phase Commit n commit msg from client, leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps og prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: ait until hear from other participants Choose commit timestamp >= prepare ts, > local ts ogs commit record through Paxos ait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks 30 Commit ait and 2-Phase Commit Commit ait and 2-Phase Commit T C T P2 T P1 Compute s p for each Start logging T C T P2 T P1 Compute s p for each Done logging Prepared Send s p 1. Client issues reads to leader of each tablet group, which acquires read locks and returns most recent data 31 2. ocally performs writes 3. Chooses coordinator from set of leaders, initiates commit 4. Sends commit msg to each leader, incl. identity of coordinator 32 8

Commit ait and 2-Phase Commit Start logging Done logging Release locks T C T P2 T P1 Compute s p for each Compute overall s c 5. Client waits for commit from coordinator Committed Notify participants s c Release locks Release locks Prepared Send s p Commit wait done Example Remove X Risky post P from friend list T C T 2 s p = 6 s c = 8 s = 15 T P Remove myself from X s friend list s p = 8 Time <8 My friends My posts X s friends [X] [me] 8 [] [] s c = 8 15 [P] 33 34 Read-only optimizations Given global timestamp, can implement read-only transactions lock-free (snapshot isolation) Step 1: Choose timestamp s read = TT.now.latest() Step 2: Snapshot read (at s read ) to each tablet Can be served by any up-to-date replica Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? 35 36 9

TrueTime Architecture TrueTime implementation GPS GPS GPS now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift GPS Atomic-clock GPS = 1ms + 200 μs/sec Client +6ms ε Datacenter 1 Datacenter 2 Datacenter n 0sec 30sec 60sec 90sec time Compute reference [earliest, latest] = now ± ε hat about faulty clocks? 37 Bad CPs 6x more likely in 1 year of empirical data 38 Known unknowns > unknown unknowns Monday lecture Caching Rethink algorithms to reason about uncertainty Project Proposals due Monday night, 11:59pm 39 40 10