Distributed Transactions

Similar documents
Concurrency Control II and Distributed Transactions

Multi-version concurrency control

Multi-version concurrency control

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Spanner: Google s Globally- Distributed Database

Spanner: Google s Globally- Distributed Database

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

Google Spanner - A Globally Distributed,

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Byzantine Fault Tolerance

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Corbett et al., Spanner: Google s Globally-Distributed Database

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Building Consistent Transactions with Inconsistent Replication

Bitcoin. CS6450: Distributed Systems Lecture 20 Ryan Stutsman

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Building Consistent Transactions with Inconsistent Replication

Strong Consistency & CAP Theorem

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Time and distributed systems. Just use time stamps? Correct consistency model? Replication and Consistency

Distributed Hash Tables

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Large-scale Caching. CS6450: Distributed Systems Lecture 18. Ryan Stutsman

Distributed Systems. GFS / HDFS / Spanner

EECS 498 Introduction to Distributed Systems

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Applications of Paxos Algorithm

Causal Consistency. CS 240: Computing Systems and Concurrency Lecture 16. Marco Canini

EECS 498 Introduction to Distributed Systems

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

CS 347 Parallel and Distributed Data Processing

Data-Intensive Distributed Computing

Exam 2 Review. Fall 2011

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

PRIMARY-BACKUP REPLICATION

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Causal Consistency and Two-Phase Commit

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Transactions. CS6450: Distributed Systems Lecture 16. Ryan Stutsman

CPS 512 midterm exam #1, 10/7/2016

Arranging lunch value of preserving the causal order. a: how about lunch? meet at 12? a: <receives b then c>: which is ok?

There Is More Consensus in Egalitarian Parliaments

Recall use of logical clocks

What are Transactions? Transaction Management: Introduction (Chap. 16) Major Example: the web app. Concurrent Execution. Web app in execution (CS636)

Integrity in Distributed Databases

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Lock-based concurrency control. Q: What if access patterns rarely, if ever, conflict? Serializability. Concurrency Control II (OCC, MVCC)

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Introduction to Distributed Systems Seif Haridi

Transaction Management: Introduction (Chap. 16)

CS 245: Database System Principles

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

EECS 498 Introduction to Distributed Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Recovering from a Crash. Three-Phase Commit

Erasure Coding in Object Stores: Challenges and Opportunities

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

CS 347 Parallel and Distributed Data Processing

CONCURRENCY CONTROL, TRANSACTIONS, LOCKING, AND RECOVERY

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Paxos provides a highly available, redundant log of events

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Consistency examples. COS 418: Distributed Systems Precept 5. Themis Melissaris

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Defining properties of transactions

Paxos and Replication. Dan Ports, CSEP 552

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Time Synchronization and Logical Clocks

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

REPLICATED STATE MACHINES

Replications and Consensus

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Strong Consistency and Agreement

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

CS October 2017

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Scalable Causal Consistency for Wide-Area Storage with COPS

Byzantine Fault Tolerance

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Distributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.

DrRobert N. M. Watson

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Transcription:

Distributed Transactions CS6450: Distributed Systems Lecture 17 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

Consider partitioned data over servers O L R U P L R W U Q L W U Why not just use 2PL? Grab locks over entire read and write set Perform writes Release locks (at commit time) Max throughput for hot item: 1/RTT 10 txns/s? 100? 1000? Reader blocked by writers 2

Consider partitioned data over servers O L R U P L R W U Q L W How do you get serializability? Need to generate a total order/timestamp? Hard to escape. On one machine, often the COMMIT record in the WAL Or a timestamp for timestamp-based CC Distributed: global timestamp per transaction Centralized transaction manager? Distributed consensus on timestamp? Validation looks like an out, but its not really... U 3

Problems with Partitioning Validation T1: R(x)=0 W(x)=1 T2: R(x)=0 W(y)=1 T3: R(y)=0 R(x)=1 4

Problems with Partitioning Validation T1: R(x)=0 W(x)=1 T2: R(x)=0 W(y)=1 T3: R(y)=0 R(x)=1 5

Problems with Partitioning Validation S1 validates just x information: T1: R(x)=0 W(x)=1 T2: R(x)=0 T3: R(x)=1 Answer is "yes" (T2 T1 T3) S2 validates just y information: T2: W(y)=1 T3: R(y)=0 Answer is "yes" (T3 T2) But, the real answer is "no"! Q: What went wrong here? 6

Spanner: Google s Globally- Distributed Database OSDI 2012 7

Google s Setting Dozens of zones (datacenters) Per zone, 100-1000s of servers Per server, 100-1000 partitions (tablets) Every tablet replicated for fault-tolerance (e.g., 5x) Paper notes in nearby datacenters 8

#1 Rule: Never Lose a Sale But can we do it with low(-ish) latency and linearizbility/serializability+external consistency? 9

Scale-out vs. fault tolerance Santa Clara O O O San Francisco P P P Fremont Q QQ Every tablet replicated via Paxos (with leader election) So every operation within transactions across tablets actually a replicated operation within Paxos RSM Paxos groups can stretch across datacenters! 10

Scale-out vs. fault tolerance Santa Clara O O O San Francisco P P P Paxos Fremont 2PC Q QQ Paxos Every tablet replicated via Paxos (with leader election) So every operation within transactions across tablets actually a replicated operation within Paxos RSM Paxos groups can stretch across datacenters! Question: what was a serious concern for 2PC at scale? 11

Want it all All state replicated in multiple data centers Externally consistent transactions ongoing that span partitions (involve multiple Paxos groups) Externally consistent, non-transactional reads to the local Paxos replica!!! Non-blocking snapshot reads and consistent lockfree read-only transactions in the past Hard: how can a Paxos replica guarantee a read saw up-to-date state? 12

Idea: Physically Timestamp Everything Physical timestamps + Multi-versioning All reads, writes, and transactions are physically timestamped If servers execute operations at their respective timestamps, things will be totally ordered and no operation that starts after the completion of another can appear to execute before it 13

Race Due to Reading at Replicas A1 chmod -r album; upload secret.jpg GA A2 A3 C1 B1 W1 chmod Ok! W2 photo Ok! GB B2 B3 C2 R1 photo R2 acl -> ok!

W1@10 chmod -r W2@15 post photo GA GB +r -r @5 @10 +r -r @5 @10 pic null @15 @6 null @6 +r @5 null pic @6 @15 read(acl)@8 read(photo)@8 Can t see photo @8.

W1@10 chmod -r W2@15 post photo GA GB +r -r @5 @10 +r -r @5 @10 pic null @15 @6 null @6 +r @5 null pic @6 @15 read(acl)@14 read(photo)@14 Can t see photo @14.

W1@10 chmod -r W2@15 post photo GA GB +r -r @5 @10 +r -r @5 @10 pic null @15 @6 null @6 +r @5 null pic @6 @15 read(acl)@20 read(photo)@20 Can see photo @20, but then we must see +r as well, so access denied.

How can C2 tell A1 is out-of-date? A1 GA A2 A3 C1 B1 W1 chmod Ok! W2 photo Ok! GB B2 B3 C2 R1 photo R2 acl -> ok! Tracking causality possible here!

How can C2 tell A1 is out-of-date? A1 GA A2 A3 C1 W1 chmod Ok! B1 GB B2 B3 C2 R2 acl -> ok! But R2 must see acked W1 even if no photo access! Idea: delay reads at replica until sees a write with later ts

Stalls reads until SM up-to-date A1 SM@5 SM@10 SM@20 GA A2 A3 C1 B1 Ok! W1 chmod@10 W2@20 GB B2 B3 C2 R2 acl@15 But R2 must see acked W1 even if no photo access! Idea: delay reads at replica until sees a write with later ts

One (big) problem Must also ensure all reads/writes (at any node) that don t overlap in wall time have increasing timestamps! @10 Need ts(w1) < ts(w2) else photo may be visible @8 here would assume SM with @9 ok to read How can we make this safe? Communication, but possible with (uncertain) clocks? @8

Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? 22

TrueTime Architecture GPS timemaster GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster Client Datacenter 1 Datacenter 2 Datacenter n Compute reference [earliest, latest] = now ± ε 23

TrueTime implementation now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift = 1ms + 200 μs/sec +6ms ε 0sec 30sec 60sec 90sec time What about faulty clocks? Bad CPUs 6x more likely in 1 year of empirical data 24

Progressive now() calls TT.now() earliest latest 2ε Consider event e now which invoked tt = TT.now(): Guarantee: tt.earliest <= t abs (e now ) <= tt.latest 25

Remember the Goal Must also ensure all reads/writes (at any node) that don t overlap in wall time have increasing timestamps! @10 Need ts(w1) < ts(w2) else photo may be visible @8 here would assume SM with @9 ok to read @8

Idea: Wait Out Uncertainty TrueTime: TT.now() -> (earliest, latest) Guarantee: real-world time is in interval Imagine W1 and W2, W2 started after W1 finished now() return for both W1 and W2 W1 W2 W1 W1 W2 W2 Choose ts(w1) and ts(w2)? Use latest? Earliest?

Commit Wait Use ts = now().latest Enforce ts less than all op timestamps that start after it finishes? Wait; don t expose write until ts definitely in past Until ts < now().earliest No possible now().latest < ts on any machine Remember: if a time is definitely in past it cannot be definitely in future (> now().latest) for any machine Else violates invariant that for all server real time is in the range of now()

Past now@15 now@10 Future W1 ts = @15 Expose commit 29

Past now@15 Future W1 @15 W2 W2 W2 Regardless where W2 executes or uncertainty window, the now().latest must be > 15 if it saw W1 30

What did we get? Keep timestamped versions for all records Reads are similar Pick ts = now().latest as timestamp to perform read Wait until state machine has writes with higher timestamp and read there Also works to read at a consistent snapshot across partitions Read from Paxos replicas, 3 to 5x gain tput and consistent reads with no-x-dc comm Also, r/w work lock-free w/ongoing transactions 31

All writes through leader Writes take locks at leader Both local and 2PC Nontransactional reads can skip leader/lock table Read-only txns can skip lock table too 32

Known unknowns Rethink algorithms to reason about uncertainty 33

Commit Wait and Replication Start consensus Achieve consensus Notify followers Acquired locks Release locks T Pick s Commit wait done 34

Client-driven transactions Client: 1. Issues reads to leader of each tablet group, which acquires read locks and returns most recent data 2. Chooses coordinator from set of leaders, initiates commit 3. Sends commit message to each leader, include identity of coordinator and buffered writes 4. Waits for commit from coordinator 35

Commit Wait and 2-Phase Commit On commit msg from client, leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps Log prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: Wait until hear from other participants Choose commit timestamp >= prepare ts, > local ts Logs commit record through Paxos Wait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks 36

Commit Wait and 2-Phase Commit Start logging Done logging Acquired locks Release locks T C Acquired locks Committed Notify participants s c Release locks T P1 Acquired locks Release locks T P2 Compute s p for each Prepared Send s p Commit wait done Compute overall s c 37

Read-only optimizations Given global timestamp, can implement read-only transactions lock-free (snapshot isolation) Step 1: Choose timestamp s read = TT.now.latest() Step 2: Snapshot read (at s read ) to each tablet Can be served by any up-to-date replica 38

Disruptive ideas: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence? Known unknowns Rethink algorithms to reason about uncertainty 39

40

Example Remove X from friend list Risky post P T C T 2 s p = 6 s c = 8 s = 15 T P s p = 8 Remove myself from X s friend list s c = 8 Time <8 8 15 My friends My posts X s friends [X] [me] [] [] [P] 41

Timestamps and TrueTime Acquired locks Release locks T Pick s > TT.now().latest s Wait until TT.now().earliest > s Commit wait average ε average ε 42

TrueTime Global wall-clock time with bounded uncertainty TT.now() time earliest latest 2*ε Consider event e now which invoked tt = TT.new(): Guarantee: tt.earliest <= t abs (e now ) <= tt.latest 43