CS6450: Distributed Systems Lecture 13. Ryan Stutsman

Similar documents
Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

EECS 498 Introduction to Distributed Systems

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

G Bayou: A Weakly Connected Replicated Storage System. Robert Grimm New York University

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Byzantine Fault Tolerance

Distributed Transactions

Bitcoin. CS6450: Distributed Systems Lecture 20 Ryan Stutsman

Distributed Hash Tables

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Causal Consistency and Two-Phase Commit

Consistency examples. COS 418: Distributed Systems Precept 5. Themis Melissaris

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Time and distributed systems. Just use time stamps? Correct consistency model? Replication and Consistency

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

PRIMARY-BACKUP REPLICATION

Scaling Out Key-Value Storage

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Important Lessons. Today's Lecture. Two Views of Distributed Systems

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Time Synchronization and Logical Clocks

Concurrency Control II and Distributed Transactions

Managing Update Conflicts in Bayou. Lucy Youxuan Jiang, Hiwot Tadese Kassa

Replications and Consensus

Asynchronous Replication and Bayou. Jeff Chase CPS 212, Fall 2000

Byzantine Fault Tolerance

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Dynamo: Key-Value Cloud Storage

Strong Consistency & CAP Theorem

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Exam 2 Review. October 29, Paul Krzyzanowski 1

Paxos provides a highly available, redundant log of events

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Arranging lunch value of preserving the causal order. a: how about lunch? meet at 12? a: <receives b then c>: which is ok?

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

Primary-Backup Replication

INF-5360 Presentation

Distributed Systems 8L for Part IB. Additional Material (Case Studies) Dr. Steven Hand

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

CSE 124: LAMPORT AND VECTOR CLOCKS. George Porter October 30, 2017

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Multi-version concurrency control

Peer-to-Peer Systems and Distributed Hash Tables

Distributed Systems 11. Consensus. Paul Krzyzanowski

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

The transaction. Defining properties of transactions. Failures in complex systems propagate. Concurrency Control, Locking, and Recovery

SEGR 550 Distributed Computing. Final Exam, Fall 2011

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Quiz I Solutions. Grade histogram for Quiz I

Weak Consistency and Disconnected Operation in git. Raymond Cheng

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Intuitive distributed algorithms. with F#

BRANCH:IT FINAL YEAR SEVENTH SEM SUBJECT: MOBILE COMPUTING UNIT-IV: MOBILE DATA MANAGEMENT

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Replication and Consistency

TIME ATTRIBUTION 11/4/2018. George Porter Nov 6 and 8, 2018

Time. COS 418: Distributed Systems Lecture 3. Wyatt Lloyd

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CS Amazon Dynamo

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

REPLICATED STATE MACHINES

CS October 2017

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Exam 2 Review. Fall 2011

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015

DrRobert N. M. Watson

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Time Synchronization and Logical Clocks

Strong Consistency and Agreement

Consensus, impossibility results and Paxos. Ken Birman

Basic vs. Reliable Multicast

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris

Module 7 - Replication

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam II

Consistency and Replication 1/65

Paxos and Replication. Dan Ports, CSEP 552

EECS 498 Introduction to Distributed Systems

Eventual Consistency 1

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Second Exam.

CS 425 / ECE 428 Distributed Systems Fall 2017

Multi-version concurrency control

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

FAQs Snapshots and locks Vector Clock

Recovering from a Crash. Three-Phase Commit

CONCURRENCY CONTROL, TRANSACTIONS, LOCKING, AND RECOVERY

Chapter 11 - Data Replication Middleware

Consistency and Replication 1/62

Versioning, Consistency, and Agreement

Transactions. CS6450: Distributed Systems Lecture 16. Ryan Stutsman

Transcription:

Eventual Consistency CS6450: Distributed Systems Lecture 13 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

Availability versus consistency NFS and 2PC all had single points of failure Not available under failures Distributed consensus algorithms allow view-change to elect primary Strong consistency model Strong reachability requirements If the network fails (common case), can we provide any consistency when we replicate? 2

Eventual consistency Eventual consistency: If no new updates to the object, eventually all accesses will return the last updated value Common: git, iphone sync, Dropbox, Amazon Dynamo Why do people like eventual consistency? Fast read/write of local copy (no primary, no Paxos) Disconnected operation Issue: Conflicting writes to different copies How do we reconcile them when discovered? 3

Bayou: A Weakly Connected Replicated Storage System Meeting room calendar application as case study in ordering and conflicts in a distributed system with poor connectivity Each calendar entry = room, time, set of participants Want everyone to see the same set of entries, eventually Else users may double-book room 4

BYTE Magazine (1991) 5

6

S1 S2 S3 Sn α δ β ϕ O Timestamp Vectors S1 S2 S3 Sn λ δ π ϕ C Write Log S1 S2 S3 Sn µ δ υ ψ F Committed Tuple Store Undo Log Tuple Store (checkpoint) Table 1 c/f 01 Table 2 c/f 10 01 Tentative Table 1 Table 2 Table 3 c/f Table 3 1 1 In Memory Figure 4. Bayou Database Organization On Stable Storage 7

What s wrong with a central server? Want my calendar on a disconnected mobile phone i.e., each user wants database replicated on her mobile device No master copy Phone has only intermittent connectivity Mobile data expensive when roaming, Wi-Fi not everywhere, all the time Bluetooth useful for direct contact with other calendar users devices, but very short range 8

Swap complete databases? Suppose two users are in Bluetooth range Each sends entire calendar database to other Possibly expend lots of network bandwidth What if conflict, i.e., two concurrent meetings? iphone sync keeps both meetings Want to do better: automatic conflict resolution 9

Automatic conflict resolution Can t just view the calendar database as abstract bits: Too little information to resolve conflicts: 1. Both files have changed can falsely conclude entire databases conflict 2. Distinct record in each database changed can falsely conclude no conflict 10

Application-specific conflict resolution Want intelligence that knows how to resolve conflicts More like users updates: read database, think, change request to eliminate conflict Must ensure all nodes resolve conflicts in the same way to keep replicas consistent 11

What s in a write? Suppose calendar update takes form: 10 AM meeting, Room=305, staff How would this handle conflicts? Better: write is an update function for the app 1-hour meeting at 10 AM if room is free, else 11 AM, Room=305, staff Want all nodes to execute same instructions in same order, eventually 12

Problem Node A asks for meeting M1 at 10 AM, else 11 AM Node B asks for meeting M2 at 10 AM, else 11 AM X syncs with A, then B Y syncs with B, then A X will put meeting M1 at 10:00 Y will put meeting M1 at 11:00 Can t just apply update functions to DB replicas 13

Insight: Total ordering of updates Maintain an ordered list of updates at each node Write Log 1 Write Log 2 A 1 B 1 C 1 A 1 A 2 B 1 B 2 C 1 A 2 B 2 After sync on 1 & 2 Make sure every node holds same updates And applies updates in the same order Make sure updates are a deterministic function of database contents If we obey the above, sync is a simple merge of two ordered lists 14

Agreeing on the update order Timestamp: local timestamp T, originating node ID Ordering updates a and b: a < b if a.t < b.t, or (a.t = b.t and a.id < b.id) 15

Write log example 701, A : A asks for meeting M1 at 10 AM, else 11 AM 770, B : B asks for meeting M2 at 10 AM, else 11 AM Timestamp Pre-sync database state: A has M1 at 10 AM B has M2 at 10 AM What's the correct eventual outcome? Timestamp order: M1 at 10 AM, M2 at 11 AM 16

Write log example: Sync problem 701, A : A asks for meeting M1 at 10 AM, else 11 AM 770, B : B asks for meeting M2 at 10 AM, else 11 AM Now A and B sync with each other. Then: Each sorts new entries into its own log Ordering by timestamp Both now know the full set of updates A can just run B s update function But B has already run B s operation, too soon! 17

Solution: Roll back and replay B needs to roll back the DB, and re-run both ops in the correct order So, in the user interface, displayed meeting room calendar entries are tentative at first B s user saw M2 at 10 AM, then it moved to 11 AM Big point: The log at each node holds the truth; the DB is just an optimization 18

Is update order consistent with wall clock? 701, A : A asks for meeting M1 at 10 AM, else 11 AM 770, B : B asks for meeting M2 at 10 AM, else 11 AM Maybe B asked first by the wall clock But because of clock skew, A s meeting has lower timestamp, so gets priority No, not externally consistent 19

Does update order respect causality? Suppose another example: 701, A : A asks for meeting M1 at 10 AM, else 11 AM 700, B : Delete update 701, A B s clock was slow Now delete will be ordered before add 20

Lamport logical clocks respect causality Want event timestamps so that if a node observes E1 then generates E2, then TS(E1) < TS(E2) Tmax = highest TS seen from any node (including self) T = max(t max +1, wall-clock time), to generate TS Recall properties: E1 then E2 on same node è TS(E1) < TS(E2) But TS(E1) < TS(E2) does not imply that E1 necessarily came before E2 21

Lamport clocks solve causality problem 701, A : A asks for meeting M1 at 10 AM, else 11 AM 700, B : Delete update 701, A 702, B : Delete update 701, A Now when B sees 701, A it sets T max ß 701 So it will then generate a delete update with a later timestamp 22

Timestamps for write ordering: Limitations Ordering by timestamp arbitrarily constrains order Never know whether some write from the past may yet reach your node So all entries in log must be tentative forever And you must store entire log forever Problem: How can we allow committing a tentative entry, so we can trim logs and have meetings 23

Fully decentralized commit Strawman proposal: Update 10, A is stable if all nodes have seen all updates with TS 10 Have sync always send in log order If you have seen updates with TS > 10 from every node then you ll never again see one < 10, A So 10, A is stable Why doesn t Bayou do this? A server that remains disconnected could prevent writes from stabilizing So many writes may be rolled back on re-connect 24

Criteria for committing writes For log entry X to be committed, all servers must agree: 1. On the total order of all previous committed writes 2. That X is next in the total order 3. That all uncommitted entries are after X 25

How Bayou commits writes Bayou uses a primary commit scheme One designated node (the primary) commits updates Primary marks each write it receives with a permanent CSN (commit sequence number) That write is committed Complete timestamp = CSN, local TS, node-id Advantage: Can pick a primary server close to locus of update activity 26

How Bayou commits writes (2) Nodes exchange CSNs when they sync with each other CSNs define a total order for committed writes All nodes eventually agree on the total order Uncommitted writes come after all committed writes <-, 10, A> <-, 20, B> <-, 30, A> Server <1, 8, C> <-, 10, A> <-, 20, B> <-, 30, A> Primary Server 27

How Bayou commits writes (2) Nodes exchange CSNs when they sync with each other CSNs define a total order for committed writes All nodes eventually agree on the total order Uncommitted writes come after all committed writes <-, 10, A> <-, 20, B> <-, 30, A> Server <1, 8, C> <2, 10, A> <3, 20, B> <4, 30, A> Primary Server 28

How Bayou commits writes (2) Nodes exchange CSNs when they sync with each other CSNs define a total order for committed writes All nodes eventually agree on the total order Uncommitted writes come after all committed writes <1, 8, C> <2, 10, A> <3, 20, B> <4, 30, A> Server <1, 8, C> <2, 10, A> <3, 20, B> <4, 30, A> Primary Server 29

Showing users that writes are committed Still not safe to show users that an appointment request has committed! All committed writes up to newly committed write must be known Else upon learning about earlier committed write, would have to re-run conflict resolution Possibly with a different outcome Bayou propagates writes between nodes to enforce this invariant, i.e. Bayou propagates writes in CSN order 30

Committed vs. tentative writes Suppose a node has seen every CSN up to a write, as guaranteed by propagation protocol Can then show user the write has committed Slow/disconnected node cannot prevent commits! Primary replica allocates CSNs; global order of writes may not reflect real-time write times 31

Tentative writes What about tentative writes, though how do they behave, as seen by users? Two nodes may disagree on meaning of tentative (uncommitted) writes Even if those two nodes have synced with each other! Only CSNs from primary replica can resolve these disagreements permanently 32

Example: Disagreement on tentative writes Time A B C sync W 0, C W 1, B W 2, A Logs 2, A 1, B 0, C 33

Example: Disagreement on tentative writes Time A B C sync W 0, C W 1, B W 2, A sync Logs 1, B 2, A 1, B 0, C 2, A 34

Example: Disagreement on tentative writes Time A B C sync W 0, C W 1, B W 2, A sync Logs 1, B 0, C 2, A 1, B 2, A 0, C 1, B 2, A 35

Example: Disagreement on tentative writes Time A B C sync W 0, C W 1, B W 2, A sync Logs 1, B 0, C 2, A 1, B 2, A 0, C 1, B 2, A 36

Tentative order commit order Time A B C Pri W -,10, A W -,20, B sync sync Logs -,10, A -,20, B -,10, A -,20, B 37

Tentative order commit order Time A B C Pri sync sync sync Logs 6,10, -,10, A 5,20, -,20, B -,10, 5,20, A B 5,20, B -,20, 6,10, B A 6,10, A 38

Tentative order commit order Time A B C Pri W -,10, A W -,20, B sync sync Logs -,10, A -,20, B -,10, A -,20, B 39

Tentative order commit order Time A B C Pri sync sync sync Logs 6,10, -,10, A 5,20, -,20, B -,10, 5,20, A B 5,20, B -,20, 6,10, B A 6,10, A 40

Trimming the log When nodes receive new CSNs, can discard all committed log entries seen up to that point Update protocol à CSNs received in order Keep copy of whole database as of highest CSN Result: No need to keep years of log data 41

Can primary commit writes in any order? Suppose a user creates meeting, then decides to delete or change it What CSN order must these ops have? Create first, then delete or modify Must be true in every node s view of tentative log entries, too Rule: Primary s total write order must preserve causal order of writes made at each node Not necessarily order among different nodes writes 42

Syncing with trimmed logs Suppose nodes discard all writes in log with CSNs Just keep a copy of the stable DB, reflecting discarded entries Cannot receive writes that conflict with stable DB Only could be if write had CSN less than a discarded CSN Already saw all writes with lower CSNs in right order: if see them again, can discard! 43

Syncing with trimmed logs (2) To propagate to node X: If X s highest CSN less than mine, Send X full stable DB; X uses that as starting point X can discard all his CSN log entries X plays his tentative writes into that DB If X s highest CSN greater than mine, X can ignore my DB! 44

How to sync, quickly? What about tentative updates? A B -,10, X -,10, X -,20, Y -,20, Y -,30, X -,30, X -,40, X B tells A: highest local TS for each other node e.g., X 30, Y 20 In response, A sends all X's updates after -,30,X, all Y's updates after -,20,X, etc. This is a version vector ( F vector in Figure 4) A s F: [X:40,Y:20] B s F: [X:30,Y:20] 45

S1 S2 S3 Sn α δ β ϕ O Timestamp Vectors S1 S2 S3 Sn λ δ π ϕ C Write Log S1 S2 S3 Sn µ δ υ ψ F Committed Tuple Store Undo Log Tuple Store (checkpoint) Table 1 c/f 01 Table 2 c/f 10 01 Tentative Table 1 Table 2 Table 3 c/f Table 3 1 1 In Memory Figure 4. Bayou Database Organization On Stable Storage 46

Let s step back Is eventual consistency a useful idea? Yes: people want fast writes to local copies iphone sync, Dropbox, Dynamo, etc. Are update conflicts a real problem? Yes all systems have some more or less awkward solution 47

Is Bayou s complexity warranted? i.e. update function log, version vectors, tentative ops Only critical if you want peer-to-peer sync i.e. both disconnected operation and ad-hoc connectivity Only tolerable if humans are main consumers of data Otherwise you can sync through a central server Or read locally but send updates through a master 48

What are Bayou s take-away ideas? 1. Update functions for automatic applicationdriven conflict resolution 2. Ordered update log is the real truth, not the DB 3. Application of Lamport logical clocks for causal consistency 49

Thursday topic: Scaling Services: Key-Value Storage Amazon s Dynamo 50

New server New server Z joins. Could it just start generating writes, e.g. -, 1, Z? And other nodes just start including Z in their version vectors? If A syncs to B, A has -, 10, Z But, B has no Z in its version vector A should pretend B s version vector was [Z:0,...] 51

Server retirement We want to stop including Z in version vectors! Z sends update: -,?, Z retiring If you see a retirement update, omit Z from VV Problem: How to deal with a VV that's missing Z? A has log entries from Z, but B s VV has no Z entry e.g. A has -, 25, Z, B s VV is just [A:20, B:21] Maybe Z has retired, B knows, A does not Maybe Z is new, A knows, B does not Need a way to disambiguate 52

Bayou s retirement plan Idea: Z joins by contacting some server X New server identifier: id now is T z, X T z is X s logical clock as of when Z joined X issues update -, T z, X new server Z 53

Bayou s retirement plan Suppose Z s ID is 20, X A syncs to B A has log entry from Z: -, 25, 20,X B s VV has no Z entry One case: B s VV: [X:10,...] 10 < 20, so B hasn t yet seen X s new server Z update The other case: B s VV: [X:30,...] 20 < 30, so B once knew about Z, but then saw a retirement update 54