Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Similar documents
Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

To do. Consensus and related problems. q Failure. q Raft

Failures, Elections, and Raft

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

REPLICATED STATE MACHINES

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

In Search of an Understandable Consensus Algorithm

Byzantine Fault Tolerant Raft

Distributed Consensus Protocols

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

hraft: An Implementation of Raft in Haskell

Raft and Paxos Exam Rubric

Time and distributed systems. Just use time stamps? Correct consistency model? Replication and Consistency

Distributed Commit in Asynchronous Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Paxos and Replication. Dan Ports, CSEP 552

Three modifications for the Raft consensus algorithm

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Paxos provides a highly available, redundant log of events

A Reliable Broadcast System

Replication in Distributed Systems

Distributed Consensus: Making Impossible Possible

Distributed Consensus: Making Impossible Possible

Replications and Consensus

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Today: Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Basic vs. Reliable Multicast

Primary-Backup Replication

Today: Fault Tolerance. Fault Tolerance

PRIMARY-BACKUP REPLICATION

CPS 512 midterm exam #1, 10/7/2016

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Causal Consistency and Two-Phase Commit

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS

Module 8 Fault Tolerance CS655! 8-1!

EECS 498 Introduction to Distributed Systems

Strong Consistency & CAP Theorem

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Recovering from a Crash. Three-Phase Commit

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Intuitive distributed algorithms. with F#

416 practice questions (PQs)

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Making RAMCloud Writes Even Faster

Consensus and related problems

Module 8 - Fault Tolerance

InfluxDB and the Raft consensus protocol. Philip O'Toole, Director of Engineering (SF) InfluxDB San Francisco Meetup, December 2015

CS 425 / ECE 428 Distributed Systems Fall 2017

CORFU: A Shared Log Design for Flash Clusters

CSE 5306 Distributed Systems

Trek: Testable Replicated Key-Value Store

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Data Management Replication

Technical Report. ARC: Analysis of Raft Consensus. Heidi Howard. Number 857. July Computer Laboratory UCAM-CL-TR-857 ISSN

Exam 2 Review. October 29, Paul Krzyzanowski 1

Introduction to Distributed Systems Seif Haridi

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Raft & RethinkDB: A thorough Examination

Multi-version concurrency control

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Exploiting Commutativity For Practical Fast Replication

Distributed Systems 11. Consensus. Paul Krzyzanowski

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Byzantine Fault Tolerance

Distributed Computation Models

Basic Paxos. Content borrowed from Ousterhout s slides on Paxos

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Strong Consistency and Agreement

CS October 2017

Implementation and Performance of a SDN Cluster- Controller Based on the OpenDayLight Framework

7 Fault Tolerant Distributed Transactions Commit protocols

Development of a cluster of LXC containers

Fast Follower Recovery for State Machine Replication

Distributed Systems COMP 212. Lecture 19 Othon Michail

Fault Tolerance. Basic Concepts

Distributed Systems 8L for Part IB

Byzantine Fault Tolerance

How to make MySQL work with Raft. Diancheng Wang & Guangchao Bai Staff Database Alibaba Cloud

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Split-Brain Consensus

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup

Transcription:

Recall: Linearizability (Strong Consistency) Consensus COS 518: Advanced Computer Systems Lecture 4 Provide behavior of a single copy of object: Read should urn the most recent write Subsequent reads should urn same value, until next write Telephone intuition: 1. Alice updates Facebook post. Alice calls Bob on phone: Check my Facebook post! 3. Bob reads Alice s wall, sees her post Andrew Or, Michael Freedman RAFT slides heavily based on those from Diego Ongaro and John Ousterhout Two phase commit protocol Two phase commit protocol Client C 1. C à P: request <op> Client C 1. C à P: request <op>. P à A, B: prepare <op>. P à A, B: prepare <op> Primary P 3. A, B à P: prepared or error Primary P 3. A, B à P: prepared or error 4. P à C: result exec<op> or failed 4. P à C: result exec<op> or failed 5. P à A, B: commit <op> 5. P à A, B: commit <op> Backup A B What if primary fails? Backup fails? Backup A B Okay (i.e., op is stable) if written to > ½ nodes 3 4

Two phase commit protocol Consensus Client C Definition: Primary P > ½ nodes > ½ nodes 1. A general agreement about something. An idea or opinion that is shared by all the people in a group Backup A B Commit sets always overlap 1 Any >½ nodes guaranteed to see committed op Origin: Latin, from consentire provided set of nodes consistent 5 6 Consensus used in systems Paxos: the original consensus protocol Group of servers attempting: Make sure all servers in group receive the same updates in the same order as each other Maintain own lists (views) on who is a current member of the group, and update lists when somebody leaves/fails Elect a leader in group, and inform everybody Ensure mutually exclusive (one process at a time only) access to a critical resource like a file Safety Only a single value is chosen Only a proposed value can be chosen Only chosen values are learned by processes Liveness *** Some proposed value eventually chosen if fewer than half of processes fail If value is chosen, a process eventually learns it 7 8

Basic fault-tolerant Replicated State (RSM) approach Why bother with a leader? 1. Consensus protocol to elect leader. PC to replicate operations from leader 3. All replicas execute ops once committed Not necessary, but Decomposition: normal operation vs. leader changes Simplifies normal operation (no conflicts) More efficient than leader-less approaches Obvious place to handle non-determinism 9 10 Goal: Replicated Clients Raft: A Consensus Algorithm for Replicated s Servers Diego Ongaro and John Ousterhout Stanford University Replicated log => replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication 11 1

Raft Overview 1. Leader election. Normal operation (basic log replication) 3. Safety and consistency after leader changes 4. Neutralizing old leaders Server States At any given time, each server is either: Leader: handles all client interactions, log replication Follower: completely passive Candidate: used to elect a new leader Normal operation: 1 leader, N-1 followers 5. Client interactions 6. Reconfiguration Follower Candidate Leader 13 14 Liveness Validation Terms (aka epochs) Servers start as followers Leaders send heartbeats (empty AppendEntries RPCs) to maintain authority If electiontimeout elapses with no RPCs (100-500ms), follower assumes leader has crashed and starts new election Term 1 Term Term 3 Term 4 Term 5 time Elections Split Vote Normal Operation start timeout, start election timeout, new election receive votes from majority of servers Follower Candidate Leader step down discover current leader or higher term discover server with higher term 15 Time divided into terms Election (either failed or resulted in 1 leader) Normal operation under a single leader Each server maintains current term value Key role of terms: identify obsolete information 16

Elections Elections Start election: Increment current term, change to candidate state, vote for self Send RequestVote to all other servers, ry until either: 1. Receive votes from majority of servers: Become leader Send AppendEntries heartbeats to all other servers. Receive RPC from valid leader: Return to follower state 3. No-one wins election (election timeout elapses): Increment term, start new election Safety: allow at most one winner per term Each server votes only once per term (persists on disk) Two different candidates can t get majorities in same term B can t also get majority Servers Voted for candidate A Liveness: some candidate must eventually win Each choose election timeouts randomly in [T, T] One usually initiates and wins election before others start Works well if T >> network RTT 17 18 Structure Normal operation term command 1 3 4 5 6 7 8 div sub log index leader div div sub followers committed entries entry = < index, term, command > stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers Durable / stable, will eventually be executed by state machines 19 Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to followers Once new entry committed: Leader passes command to its state machine, sends result to client Leader piggybacks commitment to followers in later AppendEntries Followers pass committed commands to their state machines 0

Normal operation Operation: Highly Coherent server1 1 3 4 5 6 div server 4 sub Crashed / slow followers? Leader ries RPCs until they succeed Performance is optimal in common case: One successful RPC to any majority of servers If log entries on different server have same index and term: Store the same command s are identical in all preceding entries If given entry is committed, all preceding also committed 1 Operation: Consistency Check Leader Changes 1 3 4 5 New leader s log is truth, no special steps, start normal operation leader follower AppendEntries succeeds: matching entry Will eventually make follower s logs identical to leader s Old leader may have left entries partially replicated leader follower AppendEntries fails: mismatch Multiple crashes can leave many extraneous log entries log index term s 1 1 3 4 5 6 7 5 6 6 6 AppendEntries has <index,term> of entry preceding new ones Follower must contain matching entry; otherwise it rejects Implements an induction step, ensures coherency s s 3 s 4 s 5 5 6 7 7 7 5 5 4 3 3 3 3 4

Challenge: Inconsistencies Repairing Follower s nextindex 1 3 4 5 6 7 8 9 10 1 1 3 4 5 6 7 8 9 10 1 Leader for term 8 1 4 4 5 5 6 6 6 Leader for term 7 1 4 4 5 5 6 6 6 Possible followers (a) (b) (c) (d) (e) (f) 1 4 4 5 5 6 6 1 4 1 4 4 5 5 6 6 6 6 1 4 4 5 5 6 6 6 1 4 4 1 4 4 3 3 3 3 3 7 7 Missing Entries Extraneous Entries Leader changes can result in log inconsistencies 5 Followers (a) (b) 1 4 1 3 3 3 3 3 New leader must make follower logs consistent with its own Delete extraneous entries Fill in missing entries Leader keeps nextindex for each follower: Index of next log entry to send to that follower Initialized to (1 + leader s last index) If AppendEntries consistency check fails, decrement nextindex, try again Repairing Follower s Safety Requirement Leader for term 7 1 3 4 5 6 7 8 9 10 1 1 4 4 5 5 6 6 6 nextindex Before repair (f) 1 3 3 3 3 3 Once log entry applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: If leader has decided log entry is committed, entry will be present in logs of all future leaders After repair (f) 1 4 Why does this guarantee higher-level goal? 1. Leaders never overwrite entries in their logs. Only entries in leader s log can be committed 3. Entries must be committed before applying to state machine Committed Present in future leaders logs Restrictions on commitment Restrictions on leader election 8

Picking the Best Leader Which one is more complete? 1 3 4 5 Can t tell which entries committed! s 1 s 1 1 1 Committed? Unavailable during leader transition 1 3 Elect candidate most likely to contain all committed entries In RequestVote, candidates incl. index + term of last log entry Voter V denies vote if its log is more complete : pick log whose last entry has the higher term if last log term is the same, then pick longer log Leader will have most complete log among electing majority 1 9 30 Which one is more complete? Which one is more complete? 1 3 1 3 3 3 1 3 4 31 3

Committing Entry from Current Term Committing Entry from Earlier Term 1 3 4 5 1 3 4 5 s 1 Leader for term s 1 4 Leader for term 4 s s s 3 AppendEntries just succeeded s 3 AppendEntries just succeeded s 4 s 5 1 1 Can t be elected as leader for term 3 s 4 s 5 1 1 3 3 3 Case #1: Leader decides entry in current term is committed Case #: Leader trying to finish committing entry from earlier Safe: leader for term 3 must contain entry 4 33 Entry 3 not safely committed: s 5 can be elected as leader for term 5 If elected, it will overwrite entry 3 on s 1, s, and s 3 34 Linearizable Reads? Not yet 5 nodes: A (leader), B, C, D, E A is partitioned from B, C, D, E B is elected as new leader, commits a bunch of ops But A still thinks he s the leader = can answer reads If a client contacts A, the client will get stale values! Fix: Ensure you can contact majority before serving reads by committing an extra log entry for each read This guarantees you are still the rightful leader Monday lecture 1. Consensus papers. From single register consistency to multi-register transactions 36

Neutralizing Old Leaders Client Protocol Leader temporarily disconnected other servers elect new leader old leader reconnected old leader attempts to commit log entries Terms used to detect stale leaders (and candidates) Every RPC contains term of sender Sender s term < receiver: Receiver: Rejects RPC (via ACK which sender processes ) Receiver s term < sender: Receiver reverts to follower, updates term, processes RPC Election updates terms of majority of servers Deposed server cannot commit new log entries 37 Send commands to leader If leader unknown, contact any server, which redirects client to leader Leader only responds after command logged, committed, and executed by leader If request times out (e.g., leader crashes): Client reissues command to new leader (after possible redirect) Ensure exactly-once semantics even with leader failures E.g., Leader can execute command then crash before responding Client should embed unique ID in each command This client ID included in log entry Before accepting request, leader checks log for entry with same id 38 Configuration Changes Reconfiguration View configuration: { leader, { members }, settings } Consensus must support changes to configuration Replace failed machine Change degree of replication Cannot switch directly from one config to another: conflicting majorities could arise Server 1 Server Server 3 Server 4 Server 5 C old C new Majority of C old Majority of C new 39 time 40

-Phase Approach via Joint Consensus Joint consensus in intermediate phase: need majority of both old and new configurations for elections, commitment Configuration change just a log entry; applied immediately on receipt (committed or not) Once joint consensus is committed, begin replicating log entry for final configuration -Phase Approach via Joint Consensus Any server from either configuration can serve as leader If leader not in C new, must step down once C new committed C old can make unilateral decisions C new can make unilateral decisions C old can make unilateral decisions C new can make unilateral decisions C new C new C old C old+new C old C old+new leader not in C new steps down here C old+new entry committed C new entry committed time 41 C old+new entry committed C new entry committed time 4 Raft vs. VR Viewstamped Replication: A new primary copy method to support highlyavailable distributed systems Oki and Liskov, PODC 1988 Strong leader entries flow only from leader to other servers Select leader from limited set so doesn t need to catch up Leader election Randomized timers to initiate elections Membership changes New joint consensus approach with overlapping majorities Cluster can operate normally during configuration changes 43 44

View changes on failure View changes on failure 1. Backups monitor primary. If a backup thinks primary failed, initiate View Change (leader election) 1. Backups monitor primary. If a backup thinks primary failed, initiate View Change (leader election) Primary P Requires f + 1 nodes to handle f failures 3. Inituitive safety argument: View change requires f+1 agreement Op committed once written to f+1 nodes At least one node both saw write and in new view Backup A B Backup A Primary P 4. More advanced: Adding or reing nodes ( reconfiguration ) 45 46