Failures, Elections, and Raft

Similar documents
Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

To do. Consensus and related problems. q Failure. q Raft

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

REPLICATED STATE MACHINES

In Search of an Understandable Consensus Algorithm

Byzantine Fault Tolerant Raft

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Distributed Consensus Protocols

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Today: Fault Tolerance

Consensus and related problems

Today: Fault Tolerance. Fault Tolerance

Distributed Systems 11. Consensus. Paul Krzyzanowski

Raft and Paxos Exam Rubric

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS

Paxos and Replication. Dan Ports, CSEP 552

CSE 5306 Distributed Systems. Fault Tolerance

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Commit in Asynchronous Systems

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

CSE 5306 Distributed Systems

hraft: An Implementation of Raft in Haskell

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Fault Tolerance. Distributed Software Systems. Definitions

Consensus Problem. Pradipta De

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Recovering from a Crash. Three-Phase Commit

Intuitive distributed algorithms. with F#

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Distributed Systems (ICE 601) Fault Tolerance

Introduction to Distributed Systems Seif Haridi

Fault Tolerance. Basic Concepts

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Distributed Consensus: Making Impossible Possible

Distributed Operating Systems. Distributed Synchronization

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Consensus, impossibility results and Paxos. Ken Birman

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

WA 2. Paxos and distributed programming Martin Klíma

Distributed Consensus: Making Impossible Possible

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Three modifications for the Raft consensus algorithm

Linearizability CMPT 401. Sequential Consistency. Passive Replication

EECS 498 Introduction to Distributed Systems

Replication in Distributed Systems

Synchronization. Clock Synchronization

Consensus in Distributed Systems. Jeff Chase Duke University

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Distributed Data Management Replication

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

SimpleChubby: a simple distributed lock service

Distributed File System

Distributed Algorithms Benoît Garbinato

Causal Consistency and Two-Phase Commit

MODELS OF DISTRIBUTED SYSTEMS

At-most-once Algorithm for Linearizable RPC in Distributed Systems

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

A Reliable Broadcast System

Paxos provides a highly available, redundant log of events

Distributed Systems Consensus

PRIMARY-BACKUP REPLICATION

Exam 2 Review. Fall 2011

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Synchronization. Chapter 5

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Fault Tolerance. Distributed Systems. September 2002

Primary-Backup Replication

Fault Tolerance. Distributed Systems IT332

Practical Byzantine Fault

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

MODELS OF DISTRIBUTED SYSTEMS

416 practice questions (PQs)

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

PROCESS SYNCHRONIZATION

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Today: Fault Tolerance. Replica Management

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

CSE 486/586 Distributed Systems

CS October 2017

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Exam 2 Review. October 29, Paul Krzyzanowski 1

Assignment 12: Commit Protocols and Replication Solution

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Comparative Analysis of Big Data Stream Processing Systems

Basic vs. Reliable Multicast

Module 8 Fault Tolerance CS655! 8-1!

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Transcription:

Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Synchronous vs. Asynchronous Execution speed synchronous: bounded asynchronous: unbounded Message transmission delays synchronous: bounded asynchronous: unbounded Local clock drift rate: synchronous: bounded asynchronous: unbounded CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Failures Omission failures something doesn t happen - process crashes - data lost in transmission - etc. Byzantine (arbitrary) failures something bad happens - message is modified - message received twice - any kind of behavior, including malicious Timing failures something takes too long CS 8 XI 4 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Detecting Crashes Synchronous systems timeouts Asynchronous systems? Fail-stop an oracle lets us know CS 8 XI 5 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Consensus Setting: a group of processes, some correct, some faulty Two primitives, propose(v) and decide(v) If each correct process proposes a value: Termination: every correct process eventually decides on a value Agreement: if a correct process decides v, then all processes eventually decide v Integrity: if a correct process decides v, then v was previously proposed by some process CS 8 XI 6 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Consensus Variations on the problem, depending on assumptions Synchronous vs asynchronous Fail-stop vs crash/omission failures vs Byzantine failures Equivalent to reliable, totally- and causallyordered broadcast (Atomic broadcast) CS 8 XI 7 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Byzantine Agreement Problem :Retreat :Attack :Retreat :Retreat ::Retreat ::Attack Will cover in a future lecture ::Retreat ::Attack CS 8 XI 8 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Impossibility of Consensus There is no deterministic protocol that solves consensus in a message-passing asynchronous system in which at most one process may fail by crashing. Solve means to satisfy safety and liveness Famous result from Fisher, Lynch, Paterson (FLP), 985 There is hope, though Introducing (some) synchrony timeouts Introducing randomization Introducing failure detectors CS 8 XI 9 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Oceanstore: Replicated Primary Replica File File Inner ring of servers File File CS 8 XI 0 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

State-Machine Replication An approach to dealing with failures by replication A data-structure with deterministic operations replicated among servers State consistent if every server sees the same sequence of operations Multiple rounds of consensus Common algorithms (non-byzantine): Paxos [Lamport], Viewstamped Replication [Oki and Liskov], Zab [Junqueira et al.], Raft [Ongaro and Ousterhout] CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Raft Proposed by Ongaro and Ousterhout in 04 Four components Leader election Log replication Safety Membership changes Assumes crash failures No dependency on time for safety But depends on time for availability CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Demo http://thesecretlivesofdata.com/raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Server States At any given time, each server is either: Leader: handles all client interactions, log replication - At most viable leader at a time Follower: completely passive (issues no RPCs, responds to incoming RPCs) Candidate: used to elect a new leader Normal operation: leader, N- followers CS 8 XI 4 Slides based on Raft slides by John Ousterhout

Life as a Leader Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to followers Once new entry committed: Leader passes command to its state machine, returns result to client Leader notifies followers of committed entries in subsequent AppendEntries RPCs Followers pass committed commands to their state machines Crashed/slow followers? Leader retries RPCs until they succeed Performance is optimal in common case: One successful RPC to any majority of servers CS 8 XI 5 Slides based on Raft slides by John Ousterhout

Heartbeats and Timeouts Servers start up as followers Followers expect to receive RPCs from leaders or candidates Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority If electiontimeout elapses with no RPCs: Follower assumes leader has crashed Follower starts new election Timeouts typically 00-500ms CS 8 XI 6 Slides based on Raft slides by John Ousterhout

Terms Term Term Term Term 4 Term 5 time Elections Split Vote Normal Operation Time divided into terms: Election Normal operation under a single leader At most leader per term Some terms have no leader (failed election) Each server maintains current term value Key role of terms: identify obsolete information CS 8 XI 7 Slides based on Raft slides by John Ousterhout

Server State Machine CS 8 XI 8 Slides based on Raft slides by John Ousterhout

Election Basics Increment current term Change to Candidate state Vote for self Send RequestVote RPCs to all other servers, retry until either:. Receive votes from majority of servers: - Become leader - Send AppendEntries heartbeats to all other servers. Receive RPC from valid leader: - Return to follower state. No-one wins election (election timeout elapses): - Increment term, start new election CS 8 XI 9 Slides based on Raft slides by John Ousterhout

Elections, cont d Safety: allow at most one winner per term Each server gives out only one vote per term (persist on disk) Two different candidates can t accumulate majorities in same term B can t also get majority Servers Voted for candidate A Liveness: some candidate must eventually win Choose election timeouts randomly in [T, T] One server usually times out and wins election before others wake up Works well if T >> broadcast time CS 8 XI 0 Slides based on Raft slides by John Ousterhout

term command add Log Structure 4 5 6 7 8 add cmp cmp ret ret mov mov jmp jmp div shl sub log index leader add add cmp cmp ret mov jmp div shl sub followers add cmp ret mov jmp Log entry = index, term, command Log stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers* Durable, will eventually be executed by state machines div shl committed entries CS 8 XI Slides based on Raft slides by John Ousterhout

Log Consistency If log entries on different servers have same index and term: They store the same command The logs are identical in all preceding entries 4 5 6 add cmp ret mov jmp div add cmp ret mov jmp 4 sub If a given entry is committed, all preceding entries are also committed CS 8 XI Slides based on Raft slides by John Ousterhout

AppendEntries Consistency Check Each AppendEntries RPC contains index, term of entry preceding new ones Follower must contain matching entry; otherwise it rejects request Implements an induction step, ensures coherency leader follower 4 5 add add cmp cmp ret ret mov mov jmp AppendEntries succeeds: matching entry leader follower add add cmp cmp ret ret mov shl jmp AppendEntries fails: mismatch CS 8 XI Slides based on Raft slides by John Ousterhout

Leader Changes At beginning of new leader s term: Old leader may have left entries partially replicated No special steps by new leader: just start normal operation Leader s log is the truth Will eventually make follower s logs identical to leader s Multiple crashes can leave many extraneous log entries: log index 4 5 6 7 8 term s 5 6 6 6 s s s 4 s 5 5 6 7 7 7 5 5 4 CS 8 XI 4 Slides based on Raft slides by John Ousterhout

Safety Requirement Once a log entry has been applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders This guarantees the safety requirement Leaders never overwrite entries in their logs Only entries in the leader s log can be committed Entries must be committed before applying to state machine Restrictions on commitment Committed Present in future leaders logs Restrictions on leader election CS 8 XI 5 Slides based on Raft slides by John Ousterhout

Picking the Best Leader Can t tell which entries are committed! 4 5 committed? unavailable during leader transition During elections, choose candidate with log most likely to contain all committed entries Candidates include log info in RequestVote RPCs (index & term of last log entry) Voting server V denies vote if its log is more complete : (lastterm V > lastterm C ) (lastterm V == lastterm C ) && (lastindex V > lastindex C ) Leader will have most complete log among electing majority CS 8 XI 6 Slides based on Raft slides by John Ousterhout

Picking the Best Leader Who can be elected for term 8? S, S, S S4 and S5 cannot log index 4 5 6 7 8 term s 5 6 6 6 s s s 4 s 5 5 6 7 7 7 5 5 4 CS 8 XI 7 Slides based on Raft slides by John Ousterhout

Committing Entry from Current Term Case #/: Leader decides entry in current term is committed s 4 5 6 Leader for term s s AppendEntries just succeeded s 4 s 5 Can t be elected as leader for term Safe: leader for term must contain entry 4 CS 8 XI 8 Slides based on Raft slides by John Ousterhout

Committing Entry from Earlier Term Case #/: Leader is trying to finish committing entry from an earlier term s 4 5 6 4 Leader for term 4 s s AppendEntries just succeeded s 4 s 5 Entry not safely committed: s 5 can be elected as leader for term 5 If elected, it will overwrite entry on s, s, and s! CS 8 XI 9 Slides based on Raft slides by John Ousterhout

New Commitment Rules For a leader to decide an entry is committed: Must be stored on a majority of servers s 4 5 4 Leader for term 4 At least one new entry from leader s term must also be stored on majority of servers Once entry 4 committed: s 5 cannot be elected leader for term 5 s s s 4 s 5 4 4 Entries and 4 both safe Combination of election rules and commitment rules makes Raft safe CS 8 XI 0 Slides based on Raft slides by John Ousterhout

log index leader for term 8 Log Inconsistencies Leader changes can result in log inconsistencies: 4 5 6 7 8 9 0 4 4 5 5 6 6 6 (a) (b) 4 4 5 5 6 6 4 Missing Entries possible followers (c) (d) (e) 4 4 5 5 6 6 6 6 4 4 5 5 6 6 6 4 4 4 4 7 7 Extraneous Entries (f) CS 8 XI Slides based on Raft slides by John Ousterhout

Repairing Follower Logs New leader must make follower logs consistent with its own Delete extraneous entries Fill in missing entries Leader keeps nextindex for each follower: Index of next log entry to send to that follower Initialized to ( + leader s last index) When AppendEntries consistency check fails, decrement nextindex and try again: nextindex log index leader for term 7 4 5 6 7 8 9 0 4 4 5 5 6 6 6 followers (a) (b) 4 CS 8 XI Slides based on Raft slides by John Ousterhout

Repairing Logs, cont d When follower overwrites inconsistent entry, it deletes all subsequent entries: nextindex log index leader for term 7 4 5 6 7 8 9 0 4 4 5 5 6 6 6 follower (before) follower (after) 4 CS 8 XI Slides based on Raft slides by John Ousterhout

Neutralizing Old Leaders Deposed leader may not be dead: Temporarily disconnected from network Other servers elect a new leader Old leader becomes reconnected, attempts to commit log entries Terms used to detect stale leaders (and candidates) Every RPC contains term of sender If sender s term is older, RPC is rejected, sender reverts to follower and updates its term If receiver s term is older, it reverts to follower, updates its term, then processes RPC normally Election updates terms of majority of servers Deposed server cannot commit new log entries CS 8 XI 4 Slides based on Raft slides by John Ousterhout

Client Protocol Send commands to leader If leader unknown, contact any server If contacted server not leader, it will redirect to leader Leader does not respond until command has been logged, committed, and executed by leader s state machine If request times out (e.g., leader crash): Client reissues command to some other server Eventually redirected to new leader Retry request with new leader CS 8 XI 5 Slides based on Raft slides by John Ousterhout

Client Protocol, cont d What if leader crashes after executing command, but before responding? Must not execute command twice Solution: client embeds a unique id in each command Server includes id in log entry Before accepting command, leader checks its log for entry with that id If id found in log, ignore new command, return response from old command Result: exactly-once semantics as long as client doesn t crash Enforces linearizability (will see in upcoming lecture) CS 8 XI 6 Slides based on Raft slides by John Ousterhout

Partitioned Leader A client talking to a partitioned leader could be delayed forever. Solution: leader will step down after a number of rounds of heartbeats with no response from majority CS 8 XI 7 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

What if clients can crash? Servers maintain a session for each client Keep track of latest sequence number processed for client, and response Generalizes for multiple outstanding requests Server keeps set of <seq, resp> for client Client includes with request lowest seq with no response Server can discard smaller sequence numbers Must expire sessions All replicas must agree on when to do this Raft uses leader timestamp, committed to log CS 8 XI 8 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Alive clients with expired sessions How to distinguish between client which exited from client which just took too long? Require clients to register with the leader when starting a session RegisterClient RPC Leader returns unique ID to the client Client uses this ID in subsequent request If server receives request for non-existing session Return an error. Current implementation crashes the client, forcing restart CS 8 XI 9 Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Configuration Changes Cannot switch directly from one configuration to another: conflicting majorities could arise See the paper for details Server Server Server Server 4 Server 5 C old time C new Majority of C old Majority of C new CS 8 XI 40 Slides based on Raft slides by John Ousterhout