CS 5450 State Machine Replication
Key Ideas To tolerate faults replicate functionality! Can represent deterministic distributed system as replicated state machine (SMR) Each replica reaches the same conclusion about the system independently Key examples of distributed algorithms that generically implement SMR Formalizes notions of fault-tolerance in SMR slide 2
Motivation Server Client 10 get(x) get(x) No response Client
Motivation Server Client
Motivation uneed replication for fault tolerance uwhat happens in these scenarios without replication? Storage - disk failure Web service - network failure ube able to reason about failure tolerance uhow badly can things go wrong and have our system continue to function? slide 5
State Machines ustate variables udeterministic commands slide 6
Requests and Causality Process order consistent with potential causality uclient A sends r, then r' ur is processed before r' ur causes Client B to send r' ur is processed before r'. slide 7
Coding State Machines ustate machines are procedures uclient calls procedure uavoid loops umore flexible structure slide 8
State Machine Replication c X = Y X = Y X = Y X = Y State Machine Replica
State Machine Replication f(c) f(c) X = Z X = Z f(c) f(c) X = Z X = Z State Machine Replica
Write put(x,10)
After the Write Great!
Write put(x,10)
Need Agreement get(x) 10 get(x) Replicas need to agree which requests have been handled 3 Problem!
Two Writes put(x,10) put(x,30)
Either Outcome is Fine 0 0 OR 0 0
Order Matters put(x,10) put(x,30)
Order Matters put(x,10) put(x,30)
Order Matters put(x,10) put(x,30) 0 0
Order Matters put(x,10) put(x,30) 0 0
Order Matters 0 0 Replicas need to handle requests in the same order
Requirements All non-faulty servers need uagreement Every replica needs to accept the same set of requests uorder All replicas process requests in the same relative order slide 22
Idea for Agreement usomeone proposes a request uif the proposer is non-faulty, all servers will accept that request slide 23
Agreement put(x,10)
Agreement put(x,10) Non-faulty Transmitter
Idea for Order Assign unique ids to requests, process them in ascending order uhow do we assign unique ids in a distributed system? uhow do we know when every replica has processed a given request? slide 26
Order put(x,30) put(x,10)
Order put(x,30) put(x,10) Assign Total Ordering Request ID 1 2
Order Assign Total Ordering Request ID 1 2
Order Assign Total Ordering Request ID 1 2
Order 0 0 0 0 Assign Total Ordering Request ID 1 2 Cannot receive request with smaller ID is now stable!
Order Assign Total Ordering Request ID 1 2 is now stable! is now stable!
Generating IDs uorder via clocks (client timestamp = id) Logical clocks Synchronized clocks utwo-phase ID generation Every replica proposes a candidate One candidate is chosen and agreed upon by all replicas slide 33
Replica ID Generation put(x,30) put(x,10)
Replica ID Generation 1.1 1.3 2.1 2.3 1.2 1.4 2.2 2.4 1) Propose candidates
Replica ID Generation 1.1 2.4 1.3 2.1 2.3 2.4 1.2 2.4 1.4 2.2 2.4 2.4 2) Accept
Replica ID Generation 1.1 2.4 1.3 2.2 2.1 2.2 2.3 2.4 1.2 2.4 1.4 2.2 2.2 2.2 2.4 2.4 3) Accept
Replica ID Generation 2.1 2.2 1.3 2.2 1.1 2.4 2.3 2.4 2.2 2.2 1.4 2.2 1.2 2.4 2.4 2.4 is now stable
Replica ID Generation 2.1 2.2 1.3 2.2 1.1 2.4 2.3 2.4 2.2 2.2 1.4 2.2 1.2 2.4 2.4 2.4 4) Apply
Replica ID Generation 2.1 2.2 0 0 1.3 2.2 1.1 2.4 2.3 2.4 2.2 2.2 0 0 1.4 2.2 1.2 2.4 2.4 2.4 5) Apply
Rules for Replica-Generated IDs uany new candidate ID must be > ID of any accepted request uthe ID selected from the candidate list must be >= each candidate uwhen is a candidate stable? It has been accepted No other pending request with a smaller candidate ID slide 41
Faults ufail-stop A faulty server can be detected as faulty ubyzantine Faulty servers can do arbitrary, perhaps malicious things This includes crash failures (server can stop responding without notification) slide 42
Fail-Stop Tolerance put(x,30)
Fail-Stop Tolerance 1.1 1) Propose Candidates.
Fail-Stop Tolerance 1.1 1.1 2) Accept
Fail-Stop Tolerance 1.1 1.1 0 2) Apply
Fail-Stop Tolerance 0 GAME OVER!!! 2) Apply
Fail-Stop Fault Tolerance uto tolerate t failures, need t+1 servers. uas long as 1 server remains, we re OK uonly need to participate in protocols with other live servers slide 48
Byzantine Fault Tolerance uto tolerate t failures, need 2t + 1 servers uprotocols now involve votes Can only trust server response if the majority of servers say the same thing ut + 1 servers need to participate in replication protocols slide 64
Lamport (1978) slide 65
Fault-Tolerant State Machines uimplement the state machine on multiple processors ustate machine replication Each starts in the same initial state Executes the same requests Requires consensus to execute in same order Deterministic, each will do the exact same thing Produce the same output slide 66
Consensus utermination uvalidity uintegrity uagreement Ensures procedures are called in same order across all machines slide 67