Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1
Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency E.g., FT file system for LANs Servers agree on the same sequence of command to execute Network x x y z Consensus to reach agreement E.g. Paxos, Fast Paxos. y x y z x y z z 2
Why wide-area? Reliability and availability To cope with colocation failure: fire etc. Consistency To coordinate wide-area applications Yahoo s Zookeeper, Google s Chubby Are existing protocols efficient in wide-area? If not, can we improve them? 3
More on wide-area Wide-area is more complex than simply not local area Distribution of servers LAN WAN Distribution of clients LAN {LxL}:Localarea system {L,W} WAN {WxL}:Clusterbased system {WxW}: Grid, SOA, mobile 4
Paxos vs. Fast Paxos in (W, L) Topology All servers in the same LAN Clients talk to the servers via WAN Sources of asynchrony Wide area network Client Operating system Observation Client High variance on delays WAN delay dominates WAN LAN Servers 5
Consensus speed Theoretically Counting the number of message delays to reach consensus Assuming each message take the same delay. Practically Message delay varies greatly WAN message delay maybe hundreds time larger than that of a LAN message Client Replica cmd cmd cmd result result result Replica Replica 6
Paxos Analysis Client cmd result Leader 1a 1b 2a Client WAN LAN Replica 1a 1b 2a Client Replica Paxos Servers Wide area delay for Paxos The message from the client to the leader 7
Fast Paxos Analysis Client Leader 1a 1b 2a ANY 2a ANY cmd cmd cmd result Replica 1a 1a 1b 1b 2a ANY result result cmd result Client WAN LAN Replica Client Replica Fast Paxos Quorum Size: 2f+1 Need 2f+1 replicas to receive the cmd from the client to make progress Wide area delay (f=1) The third fastest message from the client to the replicas Servers 8
Classic Paxos vs. Fast Paxos In (W, L) Paxos Fast Paxos Learning delay 3 2 Number of replicas 2f+1 3f+1 Quorum Size f+1 2f+1 Collision No Yes Question: Is Fast Paxos always faster in practice in collision-free runs? 9
Analysis Result Assumption WAN message delay dominates WAN message delay is IID and follows a continuous distribution Result Classic Paxos has a fixed probability (60%) of being faster (f=1) Classic Paxos does even better with larger f Intuition WAN dominates and has high variance 10
Quick Proof Given a wide area latency distribution Draw 5 variables (A-E) from the distribution A < B < C < D for Fast Paxos E for Paxos E s relative order to A-D ( 5 cases ) Each case has the same probability A < B < C < D E E E Paxos Faster E E Fast Paxos Faster 11
Summary Protocol performance is network settings sensitive Fast Paxos better in (L,L) Paxos beter in (W,L) and (W,W) How to design a protocol to have the best of both in a wide-range of network settings Run both Paxos and Fast Paxos concurrently Switch between Paxos and Fast Paxos when network condition changes 12
High performance replicated state machine for (W, W) systems Model A small number of n sites inter-connected by asynchronous wide-area network Up to f out of n servers can fail by crashing System load Possible variable and changing load from each site Network-bound (large request size) CPU-bound (small request size) Goals Low latency under low client load High throughput with high client load WAN 13
Paxos in steady state P 0 /Leader P 1 /Replica P 2 /Replica P 0 /Leader Clinet 0 sends x x P 0 proposes x x x proposeaccept learn Clinet 0 recv x x P 0 learns x fwd propose accept z z z P 2 fwd z z Client 2 sends z Client 2 recv z Instance 0 Instance 1 learn P 2 learns z Pros Simple and low message complexity (3n-3) Low latency at the leader (2 steps) Cons High latency at the non-leader replicas (4 steps) Not load balanced: bottleneck at the leader or the leader s links 14
Fast Paxos in steady state x P 0 /Replica P 1 /Replica P 2 /Replica x P 0 proposes x P 0 learns x x x propose propose accept x propose accept P 3 /Replica z z P 3 proposes z z P 3 learns z z Pros Load balanced Instance 0 Instance 1 Low latency at all servers under no contention(2 steps) Cons Collision results in additional latency under contention High message complexity (n 2-1) More replicas needed (3f+1) 15
Paxos vs. Fast Paxos in (W, W) system Paxos Fast Paxos New Alg? WAN learning latency Leader: 2 Non-leader: 4 2 (when no collision) 2 Collision No Yes No Number of replicas 2f+1 3f+1 2f+1 WAN msg Leader: 3n-3 n 2-1 3n-3 complexity Non-Leader: 3n-2 Load balancing No Yes Yes Neither algorithm is perfect, but Paxos is better. Question: can we achieve best of both with a new algorithm? 16
Deriving Mencius from Paxos Approach Rotating leader A variant of consensus Optimizations Advantages Ensure safety Safety is easy when derived from existing protocol Flexible design Others may derive their own protocol 17
First step: Rotating the leader Each instance of consensus is assigned to a coordinator server The coordinator is the default leader of that instance Simple well-known assignment: e.g. round-robin A server proposes client requests immediately to the next available instance it coordinates Don t have to wait for other servers: reduce latency A server only proposes client requests to instances it coordinates There will be no contention 18
Mencius Rule 1 A server p maintains its index I p, i.e., the next consensus instance it coordinates. Rule 1: Upon receiving a client request v, it immediately proposes v to instance I p and updates its index accordingly. 19
Benefits of rotating the leader All servers can now propose requests directly Potentially low latency at all sites Load balancing at the servers Higher throughput under CPU-bound client load Balanced communication pattern Higher throughput under network-bound load Leader fwd, accept propose, learn propose, learn fwd, accept prop, acc, learn prop, acc, learn prop, acc, learn prop, acc, learn prop, acc, learn prop, acc, learn Paxos Mencius 20
Ensure liveness when servers have different load Rule 1 only works well when all the servers have the same load Servers may observe different client loads Servers can skip their turns (propose no-ops) w/o skipping x 1 y 1 z 1 y 2 z 2 z 3 0 1 2 3 4 5 6 7 8 w/ skipping x y z z y z no-op no-op no-op 1 1 1 2 2 3 0 1 2 3 4 5 6 7 8 21
Mencius Rule 2 Rule 2: If server p receives a propose message with some value v other than no-op for instance i and i>i p, before accepting the v and sending back an accept message, p updates its index I p to be greater than i and proposes no-ops for each instance in range [I p, I p ) that p coordinates, where I p is p s new index. 22
Proposing no-ops is costly Consider the case where only one server proposes requests It takes 3n(n-1) messages to get one value v chosen 3(n-1) for choosing v 3(n-1) 2 for n-1 no-ops Solution Use a simpler abstraction than consensus 23
Simple consensus for efficiency Simple consensus Coordinator: can propose either a client request or a no-op Non-coordinator: only no-op Benefits no-op can be learned in one message delay if the coordinator skips (proposes a no-op) Easy to piggyback no-ops to improve efficiency: essentially no extra cost 24
Coordinated Paxos Starting state The coordinator is the default leader Start from state as if phase one is done by the coordinator Suggest The coordinator propose a request by running phase 2 Skip The coordinator proposes a no-op Fast learning for skipping Revoke A replica start the full phase 1 and 2 and propose a no-op Only needed when the coordinator is detected as having failed 25
Skips with simple consensus Replica 0 x y 2a 3a 2a 3a Replica 1 2a 3a 2a 3a Replica 2 Replica 0 x y 2a 3a 2a skip 3a Replica 1 2a 3a 2a skip 3a Replica 2 skip x no-op no-op y 0 1 2 3 4 5 6 7 8 Pros: Simple and correct protocol Cons: High Message complexity: (n-1)(n+2) 26
Reduce message complexity Replica 0 x y 2a 3a 2a skip 3a Replica 1 2a 3a 2a skip skip 3a Replica 2 Replica 0 Replica 1 x 2a 2a 3a 3a y 2a 2a +skip +skip 3a 3a Replica 0 s view: x no-op no-op y Replica 2 0 1 2 3 Idea: piggyback skip to messages Can piggyback to future 2a messages too No longer need to broadcast 27
Mencius Optimization 1 & 2 When a server p receives a suggest message from server q. Let r be a server other than p or q. Optimization 1: p does not send a separate skip message to q. Instead, p uses the accept message that replies the suggest to promise not to suggest any client request to instance smaller than i in the future. Optimization 2: p does not send a skip message to r immediately. Instead, p waits for a future suggest message from p to r to indicate that p has promised not to suggest any client requests to instance smaller than i. 28
Gap in idle replicas Potentially unbounded number of requests wait to be committed Only happens to idle replicas. Once the replica sends requests, the pending skips are resolved. Idle replicas can send skips periodically to accelerate resolution Replica 0 Replica 1 Replica 2 x no-op no-op y no-op no-op z 0 1 2 3 4 5 6 7 8 x no-op y no-op z 0 1 2 3 4 5 6 7 8 x no-op y no-op z 0 1 2 3 4 5 6 7 8 29
Mencius Accelerator 1 When a server p receives a suggest message from server q, let r be a server other than p or q. Accelerator 1: A server p propagates skip messages to r if the total number of outstanding skip messages to r is more than some constant α, or the message has been deferred for more than some time τ. 30
Failures Faulty processes cannot skip Leave gap in the requests sequence Replica 0 x y 2a 3a 2a skip 3a Replica 1 2a 3a 2a skip 3a Replica 2 Replica 0 x no-op y 0 1 2 3 4 5 6 7 8 Replica 1 x no-op y 0 1 2 3 4 5 6 7 8 31
Mencius: Rule 3 Rule 3: Let q be a server that another server p suspects has failed, and let C q be the smallest instance that is coordinated by q and not learned by p, p revokes q for all instances in the range [C q, I p ] that q coordinates. 32
Imperfect failure detector A false suspicion may result no-op chosen, even if the coordinator is not crashed and has proposed some value v Solution: re-suggest v when the coordinator learns no-op What if false suspicion happens again? Keep trying? W is not enough to provide liveness One server can be permanently falsely suspected So, we require P instead Eventually there is no false suspicion, hence P guarantees eventual liveness 33
Mencius: Rule 4 Rule 4: If server p suggests a value v other than no-op to instance i, and p learns no-op is chosen, then p suggests v again. 34
Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 29 59 89 119 35
Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 29 59 89 119 36
Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 x y x y x y x y x y x y x y x y x y x y 29 59 89 119 37
Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 x y x y x y x y x y x y x y x y x y x y 29 59 89 119 38
Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y 29 59 89 119 39
Mencius failure recovery P 2 may come back because of failure recovery or false suspicion Find out the next available slots it coordinates Start proposing request to that slot 0 30 60 x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y 90 119 120 z 149 29 59 89 P 2 recovers and propose request z to its next available slot 40
Mencius failure recovery P 2 may come back because of failure recovery or false suspicion 0 30 60 90 120 Find out the next available slots it coordinates Start proposing request to that slot Other servers catch up with P 2 by skipping x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y z 149 29 59 89 119 P 0 and P 1 skip their turns upon receiving the request from P 2 41
Mencius: Optimization 3 Optimization 3: Let q be a server that another server p suspects to have failed, and let C q be the smallest instance that is coordinated by q and not learned by p. For some constant β, p revokes q for all instances in the range [C q, I p +2β] that q coordinates if C q <I p +β. 42
Mencius summary Rule 1 Suggest client request immediately to the next consensus instance a server coordinates Rule 2 Update index and skip unused instances when accepting a value Rule 3 Revoke suspected servers Rule 4 Re-suggest a client request after a false suspicion 43
Mencius summary continued Optimization 1 Piggyback skips to accept () messages Optimization 2 Piggyback skips to future suggest (2a) messages Accelerator 1 Put a bound before starting flush skip messages Optimization 3 Issue revocation in large blocks to amortize the cost 44
Mencius vs. Paxos and Fast Paxos in (W, W) system Paxos Fast Paxos New Mencius Alg? WAN learning latency Leader: 2 Non-leader: 4 2 (when no collision) 2* Collision No Yes No Number of replicas 2f+1 3f+1 2f+1 WAN msg Leader: 3n-3 n 2-1 3n-3 complexity Non-Leader: 3n-2 Load balancing No Yes Yes Mencius achieves the best of both Paxos and Fast Paxos 45
Delayed commit P 0 P 0 suggests x to instance 0 x P 0 learns and commits x x y P 0 learns and commits y P 2 y 2a P 2 suggests y to instance 2 2a x? no-op y 3a 3a y P 2 learns y x P 2 learns y and commits x, y x no-op y 0 1 2 3 Delayed Commit 0 1 2 3 Up to one round-trip extra delay Out-of-order commit reduces the extra delay when requests are commutable A benefit of using simple consensus 46
Experimental setup Mencius vs. Paxos A clique topology for three sites A simple replicated register service k total registers: Mencius-k ρ bytes dummy payload ρ = 4000 : Network-bound ρ = 0 : CPU-bound Out-of-order commit option Clients send requests at a fixed rate 50ms, 5~20Mbps 50ms, 5~20Mbps 50ms, 5~20Mbps 47
Latency vs. system load Network-bound load, 20Mbps for all links Latency (ms) 400 300 200 Mencius Mencius-1024 Mencius-128 Mencius-16 Paxos ρ = 4000 100 0 400 800 1200 1600 Throughput (ops) Take away: Mencius has good latency Latency goes to infinitive when reaching maximum throughput Mencius s latency is as good as Paxos s under high contention Enable out-of-order commit reduces latency up to 30% 48
Throughput 10000 200% improvement when network-bound Throughput (ops) 5000 0 Mencius Mencius-1024 Mencius-128 Mencius-16 Paxos x ρ = 4000 (Network-bound) y ρ = 0 (CPU-bound) Up-to 50% improvement when CPU-bounded Enabling out-of-order hurts throughput Take away: Mencius has good throughput 49
Throughput under failure 4000 2000 Caused by delayed commits Throughput (ops) Paxos: crash follower 800 Paxos: crash leader Mencius 600 Temporary stalls when any server fails Still higher than Paxos 400 200 Temporary stalls Caused by when leader fails duplicates 0 0 20 40 60 80 Failure suspected Time (sec) Failure reported 50
Fault scalability Star topology: 10 Mbps link from each site to the central node Throughput (ops) 500 250 0 3 sites Mencius Paxos 5 sites 7 sites ρ = 4000 (Network-bound) 10000 5000 0 3 sites 5 sites 7 sites ρ = 0 (CPU-bound) Take away: Mencius scales better than Paxos 51
More evaluation results Lower latency at all servers under low contention True, even without out-of-order commit Adaptable to available bandwidth Adaptation happens automatically Easy to take advantage of all available bandwidth 52
Summary Mencius Rotating leader Paxos Easy to ensure safety and a flexible design Simple consensus The foundation of efficiency design High performance High throughput under high load Low latency under low load Low latency under medium to high load when enabling out-oforder commit Better load balancing Better scalability 53
References Flavio Junqueira, Yanhua Mao, Keith Marzullo "Classic Paxos vs. Fast Paxos: Caveat Emptor, HotDep 2007, Edinburgh, UK Yanhua Mao, Flavio Junqueira, Keith Marzullo "Mencius: Building Efficient Replicated State Machines for WANs, OSDI 2008, San Diego, CA, USA 54