HP: Hybrid Paxos for WANs

Size: px

Start display at page:

Download "HP: Hybrid Paxos for WANs"

Mercy Marshall
5 years ago
Views:

1 HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri TU Darmstadt, Germany Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group

2 Safety Critical Systems Resilience against catastrophic failures State Machine Replication Resilience of Critical Services Illusion of a single server that never fails Wide Area Replication Large and unpredictable delays in WANs latency-optimal protocol clients request server SMR clients request no reply reply n 2t+1 replicas EDCC, Valencia, May 18, 2010 Matthias Majuntke 2

3 Which Consensus Protocol State Machine Replication (SMR) Clients propose commands to replicas Agreement on sequence of commands replicas are in consistent state when executing command sequence Consensus protocol needed Latency-optimal protocols Latency: #message delays between when client proposes command and when command is learned by learner (to be executed). Two Protocols by Lamport Classic Paxos (CP) 3 message delays (during normal operation) Majority quorum for recovery Fast Paxos (FP) 2 message delays (during normal operation) message delays in presence of collisions Larger quorum for recovery Client Leader Acceptors Client Client Acceptors Client EDCC, Valencia, May 18, 2010 Matthias Majuntke 3

4 Paxos vs. Fast Paxos Compared Latency Planetlab Experiments Simulation of the CP and FP msg. patterns (different topologies) FP not always faster than CP Some clients prefer CP, some FP Single crash can turn setting EDCC, Valencia, May 18, 2010 Matthias Majuntke 4

5 Motivation for a Hybrid Protocol No clear winner between CP and FP With respect to latency Hybrid Protocol: Hybrid Paxos (HP) Runs CP and FP in parallel Chooses quickest outcome of two protocols Implements Generalized Consensus Commuting commands may be chosen in any order Does not negatively affect throughput FP mode switched off when not beneficial EDCC, Valencia, May 18, 2010 Matthias Majuntke 5

6 Outline of the Talk Contribution System Model Background on Paxos and Generalized Consensus Hybrid Paxos protocol Evaluation Discussion Conclusion EDCC, Valencia, May 18, 2010 Matthias Majuntke 6

7 Contribution Hybrid Paxos (HP) CP with additional fast mode Fast learning in absence of collisions 3 msg delays as CP in presence of collisions Latency optimal 2f+1 servers, f may crash (optimal) Linear number of messages (optimal) First efficient implementation of Generalized Consensus Experiments using Emulab HP reaches theoretical minimum of latency HP does not negatively affect throughput EDCC, Valencia, May 18, 2010 Matthias Majuntke 7

8 System Model Distributed System n servers Any number of clients (may crash) Communication via reliable FIFO channels Crash-stop model clients At most minority of servers fails (n 2f+1), f = #crashes Asynchrony ΩΩ Failure detector (eventually outputs same correct leader) servers Generalized Consensus Command History Equivalence class of command sequences Sequences c 1 and c 2 are equivalent iff executing them produces same outputs and state commuting commands EDCC, Valencia, May 18, 2010 Matthias Majuntke 8

9 Background on Generalized Consensus Protocol operates on command history = equivalence class of command sequences Terms on histories Prefix relation on histories glb of histories (largest common prefix, intersection) lub of histories (smallest common extension, union) h and h compatible iff exists g: h g, h g Definition of Generalized Consensus Consistency: every two learned histories are compatible. Nontriviality: if history is chosen than all contained commands have been proposed. Conservatism: if history h is learned, then h was chosen. Progress: if command c is proposed, eventually a history containing c is learned. EDCC, Valencia, May 18, 2010 Matthias Majuntke 9

10 Background on Paxos Family Following holds for CP, FP, and HP Clients are proposers and learners Servers are acceptors Cooperate to choose single comand history Acceptors query ΩΩ and elect leader among them Unique Leader needed for progress only Paxos * protocols operate in rounds Each leader is preassigned a set of round numbers Operation modes Recovery, to change rounds (must ensure consistency) Normal operation Quorums of acceptors CP: any two quorums intersect FP: requires larger fast quorums FQ n- FQ +1 intersection of quorum and fast quorum FQ is larger than n- FQ n- FQ EDCC, Valencia, May 18, 2010 Matthias Majuntke 10

11 CP and FP Message Patterns cl Recovery (all protocols) Normal Operation of CP ld propose 2b 1a 1b 2a 2b 2a 2b acc Phase 1 Phase 2 Normal Operation of FP cl ld acc 2bfast chosen propose 2bfast Fast mode 2a 1a 1b 2b Recovery from collision EDCC, Valencia, May 18, 2010 Matthias Majuntke 11

12 Ideas behind Message Patterns Normal Operation CP Client sends proposal (command) to leader Leader appends command to history and sends history to acceptors (2a) Acceptors accept history as local history Acceptors send history back to client (2b) Normal Operation FP Client sends proposal to acceptors Acceptors append commands to local fast history (optimistic) Acceptors send history back to client (and leader) (2bfast) Collision Recovery triggered by Leader Recovery (to start a new round) Phase 1: initialized by new leader (1a) Acceptors send local histories to leader (1b) Core of Leader determines chosen history protocol Phase 2: Leader synchronizes acceptors to chosen history (2a) Reply to clients (2b) EDCC, Valencia, May 18, 2010 Matthias Majuntke 12

13 Combining the two protocols CP HP FP cl ld propose 2b 2bfast 2bfast chosen acc 2a 2b propose 2bfast Execute CP and FP pattern in parallel CP with additional FP mode Acceptors locally maintain fast and classic history History from ld as classic history Commands from cl appended to fast history No naïve combination Clients learn either by receiving Quorum of equal 2b messages (learn( learn) Fast Quorum of equal 2bfast messages and one 2b message (hybrid learn) Needed also in FP for speculative execution EDCC, Valencia, May 18, 2010 Matthias Majuntke 13

14 Same message pattern Hybrid Recovery Acceptors maintain separate histories Classic history Fast history Leader perform CP and FP like recoveries in parallel Determines history fh from FP recovery Determines history h from CP recovery Problem: h and fh might be incompatible (no common extension) Determine largest prefix pfh of fh which is compatible with h Pick lub of pfh and h (smallest common extension) Why is this correct (sufficient for Consistency)? To show: any history lh learned by hybrid learn is prefix of pfh. lh fh, and all prefixes of fh compatible with h are prefixes of pfh Sufficient to show: lh compatible with h By hybrid learning: some acceptor holds lh as classic history lh and h have been sent by leader lh and h are compatible Neither h nor fh sufficient Goal: lub of h and fh EDCC, Valencia, May 18, 2010 Matthias Majuntke 14

15 Optimization 1 (msg complexity) Implementation Optimization Leader does not send entire history to acceptors (2a) FIFO channels Optimization 2 (execution) Implementing state machine at servers Only leader executes commands (speculatively) Prevents rollbacks at acceptors Clients receive history digests + result Optimization 3 (latency) Diverging fast and classic histories during normal mode prevents hybrid learning Periodically acceptors locally align fh to h (as in hybrid recovery) Optimization 4 (throughput) FP mode switched off during high load Leader monitors load Also true for FP EDCC, Valencia, May 18, 2010 Matthias Majuntke 15

16 Evaluation Experimental setting Banking system, two operations deposit and withdraw deposit operations are commutable (Generalized Consensus) Emulab test bed 20ms link delay between client and servers, 100Mbps Topology similar to Europe topology from beginning of presentation Servers 600Mhz PC, Fedora 6 EDCC, Valencia, May 18, 2010 Matthias Majuntke 16

17 Latency Latency of HP with varying withdraw rate = probability of collisions EDCC, Valencia, May 18, 2010 Matthias Majuntke Latency vs throughput (with and w/o batching) 17

18 Throughput Throughput with increasing clients EDCC, Valencia, May 18, 2010 Matthias Majuntke 18 Throughput with increasing number of f

19 Related Work [Lamport: ACM Computer 1998] The Part-Time parliament [Lamport: Dist. Comp. 2006] Fast Paxos [Lamport: TR2005] Generalized Consensus and Paxos [Dobre, Suri DSN2006] One-step Consensus with Zero-degradation [Charron-Bost, Schiper: PRDC2006] Improving Fast Paxos: Being Optimal with no Overhead Minimum latency of FP and CP only in failure-free runs [Camargos, Schmidt, Pedone: NCA2008] Mulitcoordinated Agreement Protocols for Higher Availability Improved availability of CP by multiple leaders; collision resolution req. [Zielinski: DISC2005] Optimistic Generic Broadcast Parallel execution of CP and FP; not resilience optimal; quadratic msg complexity [Mao, Junqueira, Marzullo: OSDI2008] Mencius: Building Efficient Replicated State Machine for WANs Based on CP; partition consensus instances among several leaders (throughput) Each client has LAN connection to one leader (latency) Perfect failure detector needed EDCC, Valencia, May 18, 2010 Matthias Majuntke 19

20 Comparison to CP Implements CP Never worse than CP Discussion FP mode switched off when leader is highly loaded Comparison to FP HP and FP need 2 msg delays in absence of collisions HP needs 3, FP needs 6 msg delays in presence of collisions Experiments: Collision rate grows faster than server utilization rate Servers underutilized when hybrid learning rate below 50% FP would spend >50% of the time recovering from collisions Optimizations Batching possible Increasing throughput by a magnitude EDCC, Valencia, May 18, 2010 Matthias Majuntke 20

21 HP: Hybrid Paxos Idea: add fast learning to Paxos Generalized Consensus protocol Summary First protocol with 2 msg delays in absence of collisions and 3 msg delays otherwise Optimal latency, resilience and number of messages Generalized Consensus is practical approach for WAN replication HP can outperform state of the art protocols HP is a Generalized Consensus protocol that features minimal latency and maximum throughput in most situations! EDCC, Valencia, May 18, 2010 Matthias Majuntke 21

22 Thank you for your attention! Questions? EDCC, Valencia, May 18, 2010 Matthias Majuntke 22

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency