Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Size: px

Start display at page:

Download "Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering"

Timothy Beasley
6 years ago
Views:

1 Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports

2 Server failures are the common case in data centers

3 Server failures are the common case in data centers

4 Server failures are the common case in data centers

5 Server failures are the common case in data centers

6 State Machine Replication Operation A Operation B Operation C

7 State Machine Replication Operation A Operation A Operation A Operation B Operation B Operation B Operation C Operation C Operation C

8 State Machine Replication Operation A Operation A Operation A Operation B Operation B Operation B Operation C Operation C Operation C

9 Paxos for state machine replication request prepare prepareok reply Client Leader Replica Replica Replica

10 Paxos for state machine replication request prepare prepareok reply Client Leader Replica Replica Throughput Bottleneck Replica

11 Paxos for state machine replication request prepare prepareok reply Client Leader Replica Replica Throughput Bottleneck Replica Latency Penalty

12 Can we eliminate Paxos overhead? Performance overhead due to worst-case network assumptions valid assumptions for the Internet data center networks are different What properties should the network have to enable faster replication?

13 Network properties determine replication complexity Asynchronous Network Messages may be: dropped reordered delivered with arbitrary latency Paxos Paxos protocol on every operation High performance cost

14 Network properties determine replication complexity Asynchronous Network All replicas: receive the same set of messages receive them in the same order Reliability Ordering Paxos Paxos protocol on every operation High performance cost

15 Network properties determine replication complexity Asynchronous Network All replicas: receive the same set of messages receive them in the same order Reliability Ordering Paxos Paxos protocol on every operation High performance cost Replication is trivial

16 Network properties determine replication complexity Asynchronous Network All replicas: receive the same set of messages receive them in the same order Reliability Ordering Paxos Paxos protocol on every operation High performance cost Replication is trivial Network implementation has the same complexity as Paxos

17 Reliability Asynchronous Network Ordering Paxos Weak Network Guarantee Strong

18 Reliability Asynchronous Network Ordering Paxos Weak Network Guarantee Strong

19 Can we build a network model that: provides performance benefits can be implemented more efficiently Reliability Asynchronous Network Ordering Paxos Weak Network Guarantee Strong

20 This Talk

21 This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

22 This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast +

23 This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast + A coordination-free replication protocol: Network-Ordered Paxos

24 This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast + A coordination-free replication protocol: Network-Ordered Paxos =

25 This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast + A coordination-free replication protocol: Network-Ordered Paxos = replication within 2% throughput overhead

26 Outline 1. Background on state machine replication and data center network 2. Ordered Unreliable Multicast 3. Network-Ordered Paxos 4. Evaluation

27 Towards an ordered but unreliable network Key Idea: Separate ordering from reliable delivery in state machine replication Network provides ordering Replication protocol handles reliability

28 OUM Approach Designate one sequencer in the network Sequencer maintains a counter for each OUM group 1. Forward OUM messages to the sequencer 2. Sequencer increments counter and writes counter value into packet headers 3. Receivers use sequence numbers to detect reordering and message drops

29 Ordered Unreliable Multicast Counter: 0 Senders Receivers

30 Ordered Unreliable Multicast 1 2 Counter: Senders Receivers

31 Ordered Unreliable Multicast Counter: Senders Receivers

32 Ordered Unreliable Multicast Counter: DROP Senders Receivers

33 Ordered Unreliable Multicast Ordered Multicast: no coordination required to determine order of messages Counter: DROP Senders Receivers

34 Ordered Unreliable Multicast Ordered Multicast: no coordination required to determine order of messages Counter: DROP 4 Drop Detection: coordination only required when messages are dropped Senders Receivers

35 Sequencer Implementations In-switch sequencing next generation programmable switches implemented in P4 nearly zero cost Middlebox prototype Cavium Octeon network processor connects to root switches adds 8 us latency End-host sequencing no specialized hardware required incurs higher latency penalties similar throughput benefits

36 Sequencer Implementations In-switch sequencing next generation programmable switches implemented in P4 nearly zero cost Middlebox prototype Cavium Octeon network processor connects to root switches adds 8 us latency End-host sequencing no specialized hardware required incurs higher latency penalties similar throughput benefits

nearly zero cost Middlebox prototype Cavium Octeon

latency End-host sequencing no specialized hardware

37 Sequencer Implementations In-switch sequencing next generation programmable switches implemented in P4 nearly zero cost Middlebox prototype Cavium Octeon network processor connects to root switches adds 8 us latency End-host sequencing no specialized hardware required incurs higher latency penalties similar throughput benefits

38 Outline 1. Background on state machine replication and data center network 2. Ordered Unreliable Multicast 3. Network-Ordered Paxos 4. Evaluation

39 NOPaxos Overview Built on top of the guarantees of OUM Client requests are totally ordered but can be dropped No coordination in the common case Replicas run agreement on drop detection View change protocol for leader or sequencer failure

40 Normal Operation Client Replica (leader) Replica Replica

41 Normal Operation request Client Replica (leader) OUM Replica Replica

42 Normal Operation request reply Client Replica (leader) Replica OUM Execute Replica

43 Client Replica (leader) Normal Operation replies from request reply majority including OUM Execute waits for leader s Replica Replica

44 Client Replica (leader) Replica Normal Operation replies from request reply majority including OUM Execute waits for leader s no coordination Replica

45 Normal Operation 1 Round Trip Time waits for replies from request reply majority Client Replica (leader) Replica Replica OUM Execute including leader s no coordination

46 Gap Agreement Replicas detect message drops Non-leader replicas: recover the missing message from the leader Leader replica: coordinates to commit a NO-OP (Paxos) Efficient recovery from network anomalies

47 View Change Handles leader or sequencer failure Ensures that all replicas are in a consistent state Runs a view change protocol similar to VR view-number is a tuple of <leader-number, session-number>

48 Outline 1. Background on state machine replication and data center network 2. Ordered Unreliable Multicast 3. Network-Ordered Paxos 4. Evaluation

49 Evaluation Setup 3-level fat-tree network testbed 5 replicas with 2.5 GHz Intel Xeon E Middle box sequencer Sequencer

50 NOPaxos achieves better throughput and latency Latency (us) better Throughput (ops/sec) better

51 NOPaxos achieves better throughput and latency 1000 Latency (us) better , , , ,000 Throughput (ops/sec) better

52 NOPaxos achieves better throughput and latency 1000 Latency (us) better 250 Paxos Fast Paxos NOPaxos 0 65, , , ,000 Throughput (ops/sec) better

53 NOPaxos achieves better throughput and latency 1000 Latency (us) 750 better Paxos 4.7X throughput and more than 40% reduction in latency Fast Paxos NOPaxos 0 65, , , ,000 Throughput (ops/sec) better

54 NOPaxos achieves better throughput and latency Paxos + Batching Latency (us) better Paxos 4.7X throughput and more than 40% reduction in latency Fast Paxos NOPaxos 0 65, , , ,000 Throughput (ops/sec) better

NOPaxos achieves better throughput and latency 1000 750 Paxos + Batching Latency (us) better 500 250 Paxos 4.

55 NOPaxos achieves better throughput and latency Paxos + Batching Latency (us) better Paxos 4.7X throughput and 25% higher throughput and more than 40% 6X lower latency reduction in latency Fast Paxos NOPaxos 0 65, , , ,000 Throughput (ops/sec) better

56 NOPaxos is resilient to network anomalies 260,000 NOPaxos Paxos SpecPaxos Throughput (ops/sec) 195, ,000 65, % 0.01% 0.1% 1% Packet Drop Rate

57 NOPaxos is resilient to network anomalies 260,000 NOPaxos Paxos SpecPaxos NOPaxos Throughput (ops/sec) 195, ,000 Speculative Paxos 65,000 Paxos % 0.01% 0.1% 1% Packet Drop Rate

58 NOPaxos is resilient to network anomalies 260,000 NOPaxos Paxos SpecPaxos NOPaxos Throughput (ops/sec) 195, ,000 Speculative Paxos 65,000 Paxos % 0.01% 0.1% 1% drops to 24% of maximum throughput Packet Drop Rate

59 NOPaxos attains throughput within 2% of an unreplicated system

60 NOPaxos attains throughput within 2% of an unreplicated system 500 Latency (us) better , , , ,000 Throughput (ops/sec) better

61 NOPaxos attains throughput within 2% of an unreplicated system Latency (us) 250 Paxos better , , , ,000 Throughput (ops/sec) better NOPaxos Unreplicated

62 NOPaxos attains throughput within 2% of an unreplicated system Latency (us) better Paxos within 2% throughput and 16us latency of an unreplicated system 0 65, , , ,000 Throughput (ops/sec) better NOPaxos Unreplicated

63 NOPaxos attains throughput within 2% of an unreplicated system Latency (us) better Paxos within 2% throughput and 16us latency of an unreplicated system 0 65, , , ,000 Throughput (ops/sec) NOPaxos using end-host sequencer better NOPaxos Unreplicated

64 NOPaxos attains throughput within 2% of an unreplicated system 500 Latency (us) better Paxos similar throughput but 36% higher latency within 2% throughput and 16us latency of an unreplicated system 0 65, , , ,000 Throughput (ops/sec) NOPaxos using end-host sequencer better NOPaxos Unreplicated

65 Related Work Group communication systems Virtual Synchrony [Birman, et al.], CATOCS [Cheriton, et al.], Amoeba [Kaashoek, et al.] Consensus protocols Fast Paxos [Lamport], Optimistic Atomic Broadcast [Pedone, et al.], Speculative Paxos [Ports, et al.] Egalitarian Paxos [Moraru, et al.], Tapir [Zhang, et al.] Network and Hardware support for distributed systems SwitchKV [Li, et al.], NetPaxos [Dang, et al.], FaRM [Dragojevic, et al.], Consensus in a Box [Istvan, et al.]

66 Summary Separate ordering from reliable delivery in state machine replication A new network model OUM that provides ordered but unreliable message delivery A more efficient replication protocol NOPaxos that ensures reliable delivery The combined system achieves performance equivalent to an unreplicated system

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Naveen Kr. Sharma Vincent Liu Arvind Krishnamurthy University of Washington CSE Today s most