MENCIUS: BUILDING EFFICIENT

Size: px

Start display at page:

Download "MENCIUS: BUILDING EFFICIENT"

Jonas Atkins
5 years ago
Views:

1 MENCIUS: BUILDING EFFICIENT STATE MACHINE FOR WANS By: Yanhua Mao Flavio P. Junqueira Keith Marzullo Fabian Fuxa, Chun-Yu Hsiung November 14, 2018

2 AGENDA 1. Motivation 2. Breakthrough 3. Rules of Mencius 4. Optimization of Mencius 5. Evaluation 6. Conclusion

3 WIDE AREA NETWORK (WAN) MODEL Model the system with n site Each site contains a server and some clients. server client

4 MOTIVATION: INSUFFICIENCY OF PAXOS Rely on single leader Leader server processes more messages.(cpu) Unbalanced communication pattern limits throughput. Higher latency for clients in non-leader site.

5 COORDINATE PAXOS To save bandwidth From n * (n - 1) to 2 * n for ACK Leader Server 1 prep prep Ack Ack prop prop acc acc Learn Learn Server 2 Leader election Propose value

6 PAXOS: LIMITED THROUGHPUT All server are mutually connected Only links with leader server used.

7 PAXOS: HIGHER LATENCY FOR OTHERS When client in leader site, it took 2 messages transmissions to learn value While client in non-leader site, it took 4 messages transmissions to learn value

8 PAXOS: HIGHER LATENCY FOR NON-LEADER SITE

9 MENCIUS IMPROVEMENT Rotate the leader: Assign each slot to a server blue 0, 3, 6 green 1, 4, 7 yellow 2, 5, 8 For each slot, only assigned server could propose non-no_op value. All servers could propose no_op.

10 ASSUMPTION Crash server recover Unreliable failure detector: Detect failed server Asynchronous FIFO channel: TCP

11 THREE ACTIONS 1. Suggest: Ordinary propose value 2. Skip: Leader itself skip this term 3. Revoke: Other take over the run and propose no-op P2 is leader initially but considered failed

12 FOUR ACTIONS TO HANDLE 1. Propose 2. Accept 3. Fill bubbles 4. Crash server recover

13 PROPOSE Need to know which slot to propose Maintain the next propose slot blue blue 0, 3, 6

14 PROPOSE Need to know which slot to propose Maintain the next propose slot blue 0:v blue 0, 3, 6

15 ACCEPT According to the receive message to adjust the slot. Obey the serializability. blue Index = 0 Receive suggestion v1 for slot 1

16 ACCEPT: CASE 1 Next propose slot above the coming message slot. blue 0:v Index = 0 Index = 0 Receive suggestion v1 for slot 1

17 ACCEPT: CASE 1 Next propose slot above the coming message slot. blue 0:v0 1:v Index = 0 Index = 0 Receive suggestion v1 for slot 1 Accept (1, v)

18 ACCEPT: CASE 2 Next propose slot below the coming message slot. blue Index = 0 Index = 3 Receive suggestion v1 for slot 1 Accept (1, v)

19 ACCEPT: CASE 2 Next propose slot below the coming message slot. blue 0:no_op 1:v Index = 0 Index = 3 Index = 3 SKIP Receive suggestion v1 for slot 1 Accept (1, v) Propose no_op for slot 0

20 FILL BUBBLES Crash server does not propose value. Commit only when no previous bubbles blue 0: v0 1 2: v2 3: v3 4 5:v Gap for slot 1, 4 Cannot not commit Revoke!

21 REVOKE Other server holds an election and takes the leadership. Propose NO_OP for the slots P0 P1 P2

22 FILL BUBBLES Crash server does not broadcast SKIP. Commit only when no previous bubbles Revoke the slots assigned to suspected crash server. Fill the bubbles. blue 0: v0 1: no_op 2: v2 3: v3 4: no_op 5:v Gap for slot 1, 4 Cannot not commit Revoke!

23 blue / yellow 0: v0 1: no_op 2: v2 3: v3 4: no_op 5:v5 6: v6 7 8 SERVER RECOVER Next propose slot is assigned NO_OP by others. Green server 0: v Index = 1 Propose v1 for slot 1

24 blue / yellow 0: v0 1: no_op 2: v2 3: v3 4: no_op 5:v5 6: v6 7 8 SERVER RECOVER Next propose slot is assigned NO_OP by others. Proposed again. Green server 0: v Index = 1 Propose v1 for slot 1 Learn slot 1, 4 Are no_op

25 blue / yellow 0: v0 1: no_op 2: v2 3: v3 4: no_op 5:v5 6: v6 7 8 SERVER RECOVER Next propose slot is assigned NO_OP by others. Proposed again. Green server 0: v0 1: no_op 2 3 4: no_op 5 6 7: v1 8 Index = 1 Index = 7 Index = 7 Propose v1 for slot 1 Learn slot 1, 4 Are no_op Propose v1 for slot 7

26 OPTIMIZATION Worst case: Only one server keep proposing value Other n 1 servers are idle. v0 v3 Index = 0 Index = 3 Index = 3 NO_OP NO_OP NO_OP NO_OP Receive suggestion v1 for slot 1 Accept (1, v) Propose no_op for slot 0

27 OPTIMIZATION Worst case: Only one server keep proposing value Other n 1 servers are idle. Fact: We use the FIFO channel v0 v3 NO_OP NO_OP NO_OP NO_OP

28 ACCEPT INCLUDE SKIP Due to FIFO Leader know no server 1 not proposed value for slot 1, 4 before ACK. blue 0: v0 1 2: v2 3: v3 4 5:v Propose value for slot 6 acc

29 ACCEPT INCLUDE SKIP Due to FIFO Leader know no server 1 not proposed value for slot 1, 4 before ACK. blue 0: v0 1 2: v2 3: v3 4 5:v Propose value for slot 6 acc After ACCEPT, green server update its next propose slot above 6 0: v0 1:no_op 2: v2 3: v3 4:no_op 5:v5 6:v6 7 8

30 PROPOSE INCLUDE SKIP Due to FIFO Server know leader not proposed value for slot 0, 3 before propose for slot 6. green 0 1:v1 2: v2 3 4:v4 5:v Propose value for slot 6 After propose, leader update its next propose slot above 6 green 0: no_op 1:v1 2: v2 3: no_op 4:v4 5:v5 6:v6 7 8 Learned data for green server

31 REVOKE FAULT SERVER MORE Don t revoke slot every time. Server could revoke more slots. How many more slots is a tuned parameters 0: v0 1 2: v : : v0 1: no_op 2: v : v0 1: no_op 2: v2 3 4: no_op 5 6 7:no_op 8

32 STILL NEED SKIP MESSAGE When there are more than two idle servers prop acc acc... prop

33 STILL NEED SKIP MESSAGE blue 0: v0 1:no_op 2:no_op 3: v3 4:no_op 5:no_op 6:v6 7 8 green 0: v0 1:no_op 2 3: v3 4:no_op 5 6:v6 7 8 yellow 0: v0 1 2:no_op 3: v3 4 5:no_op 6:v6 7 8 Idle servers cannot commit slot 3 and slot 6 mutually. Limit the number of SKIP slot by sending SKIP. (Tuned parameter α) Send SKIP periodically. (Tuned parameter τ)

34 OUT OF ORDER COMMIT DELAY Could commit only when previous slot all commit. Delay when concurrent suggest. (For the example, learn y first then x)

35 CONDITION FOR COMMIT DELAY Server 1 PROPOSE y before ACCEPT y Server 0 sent LEARN before ACCEPT y S0 propose x At slot 0 S0 learn x No commit delay Server 0 x x acc acc prop learn P1 learn y Server 1 x y x y S1 propose y At slot 1 S1 learn x x y

36 CONDITION FOR x COMMIT BEFORE y Server 0 PROPOSE x after ACCEPT y x cannot be in the order before y Server 0 P0 propose x x prop acc P0 learn x x Server 1 y P1 propose y At slot 1 y

37 OUT OF ORDER COMMIT DELAY Commit delay happen only when server sent ACCEPT message to others Between sending PROPOSE and LEARN P0 propose x At slot 0 y P0 propose x At slot 1 x x y y P0 learn x x commit delay at most one communication cycle

38 CHOOSING α, τ, AND β Recall α: send if α SKIP messages outstanding (Accelerator 1) τ: send if τ time passed since outstanding SKIP message created (Accelerator 1) β: p revokes q s proposals in range [C q, I p + 2β] if C q < I p + 2β (Optimization 3)

39 CHOOSING τ Should be large enough to amortize SKIP messages But too large == extra commit delay Mencius: τ = 50ms Accelerator 1 generates at most 20 SKIP msg/s Extra delay is at most 50 ms Can occur naturally anyway (packet loss, delay, etc.)

40 CHOOSING α Limits the # of outstanding SKIP messages before servers p and q catch up If τ large enough, SKIP messages can be combined into one Reduces overhead by factor of α Mencius: α = 20 95% cost reduction

41 CHOOSING β Large β: slow recoveries during false suspicion or failure But, overhead of having large β is negligible Update index to next available slot On SUGGEST, other replicas skip turns and catch up (Rule 2) Mencius: β = 100,000 See paper for calculation details

42 EVALUATION Mencius vs. traditional Paxos DETER testbed, TCP, C++ API: PROPOSE(v) ONCOMMIT(v) ISCOMMUTE(u, v) Mencius only, out of order enabled Nagle s algorithm α = 20,τ = 50ms, β = 100,000

43 THROUGHPUT ρ = 4,000 network-bound Mencius: 1,550 ops (82.7% utilization) Paxos: 540 ops ρ = 0 CPU-bound Paxos: 6,000 ops Leader: 100% utilization Other: 50% Mencius: 9,000 ops All 100% utilization! Less registers == lower throughput

44 THROUGHPUT Figure 5: Mencius uses available bandwidth even when channels are asymmetric A *: 20 Mbps B *: 15 Mbps C *: 10 Mbps Figure 6: Mencius is able to adapt to changing bandwidth

45 THROUGHPUT UNDER FAILURE 3 servers, network-bound Failure after 30 seconds

46 SCALABILITY

47 LATENCY 3 site clique topology Low to medium latency

48 LATENCY

OTHER OPTIMIZATIONS Batch requests Higher throughput, but higher latency Eliminate Phase 3, broadcast ACCEPT Paxos: cuts learning delay by 1 Mencius: cuts upper bound on

49 OTHER OPTIMIZATIONS Batch requests Higher throughput, but higher latency Eliminate Phase 3, broadcast ACCEPT Paxos: cuts learning delay by 1 Mencius: cuts upper bound on delayed commit by 1 Increases message complexity (decreasing throughput if CPU-bound) Broadcast body of requests Reach consensus on unique request ID Not effective if CPU-bound

50 RELATED WORK Consensus Fast Paxos CoReFP Moving sequencer/leader Totem S protocol Atomic broadcast Zieliński M-Consensus High throughput consensus/fault scalability FSR PBFT Zyzzyva Steward

51 FUTURE WORK AND OPEN ISSUES Byzantine failures Coordinator allocation Sites with faulty servers

52 CONCLUSION High performance Higher throughput than Paxos (CPU- or network-bound) Better scalability Suitable for wide-area applications At least Paxos-like commit latency

53 THANK YOU Questions?

Mencius: Another Paxos Variant???

Mencius: Another Paxos Variant??? Authors: Yanhua Mao, Flavio P. Junqueira, Keith Marzullo Presented by Isaiah Mayerchak & Yijia Liu State Machine Replication in WANs WAN = Wide Area Network Goals: Web