Cluster Consensus When Aeron Met Raft. Martin Thompson

Size: px

Start display at page:

Download "Cluster Consensus When Aeron Met Raft. Martin Thompson"

Primrose Wilkinson
5 years ago
Views:

1 Cluster When Aeron Met Raft Martin Thompson

3 What does mean?

4 con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity Source:

5 con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity : the judgment arrived at by most of those concerned Source:

6 on what?

9 Raft in a Nutshell

10 Roles Follower Candidate Leader

11 RPCs 1. RequestVote RPC Invoked by candidates to gather votes 2. AppendEntries RPC Invoked by leader to replicate and heartbeat

12 Safety Guarantees Election Safety Leader Append-Only Log Matching Leader Completeness State Machine Safety

13 Monotonic Functions

14 Version all the things!

15 Clustering Aeron

16 Is it Guaranteed Delivery???

17 What is the Architect really looking for?

18 Need to know...

19 Guaranteed Processing

20 Client Client Client Client Client

21 Client Client Client Client Client

22 Client Client Client Client Client

23 Client Client Client Client Client

24 NIO Pain!

25 Do servers crash?

26 FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null!= channel) { channel.force(true); }

27 FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null!= channel) { channel.force(true); }

28 FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null!= channel) { channel.force(true); }

29 Directory Sync Files.force(directory.toPath(), true);

30 Performance

31 Let s consider an RPC design approach

32 Client Client Client Client Client

33 Client Client Client Client Client

34 Client Client Client Client Client

35 Client Client Client Client Client

36 Client Client Client Client Client

37 Client Client Client Client Client

38 Client Client Client Client Client

39 Client Client Client Client Client

40 Client Client Client Client Client

41 Concurrency and parallelism with Replicated State Machines?

42 1. Parallel is the opposite of Serial 2. Concurrent is the opposite of Sequential 3. Vector is the opposite of Scalar John Gustafson

43 Instruction Pipelining Time Fetch

44 Instruction Pipelining Time Fetch Decode

45 Instruction Pipelining Time Fetch Decode Execute

46 Instruction Pipelining Time Fetch Decode Execute Retire

47 Instruction Pipelining Time Fetch Decode Execute Retire Fetch Decode Execute Retire

48 Instruction Pipelining Time Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire

49 Instruction Pipelining Time Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire

50 Pipeline Time Order

51 Pipeline Time Order Log

52 Pipeline Time Order Log Transmit

53 Pipeline Time Order Log Transmit Commit

54 Pipeline Time Order Log Transmit Commit Execute

55 Pipeline Time Order Log Transmit Commit Execute Order Log Transmit Commit Execute

56 Pipeline Time Order Log Transmit Commit Execute Order Log Transmit Commit Execute Order Log Transmit Commit Execute

57 Client Client Client Client Client

58 Client Client Client Client Client

59 Client Client Client Client Client

60 Client Client Client Client Client

61 Client Client Client Client Client

62 Client Client Client Client Client

63 Client Client Client Client Client

64 Client Client Client Client Client

65 Client Client Client Client Client

66 NIO Pain!

67 ByteBuffer byte[] copies ByteBuffer bytebuffer = ByteBuffer.allocate(64 * 1024); bytebuffer.putint(index, value);

68 ByteBuffer byte[] copies ByteBuffer bytebuffer = ByteBuffer.allocate(64 * 1024); bytebuffer.putbytes(index, bytes);

69 ByteBuffer byte[] copies ByteBuffer bytebuffer = ByteBuffer.allocate(64 * 1024); bytebuffer.putbytes(index, bytes);

70 How can Aeron help?

71 Message Index => Byte Index

72 Multicast, MDC, and Spy based Messaging

73 Counters => Bounded Consumption

74 Batching Amortising Costs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Average overhead per item or operation in batch

75 Batching Amortising Costs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% System calls Network round trips Disk writes Expensive computations

76 Interesting Features

77 Timers

78 All state must enter the system as a message!

79 Timers public void foo() { // Decide to schedule a timer } cluster.scheduletimer(correlationid, cluster.timems() + TimeUnit.SECONDS.toMillis(5)); public void ontimerevent(final long correlationid, final long timestampms) { // Look up the correlationid associated with the timer }

80 Timers public void foo() { // Decide to schedule a timer } cluster.scheduletimer(correlationid, cluster.timems() + TimeUnit.SECONDS.toMillis(5)); public void ontimerevent(final long correlationid, final long timestampms) { // Look up the correlationid associated with the timer }

81 Timers public void foo() { // Decide to schedule a timer } cluster.scheduletimer(correlationid, cluster.timems() + TimeUnit.SECONDS.toMillis(5)); public void ontimerevent(final long correlationid, final long timestampms) { // Look up the correlationid associated with the timer }

82 Back Pressure and Stashed Work

83 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

84 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

85 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

86 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

87 Log Replay and Snapshots

88 Log Replay and Snapshots Distributed File System?

89 Log Replay and Snapshots Distributed File System? Aeron Archive Recorded Streams

90 Multiple s on the same stream

91 Client Client Client Client Client

92 Client Client Client Client Client

93 NIO Pain!

94 1 2 MappedByteBuffer DirectByteBuffer

95 1 2 MappedByteBuffer DirectByteBuffer DirectByteBuffer MappedByteBuffer

96 In Closing

97 What s the Roadmap?

99 Questions? A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. - Leslie Lamport

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR: Putting it all together for SMR: Two-Phase Commit, Leader Election RAFT COS 8: Distributed Systems Lecture Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable