Intuitive distributed algorithms. with F#

Size: px

Start display at page:

Download "Intuitive distributed algorithms. with F#"

Amos Hampton
5 years ago
Views:

1 Intuitive distributed algorithms with F# Natallia Dzenisenka

2 A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype with F# for better understanding

3 Why distributed algorithms?

4 Sorting algorithms

5 Not everyone needs to know... But it can be really useful

6 Things Will Break

7 No blind troubleshooting

8 TUNING

9 Edge cases or even create your own new Distributed System *** *** WARNING!

10 Riak

11 Cassandra

12 HBase

13 Abstract words actually mean concrete algorithms

14 Liveness Safety

16 No To Be Or Not To Be Will show algorithms that provide different types of guarantees for different purposes

17 System should be ready for events, changes and failures! A node can die Node can join Recover from failure [FAILURE DETECTION] [GOSSIP, ANTI-ENTROPY] [SNAPSHOTS, CHECKPOINTING] And a whole bunch of other things!

18 üintro Ø Consistency Levels o Gossip o Consensus o Snapshot o Failure Detectors o Distributed Broadcast o Summary Contents

19 Consistency Levels

20 $100 $100 Balance = $100 $100 $100 Good situation! $100

21 $100 $100 Balance = 100$ $100 $100 $0 Bad situation $0 balance

22 Good news: we can choose the level of consistency

23 More consistency Less availability

24 Let s look at different consistency levels and how they affect a distributed system

25 The weakest consistency level Writer client x Writes to any node x X x x x Reads from any node Reader client

26 Weak Consistency Advantages: Trade-off: High availability, low latency Low consistency High propagation time No guarantee of successful update propagation

27 Reducing propagation latency Writer client x Writes to any node x X x x x x Reads from any node Reader client

28 NODE FAILS

29 Hinted Handoff Writer client x Writes to any node x X x x x x Hinted handoff Reads from any node * Faster update propagation after failures Reader client

30 Coordinator fails Hint file corrupted

31 Read Repair Writer client x Writes to any node x X x x x x Read repair Hash gathering Reads from any node Reader client

32 Can we get stronger consistency?

33 Quorum Writer client x Writes to any node x X x x x x Reads from any node Reader client

34 Quorum W number of replicas where clients synchronously writes R number of replicas from which client reads N number of replicas [WRITE QUORUM] [READ QUORUM] W > N/2 W + R > N By following the rule we ensure that at least 1 replica will return fresh data during read operations.

35 Quintuplets and approximate estimation of Cassandra partition size

36 How to prevent inconsistency?

37 Centralization

38 Consensus to be continued

39 Consistency levels in real distributed systems Cassandra and Riak use weak consistency level protocol for metadata exchange, hinted-handoff andread repair techniques to improve consistency. You can choose read and write quorums in Cassandra. HBase uses master/slave asynchronous replication. Zookeeper writes are also performed by master.

40 ü Intro ü Consistency Levels Ø Gossip o Consensus o Snapshot o Failure Detectors o Distributed Broadcast o Summary Contents

41 Gossip

42 Common use cases of Gossip Failure detection: determining which nodes are down Solving inconsistency Metadata exchange: Cluster membership: i.e. changes in distributed DB topology new, failed or recovered nodes Many more

43 Use Gossip when system needs to be... Fast and scalable Fault-tolerant Extremely decentralized Or when it has huge number of nodes

44 General gossip Endless process when each node periodically selects N others nodes and sends information to them

45 General gossip Node #1 Node #2 Node #3 Node #4 Node #5 List of known nodes: #2 #3 #4 #5 List of known nodes: #1 #3 #4 #5 List of known nodes: #1 #2 #4 #5 List of known nodes: #1 #2 #3 #5 List of known nodes: #1 #2 #3 #4

46 Key: 42 Value: status: green Key: 42 Value: status: red

47 Key: 42 Value: status: green Timestamp: Key: 42 Value: status: red Timestamp:

48 Anti-entropy repair Solves the state of inconsistently replicated data

49 Writer client x Read runs on reads Read repair x X Hash gatherig x x x Reader client Anti-entropy can run constantly, periodically or can be initiated manually.

50 Merkle trees for Anti-entropy Tree data structure which has hashes of data on the leaves and hashes of children on nonleaves. Cassandra, Riak and DynamoDB use anti-entropy technique using Merkle trees. In case of hash inequality - exchange of actual data takes place

51 Merkle trees Cassandra Short-lived in-memory Riak On-disk persistent

52 Rumor mongering Step 1 Step 2 Message: Node #5 is down! Message: Node #5 is down! Message: Node #5 is down! Message: Node #5 is down!

53 Real world use cases of Gossip Riak - Communicates ring state and bucket properties around thecluster Cassandra - Propagates information about database topology and other metadata - For failure detection Consul - To discover new members and failures and to perform reliable and fast event broadcasts for events like leader election. Uses SERF (based on SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol ) Amazon S3 - Topropagate server state tothe system

54 NOT gossip: one node is overloaded

55 Trade-offs with Gossip Propagation latency may be high Risk of spreading wrong information Weak consistency

56 General Gossip example

58 CODE

59 üintro üconsistency Levels ügossip Ø Consensus o Snapshot o Failure Detectors o Distributed Broadcast o Summary Contents

60 Consensus

61 In a fully asynchronous system there is no consensus solution that can tolerate one or more crash failures even when only requiring the non triviality property - Fischer Lynch Paterson [FLP] result

62 Real world applications of Consensus Distributed coordination services, i.e. log coordination Locking protocols State machine and primary-backup replication Distributed transaction resolution Agreement to move to the next stage of a distributed algorithm Leader election for higher-level protocols Many more

63 PAXOS a protocol for state-machine replication in an asynchronous environment that admits crash-failures

66 Proposers Acceptors Learners

67 Prepare request Proposer chooses a new proposal number N Proposer asks acceptors to: 1. Promise to never accept a proposal # < N 2. Return a highest accepted proposal # < N

68 Accept request If majority of acceptors responded, X = value of the highest # proposal received from acceptors Proposer issues an accept request: With a Proposal Number N And a value X Accept Request

69 Multi Paxos Deciding on one value - running one round of Basic Paxos Deciding on a sequence of values - running Paxos many times...right?

70 Multi Paxos Deciding on one value - running one round of Basic Paxos Deciding on a sequence of values - running Paxos many times Almost. We can optimize consecutive Paxos rounds by skipping prepare phase, assuming a stable leader

72 To tolerate f failures: Cheap Paxos Based on Multi-Paxos f + 1 acceptors (not 2f + 1, like in traditional Paxos) Auxiliary acceptors in case of acceptor failures

73 Fast Paxos Based on Multi-Paxos Reduces end-to-end message delays for a replica to learn a chosen value Needs 3f + 1 acceptors instead of 2f + 1

74 Vertical Paxos Reconfigurable Enables reconfiguration while state machine is deciding on commands Uses auxiliary master for reconfiguration ops A special case of Primary-Backup protocol

76 ZAB Zookeeper s Atomic Broadcast A bit stricter ordering guarantees If leader fails, new leader cannot arbitrarily reorder uncommitted state updates, or apply them starting from a different initial state

77 ü Intro ü Consistency Levels ü Gossip ü Consensus Ø Snapshot o Failure Detectors o Distributed Broadcast o Summary Contents

78 Snapshots

79 Prototyping a Chandy-Lamport Snapshot in F#

85 Assumes that channels are FIFO All nodes can reach each other All nodes are available Basic Messages for regular data Marker Messages for snapshots Any node can initiate a snapshot at any moment

86 Initial state of the system Lena $50 Maggie $0 Natasha $75

87 Lena $45 $5 Maggie $0 $10 Natasha $65

88 $45 Lena $45 marker marker $5 Maggie $0 $10 Natasha $65

89 $45 Lena $45 marker Lena Maggie = {empty} marker Maggie $0 $0 $10 Network delay for $10 marker Natasha $70

90 $45 Maggie Lena = {empty} Lena $45 marker Lena Maggie = {empty} Lena Natasha = {empty} Maggie Natasha = {empty} Maggie $0 $10 marker Natasha $70 $0 Network delay for $10 message $70

91 $45 Maggie Lena = {empty} $45 Lena $45 marker Lena Maggie = {empty} Messages to snapshot: $10 Maggie $0 marker Natasha $70 Lena Natasha = {empty} Maggie Natasha = {empty} $0 $70

92 Maggie Lena = {empty} $45 Lena $45 Natasha Lena = { empty} Lena Maggie = {empty} Natasha Maggie = { $10 } Lena Natasha = {empty} Maggie Natasha = {empty} Maggie $0 Natasha $70 $0 $70

94 CODE

96 ü Intro ü Consistency Levels ü Gossip ü Consensus ü Snapshot Ø Failure Detectors o Distributed Broadcast o Summary Contents

97 Failure detectors

98 Failure detectors are critical for: Leader election Agreement (e.g. Paxos) Reliable Broadcast And more

99 How are faults detected?

100 Heartbeat failure detection #1 #2 300 ms 300 ms 300 ms I am healthy! I am healthy! I am healthy! 300 ms 300 ms 300 ms Node failed No message 500 ms Suspected Nodes List: #1

101 Completeness Strong Eventually all failed processes are suspected by all correct once. #4 Weak Eventually all failed processes are suspected by at least one correct one. #4 #2 Failed: #2, #3 #2 Failed: none #1 Failed: #2, #3 #3 #5 Failed: #2, #3 #1 Failed: #4 #3 #5 Failed: #2, #3

102 Detects every process as failed #2 #4 #1 Failed: #2, #3, #4, #5 #3 #5

103 Accuracy Strong Weak Eventually strong Eventually weak We can say that accuracy Restricts the mistake or Limits number of false positives or, more precisely Limits number of nodes that are mistakenly suspected by the correct nodes

104 Accuracy Strong All correct processes are never suspected #4 Weak At least 1 correct process is never suspected #4 #2 Failed: none #2 Failed: #1, #2 #1 Failed: #2 #3 #5 Failed: #3 #1 Failed: #4 #3 #5 Failed: #3 * Correct node #5 isn t in any suspected list ever

105 Accuracy Eventual Strong After some point in time, all correct processes are never suspected #4 Eventual Weak After some point in time, at least 1 correct process is never suspected #4 #2 Failed: none #2 Failed: #1, #2 #1 Failed: #2 #3 #5 Failed: #3 #1 Failed: #4 #3 #5 Failed: #3 * Nodes could have had wrong suspicions in the past

106 Failure detector classes

107 Perfect Failure Detector Strong Completeness + Strong Accuracy Possible only in synchronous system Uses timeouts (waiting time for a heartbeat arrival)

108

109 Eventually Perfect Failure Detector Strong Completeness + Eventual Strong Accuracy Timeouts are adjusted to be maximum or average Eventually timeout will be that big that all assumptions will be correct

110 Eventually Perfect Failure Detector #1 #2 300 ms I am healthy! 300 ms Node failed No message 700 ms #1 Node recovered Timeout to wait: 300 ms List of suspected nodes: #1 300 ms I am healthy! Timeout to wait: 700 ms List of suspected nodes: none

111 Failure Detectors in Consensus

112 Failure detectors are present in almost every distributed system

113

114 SWIM Scalable Weakly-consistent Infection-style process group Membership protocol Strong completeness and configurable accuracy 1. Initiator picks random node from its membership list and sends a message to it. 2. Ifa chosen node doesn t send acknowledgement after some time, initiator sends ping request to other M randomly selected nodes to ask to send message to suspicious node. 3. All nodes try toget acknowledgement from suspicious node and forward it tothe initiator. If none of the M nodes gets an acknowledgement, initiator marks the node as dead and disseminates about failed node. Choosing bigger number M of nodes you make accuracybigger.

115 ü Intro ü Consistency Levels ü Gossip ü Consensus ü Snapshot ü Failure Detectors Ø Distributed Broadcast o Summary Contents

116 Distributed Broadcast Sending message to every node in the cluster

117 Messages can be lost Nodes can fail Various broadcast algorithms help us to choose between consistency of the system and its availability.

118 Best-effort Broadcast Ensures the delivery of the message to all of the correct processes if sender doesn t fail during message exchange. Messages are sent using perfect point-to-point connection (no message loss, no duplicates). #3 #1 X = 15 X = 15 #5 X = 15 X = 15 #2 #4

119 Reliable Broadcast If any correct node delivers the message, all correct nodes should deliver the message Normal scenario results in O(N) messages Worst-case scenario (when processes fail one after another) O(N) steps with O(N 2 ) messages. Uses failure detector

120 Message has sender ID #1 Sender:#1 X = 15 #3 Sender: #1 X = 15 #5 Sender:#1 X=15 X = 15 Sender:#1 X=15 X = 15 #2 #4

121 Nodes constantly check sender #3 #1 Are you ok? #5 #2 #4

122 If failure detected message is relayed #1 #3 Sender: #5 X = 15 #5 Sender: #5 X = 15 Sender: #5 X = 15 #2 #4 They can even fail one by one, other nodes will detect failure and relay message

123 If failure detector marks sender failed when it s not, unnecessary messages would be sent, that would impact performance, not correctness. However, ifprocess will not indicatefailure, then needed messages won t be sent.

124 Some reliable broadcasts do not use failure detectors What if a node processes a message, and then fails?...

125 Uniform Reliable Broadcast If any node (doesn t matter correct or crashed) delivers the message, all correct nodes should deliver the message. Works with failure detector, this way nodes don t wait for the reply from failed nodes forever.

126 All-Ack Algorithm Node that initiated the broadcast sends messages to all of nodes. Each node that gets the message, relays it and considers pending for now. Got x = 15 from: #1 #3 X=15 #1 X = 15 #5 X = 15 #2 Got X = 15 from: #1 X = 15 #4 Got X = 15 from: #1 Got X = 15 from: #1

127 Each node keeps track of nodes from which it received current message. Got x=15 from: #5 #1 X = 15 Got x=15 from: #1, #5 #3 X = 15 Got x=15 from: #1 #5 X = 15 X = 15 #2 Got x=15 from: #1, #5 #4 Got x=15 from: #1, #5 When process gets current message from all of the correct nodes, it considers message delivered.

128 ü Intro ü Consistency Levels ü Gossip ü Consensus ü Snapshot ü Failure Detectors ü Distributed Broadcast Ø Summary Contents

129 Now I know distributed algorithms

130 Thank you and reach out! Natallia Dzenisenka

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need