Introduction to riak_ensemble. Joseph Blomstedt Basho Technologies

Size: px

Start display at page:

Download "Introduction to riak_ensemble. Joseph Blomstedt Basho Technologies"

Marcus Cunningham
5 years ago
Views:

1 Introduction to riak_ensemble Joseph Blomstedt Basho Technologies

2 riak_ensemble Paxos framework for scalable consistent system 2

3 node node node node node node node node 3

4 What about state? 4

5 App App App App Database 5

6 App App App App Riak Riak Riak Riak Riak Riak Riak Riak 6

7 What if I m writing a database? 7

8 What about embedded state? 8

9 Mnesia! 9

10 {inconsistent_database, running_partitioned_network} 10

11 CAP Theorem 11

12 Consistency Availability Partition-tolerance 12

13 Consistency Availability Partition-tolerance 13

14 CP AP Consistency Availability Partition-tolerance 14

15 Node 1 Node 2 Node 3 Node 4 Node 5 client client client 15

16 Node 1 Node 2 Node 3 Node 4 Node 5 client client client 16

17 Node 1 Node 2 Node 3 Node 4 Node 5 client client client 17

18 Node 1 Node 2 Node 3 Node 4 Node 5 client client client 18

19 Node 1 Node 2 Node 3 Node 4 Node 5 client client client 19

20 Node 1 Node 2 Node 3 Node 4 Node 5 client client client 20

21 Node 1 Node 2 Node 3 Node 4 Node 5 client client client client client 21

22 Eventual Consistency 22

23 A A A 23

24 A A A 24

25 A A A B C 25

26 A A A B C 26

27 A A A B C {B,C} {B,C} {B,C} 27

28 Write Once Immutable Last Write Wins Business Rules Sets/Counters/Maps 28

29 Consensus 29

30 quorum consensus chain replication virtual synchrony 30

31 quorum consensus chain replication virtual synchrony 31

32 Quorum Consensus Paxos ZK Atomic Broadcast Raft 32

33 Paxos 33

34 34

35 Rinse/repeat for each request 35

36 2 round trips/request 36

37 Multi-Paxos 37

38 First Request 38

39 39

40 Each Additional Request 40

41 41

42 1 round trip/request (common case) 42

43 Problem Shipping entire state each request is expensive 43

44 Solution Paxos + Replicated Log 44

45 Problem Now I have N problems 45

46 Log recovery Log trimming Rollup Snapshots Fault Recovery 46

47 47

48 Better Solution Build log replication into protocol 48

49 Better Solution ZK Atomic Broadcast Raft 49

50 Zab 50

51 51

52 52

53 53

54 54

55 riak_zab 55

56 Raft 56

57 57

58 raftconsensus.github.io 58

59 rafter 59

60 riak_ensemble 60

61 riak_ensemble Paxos framework for scalable consistent system 61

62 Problem Shipping entire state each request is expensive 62

63 Solution Micro-states 63

64 Also solves Scalability 64

65 Key/Value 65

66 Each key is independent state 66

67 Semantics 67

68 Conditional single key atomic operations 68

69 get/modify/put fails if object changed (eg. concurrent put) 69

70 Design 70

71 Simple multi-paxos per key 71

72 1B keys = 1B consensus groups? 72

73 No 73

74 Partition keys across N consensus groups 74

75 Partition keys across N ensembles 75

76 Ensembles emulate paxos per key 76

77 Each Ensemble Elects leader Establishes epoch Supports get/put/modify 77

78 Establish a new epoch 78

79 79

80 consensus state epoch sequence membership leader 80

81 K/V objects epoch sequence key value 81

82 Put 82

83 83

84 84

85 2 roundtrips/put (worst) 1 roundtrip/put (best) 85

86 Get 86

87 87

88 88

89 2 roundtrips/get (worst) 0 roundtrip/get (best) 89

90 Leader abandons leadership if any quorum operation ever fails 90

91 Which forces new epoch to be established 91

92 Partial Writes 92

93 failed partial epoch X X X 2 (2) (2) (2) epoch X X Y 3 (2) (2) (2)

94 read / rewrite / reply X epoch X X Y 3 (2) (2) (2) epoch X X Y 3 (3) (3) (2)

95 read / repair / reply X epoch X X Y 3 (3) (3) (2) epoch X X X 3 (3) (3) (3)

96 Architecture 96

97 riak_ensemble_sup... sup..._manager..._peer_sup..._..._peer 97

98 riak_kv_ensemble_peer ensemble riak_ensemble_backend 98

99 %% Initialization callback that returns initial module state. -callback init(ensemble_id(), peer_id(), [any()]) -> state(). 99

100 %% Create a new opaque key/value object using whatever %% representation the defining module desires. -callback new_obj(epoch(), seq(), key(), value()) -> obj(). %% Accessors to retrieve epoch/seq/key/value from an opaque object. -callback obj_epoch(obj()) -> epoch(). -callback obj_seq (obj()) -> seq(). -callback obj_key (obj()) -> term(). -callback obj_value(obj()) -> term(). %% Setters for epoch/seq/value for opaque objects. -callback set_obj_epoch(epoch(), obj()) -> obj(). -callback set_obj_seq (seq(), obj()) -> obj(). -callback set_obj_value(term(), obj()) -> obj(). 100

101 %% Callback for get operations. Responsible for sending a reply %% to the waiting `from' process using {@link reply/2}. -callback get(key(), from(), state()) -> state(). %% Callback for put operations. Responsible for sending a reply %% to the waiting `from' process using {@link reply/2}. -callback put(key(), obj(), from(), state()) -> state(). 101

102 %% Callback for sync_request sent from a remote peer that wants to %% sync with this peer. Responsible for sending a reply to the %% waiting `from' peer using {@link reply/2}. -callback sync_request(from(), state()) -> state(). %% Callback that should do whatever is necessary to bring this peer %% up-to-date. Passed in a list of replies generated by `sync_request' %% from a quorum of peers from each view. This callback can either %% directly make the peer current and return `ok', or initiate some %% longer lived background process and return `async', followed by %% calling {@link sync_complete/1} or {@link sync_failed/1} when %% finished/failed. -callback sync([{peer_id(), any()}], state()) -> {ok, state()} {async, state()} {{error,_}, state()}. 102

103 %% Callback for periodic leader tick. This function is called %% periodically by an elected leader. Can be used to implement %% custom housekeeping. -callback tick(epoch(), seq(), peer_id(), views(), state()) -> state(). -callback ping(state()) -> {ok async failed, state()}. 103

104 Clustering 104

105 gossip manager gossip state manager gossip manager state state 105

106 id A nodes node1 ensembles -- enabled false 106

107 enable manager state 107

108 id A nodes node1 ensembles root: A enabled true 108

109 manager state peer_sup root (peer) 109

110 id A B nodes node1 node2 ensembles root: A -- enabled true false 110

111 id A A nodes node1 node1 ensembles root: A root: A enabled true true 111

112 cluster cluster cluster Node 1 Node 2 Node 3 112

113 join cluster cluster cluster Node 1 Node 2 Node 3 113

114 cluster Node 1 Node 2 Node 3 114

115 Creating Ensemble 115

116 create ensemble directory directory directory manager manager manager root peer root peer root peer 116

117 directory directory directory manager manager manager root peer root peer root peer 117

118 directory directory directory manager manager manager root peer root peer root peer 118

119 directory directory directory manager manager manager root peer foo peer root peer foo peer root peer foo peer 119

120 election directory directory directory manager manager manager root peer foo peer root peer foo peer root peer foo peer 120

121 directory directory directory manager manager manager root peer foo peer root peer foo peer root peer foo peer 121

122 Membership 122

123 A B C A B C + A B D E A B D E 123

124 riak_ensemble Paxos framework for scalable consistent system 124

125 Questions? 125

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype