How Eventual is Eventual Consistency?

Size: px

Start display at page:

Download "How Eventual is Eventual Consistency?"

Dina Bradford
6 years ago
Views:

1 Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February 2012

2 what: PBS consistency prediction why: how: weak consistency is fast measure latencies use WARS model

3 1. Fast 2. Scalable 3. Available

4 solution: replicate for 1. capacity 2. fault-tolerance

6 keep replicas in sync

7 keep replicas in sync slow

8 keep replicas in sync slow alternative: sync later

9 keep replicas in sync slow alternative: sync later inconsistent

10 consistency, latency contact more replicas, read more recent data consistency, latency contact fewer replicas, read less recent data

11 Dynamo: Amazon s Highly Available Key-value Store SOSP 2007 Apache, DataStax Project Voldemort

12 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

13 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator read( key ) client

14 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read( key ) read R=3 Coordinator client

15 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

16 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator ( key, 1) client

17 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

18 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator read( key ) client

19 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read( key ) read R=3 Coordinator client

20 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

21 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

22 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

23 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator ( key, 1) client

24 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator client

25 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator read( key ) client

26 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read( key ) read R=1 Coordinator client send read to all

27 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator client send read to all

28 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator ( key, 1) client send read to all

29 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator ( key, 1) client send read to all

30 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator ( key, 1) client send read to all

31 N replicas/key read: wait for R replies write: wait for W acks

32 R1( key, 1) R2( key, 1) R3( key, 1) Coordinator W=1

33 R1( key, 1) R2( key, 1) R3( key, 1) Coordinator W=1 write( key, 2)

34 R1( key, 1) R2( key, 1) R3( key, 1) write( key, 2) Coordinator W=1

35 R1( key, 2) R2( key, 1) R3( key, 1) ack( key, 2) Coordinator W=1

36 R1( key, 2) R2( key, 1) R3( key, 1) ack( key, 2) Coordinator W=1 ack( key, 2)

37 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) read( key )

38 R1( key, 2) R2( key, 1) R3( key, 1) read( key ) Coordinator W=1 Coordinator R=1 ack( key, 2)

39 R1( key, 2) R2( key, 1) R3( key, 1) ( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2)

40 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

41 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

42 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

43 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

44 R1 ( key, 2) R2 ( key, 2) R3( key, 1) ack( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

45 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) ack( key, 2) ack( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

46 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

47 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

48 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) ( key, 2) ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

49 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

50 if: then: else: W > N/2 R+W > N read (at least) last committed version eventual consistency

51 eventual consistency If no new updates are made to the object, eventually all accesses will return the last updated value W. Vogels, CACM 2008

52 How eventual? How long do I have to wait?

53 How consistent? What happens if I don t wait?

54 Riak Defaults N=3 R=2 W=2

55 Riak Defaults N=3 R=2 W=2 2+2 > 3

56 Riak Defaults N=3 Phew, I m safe! R=2 W=2 2+2 > 3

57 Riak Defaults N=3 R=2 W=2 2+2 > 3 Phew, I m safe!...but what s my latency cost?

58 Riak Defaults N=3 R=2 W=2 2+2 > 3 Phew, I m safe!...but what s my latency cost? Should I change?

59 R+W

60 R+W strong consistency

61 R+W strong consistency low latency

62 Cassandra: R=W=1, N=3 by default (1+1 3)

63 "In the general case, we typically use [Cassandra s] consistency level of [R=W=1], which provides maximum performance. Nice!" --D. Williams, HBase vs Cassandra: why we moved February

66 Low Value Data n = 2, r = 1, w = 1 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

67 Low Value Data n = 2, r = 1, w = 1 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

68 Mission Critical Data n = 5, r = 1, w = 5, dw = 5 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

69 Mission Critical Data n = 5, r = 1, w = 5, dw = 5 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

70 LinkedIn very low latency and high availability : R=W=1, N=3 N=3 not required, some consistency : R=W=1, personal communication

71 Anecdotally, EC worthwhile for many kinds of data

72 Anecdotally, EC worthwhile for many kinds of data How eventual? How consistent?

73 Anecdotally, EC worthwhile for many kinds of data How eventual? How consistent? eventual and consistent enough

74 Can we do better?

75 Can we do better? Probabilistically Bounded Staleness can t make promises can give expectations

76 PBS is: a way to quantify latency-consistency trade-offs what s the latency cost of consistency? what s the consistency cost of latency?

77 PBS is: a way to quantify latency-consistency trade-offs what s the latency cost of consistency? what s the consistency cost of latency? a SLA for consistency

78 How eventual? t-visibility: consistent reads with probability p after after t seconds (e.g., 99.9% of reads will be consistent after 10ms)

79 Coordinator once per replica Replica

80 Coordinator once per replica write Replica

81 Coordinator once per replica write Replica ack

82 Coordinator once per replica write Replica wait for W responses ack

83 Coordinator once per replica write Replica wait for W responses ack t seconds elapse

84 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read

85 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response

86 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read wait for R responses response

87 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

88 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

89 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

90 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

91 T I M E N=2

92 write write T I M E N=2

93 write write T I M ack ack E N=2

94 write write T I M W=1 ack ack E N=2

95 write write T I M W=1 ack ack E N=2

96 write write T I M W=1 ack ack E read read N=2

97 write write T I M W=1 ack ack E read read response response N=2

98 write write T I M W=1 ack ack E read read R=1 response response N=2

99 write write T I M W=1 ack ack E read read R=1 response response N=2 good

100 T I M E N=2

101 write T write I M E N=2

102 write T write I M ack E N=2 ack

103 write T write I W=1 ack M E N=2 ack

104 write T write I W=1 ack M E N=2 ack

105 write T write I W=1 ack M E read read N=2 ack

106 write T write I W=1 ack M E read read response response N=2 ack

107 write T write I W=1 ack M E read read R=1 response response N=2 ack

108 write T write I W=1 ack M E read read R=1 response N=2 response bad ack

109 write T write I W=1 ack M E read read R=1 response N=2 response bad ack

110 write T write I W=1 ack M E read read R=1 response N=2 response bad ack

111 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

112 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2)

113 R1( key, 2) R2( key, 1) R3( key, 1) ( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

114 R1( key, 2) R2( key, 1) R3( key, 1) R3 replied before last write arrived! ( key, 1) write( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

115 Coordinator once per replica write Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

116 Coordinator once per replica write (W) Replica wait for W responses ack t seconds elapse read response is wait for R responses response stale if read arrives before write

117 Coordinator once per replica write (W) Replica wait for W responses (A) ack t seconds elapse read response is wait for R responses response stale if read arrives before write

118 Coordinator once per replica write (W) Replica wait for W responses (A) ack t seconds elapse wait for R responses read (R) response response is stale if read arrives before write

119 Coordinator once per replica write (W) Replica wait for W responses (A) ack t seconds elapse wait for R responses read (R) (S) response response is stale if read arrives before write

120 Solving WARS: hard Monte Carlo methods: easy

121 To use WARS: gather latency data run simulation Cassandra implementation validated simulations; available on Github

122 How eventual? t-visibility: consistent reads with probability p after after t seconds key: WARS model need: latencies

123 How consistent? What happens if I don t wait?

124 Probability of reading later older than k versions is exponentially reduced by k Pr(reading latest write) = 99% Pr(reading one of last two writes) = 99.9% Pr(reading one of last three writes) = 99.99%

125 Riak Defaults N=3 R=2 W=2 2+2 > 3 Phew, I m safe!...but what s my latency cost? Should I change?

126 LinkedIn 150M+ users built and uses Voldemort Yammer 100K+ companies uses Riak Thanks production latencies

127 N=3

128 N=3 10 ms

129 LNKD-DISK 99.9% consistent reads: R=2, W=1 t = 13.6 ms Latency: ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

130 LNKD-DISK 16.5% faster 99.9% consistent reads: R=2, W=1 t = 13.6 ms Latency: ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

131 LNKD-DISK 16.5% faster worthwhile? 99.9% consistent reads: R=2, W=1 t = 13.6 ms Latency: ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

132 N=3

133 N=3

134 N=3

135 LNKD-SSD 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency: 1.32 ms 100% consistent reads: R=3, W=1 N=3 Latency: 4.20 ms Latency is combined read and write latency at 99.9th percentile

136 LNKD-SSD 59.5% faster 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency: 1.32 ms 100% consistent reads: R=3, W=1 N=3 Latency: 4.20 ms Latency is combined read and write latency at 99.9th percentile

137 LNKD-SSD 59.5% faster better payoff! 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency: 1.32 ms 100% consistent reads: R=3, W=1 N=3 Latency: 4.20 ms Latency is combined read and write latency at 99.9th percentile

138 Coordinator once per replica Replica wait for W responses write (W) (A) ack critical factor in staleness t seconds elapse wait for R responses read (R) (S) response response is stale if read arrives before write

139 Probability Density Function low variance Probability Latency

140 Probability Density Function low variance Probability high variance Latency

141 Coordinator once per replica Replica wait for W responses write (W) (A) ack SSDs reduce variance compared to disks! t seconds elapse wait for R responses read (R) (S) response response is stale if read arrives before write

142 YMMR 99.9% consistent reads: R=1, W=1 t = ms Latency: 43.3 ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

143 YMMR 81.1% faster 99.9% consistent reads: R=1, W=1 t = ms Latency: 43.3 ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

144 YMMR 81.1% faster even better payoff! 99.9% consistent reads: R=1, W=1 t = ms Latency: 43.3 ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

145 Riak Defaults N=3 R=2 W=2 2+2 > 3 Phew, I m safe!...but what s my latency cost? Should I change?

146 Is low latency worth it?

147 Is low latency worth it? PBS can tell you.

148 Is low latency worth it? PBS can tell you. (and PBS is easy)

149

150 Workflow 1. Metrics 2. Simulation 3. Set N, R, W 4. Profit

151 what: PBS consistency prediction why: how: weak consistency is fast measure latencies use WARS model

152 R+W strong consistency low latency

153

154 PBS latency vs. consistency trade-offs fast and simple modeling large benefits be more

155 VLDB 2012 early print tinyurl.com/pbspaper cassandra patch github.com/pbailis/cassandra-pbs

156 Extra Slides

157 PBS and apps

158 staleness requires either: staleness-tolerant data structures timelines, logs cf. commutative data structures logical monotonicity asynchronous compensation code detect violations after data is returned; see paper write code to fix any errors cf. Building on Quicksand memories, guesses, apologies

159 asynchronous compensation minimize: (compensation cost) (# of expected anomalies)

160 Read only newer data? monotonic reads session guarantee) # versions tolerable staleness = client s read rate global write rate (for a given key)

161 Failure?

162 Treat failures as latency spikes

163 How l o n g do partitions last?

164 what time interval? 99.9% uptime/yr 8.76 hours downtime/yr 8.76 consecutive hours down bad 8-hour rolling average

165 what time interval? 99.9% uptime/yr 8.76 hours downtime/yr 8.76 consecutive hours down bad 8-hour rolling average hide in tail of distribution OR continuously evaluate SLA, adjust

166 Give me (and academia) failure data!

167 In paper: Closed-form analysis Monotonic reads Staleness detection Varying N WAN model Production latency data tinyurl.com/pbspaper

168 t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)

169 t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy) anti-entropy: only decreases staleness comes in many flavors hard to guarantee rate Focus on message delays first

170 LNKD-SSD LNKD-DISK YMMR WAN 1.0 R=3 0.8 CDF Write Latency (ms) N=3 (LNKD-SSD and LNKD-DISK identical for reads)

171 LNKD-SSD LNKD-DISK YMMR WAN 1.0 W=3 0.8 CDF Write Latency (ms) N=3

172 N=3

173 N=3

174 N=3

175 Synthetic, Exponential Distributions

176 W 1/4x ARS Synthetic, Exponential Distributions

177 W 1/4x ARS Synthetic, Exponential Distributions W 10x ARS

178 N = 3 replicas R1 R2 R3 Write to W, read from R replicas

179 N = 3 replicas R1 R2 R3 Write to W, read from R replicas quorum system: guaranteed intersection { } R1 R2 R3 R=W=3 replicas { R2} { R3} { R3} } R1 R2 R1 R=W=2 replicas

180 N = 3 replicas R1 R2 R3 Write to W, read from R replicas quorum system: guaranteed intersection { } R1 R2 R3 R=W=3 replicas { R2} { R3} { R3} } R1 R2 R1 R=W=2 replicas partial quorum system: may not intersect { }{ }{ } R1 R2 R3 R=W=1 replicas

@ Twitter Peter Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica. UC Berkeley

@ Twitter Peter Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica. UC Berkeley PBS @ Twitter 6.22.12 Peter Bailis @pbailis Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica UC Berkeley PBS Probabilistically Bounded Staleness 1. Fast 2. Scalable 3. Available solution: