@ Twitter Peter Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica. UC Berkeley

Size: px

Start display at page:

Download "@ Twitter Peter Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica. UC Berkeley"

Darleen Carson
5 years ago
Views:

1 Twitter Peter Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica UC Berkeley

2 PBS Probabilistically Bounded Staleness

3 1. Fast 2. Scalable 3. Available

4 solution: replicate for 1. request capacity 2. reliability

5 solution: replicate for 1. request capacity 2. reliability

6 solution: replicate for 1. request capacity 2. reliability

8 keep replicas in sync

9 keep replicas in sync

10 keep replicas in sync

11 keep replicas in sync

12 keep replicas in sync

13 keep replicas in sync

14 keep replicas in sync

15 keep replicas in sync slow

16 keep replicas in sync slow alternative: sync later

17 keep replicas in sync slow alternative: sync later

18 keep replicas in sync slow alternative: sync later

19 keep replicas in sync slow alternative: sync later inconsistent

20 keep replicas in sync slow alternative: sync later inconsistent

21 keep replicas in sync slow alternative: sync later inconsistent

22 keep replicas in sync slow alternative: sync later inconsistent

23 consistency, latency contact more replicas, read more recent data consistency, latency contact fewer replicas, read less recent data

24 consistency, latency contact more replicas, read more recent data consistency, latency contact fewer replicas, read less recent data

25 eventual consistency if no new updates are made to the object, eventually all accesses will return the last updated value W. Vogels, CACM 2008

26 How eventual? How long do I have to wait?

27 How consistent? What happens if I don t wait?

28 PBS problem: solution: no guarantees with eventual consistency consistency prediction technique: measure latencies use WARS model

29 Dynamo: Amazon s Highly Available Key-value Store SOSP 2007 Apache, DataStax Project Voldemort

30 Adobe Rackspace IBM Spotify Palantir Twitter Cassandra Cisco Netflix Reddit Morningstar Digg Mozilla Rhapsody Shazam Gowalla Soundcloud Voldemort LinkedIn Gilt Groupe Aol GitHub Ask.com Yammer Best Buy Riak Comcast Boeing JoyentCloud

31 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

32 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator read( key ) client

33 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read( key ) read R=3 Coordinator client

34 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

35 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator ( key, 1) client

36 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

37 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator read( key ) client

38 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read( key ) read R=3 Coordinator client

39 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

40 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

41 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator client

42 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=3 Coordinator ( key, 1) client

43 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator client

44 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator read( key ) client

45 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) read( key ) read R=1 Coordinator client send read to all

46 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator client send read to all

47 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator ( key, 1) client send read to all

48 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator ( key, 1) client send read to all

49 N = 3 replicas R1 R2 R3 ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) ( key, 1) read R=1 Coordinator ( key, 1) client send read to all

50 N replicas/key read: wait for R replies write: wait for W acks

51 R1( key, 1) R2( key, 1) R3( key, 1) Coordinator W=1

52 R1( key, 1) R2( key, 1) R3( key, 1) Coordinator W=1 write( key, 2)

53 R1( key, 1) R2( key, 1) R3( key, 1) write( key, 2) Coordinator W=1

54 R1( key, 2) R2( key, 1) R3( key, 1) ack( key, 2) Coordinator W=1

55 R1( key, 2) R2( key, 1) R3( key, 1) ack( key, 2) Coordinator W=1 ack( key, 2)

56 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) read( key )

57 R1( key, 2) R2( key, 1) R3( key, 1) read( key ) Coordinator W=1 Coordinator R=1 ack( key, 2)

58 R1( key, 2) R2( key, 1) R3( key, 1) ( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2)

59 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

60 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

61 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

62 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

63 R1 ( key, 2) R2 ( key, 2) R3( key, 1) ack( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

64 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) ack( key, 2) ack( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

65 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

66 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

67 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) ( key, 2) ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

68 R1 ( key, 2) R2 ( key, 2) R3 ( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

69 if: R+W > N then: else: strong consistency eventual consistency

70 R+W

71 R+W strong consistency

72 R+W strong consistency lower latency

73 R+W strong consistency lower latency

74 R+W strong consistency lower latency

75 Cassandra: R=W=1, N=3 by default (1+1 3)

76 "In the general case, we typically use [Cassandra s] consistency level of [R=W=1], which provides maximum performance. Nice!" --D. Williams, HBase vs Cassandra: why we moved February

79 NoSQL Primer Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11 23

80 Low Value Data n = 2, r = 1, w = 1 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

81 Low Value Data n = 2, r = 1, w = 1 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

82 Mission Critical Data n = 5, r = 1, w = 5, dw = 5 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

83 Mission Critical Data n = 5, r = 1, w = 5, dw = 5 Consistency or Bust: Breaking a Riak Cluster Sunday, July 31, 11

84 LinkedIn very low latency and high availability : R=W=1, N=3 N=3 not required, some consistency : R=W=1, N=2 Alex Feinberg, personal communication

85 Anecdotally, EC worthwhile for many kinds of data

86 Anecdotally, EC worthwhile for many kinds of data How eventual? How consistent?

87 Anecdotally, EC worthwhile for many kinds of data How eventual? How consistent? eventual and consistent enough

88 Can we do better?

89 Can we do better? Probabilistically Bounded Staleness can t make promises can give expectations

90 PBS is: a way to quantify latency-consistency trade-offs what s the latency cost of consistency? what s the consistency cost of latency?

91 PBS is: a way to quantify latency-consistency trade-offs what s the latency cost of consistency? what s the consistency cost of latency? an SLA for consistency

92 How eventual? t-visibility: probability p of consistent reads after after t seconds (e.g., 99.9% of reads will be consistent after 10ms)

93 t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)

94 t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy) anti-entropy: only decreases staleness comes in many flavors hard to guarantee rate Focus on message delays

95 focus on steady state with failures: unavailable or sloppy

96 Coordinator once per replica Replica T i m e

97 Coordinator once per replica Replica T i m write e

98 Coordinator once per replica Replica T i m write e ack

99 Coordinator once per replica Replica T i m write e wait for W ack responses

100 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse

101 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read

102 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read response

103 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read wait for R responses response

104 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read wait for R responses response response is stale if read arrives before write

105 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read wait for R responses response response is stale if read arrives before write

106 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read wait for R responses response response is stale if read arrives before write

107 Coordinator once per replica Replica T i m write e wait for W ack responses t seconds elapse read wait for R responses response response is stale if read arrives before write

108 T i m e N=2

109 write write T i m e N=2

110 write write T i m ack ack e N=2

111 write write T i m W=1 ack ack e N=2

112 write write T i m W=1 ack ack e N=2

113 write write T i m W=1 ack ack e read read N=2

114 write write T i m W=1 ack ack e read read response response N=2

115 write write T i m W=1 ack ack e read read R=1 response response N=2

116 write write T i m W=1 ack ack e read read R=1 response response N=2 good

117 T i m e N=2

118 write write T i m e N=2

119 write write T i m ack e N=2 ack

120 write write T i m W=1 ack e N=2 ack

121 write write T i m W=1 ack e N=2 ack

122 write write T i m W=1 ack e read read N=2 ack

123 write write T i m W=1 ack e read read response response N=2 ack

124 write write T i m W=1 ack e read read R=1 response response N=2 ack

125 write write T i m W=1 ack e read read R=1 response N=2 response bad ack

126 write write T i m W=1 ack e read read R=1 response N=2 response bad ack

127 write write T i m W=1 ack e read read R=1 response N=2 response bad ack

128 Coordinator once per replica write Replica T i m e wait for W responses ack t seconds elapse read wait for R responses response response is stale if read arrives before write

129 R1( key, 2) R2( key, 1) R3( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2)

130 R1( key, 2) R2( key, 1) R3( key, 1) ( key, 1) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

131 R1( key, 2) R2( key, 1) R3( key, 1) R3 replied before last write arrived! ( key, 1) write( key, 2) Coordinator W=1 Coordinator R=1 ack( key, 2) ( key,1)

132 Coordinator once per replica write Replica T i m e wait for W responses ack t seconds elapse read wait for R responses response response is stale if read arrives before write

133 Coordinator once per replica write (W) Replica T i m e wait for W responses ack t seconds elapse read wait for R responses response response is stale if read arrives before write

134 Coordinator once per replica write (W) Replica T i m e wait for W responses (A) ack t seconds elapse read wait for R responses response response is stale if read arrives before write

135 Coordinator once per replica write (W) Replica T i m e wait for W responses (A) ack t seconds elapse read wait for R responses (R) response response is stale if read arrives before write

136 Coordinator once per replica write (W) Replica T i m e wait for W responses (A) ack t seconds elapse read wait for R responses (R) (S) response response is stale if read arrives before write

137 Solving WARS: hard Monte Carlo methods: easier

138 To use WARS: gather latency data W A R S run simulation Monte Carlo, sampling

139 How eventual? t-visibility: consistent reads with probability p after after t seconds key: WARS model need: latencies

140 How consistent? What happens if I don t wait?

141 Probability of reading later older than k versions is exponentially reduced by k Pr(reading latest write) = 99% Pr(reading one of last two writes) = 99.9% Pr(reading one of last three writes) = 99.99%

142

143 Cassandra cluster, injected latencies: WARS Simulation accuracy t-staleness RMSE: 0.28% latency N-RMSE: 0.48%

144 LinkedIn 150M+ users built and uses Voldemort Yammer 100K+ companies uses Riak production latencies fit gaussian mixtures

145 N=3

146 N=3 10 ms

147 LNKD-DISK 99.9% consistent reads: R=2, W=1 t = 13.6 ms Latency: ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

148 LNKD-DISK 16.5% faster 99.9% consistent reads: R=2, W=1 t = 13.6 ms Latency: ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

149 LNKD-DISK 16.5% faster worthwhile? 99.9% consistent reads: R=2, W=1 t = 13.6 ms Latency: ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

150 N=3

151 N=3

152 N=3

153 LNKD-SSD 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency: 1.32 ms 100% consistent reads: R=3, W=1 N=3 Latency: 4.20 ms Latency is combined read and write latency at 99.9th percentile

154 LNKD-SSD 99.9% consistent reads: 59.5% faster R=1, W=1 t = 1.85 ms Latency: 1.32 ms 100% consistent reads: R=3, W=1 N=3 Latency: 4.20 ms Latency is combined read and write latency at 99.9th percentile

155 Coordinator once per replica Replica wait for W responses write (W) (A) ack critical factor in staleness t seconds elapse wait for R responses read (R) (S) response response is stale if read arrives before write

156 LNKD-SSD LNKD-DISK 1.0 W=3 0.8 CDF Write Latency (ms) N=3

157 LNKD-SSD LNKD-DISK 1.0 W=3 0.8 CDF Write Latency (ms) N=3

158 LNKD-SSD LNKD-DISK 1.0 W=3 0.8 CDF Write Latency (ms) N=3

159 Coordinator once per replica Replica wait for W responses write (W) (A) ack SSDs reduce variance compared to disks! t seconds elapse wait for R responses read (R) (S) response response is stale if read arrives before write

160 N=3

161 N=3

162 N=3

163 YMMR 99.9% consistent reads: R=1, W=1 t = ms Latency: 43.3 ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

164 YMMR 99.9% consistent reads: 81.1% faster R=1, W=1 t = ms Latency: 43.3 ms 100% consistent reads: R=3, W=1 N=3 Latency: ms Latency is combined read and write latency at 99.9th percentile

165

166 Workflow 1. Tracing 2. Simulation 3. Tune N, R, W 4. Profit

167

168

169 PBS problem: solution: no guarantees with eventual consistency consistency prediction technique: measure latencies use WARS model

170 consistency is a metric to measure to predict

171 R+W strong consistency lower latency

172 R+W

173 latency vs. consistency trade-offs simple modeling with WARS BSmodel staleness in time, versions

174 latency vs. consistency trade-offs simple modeling with WARS BSmodel staleness in time, versions eventual consistency often fast often consistent PBS helps explain when and why

175 BS eventual consistency latency vs. consistency trade-offs simple modeling with WARS model staleness in time, versions often fast often consistent PBS helps explain when and why

176 VLDB 2012 early print tinyurl.com/pbsvldb cassandra patch tinyurl.com/pbspatch

177 Extra Slides

178 Related Work

179 Quorum System Theory e.g., Probabilistic Quorums k-quorums Deterministic Staleness e.g., TACT/conits FRACS

180 Consistency Verification e.g., Golab et al. (PODC 11), Bermbach and Tai (M4WSOC 11)

181 PBS and apps

182 staleness requires either: staleness-tolerant data structures timelines, logs cf. commutative data structures logical monotonicity asynchronous compensation code detect violations after data is returned; see paper write code to fix any errors cf. Building on Quicksand memories, guesses, apologies

183 asynchronous compensation minimize: (compensation cost) (# of expected anomalies)

184 Read only newer data? monotonic reads session guarantee) # versions tolerable staleness = client s read rate global write rate (for a given key)

185 Failure?

186 Treat failures as latency spikes

187 How l o n g do partitions last?

188 what time interval? 99.9% uptime/yr 8.76 hours downtime/yr 8.76 consecutive hours down bad 8-hour rolling average

189 what time interval? 99.9% uptime/yr 8.76 hours downtime/yr 8.76 consecutive hours down bad 8-hour rolling average hide in tail of distribution OR continuously evaluate SLA, adjust

190 LNKD-SSD LNKD-DISK YMMR WAN 1.0 W=3 0.8 CDF Write Latency (ms) N=3

191 LNKD-SSD LNKD-DISK YMMR WAN 1.0 R=3 0.8 CDF Write Latency (ms) N=3 (LNKD-SSD and LNKD-DISK identical for reads)

192 Probabilistic quorum systems pinconsistent = ( ( N-W R N R ) )

193 How consistent? k-staleness: probability p of reading one of last k versions 81

194 How consistent? 1- ( ( ( N-W R N R K )) ) 82

195 How consistent? 1- ( ( ( N-W R N R K )) ) 82

196 <k,t>-staleness: versions and time

197 <k,t>-staleness: versions and time approximation: exponentiate t-staleness by k

198 strong consistency _ reads return the last written value or newer (defined w.r.t. real time, when the read started)

199 N = 3 replicas R1 R2 R3 Write to W, read from R replicas

200 N = 3 replicas R1 R2 R3 Write to W, read from R replicas quorum system: guaranteed intersection { } R1 R2 R3 R=W=3 replicas { R2} { R3} { R3} } R1 R2 R1 R=W=2 replicas

201 N = 3 replicas R1 R2 R3 Write to W, read from R replicas quorum system: guaranteed intersection { } R1 R2 R3 R=W=3 replicas { R2} { R3} { R3} } R1 R2 R1 R=W=2 replicas partial quorum system: may not intersect { }{ }{ } R1 R2 R3 R=W=1 replicas

202 Synthetic, Exponential Distributions N=3, W=1, R=1

203 W 1/4x ARS Synthetic, Exponential Distributions N=3, W=1, R=1

204 W 1/4x ARS Synthetic, Exponential Distributions W 10x ARS N=3, W=1, R=1

205 concurrent writes: deterministically choose ( key, 1) ( key, 2) Coordinator R=2

206

207

208

209

How Eventual is Eventual Consistency?

Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February