Surviving congestion in geo-distributed storage systems

Size: px

Start display at page:

Download "Surviving congestion in geo-distributed storage systems"

Diane Catherine Ray
6 years ago
Views:

1 Surviving congestion in geo-distributed storage systems Brian Cho Marcos K. Aguilera University of Illinois at Urbana-Champaign Microsoft Research Silicon Valley

2 Geo-distributed data centers Web applications increasingly deployed across geo-distributed data centers e.g., social networks, online stores, messaging App data replicated across data centers Disaster tolerance Access locality 2

3 Congestion between geo-distributed data centers Limited bandwidth between data centers e.g., leased lines, MPLS VPN Bandwidth is expensive: ~K $/Mbps [SprintMPLS] Provision for typical (not peak) usage Many machines in each data center 3

4 Congestion Delay between geo-distributed data centers Congestion can cause significant delays TCP messaging increases to order-of-seconds (Figure) Observed across Amazon EC2 data centers [Kraska et al] Users do not tolerate delays (<s) [Nielsen] FIGURE: RPC round trip delay under congestion (0-30s) 4

5 Replication techniques applied to geo-distributed data centers Weak consistency e.g., Amazon Dynamo, Yahoo PNUTS, COPS Good performance: updates can be propagated asynchronously Semantics undesirable in some cases (e.g., writes get re-ordered across replicas) Strong consistency e.g., ABD, Paxos, available in Google Megastore, Amazon SimpleDB Avoids the many problems of weak consistency Must wait for updates to propagate across data centers App delay requirements difficult to meet under congestion 5

6 Contributions Vivace: a strongly consistent key-value store that is resilient to congestion across geo-distributed data centers Approach New algorithms send small amount of critical information across data centers in separate prioritized messages Challenges Still provide strong consistency Keep prioritized messages small Avoid delay overhead in absence of congestion 6

7 Vivace algorithms Enhance previous strongly consistent algorithms Prioritize small amount of critical information across sites 7

8 Vivace algorithms Enhance previous strongly consistent algorithms Prioritize small amount of critical information across sites Two algorithms:. Read/write algorithm Very simple Based on traditional quorum algorithm [ABD] Linearizable read() and write() read() contains a write-back phase 2. State machine replication algorithm More complex, details in paper 8

9 Traditional quorum algorithm: write val is large (compared with key & ts) <WRITE,key,val,ts> 9

10 Traditional quorum algorithm: write <WRITE,key,val,ts> <ACK-WRITE> 0

11 Traditional quorum algorithm: write write done <WRITE,key,val,ts> <ACK-WRITE>

12 Traditional quorum algorithm: read <READ,key> 2

13 Traditional quorum algorithm: read <READ,key> <ACK-READ,val,ts> large val 3

14 Traditional quorum algorithm: read writeback: ensures strong consistency (linearizability) 2 <WRITE,key,val,ts> large val, again! 4

15 Traditional quorum algorithm: read writeback: ensures strong consistency (linearizability) 2 large val, again! <WRITE,key,val,ts> <ACK-WRITE> 5

16 Traditional quorum algorithm: read read done 2 6

17 Vivace: write Replica 2 Replica 3 new quorum of local replicas 7

18 Vivace: write val sent locally <W-LOCAL,key,val,ts> Replica 2 Replica 3 8

19 Vivace: write <W-LOCAL,key,val,ts> <ACK-W-LOCAL> Replica 2 Replica 3 9

20 Vivace: write prioritize no val: small message! 2 <W-TS,key,ts> Replica 2 Replica 3 20

21 Vivace: write prioritize no val: small message! 2 <W-TS,key,ts> <ACK-W-TS> Replica 2 Replica 3 2

22 Vivace: write write done 2,2,3 have a consistent view of key & ts, but no val (yet) Replica 2 Replica 3 22

23 Vivace: write,2,3 add val to their consistent view of key & ts 2 * <W-REMOTE,key,val,ts> Replica 2 Replica 3 val is still large, but not in critical path 23

24 write comparison Traditional algorithm: remote RTT 2 * Replica 2 Replica 3 Vivace algorithm: prioritized remote RTT + local RTT 24

25 Vivace: read prioritize only ask for ts <R-TS,key> Replica 2 Replica 3 25

26 Vivace: read prioritize small message <R-TS,key> <ACK-R-TS,ts> Replica 2 Replica 3 26

27 Vivace: read 2 <R-DATA,key,ts> ask for data with largest ts Replica 2 Replica 3 27

28 Vivace: read 2 <R-DATA,key,ts> <ACK-R-DATA,val> large val, but wait for only one reply (common case: local) Replica 2 Replica 3 28

29 Vivace: read prioritize 2 <W-TS,key,ts> 3 writeback only small ts Replica 2 Replica 3 29

30 Vivace: read prioritize 2 <W-TS,key,ts> <ACK-W-TS> 3 Replica 2 Replica 3 30

31 Vivace: read read done 2 3 Replica 2 Replica 3 3

32 read comparison 2 Traditional algorithm: 2 remote RTTs 2 3 Vivace algorithm: 2 prioritized remote RTT + local RTT 32

33 Evaluation topics Practical prioritization setup Delay with congestion KV-store operations Twitter clone web app operations Delay without congestion Overhead of Vivace algorithms compared to traditional algorithms 33

34 Evaluation setup cluster <-> Amazon EC2 Ireland DSCP bit prioritization on local router s egress port Congestion generated with iperf prioritization applied here only cluster (Illinois) Amazon EC2 (Ireland) 34

35 Evaluation Does prioritization work in practice? Simple ping experiment Prioritized messages bypass congestion router-based prioritization is effective 35

36 Evaluation How well does Vivace perform under congestion? KV-store operations Twitter-clone operations (a) Read algorithms (a) Post tweet (b) Write algorithms (b) Read user timeline (c) State machine algorithms (c) Read friends timeline 36

37 Evaluation How well does Vivace perform under congestion? avoids congestion delays 2 remote RTTs TCP resend on packet loss 2 prioritized remote RTTs + local RTT buffering delay (a) Read algorithms 37

38 Evaluation What is the overhead of Vivace without congestion? (Results in paper) No measurable overhead compared to traditional algorithms Extra message phases are not harmful 38

39 Conclusion Proposed two new algorithms Read/write (simple, in talk) State machine (more complex, in paper) Both algorithms avoid delay due to congestion by prioritizing a small amount of critical information, while Still providing strong consistency Keeping prioritized messages small Avoiding delay overhead in absence of congestion Using a practical prioritization infrastructure Careful use of prioritized messages can be an effective strategy in geo-distributed data centers 39

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

MDCC MULTI DATA CENTER CONSISTENCY Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete gpang@cs.berkeley.edu amplab MOTIVATION 2 3 June 2, 200: Rackspace power outage of approximately 0