Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Size: px

Start display at page:

Download "Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes"

Esmond Hardy
5 years ago
Views:

1 Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

2 In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays 2

3 In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, 3

4 In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, Key enablers: eliminating network overheads (e.g., kernel bypass) and leveraging multicore architectures 4

5 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing 5

6 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 6

7 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 7

8 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures ->Replication contends with normal request processing 8

9 In-Memory Key-Value Stores PUT(Y) Client Y copies 9

10 In-Memory Key-Value Stores PUT(Y) Client Y copies copies 10

11 In-Memory Key-Value Stores Replicate X PUT(Y) Client Y copies copies 11

12 In-Memory Key-Value Stores Replicate X PUT(Y) Client X Y copies copies 12

13 In-Memory Key-Value Stores Replicate X PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 13

14 Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores 14

15 Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores and collocated applications Block Storage Graph Processing Stream Processing 15

16 Replication in In-Memory Key-Value Stores How to mitigate replication overhead? 16

17 Replication in In-Memory Key-Value Stores How to mitigate replication overhead? Techniques like Remote Direct Memory Access (RDMA) seem promising 17

18 Replication in In-Memory Key-Value Stores Replicate X NIC PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 18

19 RDMA-Based Replication Replicate X NIC PUT Y Client Replicate X DMA GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 19

20 RDMA-Based Replication Replicate X NIC Replicate X DMA Replicate X X X X Y Y copies copies 20

21 Existing Protocols Many systems use one-sided RDMA for replication FaRM NSDI 14, DARE HPDC 15, HydraDB SC 16 Integrity check Message fully received MSG1 MSG2 Ring Buffer MSG3 Still involves backup s CPU, RDMA defeats RDMA purpose! Receiver polls 21

22 Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 22

23 RDMA-based Replication Why not just perform RDMA and leave target idle? 23

24 RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 24

25 RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 25

26 RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) A 26

27 RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) Replicate (A) A The backup cannot guess the length of data, it needs some metadata 27

28 RDMA-based Replication Metadata A backup needs to insure the integrity of log-backup data + length of data { A@0x1 A=1kB CRC=0x1 A 28

29 RDMA-based Replication Metadata The last checksum can cover all previous log entries { { B@0x2 B=2kB CRC=0x2 A B 29

30 RDMA-based Replication Metadata The last checksum can cover all previous log entries { { { C@0x4 B=1kB CRC=0x3 A B C 30

31 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Replicate(A) A A@0x1 A=1kB CRC=0x1 31

32 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Metadata A A@0x1 A=1kB CRC=0x1 Current RDMA implementations can only target a single, contiguous memory region 32

33 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) A@0x1 A=1kB CRC=0x1 A 2x Messages, RPCs would be more efficient 33

34 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (B) A@0x1 A=1kB CRC=0x1 Replicate (B) A B 34

35 RDMA-based Replication Second Try The primary writes metadata on a remote buffer Update Metadata A B B@0x2 $##!!! $##!!! 35

36 RDMA-based Replication Second Try The primary fails in the midst of updating metadata! Recover A B B@0x2 $##!!! $##!!! 36

37 RDMA-based Replication Second Try data unusable! Recover A B B@0x2 $##!!! $##!!! 37

38 RDMA-based Replication Third Try The primary replicates A and corresponding metadata with a single RDMA PUT (A) Replicate (A) A 38

39 RDMA-based Replication Third Try The primary replicates B and corresponding metadata, right after A PUT (B) Replicate (B) A B 39

40 RDMA-based Replication Third Try The primary fully replicates C, but partially the metadata PUT (C) Replicate (C) Corrupt metadata invalidates all backup log A B C 40

41 RDMA-based Replication Fourth Try The primary replicates A and corresponding metadata PUT (A) Replicate (A) A 41

42 RDMA-based Replication Fourth Try The primary partially replicates B, then fails PUT (A) Replicate (A) A B 42

43 RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B 43

44 RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B Only fully-received and correct objects are recovered! 44

45 Tailwind Keep the same client-facing interface (RPCs) Strongly-consistent primary-backup systems Appends only a 4-byte CRC32 checksum after each record Relies on Reliable-Connected queue pairs: messages are delivered at most once, in order, and without corruption Stop failures 45

46 RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer NIC Pre-registered buffers 46

47 RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer RPC: Replicate (A) + Get Buffer Ack + Buffer NIC A Pre-registered buffers 47

48 RDMA Buffers Allocation All subsequent replication requests are performed with one-sided RDMA WRITEs RDMA: Replicate (B) Ack + Buffer NIC A B 48

49 RDMA Buffers Allocation When the primary fills a buffer, it will chose a backup and repeat all steps RPC: Replicate (C) + Get Buffer Ack + Buffer Buffers are asynchronously flushed to secondary storage, then they can be reused NIC C A B Pre-registered buffers Buffer filled 49

50 Tailwind: Failures Failures can happen at any moment RDMA complicates primary replica failures Secondary replica failures are naturally dealt with in storage systems 50

51 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred PUT (B) Replicate (B) A B

52 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B

53 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B

54 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B The last object must always be a checksum + Checksums are not allowed to be zeroed 54

55 Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B

56 Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B

57 Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B s re-compute checksum during recovery and compare it with the stored one 57

58 Failure scenarios: Partially Written Object Case 3: Partially transferred object A B Metadata act as an end-of-transmission marker 58

59 Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 59

60 Implementation Implemented on the RAMCloud in-memory kv-store Low latency, large scale, strongly consistent RPCs leverage fast networking and kernel bypass Keeps all data in memory, durable storage for backups 60

61 RAMCloud Threading Architecture Client GET(C) M1 storage A PUT(B) C B 2 Worker Core Non-volatile Buffer Replicate(B) NIC 3 Dispatch Core NIC Replicate(B) NIC NIC M3 storage Poll Dispatch Core C 4 Replicate(B) Worker Core A Non-volatile Buffer Client PUT(B) 1 61

62 Evaluation Configuration Yahoo! Cloud Serving Benchmark (Workloads A (50%PUT), B(5%PUT), WRITE-ONLY) 20 million bytes objects + 30 byte/key Requests generated with a Zipfian distribution RAMCloud replication vs Tailwind (3-way replication) 62

63 Evaluation Goals How much CPU cycles can Tailwind save? Does it improve performance? Is there any overhead? 63

64 Evaluation: CPU Savings CPU cores utilization (dispatch) x Throughput (KOp/s) CPU cores utilization (worker) Tailwind RAMCloud Values are aggregated over a 4-node cluster WRITE-ONLY workload 3x Throughput (KOp/s) 64

65 Evaluation: Latency Improvement Median Latency (µs) Tailwind RAMCloud 2x Durable writes take 16μs under heavy load Throughput (Kops) 99 th Percentile Latency (µs) x Throughput (Kops) 65

66 Evaluation: Throughput Throughput/server in a 4-node cluster Throughput (KOp/S) / Server Tailwind RAMCloud Clients (YCSB-B) 1.3x (YCSB-A) (WRITE-ONLY) 1.7x Even with 5% of PUTs, Tailwind increase throughput by 30% 66

67 Evaluation: Recovery Overhead Recovery Time (s) Tailwind RAMCloud Recoveries with up to 10 million objects 0 1M 10M Recovery Size 67

68 Related Work One-sided RDMA systems: Pilaf ATC 13, HERD SIGCOMM 14, FaRM NSDI 14, DrTM SOSP 15, DrTM + R Eurosys 16, Mitigating replication overheads/tuning consistency: RedBlue OSDI 12, Correctables OSDI 16 Tailwind reduces replication CPU footprint and improves performance without sacrificing durability, availability, or consistency 68

69 Conclusion Tailwind leverages one-sided RDMA to perform replication and leaves backups completely idle Provides backups with a protocol to protect against failure scenarios Reduces replication induced CPU usage while improving performance and latency Tailwind preserves client-facing RPC Thank you! Questions? 69

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia (CMU), Michael Kaminsky (Intel Labs), David Andersen (CMU) RDMA RDMA is a network feature that