Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays 2

In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, 3

In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, Key enablers: eliminating network overheads (e.g., kernel bypass) and leveraging multicore architectures 4

In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing 5

In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 6

In-Memory Key-Value Stores PUT(Y) Client Y copies 9

In-Memory Key-Value Stores PUT(Y) Client Y copies copies 10

In-Memory Key-Value Stores Replicate X PUT(Y) Client Y copies copies 11

In-Memory Key-Value Stores Replicate X PUT(Y) Client X Y copies copies 12

In-Memory Key-Value Stores Replicate X PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 13

Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores 14

Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores and collocated applications Block Storage Graph Processing Stream Processing 15

Replication in In-Memory Key-Value Stores How to mitigate replication overhead? 16

Replication in In-Memory Key-Value Stores How to mitigate replication overhead? Techniques like Remote Direct Memory Access (RDMA) seem promising 17

Replication in In-Memory Key-Value Stores Replicate X NIC PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 18

RDMA-Based Replication Replicate X NIC PUT Y Client Replicate X DMA GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 19

RDMA-Based Replication Replicate X NIC Replicate X DMA Replicate X X X X Y Y copies copies 20

Existing Protocols Many systems use one-sided RDMA for replication FaRM NSDI 14, DARE HPDC 15, HydraDB SC 16 Integrity check Message fully received MSG1 MSG2 Ring Buffer MSG3 Still involves backup s CPU, RDMA defeats RDMA purpose! Receiver polls 21

Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 22

RDMA-based Replication Why not just perform RDMA and leave target idle? 23

RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 24

RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 25

RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) A 26

RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) Replicate (A) A The backup cannot guess the length of data, it needs some metadata 27

RDMA-based Replication Metadata A backup needs to insure the integrity of log-backup data + length of data { A@0x1 A=1kB CRC=0x1 A 28

RDMA-based Replication Metadata The last checksum can cover all previous log entries { { B@0x2 B=2kB CRC=0x2 A B 29

RDMA-based Replication Metadata The last checksum can cover all previous log entries { { { C@0x4 B=1kB CRC=0x3 A B C 30

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Replicate(A) A A@0x1 A=1kB CRC=0x1 31

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Metadata A A@0x1 A=1kB CRC=0x1 Current RDMA implementations can only target a single, contiguous memory region 32

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) A@0x1 A=1kB CRC=0x1 A 2x Messages, RPCs would be more efficient 33

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (B) A@0x1 A=1kB CRC=0x1 Replicate (B) A B 34

RDMA-based Replication Second Try The primary writes metadata on a remote buffer Update Metadata A B B@0x2 $##!!! $##!!! 35

RDMA-based Replication Second Try The primary fails in the midst of updating metadata! Recover A B B@0x2 $##!!! $##!!! 36

RDMA-based Replication Second Try data unusable! Recover A B B@0x2 $##!!! $##!!! 37

RDMA-based Replication Third Try The primary replicates A and corresponding metadata with a single RDMA PUT (A) Replicate (A) A 38

RDMA-based Replication Third Try The primary replicates B and corresponding metadata, right after A PUT (B) Replicate (B) A B 39

RDMA-based Replication Third Try The primary fully replicates C, but partially the metadata PUT (C) Replicate (C) Corrupt metadata invalidates all backup log A B C 40

RDMA-based Replication Fourth Try The primary replicates A and corresponding metadata PUT (A) Replicate (A) A 41

RDMA-based Replication Fourth Try The primary partially replicates B, then fails PUT (A) Replicate (A) A B 42

RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B 43

RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B Only fully-received and correct objects are recovered! 44

Tailwind Keep the same client-facing interface (RPCs) Strongly-consistent primary-backup systems Appends only a 4-byte CRC32 checksum after each record Relies on Reliable-Connected queue pairs: messages are delivered at most once, in order, and without corruption Stop failures 45

RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer NIC Pre-registered buffers 46

RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer RPC: Replicate (A) + Get Buffer Ack + Buffer NIC A Pre-registered buffers 47

RDMA Buffers Allocation All subsequent replication requests are performed with one-sided RDMA WRITEs RDMA: Replicate (B) Ack + Buffer NIC A B 48

RDMA Buffers Allocation When the primary fills a buffer, it will chose a backup and repeat all steps RPC: Replicate (C) + Get Buffer Ack + Buffer Buffers are asynchronously flushed to secondary storage, then they can be reused NIC C A B Pre-registered buffers Buffer filled 49

Tailwind: Failures Failures can happen at any moment RDMA complicates primary replica failures Secondary replica failures are naturally dealt with in storage systems 50

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred PUT (B) Replicate (B) A B 0 0 0 0 0 51

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 52

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 53

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 The last object must always be a checksum + Checksums are not allowed to be zeroed 54

Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 55

Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 56

Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 s re-compute checksum during recovery and compare it with the stored one 57

Failure scenarios: Partially Written Object Case 3: Partially transferred object A B 0 0 0 0 0 0 0 Metadata act as an end-of-transmission marker 58

Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 59

Implementation Implemented on the RAMCloud in-memory kv-store Low latency, large scale, strongly consistent RPCs leverage fast networking and kernel bypass Keeps all data in memory, durable storage for backups 60

RAMCloud Threading Architecture Client GET(C) M1 storage A PUT(B) C B 2 Worker Core Non-volatile Buffer Replicate(B) NIC 3 Dispatch Core NIC Replicate(B) NIC NIC M3 storage Poll Dispatch Core C 4 Replicate(B) Worker Core A Non-volatile Buffer Client PUT(B) 1 61

Evaluation Configuration Yahoo! Cloud Serving Benchmark (Workloads A (50%PUT), B(5%PUT), WRITE-ONLY) 20 million - 100 bytes objects + 30 byte/key Requests generated with a Zipfian distribution RAMCloud replication vs Tailwind (3-way replication) 62

Evaluation Goals How much CPU cycles can Tailwind save? Does it improve performance? Is there any overhead? 63

Evaluation: CPU Savings CPU cores utilization (dispatch) 1.5 1 0.5 0 4x 0 200 400 600 800 Throughput (KOp/s) CPU cores utilization (worker) 15 10 5 0 Tailwind RAMCloud Values are aggregated over a 4-node cluster WRITE-ONLY workload 3x 0 200 400 600 800 Throughput (KOp/s) 64

Evaluation: Latency Improvement Median Latency (µs) 30 20 10 Tailwind RAMCloud 2x Durable writes take 16μs under heavy load 0 0 100 200 300 400 500 600 Throughput (Kops) 99 th Percentile Latency (µs) 80 60 40 20 3x 0 0 100 200 300 400 500 600 Throughput (Kops) 65

Evaluation: Throughput Throughput/server in a 4-node cluster Throughput (KOp/S) / Server 600 400 200 Tailwind RAMCloud 0 5 10 15 20 25 30 Clients (YCSB-B) 1.3x 300 200 100 0 5 10 15 20 25 30 (YCSB-A) 200 100 0 0 5 10 15 20 25 30 (WRITE-ONLY) 1.7x Even with 5% of PUTs, Tailwind increase throughput by 30% 66

Evaluation: Recovery Overhead Recovery Time (s) 2.5 2 1.5 1 0.5 Tailwind RAMCloud Recoveries with up to 10 million objects 0 1M 10M Recovery Size 67

Related Work One-sided RDMA systems: Pilaf ATC 13, HERD SIGCOMM 14, FaRM NSDI 14, DrTM SOSP 15, DrTM + R Eurosys 16, Mitigating replication overheads/tuning consistency: RedBlue OSDI 12, Correctables OSDI 16 Tailwind reduces replication CPU footprint and improves performance without sacrificing durability, availability, or consistency 68

Conclusion Tailwind leverages one-sided RDMA to perform replication and leaves backups completely idle Provides backups with a protocol to protect against failure scenarios Reduces replication induced CPU usage while improving performance and latency Tailwind preserves client-facing RPC Thank you! Questions? 69