Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes
In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays 2
In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, 3
In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, Key enablers: eliminating network overheads (e.g., kernel bypass) and leveraging multicore architectures 4
In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing 5
In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 6
In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 7
In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures ->Replication contends with normal request processing 8
In-Memory Key-Value Stores PUT(Y) Client Y copies 9
In-Memory Key-Value Stores PUT(Y) Client Y copies copies 10
In-Memory Key-Value Stores Replicate X PUT(Y) Client Y copies copies 11
In-Memory Key-Value Stores Replicate X PUT(Y) Client X Y copies copies 12
In-Memory Key-Value Stores Replicate X PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 13
Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores 14
Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores and collocated applications Block Storage Graph Processing Stream Processing 15
Replication in In-Memory Key-Value Stores How to mitigate replication overhead? 16
Replication in In-Memory Key-Value Stores How to mitigate replication overhead? Techniques like Remote Direct Memory Access (RDMA) seem promising 17
Replication in In-Memory Key-Value Stores Replicate X NIC PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 18
RDMA-Based Replication Replicate X NIC PUT Y Client Replicate X DMA GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 19
RDMA-Based Replication Replicate X NIC Replicate X DMA Replicate X X X X Y Y copies copies 20
Existing Protocols Many systems use one-sided RDMA for replication FaRM NSDI 14, DARE HPDC 15, HydraDB SC 16 Integrity check Message fully received MSG1 MSG2 Ring Buffer MSG3 Still involves backup s CPU, RDMA defeats RDMA purpose! Receiver polls 21
Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 22
RDMA-based Replication Why not just perform RDMA and leave target idle? 23
RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 24
RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 25
RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) A 26
RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) Replicate (A) A The backup cannot guess the length of data, it needs some metadata 27
RDMA-based Replication Metadata A backup needs to insure the integrity of log-backup data + length of data { A@0x1 A=1kB CRC=0x1 A 28
RDMA-based Replication Metadata The last checksum can cover all previous log entries { { B@0x2 B=2kB CRC=0x2 A B 29
RDMA-based Replication Metadata The last checksum can cover all previous log entries { { { C@0x4 B=1kB CRC=0x3 A B C 30
RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Replicate(A) A A@0x1 A=1kB CRC=0x1 31
RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Metadata A A@0x1 A=1kB CRC=0x1 Current RDMA implementations can only target a single, contiguous memory region 32
RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) A@0x1 A=1kB CRC=0x1 A 2x Messages, RPCs would be more efficient 33
RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (B) A@0x1 A=1kB CRC=0x1 Replicate (B) A B 34
RDMA-based Replication Second Try The primary writes metadata on a remote buffer Update Metadata A B B@0x2 $##!!! $##!!! 35
RDMA-based Replication Second Try The primary fails in the midst of updating metadata! Recover A B B@0x2 $##!!! $##!!! 36
RDMA-based Replication Second Try data unusable! Recover A B B@0x2 $##!!! $##!!! 37
RDMA-based Replication Third Try The primary replicates A and corresponding metadata with a single RDMA PUT (A) Replicate (A) A 38
RDMA-based Replication Third Try The primary replicates B and corresponding metadata, right after A PUT (B) Replicate (B) A B 39
RDMA-based Replication Third Try The primary fully replicates C, but partially the metadata PUT (C) Replicate (C) Corrupt metadata invalidates all backup log A B C 40
RDMA-based Replication Fourth Try The primary replicates A and corresponding metadata PUT (A) Replicate (A) A 41
RDMA-based Replication Fourth Try The primary partially replicates B, then fails PUT (A) Replicate (A) A B 42
RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B 43
RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B Only fully-received and correct objects are recovered! 44
Tailwind Keep the same client-facing interface (RPCs) Strongly-consistent primary-backup systems Appends only a 4-byte CRC32 checksum after each record Relies on Reliable-Connected queue pairs: messages are delivered at most once, in order, and without corruption Stop failures 45
RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer NIC Pre-registered buffers 46
RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer RPC: Replicate (A) + Get Buffer Ack + Buffer NIC A Pre-registered buffers 47
RDMA Buffers Allocation All subsequent replication requests are performed with one-sided RDMA WRITEs RDMA: Replicate (B) Ack + Buffer NIC A B 48
RDMA Buffers Allocation When the primary fills a buffer, it will chose a backup and repeat all steps RPC: Replicate (C) + Get Buffer Ack + Buffer Buffers are asynchronously flushed to secondary storage, then they can be reused NIC C A B Pre-registered buffers Buffer filled 49
Tailwind: Failures Failures can happen at any moment RDMA complicates primary replica failures Secondary replica failures are naturally dealt with in storage systems 50
Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred PUT (B) Replicate (B) A B 0 0 0 0 0 51
Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 52
Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 53
Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 The last object must always be a checksum + Checksums are not allowed to be zeroed 54
Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 55
Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 56
Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 s re-compute checksum during recovery and compare it with the stored one 57
Failure scenarios: Partially Written Object Case 3: Partially transferred object A B 0 0 0 0 0 0 0 Metadata act as an end-of-transmission marker 58
Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 59
Implementation Implemented on the RAMCloud in-memory kv-store Low latency, large scale, strongly consistent RPCs leverage fast networking and kernel bypass Keeps all data in memory, durable storage for backups 60
RAMCloud Threading Architecture Client GET(C) M1 storage A PUT(B) C B 2 Worker Core Non-volatile Buffer Replicate(B) NIC 3 Dispatch Core NIC Replicate(B) NIC NIC M3 storage Poll Dispatch Core C 4 Replicate(B) Worker Core A Non-volatile Buffer Client PUT(B) 1 61
Evaluation Configuration Yahoo! Cloud Serving Benchmark (Workloads A (50%PUT), B(5%PUT), WRITE-ONLY) 20 million - 100 bytes objects + 30 byte/key Requests generated with a Zipfian distribution RAMCloud replication vs Tailwind (3-way replication) 62
Evaluation Goals How much CPU cycles can Tailwind save? Does it improve performance? Is there any overhead? 63
Evaluation: CPU Savings CPU cores utilization (dispatch) 1.5 1 0.5 0 4x 0 200 400 600 800 Throughput (KOp/s) CPU cores utilization (worker) 15 10 5 0 Tailwind RAMCloud Values are aggregated over a 4-node cluster WRITE-ONLY workload 3x 0 200 400 600 800 Throughput (KOp/s) 64
Evaluation: Latency Improvement Median Latency (µs) 30 20 10 Tailwind RAMCloud 2x Durable writes take 16μs under heavy load 0 0 100 200 300 400 500 600 Throughput (Kops) 99 th Percentile Latency (µs) 80 60 40 20 3x 0 0 100 200 300 400 500 600 Throughput (Kops) 65
Evaluation: Throughput Throughput/server in a 4-node cluster Throughput (KOp/S) / Server 600 400 200 Tailwind RAMCloud 0 5 10 15 20 25 30 Clients (YCSB-B) 1.3x 300 200 100 0 5 10 15 20 25 30 (YCSB-A) 200 100 0 0 5 10 15 20 25 30 (WRITE-ONLY) 1.7x Even with 5% of PUTs, Tailwind increase throughput by 30% 66
Evaluation: Recovery Overhead Recovery Time (s) 2.5 2 1.5 1 0.5 Tailwind RAMCloud Recoveries with up to 10 million objects 0 1M 10M Recovery Size 67
Related Work One-sided RDMA systems: Pilaf ATC 13, HERD SIGCOMM 14, FaRM NSDI 14, DrTM SOSP 15, DrTM + R Eurosys 16, Mitigating replication overheads/tuning consistency: RedBlue OSDI 12, Correctables OSDI 16 Tailwind reduces replication CPU footprint and improves performance without sacrificing durability, availability, or consistency 68
Conclusion Tailwind leverages one-sided RDMA to perform replication and leaves backups completely idle Provides backups with a protocol to protect against failure scenarios Reduces replication induced CPU usage while improving performance and latency Tailwind preserves client-facing RPC Thank you! Questions? 69