Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Similar documents
FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

No compromises: distributed transactions with consistency, availability, and performance

FaRM: Fast Remote Memory

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Rocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

RDMA and Hardware Support

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

IsoStack Highly Efficient Network Processing on Dedicated Cores

Characterizing Performance and Energy-Efficiency of The RAMCloud Storage System

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

VoltDB vs. Redis Benchmark

No Compromises. Distributed Transactions with Consistency, Availability, Performance

The Google File System

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

High Performance Transactions in Deuteronomy

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Inexpensive Coordination in Hardware

CLOUD-SCALE FILE SYSTEMS

PASTE: A Networking API for Non-Volatile Main Memory

Rocksteady: Fast Migration for Low-latency In-memory Storage

The Google File System

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Stateless Network Functions:

Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models

No compromises: distributed transac2ons with consistency, availability, and performance

The Google File System

A Distributed Hash Table for Shared Memory

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

The Google File System

An Empirical Evaluation of How The Network Impacts The Performance and Energy Efficiency in RAMCloud

The Google File System

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Farewell to Servers: Resource Disaggregation

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Exploiting Commutativity For Practical Fast Replication

Dalí: A Periodically Persistent Hash Map

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

FaRM: Fast Remote Memory

The Google File System

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

The Google File System (GFS)

GFS: The Google File System. Dr. Yingwu Zhu

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

Presented by: Alvaro Llanos E

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CPS 512 midterm exam #1, 10/7/2016

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

Networks and distributed computing

MapReduce. U of Toronto, 2014

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Blurred Persistence in Transactional Persistent Memory

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

RoGUE: RDMA over Generic Unconverged Ethernet

GFS: The Google File System

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Amazon Aurora Deep Dive

Staggeringly Large Filesystems

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

VOLTDB + HP VERTICA. page

Making RAMCloud Writes Even Faster

Recoverability. Kathleen Durant PhD CS3200

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Ceph: A Scalable, High-Performance Distributed File System

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Lightweight Replication Through Remote Backup Memory Sharing for In-Memory Key-Value Stores

This presentation covers Gen Z Buffer operations. Buffer operations are used to move large quantities of data between components.

Tolerating Faults in Disaggregated Datacenters

Distributed File Systems. CS432: Distributed Systems Spring 2017

Recovering Disk Storage Metrics from low level Trace events

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

Current Topics in OS Research. So, what s hot?

Module 12: I/O Systems

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

Log-structured Memory for DRAM-based Storage

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Multifunction Networking Adapters

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Buffer Management for XFS in Linux. William J. Earl SGI

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

At-most-once Algorithm for Linearizable RPC in Distributed Systems

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Transcription:

Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays 2

In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, 3

In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, Key enablers: eliminating network overheads (e.g., kernel bypass) and leveraging multicore architectures 4

In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing 5

In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 6

In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 7

In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures ->Replication contends with normal request processing 8

In-Memory Key-Value Stores PUT(Y) Client Y copies 9

In-Memory Key-Value Stores PUT(Y) Client Y copies copies 10

In-Memory Key-Value Stores Replicate X PUT(Y) Client Y copies copies 11

In-Memory Key-Value Stores Replicate X PUT(Y) Client X Y copies copies 12

In-Memory Key-Value Stores Replicate X PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 13

Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores 14

Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores and collocated applications Block Storage Graph Processing Stream Processing 15

Replication in In-Memory Key-Value Stores How to mitigate replication overhead? 16

Replication in In-Memory Key-Value Stores How to mitigate replication overhead? Techniques like Remote Direct Memory Access (RDMA) seem promising 17

Replication in In-Memory Key-Value Stores Replicate X NIC PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 18

RDMA-Based Replication Replicate X NIC PUT Y Client Replicate X DMA GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 19

RDMA-Based Replication Replicate X NIC Replicate X DMA Replicate X X X X Y Y copies copies 20

Existing Protocols Many systems use one-sided RDMA for replication FaRM NSDI 14, DARE HPDC 15, HydraDB SC 16 Integrity check Message fully received MSG1 MSG2 Ring Buffer MSG3 Still involves backup s CPU, RDMA defeats RDMA purpose! Receiver polls 21

Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 22

RDMA-based Replication Why not just perform RDMA and leave target idle? 23

RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 24

RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 25

RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) A 26

RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) Replicate (A) A The backup cannot guess the length of data, it needs some metadata 27

RDMA-based Replication Metadata A backup needs to insure the integrity of log-backup data + length of data { A@0x1 A=1kB CRC=0x1 A 28

RDMA-based Replication Metadata The last checksum can cover all previous log entries { { B@0x2 B=2kB CRC=0x2 A B 29

RDMA-based Replication Metadata The last checksum can cover all previous log entries { { { C@0x4 B=1kB CRC=0x3 A B C 30

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Replicate(A) A A@0x1 A=1kB CRC=0x1 31

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Metadata A A@0x1 A=1kB CRC=0x1 Current RDMA implementations can only target a single, contiguous memory region 32

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) A@0x1 A=1kB CRC=0x1 A 2x Messages, RPCs would be more efficient 33

RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (B) A@0x1 A=1kB CRC=0x1 Replicate (B) A B 34

RDMA-based Replication Second Try The primary writes metadata on a remote buffer Update Metadata A B B@0x2 $##!!! $##!!! 35

RDMA-based Replication Second Try The primary fails in the midst of updating metadata! Recover A B B@0x2 $##!!! $##!!! 36

RDMA-based Replication Second Try data unusable! Recover A B B@0x2 $##!!! $##!!! 37

RDMA-based Replication Third Try The primary replicates A and corresponding metadata with a single RDMA PUT (A) Replicate (A) A 38

RDMA-based Replication Third Try The primary replicates B and corresponding metadata, right after A PUT (B) Replicate (B) A B 39

RDMA-based Replication Third Try The primary fully replicates C, but partially the metadata PUT (C) Replicate (C) Corrupt metadata invalidates all backup log A B C 40

RDMA-based Replication Fourth Try The primary replicates A and corresponding metadata PUT (A) Replicate (A) A 41

RDMA-based Replication Fourth Try The primary partially replicates B, then fails PUT (A) Replicate (A) A B 42

RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B 43

RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B Only fully-received and correct objects are recovered! 44

Tailwind Keep the same client-facing interface (RPCs) Strongly-consistent primary-backup systems Appends only a 4-byte CRC32 checksum after each record Relies on Reliable-Connected queue pairs: messages are delivered at most once, in order, and without corruption Stop failures 45

RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer NIC Pre-registered buffers 46

RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer RPC: Replicate (A) + Get Buffer Ack + Buffer NIC A Pre-registered buffers 47

RDMA Buffers Allocation All subsequent replication requests are performed with one-sided RDMA WRITEs RDMA: Replicate (B) Ack + Buffer NIC A B 48

RDMA Buffers Allocation When the primary fills a buffer, it will chose a backup and repeat all steps RPC: Replicate (C) + Get Buffer Ack + Buffer Buffers are asynchronously flushed to secondary storage, then they can be reused NIC C A B Pre-registered buffers Buffer filled 49

Tailwind: Failures Failures can happen at any moment RDMA complicates primary replica failures Secondary replica failures are naturally dealt with in storage systems 50

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred PUT (B) Replicate (B) A B 0 0 0 0 0 51

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 52

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 53

Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B 0 0 0 0 0 The last object must always be a checksum + Checksums are not allowed to be zeroed 54

Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 55

Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 56

Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B 0 0 0 0 0 s re-compute checksum during recovery and compare it with the stored one 57

Failure scenarios: Partially Written Object Case 3: Partially transferred object A B 0 0 0 0 0 0 0 Metadata act as an end-of-transmission marker 58

Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 59

Implementation Implemented on the RAMCloud in-memory kv-store Low latency, large scale, strongly consistent RPCs leverage fast networking and kernel bypass Keeps all data in memory, durable storage for backups 60

RAMCloud Threading Architecture Client GET(C) M1 storage A PUT(B) C B 2 Worker Core Non-volatile Buffer Replicate(B) NIC 3 Dispatch Core NIC Replicate(B) NIC NIC M3 storage Poll Dispatch Core C 4 Replicate(B) Worker Core A Non-volatile Buffer Client PUT(B) 1 61

Evaluation Configuration Yahoo! Cloud Serving Benchmark (Workloads A (50%PUT), B(5%PUT), WRITE-ONLY) 20 million - 100 bytes objects + 30 byte/key Requests generated with a Zipfian distribution RAMCloud replication vs Tailwind (3-way replication) 62

Evaluation Goals How much CPU cycles can Tailwind save? Does it improve performance? Is there any overhead? 63

Evaluation: CPU Savings CPU cores utilization (dispatch) 1.5 1 0.5 0 4x 0 200 400 600 800 Throughput (KOp/s) CPU cores utilization (worker) 15 10 5 0 Tailwind RAMCloud Values are aggregated over a 4-node cluster WRITE-ONLY workload 3x 0 200 400 600 800 Throughput (KOp/s) 64

Evaluation: Latency Improvement Median Latency (µs) 30 20 10 Tailwind RAMCloud 2x Durable writes take 16μs under heavy load 0 0 100 200 300 400 500 600 Throughput (Kops) 99 th Percentile Latency (µs) 80 60 40 20 3x 0 0 100 200 300 400 500 600 Throughput (Kops) 65

Evaluation: Throughput Throughput/server in a 4-node cluster Throughput (KOp/S) / Server 600 400 200 Tailwind RAMCloud 0 5 10 15 20 25 30 Clients (YCSB-B) 1.3x 300 200 100 0 5 10 15 20 25 30 (YCSB-A) 200 100 0 0 5 10 15 20 25 30 (WRITE-ONLY) 1.7x Even with 5% of PUTs, Tailwind increase throughput by 30% 66

Evaluation: Recovery Overhead Recovery Time (s) 2.5 2 1.5 1 0.5 Tailwind RAMCloud Recoveries with up to 10 million objects 0 1M 10M Recovery Size 67

Related Work One-sided RDMA systems: Pilaf ATC 13, HERD SIGCOMM 14, FaRM NSDI 14, DrTM SOSP 15, DrTM + R Eurosys 16, Mitigating replication overheads/tuning consistency: RedBlue OSDI 12, Correctables OSDI 16 Tailwind reduces replication CPU footprint and improves performance without sacrificing durability, availability, or consistency 68

Conclusion Tailwind leverages one-sided RDMA to perform replication and leaves backups completely idle Provides backups with a protocol to protect against failure scenarios Reduces replication induced CPU usage while improving performance and latency Tailwind preserves client-facing RPC Thank you! Questions? 69