FaRM: Fast Remote Memory

Similar documents
FaRM: Fast Remote Memory

No compromises: distributed transactions with consistency, availability, and performance

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

No Compromises. Distributed Transactions with Consistency, Availability, Performance

Advanced Computer Networks. End Host Optimization

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

A Distributed Hash Table for Shared Memory

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Multifunction Networking Adapters

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

CPS 512 midterm exam #1, 10/7/2016

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Memory Management Strategies for Data Serving with RDMA

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Mark Falco Oracle Coherence Development

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Buffer Management for XFS in Linux. William J. Earl SGI

The NE010 iwarp Adapter

2 nd Half. Memory management Disk management Network and Security Virtual machine

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Chapter 17: Parallel Databases

High Performance File Serving with SMB3 and RDMA via SMB Direct

Chapter 11: Implementing File-Systems

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Virtual File System -Uniform interface for the OS to see different file systems.

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS3600 SYSTEMS AND NETWORKS

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017

Stateless Network Functions:

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Distributed File Systems II

PARAVIRTUAL RDMA DEVICE

RDMA and Hardware Support

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Chapter 11: Implementing File Systems

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Last Class: Memory management. Per-process Replacement

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

IsoStack Highly Efficient Network Processing on Dedicated Cores

Chapter 12: File System Implementation

Software-defined Storage: Fast, Safe and Efficient

Revisiting Network Support for RDMA

Generic RDMA Enablement in Linux

arxiv: v2 [cs.db] 21 Nov 2016

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Ausgewählte Betriebssysteme - Mark Russinovich & David Solomon (used with permission of authors)

[537] Fast File System. Tyler Harter

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Chapter 10: File System Implementation

Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors

Using RDMA for Lock Management

Advanced Computer Networks. RDMA, Network Virtualization

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

Computer Architecture. Lecture 8: Virtual Memory

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

6.9. Communicating to the Outside World: Cluster Networking

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

virtual memory Page 1 CSE 361S Disk Disk

RDMA over Commodity Ethernet at Scale

No compromises: distributed transac2ons with consistency, availability, and performance

Application Acceleration Beyond Flash Storage

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Company. Intellectual Property. Headquartered in the Silicon Valley

Rethinking Distributed Indexing for RDMA - Based Networks

Chapter 12: File System Implementation

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Containing RDMA and High Performance Computing

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

08:End-host Optimizations. Advanced Computer Networks

OPERATING SYSTEM. Chapter 12: File System Implementation

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Chapter 11: File System Implementation

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

Chapter 11: File System Implementation

Transcription:

FaRM: Fast Remote Memory

Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main memory Removes OH of disk/flash Enables small random data accesses Network communication still a bottleneck! Fast networks won t reduce this bottleneck Systems still use TCP/IP networking

Problem Context: (continued) Remote Direct Memory Access (RDMA): allows computers in a network to exchange data in main memory w/o involving the processor, cache, or OS of either computer Provides reliable user-level reads/writes of remote memory Achieves low latency, high throughput Bypasses the kernel Avoids complex protocol stack overheads Frees up resources

The Solution: FaRM FaRM: a main memory distributed computing platform Exploits RDMA to improve latency and throughput More than an order of magnitude higher than state-of-the-art main memory systems that use TCP/IP Simplified programming model All of the memory of machines in the cluster is a shared address space Sufficient for more application code Applications use transactions to allocate, read, write, and free objects in addr. space with local transparency

FaRM: Communication Primitives Uses one-sided RDMA read for direct data access Uses RDMA writes to implement a fast message passing primitive Circular buffer to implement a unidirectional channel Buffer is stored on the receiver One buffer for each sender/receiver pair

FaRM: Architecture Communication primitives are fast, but Accesses to main memory still achieve up to a 23x higher request rate Designed FaRM to enable performance improvement by collocating data and computation on the same machine FaRM machines store data in main memory Also execute application threads Memory of all machines in the cluster is exposed as a shared address-space

FaRM: Distributed Memory Management Shared address space consists of numerous 2GB shared memory regions Represent the unit of address mapping, recovery, and registration for RDMA with the NIC Address of object = 32-bit region identifier, 32-bit offset relative to start Object access done through consistent hashing Maps region identifier to the machine that stores the object

FaRM: Lock-free operations Application is guaranteed to read a consistent object state, even if it is concurrent with the writes to the same object Reliance on cache coherent DMA lockfreeread: reads the object with RDMA and checks if the header version is unlocked and matches all the cache line versions

FaRM: Hashtables FaRM provides a general key-value store interface Implemented as a hash table on top of the shared address space Used to obtain pointers to shared objects

Evaluation FaRM s performance compared to a baseline system that uses TCP/IP for messaging: Performs better than MemC3 which is the best main-memory key-value store in literature Order of magnitude greater of throughput and latency than the baseline These results hold over a wide range of settings

Related Work: Pilaf Pilaf: a key-value store Uses send/receive verbs to send update operations to the server Uses one-sided RDMA reads to implement lookups Provides linearizability using 64-bit CRCS (cyclic redundancy checks) to detect inconsistent reads FaRM: Technique to detect inconsistent reads is more general Better hashtable performance Uses fewer RDMAS to perform lookups Higher space utilization

Related Work: RAMCloud RAMCloud: describes techniques for logging and recovering in a main-memory key-value store. Doesn t provide a lot of information about normal case operations. FaRM: uses similar techniques for logging and recovery, but extends them Deals with transactions on general data structures Shared address space Focused on techniques to achieve good performance in normal case

Limitations Requires a major overhaul of the application because TCP/IP is no longer used and there is a need to rewrite the application to use the FaRM API Requires overhauling the existing datacenter infrastructure Need RDMA NICs on every server Need Infiniband for data centers larger than 100 servers because RoCE doesn t scale well 2 GB pages => resource fragmentation

Next Steps The holy grail in this area would be to create some sort of drop-in replacement for TCP/IP that could be used by existing applications, without modification This would allow applications to better utilize the network bandwidth available with modern hardware technology