NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Similar documents
Past and present of the Linux NVMe driver Christoph Hellwig

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

An Approach for Implementing NVMeOF based Solutions

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

Enabling the NVMe CMB and PMR Ecosystem

NVM Express 1.3 Delivering Continuous Innovation

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

EXPERIENCES WITH NVME OVER FABRICS

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Ziye Yang. NPG, DCG, Intel

InfiniBand Networked Flash Storage

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Important new NVMe features for optimizing the data pipeline

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

LEGAL NOTICE: LEGAL DISCLAIMER:

Intel Builder s Conference - NetApp

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Virtio SCSI. An alternative virtualized storage stack for KVM. Stefan Hajnoczi Paolo Bonzini

NFS/RDMA Next Steps. Chuck Lever Oracle

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

InfiniBand Linux Operating System Software Access Layer

IO virtualization. Michael Kagan Mellanox Technologies

PARAVIRTUAL RDMA DEVICE

Persistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

Windows Support for PM. Tom Talpey, Microsoft

NVMe Direct. Next-Generation Offload Technology. White Paper

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Advanced Computer Networks. End Host Optimization

Jim Harris. Principal Software Engineer. Intel Data Center Group

Application Access to Persistent Memory The State of the Nation(s)!

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

The SNIA NVM Programming Model. #OFADevWorkshop

Introduction to High-Speed InfiniBand Interconnect

Remote Access to Ultra-Low-Latency Storage Tom Talpey Microsoft

NVMe From The Server Perspective

Windows Support for PM. Tom Talpey, Microsoft

Application Acceleration Beyond Flash Storage

NVMe Over Fabrics (NVMe-oF)

CERN openlab Summer 2006: Networking Overview

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Linux Support of NFS v4.1 and v4.2. Steve Dickson Mar Thu 23, 2017

Open Fabrics Workshop 2013

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

NVM Express* over Fabrics using Intel Ethernet Connection X722 iwarp RDMA

Storage Protocol Offload for Virtualized Environments Session 301-F

Performance Implications Libiscsi RDMA support

Userspace NVMe Driver in QEMU

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

User Guide. Storage Executive Command Line Interface. Introduction. Storage Executive Command Line Interface User Guide Introduction

SRP Update. Bart Van Assche,

Software and Management for NVMe Session A12 Part B 3:40 to 4:45

Memory Management Strategies for Data Serving with RDMA

Multifunction Networking Adapters

Infiniband and RDMA Technology. Doug Ledford

Introduction to Infiniband

Annual Update on Flash Memory for Non-Technologists

Why NVMe/TCP is the better choice for your Data Center

Quo Vadis Virtio? Michael S. Tsirkin Red Hat

Fibre Channel vs. iscsi. January 31, 2018

DRBD SDS. Open Source Software defined Storage for Block IO - Appliances and Cloud Philipp Reisner. Flash Memory Summit 2016 Santa Clara, CA 1

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

Virtio-blk Performance Improvement

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Persistent Memory over Fabrics

liblightnvm The Open-Channel SSD User-Space Library Simon A. F. Lund CNEX Labs

NVMe over Universal RDMA Fabrics

PCIe driver development for Exynos SoC

The NE010 iwarp Adapter

URDMA: RDMA VERBS OVER DPDK

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Generic RDMA Enablement in Linux

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Paving the Way to the Non-Volatile Memory Frontier. PRESENTATION TITLE GOES HERE Doug Voigt HP

How Next Generation NV Technology Affects Storage Stacks and Architectures

Accelerating Data Centers Using NVMe and CUDA

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Low latency and high throughput storage access

PERSISTENT MEMORY PROGRAMMING

Configuring OpenStack with Datera as the Storage Provider

STORAGE NETWORKING TECHNOLOGY STEPS UP TO PERFORMANCE CHALLENGES

Transport Layer (Congestion Control)

ROCm: An open platform for GPU computing exploration

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

Transcription:

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg 2016 Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

NVMe over Fabrics: the beginning Early 2014 demo apparently involved Linux (neither of us were involved) During early spec development were at least two implementations: Intel (+ a few others) and HGST. Various Linux developers were / are involved with the NVMe and NVMe over Fabrics specification.

HGST prototype Initially just tunneled NVMe commands over the existing SRP protocol structures Then tried to accommodate the existing draft spec where possible Where not possible, change the spec.

NVMe Linux Fabrics Driver WG In 2015 a new working group of the NVM Express organization was created to merge the different Linux development streams. Multiple members contribution to the code and even more testing the code. Tried to follow Linux-style development as much as possible: Private git repository Mailing list

NVMe Linux Host Driver Even before the release of the spec we started splitting the existing Linux NVMe driver into a common and a PCIe specific part: Added block layer pass through support for internal NVMe commands (similar to what Linux does for SCSI) Separated data structures into common and PCIe Added struct nvme_ctrl_ops And moved the code around (new drivers/nvme/ directory)

NVMe Linux Host Driver The existing NVMe host driver was split into a layer subsystem. Core components: nvme-core: Implements the native nvme specification in a fabric independent fashion. nvme-fabrics: implements the transport-independent parts of the fabrics specification. Drivers: nvme-pci: PCIe specific host driver. nvme-rdma: RDMA specific host driver.

NVMe over Fabrics Host Driver The new Fabric drivers uses the existing common code The transport driver is in control of the actual I/O path (no additional indirections for the fast path) Existing user space APIs of the PCIe driver are all also supported when using Fabrics Uses new sub-command of the existing nvme-cli tool to connect to remote fabrics controllers

NVMe over Fabrics Host Driver Features supported: Various RDMA transports (Infiniband/RoCE/iWARP) Dynamic connect/disconnect to/from multiple controllers Discovery support (using nvme-cli) Keep-alive Basic multi-pathing (using dm-multipath) Features not supported (yet): Authentication (spec not done yet) Fibre Channel transport

NVMe Linux Host Driver now Most code is shared for the different transports Common PCIe Fabrics Common RDMA Transport drivers are fairly small (~2000 lines of code)

NVMe Target Supports implementing NVMe controllers in the Linux kernel Initially just NVMe over Fabrics Adding PCIe support (e.g. using vhost) could be done later Split into a generic target and transport drivers: RDMA Loop (for local testing) Fibre Channel support is work in progress

NVMe Target The NVMe target can use any Linux block device: e.g. NVMe, SAS, SATA, ramdisk, virtio Uses the block layer to communicate with the device Early experiments with NVMe command pass through not continued

NVMe target Supported features: All mandatory NVMe I/O commands Full set of NVMe (over Fabrics) admin commands Data Set Management (DSM) for deallocate Dynamic subsystem/controller/namespace provisioning Discovery service support (including referrals)

NVMe target Features not supported (yet): Smart/Error log pages Authentication (spec not done yet) Persistent reservations Fused commands NVMe security protocols (Security Send / Receive) PCIe Peer to Peer I/O (e.g. CMB in a backend NVMe device)

NVMe Target Again most code is in the core The whole core (~ 3000 lines of code) is smaller than many SCSI target transport drivers Core Fabrics RDMA loop We aggressively tried to offload work to common code: RDMA subsystem configfs subsystem and will continue to do so for new features, e.g. Persistent Reservations

NVMe Target configuration Uses a configfs interface to let user space tools configure the tool. Simpler and more integrated than the SCSI target The prime user space tool is called nvmetcli and is written in python Allows for interactive configuration using a console interface Allows saving configurations into json format files and restoring them

Nvmetcli

Linux RDMA stack

Linux RDMA stack The Linux RDMA stack has three layers: Consumers (called ULPs) RDMA core layer (API and management) Device Drivers As part of the NVMe RDMA development a lot of shared logic and code duplication moved to the core Both ULP drivers and device drivers benefited from code simplification and optimization

RDMA memory registration In order to allow remote access a driver needs to register memory buffers with remote access permissions: The stack offered 6 different methods to register memory A lot of effort to support different methods Results in code duplication in both ULPs and device drivers

RDMA memory registration Converged on a single memory registration method: Fast Registration Work Requests (FRWR): New, simpler memory registration API that leaves most work to the core Provides a simple interface for both ULPs and providers (uses the common scatterlist structure). Migrated existing ULPs to the new API

RDMA CQ API RDMA completion queue polling implementation was duplicated across the ULPs Each got some right (and some wrong). Several had CQ re-arming, fairness and context abuse issues We converged on a higher-level API that gets it right once and provides a simple interface to the ULPs

RDMA CQ API Exposes a simple interface for ULPs to attach a done handler to each work request (similar to the Linux block-layer bio interface) Offers several CQ polling contexts: Software-IRQ (using a new irq-poll library) Workqueue (Process-like context) Direct polling (for application driven polling) Migrated existing ULPs to the new API

RDMA CQ Pooling API RDMA applications can achieve better performance scalability by sharing RDMA completion queues: Better completion aggregation Less interrupts Smart and fair completion queue spreading by affinity requirements provides better parallelism in completion processing

RDMA CQ Pooling API Moves completion queue allocation and assignments to the RDMA core. Completion queue pools are allocated in a lazy fashion on demand ULPs can pass an optional affinity hint Moves topology knowledge from ULPs into the core Work in progress, expected to make kernel 4.9

Storage RDMA data transfer model All the RDMA storage ULPs share the classic RPC model:

RDMA Read/Write implementations Every single ULP implemented the same sequence of operations to transfer data over RDMA: Again, lots of code duplication All the logic lived in the ULPs, no logic in the core or HCA drivers Some got it right (and most got it wrong somewhere) ULPs struggled to support iwarp which is slightly different from IB/RoCE

RDMA Read/Write API A new core library that implements generic datatransfer using RDMA READ/WRITE. Exposes a simple and intuitive interface for ULPs (uses the common scatterlist structure) Masks RDMA transport differences (iwarp vs IB/RoCE) Provides support for Mellanox Signature extension (to implement T10 PI) Optimizes the implementation with correct work request pipelining, completion signalling and extra features Migrated most existing ULPs to the new API

Initial Performance Measurements 13us latency for QD=1 random reads Sub-10us network contribution

More Performance Measurements Polling allows for sub-7us added latency

Status All code mentioned has been merged into Linux 4.8 (only release candidates so far) Fibre Channel support for both the host and target will be submitted soon The updated nvme-cli with Fabrics support and nvmetcli need to get into the distributions

Links NVMe over Fabrics source code: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git http://git.infradead.org/nvme-fabrics.git (for-next) Nvme-cli repository: https://github.com/linux-nvme/nvme-cli Nvmetcli repository: http://git.infradead.org/users/hch/nvmetcli.git

Questions?