Important new NVMe features for optimizing the data pipeline

Similar documents
Enabling the NVMe CMB and PMR Ecosystem

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

An NVMe-based FPGA Storage Workload Accelerator

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Hardware NVMe implementation on cache and storage systems

Accelerating Data Centers Using NVMe and CUDA

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

NVMe SSDs with Persistent Memory Regions

Persistent Memory over Fabrics

EXPERIENCES WITH NVME OVER FABRICS

Application Access to Persistent Memory The State of the Nation(s)!

Using FPGAs to accelerate NVMe-oF based Storage Networks

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

NVMe Over Fabrics (NVMe-oF)

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

The True Performance of Flash Storage. Flash Memory Summit 2018 Santa Clara, CA

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

App Note. Utilizing Everspin STT-MRAM in Enterprise SSDs to Simplify Power-fail Protection and Increase Density, Performance, and Endurance

NVM Express over Fabrics Storage Solutions for Real-time Analytics

PCIe Storage Beyond SSDs

NVMe Direct. Next-Generation Offload Technology. White Paper

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Persistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

NVMe over Universal RDMA Fabrics

NVMe : Redefining the Hardware/Software Architecture

Past and present of the Linux NVMe driver Christoph Hellwig

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Windows Support for PM. Tom Talpey, Microsoft

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

Windows Support for PM. Tom Talpey, Microsoft

Building a High IOPS Flash Array: A Software-Defined Approach

NVM Express 1.3 Delivering Continuous Innovation

Building an All Flash Server What s the big deal? Isn t it all just plug and play?

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Capabilities and System Benefits Enabled by NVDIMM-N

Userspace NVMe Driver in QEMU

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

NVM PCIe Networked Flash Storage

Designing Next Generation FS for NVMe and NVMe-oF

Toward a Memory-centric Architecture

I N V E N T I V E. SSD Firmware Complexities and Benefits from NVMe. Steven Shrader

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

OpenMPDK and unvme User Space Device Driver for Server and Data Center

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

RDMA and Hardware Support

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

NVMe performance optimization and stress testing

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Computational Storage: Acceleration Through Intelligence & Agility

InfiniBand Networked Flash Storage

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Storage Protocol Offload for Virtualized Environments Session 301-F

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

How Next Generation NV Technology Affects Storage Stacks and Architectures

Ziye Yang. NPG, DCG, Intel

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Jim Harris. Principal Software Engineer. Data Center Group

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Application Acceleration Beyond Flash Storage

Accessing NVM Locally and over RDMA Challenges and Opportunities

The NE010 iwarp Adapter

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Demystifying Network Cards

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

Agilio CX 2x40GbE with OVS-TC

Reference Design: NVMe-oF JBOF

The Transition to PCI Express* for Client SSDs

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

FPGA Implementation of Erasure Codes in NVMe based JBOFs

Opportunities from our Compute, Network, and Storage Inflection Points

Jim Harris. Principal Software Engineer. Intel Data Center Group

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Using MRAM to Create Intelligent SSDs

N V M e o v e r F a b r i c s -

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

Transcription:

Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO Eideticom Santa Clara, CA 1

Outline Intro to NVMe Controller Memory Buffers (CMBs) Use cases for CMBs Submission Queue Support (SQS) only RDS (Read Data Support) and WDS (Write Data Support) for NVMe p2p copies SQS, RDS and WDS for optimized NVMe over Fabrics Software for NVMe CMBs SPDK (Storage Performance Developer Kit) work for NVMe copies. Linux kernel work for p2pdma and for offload. Roadmap for the future Santa Clara, CA 2

Intro to Controller Memory Buffers CMBs were introduced to the NVMe standard in 2014 in version 1.2. A NVMe CMB is a PCIe BAR (or part thereof) that can be used for certain NVMe specific data types. The main purpose of the CMB is to provide an alternative to: Placing queues in host memory Placing data for DMA in host memory. As well as a BAR, two optional NVMe registers are needed: CMBLOC - location CMBSZ - size and supported types Multiple vendors support CMB today (Intel, Eideticom, Everspin) or soon (Toshiba, Samsung, WDC etc). Santa Clara, CA 3

Intro to Controller Memory Buffers A - This device s manufacturer has registered its vendor ID and device IDs with the PCIe database. This means you get a human-readable description of it. B - This device has three PCIe BARs: BAR0 is 16KB and is the standard NVMe BAR that any legitimate NVMe device must have. C - The third BAR is the Controller Memory Buffer (CMB) which can be used for both NVMe queues and NVMe data. F - Since this device is a NVMe device it is bound to the standard Linux kernel NVMe driver. Below is CMBLOC and CMBSZ for a CMB enabled NVMe device Santa Clara, CA 4

Some Fun Use Cases for CMBs Placing some (or all) of your NVMe queues in CMB rather than host memory. Reduce latency [Linux Kernel 1 and SPDK 1 ]. Using the CMB as a DMA buffer allows for offloaded NVMe copies. This can improve performance and offloads the host CPU [SPDK 1 ]. Using the CMB as a DMA buffer allows RDMA NICs to directly place NVMe-oF data into the NVMe SSD. Reduce latency and CPU load [Linux Kernel 2 ] Traditional DMAs (left) load the CPU. P2P DMAs (right) do not load the CPU. 1 Upstream in the relevant tree. Santa Clara, CA 2 Proposed patches (see last slide for git repo). 5

Software for CMBs - SPDK Storage Performance Development Kit (SPDK) is a Free and Open Source (FOSS) user-space framework for high performance storage. Focus on NVMe and NVMe-oF. Code added in Feb 2018 to enable P2P NVMe copies when CMBs allow it. A simple example of an application using this new API also in SPDK examples (cmb_copy). cmb_copy is an example application using SPDK s APIs to copy data between NVMe SSDs using P2P DMAs. This bypasses the CPU s memory and PCIe subsystems. Santa Clara, CA 6

Software for CMBs - SPDK cmb_copy moves data direct from SSD A to SSD B using NVMe CMB(s). 99.99% of UpStream Port (USP) traffic on PCIe switch is eliminated. OS is still in complete control of the IO and handles any status/error messages. NVMe SQEs can also be in the CMB or not as desired. SGLs or PRPs can be supported. See https://asciinema.org/a/bkd32zdlykviq7f8m5bbvdx42 Santa Clara, CA 7

Software for CMBs The Linux Kernel A P2P framework called p2pdma is being proposed for the Linux kernel. Much more general than NVMe CMBs. Any PCIe device can utilize it (NICS, GPGPUs etc.). PCIe drivers can register device memory (e.g. CMBs or BARs) or request access to P2P memory for DMA. Initial patches use p2pdma to optimize the NVMe-oF target code. The p2pdma framework can be used to improve NVMeoF targets. Here we show results from a NVMe-oF target system. p2pdma can reduce CPU memory load by x50 and CPU PCIe load by x25. NVMe offload can also be employed to reduce CPU core load by x50. Santa Clara, CA 8

Software for CMBs The Linux Kernel The hardware setup for the NVMe-oF p2pdma testing is as shown on the right. The software setup consisted of a modified Linux kernel and standard NVMe-oF configuration tools (mostly nvme-cli and nvmet). The Linux kernel used added support for NVMe offload and Peer-2-Peer DMAs using an NVMe CMB provided by the Eideticom NVMe device. Green: Legacy Data Path Red: p2pdma Data Path This is the NVMe-oF target configuration used. Note RDMA NIC is connected to switch and not CPU Root Port. Santa Clara, CA 9

Software for CMBs The Linux Kernel FMS2018 Demo! Eideticom puts accelerators behind NVMe namespaces. We can combine this with CMBs to add compute to p2pdma. Demo with AMD in Xilinx in Xilinx booth. >3GB/s compression (input) via U.2 NVMe accelerator (NoLoad) with p2pdma offloading CPU by >99%. Leverages p2pdma to move data from input NVMe SSD to NoLoad and from NoLoad to output Santa Clara, CA NVMe SSD. 10

Persistent Memory Regions PMRs add persistent to CMBs. PMR features being discussed and standard will be updated. Get involved! Small PMRs (10MB-1GB) interesting: Write cache Journal for SQEs Persistent scratchpad for meta-data Big PMRs (>>1GB) are really interesting. Santa Clara, CA 11

Persistent Memory Regions See https://www.youtube.com/watch?v=oieem6hhass&t=1310s Large NVMe PMRs would enable PMoF aware filesystems! Santa Clara, CA 12

Roadmap for CMBs and PMRs and the Software NVMe CMBs have been in the standard for a while. However it s only now they are starting to become available and software is starting to utilize them. SPDK and the Linux kernel are the two main locations for CMB software enablement today. SPDK: NVMe P2P copies. NVMe-oF updates coming. Linux kernel. p2pdma framework upstream soon. Will be expanded to support other NVMe/PCIe resources (e.g. doorbells). Persistent Memory Regions add non-volatile CMBs and will require (lots of) software enablement too. They will enable a path to Persistent memory storage on the PCIe bus. Santa Clara, CA 13