An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

Similar documents
An NVMe-based FPGA Storage Workload Accelerator

Important new NVMe features for optimizing the data pipeline

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

Accelerating Data Centers Using NVMe and CUDA

Enabling the NVMe CMB and PMR Ecosystem

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Using FPGAs to accelerate NVMe-oF based Storage Networks

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

NVMe SSDs with Persistent Memory Regions

Application Access to Persistent Memory The State of the Nation(s)!

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

5051 & 5052 PCIe Card Overview

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

N V M e o v e r F a b r i c s -

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Building and Using the ATLAS Transactional Memory System

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

FPGA Solutions: Modular Architecture for Peak Performance

The Nios II Family of Configurable Soft-core Processors

RDMA and Hardware Support

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Accelerating Data Center Workloads with FPGAs

Hardware Based Compression in Ceph OSD with BTRFS

Toward a Memory-centric Architecture

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

XPU A Programmable FPGA Accelerator for Diverse Workloads

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

NVMe Direct. Next-Generation Offload Technology. White Paper

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Industry Collaboration and Innovation

PCIe Storage Beyond SSDs

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Ziye Yang. NPG, DCG, Intel

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Maximizing heterogeneous system performance with ARM interconnect and CCIX

NVM PCIe Networked Flash Storage

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

The NE010 iwarp Adapter

Dongjun Shin Samsung Electronics

IBM CORAL HPC System Solution

Open storage architecture for private Oracle database clouds

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

Next Generation Enterprise Solutions from ARM

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Release Notes for Cisco Integrated System for Microsoft Azure Stack, Release 1.0. Release Notes for Cisco Integrated System for Microsoft

Simplify System Complexity

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

Windows Support for PM. Tom Talpey, Microsoft

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Emerging Technologies for HPC Storage

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

EXPERIENCES WITH NVME OVER FABRICS

P51: High Performance Networking

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

SoC Platforms and CPU Cores

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

RISC-V based core as a soft processor in FPGAs Chowdhary Musunuri Sr. Director, Solutions & Applications Microsemi

Extremely Fast Distributed Storage for Cloud Service Providers

An FPGA-Based Optical IOH Architecture for Embedded System

NVMe-IP Introduction for Xilinx Ver1.8E

An 80-core GRVI Phalanx Overlay on PYNQ-Z1:

Application Acceleration Beyond Flash Storage

Simplify System Complexity

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

INT G bit TCP Offload Engine SOC

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

ATS-GPU Real Time Signal Processing Software

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

Reference Design: NVMe-oF JBOF

Persistent Memory over Fabrics

Industry Collaboration and Innovation

To hear the audio, please be sure to dial in: ID#

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Transcription:

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin 1

Overview Acceleration for Storage NVMe for Acceleration How are we using (abusing ;-)) NVMe to support acceleration? Embedded NVMe Controller RISC-V on FPGA Performance Fabrics, Peer-to-Peer, and CMB 2

Acceleration PCIe Bus NVMe SSD NVMe SSD Host CPU HDD RDMA NIC NoLoad Accel. Card TM Storage I/O bandwidth increasing rapidly Storage workloads can be taxing on host CPU Hyperconverged storage exacerbates the problem Reconfigurable logic can provide compelling solution for storage workloads 3

Acceleration Over NVMe Using NVMe host controller interface to provide data and control to accelerator functions No need for proprietary drivers Avoid driver development and take advantage of improvements in NVMe standard, drivers and tools Leverage industry-standard NVMe test tools Assist with deployment and benchmarking Test tools, software and ecosystem Can tie into NVMe over Fabrics Can leverage inbox drivers in all modern OS Can leverage servers and storage systems developed for NVMe 4

Basic Architecture Host CPU PCIe Controller and DMA Engine NVMe Controller Accelerators FPGA Bus DDR DDR Controller DDR TM NoLoad Accelerator Board PCIe Presents as an NVMe 1.2 device with multiple namespaces Host CPU communicates with accelerators via NVMe commands to NVMe controller NVMe controller pulls commands and data via DMA engine Accelerators easily integrated on an internal bus Accelerators are mapped to NVMe namespaces to enable discovery and command and data routing 5

Acceleration Over NVMe Commands Identify Namespace is used for accelerator discovery Vendor specific field used to provide accelerator specific information Write is used to provide data to accelerators Read is used to retrieve results from accelerators Writing and reading from specific namespace to communicate with specific accelerator Vendor specific commands available for accelerator specific control 6

Embedded NVMe Controller For flexibility developed an embedded controller Faster turnaround on compliance debugging Quickly implement new features Downside is that getting performance from an embedded controller on FPGA is more difficult Requires coprocessors and offload 7

Processor Selection Which processor to use for controller? Requirements Platform agnostic Broad software ecosystem Soft Requirements Extensible instruction set 32-bit and 64-bit addressing available 8

RISC-V RISC-V is an instruction set architecture Gaining momentum in academia and industry Originally developed in 2010 at UC Berkeley Several commercial and open-source processor implementations available Can be autogenerated using open-source toolchain Meets our hard and soft requirements with a few caveats 9

RISC-V Software Ecosystem RISC-V includes software support for: GCC toolchain with GDB support LLVM toolchain Spike simulator QEMU model Includes OS support for GNU/Linux, FreeBSD, and NetBSD 10

RISC-V Core Original plan was to use Rocket core generator Rocket designs are best suited to ASIC Only achieved 50MHz on FPGA Alternative was to start from ORCA BSD license FPGA-optimized RISC-V CPU Using 32-bit instruction set to reduce size Original design achieved 125MHz 11

Development Process Wrote software for our controller and replaced NVMe QEMU model to verify functionality Ported to RISC-V Porting DMA accounted for 90% of effort Verified controller against Linux and Windows drivers with a backing RAM drive Performance testing for the RAM drive 12

Test Setup Intel i5 6500 PLX9797 PCIe switch Eideticom NoLoad Accelerator targeted to Xilinx XCVU095 Ultrascale PCIe Gen3x8 2 x 2.5GB DDR4 Samsung 960 EVO 250GB M.2 SSD Intel SSDPEKKW256G7 256GB M.2 SSD Viavi PCIe Capture Card 13

FIO Performance Throughput (GB/s) FIO Throughput by Block Size 8 7 6 5 4 3 2 1 0 0.25 1 4 16 64 256 1024 4096 FIO Block Size (KB) Read FIO Write FIO DMA Performance Big block transfers saturate PCIe Gen3x8 throughput Small block transfers require further work by adding more RISC-V cores and command processing offload engines Capable of saturating PCIe Gen3x8 with current DMA engine for most block sizes Verified with PCIe capture card that we are saturating PCIe bandwidth 14

Latency Latency (usec) 300 250 200 150 100 50 0 Read Latency by Block Size 268.41 111.55 45.15 19.69 19.2 19.18 28.17 0.5 1 4 16 64 512 1024 Block Size (KB) NVMe latency is largely due to software path Accelerator use model will tend to focus more on throughput than latency Future improvements in NVMe command processing will improve latency 15

RISC-V Complications No external debugger for RISC-V yet Difficult to track down bugs in embedded system without external debugger Built our own internal primitive debugger Trade-offs between code size and clock rate in FPGA design are persistent Instruction bubbles in the processor were slowing us down Fixing ORCA implementation improved performance Turned up and fixed several ORCA bugs during this process Managed to get ORCA to 190MHz on FPGA Built DMA offload to handle data transfers and Completion commands Underbaked or missing features in ORCA 16

Error Correction Accelerator Built an RS(32, 4) EC accelerator ISA-L compatible Utilizes 16KB block sizes Saturates PCIe Gen3x8 throughput (i.e. 8GB/s) Modified ISA-L perf test in less than an hour to use NoLoad NVMe Accelerator Using our host side API software integrates into host software with 10 lines of code Roadmap includes PCIe Gen3x16 (PCIe Gen4x8) dual namespace version capable of 16GB/s 17

Peer-to-Peer CMB Host CPU Added full CMB support to accelerator Send NVMe Write Completion NVMe Write Command with CMB source NVMe Drive Storage CMB NoLoad Send NVMe Read Completion NVMe Read issued with destination in CMB NVMe Drive Took 1 day thanks to software controller With data CMB only one external DMA is required Removes load on host CPU for memcpy Internal DMA from CMB to Storage PCIe memory write to CMB 18

Peer-to-Peer CMB (Staging Buffer) Send NVMe Write Completion NVMe Write Command with CMB source NVMe Drive Host CPU Storage CMB NoLoad PCIe memory read from CMB Send NVMe Read Completion NVMe Read issued with destination in CMB NVMe Drive PCIe memory write to CMB Added full CMB support to accelerator Took 1 day thanks to software controller With data CMB only one external DMA is required Removes load on host CPU for memcpy CMB can be used as a staging buffer between two devices that do not support CMB 19

Peer-to-Peer CMB Results Throughput (MB/s) 1800 1600 1400 1200 1000 800 600 400 Peer-to-Peer Testing with CMB vs System Memory Current setup saturates due to insufficient sources in the test environment 200 0 4 32 256 2048 16384 131072 Block Size (KB) Via CMB Via System Memory 20

Peer-to-Peer with CMB as Staging Buffer Throughput (MB/s) 3.5 2.5 1.5 1 0.5 Peer-to-Peer Staging Buffer CMB vs System Memory 3 2 0 4 32 256 2048 16384 131072 Block Size (KB) Our test setup has insufficient sources to demonstrate expected maximum performance via CMB via System Memory 21