14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

Similar documents
2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

USER SPACE IPOIB PACKET PROCESSING

PARAVIRTUAL RDMA DEVICE

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

14th ANNUAL WORKSHOP 2018 T10-DIF OFFLOAD. Tzahi Oved. Mellanox Technologies. Apr 2018 [ LOGO HERE ]

Asynchronous Peer-to-Peer Device Communication

Accelerated Verbs. Liran Liss Mellanox Technologies

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Important new NVMe features for optimizing the data pipeline

EXPERIENCES WITH NVME OVER FABRICS

RoCE Update. Liran Liss, Mellanox Technologies March,

NVMe Direct. Next-Generation Offload Technology. White Paper

NON-CONTIGUOUS MEMORY REGISTRATION

Containing RDMA and High Performance Computing

13th ANNUAL WORKSHOP 2017 VERBS KERNEL ABI. Liran Liss, Matan Barak. Mellanox Technologies LTD. [March 27th, 2017 ]

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

MULTI-PROCESS SHARING OF RDMA RESOURCES

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

The Non-Volatile Memory Verbs Provider (NVP): Using the OFED Framework to access solid state storage

Reference Design: NVMe-oF JBOF

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

The Exascale Architecture

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Ziye Yang. NPG, DCG, Intel

Enabling the NVMe CMB and PMR Ecosystem

OFV-WG: Accelerated Verbs

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

I N V E N T I V E. SSD Firmware Complexities and Benefits from NVMe. Steven Shrader

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Past and present of the Linux NVMe driver Christoph Hellwig

Agenda. Introduction Fibre Channel terms and background NVMe terms and background Deep dive on FC-NVMe FC-NVMe use cases Summary

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

What NVMe /TCP Means for Networked Storage. Live Webcast January 22, 2019

MOVING FORWARD WITH FABRIC INTERFACES

Open Fabrics Interfaces Architecture Introduction. Sean Hefty Intel Corporation

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

IO virtualization. Michael Kagan Mellanox Technologies

Performance Implications Libiscsi RDMA support

Persistent Memory over Fabrics

An Approach for Implementing NVMeOF based Solutions

Advanced Computer Networks. End Host Optimization

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

Intel Builder s Conference - NetApp

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

RDMA Container Support. Liran Liss Mellanox Technologies

Open Fabrics Workshop 2013

NVMe Driver for BitVisor BitVisor Summit 6. Ake Koomsin

Userspace NVMe Driver in QEMU

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

Networking at the Speed of Light

The NE010 iwarp Adapter

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

Application Access to Persistent Memory The State of the Nation(s)!

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

12th ANNUAL WORKSHOP 2016 RDMA RESET SUPPORT. Liran Liss. Mellanox Technologies. [ April 4th, 2016 ]

WINDOWS RDMA (ALMOST) EVERYWHERE

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

NVMe Over Fabrics (NVMe-oF)

N V M e o v e r F a b r i c s -

Multifunction Networking Adapters

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Introduction to High-Speed InfiniBand Interconnect

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

The Transition to PCI Express* for Client SSDs

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Support for Smart NICs. Ian Pratt

Accelerating Data Centers Using NVMe and CUDA

RDMA in Embedded Fabrics

Application Acceleration Beyond Flash Storage

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Memory Management Strategies for Data Serving with RDMA

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

Virtual RDMA devices. Parav Pandit Emulex Corporation

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Network Adapter Flow Steering

Wireless Base Band Device (bbdev) Amr Mokhtar DPDK Summit Userspace - Dublin- 2017

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Kernel OpenFabrics Interface

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

by Brian Hausauer, Chief Architect, NetEffect, Inc

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

NVM Express TM Management Interface

NVM Express over Fabrics Storage Solutions for Real-time Analytics

Linux Storage System Bottleneck Exploration

Transcription:

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD Liran Liss Mellanox Technologies April 2018

AGENDA Introduction NVMe NVMf NVMf target driver Offload model Verbs interface Status 2 OpenFabrics Alliance Workshop 2018

INTRODUCTION 3 OpenFabrics Alliance Workshop 2018

NVME Standard PCIe host controller interface for solid-state storage Driven by industry consortium of 80+ members Standardize feature, command, and register sets Leverage PCIe capabilities: low latency, scalable BW, power efficiency etc. Focus on efficiency, scalability and performance All parameters for 4KB command in single 64B DMA fetch Simple command set (13 required commands) MSI-X support 4 OpenFabrics Alliance Workshop 2018

NVME OVER FABRICS Standard NVMe Devices are constrained to the server / storage box Limited number of PCIe-attached devices PCIe, SAS and SATA are not Scale-Out Distance limitations Hard to share Complex error handling NVMe Over Fabrics (NVMf) Announced September 2014 Standard V1.0 done, published June 2016 NVMf properties Providers scale-out without requiring SCSI protocol translation Preserves NVMe command set Simplifies storage virtualization, migration, and failover Goal: <10µs added latency compared to local NVMe 5 OpenFabrics Alliance Workshop 2018

NVME MAPPING TO FABRICS NVMe Submit Queue (SQ) Completion Queue (CQ) Host writes SQE and rings doorbell NVMf QP SQ QP RQ (+CQ) Initiator sends SQE capsule Device writes CQE, host awaits and interrupt and polls CQ PCIe data exchange Target sends CQE capsule, initiator awaits an interrupt and polls Rx CQ RDMA Rd/Wr, immediate up to 8K 6 OpenFabrics Alliance Workshop 2018

COMMAND/RESPONSE CAPSULES Command capsule Response capsule 7 OpenFabrics Alliance Workshop 2018

LINUX NVMF TARGET NVMf Subsystem NVMf target Submit BIO block NIC driver NIC SW HW NVMe PCI io io admin io io admin HostX HostY 8 OpenFabrics Alliance Workshop 2018

NVMF TARGET OFFLOAD NVMf Subsystem NVMf target NIC driver Bind QPs to NVMf offload NIC GetProperies: - IOQ - CQ SW HW block NVMe PCI io io admin io io admin HostX HostY 9 OpenFabrics Alliance Workshop 2018

OFFLOAD FLOW No CPU cycles CPU Memory IOQ CQ Data Buffer Initiator (client) Fabric NIC Ring RDMA-WRITE Doorbell Data Post Poll CQ READ SQE NVME DMA Data Post Fetch CQE SQE Fabric SEND IO-READ Fabric SEND Request Response Doorbell 10 OpenFabrics Alliance Workshop 2018

OFFLOAD FLOW CMB No CPU cycles No traffic through system memory CPU Memory Initiator (client) Fabric NIC NVME IOQ CQ Data Buffer Doorbell 11 OpenFabrics Alliance Workshop 2018

NVMF OFFLOAD VERBS 12 OpenFabrics Alliance Workshop 2018

MODEL QP SCQ QP SCQ BE-cntl Front-end Subsystem (NVME SRQ) BE-cntl RCQ IO SQ/CQ Back-end controller A IO SQ/CQ Back-end controller B NS X NS Y NS Z 13 OpenFabrics Alliance Workshop 2018

OBSERVATIONS Offloaded commands are not visible to SW Do not consume SRQ Rx WQEs Do not generate SRQ completions Do not consume WQEs on QP Send queues Do not generate Send completions Non-offloaded commands follow standard Verbs processing Consume SRQ Rx WQEs and generate CQEs on the SRQ CQ Responses are posted on corresponding QP Send queues and generate completions The back-end IOQ + CQ are under exclusive HW ownership SW may use other IOQs/CQs for non-offloaded operations Front-end NSIDs not necessarily equal to back-end NSIDs 14 OpenFabrics Alliance Workshop 2018

VERBS FLOW Identify NVMe-oF offload capabilities of the device Create SRQ Represents a front-end NVMf subsystem Create NVMe backend Represents locally attached back-end NVMe subsystems Map namespaces From: front-end facing namespace id To: NVMe backend + namespace id Create QP Connected to hosts Bound to an NVMf SRQ Modify QP Enable NVMf offload following CONNECT completion 15 OpenFabrics Alliance Workshop 2018

NVMF CAPABILITIES: IBV_QUERY_DEVICE_EX() struct ibv_device_attr_ex {... struct ibv_nvmf_caps nvmf_caps; }; struct ibv_nvmf_caps { enum nvmf_offload_type offload_type; uint32_t max_backend_ctrls_total; uint32_t max_backend_ctrls; uint32_t max_namespaces; uint32_t max_staging_buf_pages; uint32_t min_staging_buf_pages; uint32_t max_io_sz; uint16_t max_nvme_queue_sz; uint16_t min_nvme_queue_sz; uint32_t max_ioccsz; uint32_t min_ioccsz; uint16_t max_icdoff; };

SUBSYSTEM REPRESENTATION: IBV_CREATE_SRQ_EX() struct ibv_srq_init_attr_ex {... struct ibv_nvmf_attrs nvmf_attr; }; struct ibv_nvmf_attrs { enum nvmf_offload_type uint32_t uint8_t uint32_t uint16_t uint32_t uint16_t struct ibv_mr uint64_t uint64_t }; offload_type; max_namespaces; nvme_log_page_sz; ioccsz; icdoff; max_io_sz; nvme_queue_sz; *staging_buf_mr; staging_buf_addr; staging_buf_len; SRQ type: IBV_SRQT_NVMF

NVME BACKEND ASSOCIATION: IBV_SRQ_CREATE_NVME_CTRL() struct ibv_mr_sg { struct ibv_mr *mr; union { void *addr; }; }; uint64_t uint64_t offset; len; struct nvme_ctrl_attrs { struct ibv_mr_sg struct ibv_mr_sg struct ibv_mr_sg struct ibv_mr_sg uint16_t uint16_t sq_buf; cq_buf; sqdb; cqdb; sqdb_ini; cqdb_ini; }; uint16_t uint32_t timeout_ms; comp_mask;

NVME BACKEND ASSOCIATION (CONT.) struct ibv_nvme_ctrl *ibv_srq_create_nvme_ctrl( struct ibv_srq *srq, struct nvme_ctrl_attrs *nvme_attrs); int ibv_srq_remove_nvme_ctrl( struct ibv_srq *srq, struct ibv_nvme_ctrl *nvme_ctrl);

MAP NAMESPACES: IBV_MAP_NVMF_NSID() int ibv_map_nvmf_nsid( struct ibv_nvme_ctrl *nvme_ctrl, uint32_t fe_nsid, uint16_t lba_data_size, uint32_t nvme_nsid); int ibv_unmap_nvmf_nsid( struct ibv_nvme_ctrl *nvme_ctrl, uint32_t fe_nsid); Map <fe_nsid> to <nvme_ctrl, nvme_nsid> Front-end subsystem determined by the SRQ associated with the nmve_ctrl Indicate the namespace LBA Data Size

ENABLE OFFLOAD: IBV_QP_SET_NVMF() enum { IBV_QP_NVMF_ATTR_FLAG_ENABLE = 1 << 0, }; int ibv_modify_qp_nvmf( struct ibv_qp *qp, int flags); Enables NVMf offload on the given QP A QP should be enabled for offload only after processing the CONNECT fabric command

EXCEPTIONS Transport error QP transitions to error state raising IBV_EVENT_QP_FATAL async event NVMe errors Reported as nmve_ctrl async events enum ibv_event_type {... IBV_EVENT_NVME_PCI_ERR, IBV_EVENT_NVME_TIMEOUT,... }; struct ibv_async_event { union {... struct ibv_nvme_ctrl *nvme_ctrl } element;... };

STATUS Submitted RFC for user-space Verbs https://www.spinics.net/lists/linux-rdma/msg58512.html Added/Updated files Documentation/nvmf_offload.md 172 libibverbs/man/ibv_create_srq_ex.3 48 libibverbs/man/ibv_get_async_event.3 15 libibverbs/man/ibv_map_nvmf_nsid.3 89 libibverbs/man/ibv_qp_set_nvmf.3 53 libibverbs/man/ibv_query_device_ex.3 26 libibverbs/man/ibv_srq_create_nvme_ctrl.3 89 libibverbs/verbs.h 107 Kernel discussions ongoing

14th ANNUAL WORKSHOP 2018 THANK YOU Liran Liss Mellanox Technologies