EXPERIENCES WITH NVME OVER FABRICS

Similar documents
NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

NVMe Over Fabrics (NVMe-oF)

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Application Acceleration Beyond Flash Storage

N V M e o v e r F a b r i c s -

InfiniBand Networked Flash Storage

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

NVMe Direct. Next-Generation Offload Technology. White Paper

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

NVMe over Universal RDMA Fabrics

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

An Approach for Implementing NVMeOF based Solutions

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Ziye Yang. NPG, DCG, Intel

Storage Protocol Offload for Virtualized Environments Session 301-F

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Important new NVMe features for optimizing the data pipeline

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

PARAVIRTUAL RDMA DEVICE

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

Implementing Virtual NVMe for Flash- Storage Class Memory

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Designing Next Generation FS for NVMe and NVMe-oF

Performance Implications Libiscsi RDMA support

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Enabling the NVMe CMB and PMR Ecosystem

The NE010 iwarp Adapter

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

At the heart of a new generation of data center infrastructures and appliances. Sept 2017

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

The True Performance of Flash Storage. Flash Memory Summit 2018 Santa Clara, CA

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Networking at the Speed of Light

Open storage architecture for private Oracle database clouds

RoCE vs. iwarp Competitive Analysis

Extremely Fast Distributed Storage for Cloud Service Providers

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Multifunction Networking Adapters

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Low latency and high throughput storage access

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Why NVMe/TCP is the better choice for your Data Center

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

Annual Update on Flash Memory for Non-Technologists

Mellanox CloudX, Mirantis Fuel 5.1/ 5.1.1/6.0 Solution Guide

Next Generation Storage Networking for Next Generation Data Centers. PRESENTATION TITLE GOES HERE Dennis Martin President, Demartek

Creating an agile infrastructure with Virtualized I/O

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Using FPGAs to accelerate NVMe-oF based Storage Networks

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

Introduction to Infiniband

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Deep Learning Performance and Cost Evaluation

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

ARISTA: Improving Application Performance While Reducing Complexity

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Reference Design: NVMe-oF JBOF

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Birds of a Feather Presentation

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Advanced Computer Networks. End Host Optimization

Deep Learning Performance and Cost Evaluation

Enterprise Flash Storage Annual Update

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

QNAP 25GbE PCIe Expansion Card. Mellanox SN GbE/100GbE Management Switch. QNAP NAS X 25GbE card X 25GbE/100GbE switch

SNAP Performance Benchmark and Profiling. April 2014

Userspace NVMe Driver in QEMU

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Interconnect Your Future

RDMA on vsphere: Update and Future Directions

Memory Management Strategies for Data Serving with RDMA

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

I/O Virtualization: Enabling New Architectures in the Data Center for Server and I/O Connectivity Sujoy Sen Micron

Ethernet. High-Performance Ethernet Adapter Cards

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Hyper-converged infrastructure with Proxmox VE virtualization platform and integrated Ceph Storage.

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

RoCE Accelerates Data Center Performance, Cost Efficiency, and Scalability

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

Transcription:

13th ANNUAL WORKSHOP 2017 EXPERIENCES WITH NVME OVER FABRICS Parav Pandit, Oren Duer, Max Gurtovoy Mellanox Technologies [ 31 March, 2017 ]

BACKGROUND: NVME TECHNOLOGY Optimized for flash and next-gen NV-memory Traditional SCSI interfaces designed for spinning disk NVMe bypasses unneeded layers NVMe Flash Outperforms SAS/SATA Flash 2x-2.5x more bandwidth, 40-50% lower latency Up to 3x more IOPS 2

JBOD TRADITIONAL SAS/SATA STORAGE ARCHITECTURE / JBOF JBOD/ JBOF Multiple hardware components, software stacks, and drivers

MORE EFFICIENT AND COST-EFFECTIVE NVME ARCHITECTURE JBOF Single interconnect technology, one software stack and driver

NVME MARKET PROJECTED TO GROW TO $57B BY 2020 >50% of enterprise servers and storage appliances will support NVMe by 2020 ~40% of all-flash arrays will be NVMe-based by 2020 Shipments of NVMe SSDs will grow to 25+ million by 2020 740,000 NVMe-oF adapter shipped by 2020 RDMA NICs will claim >75% of the NVMe-oF market

NVME OVER FABRICS ENABLES STORAGE NETWORKING OF NVME DEVICES Sharing NVMe based storage across multiple servers/cpus Better utilization: capacity, rack space, power Scalability, management, fault isolation NVMe over Fabrics industry standard developed Version 1.0 completed in June 2016 RDMA protocol is part of the standard NVMe-oF version 1.0 includes a Transport binding specification for RDMA InfiniBand or Ethernet(RoCE)

SOME NVME-OF DEMOS AT FMS AND IDF 2016 Flash Memory Summit E8 Storage Mangstor With initiators from VMs on VMware ESXi Micron Windows & Linux initiators to Linux target Newisis (Sanmina) Pavilion Data in Seagate booth Intel Developer Forum E8 Storage HGST (WD) NVMe-oF on InfiniBand Intel: NVMe over Fabrics with SPDK Newisis (Sanmina) Samsung Seagate

SOME NVME-OF PRODUCTS IN THE MARKET TODAY

HOW DOES NVME OVER FABRICS MAINTAIN NVME PERFORMANCE? Extends NVMe efficiency over a fabric NVMe commands and data structures are transferred end to end Relies on RDMA for performance Bypassing TCP/IP Early Pre-Standard version also used RDMA For more Information on NVMe over Fabrics (NVMe-oF) https://community.mellanox.com/do cs/doc-2186

RDMA & NVME: A PERFECT FIT Network

NVME OVER FABRICS (NVME-OF) STANDARD 1.0 OPEN SOURCE DRIVER Released with Standard in June 2016 How to use open source driver instructions: https://community.mellanox.com/docs/doc-2504

NVME-OF PERFORMANCE WITH OPEN SOURCE LINUX DRIVERS ConnectX-4 Lx 25GbE NIC ConnectX-4 Lx 25GbE NIC Added fabric latency ~12us, BS = 512b SN2700 Switch BS = 4KB, 16 jobs, IO depth = 64 Bandwidt h (Target side) IOPS (Targe t side) Num. Online cores 5.2GB/sec 1.3M 4 50% Each core utilization 4 P3700 NVMe SSDS ConnectX-4 Lx 50GbE NIC

INTRODUCTION TO LINUX NVME STACK Originally for PCIe interface, extended for fabrics specification Host side Extended for various fabrics RDMA, FC, Loop Architecture that allows fabric extensions sharing common functionality through common code Extends nvmecli for fabric configuration, discovery Multi queue implementation to benefit from cpu affinity Target side Kernel implementation that leverages block storage interface Ground up implementation of target side nvmetcli for configuration via configfs sharing common functionality through common code

INTRODUCTION TO LINUX NVME RDMA STACK Connection establishment is through standard RDMA CM Data path and resource setup through kernel IB verbs Supports target/subsystem defined inline data for write commands Utilizes common IB core stack for CQ processing, read-write operations Common code for IB, RoCEv2 NVMe queues map to RDMA queues NVMe admin queue, IO queue -> RDMA QP, completion queue NVMe completion queue -> RDMA QP, completion queue Block layer SGEs map to RDMA SGE(s), memory region. NVMe commands and completions are transported through RDMA SQ entries

LINUX KERNEL NVME SOFTWARE STACK

NVME FABRICS CODE ORGANIZATION NVMe Fabric Host code distribution pci 16% NVMe Fabric Target code distribution loop 10% core common 40% fabric common 39% fc 20% fc 31% rdma 16% fabric common 8% core common fabric common rdma fc pci rdma 20% fabric common rdma fc loop

MICRO BENCHMARK SETUP Setup: Hardware: 64 core x86_64 host and target systems 64Gb RAM 100Gb Ethernet ConnectX-4 NICs Software stack: Linux NVMe host and target software stack with kernel 4.10+. 250GB null target, 4K queue depth, 64 MQs, single LUN or namespace NULL block driver with multiple queues for fabric performance characteristics Tool: fio 16 jobs, 256 queue depth 70% write, 30% read

RANDOM READ WRITE 30-70 LATENCY CHART 20 times lower latency compare to iscsi-tcp upto 4K IO Size 10 times lower latency compare to ISER for 8K and higher 2 times lower latency compare to iser for all IO size Block layer MQ support comes natively to NVMe latency 400 350 300 250 200 150 100 50 0 149 Random 70-30 Latency (usec) (lower is better) 162 169 81 81 84 84 86 5.45 5.29 5.35 5.62 184 212 18 23 512 1K 2K 4K 8K 16K 32K nvmeof isert iscsi 123 266 34 136 379

RANDOM READ WRITE 30-70 IOPS CHART 20 times higher IOPs compare to iscsi-tcp upto 4K size 4 times higher IOPs compare to iser for size 8K and higher 2500 2000 1500 2349 2354 2294 Random 70-30 IOPs (K) (higher is better) 2131 1000 797 628 500 440 194 195 184 188 181 104.4 103.7 98.1 81 75.2 122 60.3 108 39.6 0 512 1K 2K 4K 8K 16K 32K nvmeof isert iscsi

MICRO ENHANCEMENTS FOR RDMA FABRICS Host side Using more SGEs on host side to handle non contiguous pages Per IOQ setting, might be done based on advertised inline data size Target side Having higher inline size for incoming write commands Currently static configuration; Will be per host setting in future via configfs CPU utilization is unchanged with these new feature Patches are under review and update for dynamic configuration

IMPROVEMENTS WITH 16K INLINE SIZE CHANGES 70/30 Write/Read Micro enhancements (Higher the better) 140.00% 120.00% 119.07% 116.71% IOPs improvement Latency reduction 100.00% 80.00% 60.00% 50.64% 49.43% 40.00% 20.00% 0.00% 8.28% 0.00% 0.17% 2.30% 4.34% 0.00% 0.18% 0.00% 0.00% 512 1K -0.13% 2K 4K 8K 16K 32K -20.00%

WIRESHARK SUPPORT FOR FABRIC COMMANDS (BASIC)

WIRESHARK TRACES FOR LATENCY ANALYSIS

13th ANNUAL WORKSHOP 2017 THANK YOU Parav Pandit Mellanox Technologies