SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Similar documents
Ziye Yang. NPG, DCG, Intel

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Jim Harris. Principal Software Engineer. Intel Data Center Group

Jim Harris Principal Software Engineer Intel Data Center Group

Ben Walker Data Center Group Intel Corporation

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Intel SSD Data center evolution

Intel Architecture 2S Server Tioga Pass Performance and Power Optimization

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

OPENSHMEM AND OFI: BETTER TOGETHER

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Jim Harris. Principal Software Engineer. Data Center Group

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

April 2 nd, Bob Burroughs Director, HPC Solution Sales

FAST FORWARD TO YOUR <NEXT> CREATION

Mohan J. Kumar Intel Fellow Intel Corporation

EXPERIENCES WITH NVME OVER FABRICS

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Daniel Verkamp, Software Engineer

Re- I m a g i n i n g D a t a C e n t e r S t o r a g e a n d M e m o r y

Extremely Fast Distributed Storage for Cloud Service Providers

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Ravindra Babu Ganapathi

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Ultimate Workstation Performance

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Intel optane memory as platform accelerator. Vladimir Knyazkin

OpenMPDK and unvme User Space Device Driver for Server and Data Center

Intel. Rack Scale Design: A Deeper Perspective on Software Manageability for the Open Compute Project Community. Mohan J. Kumar Intel Fellow

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

Ceph Optimizations for NVMe

Accelerating Data Center Workloads with FPGAs

Data and Intelligence in Storage Carol Wilder Intel Corporation

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

DPDK Performance Report Release Test Date: Nov 16 th 2016

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

March NVM Solutions Group

Application Acceleration Beyond Flash Storage

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Intel s Architecture for NFV

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

N V M e o v e r F a b r i c s -

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

The Transition to PCI Express* for Client SSDs

Välkommen. Intel Anders Huge

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

Graphics Performance Analyzer for Android

Applying Polling Techniques to QEMU

Fast forward. To your <next>

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

RAIN: Reinvention of RAID for the World of NVMe

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

NVMe over Universal RDMA Fabrics

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

DPDK Intel NIC Performance Report Release 17.08

Intel Speed Select Technology Base Frequency - Enhancing Performance

Real-Time Systems and Intel take industrial embedded systems to the next level

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Agilio CX 2x40GbE with OVS-TC

Intel and SAP Realising the Value of your Data

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

LS-DYNA Performance on Intel Scalable Solutions

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Total Cost of Ownership Analysis for a Wireless Access Gateway

Intel Xeon Processor E v3 Family

Intel Parallel Studio XE 2015

Lenovo - Excelero NVMesh Reference Architecture

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

NVMe Over Fabrics (NVMe-oF)

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

NVMe SSDs with Persistent Memory Regions

Transcription:

SPDK China Summit 2018 Ziye Yang Senior Software Engineer Network Platforms Group, Intel Corporation

Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 2

Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 3

Why an environment abstraction? Flexibility for user 5

Environment abstraction Memory allocation (pinned for DMA) and address translation PCI enumeration and resource mapping Thread startup (pinned to cores) Lock-free ring and memory pool data structures env.h env.c init.c pci.c pci_ioat.c pci_nvme.c pci_virtio.c threads.c vtophy.c 6

Environment abstraction Configurable:./configure --with-env=... Interface defined in spdk/env.h Default implementation uses DPDK (lib/env_dpdk) Flexibility: Decoupling and DPDK enhancements env.h env.c init.c pci.c pci_ioat.c pci_nvme.c pci_virtio.c threads.c vtophy.c 7

How do we combine SPDK components? The SPDK app Framework provides the glue 9

App Framework Components Reactor Event I/O Channel 10

Core 0 Core 1 Core N Events Events Events Reactor 0 Reactor 1 Reactor N I/O Device I/O Device I/O Device I/O Device 11

Essentially a task running on a reactor Primarily checks hardware for async events Can run periodically on a timer Example: poll completion queue Callback runs to completion on reactor thread Submit I/O SQ I/O completion callback CQ Completion handler may send an event I/O Device 12

Event Core 0 Core 1 Core N Events Events Events Reactor 0 Reactor 1 Reactor N I/O Device I/O Device I/O Device 13

Event Cross-thread communication Function pointer + arguments One-shot message passed between reactors Multi-producer/single-consumer ring Runs to completion on reactor thread Reactor A Allocate and call event Events Reactor B Execute and free event 14

I/O Channel Abstracts hardware I/O queues Register I/O devices Create I/O channel per thread/device combination Provides hooks for driver resource allocation I/O Device I/O channel creation drives poller creation Pervasive in SPDK 15

NVMe over Fabrics Target Example nvmf_tgt_advance_state spdk_nvmf_parse_conf (listen on transport) Events Events NVMe-oF tgt I/O channel creation: spdk_nvmf_tgt_create Group data poller creation in each core: Trigger the create_cb (spdk_nvmf_tgt_create_poll_group) of I/O channel, then we will have spdk_nvmf_poll_group_poll in each core Group Acceptor Reactor 0 Group Connection Reactor 1 Group Acceptor Group Connection Group Acceptor network poller creation: spdk_nvmf_tgt_accept will be used to connect events in each core Network 16

NVMe over Fabrics Target Example Group Acceptor network poller handles connect events New qpair (connection) is allocated to different cores via Round Robin manner. Asynchronous message passing is used, then spdk_nvmf_poll_group_add is called. I/O request arrives over network, and handled by the group poller in the designated core. I/O submitted to storage Storage device poller checks completions Response sent All asynchronous work is driven by pollers 17

Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 18

SPDK NVMe-oF Components NVMe over Fabrics Target Released July 2016 (with spec) Hardening: o Intel test infrastructure Discovery simplification Correctness & kernel interop Performance improvements: o o o Read latency improvement Scalability validation (up to 150Gbps) Event Framework enhancements Multiple connection performance improvement (e.g., group transport polling,) NVMe over Fabrics Host (Initiator) New component added in Dec 2016 Performance improvements o o Eliminate copy: now true zero-copy SGL (single SGL element) 19

SPDK NVMe-oF transport work Existing work: RDMA transport DPDK components used which is encapsulated in libspdk_env_dpdk.a, e.g., o o o o PCI device management CPU/thread scheduling Memory management (e.g., lock free rings) Log management Upcoming work: TCP transport Kernel based TCP transport VPP/DPDK based user space TCP transport o o Use DPDK Ethernet PMDs Use user space TCP/IP stack (e.g., VPP) 20

NVMe-oF Target Throughput Performance (RDMA) 4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 SPDK vs. Kernel NVMe-oF I/O Efficiency 3 SPDK 3x 50GbE Line Rate IOPS Cores 30 Kernel 35 30 25 20 15 10 5 0 NVMe* over Fabrics Target Features Utilizes NVM Express * (NVMe) Polled Mode Driver RDMA Queue Pair group Polling Connections pinned to CPU cores Realized Benefit Reduced overhead per NVMe I/O No interrupt overhead No synchronization overhead SPDK reduces NVMe over Fabrics software overhead up to 10x! System Configuration: Target system: Supermicro SYS-2028U-TN24R4T+, 2x Intel Xeon E5-2699v4 (HT off), Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, 12x Intel P3700 NVMe SSD (800GB) per socket, -1H0 FW; Network: Mellanox * ConnectX-4 LX 2x25Gb RDMA, direct connection between initiators and target; Initiator OS: CentOS * Linux * 7.2, Linux kernel 4.10.0, Target OS (SPDK): Fedora 25, Linux kernel 4.9.11, Target OS (Linux kernel): Fedora 25, Linux kernel 4.9.11 Performance as measured by: fio, 4KB Random Read I/O, 2 RDMA QP per remote SSD, Numjobs=4 per SSD, Queue Depth: 32/job. SPDK commit ID: 4163626c5c 21

NANOSECONDS NVM Express * Driver Software Overhead 6000 5000 4000 Kernel Source of Overhead Interrupts SPDK Approach Asynchronous Polled Mode 3000 Synchronization Lockless 2000 System Calls User Space Hardware Access 1000 0 Linux Kernel Submit SPDK DMA Mapping Generic Block Layer Hugepages Specific for Flash Latencies SPDK reduces NVM Express * (NVMe) software overhead up to 10x! System Configuration: 2x Intel Xeon E5-2695v4 (HT off), Intel Speed Step enabled, Intel Turbo Boost Technology disabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, CentOS * Linux * 7.2, Linux kernel 4.7.0-rc1, 1x Intel P3700 NVMe SSD (800GB), 4x per CPU socket, FW 8DV10102, I/O workload 4KB random read, Queue Depth: 1 per SSD, Performance measured by Intel using SPDK overhead tool, Linux kernel data using Linux AIO 22

Throughput (KIOps) NVM Express * Driver Throughput Scalability 4000 3500 3000 2500 2000 1500 1000 500 0 I/O Performance on Single Intel Xeon core 1 2 3 4 5 6 7 8 Number of Intel SSD DC P3700 Linux Kernel SPDK Systems with multiple NVM Express * (NVMe) SSDs capable of millions of I/O per second Results in many cores of software overhead with kernel-based interrupt-driven driver model SPDK enables: - more CPU cycles for storage services - lower I/O latency SPDK saturates 8 NVMe SSDs with a single CPU core! System Configuration: 2x Intel Xeon E5-2695v4 (HT off), Intel Speed Step enabled, Intel Turbo Boost Technology disabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, CentOS * Linux * 7.2, Linux kernel 4.10.0, 8x Intel P3700 NVMe SSD (800GB), 4x per CPU socket, FW 8DV101H0, I/O workload 4KB random read, Queue Depth: 128 per SSD, Performance measured by Intel using SPDK perf tool, Linux kernel data using Linux AIO 23

Latency (usec) SPDK Host + Target vs. Kernel Host + Target 30 25 20 15 10 5 0 14 11 Avg. I/O Round Trip Time Kernel vs. SPDK NVMe-oF Stacks Coldstream, Perf, qd=1 7 7 Kernel NVMe-oF Target / SPDK NVMe-oF Target / Kernel NVMe-oF Target / SPDK NVMe-oF Target / Kernel NVMe-oF InitiatorSPDK NVMe-oF InitiatorKernel NVMe-oF InitiatorSPDK NVMe-oF Initiator 4K Random Read 4K Random Read 4K Random Write 4K Random Write 13 12 8 8 Local Fabric + Software SPDK reduces Optane NVMe-oF latency by 44%, write latency by 36%! System Configuration: 2x Intel Xeon E5-2695v4 (HT on, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 64GB DDR4 Memory, 8x 8GB DDR4 2400 MT/s, Ubuntu 16.04.1, Linux kernel 4.10.1, 1x 25GbE Mellanox 2P CX-4, CX-4 FW= 14.16.1020, mlx5_core= 3.0-1 driver, 1 ColdStream, connected to socket 0, 4KB Random Read I/0 1 initiators, each initiator connected to bx NVMe-oF subsystems using 2P 25GbE Mellanox. Performance measured by Intel using SPDK perf tool, 4KB Random Read I/O, Queue Depth: 1/NVMe-oF subsystem. numjobs 1, 300 sec runtime, direct=1, norandommap=1, FIO-2.12, SPDK commit # 42eade49 24

Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 27

Conclusion In this presentation, we introduce SPDK library The accelerated NVMe-oF target built from SPDK library SPDK proves to be useful to accelerate storage applications equipped with NVMe based devices Call for action: Welcome to use SPDK in storage area (similar as using DPDK in network ) and contribute into SPDK community. 28

Notices & Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks. Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel Advanced Vector Extensions (Intel AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. 2018 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others. 30