Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Similar documents
Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

Jim Harris. Principal Software Engineer. Intel Data Center Group

Ziye Yang. NPG, DCG, Intel

Jim Harris. Principal Software Engineer. Data Center Group

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Jim Harris Principal Software Engineer Intel Data Center Group

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Ben Walker Data Center Group Intel Corporation

Intel SSD Data center evolution

Userspace NVMe Driver in QEMU

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Intel Architecture 2S Server Tioga Pass Performance and Power Optimization

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Applying Polling Techniques to QEMU

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Storage Performance Tuning for FAST! Virtual Machines

Daniel Verkamp, Software Engineer

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Mohan J. Kumar Intel Fellow Intel Corporation

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

Achieve Low Latency NFV with Openstack*

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

OPENSHMEM AND OFI: BETTER TOGETHER

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Ultimate Workstation Performance

Ceph Optimizations for NVMe

FAST FORWARD TO YOUR <NEXT> CREATION

A Userspace Packet Switch for Virtual Machines

Re- I m a g i n i n g D a t a C e n t e r S t o r a g e a n d M e m o r y

Virtio-blk Performance Improvement

Intel optane memory as platform accelerator. Vladimir Knyazkin

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

DPDK Vhost/Virtio Performance Report Release 18.11

Tackling the Management Challenges of Server Consolidation on Multi-core System

DPDK Vhost/Virtio Performance Report Release 18.05

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Software and Management for NVMe Session A12 Part B 3:40 to 4:45

KVM Virtualized I/O Performance

Ravindra Babu Ganapathi

Intel s Architecture for NFV

Quo Vadis Virtio? Michael S. Tsirkin Red Hat

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Agilio CX 2x40GbE with OVS-TC

OpenMPDK and unvme User Space Device Driver for Server and Data Center

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

DPDK Roadmap. Tim O Driscoll & Chris Wright Open Networking Summit 2017

Hardware-Assisted Mediated Pass-Through with VFIO. Kevin Tian Principal Engineer, Intel

EXPERIENCES WITH NVME OVER FABRICS

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

The Path to DPDK Speeds for AF XDP

Fast forward. To your <next>

DPDK Performance Report Release Test Date: Nov 16 th 2016

Fast packet processing in linux with af_xdp

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

Intel Speed Select Technology Base Frequency - Enhancing Performance

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Accelerating Data Center Workloads with FPGAs

Extremely Fast Distributed Storage for Cloud Service Providers

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

KVM as The NFV Hypervisor

Datacenter Network Solutions Group

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

DPDK Vhost/Virtio Performance Report Release 17.08

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

International Journal of Advance Engineering and Research Development. DPDK-Based Implementation Of Application : File Downloader

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

Live Migration of vgpu

March NVM Solutions Group

Total Cost of Ownership Analysis for a Wireless Access Gateway

Välkommen. Intel Anders Huge

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

A New Key-Value Data Store For Heterogeneous Storage Architecture

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

Essential Performance and Advanced Security

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

CLOUD ARCHITECTURE & PERFORMANCE WORKLOADS. Field Activities

Intel Builder s Conference - NetApp

Real-Time Systems and Intel take industrial embedded systems to the next level

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

Data and Intelligence in Storage Carol Wilder Intel Corporation

Transcription:

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel @optimistyzy

Notices & Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks. Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel Advanced Vector Extensions (Intel AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. 2018 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.

Outline Background SPDK vhost solution Experiments Conclusion

Background

NVMe & virtualization NVMe specification enables highly optimized drives (e.g., NVMe SSD) For example, multiple I/O queues allows lockless submission from CPU cores in parallel However, even the best kernel mode drivers have non-trivial software overhead Long I/O stack in kernel with resource contention Virtualization adds additional overhead Long I/O stack in both guest OS kernel and host OS kernel Context switch overhead (e.g., VM_EXIT caused by I/O interrupt in guest OS)

What is in QEMU s solution? The solution in QEMU to virtualize NVMe device: Virtio virtualization NVMe controller virtualization Hardware assisted virtualization Virtio virtualization Virtio SCSI/block Controllers NVMe controller virtualization QEMU emulated NVMe Device (file based NVMe backend) QEMU NVMe Block Driver based on VFIO (exclusive access by QEMU)

What is in QEMU s solution? Guest VM (Linux*, Windows*, FreeBSD*, etc.) virtio front-end drivers virtio back-end drivers device emulation virtqueue Hypervisor (i.e. QEMU/KVM) Paravirtualized driver specification Common mechanisms and layouts for device discovery, I/O queues, etc. virtio device types include: virtio-net virtio-blk virtio-scsi virtio-gpu virtio-rng virtio-crypto

Accelerate virtio via vhost target Guest VM (Linux*, Windows*, FreeBSD*, etc.) virtio front-end drivers virtqueue Separate process for I/O processing vhost protocol for communicating guest VM parameters memory number of virtqueues virtqueue locations virtio back-end drivers Device emulation vhost Hypervisor (i.e. QEMU/KVM) vhost vhost target (kernel or userspace)

SPDK vhost solution

What is SPDK? Storage Performance Development Kit Intel Platform Storage Reference Architecture Optimized for Intel platform characteristics Open source building blocks (BSD licensed) Available via github.com/spdk or spdk.io Scalable and Efficient Software Ingredients User space, lockless, polled-mode components Up to millions of IOPS per core Designed for Intel Optane technology latencies 10

SPDK architecture Storage Protocols Storage Services NVMe-oF* Target 3 rd Party NVMe 18.01 Release RDMA TCP NVMe iscsi Target SCSI vhost-scsi Target Block Device Abstraction (BDEV) Linux AIO vhost- NVMe Target Logical Volumes Ceph RBD 18.04 Release 18.07 Release snapshots clones PMDK blk Virtio SCSI GPT Virtio Blk Qos DPDK Encryption vhostblk Target iscsi initiator Linux nbd BlobFS Blobstore Integration Cinder VPP TCP/IP RocksDB Ceph QEMU Drivers NVMe-oF * Initiator NVMe Devices RDMA TCP NVMe* PCIe Driver OCSSD Intel QuickData Technology Driver Core Application Framework

Combine virtio and NVMe to inform a uniform SPDK vhost solution QEMU Guest VM Virtio Controller vhost eventfd UNIX domain socket SPDK vhost virtio DPDK vhost QEMU Guest VM NVMe Controller vhost eventfd UNIX domain socket SPDK vhost NVMe DPDK vhost virtqueue Shared Guest VM Memory Host Memory sq cq Shared Guest VM Memory Host Memory

Virtio VS NVMe Available Ring Both Use Ring Data Structures for IO Submission Queue Available Index TAIL

Virtio-SCSI and NVMe protocol format comparison ADDR LEN FLAGS NEXT SCSI_REQ ADDR LEN FLAGS NEXT DATA ADDR LEN FLAGS NEXT SCSI_RSP DATA NVMe_Req NVMe_Rsp (16 * 3 + SCSI_Req + SCSI_Rsp + Data) Bytes (NVMe_Req + Data + NVMe_Rsp) Bytes

SPDK vhost architecture QEMU Released Separate Patch for QEMU QEMU Guest 1 Vhost SCSI Driver Guest 2 Guest 3 Vhost BLK Driver Vhost NVMe Driver Virtio SCSI Controller Virtio BLK Controller NVMe Controller Kernel kvm SPDK vhost Target SCSI BLK NVMe BDEV

Comparison of known solutions Usage Solution Solution Usage QEMU Emulated NVMe device QEMU VFIO Based solution QEMU PCI- Passthrough SR-IOV SPDK Vhost-SCSI SPDK Vhost-BLK SPDK Vhost-NVMe Guest OS Interface NVMe NVMe NVMe Virtio SCSI Virtio BLK NVMe NVMe Backend Device sharing Live Migration support Mediated- NVMe VFIO Y N N Y Y Y *Depends on implementation * N N N Y Y No, Feature is WIP VFIO dependency N Y N Can use UIO or VFIO Can use UIO or VFIO Can use UIO or VFIO *Depends on implementation * Y QEMU Support Y Y Y Y Y No, Upstream is WIP Y

SPDK vhost NVMe implementation details

Vhost NVMe implementation details QEMU Guest VM NVMe Controller SPDK Vhost-NVMe NVMe NVMe IO Queue Poller s q c q Admin Queue vhost UNIX Domain Socket DPDK vhost NS1 BDEV NS2 BDEV NS BDEV Kernel kvm sq cq Shared Guest VM Memory

Create io queue Guest: Create IO Queue SPDK: Start to Create IO Queue QSIZE QID CQID QPRIO PC PRP1 Guest: Submit to Admin, Write DB QEMU: Pick up Admin Command QEMU: Send via Domain Socket SPDK: Memory Translation SPDK: Both Guest and SPDK see same IO Queue now sq

New feature to address guest NVMe performance issue Submit a new IO MMIO Writes happened, which will cause VM_EXIT Write SQ1 Shadow SQ 1 SQ 1 Doorbell Doorbell NVMe 1.3 New Feature: Optional Admin Command support for Doorbell Buffer Config, only used for emulated NVMe controllers, Guest can update shadow doorbell buffer instead of submission queue s doorbell registers

Shadow doorbell buffer Start End Description 00h 03h Submission Queue 0 Tail Doorbell or Eventidx (Admin) 04h 07h Completion Queue 0 Head Doorbell or Eventidx (Admin) 08h 0Bh Submission Queue 1 Tail Doorbell or Eventidx 0Ch 0Fh Completion Queue 1 Head Doorbell or Eventidx Command PRP1 PRP2 Description Shadow doorbell memory address, updated by Guest NVMe Driver Eventidx memory address, updated by SPDK vhost target

Experiments

1 VM with 1 NVMe SSD 600 500 400 300 200 100 0 IOPS (K) 1 QEMU-NVME Vhost-SCSI Vhost-BLK Vhost-NVMe 300 250 200 150 100 50 0 CPU Utilization (%) Guest Usr Guest Sys QEMU-NVMe Vhost-BLK Host Usr Vhost-SCSI Vhost-NVMe Host Sys KVM Events 200000000 150000000 100000000 50000000 0 QEMU-NVMe Vhost-SCSI Vhost-BLK Vhost-NVMe System Configuration: 2 * Intel Xeon E5 2699v4 @ 2.2GHz; 128GB, 2667 DDR4, 6 memory Channels; SSD: Intel Optane P4800X, FW: E2010324, 375GiB; Bios: HT disabled, Turbo disabled; OS: Fedora 25, kernel 4.16.0. 1 VM, VM config : 4 vcpu 4GB memory, 4 IO queues; VM OS: Fedora 27, kernel 4.16.5-200, blk-mq enabled; Software: QEMU-2.12.0 with SPDK Vhost-NVMe driver patch, IO distribution: 1 vhost-cores for SPDK, FIO 3.3, io depth=32, numjobs=4, direct=1, block size=4k,total tested data size=400gib

8 VMs with 4 NVMe SSDs IOPS (K) Latency (us) 3000 2500 2500 2000 1500 1000 500 2000 1500 1000 500 0 randread 0 randread Vhost-SCSI Vhost-BLK Vhost-NVMe Vhost-SCSI Vhost-BLK Vhost-NVMe Linux kernel NVMe driver will poll completion queue when submitting a new request, which can help to decrease interrupt numbers and vm_exit events. System Configuration: 2 * Intel Xeon E5 2699v4 @ 2.2GHz; 256GB, 2667 DDR4, 6 memory Channels; SSD: Intel DC P4510, FW: VDV10110, 2TiB; BIOS: HT disabled, Turbo disabled; Host OS: CentOS 7, kernel 4.16.7. 8 VMs, VM config : 4 vcpu 4GB memory, 4 IO queues; Guest OS: Fedora 27, kernel 4.16.5-200, blk-mq enabled; Software: QEMU- 2.12.0 with SPDK Vhost-NVMe driver patch, IO distribution: 2 vhost-cores for SPDK, FIO 3.3, io depth=128, numjobs=4, direct=1, block size=4k,runtime=300s,ramp_time=60s; SSDs well preconditioned with 2 hours randwrites before randread tests.

Conclusion

Conclusion Conclusion In this presentation, we introduce SPDK vhost solution(i.e., SCSI/Blk/NVMe) to accelerate NVMe I/Os in virtual machines Future work VM live migration support for the whole SPDK vhost solution(i.e., vhost SCSI/BLK/NVMe) Upstream QEMU vhost driver. Promotion Welcome to evaluate & use SPDK vhost target! Welcome to contribute to SPDK community!

Q & A