Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Similar documents
Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Jim Harris. Principal Software Engineer. Intel Data Center Group

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Jim Harris Principal Software Engineer Intel Data Center Group

Jim Harris. Principal Software Engineer. Data Center Group

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Ben Walker Data Center Group Intel Corporation

Ziye Yang. NPG, DCG, Intel

Intel SSD Data center evolution

Daniel Verkamp, Software Engineer

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Intel Architecture 2S Server Tioga Pass Performance and Power Optimization

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Applying Polling Techniques to QEMU

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

Storage Performance Tuning for FAST! Virtual Machines

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

OPENSHMEM AND OFI: BETTER TOGETHER

Userspace NVMe Driver in QEMU

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Mohan J. Kumar Intel Fellow Intel Corporation

FAST FORWARD TO YOUR <NEXT> CREATION

Intel optane memory as platform accelerator. Vladimir Knyazkin

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

Ultimate Workstation Performance

Re- I m a g i n i n g D a t a C e n t e r S t o r a g e a n d M e m o r y

DPDK Vhost/Virtio Performance Report Release 18.11

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

DPDK Vhost/Virtio Performance Report Release 18.05

A Userspace Packet Switch for Virtual Machines

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

Virtio-blk Performance Improvement

Välkommen. Intel Anders Huge

Achieve Low Latency NFV with Openstack*

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

Intel Speed Select Technology Base Frequency - Enhancing Performance

Intel s Architecture for NFV

March NVM Solutions Group

Agilio CX 2x40GbE with OVS-TC

H.J. Lu, Sunil K Pandey. Intel. November, 2018

KVM Virtualized I/O Performance

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Ravindra Babu Ganapathi

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Fast forward. To your <next>

DPDK Performance Report Release Test Date: Nov 16 th 2016

OpenMPDK and unvme User Space Device Driver for Server and Data Center

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Move ǀ store ǀ process

DPDK Vhost/Virtio Performance Report Release 17.08

Real-Time Systems and Intel take industrial embedded systems to the next level

Extremely Fast Distributed Storage for Cloud Service Providers

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

Ceph Optimizations for NVMe

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

Intel Clear Containers. Amy Leeland Program Manager Clear Linux, Clear Containers And Ciao

DPDK Intel NIC Performance Report Release 18.05

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Data and Intelligence in Storage Carol Wilder Intel Corporation

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

The Path to DPDK Speeds for AF XDP

DEVITO AUTOMATED HIGH-PERFORMANCE FINITE DIFFERENCES FOR GEOPHYSICAL EXPLORATION

KVM as The NFV Hypervisor

DPDK Intel NIC Performance Report Release 17.08

A New Key-Value Data Store For Heterogeneous Storage Architecture

Essential Performance and Advanced Security

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

NVMe SSDs with Persistent Memory Regions

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Intel Xeon Processor E v3 Family

Total Cost of Ownership Analysis for a Wireless Access Gateway

Intel SSD DC P3700 & P3600 Series

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

Hardware and Software Co-Optimization for Best Cloud Experience

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Quo Vadis Virtio? Michael S. Tsirkin Red Hat

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

RAIN: Reinvention of RAID for the World of NVMe

IXPUG 16. Dmitry Durnov, Intel MPI team

Out-of-band (OOB) Management of Storage Software through Baseboard Management Controller Piotr Wysocki, Kapil Karkra Intel

Vhost and VIOMMU. Jason Wang (Wei Xu Peter Xu

Transcription:

Changpeng Liu Senior Storage Software Engineer Intel Data Center Group

Legal Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks. Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel Advanced Vector Extensions (Intel AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. 2018 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others. 2

SPDK VHOST Architecture QEMU SPDK vhost Guest VM virtio-scsi/blk vhost eventfd UNIX domain socket virtio-scsi/blk DPDK vhost virtqueue Shared Guest VM Memory Host Memory 4

SPDK VHOST QEMU Guest VM Application Guest kernel virtqueue 1. Add IO to virtqueue 2. Poll virtqueue 3. Device executes IO 4. Guest completion interrupt Kernel SPDK Vhost kvm vhost i/o 5

COMPARISON with existing solutions QEMU VIRTIO SCSI Target VHOST Kernel Target VHOST Userspace Target QEMU Guest VM Guest Kernel VIRTIO_SCSI QEMU Guest VM Guest Kernel VIRTIO_SCSI QEMU Guest VM Guest Kernel VIRTIO_SCSI VIRTIO_SCSI_PCI VHOST_SCSI_PCI IOCTL VHOST_USER_SCSI_PCI SOCKET Host Kernel NVME_MOD Host Kernel VHOST LIO NVME_MOD SPDK VHOST VHOST_USER SCSI PMD_NVME 6

SPDK VHOST Target Summary Vhost Target QEMU Support Guest Support Container similar Solution Support Vhost SCSI Target Yes Yes, Kernel+PMD Yes Vhost Blk Target Yes Yes, Kernel+PMD Yes Vhost NVMe Target No, SPDK QEMU branch Yes, Kernel+PMD No -Vhost-SCSI: QEMU 2.9 added vhost-user-scsi-pci host driver support -Vhost-Blk: QEMU 2.11 added vhost-user-blk-pci host driver support -Vhost NVMe: a new device type which can demonstrate NVMe controller to VM, native kernel NVMe driver can be used

Virtual Machine Acceleration Released VM Vhost Target SPDK Provides dynamic block device provisioning SCSI Increase VM Density BDEV Decrease Guest Latency Logical volume Blobstore Works with KVM/QEMU BDEV BDEV NVMe BD NVMe Driver SSD for Datacenter NVMeoF BD NVMe-oF Initiator NVMe-oF Target

Virtio SCSI/blk Driver Virtio SCSI/Blk is an initiator for SPDK Vhost target Released SPDK Vhost Target BDEV VM Virtio Blk Driver VIRTIO_BLK_PCI Application BDEV Virtio SCSI Driver User Vhost Virtio SCSI/Blk driver supports 2 usage models: PCI Mode: Polling mode driver inside Guest VM User vhost: Can be used to connect to vhost target directly via socket, e.g.: containers or multi-process application

IOPS in Millions 48 VMs: vhost-scsi performance (SPDK vs. Kernel) Intel Xeon Platinum 8180 Processor, 24x Intel P4800x 375GB 2 partitions per VM, 10 vhost I/O processing cores 11 10 9 8 7 6 5 4 3 2 1 0 3.2x 9.23 8.98 3.2x 2.7x 9.49 1 3.4 2.86 2.77 4K 100% Read 4K 100% Write 4K 70%Read30%Write Aggregate IOPS across all 48x VMs reported. All VMs on separate cores than vhost-scsi cores. 10 vhost-scsi cores for I/O processing SPDK vhost-scsi up to 3.2x better with 4K 100% Random read I/Os Used cgroups to restrict kernel vhost-scsi processes to 10 cores vhost-kernel vhost-spdk System Configuration:Intel Xeon Platinum 8180 @ 2.5GHz. 56 physical cores 6x 16GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4800x 375GB x24 drives, Bios: HT disabled, p-states enabled, turbo enabled, Ubuntu 16.04.1 LTS, 4.11.0 x86_64 kernel, 48 VMs, number of partition: 2, VM config : 1core 1GB memory, VM OS: fedora 25, blk-mq enabled, Software packages: Qemu-2.9, libvirt-3.0.0, spdk (3bfecec994), IO distribution: 10 vhost-cores for SPDK / Kernel. Rest 46 cores for QEMU using cgroups, FIO-2.1.10 with SPDK plugin, io depth=1, 8, 32 numjobs=1, direct=1, block size 4k

IOPS in Millions (higher is better) NUMA vs. Non-NUMA: SPDK vhost-scsi Intel Xeon Platinum 8180 Processor, 24x Intel P4800x 375GB 48VMs, 10 vhost-scsi cores 10 9 8 7 6 8.43 9.23 11 8.15 8.98 8.3 9.49 10% performance improvement with NUMA optimized. NUMA optimization done to ensure vhost-scsi core match to NVMe drive socket location 5 4 1 3 2 1 0 4K 100% Read 4K 100% Write 4K 70%Read30%Write Non-NUMA optimized NUMA optimized System Configuration:Intel Xeon Platinum 8180 @ 2.5GHz. 56 physical cores 6x 16GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4800x 375GB x24 drives, Bios: HT disabled, p-states enabled, turbo enabled, Ubuntu 16.04.1 LTS, 4.11.0 x86_64 kernel, 48 VMs, number of partition: 2, VM config : 1core 1GB memory, VM OS: fedora 25, blk-mq enabled, Software packages: Qemu-2.9, libvirt-3.0.0, spdk (3bfecec994), IO distribution: 10 vhost-cores for SPDK / Kernel. Rest 46 cores for QEMU using cgroups, FIO-2.1.10 with SPDK plugin, io depth=1, 8, 32 numjobs=1, direct=1, block size 4k

Virtio-blk protocol extension Faster than virtio-scsi protocol due to eliminate SCSI middle layer inside Guest kernel Linux Block layer supports multi-queues for virtio-blk Lack of support for DISCARD/WRITE ZEROES commands. Virtio-Blk protocol specification has added this feature. See https://github.com/oasis-tcs/virtio-spec for reference. Linux kernel driver and QEMU driver will be kicked soon. 15

Vhost-NVMe What s vhost-nvme? NVMe specification as the communication protocol between Guest and slave I/O target Make use of UNIX domain socket as the message channel to setup I/O queues and interrupt notifier for Guest NVMe 1.3 specification virtualization enhancement What s the benefit? Native kernel NVMe driver can be used inside VM without any extra modifications Eliminate the SCSI middle layer driver compared with exist vhost scsi solution, which can improve the performance Vhost-NVMe Target QEMU Guest VM Guest Kernel NVMe Controller VHOST_USER_NVME_PCI SOCKET SPDK VHOST VHOST_USER BDEV PMD_NVME 16

NVMe 1.3 Specification Enhancement Host Memory Submission Queue Completion Queue 0 1 2 Shadow Submission Queue Tail Shadow Completion Queue Head 2 1 0 Submission Queue Tail Completion Queue Head 3 3 Host Memory NVMe 1.3 Virtualization Enhancement: 1. Optional Admin Command Support 2. Doorbell Buffer Config 3 3 PCI BAR 17

Vhost-NVMe Implementation Vhost Message Protocol Get Controller Capabilities Get Device ID Get/Set Controller Configuration Admin Command Pass-through Set Memory Table Set Guest Notifier Description Controller capabilities register of NVMe specification Vendor ID of the emulated NVMe controller of QEMU Enable/Disable emulated NVMe controller Admin commands routed to slave target Sets the memory map regions on the slave target so it can translate the I/O queues addresses. Set the event file descriptor for the purpose to interrupt the Guest when I/O is completed. Set Event Notifier Admin Commands Identify/Identify NS Create/Delete Submission Queue Create/Delete Completion Queue Abort Asynchronous Event Request Doorbell Buffer Config Set the event file descriptor for AER. Table 1: Vhost socket messages Description QEMU gets the identify data from slave target, QEMU can cache it with a local copy to avoid repeated vhost messages. QEMU allocates/deletes the queues and send the Admin command to slave target. Each Create Completion Queue command should follow with a Set Guest Notifier for IRQ notification. Slave target will process Abort command. QEMU sends the command to slave target and follows with a Set Event Notifer for real AER. Set the shadow doorbell buffer in slave target Table 2: Mandatory Admin commands in slave target 18

Use Cases Integrating SPDK Blobstore with RocksDB to MySQL inside VM QEMU VM_1 MySQL RocksDB QEMU VM_N MySQL RocksDB -Optane can be parted into several logical volumes to each VM for critical log usage. User Kernel BINLOG POSIX API Filesystem WAL SPDK Blobstore SST POSIX API Filesystem User Kernel BINLOG POSIX API Filesystem WAL SPDK Blobstore SST POSIX API Filesystem -Enable WAL with SPDK to provide short I/O path without any data copies. Native NVMe Driver Emulated NVMe 1 Emulated NVMe 2 Native NVMe Driver NVMe 3 PCI Passthrough Native NVMe Driver Emulated NVMe 1 Emulated NVMe 2 PCI Passthrough Native NVMe Driver NVMe 3 Intel Data Center P3600/P3700/P4500/ P4600 serials NVMe SSDs SPDK Vhost User NVMe Target Admin Queue SQ CQ Emulated NVMe1 10GiB Emulated NVMe2 10GiB I/O Queue 1 I/O Queue 2 SQ CQ SQ CQ Emulated NVMe2N 10GiB I/O Queue 2N SQ CQ Emulated NVMe2N+1 10GiB I/O Queue 2N+1 SQ CQ Intel Data Center Optane P4800X 19

Benchmarks 600 500 400 300 200 100 1 VM with 1 NVMe SSD, 4 VCPU Randread, IOPS(K), Higher is better CPU Usage (usr+sys), lower is better 400% 350% 300% 250% 200% 150% 100% 50% 0 numjobs=1 numjobs=2 numjobs=4 0% numjobs=1 numjobs=2 numjobs=4 QEMU-NVMe Kernel vhost-scsi SPDK vhost-scsi QEMU-NVMe Kernel vhost-scsi SPDK vhost-scsi SPDK vhost-blk SPDK vhost-nvme SPDK vhost-blk SPDK vhost-nvme System Configuration: 2 * Intel Xeon E5 2699v4 @ 2.2GHz, 128GB, 2667 DDR4, 6 memory Channels, SSD: Intel P3700 800GB, FW: 8DV101H0, Bios: HT disabled, CentOS 7.4(kernel 4.12.5), 1 VMs, VM config : 4core 4GB memory, VM OS: Fedora 25(kernel 4.14.0), blk-mq enabled, Software packages: Qemu-2.11,IO distribution: 1 vhost-cores for SPDK, FIO, io depth=128 numjobs=1,2,4 direct=1, block size 4k. 20

Benchmarks and KVM Events 8 VMs shared 4 NVMe SSD, 4 VCPU 1 VMs with 1 NVMe SSD, 4 VCPU 3000 Randread, IOPS(K), Higher is better 120000000 KVM Events, Lower is better 2500 100000000 2000 80000000 1500 60000000 1000 40000000 500 20000000 0 VM=2 VM=4 VM=8 QEMU-NVMe Kernel vhost-scsi SPDK vhost-scsi SPDK vhost-blk SPDK vhost-nvme 0 kvm_msi_set_irq kvm_exit Guest Controller Interrupt SPDK vhost-scsi SPDK vhost-blk SPDK vhost-nvme System Configuration: 2 * Intel Xeon E5 2699v4 @ 2.2GHz, 128GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4510 2TB, FW: QDV1013A, Bios: HT disabled, CentOS 7.4(kernel 4.12.5), 1 VMs, VM config : 4core 4GB memory, VM OS: Fedora 25(kernel 4.14.0), blk-mq enabled, Software packages: Qemu-2.11,IO distribution: 1 vhost-cores for SPDK, FIO, io depth=128, numjobs=2; FIO, io depth=64 numjobs=4, size=100gb; direct=1 block size 4k. block size 4k. 21