vhost-user-scsi: offloading virtio-scsi to userspace

Similar documents
Jim Harris Principal Software Engineer Intel Data Center Group

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

DPDK s Best Kept Secret: Micro-benchmarks. M Jay DPDK Summit - San Jose 2017

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Ziye Yang. NPG, DCG, Intel

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Jim Harris. Principal Software Engineer. Intel Data Center Group

Jim Harris. Principal Software Engineer. Data Center Group

EMC ViPR. User Guide. Version

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

EMC AppSync. User Guide. Version REV 01

Requirements Engineering. Objectives. System requirements. Types of requirements. FAQS about requirements. Requirements problems

Membership Library in DPDK Sameh Gobriel & Charlie Tai - Intel DPDK US Summit - San Jose

Intel Cache Acceleration Software for Windows* Workstation

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

EMC VNX Series. Problem Resolution Roadmap for VNX with ESRS for VNX and Connect Home. Version VNX1, VNX2 P/N REV. 03

EMC M&R (Watch4net ) Installation and Configuration Guide. Version 6.4 P/N REV 02

LED Manager for Intel NUC

Illumina LIMS. Software Guide. For Research Use Only. Not for use in diagnostic procedures. Document # June 2017 ILLUMINA PROPRIETARY

Enhanced Memory Management

What s New in AppSense Management Suite Version 7.0?

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Ben Walker Data Center Group Intel Corporation

Intel Desktop Board DZ68DB

Isilon InsightIQ. Version 2.5. User Guide

DIVAR IP U. Video DIVAR IP U.

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

DIVAR IP U. Video DIVAR IP U.

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Theory and Practice of the Low-Power SATA Spec DevSleep

Intel s Architecture for NFV

Intel vpro Technology Virtual Seminar 2010

Intel & Lustre: LUG Micah Bhakti

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

DIVAR IP U. Video DIVAR IP U.

DIVAR IP U. Video DIVAR IP U.

TAKING THE PULSE OF ICT IN HEALTHCARE

Computer User s Guide 4.0

DPDK on Microsoft Azure

Standard. 8029HEPTA DataCenter. Because every fraction of a second counts. network synchronization requiring minimum space. hopf Elektronik GmbH

How to Create a.cibd File from Mentor Xpedition for HLDRC

Tdb: A Source-level Debugger for Dynamically Translated Programs

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Lecture 13: Exceptions and Interrupts

Local Run Manager. Software Reference Guide for MiSeqDx

CS 153 Design of Operating Systems

Doctor Web. All rights reserved

Intel 848P Chipset. Specification Update. Intel 82848P Memory Controller Hub (MCH) August 2003

Intel Atom Processor E3800 Product Family Development Kit Based on Intel Intelligent System Extended (ISX) Form Factor Reference Design

Intel Solid State Drive NVMe* Windows* driver for the Intel Optane TM SSD 900P Series

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

IMPLEMENTATION OF OBJECT ORIENTED APPROACH TO MODIFIED ANT ALGORITHM FOR TASK SCHEDULING IN GRID COMPUTING

Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

CS 153 Design of Operating Systems Spring 18

The Intel Processor Diagnostic Tool Release Notes

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

EMC NetWorker Module for SAP

KVM as The NFV Hypervisor

L EGAL NOTICES. ScanSoft, Inc. 9 Centennial Drive Peabody, MA 01960, United States of America

VirtuOS: an operating system with kernel virtualization

BIS - Basic package V4.3

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

BIS - Basic Package V4.4

Device Firmware Update (DFU) for Windows

Design of Vhost-pci - designing a new virtio device for inter-vm communication

Introduction to Intel Boot Loader Development Kit (Intel BLDK) Intel SSG/SSD/UEFI

BIS - Basic Package V4.6

Intel Desktop Board DP55SB

IEEE1588 Frequently Asked Questions (FAQs)

CS 153 Design of Operating Systems Spring 18

Intel Desktop Board D945GCCR

Secure Biometric-Based Authentication for Cloud Computing

Intel Desktop Board D945GCLF

Intel Desktop Board D975XBX2

USER S GUIDE: SPRINT RELAY CUSTOMER PROFILE

Implementation and Testing of Soft Patch Panel. Yasufumi Ogawa (NTT) Tetsuro Nakamura (NTT) DPDK Summit - San Jose 2017

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

Intel Virtualization Technology Roadmap and VT-d Support in Xen

Drive Recovery Panel

2013 Intel Corporation

EMC ViPR. Controller REST API Developer Guide. Version

Intel Desktop Board D946GZAB

TRUSTED WIRELESS HEALTH A New Approach to Medical Grade Wireless

Installation Guide and Release Notes

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Intel Desktop Board D945GCLF2

Intel Open Source HD Graphics, Intel Iris Graphics, and Intel Iris Pro Graphics

Data Plane Development Kit

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

The Transition to PCI Express* for Client SSDs

Local Run Manager Generate FASTQ Analysis Module

Intel vpro Technology Virtual Seminar 2010

Intel Cache Acceleration Software - Workstation

Intel Open Source HD Graphics Programmers' Reference Manual (PRM)

Re- I m a g i n i n g D a t a C e n t e r S t o r a g e a n d M e m o r y

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Transcription:

vhost-ser-scsi: offloading virtio-scsi to serspace Felipe Franciosi (AHV Engineering, Ntanix) Jim Harris (SPDK Architect, Intel) 27 October 2017

Disclaimer This presentation and the accompanying oral commentary may inclde express and implied forward-looing statements, inclding bt not limited to statements concerning or bsiness plans and objectives, prodct featres and technology that are nder development or in process and capabilities of sch prodct featres and technology, or plans to introdce prodct featres in ftre releases, the implementation of or prodcts on additional hardware platforms, strategic partnerships that are in process, prodct performance, competitive position, indstry environment, and potential maret opportnities. These forwardlooing statements are not historical facts, and instead are based on or crrent expectations, estimates, opinions and beliefs. The accracy of sch forwardlooing statements depends pon ftre events, and involves riss, ncertainties and other factors beyond or control that may case these statements to be inaccrate and case or actal reslts, performance or achievements to differ materially and adversely from those anticipated or implied by sch statements, inclding, among others: failre to develop, or nexpected difficlties or delays in developing, new prodct featres or technology on a timely or cost-effective basis; delays in or lac of cstomer or maret acceptance of or new prodct featres or technology; the failre of or software to interoperate on different hardware platforms; failre to form, or delays in the formation of, new strategic partnerships and the possibility that we may not receive anticipated reslts from forming sch strategic partnerships; the introdction, or acceleration of adoption of, competing soltions, inclding pblic clod infrastrctre; a shift in indstry or competitive dynamics or cstomer demand; and other riss detailed in or Form 10-Q for the fiscal qarter ended April 30, 2017, filed with the Secrities and Exchange Commission. These forward- looing statements spea only as of the date of this presentation and, except as reqired by law, we assme no obligation to pdate forward- looing statements to reflect actal reslts or sbseqent events or circmstances. Any ftre prodct or roadmap information is intended to otline general prodct directions, and is not a commitment, promise or legal obligation for Ntanix to deliver any material, code, or fnctionality. This information shold not be sed when maing a prchasing decision. Frther, note that Ntanix has made no determination as to if separate fees will be charged for any ftre prodct enhancements or fnctionality which may ltimately be made available. Ntanix may, in its own discretion, choose to charge separate fees for the delivery of any prodct enhancements or fnctionality which are ltimately made available. Certain information contained in this presentation and the accompanying oral commentary may relate to or be based on stdies, pblications, srveys and other data obtained from third-party sorces and or own internal estimates and research. While we believe these third-party stdies, pblications, srveys and other data are reliable as of the date of this presentation, they have not independently verified, and we mae no representation as to the adeqacy, fairness, accracy, or completeness of any information obtained from third-party sorces. 2

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may mae changes to specifications and prodct descriptions at any time, withot notice. Designers mst not rely on the absence or characteristics of any featres or instrctions mared "reserved" or "ndefined." Intel reserves these for ftre definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from ftre changes to them. The information here is sbject to change withot notice. Do not finalize a design with this information. The prodcts described in this docment may contain design defects or errors nown as errata which may case the prodct to deviate from pblished specifications. Crrent characterized errata are available on reqest. Contact yor local Intel sales office or yor distribtor to obtain the latest specifications and before placing yor prodct order. All prodcts, compter systems, dates, and figres specified are preliminary based on crrent expectations, and are sbject to change withot notice. This docment contains information on prodcts in the design phase of development. Applies only to halogenated flame retardants and PVC in components. Prsant to JEP-709, halogens are below 1,000ppm bromine and 1,000ppm chlorine. The replacement of halogenated flame retardants and/or PVC may not be better for the environment. Intel and the Intel logo are trademars of Intel Corporation in the U.S. and other contries. *Other names and brands may be claimed as the property of others. Copyright 2016 Intel Corporation. All rights reserved. 3

State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 8 8 1 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 4 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 4 main loop (device emlation) main loop (device emlation) 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem

Motivation Crrent design imposes certain limitations - A 1:1 mapping on VM to host process for handling VQs - Impossible to have a single process handling mltiple VMs - Impossible to efficiently poll on mltiple VMs - Non-trivial to virtalise a single NVMe to mltiple VMs from serspace Soltion - Create a mechanism to offload the datapath to a separate process - Precedence: vhost-ser-net 5

State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 8 8 1 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 6 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 4 main loop (device emlation) main loop (device emlation) 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem

State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 1 8 8 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 7 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 VUS APP VUS APP 4 IO Thread IO Thread 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem

State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 1 8 8 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 8 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 VUS APP 4 IO Thread IO Thread 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem

Trade-offs Main benefit: better performance - Easier to implement MQ spport (with mltiple IO threads) - Easier to optimise for better batching - First pic p everything from all VQs (mltiple VMs) - Then sbmit it altogether to yor bacend - Simple to implement VQ polling for mltiple VMs 9 Drawbacs - When sing a single bacend process, secrity and stability - With separate processes, these are no different than today - LUN management is independent (Qem is not aware of LUNs anymore) - Featres generally available from Qem bloc layer are lost

Side-by-side: vhost-ser and vhost VM virtio-scsi VM virtio-scsi alloc vqs pci config alloc vqs pci config vhw vhw Linx (host) 10 Qem gest s memory 010111010110 110110100101 VQ 011100101010 010100100010101001011101 vm.o nix soc gpa 1101 1110 1001101010 0111010010 ic irq VUS Linx (host) Qem gest s memory 010111010110 110110100101 VQ 011100101010 010100100010101001011101 irq ic vhost dev vm.o ioctl() vhost-scsi

Design/Implementation Highlights Small refactor of vhost-scsi (in Qem) - Original vhost-scsi split into: vhost-scsi-common and vhost-scsi - New vhost-ser-scsi introdced nder vhost-scsi-common Live Migration - Very straightforward, bt ended p dropped from merge - Tric: flsh the device on GET_VRING_BASE - Ftre wor: throttling vhost(-ser) devices for convergence 11 Key difference with vhost: nix msgs are async - ioctl() blocs ntil the ernel modle finished processing the command - On vhost-ser, se F_REPLY_ACK to wait for completion

Storage Performance Development Kit (SPDK) What is SPDK? Userspace polled-mode drivers, libraries and applications for storage, storage networing and storage virtalization Leverages DPDK for hgepage memory management and PCI device enmeration Started in 2013, open sorced in 2015 BSD licensed http://spdk.io

Storage Performance Development Kit (SPDK) Storage Protocols NVMe-oF* Target NVMe iscsi Target vhost-ser-scsi Target SCSI vhost-ser-bl Target Bloc Device Layer (bdev) Storage Services NVMe PMD Logical Volmes Linx AIO Ceph RBD GPT virtio-scsi PMD libpmembl BlobFS Blobstore Released Q4 17 Polled Mode Drivers NVMe-oF * Initiator NVMe NVMe * PCIe virtio-scsi Intel QicData Technology

Basic Architectre Configre vhost-scsi controller /spd/vhost.0 vhost-scsi ctrlr Logical Core 0 Logical Core 1 JSON RPC creates SPDK constrcts for vhost device and bacing storage scsi dev scsi ln bdev creates controller-specific vhost domain socet nvme NVMe SSD

Basic Architectre Lanch VM VM /spd/vhost.0 Logical Core 0 Logical Core 1 QEMU connects to domain socet SPDK vhost-scsi ctrlr VQ VQ VQ scsi dev vhost-scsi bdev-nvme Assigns logical core Starts vhost-scsi Allocates NVMe qee pair Starts bdev-nvme scsi ln bdev nvme NVMe SSD QP

Basic Architectre Repeat for additional VMs s spread across available cores VM /spd/vhost.0 vhost-scsi ctrlr VQ VQ VQ scsi dev Logical Core 0 vhost-scsi bdev-nvme Logical Core 1 vhost-scsi bdev-nvme scsi ln bdev nvme vhost-scsi bdev-nvme vhost-scsi bdev-nvme NVMe SSD QP

Asynchronos Polling Poller exection VM /spd/vhost.0 Logical Core 0 Logical Core 1 Reactor on each core Iterates throgh s rond-robin vhost-scsi poll for new I/O reqests sbmit to NVMe SSD bdev-nvme poll for I/O completions complete to gest VM vhost-scsi ctrlr VQ VQ VQ scsi dev scsi ln bdev nvme NVMe SSD QP vhost-scsi bdev-nvme vhost-scsi bdev-nvme vhost-scsi bdev-nvme vhost-scsi bdev-nvme

Sharing SSDs in serspace Typically not 1:1 VM to local attached NVMe SSD otherwise jst se PCI direct assignment What abot SR-IOV? SR-IOV SSDs not prevalent yet precldes featres sch as snapshots What abot LVM? LVM depends on Linx ernel bloc layer and storage drivers (i.e. nvme) SPDK wants to se serspace polled mode drivers SPDK Blobstore and Logical Volmes!

Blobstore Design Design Goals Minimalistic for targeted storage se cases lie Logical Volmes and RocsDB Deliver only the basics to enable another class of application Design for fast storage media

Blobstore Design High Level Application interacts with chns of data called blobs Mtable array of pages of data, accessible via ID Asynchronos No blocing, qeing or waiting Flly parallel No locs in IO path Atomic metadata operations Depends on SSD atomicity (i.e. NVMe) 1+ 4KB metadata pages per blob

Logical Volmes Blobstore pls: UUID xattr for lvolstore, lvols Friendly names lvol name niqe within lvolstore lvolstore name niqe within application Ftre snapshots (reqires blobstore spport) bdev lvol... lvolstore blobstore bdev nvme NVMe SSD bdev lvol

SPDK vhost Performance - Configration 2x Intel Xeon Platinm 8180 (28 cores each) VMs: 46 cores vhost: 10 cores 23x Intel P4800x Optane SSD SPDK lvolstore/lvm lvgrop per SSD 46 VMs 1, 2GB DRAM, 100GB logical volme 2 VMs per lvolstore/lvgrop

SPDK vhost Performance IO/s (in millions) QD=1 Latency (in s) 8 50 6 38 4 25 2 13 0 Linx SPDK 0 Linx QEMU SPDK System Configration: 2S Intel Xeon Platinm 8180: 28C, E5-2699v3: 18C, 2.5GHz (HT off), Intel Trbo Boost Technology enabled, 12x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubnt * Server 16.04.2 LTS, 4.11 ernel, 23x Intel P4800x Optane SSD 375GB, 1 SPDK lvolstore or LVM lvgrop per SSD, SPDK commit ID c5d8b108f22ab, 46 VMs (CentOS 3.10, 1, 2GB DRAM, 100GB logical volme), vhost dedicated to 10 cores As measred by: fio 2.10.1 Direct=Yes, 4KB random read I/O, Ramp Time=30s, Rn Time=180s, Norandommap=1, I/O Engine = libaio, Nmjobs=1 Legend: Linx: Kernel vhost-scsi QEMU: virtio-bl dataplane SPDK: Userspace vhost-scsi

Ftre vhost-ser-bl Live Migration Logical Volme snapshots

SPDK Commnity Home Page : http://www.spdk.io/ Githb : https://githb.com/spd/spd Trello : https://trello.com/spd GerritHb : https://review.gerrithb.io/#/q/project:spd/spd+stats:open IRC : https://freenode.net/ we re on #spd

Than yo! Qestions? james.r.harris@intel.com felipe@ntanix.com