vhost-ser-scsi: offloading virtio-scsi to serspace Felipe Franciosi (AHV Engineering, Ntanix) Jim Harris (SPDK Architect, Intel) 27 October 2017
Disclaimer This presentation and the accompanying oral commentary may inclde express and implied forward-looing statements, inclding bt not limited to statements concerning or bsiness plans and objectives, prodct featres and technology that are nder development or in process and capabilities of sch prodct featres and technology, or plans to introdce prodct featres in ftre releases, the implementation of or prodcts on additional hardware platforms, strategic partnerships that are in process, prodct performance, competitive position, indstry environment, and potential maret opportnities. These forwardlooing statements are not historical facts, and instead are based on or crrent expectations, estimates, opinions and beliefs. The accracy of sch forwardlooing statements depends pon ftre events, and involves riss, ncertainties and other factors beyond or control that may case these statements to be inaccrate and case or actal reslts, performance or achievements to differ materially and adversely from those anticipated or implied by sch statements, inclding, among others: failre to develop, or nexpected difficlties or delays in developing, new prodct featres or technology on a timely or cost-effective basis; delays in or lac of cstomer or maret acceptance of or new prodct featres or technology; the failre of or software to interoperate on different hardware platforms; failre to form, or delays in the formation of, new strategic partnerships and the possibility that we may not receive anticipated reslts from forming sch strategic partnerships; the introdction, or acceleration of adoption of, competing soltions, inclding pblic clod infrastrctre; a shift in indstry or competitive dynamics or cstomer demand; and other riss detailed in or Form 10-Q for the fiscal qarter ended April 30, 2017, filed with the Secrities and Exchange Commission. These forward- looing statements spea only as of the date of this presentation and, except as reqired by law, we assme no obligation to pdate forward- looing statements to reflect actal reslts or sbseqent events or circmstances. Any ftre prodct or roadmap information is intended to otline general prodct directions, and is not a commitment, promise or legal obligation for Ntanix to deliver any material, code, or fnctionality. This information shold not be sed when maing a prchasing decision. Frther, note that Ntanix has made no determination as to if separate fees will be charged for any ftre prodct enhancements or fnctionality which may ltimately be made available. Ntanix may, in its own discretion, choose to charge separate fees for the delivery of any prodct enhancements or fnctionality which are ltimately made available. Certain information contained in this presentation and the accompanying oral commentary may relate to or be based on stdies, pblications, srveys and other data obtained from third-party sorces and or own internal estimates and research. While we believe these third-party stdies, pblications, srveys and other data are reliable as of the date of this presentation, they have not independently verified, and we mae no representation as to the adeqacy, fairness, accracy, or completeness of any information obtained from third-party sorces. 2
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may mae changes to specifications and prodct descriptions at any time, withot notice. Designers mst not rely on the absence or characteristics of any featres or instrctions mared "reserved" or "ndefined." Intel reserves these for ftre definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from ftre changes to them. The information here is sbject to change withot notice. Do not finalize a design with this information. The prodcts described in this docment may contain design defects or errors nown as errata which may case the prodct to deviate from pblished specifications. Crrent characterized errata are available on reqest. Contact yor local Intel sales office or yor distribtor to obtain the latest specifications and before placing yor prodct order. All prodcts, compter systems, dates, and figres specified are preliminary based on crrent expectations, and are sbject to change withot notice. This docment contains information on prodcts in the design phase of development. Applies only to halogenated flame retardants and PVC in components. Prsant to JEP-709, halogens are below 1,000ppm bromine and 1,000ppm chlorine. The replacement of halogenated flame retardants and/or PVC may not be better for the environment. Intel and the Intel logo are trademars of Intel Corporation in the U.S. and other contries. *Other names and brands may be claimed as the property of others. Copyright 2016 Intel Corporation. All rights reserved. 3
State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 8 8 1 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 4 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 4 main loop (device emlation) main loop (device emlation) 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem
Motivation Crrent design imposes certain limitations - A 1:1 mapping on VM to host process for handling VQs - Impossible to have a single process handling mltiple VMs - Impossible to efficiently poll on mltiple VMs - Non-trivial to virtalise a single NVMe to mltiple VMs from serspace Soltion - Create a mechanism to offload the datapath to a separate process - Precedence: vhost-ser-net 5
State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 8 8 1 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 6 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 4 main loop (device emlation) main loop (device emlation) 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem
State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 1 8 8 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 7 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 VUS APP VUS APP 4 IO Thread IO Thread 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem
State of the Art: Qem handles VQs VMs vhw Storage Worload virtio-scsi 1 8 8 2 2 Storage Worload virtio-scsi 1 HW Linx (host) 8 Qem gest s 010111010110 110110100101 req memory 011100101010 req 010100100010101001011101 req 4 VUS APP 4 IO Thread IO Thread 6 6 7 7 3 vm.o 3 gest s memory 010111010110 110110100101 req 011100101010 req 010100100010101001011101 req 5 5 Qem
Trade-offs Main benefit: better performance - Easier to implement MQ spport (with mltiple IO threads) - Easier to optimise for better batching - First pic p everything from all VQs (mltiple VMs) - Then sbmit it altogether to yor bacend - Simple to implement VQ polling for mltiple VMs 9 Drawbacs - When sing a single bacend process, secrity and stability - With separate processes, these are no different than today - LUN management is independent (Qem is not aware of LUNs anymore) - Featres generally available from Qem bloc layer are lost
Side-by-side: vhost-ser and vhost VM virtio-scsi VM virtio-scsi alloc vqs pci config alloc vqs pci config vhw vhw Linx (host) 10 Qem gest s memory 010111010110 110110100101 VQ 011100101010 010100100010101001011101 vm.o nix soc gpa 1101 1110 1001101010 0111010010 ic irq VUS Linx (host) Qem gest s memory 010111010110 110110100101 VQ 011100101010 010100100010101001011101 irq ic vhost dev vm.o ioctl() vhost-scsi
Design/Implementation Highlights Small refactor of vhost-scsi (in Qem) - Original vhost-scsi split into: vhost-scsi-common and vhost-scsi - New vhost-ser-scsi introdced nder vhost-scsi-common Live Migration - Very straightforward, bt ended p dropped from merge - Tric: flsh the device on GET_VRING_BASE - Ftre wor: throttling vhost(-ser) devices for convergence 11 Key difference with vhost: nix msgs are async - ioctl() blocs ntil the ernel modle finished processing the command - On vhost-ser, se F_REPLY_ACK to wait for completion
Storage Performance Development Kit (SPDK) What is SPDK? Userspace polled-mode drivers, libraries and applications for storage, storage networing and storage virtalization Leverages DPDK for hgepage memory management and PCI device enmeration Started in 2013, open sorced in 2015 BSD licensed http://spdk.io
Storage Performance Development Kit (SPDK) Storage Protocols NVMe-oF* Target NVMe iscsi Target vhost-ser-scsi Target SCSI vhost-ser-bl Target Bloc Device Layer (bdev) Storage Services NVMe PMD Logical Volmes Linx AIO Ceph RBD GPT virtio-scsi PMD libpmembl BlobFS Blobstore Released Q4 17 Polled Mode Drivers NVMe-oF * Initiator NVMe NVMe * PCIe virtio-scsi Intel QicData Technology
Basic Architectre Configre vhost-scsi controller /spd/vhost.0 vhost-scsi ctrlr Logical Core 0 Logical Core 1 JSON RPC creates SPDK constrcts for vhost device and bacing storage scsi dev scsi ln bdev creates controller-specific vhost domain socet nvme NVMe SSD
Basic Architectre Lanch VM VM /spd/vhost.0 Logical Core 0 Logical Core 1 QEMU connects to domain socet SPDK vhost-scsi ctrlr VQ VQ VQ scsi dev vhost-scsi bdev-nvme Assigns logical core Starts vhost-scsi Allocates NVMe qee pair Starts bdev-nvme scsi ln bdev nvme NVMe SSD QP
Basic Architectre Repeat for additional VMs s spread across available cores VM /spd/vhost.0 vhost-scsi ctrlr VQ VQ VQ scsi dev Logical Core 0 vhost-scsi bdev-nvme Logical Core 1 vhost-scsi bdev-nvme scsi ln bdev nvme vhost-scsi bdev-nvme vhost-scsi bdev-nvme NVMe SSD QP
Asynchronos Polling Poller exection VM /spd/vhost.0 Logical Core 0 Logical Core 1 Reactor on each core Iterates throgh s rond-robin vhost-scsi poll for new I/O reqests sbmit to NVMe SSD bdev-nvme poll for I/O completions complete to gest VM vhost-scsi ctrlr VQ VQ VQ scsi dev scsi ln bdev nvme NVMe SSD QP vhost-scsi bdev-nvme vhost-scsi bdev-nvme vhost-scsi bdev-nvme vhost-scsi bdev-nvme
Sharing SSDs in serspace Typically not 1:1 VM to local attached NVMe SSD otherwise jst se PCI direct assignment What abot SR-IOV? SR-IOV SSDs not prevalent yet precldes featres sch as snapshots What abot LVM? LVM depends on Linx ernel bloc layer and storage drivers (i.e. nvme) SPDK wants to se serspace polled mode drivers SPDK Blobstore and Logical Volmes!
Blobstore Design Design Goals Minimalistic for targeted storage se cases lie Logical Volmes and RocsDB Deliver only the basics to enable another class of application Design for fast storage media
Blobstore Design High Level Application interacts with chns of data called blobs Mtable array of pages of data, accessible via ID Asynchronos No blocing, qeing or waiting Flly parallel No locs in IO path Atomic metadata operations Depends on SSD atomicity (i.e. NVMe) 1+ 4KB metadata pages per blob
Logical Volmes Blobstore pls: UUID xattr for lvolstore, lvols Friendly names lvol name niqe within lvolstore lvolstore name niqe within application Ftre snapshots (reqires blobstore spport) bdev lvol... lvolstore blobstore bdev nvme NVMe SSD bdev lvol
SPDK vhost Performance - Configration 2x Intel Xeon Platinm 8180 (28 cores each) VMs: 46 cores vhost: 10 cores 23x Intel P4800x Optane SSD SPDK lvolstore/lvm lvgrop per SSD 46 VMs 1, 2GB DRAM, 100GB logical volme 2 VMs per lvolstore/lvgrop
SPDK vhost Performance IO/s (in millions) QD=1 Latency (in s) 8 50 6 38 4 25 2 13 0 Linx SPDK 0 Linx QEMU SPDK System Configration: 2S Intel Xeon Platinm 8180: 28C, E5-2699v3: 18C, 2.5GHz (HT off), Intel Trbo Boost Technology enabled, 12x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubnt * Server 16.04.2 LTS, 4.11 ernel, 23x Intel P4800x Optane SSD 375GB, 1 SPDK lvolstore or LVM lvgrop per SSD, SPDK commit ID c5d8b108f22ab, 46 VMs (CentOS 3.10, 1, 2GB DRAM, 100GB logical volme), vhost dedicated to 10 cores As measred by: fio 2.10.1 Direct=Yes, 4KB random read I/O, Ramp Time=30s, Rn Time=180s, Norandommap=1, I/O Engine = libaio, Nmjobs=1 Legend: Linx: Kernel vhost-scsi QEMU: virtio-bl dataplane SPDK: Userspace vhost-scsi
Ftre vhost-ser-bl Live Migration Logical Volme snapshots
SPDK Commnity Home Page : http://www.spdk.io/ Githb : https://githb.com/spd/spd Trello : https://trello.com/spd GerritHb : https://review.gerrithb.io/#/q/project:spd/spd+stats:open IRC : https://freenode.net/ we re on #spd
Than yo! Qestions? james.r.harris@intel.com felipe@ntanix.com