Improve VNF safety with Vhost-User/DPDK IOMMU support

Similar documents
Vhost and VIOMMU. Jason Wang (Wei Xu Peter Xu

Vhost dataplane in Qemu. Jason Wang Red Hat

viommu/arm: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017

viommu/arm: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Accelerating VM networking through XDP. Jason Wang Red Hat

VIRTIO: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Virtio/vhost status update

A Userspace Packet Switch for Virtual Machines

Quo Vadis Virtio? Michael S. Tsirkin Red Hat

FAQ. Release rc2

KVM as The NFV Hypervisor

Userspace NVMe Driver in QEMU

Task: provide isolated access to multiple PCI devices for multiple KVM guests on POWER8 box.

Shared Virtual Memory Virtualization. Liu, Yi L Raj, Ashok Pan, Jacob

Design of Vhost-pci - designing a new virtio device for inter-vm communication

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

DPDK Vhost/Virtio Performance Report Release 18.05

Agilio CX 2x40GbE with OVS-TC

DPDK Vhost/Virtio Performance Report Release 18.11

Introduction to Virtio Crypto Device.

Evolution of the netmap architecture

DPDK Roadmap. Tim O Driscoll & Chris Wright Open Networking Summit 2017

DPDK Vhost/Virtio Performance Report Release 17.08

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

Achieve Low Latency NFV with Openstack*

Applying Polling Techniques to QEMU

Configuring and Benchmarking Open vswitch, DPDK and vhost-user. Pei Zhang ( 张培 ) October 26, 2017

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr.

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

KVM 在 OpenStack 中的应用. Dexin(Mark) Wu

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF's on OpenStack

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Virtio-blk Performance Improvement

Implementing DPDK based Application Container Framework with SPP YASUFUMI OGAWA, NTT

Virtual Virtual Memory

Hardware-Assisted Mediated Pass-Through with VFIO. Kevin Tian Principal Engineer, Intel

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

Xilinx Answer QDMA DPDK User Guide

Utilizing the IOMMU scalably

Virtualization, Xen and Denali

Software Routers: NetMap

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Understanding The Performance of DPDK as a Computer Architect

Demystifying Network Cards

Storage Performance Tuning for FAST! Virtual Machines

Agenda How DPDK can be used for your Application DPDK Ecosystem boosting your Development Meet the Community Challenges

KVM Weather Report. Red Hat Author Gleb Natapov May 29, 2013

VALE: a switched ethernet for virtual machines

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Measuring a 25 Gb/s and 40 Gb/s data plane

Red Hat OpenStack Platform 10

CHALLENGES FOR NEXT GENERATION FOG AND IOT COMPUTING: PERFORMANCE OF VIRTUAL MACHINES ON EDGE DEVICES WEDNESDAY SEPTEMBER 14 TH, 2016 MARK DOUGLAS

VGA Assignment Using VFIO. Alex Williamson October 21 st, 2013

Live Migration with Mdev Device

KVM Weather Report. Amit Shah SCALE 14x

Low-Overhead Ring-Buffer of Kernel Tracing in a Virtualization System

Virtualisation: The KVM Way. Amit Shah

Design and Implementation of Virtual TAP for Software-Defined Networks

DPDK Integration within F5 BIG-IP BRENT BLOOD, SR MANAGER SOFTWARE ENGINEERING VIJAY MANICKAM, SR SOFTWARE ENGINEER

ARM-KVM: Weather Report Korea Linux Forum

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

Fast packet processing in linux with af_xdp

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

CLOUD ARCHITECTURE & PERFORMANCE WORKLOADS. Field Activities

The Price of Safety: Evaluating IOMMU Performance

Light & NOS. Dan Li Tsinghua University

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

I/O and virtualization

Utilizing the IOMMU Scalably

DPDK Performance Report Release Test Date: Nov 16 th 2016

Improve Performance of Kube-proxy and GTP-U using VPP

Learning with Purpose

Zhang Chen Zhang Chen Copyright 2017 FUJITSU LIMITED

VGPU ON KVM VFIO BASED MEDIATED DEVICE FRAMEWORK Neo Jia & Kirti Wankhede, 08/25/2016

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

HKG : OpenAMP Introduction. Wendy Liang

Live Migration: Even faster, now with a dedicated thread!

Memory Management Strategies for Data Serving with RDMA

IsoStack Highly Efficient Network Processing on Dedicated Cores

KVM on POWER Status update & IO Architecture

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Virtio 1 - why do it? And - are we there yet? Michael S. Tsirkin Red Hat

A Brief Guide to Virtual Switching Franck Baudin (Red Hat) Billy O Mahony (Intel)

Xen Network I/O Performance Analysis and Opportunities for Improvement

Task Scheduling of Real- Time Media Processing with Hardware-Assisted Virtualization Heikki Holopainen

MoonGen. A Scriptable High-Speed Packet Generator. Paul Emmerich. January 31st, 2016 FOSDEM Chair for Network Architectures and Services

Open vswitch - architecture

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

DPDK Intel NIC Performance Report Release 18.02

NetVM: High Performance and Flexible Networking Using Virtualization on Commodity Platforms

Transcription:

Improve VNF safety with Vhost-User/DPDK IOMMU support No UIO anymore! Maxime Coquelin Software Engineer KVM Forum 2017

AGENDA Background Vhost-user device IOTLB implementation Benchmarks Future improvements Conclusion - Questions 2

Background

Why do we need IOMMU support? Background Current status (i.e. without IOMMU support): Use of UIO or VFIO with enable_unsafe_noiommu_mode = 1 Taints the Kernel, require rebuild with some distros Use of GPA (guest physical addresses) for virtqueues and buffers Vhost-user backends mmaps all the guest memory with RW permissions DPDK app in guest could make Vhost to access memory the app hasn t access to The guest app could pass random GPA as descriptor buffer address Vhost backend overwrites random memory with packet content, or leaks random memory as a packet. 4

IOMMU support in guest Background PCIe device driver Virtio PMD (v16.11) User Kernel ioctl(..., VFIO_IOMMU_MAP_DMA,...) PCIe device driver VFIO Virtio-net (v4.6) iommu_map() dma_map_sg() IOMMU Framework DMA Framework (struct iommu_ops).map() (struct dmap_ops).map_page() IOMMU driver 5 Guest

Static vs. dynamic mappings Background We consider two types of DMA mappings Dynamic mappings (Kernel/Virtio-net driver) At least one dma_map()/dma_unmap() per packet At least one IOTLB miss/invalidate per packet Static mappings (DPDK/Virtio PMD) Single iommu_map/unmap() for all the memory pool at device probe/remove Only IOTLB misses the first time pages are accessed 6

viommu support in Qemu Background Emulated IOMMU devices implementations in QEMU x86 and PowerPC supported, ARM on-going Platform-agnostic Virtio-IOMMU device spec being discussed Provides IO translation & device isolation as physical IOMMUs do Generic IOTLB/IOMMU API provided in QEMU get IOTLB entry from (IOVA, perm) IOMMU notifiers (MAP/UNMAP) 7

viommu support for Vhost backend dev in QEMU Background Initially introduced with kernel backend Implements Address Translation Services (ATS) from PCIe spec Using QEMU s IOTLB/IOMMU APIs Vhost-backend changes Notify the backend for IOTLB invalidates Notify the backend for IOTLB updates Handle backend IOTLB miss requests 8

viommu support for Vhost backend in kernel Background Implements new protocol using Vhost-kernel chardev reads/writes Other vhost-kernel requests uses ioctls Required to be able to poll for IOTLB miss requests struct vhost_iotlb_msg message types VHOST_IOTLB_MISS : Request QEMU for an IOTLB entry VHOST_IOTLB_UPDATE : Update Kernel with a new IOTLB entry VHOST_IOTLB_INVALIDATE : Notify Kernel an IOTLB entry is now invalid Device IOTLB implemented in vhost kernel driver Relies on interval tree for better cache lookup performance O(log(n)) Dedicated cache for virtqueues metadata O(1) 9

Vhost-user device IOTLB implentation

Vhost-user protocol update Vhost-user device IOTLB implementation Goal : design as close as possible to vhost-kernel protocol Problem : IOTLB miss request requires slave initiated requests support But vhost-user socket only supports master initiated requests Introduction of a new socket for slave requests Slave request channel VHOST_USER_PROTOCOL_F_SLAVE_REQ protocol feature VHOST_USER_SET_SLAVE_REQ_FD request to share new socket s fd Re-use master s message structure, with new requests IDs Only used by IOMMU feature for now 11

Vhost-user protocol update (cont d) Vhost-user device IOTLB implementation IOTLB protocol update (Since QEMU v2.10, Vhost-user spec for details) Master initiated : VHOST_USER_IOTLB_MSG IOTLB update & invalidation requests Same payload as vhost-kernel counterpart (struct vhost_iotlb_msg) Reply-ack mandatory Slave initiated : VHOST_USER_SLAVE_IOTLB_MSG IOTLB miss requests Also using struct vhost_iotlb_msg as payload Reply-ack optionnal 12

IOTLB miss/update sequence Vhost-user device IOTLB implementation PMD thread Vhost-user protocol thread Main thread vcpu thread IOTLB miss req IOTLB update req Device IOTLB cache insert IOTLB update ack IOTLB miss ack (optional) Slave channel Master channel 13 DPDK QEMU

IOTLB invalidate sequence Vhost-user device IOTLB implementation PMD thread Vhost-user protocol thread Main thread vcpu thread Guest unmap trap IOTLB invalidate req Device IOTLB cache remove IOTLB invalidate ack Master channel 14 DPDK QEMU

IOTLB cache implementation Vhost-user device IOTLB implementation Device IOTLB cache implemented in Vhost-user backend Avoid querying for every address translation Single writer, multiple readers to the IOTLB cache Writer : Vhost-user protocol threads (IOTLB updates/invalidates) Readers : PMD threads (IOTLB cache lookups) Great case for RCU! Prototyped and tested, but liburcu is LGPLv2, only small functions can be in-lined Adds dependency to DPDK build Some distros don t ship liburcu 15

IOTLB cache implementation Vhost-user device IOTLB implementation Fallback : readers-writers locks (rte_rwlock) Better than regular mutexes But read lock uses rte_atomic32_cmpset(), optimizations to reduce its cost : Per-virtqueue IOTLB cache Read lock taken once per packets burst Initial cache implementation based on sorted lists Not efficient, but enough with 1G pages. Need a better implementation for smaller pages Cache sized large enough not to face misses with static mappings IOTLB cache evictions should only happen with buggy/malicious guests 16

Benchmarks

Physical Virtual Physical Benchmarks PVP benchmark based on TestPMD IO forwarding on host side TestPMD (macswap) DUT VM MAC swapping in guest to access packet header Setup information T-Rex + binary-search.py from lua-traffigen DUT TestPMD (io) 10G NIC 10G NIC E5-2667 v4 @3.20GHz (Broadwell) 32GB RAM @2400MHz 2 x 10G Intel X710 10G NIC 10G NIC Moongen/IXIA/... TE 18

Physical Virtual Physical Benchmarks 20 PVP reference benchmark with IOTLB series v2 Parameters: 64B packets, 0.005% acceptable loss, bidirectionnal testing (result is the sum) 18 16 14 12 2M/2M hugepages IOMMU off : No performance regression IOMMU on : ~25% degradation Mpps 10 8 6 4 IOTLB cache lookup overhead 1G/1G hugepages 2 0 1G/1G hugepages 2M/2M hugepages IOMMU on/off : No performance regression Virtio PMD is the bottleneck Base (DPDK v17.08) + IOTLB series, IOMMU=off + IOTLB series, IOMMU=on 19

16 14 12 10 Micro-benchmark using Testpmd Benchmarks 1G hugepages 16 14 12 10 2M hugepages Mpps 8 6 Mpps 8 6 4 4 2 2 0 Guest -> host Host -> guest IO loopback 0 Guest -> host Host -> guest IO loopback Base (DPDK v17.08) Base (DPDK v17.08) + IOTLB series, IOMMU=off + IOTLB series, IOMMU=off + IOTLB series, IOMMU=on + IOTLB series, IOMMU=on 20

Future improvements

Contiguous IOTLB entries merging Future improvements Performance penalty with 2MB hugepages due to higher number of IOTLB entries IOTLB cache lookup overhead Most of IOTLB entries are both virtually AND physically contiguous Rough prototype merging entries fixes performance penalty Less IOTLB cache lookup iterations Better CPU cache utilization Remaining questions: Need to define invalidation strategy : invalidate all merged entry or split it? Is there a performance impact with dynamic mappings? 22

Interval tree based IOTLB cache Future improvements Vhost-kernel backend uses interval tree for its IOTLB cache implementation O(log(n)) for lookup Current Vhost-user backend only implements sorted list O(n) for lookup Required work New interval tree lib in DPDK Convert Vhost-user s IOTLB cache implementation 23

IOTLB misses batching Future improvements IOMMU support with Virtio-net kernel driver not viable due to poor performance Bursting broken due to IOTLB miss for every packets Before starting packets burst loop, translate all descriptors buffers addresses If no missing translations, start the burst If some, send IOTLB miss requests for all missing translations Might improve overall performance with multiple vhost-user ports per lcore 24

Conclusion

Conclusion Vhost-user design close to Vhost-kernel Reasonable performance impact with static mappings And more improvements coming soon! Performance impact a blocker with dynamic mappings Special thanks to : Jason Wang & Wei Xu Vhost-kernel IOMMU support Peter Xu viommu support in QEMU 26

Questions? THANK YOU!