SHRINKING THE HYPERVISOR ONE SUBSYSTEM AT A TIME A Userspace Packet Switch for Virtual Machines Julian Stecklina OS Group, TU Dresden jsteckli@os.inf.tu-dresden.de VEE 2014, Salt Lake City
1 Motivation 2 Userspace Switch 3 Evaluation 4 Summary TU Dresden Userspace Packet Switch slide 2 of 30
01 Microkernels and TCB Systems built on microkernels usually structured as multiserver systems with strong isolation between subsystems. Applications only depend on subsystems they use. kernel user µkernel Application Networking Filesystem TU Dresden Userspace Packet Switch slide 3 of 30
01 Towards Monolithic Hypervisors kernel user Qemu virtio-net TU Dresden Userspace Packet Switch slide 4 of 30
01 Towards Monolithic Hypervisors kernel user KVM Qemu virtio-net TU Dresden Userspace Packet Switch slide 5 of 30
01 Towards Monolithic Hypervisors kernel user KVM vapic Qemu virtio-net TU Dresden Userspace Packet Switch slide 6 of 30
01 Towards Monolithic Hypervisors kernel user KVM vapic v-net... TU Dresden Userspace Packet Switch slide 7 of 30
01 User vs. Kernel Attacks by malicious guest code are a serious concern. Successful attacks on Qemu achieve unprivileged code execution. Dangerous, but manageable. Successful attacks on KVM achieve code execution in kernel mode. Game over. L You are here. TU Dresden Userspace Packet Switch slide 8 of 30
01 KVM/vhost Trusted Code In a modern KVM installation the complete networking path is in the TCB of all applications on the host: (simple) instruction decoding, virtio-net device implementation, NIC driver. Does it have to be? TU Dresden Userspace Packet Switch slide 9 of 30
1 Motivation 2 Userspace Switch 3 Evaluation 4 Summary TU Dresden Userspace Packet Switch slide 10 of 30
02 KVM Networking notification kernel user VCPU A VHOST VHOST VCPU B sv3 guest VM Exit Inject IRQ VM A VM B TU Dresden Userspace Packet Switch slide 11 of 30
02 KVM Networking notification kernel user VCPU A VCPU B sv3 sv3 guest VM Exit Inject IRQ VM A VM B TU Dresden Userspace Packet Switch slide 12 of 30
02 KVM Networking notification kernel user VCPU A VCPU B sv3 sv3 guest VM Exit Inject IRQ VM A VM B CPU 1 CPU 2 CPU 3 TU Dresden Userspace Packet Switch slide 13 of 30
02 A Userspace Switch: sv3 Userspace packet switch running as ordinary process on top of the host Linux: implements virtio-net, (no) packet memory management, NIC drver. Every sv3 instance is a complete isolated networking subsystem. TU Dresden Userspace Packet Switch slide 14 of 30
02 Vhost and KVM KVM and vhost are loosely tied together by Qemu using eventfds. Qemu ties them together using ioctl. KVM can trigger eventfds on VM Exits. eventfds can be used to trigger IRQ injection in KVM. Can use eventfds from userspace as well without using vhost. TU Dresden Userspace Packet Switch slide 15 of 30
02 Tying sv3 into Qemu Enhanced Qemu to support out-of-process PCI devices. Qemu connects to sv3 via AF LOCAL socket. Qemu exchanges fds to establish shared memory. Qemu exchanges eventfds for VM Exit notification, IRQ injection. sv3 implements the complete virtio-net logic and it feels like L4! TU Dresden Userspace Packet Switch slide 16 of 30
02 Zero-Copy Packet Transmission sv3 creates linear mappings of guest memory with mmap. Packet data can be copied with a plain memcpy. No additional copies are necessary. If no buffer space in the receiving VM is available, packet is dropped. No dynamic memory management for packets needed. VM A sv3 VM B VM A VM B TU Dresden Userspace Packet Switch slide 17 of 30
02 External Communication Userspace driver for Intel X520 10 GBit NIC using VFIO (requires IOMMU) supporting static offloads: TCP Segmentation Offload Large Receive Offload Checksum Offload Virtio descriptors translated to HW descriptors allows for zero-copy send with all offloads. Better reuse existing drivers next time... TU Dresden Userspace Packet Switch slide 18 of 30
02 Switching Loop sv3 is mostly single-threaded and lockless. Userspace RCU is used to synchronize adding and removing switch ports. 1 disable events on all virtio queues 2 disable HW IRQs 3 poll for work until queues empty 4 enable events/irqs 5 poll a last time, if packet seen goto 1 6 block on eventfd In overload scenarios, sv3 naturally operates in polling mode. TU Dresden Userspace Packet Switch slide 19 of 30
1 Motivation 2 Userspace Switch 3 Evaluation 4 Summary TU Dresden Userspace Packet Switch slide 20 of 30
03 Resource Consumption Processes are lightweight alternatives to driver VMs. sv3 NIC driver sv3 total < 2 MiB + 14 MiB < 16 MiB smallest VM 32 MiB [1] netback VM 128 MiB [1] Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor, Colp et al., SOSP 11 TU Dresden Userspace Packet Switch slide 21 of 30
03 Evaluation System Intel Core i7 3770S (Ivy Bridge) C-states, HT and frequency scaling disabled 16 GiB RAM @ 159 Gbit/s Host: Fedora 19 with Linux 3.10 (vanilla) Guest: Linux 3.10 (vanilla), 256 MiB RAM Qemu 1.5 (plus patches) TU Dresden Userspace Packet Switch slide 22 of 30
03 Cost of Userspace Execution 14 Roundtrip Latency in μs 12 10 8 6 4 2 0 vhost sv3 Notification to IRQ injection times for vhost vs. sv3 without any packet processing. Cost of additional trip to syscall layer and mode switch. TU Dresden Userspace Packet Switch slide 23 of 30
03 Latency Roundtrip Latency in μs 100 80 60 40 20 0 vhost-external vhost-vm sv3-external sv3-vm Latency measured using netperf UDP RR. TU Dresden Userspace Packet Switch slide 24 of 30
03 VM-to-VM Bandwidth 3 2.5 vhost, tso sv3, tso sv3 only CPU Utilization 2 1.5 1 0.5 0 0 5 10 15 20 25 30 GBit/s TU Dresden Userspace Packet Switch slide 25 of 30
03 VM-to-VM Bandwidth (cont d) 3 2.5 vhost, no tso sv3, no tso sv3 only CPU Utilization 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 GBit/s TU Dresden Userspace Packet Switch slide 26 of 30
03 external-to-vm Bandwidth 3 2.5 vhost, tso sv3, tso sv3 only CPU Utilization 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 GBit/s TU Dresden Userspace Packet Switch slide 27 of 30
1 Motivation 2 Userspace Switch 3 Evaluation 4 Summary TU Dresden Userspace Packet Switch slide 28 of 30
04 Summary sv3 is an efficient lockless userspace packet switch for VMs running on (unmodified) Linux/KVM. Code: https://github.com/blitz/sv3. Linux/KVM has all the mechanisms to make a microkernel-style design possible and efficient: rights transfer via AF LOCAL sockets (capabilities) efficient notifications via eventfds drivers in userspace via VFIO using eventfds for IRQs tying eventfds to VM exits / IRQ injection address space switch cost not a factor in performance Few reasons to write new systems functionality in kernel mode. Questions? TU Dresden Userspace Packet Switch slide 29 of 30
04 external-to-vm Bandwidth 3 2.5 sv3, no tso sv3 only CPU Utilization 2 1.5 1 0.5 0 1 2 3 4 5 6 GBit/s TU Dresden Userspace Packet Switch slide 30 of 30