I/O Scalability in Xen

Similar documents
SR-IOV support in Xen. Yaozu (Eddie) Dong Yunhong Jiang Kun (Kevin) Tian

Nested Virtualization Update From Intel. Xiantao Zhang, Eddie Dong Intel Corporation

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

Enhancing pass through device support with IOMMU. Haitao Shan Yunhong Jiang Allen M Kay Eddie (Yaozu) Dong

Practical Xen Testing at Intel

Intel Graphics Virtualization on KVM. Aug KVM Forum 2011 Rev. 3

Making Nested Virtualization Real by Using Hardware Virtualization Features

Intel Virtualization Technology Roadmap and VT-d Support in Xen

Towards More Power Friendly Xen

Optimizing and Enhancing VM for the Cloud Computing Era. 20 November 2009 Jun Nakajima, Sheng Yang, and Eddie Dong

Tackling the Management Challenges of Server Consolidation on Multi-core System

Achieve Low Latency NFV with Openstack*

Junhong Jiang, Kevin Tian, Chris Wright, Don Dugger

Graphics Pass-through with VT-d

A Novel Approach to Gain High Throughput and Low Latency through SR- IOV

KVM as The NFV Hypervisor

VT-d Posted Interrupts. Feng Wu, Jun Nakajima <Speaker> Intel Corporation

Hardware-Assisted Mediated Pass-Through with VFIO. Kevin Tian Principal Engineer, Intel

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen*

Intel Virtualization Technology for Directed I/O

Windows Server 2012: Server Virtualization

Live Migration of vgpu

Netchannel 2: Optimizing Network Performance

DPDK Summit China 2017

Demonstrating Data Plane Performance Improvements using Enhanced Platform Awareness

Support for Smart NICs. Ian Pratt

Link Virtualization based on Xen

Intel I350 Gigabit Ethernet Adapters Product Guide

COSC6376 Cloud Computing Lecture 15: IO Virtualization

KVM for IA64. Anthony Xu

Forging a Future in Memory: New Technologies, New Markets, New Applications. Ed Doller Chief Technology Officer

Interrupt Swizzling Solution for Intel 5000 Chipset Series based Platforms

Intel I350 Gigabit Ethernet Adapter for IBM System x IBM Redbooks Product Guide

LED Manager for Intel NUC

Evaluation and improvements of I/O Scalability for Xen. Jun Kamada, Fujitsu Limited Simon Horman, VA Linux Systems Japan

Intel IXP400 Software Version 1.5

To Grant or Not to Grant

Development of I/O Pass-through: Current Status & the Future. Nov 21, 2008 Yuji Shimada NEC System Technologies, Ltd.

Virtualization at Scale in SUSE Linux Enterprise Server

SR-IOV Networking in Xen: Architecture, Design and Implementation

Intel s s Security Vision for Xen

VIRTIO: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Intel Graphics Virtualization Technology. Kevin Tian Graphics Virtualization Architect

Intel Virtualization Technology for Directed I/O

Data Path acceleration techniques in a NFV world

Design of Vhost-pci - designing a new virtio device for inter-vm communication

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

End to End SLA for Enterprise Multi-Tenant Applications

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

Xen VT status and TODO lists for Xen-summit. Arun Sharma, Asit Mallick, Jun Nakajima, Sunil Saxena

6.9. Communicating to the Outside World: Cluster Networking

Introduction to Oracle VM (Xen) Networking

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Intel Virtualization Technology for Directed I/O

Intel 82580EB/82580DB GbE Controller Feature Software Support. LAN Access Division (LAD)

evm for Windows* User Manual

Task Scheduling of Real- Time Media Processing with Hardware-Assisted Virtualization Heikki Holopainen

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

BlackBerry Java Development Environment (JDE)

Linux Virtualization Update

Intel Virtualization Technology for Directed I/O Architecture Specification

Performance Tuning Guidelines for Low Latency Response on AMD EPYC -Based Servers Application Note

Solid-State Drive System Optimizations In Data Center Applications

KVM Weather Report. Red Hat Author Gleb Natapov May 29, 2013

Ravindra Babu Ganapathi

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Runtime VM Protection By Intel Multi-Key Total Memory Encryption (MKTME)

DPDK Integration within F5 BIG-IP BRENT BLOOD, SR MANAGER SOFTWARE ENGINEERING VIJAY MANICKAM, SR SOFTWARE ENGINEER

Interrupt Coalescing in Xen

Jeff Dodson / Avago Technologies

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

The Transition to PCI Express* for Client SSDs

The Price of Safety: Evaluating IOMMU Performance

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

Red Hat Virtualization 4.1 Hardware Considerations for Implementing SR-IOV

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

BIOS Update Release Notes

Virtio 1 - why do it? And - are we there yet? Michael S. Tsirkin Red Hat

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Red Hat Enterprise Virtualization 3.6

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

CSE 120 Principles of Operating Systems

Data Plane Development Kit

TIPS TO. ELIMINATE LATENCY in your virtualized environment

Messaging Overview. Introduction. Gen-Z Messaging

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

BIOS Update Release Notes

Intel 848P Chipset. Specification Update. Intel 82848P Memory Controller Hub (MCH) August 2003

Vhost and VIOMMU. Jason Wang (Wei Xu Peter Xu

Intel IXP400 Digital Signal Processing (DSP) Software: Priority Setting for 10 ms Real Time Task

Intel Virtualization Technology for Directed I/O

IBM PowerKVM available with the Linux only scale-out servers IBM Redbooks Solution Guide

Shared Virtual Memory (SVM) in Xen. Feng Wu

Transcription:

I/O Scalability in Xen Kevin Tian kevin.tian@intel.com Eddie Dong eddie.dong@intel.com Yang Zhang yang.zhang@intel.com Sponsored by: & &

Agenda Overview of I/O Scalability Issues Excessive Interrupts Hurt I/O NUMA Challenge Proposals Soft interrupt throttling in Xen Interrupt-Less NAPI (ILNAPI) Host I/O NUMA Guest I/O NUMA 2

Retrospect 2009 Xen Summit (Eddie Dong, ) Extending I/O Scalability in Xen Covered topics VNIF: multiple TX/RX tasklets, notification frequency VT-d: veoi optimization, vintr delivery SR-IOV: adaptive interrupt coalescing (AIC) Interrupt is the hotspot! 3

New Challenges Always Exist Interrupt overhead is increasingly high One 10G Niantic NIC may incur 512k intr/s 64 (VFs + PF) x 8000 intr/s Similar for dom0 when multiple queues are used 40G NIC is coming Prevalent NUMA architecture (even on 2-node low end server) The DMA distance to memory node matters (I/O NUMA) w/o I/O NUMA awareness, DMA accesses may be suboptimal Need breakthrough in software architecture 4

Excessive Interrupts Hurt! (SR-IOV Rx Netperf) 350.00% 10000 300.00% 250.00% 200.00% 150.00% CPU% (ITR=8000) CPU% (ITR=4000) CPU% (ITR=2000) CPU% (ITR=1000) BW (ITR=8000) BW (ITR=4000) BW (ITR=2000) BW (ITR=1000) 9000 8000 7000 6000 5000 4000 100.00% 3000 50.00% 2000 1000 0.00% CPU% 0 1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s 5

Excessive Interrupts Hurt! 350.00% 10000 300.00% 250.00% 200.00% CPU% (ITR=8000) CPU% (ITR=4000) CPU% (ITR=2000) CPU% (ITR=1000) BW (ITR=8000) BW (ITR=4000) BW (ITR=2000) BW (ITR=1000) Linear (CPU% (ITR=8000)) Bandwidth is not saturated with low interrupt rate! CPU% increases fast with high interrupt rate! 9000 8000 7000 6000 5000 150.00% 4000 100.00% 3000 2000 50.00% 1000 0.00% CPU% 0 1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s 6

Excessive Interrupts Hurt! (Cont.) Excessive VM-exits (7vm as example) External Interrupts APIC Access Interrupt Window Excessive context switches 35k/s 49k/s 7k/s Tackling the Management Challenges of Server Consolidation on Multi-core System, Hui Lv, Xen Summit 2011 SC Excessive ISR/softirq overhead both in Xen and guest Similar impact for dom0 using multi-queue NIC 7

NUMA Status in Xen Processor Nodes Host CPU/Memory NUMA Administrable based on capacity plan IOH/PCH Integrated PCI-e devices Guest CPU/Memory NUMA Not supported But extensively discussed I/O Devices Memory Memory Buffer Lack of manageability for Host I/O NUMA Guest I/O NUMA 8

NUMA Related Structures An integral combo for CPU, memory and I/O devices System Resource Affinity Table (SRAT) Associates CPUs and memory ranges, with proximity domain System Locality Distance Table (SLIT) Distance among proximity domains _PXM (Proximity) object Standard way to describe proximity info for I/O devices Solely acquiring _PXM info of I/O devices is not enough to construct I/O NUMA knowledge! 9

Host I/O NUMA Issues No host I/O NUMA awareness in Dom0 Dom0 owns the majority of I/O devices Dom0 memory is first allocated by skipping DMA zone DMA memory is reallocated for continuity later Above allocations are made within node_affinity mask round-robin No consideration on actual I/O NUMA topology Complex and confusing if dom0 handles host I/O NUMA itself Implicates physical CPU/Memory awareness in dom0 too Virtual NUMA vs. Host NUMA? Xen however has no knowledge of _PXM() 10

Guest I/O NUMA Issues Guest needs I/O NUMA awareness to handle assigned devices Guest NUMA is the premise Guest NUMA is not upstream yet! Extensive talks in previous Xen summits VM Memory Allocation Schemes and PV NUMA Guests, Dulloor Rao Xen Guest NUMA: General Enabling Part, Jun Nakajima Already extensive discussions and works Now time to push into upstream! No I/O NUMA information exposed to guest Lack of I/O NUMA awareness in device assignment process 11

Proposals Per-interrupt overhead has been studied extensively! Now we want to reduce the interrupt number! 12

The Effect of Dynamic Interrupt Rate A manual tweak on ITR based on VM number (8000 / vm_num) 350.00% 10000 300.00% 250.00% 200.00% 150.00% CPU% (ITR=8000) CPU% (ITR=1000) CPU% (dynamic ITR) BW (ITR=8000) BW (ITR=1000) BW (dynamic ITR) Linear (CPU% (ITR=8000)) Linear (CPU% (ITR=1000)) Linear (CPU% (dynamic ITR)) 9000 8000 7000 6000 5000 4000 100.00% 3000 50.00% 2000 1000 0.00% CPU% 0 1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s 13

Software Interrupt Throttling in Xen Throttle virtual interrupts based on administrative policies Based on shared resources (e.g. bandwidth/vm_number) Based on priority and SLAs Apply to both PV and HVM guests Fewer virtual interrupts reduces guest ISR/softirq overhead It may further throttle physical interrupts too! If the device doesn t trigger a new interrupt when an earlier request is still pending 14

Interrupt-Less NAPI (ILNAPI) NAPI itself doesn t eliminate interrupts NAPI logic is scheduled by rx interrupt handler Mask interrupt when NAPI is scheduled Unmask interrupt when NAPI completes current poll What about scheduling NAPI w/o interrupts? If we can piggyback NAPI schedule on other events System calls, other interrupts, scheduling, Internal NAPI schedule overhead is much less than a heavy device->xen->vm interrupt path Yes, that s Interrupt-Less NAPI (ILNAPI) 15

Interrupt-Less NAPI (Cont.) Net Core Syscall ILNAPI_HIGH watermark: NAPI Event Pool ISR Schedule When there re too many notifications within the guest Serve as the high watermark for NAPI schedule frequency Poll IXGBEVF driver ISR ILNAPI_LOW watermark: Activated when there re insufficient notifications IXGBE NIC IRQ Serve as the low water mark to ensure a reasonable traffic May move back to interrupt-driven manner 16

Interrupt-Less NAPI (Cont.) 350.00% 10000 300.00% 250.00% 200.00% CPU% (ITR=8000) CPU% (ITR=1000) CPU% (ILNAPI) BW (ITR=8000) BW (ITR=1000) BW (ILNAPI) Linear (CPU% (ITR=8000)) Linear (CPU% (ITR=1000)) Linear (CPU% (ILNAPI)) 9000 8000 7000 6000 5000 150.00% 4000 100.00% 3000 50.00% 2000 1000 0.00% CPU% 1vm 2vm 3vm 4vm 5vm 6vm 7vm Mb/s 0 17

Interrupt-Less NAPI (Cont.) Watermarks can be adaptively chosen by the driver Based on bandwidth/buffer estimation Or an enlightened scheme: Xen may provide guidance through shared buffer Resource utilization (e.g. VM number) Administrative policies SLA requirements ILNAPI can be turned on/off dynamically under Xen s control E.g. in case where latency is much concerned 18

Proposals We need close the Xen architecture gaps for both host I/O NUMA and guest I/O NUMA! 19

Host I/O NUMA Give Xen full NUMA information: Xen already sees SRAT/SLIT New hypercall to convey I/O proximity info (_PXM) from Dom0 Xen need extend _PXM to all child devices Extend DMA reallocation hypercall to carry device ID May need Xen version for set_dev_node Xen reallocates DMA memory based on proximity info CPU access in dom0 remains NUMA-unaware E.g. the communication between backend/frontend driver 20

Guest I/O NUMA Okay, let s help guest NUMA support in Xen! IOMMU may also spans nodes ACPI defines Remapping Hardware Status Affinity (RHSA) The association between IOMMU and proximity domain Allocate remapping table based on RHSA and proximity domain info 21

Guest I/O NUMA (Cont.) Make up guest I/O NUMA awareness Construct _PXM method for assigned devices in DM Based on guest NUMA info (SRAT/SLIT) Extend control panel to favor I/O NUMA Assign devices which are in same proximity domain as specified nodes of the guest Or, affine guest to the node where assigned device is affined The policy for SR-IOV may be more constrained E.g. all guests sharing same SR-IOV device run on same node Warn user when optimal placement can t be assured 22

Summary I/O scalability is always challenging every time when we reexamine it! Excessive interrupts hurt I/O scalability, but there re some means both in Xen and in guest to mitigate it! CPU/Memory NUMA has been well managed in Xen, but I/O NUMA awareness is still not in place! 23

Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. All dates provided are subject to change without notice. Intel is a trademark of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2007, Intel Corporation. All rights are protected. 24

25