I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

Similar documents
Intel Virtualization Technology Roadmap and VT-d Support in Xen

SR-IOV support in Xen. Yaozu (Eddie) Dong Yunhong Jiang Kun (Kevin) Tian

Intel Graphics Virtualization on KVM. Aug KVM Forum 2011 Rev. 3


Making Nested Virtualization Real by Using Hardware Virtualization Features

Nested Virtualization Update From Intel. Xiantao Zhang, Eddie Dong Intel Corporation

KVM for IA64. Anthony Xu

Junhong Jiang, Kevin Tian, Chris Wright, Don Dugger

Graphics Pass-through with VT-d

Hardware-Assisted Mediated Pass-Through with VFIO. Kevin Tian Principal Engineer, Intel

Enhancing pass through device support with IOMMU. Haitao Shan Yunhong Jiang Allen M Kay Eddie (Yaozu) Dong

Intel Virtualization Technology for Directed I/O

Intel Virtualization Technology for Directed I/O Architecture Specification

COSC6376 Cloud Computing Lecture 14: CPU and I/O Virtualization

I/O Scalability in Xen

Intel Virtualization Technology for Directed I/O

KVM as The NFV Hypervisor

Practical Xen Testing at Intel

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland


Virtualization. ! Physical Hardware Processors, memory, chipset, I/O devices, etc. Resources often grossly underutilized

Virtualization. Starting Point: A Physical Machine. What is a Virtual Machine? Virtualization Properties. Types of Virtualization

Achieve Low Latency NFV with Openstack*

Nested Virtualization Friendly KVM

Linux and Xen. Andrea Sarro. andrea.sarro(at)quadrics.it. Linux Kernel Hacking Free Course IV Edition

Interrupt Swizzling Solution for Intel 5000 Chipset Series based Platforms

evm for Windows* User Manual

The Transition to PCI Express* for Client SSDs

Intel s s Security Vision for Xen

Runtime VM Protection By Intel Multi-Key Total Memory Encryption (MKTME)

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Advanced Operating Systems (CS 202) Virtualization

Knut Omang Ifi/Oracle 6 Nov, 2017

Chapter 5 C. Virtual machines

Intel Transparent Computing

Intel Virtualization Technology for Directed I/O

CS-580K/480K Advanced Topics in Cloud Computing. VM Virtualization II

Intel s Virtualization Extensions (VT-x) So you want to build a hypervisor?

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Introduction to SGX (Software Guard Extensions) and SGX Virtualization. Kai Huang, Jun Nakajima (Speaker) July 12, 2017

Virtualization. Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels

Introduction to Virtual Machines. Carl Waldspurger (SB SM 89 PhD 95) VMware R&D

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Live Migration of vgpu

Virtualization. Pradipta De

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

I/O Virtualization The Next Virtualization Frontier

Micro VMMs and Nested Virtualization

Programmed I/O accesses: a threat to Virtual Machine Monitors?

Optimizing and Enhancing VM for the Cloud Computing Era. 20 November 2009 Jun Nakajima, Sheng Yang, and Eddie Dong

Hardware assisted Virtualization in Embedded

Virtual Machines. Part 2: starting 19 years ago. Operating Systems In Depth IX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Introduction to Intel Boot Loader Development Kit (Intel BLDK) Intel SSG/SSD/UEFI

Distributed Systems COMP 212. Lecture 18 Othon Michail

Intel Rapid Storage Technology (Intel RST) Production Version Release

System Virtual Machines

Virtualization. Virtualization

Virtual Virtual Memory

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

BIOS Update Release Notes

Zhang Tianfei. Rosen Xu

MxGPU Setup Guide with VMware

I/O and virtualization

Nested Virtualization and Server Consolidation

Intel Rapid Storage Technology (Intel RST) Production Version Release

Pacifica Next Generation Architecture for Efficient Virtual Machines

BIOS Update Release Notes

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

Towards More Power Friendly Xen

CSE 120 Principles of Operating Systems

Innovating and Integrating for Communications and Storage

VGPU ON KVM VFIO BASED MEDIATED DEVICE FRAMEWORK Neo Jia & Kirti Wankhede, 08/25/2016

BIOS Update Release Notes

Rack Disaggregation Using PCIe Networking

Network device virtualization: issues and solutions

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

Operating Systems 4/27/2015

VIRTIO: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Development of I/O Pass-through: Current Status & the Future. Nov 21, 2008 Yuji Shimada NEC System Technologies, Ltd.

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Virtualization, Xen and Denali

Module 1: Virtualization. Types of Interfaces

Intel Rapid Storage Technology (Intel RST) Production Version Release

Intel Rapid Storage Technology (Intel RST) (Intel Optane Memory SDK Only) Production Version Release

Reference Boot Loader from Intel

Jeff Dodson / Avago Technologies

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Virtualisation: The KVM Way. Amit Shah

Virtually Impossible

Virtualization (II) SPD Course 17/03/2010 Massimo Coppola

SR-IOV Networking in Xen: Architecture, Design and Implementation

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

BIOS Update Release Notes

Intel SRCS28X RAID Controller 814G Firmware Upgrade for Intel Storage System SSR316MJ+

BIOS Update Release Notes

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Transcription:

I/O virtualization Jiang, Yunhong <yunhong.jiang@intel.com> Yang, Xiaowei <xiaowei.yang@intel.com> 1

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2009 Intel Corporation. 2

Agenda Overview Software Emulation Para-virtualization Direct I/O SR-IOV 3

A Retrospect on Physical Platform Processor Memory USB SATA Disk USB host controller USB Device PCIe PCIe IOH ICH 10 Intel QuickPath Interconnect PCIe PCIe LPC Display NIC Legacy Device 4

Device Semantic Software communicate with device Port I/O MMIO Interrupt Device transfer data to and from system memory DMA access Events notification from device Interrupt Device discovery and configuration Configuration space access in PCI device Processor I/O System Memory DMA Device 5

I/O Virtualization Overview Present virtual I/O to VM Isolation Performance Scalability Reliability Approaches Software Emulation Para-virtualization Direct I/O 6

Agenda Overview Software Emulation Para-virtualization Direct I/O SR-IOV 7

Overview Emulate existing hardware completely in software components (aka. Device Model) Must maintain same SW interface as native device E.g. I/O, MMIO, Interrupt, DMA But, could use arbitrary media to emulate the virtualized device E.g. VMM can use a SATA disk (or a partition, a file) to emulate a virtualized IDE disk Device Model IRQ Emul VM Exit Apps Native Driver I/O IRQ Virtual Interrupt Transparent to VM software stack Driver Hypervisor Device 8

I/O Access Emulation Device Model Apps 0x70, w, 1 State machine 0x90, w, 2 Handlers Native Driver Driver VM Exit VM Entry VMM Device HW 9

I/O Access Emulation (II) VMCS IO bitmap decides which IO port access will cause VM Exit CR3 VMM setup shadow page table/ept table to trap MMIO access In shadow model, corresponding shadow L1 entry will be set nonpresent In EPT model, the EPT table will not setup the mapping for the MMIO range shadow CR3 L2 L1 MMIO Device Not present Shadow L2 Shadow L1 MMIO trap in shadow model 10

DMA Emulation Device Model State machine 0x70, w, 1 0x90, w, 2 Handlers Apps Native Driver Driver DMA target address VMM Device 2 1 3 HW 11

Interrupt Emulation CPU IDTR IDT Virtual CPU vidtr vidt PIC VM Entry Interrupt Injection PIRQA ICH Device Model Device API Virtual PIC INTA# Device Native Interrupt Injection Virtual Interrupt Injection 12

Case Study IDE sector read Driver allocate the PRDT Driver write register BMIDTPP to setup the PRDT address VMM trap the write operation and notify DM (Device Model) Device model map the guest memory pointed by PRDT Driver set the buffer address/length in the PRDT Device Model Buffer Buffer IDE Driver APPs PRDT Buffer Driver write register to set the target IDE vector VMM trap the write operation and notify DM DM keep this information in state machine Driver write the BMICX command register to trigger a DMA read VMM trap the write operation and notify DM DM translate the IDE sector to offset in the image file DM read corresponding data DM copy the data to guest s memory SATA Driver IRQ Emul VM Exit I/O IRQ Virtual Interrupt Hypervisor A interrupt is injected after DMA read DM inject a virtual IDE interrupt to guest OS acknowledge the interrupt and consume the data in the buffer SATA Disk 2 3 13

PCI Device Discovery & Configuration in Native PCI device is presented as (BUS, Device, Function) in PCI hierarchy Software enumerate devices through configuration space Configuration space access return 0xFF for non-exist device Device ID/Vendor ID/Revision ID to identify devices Device s Port IO/MMIO base address is configurable BAR (Base Address Register) in configuration space Dev0 F0 Host-PCI Bridge Dev 2 F0 PCI-PCI Bridge Dev2 F1 Dev3 F0 14

PCI Device Discovery in Software Emulation Device model emulates a virtual PCI hierarchy A host-pci bridge is emulated PCI device s configuration space access is emulated, like Device ID/Vendor ID for emulated device Port IO/MMIO BAR 15

Device Emulation Pros & Cons Pros Transparent to VM software stack Agnostic to physical device in the platform. Thus Legacy SW can still run, even after HW upgrade Smooth VM migration across different platforms Good physical device sharing Cons Un-optimum performance Cannot enjoy latest & greatest HW Lack of modern device emulation, since too complex Poor scalability Isolation and stability depends on implementation 16

Agenda Overview Software Emulation Para-virtualization Direct I/O SR-IOV 17

Overview Goal -- Improve I/O performance through SW approach VMM presents a specialized virtual device to VM A new & efficient interface between VMM & VM driver Usually high-level abstraction Requires a specialized driver in VM Tactics to further reduce overhead caused by VMM Batched I/O Shared memory Back-end Driver Driver Shared Memory Hyper call Apps Front-end Driver Notification Hypervisor Device 18

Case Study PV block device read FE (front-end) driver receives an OS standard read request FE grants the write permission of read buffer to the domain where BE (backend) driver is in FE produces a corresponding request (buffer s index, offset, length) in shared data structure, then notifies BE Upon notification, BE consumes the requests by mapping the remote domain s buffer, reassembling it in a OS standard format delivering downwards When the request is complete, BE callback function assembles the response in the shared data structure and notifies FE FE removes the buffer s granted permission and completes the read request finally Back-end Driver Driver Device Shared Memory Hyper call Apps Front-end Driver Notification Hypervisor 19

Para-virtualization Pros & Cons Pros Better performance Agnostic to physical device in the platform. Benefit smooth VM migration Cons Need install specialized driver in VM High CPU utilization for the I/O interface and memory copy Not so good scalability because of CPU utilization Isolation and stability depends on implementation 20

Agenda Overview Software Emulation Para-virtualization Direct I/O SR-IOV 21

Device I/O - Overview Assign a physical device to a VM directly VM access the device directly, w/ VMM intervention reduced to minimum Physical device access guest s memory directly with help of Intel s VT-d technology Interrupt will be injected to guest through hypervisor VMM emulates PCI configuration space (may through device model) Write Through Hypervisor Apps Native Driver I/O IRQ Virtual Interrupt Physical Interrupt 22

Case Study IDE sector read Driver allocate the PRDT Driver write register BMIDTPP to setup the PRDT address The access is passed-through to hardware Driver set the buffer address/length in the PRDT Driver write register to set the target IDE vector The access is passed-through to hardware Driver write the BMICX command register to trigger a DMA read The access is passed-through to hardware IDE disk perform DMA access to the memory through VT-d A interrupt is injected after DMA read DM inject a virtual IDE interrupt to guest OS acknowledge the interrupt and consume the data in the buffer Hypervisor Passthrough IDE Driver I/O IRQ APPs PRDT Buffer Interrupt 2 3 VT-d IDE disk 23

DMA Operation Driver in guest use guest physical address as DMA target Device issue DMA access using address setup by guest Also, driver in VM can set bogus address as DMA target Guest Pseudo Physical Memory Machine Physical Memory VM1 VM2 VM3 VM4 1 2 3 4 5 Hypervisor 3 1 5 2 4 24

Intel VT-d Technology DMA Remapping All DMA transaction is captured by chipset Intel VT-d technology provides DMA remapping, to translate. 25

Intel VT-d Technology DMA Remapping DMA Requests Device ID Virtual Address Length Fault Generation Bus 255 Bus N Bus 0 Dev 31, Func 7 Dev P, Func 2 Dev P, Func 1 Dev 0, Func 0 4KB Page Tables 4KB Page Frame DMA Remapping Engine Translation Cache Context Cache Device Assignment Structures Device D1 Device D2 Address Translation Structures Address Translation Structures Memory-resident Partitioning & Translation Structures 26

DMA Remapping Translation Flow Context entry points to the translation structure Multiple-level page table is used for address translation 0s 9 bit DMA address 9 bit 9 bit 12 bit Context Entry 27

DMA Remapping PCIe ATS PCIe ATS (Address Translation Services ) specification enable I/O devices to participate in DMA remapping Device can request a translation explicitly before DMA access A special flag to indicate if DMA access s target address is translated or not VT-d will only translate DMA request that isn t translated by device 28

I/O Access Pass-through Guest may change device s Port IO/MMIO base address Will not impact physical MMIO BAR through configuration space emulation Guest has different port IO/MMIO base address with physical one MMIO access pass-through For shadow mode, setup shadow page table according to physical address For EPT, setup the EPT table for the translation CR3 shadow CR3 L2 L1 MMIO Device Port I/O access is passthrough only if guest doesn t change Port IO BAR Otherwise need trap-andemulation Shadow L2 MMIO Device 29

Device Initial State Device should have clean state before assigned to guest To avoid information leak when device detach To avoid in-flight device action VMM need quiescent and reset a PCIe function before device assignment 30

Device Direct Assignment Pros & Cons Pros Near-to-native performance Minimum VMM intervention, thus low CPU utilization Good isolation Cons Exclusive device access PCI slots in system is limited. Thus not a very scalable solution. 31

Agenda I/O Virtualization Overview Device Emulation Device Para-virtualization Device Direct Assignment SR-IOV 32

SR-IOV - Overview A standard defined by PCI SIG, to provide virtualization friendly device Device can be shared by VMs, while still enjoying the benefits of device direct assignment 33

SR-IOV - Overview A PCI standard to provide virtualization friendly device Device can be shared by VMs, while still enjoying the benefits of device direct assignment Start with a single function device HW under the control of privileged software (VMM) Includes a SR-IOV capability Physical Function Replicate the resources needed by a VM MMIO for direct communication RID for tag DMA traffict Minimal configuration space Virtual Function (VF) 34

SR-IOV Runtime Operation Guest s run-time operation to VF is same to Device Direct assignment Same MMIO pass-through mechanism No Port IO supported in VF DMA through DMA remapping Interrupt emulation through hypervisor The only difference is device configuration How to enable the VF VF does not exists by default How to locate the VF s resource, like MMIO base address 35

Network Performance and Scalability with SR- IOV dom0% vm% BW dom0% vm% BW CPU % 1200 1000 800 600 400 200 0 10 20 40 60 10 8 6 4 2 0 Bandwidth ( Gbp s ) CPU % 1200 1000 800 600 400 200 0 10 20 40 60 10 8 6 4 2 0 Bandwidth ( Gbp s ) VM # VM # Xen network scalability with software virtualization Xen network scalability with Intel VT-c (SR-IOV) 36

SR-IOV - Summary Pros Good performance Good Scalability A cost effective I/O virtualization solution 37

Agenda I/O Virtualization Overview Device Emulation Device Para-virtualization Device Direct Assignment SR-IOV 38

Backup 39

Interrupt Isolation Challenges In existing interrupt architecture on x86 Platform Interrupt is an DMA write to a specific address (0xFEEX_XXXX) Interrupt attribute is contained in the data/address Device issue DMA write according to guest driver setup Guest can trigger arbitrary interrupt through bogus DMA access it is not page grained, DMA remapping can t cover this range 40

Intel VT-d Technology - Interrupt Remapping Interrupt remapping redefines the interrupt message format The DMA write include only an index to a remapping table The Remapping table is setup in memory by VMM and includes the interrupt attribute Interrupt Remapping Engine will translate the index to interrupt attribute VMM need intercept guest s setup to interrupt attribute (Message Interrupt setup) Vector Target, etc CPU Interrupt Remapping Engine Index Device search Interrupt Remapping Table vector, destination 0 n 41