Re-architecting Virtualization in Heterogeneous Multicore Systems

Size: px

Start display at page:

Download "Re-architecting Virtualization in Heterogeneous Multicore Systems"

Juniper Newton
5 years ago
Views:

1 Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College of Computing, School of ECE Georgia Institute of Technology 1

2 Overview Addressing Virtualization in Heterogeneous Multicores Functional Approach V-Core Approach Key Aspects High performance Scalability Enhanced functionality Example: Functional Approach Self-Virtualized NIC (SV-NIC) using IXP NP Ongoing Work: V-Core Applying V-Core approach to IXP NP, Cell, N-VIDIA GPU and Virtex FPGAs Research Challenges 2

3 Functional Approach Use Heterogeneous Core to Implement Certain Virtual Platform Related Functionality Guest VMs Oblivious to type of Core(s) Example: Self-Virtualized I/O Devices Offload I/O virtualization to smart peripherals Componentization, Functional Partitioning of Cores (Utilize processing near physical device for fast communication), Optimal Resource Placement (Efficiently utilize the I/O interconnect between device and host cores), Minimization of Host Interactions in I/O Path, Utilize Specialized Cores Functionalities Management of virtual devices I/O Creation/Removal/Reconfiguration Between guest VMs and physical device via scalable multiplexing/demultiplexing of virtual devices 3

4 Smart Device ENP2611 Source: Radisys 4

5 SV-NIC Host Controller Domain Guest 0 Guest 1 Mgmt VIF0 VIF1 Hypervisor PCI Communication Mgmt VIFs ENP2611 Network 5

6 SV-NIC HV Interaction Example 6

7 HV-NIC Host Driver Domain Guest 0 Guest 1 VIF0 VIF1 Hypervisor PCI Communication ENP2611 Network 7

8 Implementation Details PCI Implications Asymmetric performance Asymmetric resource access S/W I/O MMU for I/O translation Signaling Via PCI interrupt Virtualization via Interrupt Identifier Via asynchronous messaging Sidecore 8

9 Experimental Results Latency 9

10 Experimental Results Throughput 10

11 V-Core Approach Provide Virtual Heterogeneous Cores to Guest VMs Extensions to Hypervisor and Management Domain Efficient co-scheduling of accelerators at hypervisor Resource allocation and VM creation with admission control at dom0 Virtualized Accelerator Abstraction Layer (VAAL) Enable interaction between VMs and devices Setup corresponding control and data path 11

12 V-Core Implementation Architecture V-Core Management Extension DOM 0 Existing Drivers Accelerator Driver Domain (e.g. Cell, IXP NP) VAAL Setup Module Accelerator Driver Domain (e.g. GPU, FPGA) VAAL Backend VIF Guest VM VAAL Access Module Xen Hypervisor with V-Core Scheduling Extension G-CPU1 G-CPU2 G-CPU3 G-CPU4 IBM Cell VAAL Backend on PPE App threads on SPE NVidia GPU CUDA Management Hardware App threads 12

13 VM Execution Model Programming Model Source Program Control Thread Memory Accelerator based Code Segment compiled for specific device/driver combination Stream Compilation Environment Stream Elements Kernel Kernel Configuration of the Machine Model Multicore processor 1 Accelerator 1 Cache Local Memory ACC DMA FIFO System Architecture Description Architecture description specifies configuration of accelerators and processors & communicates QoS requirements 13

14 Virtualization With Xen: Some Initial Thoughts Separate domains for applications, work queues, accelerator controllers XEN maps physical accelerator resources to controllers XEN provides virtual interfaces for inter-domain communication Application 1 API Application 2 API Work Queue Completion Queue 8800GTX Controller Cell PCIe Card Controller XEN NVIDIA 8800 GTX Cell PCIe Card 14

15 V-Core API Guest VM/Application API Work queue Completion queue Dependence checking and synchronization Accelerator Controller API NVIDIA, FPGA, Cell, IXP NP 15

16 Research Challenges System Definition Language Describe the desired target VM Communicate QoS requirements System Partitioning & Allocation Allocate physical resources for a VM Strategies to reduce the complexity of the problem Scheduling Co-scheduling accelerators in a VM Static and run-time schedulers: real-time vs. best effort Space vs. time-sharing among VMs, based on capabilities Applications 16

17 Thank You 17

18 Extra Slides 18

19 Cost of I/O Virtualization Source: Menon et al, VEE'05 19

20 Functionalities Breakdown Virtual Interface Management Jointly done by host CPU and XScale core API for Creating new VIFs Removing existing VIFs Configuration changes Network I/O Jointly done by host CPU and IXP micro-engines 20

21 Experimental Results Host Configuration Dell PE2650 server, 2-way HT Xeon 2.8GHz 2GB RAM Xen 3.0-unstable Linux para-virtualized kernel for Dom0 and DomU BVT scheduling ENP2611 Configuration 256 MB SDRAM, 8MB SRAM Linux kernel 21

22 PCI Asymmetry 22

23 Micro-Benchmarks Egress (SV-NIC) 23

24 Micro-Benchmark Ingress (SV-NIC) Pkt size = 64 bytes 24

25 Microbenchmarks 25

26 Micro-Benchmark Host Latency (SV-NIC) Num VIFS Interrupt Virtualization Cost uS uS uS 26

27 Virtual Interrupt Identifier 27

28 Core-Core Communication Latency 28

29 IXP2400 Network Processor Source: IXP2400 H/W Ref. Manual 29

30 Sidecore Utilize a Core in Appropriate State Better cache utilization Reduced or no cost for processor state changes Specialized processing SV-NIC + Sidecore Limitations of Interrupt Identifier May result in redundant signaling Signal virtualization via dedicated host core No interrupts from device - Polling 30

31 V-Core Issues Some accelerators (NVIDIA CUDA) do not support resource sharing An accelerators consist of both a device and driver Some accelerators do not support sufficiently powerful, programmable back-ends, e.g., some FPGA devices 31

32 VAAL VAAL Front-End Interface to devices visible at the application layer Enables transmission of executable from the application to the accelerators for execution, e.g., thread binary VAAL Back-End Functions needed to virtualize devices, such as memory isolation and kernel to hardware mapping May execute on the device or within an accelerator driver domain depending on strength of the device 32

33 Example: NVIDIA CUDA Memory Control Thread Inputs Outputs Inputs Outputs Kernel Kernel Describe kernels by function, inputs, and outputs Treat kernels as schedulable entities Draw upon insights from OOO superscalar execution 33

34 Run-Time Scheduler: Work Queue input kernel output input kernel input output kernel output #3 #2 #1 input kernel output input kernel output One queue per accelerator type? 34

35 Accelerator Controllers Pull Kernels From the Work Queue Runtime DRIVER/CUDA NVIDIA 8800GTX #3 #2 #1 Runtime DRIVER/API CELL PCIe Card Runtime OS/API CELL Blade 35

36 Upon Completion, Results Are Pushed Into A Completion Queue Runtime DRIVER/CUDA NVIDIA 8800GTX Runtime #3 outputs #1 outputs #2 outputs DRIVER/API CELL PCIe Card Runtime OS/API CELL Blade 36

37 HVM Reorder Results to Avoid Hazards Commit #1 #3 outputs #1 outputs #2 outputs Commit #2 Commit #3 Forward outputs from the completion queue to inputs in the work queue as needed 37

38 HVM Applications Register Kernels With Accelerator Controllers At Runtime FFT/Inputs/Outputs/Acc/Hash DCT/Inputs/Outputs/Acc/Hash Filter/Inputs/Outputs/Acc/Hash Window/Inputs/Outputs/Acc/Hash Runtime DRIVER/CUDA NVIDIA 8800GTX MatrixMul/Inputs/Outputs/Acc/Hash Convolve/Inputs/Outputs/Acc/Hash Filter/Inputs/Outputs/Acc/Hash Runtime OS/API CELL Blade Window/Inputs/Outputs/Acc/Hash 38

GViM: GPU-accelerated Virtual Machines

GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor