GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

Size: px

Start display at page:

Download "GaaS Workload Characterization under NUMA Architecture for Virtualized GPU"

Abigail Johnston
5 years ago
Views:

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017

1 GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of Efficient Architectures Laboratory) Department of Electrical and Computer Engineering University of Florida

2 Talk Overview 1. Background and Motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 un Core Interconnect QPI Core Interconnect un GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 2 / 27

3 Graphics-as-a-Service (GaaS) Cloud Gaming Video Streaming Virtual Desktop (VDI) 3 / 27

4 Graphics-as-a-Service (GaaS) GPU Virtualization! 4 / 27

5 GPU Virtualization 1. API Intercept 2. GPU pass-through 3. Shared virtualized GPU 5 / 27

6 GPU Virtualization Intel GVT-s Intel GVT-d Intel GVT-g AMD Firepro vcuda NVIDIA GPU-passthrough NVIDIA GRID 1. API intercept 2. GPU pass-through 3. Virtualized GPU 6 / 27

7 NVIDIA GRID GPU Virtualization XenServer Hypervisor NVIDIA GRID vgpu Manager Nvidia Kernel Driver NVIDIA GPU Management Inferface Streaming engine 3D Graphics Copy Engine Engine Requests from VMs Video Encoder Paravirtualized Interface Video Decoder Guest VM Guest Applications VM Apps Guest VM Driver Guest VM Driver Direct GPU Access Channel CPU Access Timeshared scheduling GPU MMU Framebuffer GPU Access VM1 FB VM2 FB VM1 pagetables VM2 pagetables 7 / 27

8 GPU NUMA issue Unified Architecture Discrete Architecture Socket 0 Socket 1 CPU CPU Cache Controller Last level cache GPU0 Cache QPI Cache Controller Last level cache Unified Architecture GPU1 Cache Controller Socket 0 CPU Last level cache GPU0 PCIE express QPI Socket 1 CPU GPU1 Controller Last level cache Discrete Architecture PCIE express 8 / 27

9 GPU NUMA Issue un L1/L2 Core Interconnect L1/L2 L1/L2 L1/L2 QPI Interconnect L1/L2 Core Interconnect L1/L2 L1/L2 L1/L2 un GPU A PCIe/F LL cache MC GPU B PCIe/F LL cache MC memory Local Access App Real case memory Remote Access App Ideal I/O thread I/O thread 9 / 27

10 Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 un Core Interconnect QPI Core Interconnect un GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 10 / 27

11 Experiment Setup Platform Configuration 4U Supermicro Server XenServer 7.0 Intel QPI, 6.4 GT/s NVIDIA GRID K2, 8GB GDDR5, 225W, PCIE 3.0 x 16 GRID K2 Physical GPUs 2 VGPU type Frame Buffer (Mbytes) Maximum vgpus per GPU K K K K / 27

execution time Local mapping: the Guest VM s vcpus are statically pinned to the local socket close to the GPU.

12 Workload Selection Workloads and Metrics GaaS workloads: Unigine-Heaven, Unigine-Valley, 3DMark (Return to Proxycon, Firefly Forest, Canyon Fly, Deep Freeze) Performance metrics: frame-per-seconds (FPS) GPGPU workloads: Rodinia benchmark Performance metrics: execution time Local mapping: the Guest VM s vcpus are statically pinned to the local socket close to the GPU. Remote mapping: the vcpus are statically pinned to the remote socket. (XenServer controls the memory affinity automatically close to the CPU affinity). 12 / 27

13 Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 un Core Interconnect QPI Core Interconnect un GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 13 / 27

14 Bandwidth (MB/s) Bandwidth (MB/s) Bandwidth (MB/s) Bandwidth (MB/s) NUMA Transfer Bandwidth KB 2KB 4KB 8KB LocalHtoD RemoteHtoD 16KB 32KB 64KB 128KB 256KB 512KB 1MB Transfer size 2MB 4MB 8MB 16MB CPU GPU, pinned memory 32MB 64MB KB 2KB 4KB 8KB LocalDtoH RemoteDtoH 16KB 32KB 64KB 128KB 256KB 512KB 1MB Transfer size 2MB 4MB 8MB 16MB GPU CPU, pinned memory 32MB 64MB KB 2KB 4KB 8KB LocalDtoD RemoteHtoD 16KB 32KB 64KB 128KB 256KB 512KB 1MB Transfer size 2MB 4MB 8MB 16MB CPU GPU, pageable memory 32MB 64MB KB 2KB 4KB 8KB LocalDtoH RemoteDtoH 16KB 32KB 64KB 128KB 256KB 512KB 1MB Transfer size 2MB 4MB 8MB 16MB GPU CPU, pageable memory 32MB 64MB 14 / 27

15 NUMA Transfer Bandwidth Pinned memory: 10% NUMA overhead for writing data to GPU, 20% reading data back from GPU Pageable memory: close to 0 NUMA overhead for writing, 50% for reading data back from GPU 15 / 27

Normalized execution time Normalized execution time NUMA Performance Difference-GPGPU Workloads Note: only can be configured using K2 for CUDA programs. 1.2 1.1 1.

16 Normalized execution time Normalized execution time NUMA Performance Difference-GPGPU Workloads Note: only can be configured using K2 for CUDA programs Local Remote Remarks For GPGPU workloads streamcluster, srad_v2, backprop stands out streamcluster srad_v2 100% 80% 60% 40% 20% 0% streamcluster backprop bfs b+tree gaussian heartwall nn pathfinder mummergpu dwt2d Kernel CPU+Other srad_v2 backprop bfs b+tree gaussian heartwall nn pathfinder mummergpu dwt2d Further breakdown shows that for GPGPU workloads, the more time spent on CPU GPU communication, the higher NUMA overhead there is. 16 / 27

FPS FPS FPS NUMA Performance Difference-GaaS Workloads 50 40 30 20 10 0 2VM 4VM 3DMark K240 2VM K260 Local K280 Return to Proxycon 2VM 4VM K240 Remote 2VM K260 Firefly Forest K280 70 60 50 40 30 20

17 FPS FPS FPS NUMA Performance Difference-GaaS Workloads VM 4VM 3DMark K240 2VM K260 Local K280 Return to Proxycon 2VM 4VM K240 Remote 2VM K260 Firefly Forest K VM 4VM K240 2VM K260 Local Canyon Flight K280 2VM 4VM Remote K240 2VM K260 Deep Freeze K280 GaaS workloads VM 4VM 2VM Little NUMA overhead exists 0 K240 K260 Unigine-Heaven Local K280 2VM 4VM K240 Remote 2VM K260 Unigine-Valley K / 27

GaaS Overhead Analysis Cont. (1) 3DMark streamcluster 1. GPU compute 2.

Copy queue copy between CPU and GPU GPU compute 1. GPU compute Unigine-Heaven 3D graphics processing 1.

Copy queue copy between CPU and GPU 1. GPU compute GPU compute Unigine-Valley 2.

18 GaaS Overhead Analysis Cont. (1) 3DMark streamcluster 1. GPU compute 2.Copy queue copy between CPU and GPU 3D graphics processing 1. GPU compute 2.Copy queue copy between CPU and GPU GPU compute 1. GPU compute Unigine-Heaven 3D graphics processing 1. GPU compute 2.Copy queue backprop srad_v2 GPU compute copy between CPU and GPU 2.Copy queue copy between CPU and GPU 1. GPU compute GPU compute Unigine-Valley 2.Copy queue copy between CPU and GPU 1. GPU compute heartwall GPU compute 3D graphics processing 1. GPU compute 2.Copy queue copy between CPU and GPU 2.Copy queue copy between CPU and GPU GaaS workloads GPGPU workloads 18 / 27

19 GaaS Overhead Analysis Cont. (1) 1. For GaaS workloads, most memory copy operations between CPU and GPU are overlapped with graphics processing operations. However, GPGPU workloads are different. Little overlap happens. 2. The communication time is trivial compared to GPU computing in the graphics queue, which clearly shows the GPU-computation intensive feature. 19 / 27

20 GaaS Overhead Analysis Cont. (2) GPU compute 3DMark Copy queue Unigine-Heaven GPU compute GPU compute hearwall Copy queue cudamemcpy(htod) cudamemcpy(dtoh) Copy queue Unigine-Valley GPU compute Copy queue GaaS workloads 20 / 27

21 GaaS Overhead Analysis Cont. (2) GaaS workloads incurs more real-time processing, compared with GPGPU workloads. This kind of workload behavior makes it easier for memory transfers overlapping with GPU computing. 21 / 27

22 normalized L3 miss rate Influence of CPU un VMs on the same socket 4 VMs on seperate socket VM2VM3VM4 VM2VM3VM4 VM2VM3VM4 3DMark Unigine-Heaven Unigine-Valley CPU un has little performance influence on GPU NUMA for GaaS 22 / 27

23 Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 un Core Interconnect QPI Core Interconnect un GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 23 / 27

24 FPS Power (watt) DVFS-CPU Power (watt) 660 Power (watt) RP FF CF DF UH UV Remarks: Unigine-Heaven power Performance Ondemand Powersave Performance Powersave Ondemand time(s) 3DMark power Performance Powersave Ondemand time (s) time(s) Ondemand CPU frequency scaling achieves the best performance tradeoff between performance and energy for GaaS Unigine-Valley power Performance Powersave Ondemand 24 / 27

FPS FPS DVFS-GPU 70 60 50 40 30 20 10 Core 745 575Mhz _high _low Mem 1250 750Mhz 70 60 mem_high mem_low 50 40 30 20 10 0 RP FF CF DF UH UV 0 RP FF CF DF

25 FPS FPS DVFS-GPU Core Mhz _high _low Mem Mhz mem_high mem_low RP FF CF DF UH UV 0 RP FF CF DF UH UV Remarks: The GPU memory frequency can be tuned lower within a certain range to get energy saving with little performance degradation for GaaS. 25 / 27

26 Conclusions In this work, we conduct a characterization on XenServer using virtual GPU, we found no NUMA overhead for GaaS workloads, due to the fact that most memory copy operations are overlapped with GPU computation. GaaS workloads exhibits different workload behavior with GPGPU workloads. Ondemand CPU frequency scaling achieves the best tradeoff between performance and energy for GaaS. GPU memory clock can be tuned lower within a certain range to save energy for GaaS. 26 / 27

27 Thanks For Your Attention! 27 / 27

DELIVERING HIGH-PERFORMANCE REMOTE GRAPHICS WITH NVIDIA GRID VIRTUAL GPU. Andy Currid NVIDIA

DELIVERING HIGH-PERFORMANCE REMOTE GRAPHICS WITH NVIDIA GRID VIRTUAL Andy Currid NVIDIA WHAT YOU LL LEARN IN THIS SESSION NVIDIA's GRID Virtual Architecture What it is and how it works Using GRID Virtual