Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments

Size: px

Start display at page:

Download "Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments"

Lydia Thomas
6 years ago
Views:

1 Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments Michela Becchi University of Missouri nps.missouri.edu

Titan: > 20 petaflops, > 700 terabytes memory 18,688 nodes:

2 GPUs in Clusters & Clouds Many-core GPUs are used in supercomputers 3 out of the top 10 supercomputers use GPUs Titan: > 20 petaflops, > 700 terabytes memory 18,688 nodes: 16-core AMD CPU 1 Nvidia Tesla K20 GPU Many-core GPUs are used in cloud computing 2

3 Different usage paradigms Accelerator model Cluster/cloud model 1 application Multi-tenancy GPU: dedicated resource Explicit procurement of GPUs Static (or programmer-defined) binding of application to GPUs Intra-application scheduling GPU: shared resource Resource virtualization & Transparency Dynamic (or runtime) binding of applications to GPUs better resource utilization and load balancing Intra- and Inter-application scheduling Memory management within application Advanced memory management across applications required 3

4 Context AMBER GROMACS NAMD GPUBlast LAMMPS NAMD AMBER Blast

5 We have designed a runtime that Abstracts GPUs from end-users Schedules applications on GPUs Dynamically binds applications to GPUs Allows GPU sharing Provides memory management Provides dynamic recovery and load balancing in case of GPU failure/upgrade/downgrade 5

6 Deployment scenarios With cluster-level schedulers E.g.: TORQUE, SLURM Cluster-level scheduler CUDA app CUDA app CUDA app CUDA app CUDA app CUDA app Intercept library Intercept library Intercept library Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n With VM-based systems for cloud computing E.g.: Eucalyptus VM 1 VM 2 VM 3 VM 4 VM k VM n CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS VM manager Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n 6

7 GPU sharing Inter-kernel sharing [HPDC 11] - When: GPU underutilized within a kernel - Why: limited parallelism, small datasets - How: kernel consolidation across applications GPU k 1 k 1 k 1 k 1 k1 k 1 k 1 k 1 k 2 k 2 Inter-application sharing [HPDC 12] - When: GPU underutilized within an application - Why: long CPU phases - How: application multiplexing on GPU GPU CPU app 1 app 1 app 1 app 1 app 2 app 1 app 1 app 2 time time 7

8 GPU sharing Multi-process application sharing [HPDC 13] When: GPU underutilized by multi-process applications (e.g. MPI) Why: Synchronization leads to intra- & inter-application imbalance How: Preempt some inactive processes to allow other processes to progress GPU 0 GPU 1 A 0 A 1 time A 0 A 1 B 0 B 1 B 0 B 1 GPU 0 GPU 1 A 0 B 0 A 0 B 1 B 0 A 1 A 1 B 1 time 8

9 Inter-kernel sharing [HPDC 11] app 1 m c HD k 1 c DH f app 2 m c HD k 2 c DH f serialized execution m c HD k 1 c DH f m c HD k 2 c DH f Inter-kernel sharing m c HD m c HD combined k 1 & k 2 c DH f c DH f time app 1 and app 2 have no conflicting memory requirements 9

10 Relative Throughput Benefit Space- vs. time-sharing: some results SPACE-SHARING TIME-SHARING BATCH 1 BATCH 2 BATCH 3 BATCH 4 GPU1 GPU2 GPU2 GPU1 GPU1 GPU2 GPU GPU1 1.3 BS+KM BO+KNN PDE+MD EU+IP BS+BO KM+KNN BO+EU BS+MD Workload Mix 10

11 Idea: Molding Downgrade the execution configuration of kernels so to force beneficial sharing Penalize single application to improve overall throughput Limiting # blocks force space sharing Limiting # threads/block force time sharing w/ interleaved execution kernel 1 b 11 b 12 b 13 b 14 k1 Downgrade to 3 blocks kernel 2 b21 b22 b23 b14 k1 Downgrade to 2 blocks b 11 b 12 b 13 b 21 b 22 kernel 1 and kernel 2 can space-share GPU after molding 11

12 Relative Throughput Benefit Molding: some results FORCED TIME-SHARING No Molding Molding FORCED SPACE-SHARING GPU2 1.4 GPU1 GPU2 GPU IP+BS PDE+MD IP+BO BS+KM Worklaod Mix Molding can improve overall throughput despite penalizing single applications 12

13 Inter-application sharing [HPDC 12] app 1 m c HD k 11 cpu k 12 c DH f cpu app 2 cpu m c HD k 21 c DH f cpu m c HD k 22 cpu k 23 c DH f serialized execution GPU CPU sharing w/o conflicting memory req. GPU CPU sharing w/ conflicting memory req. GPU CPU GPU CPU xfer (app 1 ) CPU GPU xfer (app 1 ) time 13

14 Our runtime: node-level view app 1 app 2 app 3 app j app N Intercept Lib Intercept Lib Intercept Lib Intercept Lib Intercept Lib Our RUNTIME Memory manager Virtual memory handling Memory manager Page table Swap area connection manager and offload control dispatcher Waiting contexts Assigned contexts Failed contexts node-to-node offloading Dispatcher Scheduling GPU binary registration vgpu 11 vgpu 12 vgpu 1k vgpu 21 vgpu 22 vgpu 2k vgpu n1 vgpu n2 vgpu nk CUDA driver/runtime GPU 1 GPU 2 GPU n Virtual GPUs Abstraction GPU Sharing 14

15 Mapping and scheduling (FCFS) app 1 app 2 app 3 t 1 t 2 t 1 t 2 t 3 t 1 t 2 t 3 FE library FE library FE library Our RUNTIME Memory manager connection manager and offload control c 11 c 12 c 21 c 22 c 23 c 31 c 32 c 33 Waiting ctx Page table Swap area dispatcher Assigned ctx Failed ctx vgpu 11 vgpu 12 vgpu 21 vgpu 22 vgpu 31 vgpu 32 GPU 1 CUDA driver/runtime GPU 2 GPU 3 15

16 Mapping and scheduling (FCFS) app 1 app 2 app 3 t 1 t 2 t 1 t 2 t 3 t 1 t 2 t 3 FE library FE library FE library Our RUNTIME Memory manager Page table Swap area connection manager and offload control dispatcher c 11 c 12 c 21 c 22 c 23 c 31 c 32 c 33 c 11 c 12 c 21 c 22 c 23 c 31 c 32 Waiting ctx Assigned ctx Failed ctx Hardware configuration and application- GPU mapping abstracted from end-users complete! vgpu 11 vgpu 12 vgpu 21 vgpu 22 vgpu 31 vgpu 32 Time-sharing of GPUs GPU 1 CUDA driver/runtime GPU 2 GPU 3 16

17 Delayed binding app 1 malloc 1 copyhd 11 copyhd 12 kernel 1 copydh 1 Memory manager Page table app 1 app 2 app j app N malloc 2 copyhd 2 kernel 2 copydh 2 Swap area d1 connection manager and offload control c 1 dispatcher c 2 Deferral of application-gpu mapping o Better scheduling decisions o GPU memory allocation when needed app 2 d2 CUDA driver/runtime vgpu 11 vgpu 12 Memory manager in runtime o dynamic binding GPU 1 17

18 Dynamic binding & swapping app 1 app 2 app 3 app 4 Memory manager Page table app 1 malloc 1 copyhd 1 kernel 11 kernel 12 Swap area d1 malloc 2 copyhd 2 kernel 2 GPU 1 malloc 3 copyhd 3 kernel 3 dispatcher malloc 4 copyhd 4 kernel 4 app 2 d2 app 3 d3 c 1 c 3 c 4 app 4 d4 vgpu 11 vgpu 12 vgpu 21 vgpu 22 Swap! d1 connection manager and offload control c 2 CUDA driver/runtime Full! d3 d1 GPU 2 d4 GPU sharing among applications with conflicting memory requirements Migration of applications from slower to faster GPUs High availability in case of GPU failure Load balancing in case of GPU upgrade-downgrade 18

19 Total Execution time (sec) Experiments: sharing & swapping serialized execution (1 vgpu) GPU sharing (4 VGPUs) Fraction of CPU code 2 Tesla C2050 and 1 Tesla C1060 GPUs 36 matmul jobs w/ 5 kernel calls and varying CPU phases Sharing increases performances by hiding CPU phases 19

20 Execution Time (sec) Experiments: cluster w/ TORQUE serialized execution GPU sharing (4 vgpus) GPU sharing + load balancing Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs) Metric (# jobs) 3 node cluster w/ 2 GPU nodes (2 Tesla C2050s and 1 C1060) >2X performance improvement due to sharing, further 20% due to offloading 20

21 Load Imbalance in Multi-process applications [HPDC 13] Causes of load imbalance Intrinsic load-imbalance (intra-application) Different GPU capabilities (intra-application) Unmatched amount of GPUs to processes (inter-application) Synchronization among processes

22 Intra-application Imbalance

23 Inter-application Imbalance

24 Preemption Policies Maximum idle time-driven preemption Preempt a context (process) whenever it does not utilize the GPU for a predefined amount of time PROS: easy implementation CONS: need for setting maximum idle time parameter Synchronization call-driven preemption Preempt a context (process) whenever a collective communication or synchronization call is serviced PROS: no need for parameters setting CONS: Either need for bookkeeping (complex implementation, overhead) Or unnecessary preemptions (last process enters synchronization point)

25 Overall execution time (seconds) Experiment: Node-level % Batch scheduling 2-way sharing 4-way sharing Preemptive sharing Preemptive 2-way sharing % Percentage imbalance Intra-application imbalance Batch scheduler fails to capture intra-application imbalance N-way sharing hides CPU execution behind GPU execution phases of co-located processes Combining 2-way sharing and preemption further improves performance

26 Overall execution time (seconds) Experiment: Node-level Batch scheduling 2-way sharing 4-way sharing Preemptive sharing Preemptive 2-way sharing x[4] + 1x[3] 2x[4] + 2x[3] 1x[4] + 3x[3] 4x[3] Workload composition Inter-application imbalance Batch scheduler causes GPU underutilization, leading to performance loss N-way sharing provides improvement only if the imbalance is high Preemptive sharing allows correcting the imbalance and leads to performance improvement

27 Overall execution time (seconds) Experiment: Cluster-level processes/app 6 processes/app 8 processes/app batch scheduling 4-way sharing Preemptive 2-way sharing Scheduling scheme 2 nodes with 7 GPUs Batch scheduler is unable to schedule jobs with more processes than GPUs 4-way sharing and preemptive sharing lead to 25-30% and 40-45% performance improvement, respectively

28 Conclusion Node-level runtime providing GPU virtualization (Manageability) GPU sharing (Utilization, Latency Hiding) Flexible scheduling (Configurability) Dynamic binding & Preemption (Utilization, Latency Hiding) What lies ahead Integration with cluster-level scheduler Dynamic scheduling at the cluster-level Power efficiency considerations 28

29 Thanks My coauthors MU MU MU MU AMD NEC You all for the attention! 29

30 Understanding GPU resource utilization (cont d) Y # thread-blocks < # SMs N GPU GPU b 11 b 12 b 13 k 1 k1 SM underutilization b 11 b 12 b 13 b 14 b 15 TIME SHARING b 11 b 21 b 12 b 22 b 13 b 23 b 14 b 24 b 15 b 25 all SMs busy SPACE SHARING b 11 b 12 b 13 b 21 b 22 Best case N INTERLEAVED EXECUTION of THREAD-BLOCKS Latencies hiding co-scheduled thread-blocks have conflicting register/sh.mem req. Worst case Y SERIALIZED EXECUTION of THREAD-BLOCKS 30

31 Total Execution time (sec) Experiments: runtime overhead CUDA Runtime 1 vgpu 2 vgpus 4 vgpus 8 vgpus # of jobs 1 Tesla C2050 GPU, short-running jobs overhead < 10% in worst case, and amortized through GPU sharing 31

32 HPDC 13 From single-process, single threaded applications to multi-process/multi-threaded applications Challenge: synchronizations (e.g. barrier synchronizations, communication primitive) can introduce GPU underutilization Solution: Preemptive GPU sharing 32

33 Scenario 1: Intra-application Imbalance (a) Batch Scheduling A 11 A 21 sync A1 sync A2 sync C1 sync B1 GPU 0 A 12 B 12 GPU 1 A 31 A 22 B 11 GPU 2 A 32 C 11 GPU 3 A 41 A 42 C 21 (b) Controlled 2-way sharing GPU 0 A 11 B 11 GPU 1 A 21 sync A1 sync B1 sync C1 A 12 A 22 B 21 sync A2 GPU 2 GPU 3 A 41 A 31 C 11 A 32 C 21 A 42 (c) Preemptive sharing sync A1 sync C1 sync B1 sync A2 GPU 0 B 21 A 11 A 21 GPU 1 GPU 2 A 31 C 21 A 12 A 22 A 32 GPU 3 A 41 B 11 C 11 A 42 time 33

34 Scenario 2: Inter-application Imbalance (a) Batch scheduling GPU 0 vgpu 00 GPU 1 vgpu 10 vgpu 20 GPU 2 vgpu 30 A 11 A 12 A 21 A 22 A 32 A 13 A 23 A 33 GPU 3 A 31 B 11 B 21 B 22 B 32 B 23 C 31 B 31 B 33 C 11 C 32 C 12 C 33 C 13 B 12 B 13 C 21 C 22 C 23 (b) Controlled 2-way sharing GPU 0 vgpu 00 vgpu 01 A 11 B 21 A 12 B 22 A 13 B 23 C 31 C 32 C 33 GPU 1 vgpu 10 vgpu 11 A 21 B 31 A 22 B 32 A 23 B 33 GPU 2 vgpu 20 vgpu 21 A 31 C 11 A 32 A 33 C 12 C 13 GPU 3 vgpu 30 vgpu 31 B 11 C 21 B 1,2 B 13 C 22 C 23 (c) Preemptive sharing GPU 0 GPU 1 vgpu 00 vgpu10 A 11 A 21 A 12 A 22 A 13 A 23 B 33 C 21 B 13 C 31 C 22 C 32 C 23 C 33 GPU 2 GPU 3 vgpu20 vgpu30 B 11 A 31 A 32 A 33 C 11 B 21 B 31 B 32 B 12 B 22 B 23 C 12 C 13 (d) Preemptive 2-way sharing GPU 0 vgpu 00 vgpu 01 A 11 B 21 A 12 A 13 B 23 Legend GPU 1 GPU 2 GPU 3 vgpu 10 vgpu 11 vgpu 20 vgpu 21 vgpu 30 vgpu 31 B 11 A 12 B 31 A 13 C 11 C 21 C 31 A 22 B C 12 B 1,2 C C 32 B 22 A 23 B 33 A 33 C 13 C 33 B 13 C 23 App A App B App C Idle time App A sync. point App B sync. point App C sync. point time 34

35 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); malloc(&b_d, size); malloc(&c_d, size); copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); //B_d = A_d * A_d matmul(b_d, B_d, C_d); //C_d = B_d * B_d copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); 35

36 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); ON THE BARE CUDA RUNTIME malloc(&b_d, size); malloc(&c_d, size); MEMORY CAPACITY EXCEEDED RUNTIME ERROR! copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); //B_d = A_d * A_d matmul(b_d, B_d, C_d); //C_d = B_d * B_d copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); 36

37 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); malloc(&b_d, size); malloc(&c_d, size); copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); matmul(b_d, B_d, C_d); copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); ON OUR RUNTIME FIRST MEMORY ALLOCATION & DATA XFER TO GPU (A_d & B_d) SWAP(A_d) & MEMORY ALLOCATION (C_d) 37

38 Total execution time (sec) Experiments: load balancing w/ dynamic binding no load balancing load balancing through dynamic binding cpu fraction = 0 cpu fraction = 1 # of jobs Unbalanced system: 2 Tesla C2050 and 1 Quadro 2000 GPUs Especially on small batches of jobs, dynamic binding improves performance 38

39 Runtime configurations Only initial memory transfer deferral Only memory transfers before 1 st kernel call deferred Pros: Overlap computation/communication Cons: More swapping overhead Unconditional memory transfer deferral All memory transfers are deferred Pros: Less swapping overhead Cons: No overlapping computation/communication 39

40 Application call Actions performed by runtime Errors returned by the runtime Malloc Create PTE A virtual address cannot be assigned Allocate swap Swap memory cannot be allocated Copy HD Check valid PTE No valid PTE Move data to swap Swap-data size mismatch Copy DH Check valid PTE No valid PTE If (PTE.toCopy2Swap)cudaMemcpy DH - Free Check valid PTE No valid PTE De-allocate swap Cannot de-allocate swap If (PTE.isAllocated) - cudafree Launch Check valid PTE No valid PTE If (^PTE.isAllocated) cudamalloc - If (PTE.toCopy2Dev) cudamemcpy HD - cudalaunch - Swap Check valid PTE No valid PTE If (PTE.toCopy2Swap) cudamemcpy DH - If (PTE.isAllocated) cudafree - 40

41 Flags for Page Table Entries handling isallocated/tocopy2d/tocopy2s copydh swap F/F/F copyhd launch F/T/F copyhd copydh swap launch swap swap T/F/T T/F/F copyhd T/T/F copydh copyhd copydh launch copydh 41

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?