Speeding up the execution of numerical computations and simulations with rcuda José Duato

Size: px

Start display at page:

Download "Speeding up the execution of numerical computations and simulations with rcuda José Duato"

Jean Powers
5 years ago
Views:

1 Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain

Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization? 3. Why is remote GPU virtualization needed? 4. Frameworks for remote GPU virtualization 5.

2 Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization? 3. Why is remote GPU virtualization needed? 4. Frameworks for remote GPU virtualization 5. Cons of remote GPU virtualization 6. Pros of remote GPU virtualization Introduction to the remote GPU virtualization technique and frameworks Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

3 st Outline Introduction to GPU computing Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Current computing needs Many applications

This demand increases over time Applications

(parallel processing, vector instructions)

growth in the last years Speeding up the

4 Current computing needs Many applications require huge computing power. This demand increases over time Applications can be accelerated to reduce execution time (parallel processing, vector instructions) GPU computing has experienced a remarkable growth in the last years Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

GPUs reduce energy and time blastp db sorted_env_nr query SequenceLength_00001300.

5 GPUs reduce energy and time blastp db sorted_env_nr query SequenceLength_ txt -num_threads X -gpu [t f] Dual socket E v2 Intel Xeon node with NVIDIA K20 GPU On GPU On On GPU On On GPU On On GPU On 2.36x 3.56x GPU-Blast: Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Green zone: GPUs are better than s GPUs are not

zone: GPUs are worse than s Speeding up the

6 GPUs reduce energy and time blastp db sorted_env_nr query SequenceLength_ txt -num_threads X -gpu [t f] Dual socket E v2 Intel Xeon node with NVIDIA K20 GPU Green zone: GPUs are better than s GPUs are not the magic solution for all applications Red zone: GPUs are worse than s Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

7 Current GPU computing facilities The basic building block is a node with one or more GPUs Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

8 Basics of GPU computing Basic behavior of CUDA GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

9 Basics of GPU computing 1 #include <cuda.h> 2 #include <stdio.h> 3 const int N = 8; 5 // Function that will be executed in the GPU 6 global void my_gpu_function (int *a, int *b) 7 {b[threadidx.x] = a[threadidx.x ] * 2;} 9 int main ( ) 10 { 11 int a[n] = {0, 1, 2, 3, 4, 5, 6, 7 }; 12 int *ad, *bd; 13 const int isize = N * sizeof(int); 14 // Perform some computations in the 15 code code 17 // Allocate GPU memory 18 cudamalloc ((void**)&ad, isize); 19 cudamalloc ((void**)&bd, isize); 27 // Copy results from GPU memory 28 cudamemcpy(b, bd, isize, cudamemcpydevicetohost); 30 // Free GPU memory 31 cudafree(ad); 32 cudafree(bd); 33 return 0; 34 } 21 // Copy data to GPU memory 22 cudamemcpy(ad, a, isize, cudamemcpyhosttodevice); 24 // Execute function in the GPU 25 my_gpu_function<<<1, N>>>(ad, bd); Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

10 Heterogeneous clusters From the programming point of view: A set of nodes, each one with: one or more s (with several cores per ) one or more GPUs (typically between 1 and 4) An interconnection network Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

11 Outline ndimproving overall performance & energy consumption of your cluster with remote GPU virtualization What is remote GPU virtualization? Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

12 It has to do with GPUs, obviously! Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

13 Remote GPU virtualization GPU With CUDA (local) No GPU A software technology that enables a more flexible use of GPUs in computing facilities Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

$h> 3 const int N = 8; 5 // Function that will be executed in the GPU 6 global void my_gpu_function (int *a, int *b) 7 {b[threadidx.x] = a[threadidx.$

14 Same source code for using virtual GPUs 1 #include <cuda.h> 2 #include <stdio.h> 3 const int N = 8; 5 // Function that will be executed in the GPU 6 global void my_gpu_function (int *a, int *b) 7 {b[threadidx.x] = a[threadidx.x ] * 2;} 9 int main ( ) 10 { 11 int a[n] = {0, 1, 2, 3, 4, 5, 6, 7 }; 12 int *ad, *bd; 13 const int isize = N * sizeof(int); 14 // Perform some computations in the 15 code code 17 // Allocate GPU memory 18 cudamalloc ((void**)&ad, isize); 19 cudamalloc ((void**)&bd, isize); 21 // Copy data to GPU memory 22 cudamemcpy(ad, a, isize, cudamemcpyhosttodevice); 24 // Execute function in the GPU 25 my_gpu_function<<<1, N>>>(ad, bd); Application source code does not change. The change is linking against the remote GPU virtualization libraries instead of using the original CUDA libraries 27 // Copy results from GPU memory 28 cudamemcpy(b, bd, isize, cudamemcpydevicetohost); 30 // Free GPU memory 31 cudafree(ad); 32 cudafree(bd); 33 return 0; 34 } Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

15 Basics of remote GPU virtualization Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

16 Basics of remote GPU virtualization Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

17 Remote GPU virtualization envision Remote GPU virtualization allows a new vision of a GPU deployment, moving from the usual cluster configuration: node 1 node 2 node 3 node n GPU GPU GPU GPU Physical configuration Interconnection GPU GPU GPU GPU node 1 Logical connections node 2 node 3 node n Logical configuration Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Remote GPU virtualization envision Real local GPUs node 1 node 2 node 3

GPU virtualization Virtualized remote GPUs GPU n GPU 3 GPU 2 GPU GPU GPU

virtualization allows all nodes to share all GPUs Interconnection

18 Remote GPU virtualization envision Real local GPUs node 1 node 2 node 3 node n GPU GPU GPU GPU Without GPU virtualization Interconnection With GPU virtualization Virtualized remote GPUs GPU n GPU 3 GPU 2 GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU 1 GPU GPU GPU GPU GPU virtualization allows all nodes to share all GPUs Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

19 rd Outline Why is remote GPU virtualization needed? Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

20 Outline Which is the problem with GPU-enabled clusters? Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

21 The problem is the cluster architecture A GPU-enabled cluster is a set of independent self-contained nodes. The cluster follows the shared-nothing approach: Nothing is directly shared among nodes (MPI required for aggregating computing resources within the cluster, included GPUs) GPUs can only be used within the node they are attached to node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

22 First concern with accelerated clusters Non-accelerated applications keep GPUs idle in the nodes where they use all the cores Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) A -only application spreading over these nodes will make their GPUs unavailable for accelerated applications node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

time: Initial acquisition costs not amortized Space: GPUs reduce density Energy:

23 Money leakage in current clusters? Idle Power (Watts) For some workloads, GPUs may be idle for significant periods of time: Initial acquisition costs not amortized Space: GPUs reduce density Energy: idle GPUs keep consuming power 4 GPUs node 1 GPU node 25% 1 GPU node: Two E v2 sockets and 32GB DDR3. One Tesla K20 GPU 4 GPU node: Two E v2 sockets and 128GB DDR3. Four Tesla K20 GPUs Time (s) Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

24 First concern with accelerated clusters (II) Accelerated applications keep s idle in the nodes where they execute Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) An accelerated application using just one core may avoid other jobs to be dispatched to this node node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

25 First concern with accelerated clusters (II) Accelerated applications keep s idle in the nodes where they execute Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) An accelerated MPI application using just one core per node may keep part of the cluster busy node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

26 Second concern with accelerated clusters Non-MPI multi-gpu applications cannot make use of the tremendous GPU resources available across the cluster (even if those GPU resources are idle) Non-MPI multi-gpu application All these GPUs cannot be used by the multi-gpu application being executed node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

27 One more concern with accelerated clusters Do applications completely squeeze the GPUs available in the cluster? When a GPU is assigned to an application, computational resources inside the GPU may not be fully used Application presenting low level of parallelism code being executed (GPU assigned GPU working) GPU-core stall due to lack of data etc node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

28 GPU usage of GPU-Blast GPU assigned but not used GPU assigned but not used Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

29 GPU usage of CUDA-MEME GPU utilization is far away from maximum Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

30 GPU usage of LAMMPS GPU assigned but not used Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

31 GPU usage of CUDASW GPU assigned but not used GPU assigned but not used Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

32 GPU allocation vs GPU utilization GPUs assigned but not used Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

33 Sharing a GPU among jobs: GPU-Blast First instance Two concurrent instances of GPU-Blast Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

34 Sharing a GPU among jobs: GPU-Blast One instance required about 51 seconds Two concurrent instances of GPU-Blast Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

35 Sharing a GPU among jobs: GPU-Blast Second instance First instance Two concurrent instances of GPU-Blast Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

36 Sharing a GPU among jobs: GPU-Blast One instance required about 51 seconds Four concurrent instances of GPU-Blast Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

37 Sharing a GPU among jobs K20 GPU LAMMPS: 876 MB mcuda-meme: 151 MB BarraCUDA: 3319 MB MUMmerGPU: 2104 MB GPU-LIBSVM: 145 MB Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

38 Back to first concern Accelerated applications keep s idle in the nodes where they execute Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) An accelerated MPI application using just one core per node may keep part of the cluster busy node 1 node 2 node 3 node n GPU GPU GPU GPU Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Back to first concern Accelerated applications keep s idle in the nodes where they execute Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n

39 Back to first concern Accelerated applications keep s idle in the nodes where they execute Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) node 1 GPU node 2 node 3 GPU An accelerated MPI application using just one core per node may keep part of the cluster busy GPU Idle cores in nodes executing accelerated node n applications could GPU be used for other GPU applications Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

40 Performance in GPU clusters is lost In summary There are scenarios where GPUs are available but cannot be used Also, cores are available but cannot be used Accelerated applications do not make use of 100% of GPU resources In conclusion GPU and cycles are lost, thus reducing cluster performance Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

41 We need something more in the cluster The current model for using GPUs is too rigid What is missing is some flexibility for using the GPUs in the cluster Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

42 We need something more in the cluster The current model for using GPUs is too rigid What is missing is some flexibility for using the GPUs in the cluster A way of seamlessly sharing GPUs across nodes in the cluster (remote GPU virtualization) Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

43 Remote GPU virtualization node 1 node 2 node 3 node n GPU GPU GPU GPU Physical configuration Interconnection GPU GPU GPU GPU node 1 Logical connections node 2 node 3 node n Logical configuration Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

44 Increasing flexibility: one step further Once GPUs are shared, their amount can be reduced to match the actual workload This would increase GPU utilization, also lowering power consumption, at the same time that initial acquisition costs are reduced Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

45 What is needed for increased flexibility? In summary this new cluster configuration requires: A way of seamlessly sharing GPUs across nodes in the cluster (remote GPU virtualization) Enhanced job schedulers that take into account the new virtual GPUs Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

46 th Outline Frameworks for remote GPU virtualization Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Remote GPU virtualization frameworks Several efforts have been made to implement remote GPU virtualization during the last years: rcuda (CUDA 7.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vcuda (CUDA 1.

47 Remote GPU virtualization frameworks Several efforts have been made to implement remote GPU virtualization during the last years: rcuda (CUDA 7.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vcuda (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V-GPU (CUDA 4.0) Publicly available NOT publicly available rcuda is a development by Technical University of Valencia Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

48 Remote GPU virtualization frameworks FDR InfiniBand + K20!! H2D pageable D2H pageable H2D pinned D2H pinned Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

49 rcuda uses optimized transfers rcuda features optimized data transfers: Pipelined transfers to improve performance Preallocated pinned memory buffers Optimal pipeline block size Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

50 Moving data WITHOUT a pipeline Pageable memory area to be copied to remote GPU Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

51 Moving data WITHOUT a pipeline Pageable memory area to be copied to remote GPU Memory area copied to network buffers for RDMA Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

52 Moving data WITHOUT a pipeline Pageable memory area to be copied to remote GPU Memory area copied to network buffers for RDMA Memory area arrives at network buffers at remote server Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

53 Moving data WITHOUT a pipeline Pageable memory area to be copied to remote GPU Memory area copied to network buffers for RDMA Memory area arrives at network buffers at remote server Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

54 Moving data WITHOUT a pipeline Main memory to network buffers from client to remote server buffers to GPU memory Transmission time without pipeline Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

55 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

56 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Memory area is split into chunks (pipeline blocks) Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

57 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Only two small network buffers at source and destination are required Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

58 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Transmission starts by moving forward the first block Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

59 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU 2 pipeline stages concurrently move data forward Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

60 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU The three stages of the pipeline are concurrently transferring data forward Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

61 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

62 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

63 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

64 Moving data WITH a pipeline Pageable memory area to be copied to remote GPU Destination memory area in the remote GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Transmission time with pipeline Improvement Transmission time without pipeline Without

65 Moving data without a pipeline Main memory to network buffers from client to remote server buffers to GPU memory Management of data transfer is much more complex Transmission time with pipeline Improvement Transmission time without pipeline Without pipeline Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Bandwidth (MB/s) Basic performance analysis Bandwidth (MB/s) Pipeline block size for InfiniBand Connect-IB It was 2MB with IB QDR and 1MB with FDR X3 1 port 2 ports Copy Size (MB)

66 Bandwidth (MB/s) Basic performance analysis Bandwidth (MB/s) Pipeline block size for InfiniBand Connect-IB It was 2MB with IB QDR and 1MB with FDR X3 1 port 2 ports Copy Size (MB) H2D pinned Copy Size (MB) NVIDIA Tesla K40; Mellanox Connect-IB + SX6036 Mellanox switch Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

How applications use remote GPUs Environment variables are properly initialized in the client side and used by the rcuda client (transparently to the

67 How applications use remote GPUs Environment variables are properly initialized in the client side and used by the rcuda client (transparently to the application) Server name/ip address : GPU Amount of GPUs assigned to application Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

68 th Outline Cons of remote GPU virtualization Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

69 Problem with remote GPU virtualization The main drawback of remote GPU virtualization is the reduced bandwidth to the remote GPU No GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

70 Using InfiniBand networks Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

71 rcuda Overhead (%) Performance of rcuda Execution Time (s) CUDASW++ Bioinformatics software for Smith-Waterman protein database searches FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE Sequence Length Dual socket E v2 Intel Xeon node with NVIDIA K20 GPU Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

72 th Outline Pros of remote GPU virtualization Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

73 1: more GPUs for a single application GPU virtualization is useful for multi-gpu applications node 1 node 2 node 3 node n Only the GPUs in the node can be provided to the application GPU GPU GPU GPU Without GPU virtualization Interconnection With GPU virtualization Many GPUs in the cluster can be provided to the application GPU GPU GPU GPU Logical connections node 1 node 2 node 3 node n Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

74 1: more GPUs for a single application 64 GPUs! Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

75 1: more GPUs for a single application MonteCarlo Multi-GPU (from NVIDIA samples) FDR InfiniBand + NVIDIA Tesla K20 Higher is better Lower is better Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

2: busy cores do not block GPUs node 1 node 2 node 3 node n GPU GPU GPU GPU Physical configuration Interconnection GPU GPU GPU GPU Logical connections

76 2: busy cores do not block GPUs node 1 node 2 node 3 node n GPU GPU GPU GPU Physical configuration Interconnection GPU GPU GPU GPU Logical connections node 1 node 2 node 3 node n Logical configuration Interconnection Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

77 3: easier cluster upgrade A cluster without GPUs needs to be upgraded to use GPUs No GPU GPUs require large power supplies Are power supplies already installed in the nodes large enough? GPUs require large amounts of space Does current form factor of the nodes allow to install GPUs? Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

78 3: easier cluster upgrade Approach 1: augment the cluster with some CUDA GPUenabled nodes only those GPU-enabled nodes can execute accelerated applications No GPU GPU-enabled Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

79 3: easier cluster upgrade Approach 2: augment the cluster with some rcuda servers all nodes can execute accelerated applications GPU-enabled Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

with 4 GPUs node with the Slurm scheduler Speeding up the execution

80 3: easier cluster upgrade Dual socket E5-2620v2 Intel Xeon + 32GB + K20 GPU FDR InfiniBand based cluster 15 nodes without GPU + 1 node with 4 GPUs node with the Slurm scheduler Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

81 Non-GPU 3: easier cluster upgrade Applications used for tests: GPU-Blast (21 seconds; 1 GPU; 1599 MB) LAMMPS short (90 seconds; 1 GPU; 2633 MB) LAMMPS long 2p (149 seconds; 2 GPUs; 3950 MB) LAMMPS long 4p (71 seconds; 4 GPUs; 2385 MB) mcuda-meme short (510 seconds; 1 GPU; 151 MB) mcuda-meme long 2p (1182 seconds; 2 GPUs; 152 MB) mcuda-meme long 4p (631 seconds; 4 GPUs; 152 MB) BarraCUDA (10 minutes; 1 GPU; 3319 MB) GPU-LIBSVM (5 minutes; 1GPU; 145 MB) MUMmerGPU (5 minutes; 1GPU; 2804 MB) GROMACS (167 seconds) NAMD (11 minutes) Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

82 3: easier cluster upgrade Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

83 3: easier cluster upgrade -68% -60% +131% +119% -63% -56% Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

84 4: GPU consolidation Concentrate GPUs into dedicated GPU boxes (with a low-power ) Allow GPU task migration Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

85 4: GPU consolidation Box A has 4 GPUs but only one is busy Box B has 8 GPUs but only two are busy 1. Move jobs from Box B to Box A and switch off Box B 2. Migration should be transparent to applications (decided by the global scheduler) Box B Box A TRUE GREEN GPU COMPUTING Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

86 4: GPU consolidation: one step beyond Job granularity instead of GPU granularity Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

87 Get a free copy of rcuda at More than 650 requests world rcuda is a development by Technical University of Valencia Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

88 Thanks! Questions? rcuda is a development by Technical University of Valencia Speeding up the execution of numerical computations and simulations. ICETE/PECCS /106

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION