rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

Size: px

Start display at page:

Download "rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU"

Jeffrey Fields
5 years ago
Views:

1 rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018

2 rcuda: from virtual machines to hybrid CPU-GPU clusters Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018

3 Outline What is rcuda?

4 Basics of GPU computing Basic behavior of CUDA GPU Remark: GPUs can only be used within the node they are attached to

5 Basics of GPU computing Basic behavior of CUDA GPU Remark: GPUs can only be used within the node they are attached to

6 A different approach: remote GPU virtualization

7 A different approach: remote GPU virtualization A software technology that enables a more flexible use of GPUs in computing facilities rcuda remote CUDA rcuda is a development by Universitat Politècnica de València

8 Basics or rcuda Access to remote GPU is transparent to applications: no source code modification is needed rcuda is a development by Universitat Politècnica de València

9 Basics or rcuda Access to remote GPU is transparent to applications: no source code modification is needed rcuda is a development by Universitat Politècnica de València

10 Basics or rcuda Access to remote GPU is transparent to applications: no source code modification is needed rcuda is a development by Universitat Politècnica de València

11 rcuda supports RDMA transfers

12 rcuda envision rcuda allows a new vision of a GPU deployment, moving from the usual cluster configuration Physical configuration to the following one: Logical configuration

13 Outline Perfomance of rcuda?

14 Performance of rcuda K20 GPU and FDR InfiniBand K40 GPU and EDR InfiniBand Lower is better

15 Performance of rcuda P100 GPU and EDR InfiniBand Lower is better BarraCUDA CUDA-MEME Lower is better

16 Performance of data movements among GPUs CUDA rcuda rcuda scenario 1 rcuda scenario 2

17 Performance of data movements among GPUs Higher is better

18 Performance of data movements to/from GPUs CUDA rcuda

19 Performance of data movements to/from GPUs CPU to GPU Higher is better GPU to CPU

20 Performance of data movements to/from GPUs CPU to GPU Higher is better GPU to CPU

21 Performance of data movements to/from GPUs CPU to GPU Higher is better GPU to CPU

22 Performance of data movements to/from GPUs New communication module in progress

23 Outline Benefits of rcuda?

24 Outline Benefits of rcuda: 1. Many GPUs for an application 2. Server consolidation 3. GPU acceleration for virtual machines 4. Increased cluster throughput

25 Providing many GPUs to an application with rcuda

26 Providing many GPUs to an application with rcuda K20 GPUs and FDR InfiniBand Lower is better MonteCarlo multi-gpu program running in 14 NVIDIA Tesla K20 GPUs

27 Providing many GPUs to an application with rcuda 64 GPUs!!

28 Providing many GPUs to an application with rcuda Work in progress!! K20 GPUs GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 non-optimized (yet) version of rcuda!!! GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15 GPU 16

29 Server consolidation with rcuda 1 off 3 off off off 7 off 9 off off GPU utilization (%)

Server consolidation with rcuda The GPU-Blast application is migrated up to 5 times among K40 GPUs The aggregated volume of GPU data is 1300 MB (consisting

30 Server consolidation with rcuda The GPU-Blast application is migrated up to 5 times among K40 GPUs The aggregated volume of GPU data is 1300 MB (consisting of 9 memory regions) Lower is better The Reference line is the execution time of the application when using CUDA with a local GPU and without any migration

31 Virtual machines may need access to GPUs How to access the GPU in the native domain from inside of virtual machines?

32 Virtual machines may need access to GPUs The GPU is assigned by using PCI passthrough exclusively to a single virtual machine Concurrent usage of the GPU is not possible

be provided to the VMs, either in a single remote node or in

33 Using rcuda to access the GPU If InfiniBand is available, the rcuda server can be placed in another node Several GPUs can be provided to the VMs, either in a single remote node or in several remote nodes High performance network fabric available

Using rcuda to access the GPU High performance network is not available This configuration allows the use of more than one GPU at the host When InfiniBand is not available, the rcuda server may be

34 Using rcuda to access the GPU High performance network is not available This configuration allows the use of more than one GPU at the host When InfiniBand is not available, the rcuda server may be placed in the native domain and the rcuda client would be placed inside the VMs The virtual network provided by the hypervisor would be used to exchange data between the rcuda clients and the rcuda server

35 Using rcuda to access the GPU

36 Increased cluster throughput One rcuda box serves multiple clients...

37 Increased cluster throughput Lower is better - 58% 1. BarraCUDA 2. CUDA-MEME 3. CUDASW++ 4. GPU-Blast 5. Gromacs 6. Magma

38 Increased cluster throughput GPU assigned but not used GPU assigned but not used

39 Outline One more benefit: Heterogeneous 2 environments

40 rcuda availability rcuda is available for the x86, POWER and ARM processors

41 Outline Performance of rcuda on ARM systems

42 From ARM to x86 with rcuda ThunderX

43 Application performance Work in progress. A couple of applications have been already analyzed: 1. Cloverleaf: a mini-app that solves the compressible Euler equations on a Cartesian grid 2. Flow: a mini-app that implements a 2D hydrodynamics simulator

44 Application performance: Cloverleaf Single node executions Lower is better Estimation over multiple nodes

45 Application performance: Cloverleaf Single node executions Lower is better Rough energy estimation: ThunderX TDP = 80 watts P100 TDP = 250 watts Xeon TDP = 140 watts 40*80 versus 1*80+3*250+2* watts versus 1110 watts Estimation over multiple nodes

46 Application performance: Flow Single node executions Lower is better Estimation over multiple nodes

47 Application performance: Flow Single node executions Lower is better Rough energy estimation: ThunderX TDP = 80 watts P100 TDP = 250 watts Xeon TDP = 140 watts 60*80 versus 1*80+3*250+2* watts versus 1110 watts Estimation over multiple nodes

48 Hybrid CPU-GPU clusters High density ARM-based nodes

49 Hybrid CPU-GPU clusters High density ARM-based nodes

50 Hybrid CPU-GPU clusters High density ARM-based nodes rcuda clients rcuda servers

51 Get a free copy of rcuda at More than 900 requests world rcuda is a development by Universitat Politècnica de València, Spain

52 Tony Díaz Pablo Higueras Javier Prades Jaime Sierra Cristian Peñaranda Federico Silla Carlos Reaño rcuda is a development by Universitat Politècnica de València, Spain

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the