rcuda: an approach to provide remote access to GPU computational power

Size: px

Start display at page:

Download "rcuda: an approach to provide remote access to GPU computational power"

Eileen Wilkinson
6 years ago
Views:

1 rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop

2 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (2 of 60) HPC Advisory Council Workshop

3 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (3 of 60) HPC Advisory Council Workshop

4 computing Will be the near future in HPC in fact, it is already here!!!!

5 computing It is massively parallel For the right kind of code the use of the use of computing brings huge benefits. Development tools and libraries facilitate the use of the. (5 of 60) HPC Advisory Council Workshop

6 computing Two approaches in computing: CUDA: nvidia propietary OpenCL: open standard

7 computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate CPU code Running: and code ( kernel) Data transfers: CPU and memory spaces Before kernel execution: data from CPU memory space to memory space Computation: Kernel execution After kernel execution: results from memory space to CPU memory space. (7 of 60) HPC Advisory Council Workshop

8 CPU computing Not all algorithms take profit of power. In some cases only part of a program must be run on a. Depending on the algorithms, the can be idle for long periods. (8 of 60) HPC Advisory Council Workshop

9 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (9 of 60) HPC Advisory Council Workshop

10 computing cost Tesla s2050 near 900 Watts (TDP manufacturer specification) Usage: 75% of time, so 25% idle time. Then each node misses: 160 Kwh/month 2 Mwh/year It could be several hundreds of Kg CO 2 /year (10 of 60) HPC Advisory Council Workshop

11 computing You can find two different scenarios: Scenario 1 If all your programs are going to use the for long periods Add a to each node You don't need our tool (11 of 60) HPC Advisory Council Workshop

12 computing You can find two different scenarios: Scenario 2 You could think in adding a, only to some nodes OUR TOOL CAN HELP YOU!!! (12 of 60) HPC Advisory Council Workshop

13 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (13 of 60) HPC Advisory Council Workshop

14 rcuda

15 rcuda

16 rcuda

17 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (17 of 60) HPC Advisory Council Workshop

18 rcuda structure CUDA application

19 rcuda structure Client side Server side

20 rcuda structure Client side Server side Application

21 rcuda functionality

22 rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management Peer device memory access 5 4 Stream management 2 2 Suface reference management 8 8 Texture refefence managemet 6 6 Thread management 6 6 Version managemet 2 2 (22 of 60) HPC Advisory Council Workshop

23 rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported Unified addressing 11 0 Peer Device Memory Access 3 0 OpenGL Interoperability 3 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability 6 0 (23 of 60) HPC Advisory Council Workshop

24 rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference BLAS BLAS BLAS (24 of 60) HPC Advisory Council Workshop

25 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (25 of 60) HPC Advisory Council Workshop

26 rcuda: basic TCP/IP version

27 rcuda: basic TCP/IP version Example of rcuda interaction rcuda initialization

28 rcuda: basic TCP/IP version Example of rcuda interaction CudaMemcpy(..., cudamemcpyhosttodevice);

29 rcuda: basic TCP/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe transfer On rcuda this overhead is due to: Network transfer PCIe transfer (but this appears in CUDA) (29 of 60) HPC Advisory Council Workshop

30 rcuda: basic TCP/IP version Data transfer time for matrix multiplication (2 from client to remote ) (1 from remote to client) Time (sec) rcuda CUDA Matrix dimension (30 of 60) HPC Advisory Council Workshop

31 rcuda: basic TCP/IP version Execution time for matrix multiplication Tesla c1060 Intel Xeon E5410 2'33 GHz 70 Time (sec) CPU rcuda kernel rcuda data transfer 10 0 rcuda misc Matrix dimension (31 of 60) HPC Advisory Council Workshop

32 rcuda: basic TCP/IP version Estimated execution time for matrix multiplication, including data transfers for some HPC networks Time (sec) CPU 10Gbit Ethernet 10Gbit InfiniBand 40Gbit InfiniBand Matrix dimension (32 of 60) HPC Advisory Council Workshop

33 rcuda: basic TCP/IP version We have shown the functionality As we decrease the network overhead, our solution will have a performance close to the CUDA solution (33 of 60) HPC Advisory Council Workshop

34 rcuda: InfiniBand version

35 rcuda: InfiniBand version

36 rcuda: InfiniBand version

37 rcuda: InfiniBand Verbs implementation Same user level functionallity. Use of Direct Use of pipelined transfers. Client to/from remote bandwidth near the peak of InfiniBand network performance. (37 of 60) HPC Advisory Council Workshop

38 rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (38 of 60) HPC Advisory Council Workshop

39 rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (39 of 60) HPC Advisory Council Workshop

40 rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (40 of 60) HPC Advisory Council Workshop

41 rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (41 of 60) HPC Advisory Council Workshop

42 rcuda: Direct WITH direct 1.- to Main Memory 2.- Main Memory to Network CPU Main memory InfiniBand chipset memory (42 of 60) HPC Advisory Council Workshop

43 rcuda: Direct WITH direct 1.- to Main Memory 2.- Main Memory to Network CPU Main memory InfiniBand chipset memory (43 of 60) HPC Advisory Council Workshop

44 rcuda: Direct WITH direct 1.- to Main Memory 2.- Main Memory to Network CPU Main memory InfiniBand chipset memory A memory copy is avoided (44 of 60) HPC Advisory Council Workshop

45 rcuda: Pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Network Server (45 of 60) HPC Advisory Council Workshop

46 rcuda: Pipelined transfers Without pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Network Server (46 of 60) HPC Advisory Council Workshop

47 rcuda: Pipelined transfers Without pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Network Data transfer Server (47 of 60) HPC Advisory Council Workshop

48 rcuda: Pipelined transfers Without pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Network Data transfer Server Copy to (48 of 60) HPC Advisory Council Workshop

49 rcuda: Pipelined transfers CLIENT NODE WITH pipelined transfers SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Network Copy to network buffers Server (49 of 60) HPC Advisory Council Workshop

50 rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Network Server Copy to network buffers Copy to network buffers Data transfer (50 of 60) HPC Advisory Council Workshop

51 rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Copy to network buffers Copy to network buffers Network Data transfer Data transfer Server Copy to (51 of 60) HPC Advisory Council Workshop

52 rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Copy to network buffers Copy to network buffers Network Data transfer Data transfer Data transfer Server Copy to Copy to (52 of 60) HPC Advisory Council Workshop

53 rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Copy to network buffers Copy to network buffers Network Data transfer Data transfer Data transfer Server Copy to Copy to Copy to (53 of 60) HPC Advisory Council Workshop

54 rcuda: optimized InfiniBand version Bandwidth for GEMM 13824x13824 Bandwidth (MB/sec) rcuda GigaE rcuda IPoIB rcuda IBVerbs CUDA (54 of 60) HPC Advisory Council Workshop

55 rcuda: optimized InfiniBand version Time for GEMM 4096x4096 2,50 2,00 2,28 Time (sec) 1,50 1,00 1,30 0,70 0,65 0,62 0,50 0,00 rcuda GigaE rcuda IpoIB rcuda IBVerbs CUDA CPU (MKL) Intel Xeon E5645 (55 of 60) HPC Advisory Council Workshop

56 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (56 of 60) HPC Advisory Council Workshop

57 rcuda: work in progress rcuda port to Microsoft rcuda thread safe rcuda support to CUDA 4.0 Support to CUDA C extensions ropencl (57 of 60) HPC Advisory Council Workshop

58 rcuda: near future Dynamic remote scheduling. Workload balance. Remote data cache. Remote kernels cache (58 of 60) HPC Advisory Council Workshop

59 rcuda: more information virtualization in high performance clusters J. Duato, F. Igual, R. Mayo, A. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC'09. rcuda: reducing the number of -based accelerators in high performance clusters. J. Duato, A. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 (accepted). (59 of 60) HPC Advisory Council Workshop

60 rcuda: credits Parallel Architectures Group Technical University of València High Performance Computing and Architectures Group University Jaume I of Castelló (60 of 60) HPC Advisory Council Workshop

computational power computational

computational power computational rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council