computational power computational

Size: px

Start display at page:

Download "computational power computational"

Isaac Martin
6 years ago
Views:

1 rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council Workshop

2 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (2 of 59) HPC Advisory Council Workshop

3 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (3 of 59) HPC Advisory Council Workshop

4 computing Will be the near future in HPC in fact, it is already here!!!! (4 of 59) HPC Advisory Council Workshop

5 computing It has been the first massively parallel hardware. For the right kind of code the use of computing brings huge benefits in terms of performance and energy. Development tools and libraries facilitate the use of the. (5 of 59) HPC Advisory Council Workshop

6 computing Two main approaches in computing development environments: CUDA: nvidia propietary OpenCL: open standard OpenCL (6 of 59) HPC Advisory Council Workshop

7 computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate CPU code and code ( kernel) Running: Data transfers: CPU and memory spaces Before kernel execution: data from CPU memory space to memory space Computation: Kernel execution After kernel execution: results from memory space to CPU memory space. (7 of 59) HPC Advisory Council Workshop

8 computing Not all algorithms take profit of power. In some cases only part of a program must be run on a. Depending on the algorithms, the can be idle for long periods. (8 of 59) HPC Advisory Council Workshop

9 computing You can find two different scenarios: Scenario 1 If all your programs are going to use the for long periods Add a to each node You don't need our tool (9 of 59) HPC Advisory Council Workshop

10 computing You can find two different scenarios: Scenario 2 Only Only part part of of your your programs programs are are going going to use use the the All All your your programs programs use use the the,, but but part-time part-time use use You could think in adding a, only to some nodes OUR TOOL CAN HELP YOU!!! (10 of 59) HPC Advisory Council Workshop

11 computing Cost from the energy point of view Nvidia Tesla s2050 near 900 Watts (TDP specification) Usage: 75%, so 25% is wasted. Then for each node (aprox): 160 Kwh are wasted per month. 2 Mwh are wasted per year. It could be several hundreds of Kg CO2/year (11 of 59) HPC Advisory Council Workshop

12 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (12 of 59) HPC Advisory Council Workshop

13 rcuda Add only the computing nodes that give the necessary computational power (13 of 59) HPC Advisory Council Workshop

14 rcuda rcuda provides remote access from each node to any in the system (14 of 59) HPC Advisory Council Workshop

15 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (15 of 59) HPC Advisory Council Workshop

16 rcuda structure CUDA application Application Application CUDA CUDA driver driver++runtime runtime (16 of 59) HPC Advisory Council Workshop

17 rcuda structure Client side Application Application Server side Application Application CUDA CUDA driver driver++runtime runtime CUDA CUDA driver driver++runtime runtime (17 of 59) HPC Advisory Council Workshop

rcuda rcudadaemon daemon Network Networkinterface interface CUDA

18 rcuda structure Client side Server side Application Application rcuda rcuda library library Network Networkinterface interface rcuda rcudadaemon daemon Network Networkinterface interface CUDA CUDA driver driver++runtime runtime (18 of 59) HPC Advisory Council Workshop

19 rcuda functionality CUDA programming CCextensions. extensions. Runtime Runtimelibrary. library. C extensions Not Notsupported supportedininthe thecurrent currentversion versionof ofrcuda. rcuda. We Wedon't don'twant wantto torewrite rewriteaacompiler compiler (by (bynow) now) Runtime library Support Supportfor foralmost almostall allfunctions. functions. For Forsome someinternal internalfunctions, functions,nvidia nvidiadoes doesnot notgive giveinformation information(not (not supported supportedininrcuda) rcuda) (19 of 59) HPC Advisory Council Workshop

20 rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management Peer device memory access 5 4 Stream management 2 2 Suface reference management 8 8 Texture refefence managemet 6 6 Thread management 6 6 Version managemet 2 2 (20 of 59) HPC Advisory Council Workshop

21 rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported Unified addressing 11 0 Peer Device Memory Access 3 0 OpenGL Interoperability 3 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability 6 0 (21 of 59) HPC Advisory Council Workshop

22 rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference BLAS BLAS BLAS (22 of 59) HPC Advisory Council Workshop

23 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (23 of 59) HPC Advisory Council Workshop

24 rcuda: basic TCP/IP version Characteristics Use UseTCP/IP TCP/IPstack stack ItItisisaabasic basicversion versionto toshow showthe thefunctionallity functionallity Estimation Estimationof ofthe theoverhead overheaddue dueto tothe thecommunication communicationnetwork. network. Runs Runsover overall alltpc/ip TPC/IPnetworks networks Ethernet Ethernet InfiniBand InfiniBand etc. etc. (24 of 59) HPC Advisory Council Workshop

25 rcuda: basic TCP/IP version Example Example of of rcuda rcuda interaction interaction rcuda initialization Network Client Client application application Server Server daemon daemon Get Load Kernel Return result query Kernel software Locate and send kernel SEND Data transfer RECEIVE to time (25 of 59) HPC Advisory Council Workshop

26 rcuda: basic TCP/IP version Example Example of of rcuda rcuda interaction interaction CudaMemcpy(..., cudamemcpyhosttodevice); Network Client Client application application Copy data from application to send buffers Send buffers to server Server Server daemon daemon Copy data from receive buffers to daemon buffers Copy data to memory Data transfer SEND Data transfer RECEIVE to time (26 of 59) HPC Advisory Council Workshop

27 rcuda: basic TCP/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe transfer On rcuda this overhead is due to: Network transfer PCIe transfer (but this appears in CUDA) (27 of 59) HPC Advisory Council Workshop

28 rcuda: basic TCP/IP version Data transfer time for matrix-matrix multiplication (GEMM) (2 data matrices from client to remote ) (1 result matrix from remote to client) Gb 10Gb Ethernet Ethernet rcuda CUDA Time (msec) Matrix dimension (28 of 59) HPC Advisory Council Workshop

29 rcuda: basic TCP/IP version Execution time for matrix-matrix multiplication Tesla Tesla c1060 c1060 Intel Xeon E5410 2'33 Ghz Intel Xeon E5410 2'33 Ghz 10Gb Ethernet 10Gb Ethernet 70 CPU 60 kernel execution rcuda kernel and data transfer rcuda data transfer data transfer Matrix dimension rcuda misc Time (sec) 50 (29 of 59) HPC Advisory Council Workshop

30 rcuda: basic TCP/IP version Estimated execution time for matrix multiplication, including data transfers for some HPC networks 120 CPU Time (sec) Gbit Ethernet 80 10Gbit InfiniBand 40Gbit InfiniBand Matrix dimension (30 of 59) HPC Advisory Council Workshop

31 rcuda: basic TCP/IP version We have shown the functionality (almost all CUDA SDK examples has been tested) As we decrease the network overhead, our solution will have a performance close to the CUDA solution (31 of 59) HPC Advisory Council Workshop

32 rcuda: InfiniBand version why? InfiniBand InfiniBand isis the the most most used used HPC HPC network network Low Low latency latency High High bandwidth bandwidth As As shown, shown, good good results results are are expected... expected... (32 of 59) HPC Advisory Council Workshop

33 rcuda: InfiniBand version InfiniBand version facts Use Use of of IB-Verbs IB-Verbs All All the thetcp/ip TCP/IPsoftware software stack stack overflow overflow isis out out Our Our goal goal isis to to get get near near this this peak peak bandwidth. bandwidth. Bandwidth Bandwidth test test of of our our IB IB network network isis about about MB/sec MB/sec (33 of 59) HPC Advisory Council Workshop

34 rcuda: InfiniBand version but... Bandwidth Bandwidth far far from from the the peak. peak. We We want want to to be be closer closer to to the the peak peak IB IB bandwidth bandwidth What we can do? Reduce Reduce the the data data movements movements between between memory memory buffers buffers Overlap Overlap de de memory memory access access with with the the network network communication communication (34 of 59) HPC Advisory Council Workshop

35 rcuda: Optimized InfiniBand version Same user level functionallity. Client to/from remote bandwidth near the peak of InfiniBand network bandwidth. Use of Direct Reduce the number of memory copies Use of pipelined transfers. Overlap memory copies and communications (35 of 59) HPC Advisory Council Workshop

36 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (36 of 59) HPC Advisory Council Workshop

37 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (37 of 59) HPC Advisory Council Workshop

38 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (38 of 59) HPC Advisory Council Workshop

39 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (39 of 59) HPC Advisory Council Workshop

40 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (40 of 59) HPC Advisory Council Workshop

41 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (41 of 59) HPC Advisory Council Workshop

42 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network A A memory memory copy copy is is avoided avoided (42 of 59) HPC Advisory Council Workshop

43 rcuda: Pipelined transfers SERVER NODE chipset CPU InfiniBand InfiniBand chipset memory CPU Main memory Main memory CLIENT NODE Client Network Server (43 of 59) HPC Advisory Council Workshop

44 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Network Server (44 of 59) HPC Advisory Council Workshop

45 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers Network CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Server (45 of 59) HPC Advisory Council Workshop

46 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers Network Server CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Copy to (46 of 59) HPC Advisory Council Workshop

47 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Network Server (47 of 59) HPC Advisory Council Workshop

48 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Copy to network buffers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Server (48 of 59) HPC Advisory Council Workshop

49 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to (49 of 59) HPC Advisory Council Workshop

50 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer Data transfer Copy to Copy to chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE (50 of 59) HPC Advisory Council Workshop

51 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer Data transfer Copy to Copy to chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to (51 of 59) HPC Advisory Council Workshop

52 rcuda: Optimized InfiniBand version Bandwidth for matrix-matrix product 4096x Bandwidth (MB/sec) IB peak bandwidth 2900 MB/sec rcuda GigaE rcuda IPoIB rcuda IBVerbs 40Gb InfiniBand CUDA (52 of 59) HPC Advisory Council Workshop

53 rcuda: optimized InfiniBand version Time for matrix-matrix product (4096x4096) 2,50 GeForce GeForce GTX GTX Intel Xeon E5645 Intel Xeon E5645 2,28 Time (sec) 2,00 1,50 1,30 1,00 0,70 0,65 0,62 0,50 0,00 rcuda IpoIB rcuda GigaE CUDA rcuda IBVerbs CPU (MKL) (53 of 59) HPC Advisory Council Workshop

54 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (54 of 59) HPC Advisory Council Workshop

55 rcuda: work in progress rcuda port to Microsoft rcuda thread safe rcuda support to CUDA 4.0 Support for CUDA C/C++ extensions ropencl (55 of 59) HPC Advisory Council Workshop

56 rcuda: near future Support for iwarp communications. Dynamic remote scheduling. Workload balance. Remote data cache. Remote kernels cache. (56 of 59) HPC Advisory Council Workshop

57 rcuda: more information virtualization in high performance clusters J. Duato, F. Igual, R. Mayo, A. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC'09. rcuda: reducing the number of -based accelerators in high performance clusters. J. Duato, A. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 (accepted). (57 of 59) HPC Advisory Council Workshop

58 rcuda: credits Parallel Architectures Group Technical University of València (Spain) High Performance Computing and Architectures Group University Jaume I of Castelló (Spain) (58 of 59) HPC Advisory Council Workshop

59 rcuda Thanks to and for their hardware donation for the devlopment of this work MORE INFORMATION: POSTER SESSION (Tuesday 21 and Wednesday 22) Thanks for your attention. Questions? (59 of 59) HPC Advisory Council Workshop

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda