An approach to provide remote access to GPU computational power

Size: px

Start display at page:

Download "An approach to provide remote access to GPU computational power"

Charity Maxwell
6 years ago
Views:

1 An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84

2 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality Basic TCP/IP version Infiniband version Work in progress and near future 2/84

3 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 3/84

4 computing computing includes all the technological issues (hardware and software) for using the computational power for executing general purpose code. This leads to a heterogeneous system. computing has had a great growing in the last years. 4/84

5 computing Nov 2008, top500 list: First supercomputer on top500 (#29) using computing at Tokyo Institute of Technology. 5/84

6 computing Rank Site Vendor 1 RIKEN Advanced Institute for Computational Science (AICS) Japan K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2011 Fujitsu 2 National Supercomputing Center in Tianjin China Tianhe-1A - NUDT TH MPP, X Ghz 6C, NVIDIA, FT C / 2010 NUDT 3 DOE/SC/Oak Ridge National Laboratory United States Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 2009 Cray Inc. 4 National Supercomputing Centre in Shenzhen (NSCS) China Nebulae - Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 / 2010 Dawning 5 GSIC Center, Tokyo Institute of Technology Japan TSUBAME HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia, Linux/Windows / 2010 NEC/HP 6/84

7 computing Rank Site Vendor 1 RIKEN Advanced Institute for Computational Science (AICS) Japan K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2011 Fujitsu 2 National Supercomputing Center in Tianjin China Tianhe-1A - NUDT TH MPP, X Ghz 6C, nvidia, FT C / 2010 NUDT 3 DOE/SC/Oak Ridge National Laboratory United States Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 2009 Cray Inc. 4 National Supercomputing Centre in Shenzhen (NSCS) China Nebulae - Dawning TC3600 Blade, Intel X5650, nvidia Tesla C2050 / 2010 Dawning 5 GSIC Center, Tokyo Institute of Technology Japan TSUBAME HP ProLiant SL390s G7 Xeon 6C X5670, nvidia, Linux/Windows / 2010 NEC/HP 7/84

8 computing JUNE 2011 Green500 Rank Site Computer 1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 3 Nagasaki University DEGIMA Cluster, Intel i5, ATI Radeon, Infiniband QDR 4 GSIC Center, Tokyo Institute of Technology HP Proliant SL 390s G7 Xeon 6C, nvidia, Linux/Windows 5 CINECA/SCS Supercomputing Solution IdataPlex DX360M3, Xeon 2.4, nvidia, Infiniband 8/84

9 computing JUNE 2011 Green500 Rank Site Computer 1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 3 Nagasaki University DEGIMA Cluster, Intel i5, ATI Radeon, Infiniband QDR 4 GSIC Center, Tokyo Institute of Technology HP Proliant SL 390s G7 Xeon 6C, nvidia, Linux/Windows 5 CINECA/SCS Supercomputing Solution IdataPlex DX360M3, Xeon 2.4, nvidia, Infiniband 9/84

10 computing processors have been the first commodity massively parallel processors. For the right kind of code the use of s brings huge benefits in terms of performance and energy. Development tools have been introduced in order to ease the programming of the s. 10/84

11 computing Basic construction node: s inside the box. Main Memory 11/84

12 computing Basic construction node: Main Memory s outside the box. 12/84

13 computing From the programming point of view: A set of nodes, each one with: one or more s (with several cores per ) one or more s (1-4). An interconnection network. Main Memory Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection 13/84

14 computing Two main approaches in computing development environments: CUDA nvidia proprietary OpenCL open standard 14/84

15 computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate code. code ( kernel). 15/84

16 computing Basically OpenCL and CUDA have the same work scheme: Running: Data transfers: and ory spaces 1. Before kernel execution: data from ory space to ory space. 2. Computation: Kernel execution. 3. After kernel execution: results from ory space to ory space. 16/84

17 computing What means the right kind of code? There must be data parallelism in the code: this is the only way of taking benefit of the hundreds of processors in a. 17/84

18 computing What means the right kind of code? There must be a limited overhead due to data movement between the ory space and the ory space. 18/84

19 computing What means the right kind of code? Influence of data transfers for SGEMM Time devoted to data transfers (%) 100 Pinned Memory Non-Pinned Memory Matrix Size Matrix computations on graphics processors and clusters of s Francisco D. Igual Peña. Phd Degree dissertation. 19/84

20 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 20/84

21 computing scenarios Different scenarios from the point of view of the application: Low amount of data parallelism. High amount of data parallelism. Moderate amount of data parallelism. Applications for multi-computing. 21/84

22 computing scenarios Low amount of data parallelism: Application has a little part where data parallelism can be extracted. BAD for computing No is needed in the system, just proceed with the tradicional HPC strategies 22/84

23 computing scenarios High amount of data parallelism. A lot of data parallellism can be extracted from every application. GOOD for computing Add as many as possible s to each node in the system and rewrite the applications in order to use them. 23/84

24 computing scenarios Moderate amount of data parallelism Application has a moderate level of data parallelism ( [40%-80%]) What about computing? If every node in the system includes s, these s are used only when data parallelism appears in some part of the application. Rest of the time s are idle, and this is an overcost in both adquisition and maintenace (energy). 24/84

25 computing scenarios Applications for multi-computing An application can use in parallel a great amount of s. What about computing? The code running in a core can only access to the s in that core but it can run faster if it would be possible to access to more s. 25/84

26 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 26/84

27 Introduction to rcuda A tool that enables that a code running in one node can access to s in other node. It is useful when you have: Moderate level of data parallelism. Applications for multi computing. 27/84

28 Introduction to rcuda Moderate level of data parallelism Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Adding a set of s on each node leads to have a set of s idle for long periods. This is a waste in money and energy 28/84

29 Introduction to rcuda Moderate level of data parallelism Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Add only just the s that can be used, considering the applications and their amount of data parallelism and 29/84

30 Introduction to rcuda Moderate level of data parallelism Logical interconnections Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Add only just the s that can be used, considering the applications and their amount of data parallelism and make all of them accessible from every node 30/84

31 Introduction to rcuda Applications for muti-computing Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection From each it is only possible to access to the corresponding s 31/84

32 Introduction to rcuda Applications for multi-computing Logical interconnections Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Put all s accessible from every node 32/84

33 Introduction to rcuda Applications for multi-computing Logical interconnections Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Put all s accessible from every node and enable the access from a to as many as s are necessary 33/84

34 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 34/84

35 rcuda structure CUDA application Application CUDA driver + runtime 35/84

36 rcuda structure Client side Application CUDA application Application CUDA driver + runtime Server side CUDA driver + runtime 36/84

37 rcuda structure Client side Application Server side CUDA application rcuda daemon CUDA driver + runtime rcuda library device device 37/84

38 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 38/84

39 rcuda functionallity CUDA programming CCextensions. extensions. Runtime Runtimelibrary. library. C extensions Not Notsupported supportedininthe thecurrent currentversion versionof ofrcuda. rcuda. We Weare areworking workingon onitit Runtime library Support Supportfor foralmost almostall allfunctions. functions. For Forsome someinternal internalfunctions, functions,nvidia nvidiadoes doesnot notgive giveinformation information(not (not supported supportedininrcuda) rcuda) 39/84

40 rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management Peer device ory access 3 3 Stream management 5 5 Suface reference management 2 2 Texture refefence managemet 8 8 Thread management 6 6 Unified addressing 1 1 Version managemet /84

41 rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported OpenGL Interoperability 4 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability /84

42 rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference BLAS BLAS BLAS /84

43 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 43/84

44 Basic TPC/IP version Proof of concept Use TCP/IP stack It is a basic version to show the functionallity Estimation of the overhead due to the communication network. Runs over all TPC/IP networks: Ethernet, InfiniBand, etc. 44/84

45 Basic TPC/IP version Example Example of of rcuda rcuda interaction interaction Initialization Initialization Client Client application application Server Server daemon daemon query Kernel software Locate and send kernel Get Load Kernel Return result Result SEND RECEIVE Get Load Kernel SEND RECEIVE 45/84

46 Basic TPC/IP version Example Example of of rcuda rcuda interaction interaction CudaMemcpy( CudaMemcpy(...,..., cudamemcpyhosttodevice) cudamemcpyhosttodevice) Client Client application application Server Server daemon daemon query Kernel software Send data to server Copy data to ory Result Return result SEND RECEIVE CudaMemcpy SEND RECEIVE 46/84

47 Basic TPC/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe data transfers On rcuda this overhead is due to: PCIe data transfers data transfers 47/84

48 Basic TPC/IP version Data Data transfer transfer time time for for matrix-matrix matrix-matrix multiplication multiplication (GEMM) (GEMM) (2 (2 data data matrices matrices from from client client to to remote remote ) ) (1 (1 result result matrix matrix from from remote remote to to client) client) Gb 10Gb Ethernet Ethernet Time (msec) rcuda CUDA Matrix dimension 48/84

49 Basic TPC/IP version Execution Execution time time for for matrix-matrix matrix-matrix multiplication multiplication (GEMM) (GEMM) Tesla Tesla c1060 c Intel Intel Xeon Xeon E5410 E5410 2'33 2'33 Ghz Ghz 10Gb Ethernet 60 10Gb Ethernet rcuda kernel execution Time (sec) Data transfers rcuda misc operations Matrix dimension 49/84

50 Basic TPC/IP version Estimated Estimated execution execution time time for for matrix matrix multiplication, multiplication, including including data data transfers transfers for for some some HPC HPC networks networks Gb etehrnet Time (sec) Gb Infiniband CUDA Matrix dimension 50/84

51 Basic TPC/IP version The functionallity has been shown Almost all CUDA SDK examples have been tested As the network overhead can be minimized a remote rcuda device will have a performance close to the local CUDA device. 51/84

52 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 52/84

bandwidth 38,4 Infiniband Infiniband QDR Infiniband DDR Infiniband

53 InfiniBand version Why InfiniBand version? InfiniBand is the most used HPC network Low latency and high bandwidth 38,4 Infiniband Infiniband QDR Infiniband DDR Infiniband DDR 4x Gigabit Ethernet Propietary Custom X4 internal internnect Others Top500 June 2011 Interconnect 53/84

54 InfiniBand version Why InfiniBand version? InfiniBand is the most used HPC network Good results are expected Time (sec) Low latency and high bandwidth 70 SGEMM Gb Infiniband 50 CUDA Matrix dimension 54/84

55 InfiniBand version InfiniBand version facts: Use IB-Verbs All TPC/IP stack overflow is out Our goal is to be as close as possible to the network peak performance Bandwidth test of our IB network is about 2900 MB/sec 55/84

56 InfiniBand version Same user level functionallity Bandwidth client to/from remote near the peak InfiniBand network bandwidth Use of Direct Reduce the number of intra-node data movements Use of pipelined transfers Overlap intra-node data movements and transfers 56/84

57 InfiniBand version Intra-node data movement: basic method Two different main ory zones needed. INFINIBAND Chipset Main ory proc 57/84

58 InfiniBand version Intra-node data movement: basic method Step 1 Copy data from ory to the main ory associated with the INFINIBAND Chipset Main ory proc 58/84

59 InfiniBand version Intra-node data movement: basic method Step 2 Copy data between the two main ory buffers. INFINIBAND Chipset Main ory proc 59/84

60 InfiniBand version Intra-node data movement: basic method Step 3 Send data from the main ory buffer associated with the network card. INFINIBAND Chipset Main ory proc 60/84

61 InfiniBand version Intra-node data movement: basic method Three data movements have been needed. INFINIBAND Chipset Main ory proc 61/84

62 InfiniBand version Intra-node data movement: Direct Only ONE main ory zone is needed. This zone is bound to both the and the network device INFINIBAND Chipset Main ory proc 62/84

63 InfiniBand version Intra-node data movement: Direct Step 1 Copy data from the ory to the main ory. INFINIBAND Chipset Main ory proc 63/84

64 InfiniBand version Intra-node data movement: Direct Step 2 Send data from the main ory. INFINIBAND Chipset Main ory proc 64/84

65 InfiniBand version Intra-node data movement: Direct Only TWO data movements have been needed. INFINIBAND Chipset Main ory proc 65/84

66 InfiniBand version Standard data transfers between nodes Chipset InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset 66/84

67 InfiniBand version Standard data transfers between nodes InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Chipset Client Copy to network buffers Server 67/84

68 InfiniBand version Standard data transfers between nodes InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Chipset Client Copy to network buffers Send Server 68/84

69 InfiniBand version Standard data transfers between nodes InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Chipset Client Copy to network buffers Send Copy to Server 69/84

70 InfiniBand version Pipelined data transfers Chipset Client InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Copy to network Server 70/84

71 InfiniBand version Pipelined data transfers InfiniBand InfiniBand Chipset Client Copy to network Copy to network Main ory rcuda server Main ory rcuda client Chipset Copy to network Server 71/84

72 InfiniBand version Pipelined data transfers Chipset Client InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Copy to network Server 72/84

73 InfiniBand version Pipelined data transfers Chipset Client Copy to network InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Copy to network Send Server 73/84

74 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Copy to network Copy to network Send Send Chipset Copy to 74/84

75 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Copy to network Copy to network Send Send Send Copy to Copy to Chipset 75/84

76 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Copy to network Copy to network Send Send Send Copy to Copy to Chipset Copy to 76/84

77 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Chipset Copy to network Copy to network Send Send Send Copy to Copy to Copy to This is the overhead for transfering data to the remote node 77/84

78 InfiniBand version Bandwidth for a matrix of 4096 x 4096 single precission 5000 Bandwidth (MB/sec) Gb InfiniBand IB peak bandwidth 2900 MB/sec rcuda GigaE rcuda IPoIB rcuda IBVerbs CUDA 78/84

79 InfiniBand version Execution time for a matrix x matrix (dim=4096) 2,50 2,28 GeForce 9800 GTX Intel Xeon E5645 Time (sec) 2,00 1,50 1,30 1,00 0,70 0,65 0,62 0,50 0,00 rcuda IpoIB rcuda GigaE CUDA rcuda IBVerbs (MKL) 79/84

80 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 80/84

81 Work in progress Dynamic remote scheduling Port to Microsoft Full support for CUDA 4.0 Support for C/C++ extension Apply to OpenCL 81/84

82 Near future Support for iwarp communications Workload balance Remote data cache Remote kernel cache 82/84

83 More information virtualization in high performance clusters. J. Duato, F. Igual, R. Mayo, A. J. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC rcuda: reducing the number of -based accelerators in high performance clusters. J.Duato, A. J. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. J. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 Enabling CUDA acceleration within virtual machines using rcuda. J. Duato, A. J. Peña, F. Silla, J. C. Fernández R. Mayo, E. S. Quintana. High Performance Computing Conference, HiPC /84

84 Antonio Peña Jose Duato Federico Silla People People Enrique S. Quintana-Ortí Rafael Mayo Thanks to Mellanox and AIC for their support to this work 84/84

computational power computational

computational power computational rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council