GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Size: px

Start display at page:

Download "GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3"

Antony Richardson
6 years ago
Views:

1 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose computation on NVIDIA CUDA [] Compute Unified Device Architecture OpenCL [2] HPC GP PC MPI OpenMP CUDA Graduate School of System and Information Engineering, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba 3 Bordeaux Sud-Ouest INRIA research center a) odajima@hpcs.cs.tsukuba.ac.jp CPU CPU CPU SIMD CPU PGAS Partitioned Global Address Space XcalableMP [3] XMP XcalableMP acceleration device extension [4] XMP-dev CPU XMP-dev CPU INRIA StarPU [5] XMP-dev StarPU CPU XMP-dev /CPU c 202 Information Processing Society of Japan

2 CPU XMP-dev StarPU XMP-dev StarPU 2. XcalableMP XMP-dev 2. XcalableMP XMP [6] XMP PGAS OpenMP MPI XMP XMP XMP XMP #pragma xmp XMP nodes 4 template XMP template distribute template t align template t x loop for template t XMP XMP N int x[n]; #pragma xmp nodes p(4) #pragma xmp template t(0:n-) #pragma xmp distribute t(block) onto p #pragma xmp align x[i] with t(i) int main () { int i; #pragma xmp loop on t(i) for (i = 0; i < N; i++) x[i] = func(i); } #pragma xmp nodes p(4) #pragma xmp template t(0 : N-) template t #pragma xmp distribute t(block) onto p x node node2 node3 node4 #pragma xmp align x[i] with t(i) node node2 node3 node4 node node2 node3 node4 #pragma xmp loop on t(i) for (i = 0; i < N; i++) { } XMP XMP shadow reflect XMP shadow shadow shadow reflect reflect reflect peer-to-peer shadow full shadow full shadow reflect MPI Allgather 2.2 XMP-dev XMP XMP-dev [4] device CPU XMP-dev XMP-dev XMP c 202 Information Processing Society of Japan 2

3 XMP - loop XMP CUDA OpenCL MPI XMP-dev - 2 XMP-dev XMP-dev XMP XMP loop CPU 5 24 XMP-dev #pragma xmp device #pragma xmp device replicate (list) replicate free XMP-dev sync_clause := in (list) out (list) #pragma xmp device replicate_sync sync_clause replicate sync replicate sync clause in out replicate replicate sync out #pragma xmp device loop loop-statement device loop XMP loop for for XMP-dev XMP-dev loop 3. StarPU StarPU [7] [8] StarPU int x[n], y[n]; #pragma xmp nodes p(4) #pragma xmp template t(0:n-) #pragma xmp distribute t(block) onto p #pragma xmp align [i] with t(i) :: x, y int main() { int i; #pragma xmp loop on t(i) for (i = 0; i < N; i++) { x[i] = func(i); y[i] = func(i); } #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < N; i++) y[i] += x[i]; #pragma xmp device replicate_sync out (y) } } 2 XMP-dev Execu on HOST Data Allocate Data Copy H -> D Execu on Device Data Copy D -> H H : Host, D : Device StarPU CPU Cell Broadband Engine CPU NVIDIA CUDA StarPU 3. Codelet StarPU codelet codelet codelet CPU starpu_codelet cl = { };.where = STARPU_CPU STARPU_CUDA,.cpu_func = cpu_fnction,.cuda_func = cuda_function,.nbuffers = 0 c 202 Information Processing Society of Japan 3

4 3.2 StarPU StarPU starpu data handle handle StarPU CPU CPU StarPU Read only, Wite only, Read Write, etc... - MPI starpu data acquire starpu data release StarPU StarPU StarPU starpu data register StarPU StarPU starpu data partition starpu task submit CPU starpu data unpartition starpu data unregister StarPU StarPU CPU CPU GFLOPS NVIDIA Tesla M2090 starpu_data_register starpu_data_par on starpu_task_submit starpu_data_unpar on, starpu_data_unregister 3 StarPU 665GFLOPS Kepler2 TFLOPS 5 6 CPU StarPU - 4. XMP-dev/StarPU 4. XMP-dev/StarPU StarPU StarPU codelet StarPU MPI XMP-dev StarPU c 202 Information Processing Society of Japan 4

5 XMP-dev/StarPU /CPU XMP-dev XMP-dev device StarPU XMP-dev CUDA [4] OpenCL [9] CPU StarPU /CPU StarPU StarPU XMP-dev StarPU XMP-dev CPU /CPU 4.2 XMP-dev/StarPU XMP-dev StarPU 4 XMP-dev/StarPU XMP-dev [4] XMP-dev/CUDA XMP template Global array Local array XMP-dev/StarPU Local array StarPU CPU StarPU Local array CPU 4 Replicate array Local array Local array StarPU acquire-release Local array Replicate array Local array Replicate array Global array ( Image ) 4 node Local array Replicate array XMP-dev/StarPU node2 Local array Local array Replicate array Local array XMP-dev/StarPU XMP-dev/CUDA StarPU #pragma xmp device replicate XMP-dev/CUDA StarPU XMP-dev/StarPU StarPU Replicate array #pragma xmp device replicate sync XMP-dev/CUDA - StarPU XMPdev/StarPU device replicate Replicate array inout in Local array Replicate array out #pragma xmp device loop devic loop XMPdev/StarPU StarPU 2 CPU CPU 5 for /CPU StarPU task submit CPU for c 202 Information Processing Society of Japan 5

6 HA-PACS CPU Intel Xeon E * 2 (6cores) Memory DDR3 28GB NVIDIA Tesla M2090 * 4 Memory 6GB/ CUDA Toolkit 4. MPI OpenMPI.4.3 Interconnection InfiniBand x4 QDR # of node 4 Normalized execu on me Num of chunk / node 256k*024k 28k*52k 64k*256k 32k*28k 6k*64k Normalized execu on me Number of chunks 04k 208k 5 XMP-dev/CUDA XMPdev/CUDA 5. XMP-dev/StarPU 5. HA-PACS [0] StarPU 4 HA-PACS 6 4 = 2 2 N i-j- i- j- XMPdev/CUDA 5 04k 208k XMP-dev/CUDA 6 i- j- 35% i CPU CPU 04 [] 6 4 i- /4 32k/4 = 8k 28k/32 = 8k 8k 8k(= = 892) CPU 2 8k 2 N = 48k CPU [sec] k 4k 2k k CPU CPU 8k(= 2 8k= 96k) c 202 Information Processing Society of Japan 6

7 CPU2 3 CPU2 7 3 N = 96k [sec] CPU Ratio Normalized execu on me k 52k 024k CPU CPU N = 3m(= (96k + 96k 7) 4 = 96k 8 4 = 3 k k) 8k 96 4 /CPU 9.6% HA-PACS 4 N = (96k +96k 7 4) node = 2784k node 4 N = 3m [sec] Hybrid Speed-up XMP-dev/StarPU starpu data partition CPU /CPU 8k CPU CPU CPU Weight CPU Weight CPU Weight 7 XMP-dev/CUDA CPU Weight Execu on me [sec] CPU Weight me CPU me 8 N = 256k CPU Weight ratio = /7.24 = ) 7 CPU Weight XMP-dev/CUDA CPU Weight % /CPU 7 CPU Weigh CPU CPU Weight N = 256k CPU Weight 0.2 N ( 0.2) = k /7 = k(= 892) 256 8k 8k/256 = 32 Tesla M2090 SM Streaming c 202 Information Processing Society of Japan 7

8 Processor 6 2 8k 3 8k 7 6. PGI Accelerator Compilers[2] HMPP Workbench [3] PGI Accelerator compilers NVIDIA CUDA HMPP Workbench CUDA OpenCL CPU OpenACC [4] [5] CPU CPU Agullo [6] StarPU Intel Nehalem X5550 6cores NVIDIA FX StarPU /CPU XMP-dev/CUDA 7. CPU XMP-dev StarPU XMP-dev/StarPU /CPU HA-PACS 4 HA-PACS [] CUDA C Programming Guide. nvidia.com/nvidia-gpu-computing-documentation. [2] OpenCL. [3] XcalableMP. [4],,,,. PGAS XcalableMP. ACS, Mar 202. [5] StarPU. StarPU/. [6],,. XcalableMP. ACS, 200. [7] C. Augonnet and R. Namyst. A Unified Runtime System for Heterogeneous Multi-core Architectures. In Euro-Par 2008 Workshops - Parallel Processing, [8] C. Augonnet, S. Thibault, and R. Namyst. StarPU: a Runtime System for Scheduling Tasks over Accelerator- Based Multicore Machines. In Concurrency Computat.: Pract. Exper., Mar 200. [9] T. Nomizua, D. Takahashi, J. Lee, T. Boku, and M. Sato. Implementation of XcalableMP Device Acceleration Extention with OpenCL. In Multicore and Programming Models, Languages and Compilers Workshop (Colocated with IPDPS 202), May 202. [0] HA-PACS. jp/ccs/research/project/ha-pacs. [],,. GP : 60GFLOPS., Oct [2] PGI Accelerator Compiler. SPG/Pgi/Accel/index.html. [3] HMPP Workbench. com/hmpp.html. [4] OpenACC. [5],,,. CPU GEMM., [6] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov. Faster, Cheaper, Better a Hybridization Methodology to Develop Linear Algebra Software for s. In Computing Gems. Sep 200. c 202 Information Processing Society of Japan 8

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction