Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures

Size: px

Start display at page:

Download "Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures"

Abel Powers
6 years ago
Views:

2 Sommaire Travaux réalisés dans le cadre du projet OpenGPU grâce au soutien de la DGCIS. Méthode implicite: résolution de systèmes linéaires creux Méthode explicite: Smoothed Particle Hydrodynamics (SPH) Copyright ESI Group, All rights reserved.

The multi-frontal method operates by design on dense submatrices for performance: GEMM and TRSM BLAS Level 3

3 Multi-frontal Solver and CUBLAS One of the major workhorses of VPS implicit is the (multifrontal) linear system direct solver (MUMPS). The multi-frontal method operates by design on dense submatrices for performance: GEMM and TRSM BLAS Level 3 kernels with sometime a large number of RHS. In VPS, main focus is on double precision real and complex operands. What about using the CUBLAS provided by NVIDIA and see what happens on some industrial test cases?

4 CUBLAS (3.2) Level 3 Performance Performance on C2070 (ECC on) including data transfers. GEMM and [SY,HE]RK optimized. Little has been done for the performance of the other Level 3 BLAS routines. Single Precision Level 3 CUBLAS True for all other precisions D, C and Z. TRSM is important for multiple RHS solve. Gflops / s SGEMM SSYMM SSYRK SSYR2K STRMM STRSM Problem size

5 Recursive GEMM based Level 3 BLAS A 11 A 21 A 22 B 1 B 2 B 1 := A B 1 (TRSM) B 2 := B 2 A 21 B 1 (GEMM) B 2 := A B 2 (TRSM) Recursive formulation of the TRSM operation. Use of native (slower) TRSM on leaves of the tree and (fast) GEMM elsewhere. Method can be applied to all Level 3 (and 2) operations.

6 (Recursive) CUBLAS Level 3 Performance Asymptotically achieves GEMM performance. DGEMM (original) DTRSM (original) DTRSM (recursive) ZGEMM (original) ZSYMM (recursive) ZHEMM (recursive) ZSYR2K (recursive) ZHER2K (recursive) ZTRMM (recursive) ZTRSM (recursive) Gflo ops / s Problem size Gflops / s Problem size [SY,HE] rank-2k updates should be implemented by a GEMM call followed by a triangular inplace copy-add. The recursive algorithm should be used until there is enough memory to use the above algorithm.

7 VPS Implicit: Non-Linear Static Test Case Double precision real, 1 rhs. 12 numerical factorizations and 12 solves. Problem size = , non-zero terms = Speed-up: 20% over 1 Nehalem core Time in mn CPU CPU-GPU 0 Total Matrix Solver

8 VPS-Implicit: NVH Frequency Response Double precision complex, 1258 rhs. 25 numerical factorizations and 175 solves. Problem size = , non-zero terms = Speed-up: 2x over 1 Nehalem core. Time in mn Internal Acoustics Total Matrix Solver CPU CPU-GPU

9 Conclusions Naïve (no data transfer / computation overlap) recursive GEMM based implementation was necessary to handle efficiently large number of rhs. The library approach makes the GPU particularly easy to use within complex applications the performance gain however remains limited. More work is necessary to get better speedups for sparse direct solvers on GPUs.

10 SPH La granularité des calculs effectués en SPH en fait une méthode de choix pour le calcul sur GPU. Calculs réel simple précision. La majeure partie des calculs est uniformément répartie dans (seulement) 3 hot-spots de 5 routines au total. Les temps d exécution reportés inclus les transferts de données vers la carte (pas de recouvrement). Comparaison des temps de calcul entre 1 cœur Nehalem W5590 et une carte Nvidia Fermi (C2070 6Gb de RAM). Cas industriel: Véhicule roulant sur de l eau ( points, particules, plaques).

11 Cuda kernels for one hot-spot Simulation (ms) GPU (s) CPU (s) Gain(%) Elapsed time CPU - GPU (1) Speedup seems to slightly increase with the simulation time. Elapsed time (s) Simulation time (ms) GPU CPU

12 Estimation for 3 hot-spots Simulation (ms) GPU (s) CPU (s) Gain(%) Elapsed time CPU - GPU - estimation Data re-use (less data transfers) as the numbers of kernels increase should lead to an even better speedup. Elapsed time (s) GPU CPU Simulation time (s)

13 Cuda kernels for 3 hot-spots Simulation (ms) GPU (s) CPU (s) Gain(%) Number of registers is constant: need to reduce the size of thread blocks to run successfully: performance loss. Size of argument list is limited in bytes: 256 Bytes 1.x (C1060) 4 Kbytes 2.0 (C2070) 13

14 Conclusions future work SPH: very promising for GPU computing still need to work on kernels to achieve the potential. Hybrid GPU(s) CPU computing: to investigate. Other explicit method topics to investigate: Finite Pointset Method (FPM), Internal forces computing, Contact mechanics, Experiments on clusters of GPUs (MPI+OpenMP+GPUs) Tools evaluation for kernel generation: HMPP, PGI

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse