GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Size: px

Start display at page:

Download "GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction"

Kerry Gibbs
5 years ago
Views:

1 1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: Bath HPC Symposium, Jun 4 th 2013

Substantial increase in global model resolution 25km few km 10 10

day forecast (Unified Model) Key requirement Model needs to scale to 10

solve elliptic PDE for pressure correction with 10 10 dof in 1second

2 2/8 Why fast elliptic solvers? COMPUTATIONAL CHALLENGES IN NUMERICAL WEATHER- AND CLIMATE PREDICTION Substantial increase in global model resolution 25km few km degrees of freedom per atmospheric variable Model runtime 1hour for 5 day forecast (Unified Model) Key requirement Model needs to scale to cores Use emerging architectures (GPUs, Intel MIC) Repeatedly solve elliptic PDE for pressure correction with dof in 1second PDEs with similar structure in flat geometries arise in Ocean modelling Subsurface flow simulations Oil- and gas reservoir modelling

3 3/8 Equation and solvers DISCRETISATION 3d PDE for pressure correction u(x, y, z) linear algebra problem A 1,1 A 1,2... u 1 b 1. A.. 2,1. =. Au = b. A n,n u n b n Properties of n n matrix A large n in realistic applications sparse, defined by local stencil Key question: recompute store stencil? Depends on architecture & FLOPS/memory reference

4 4/8 Solvers and parallelisation PARALLEL SOLVER Preconditioned Conjugate Gradient (PCG) method (Gaussian elemination is O(n 3 ) and impossible for n ) Iteratively improve solution u (0) u (1) u (2) u = A 1 b Key components ( kernels in GPU implementation) SpMV: Apply matrix stencil y Ax Preconditioner: solve 1d tridiagonal system in vertical columns Level 1 BLAS operations: y y + αx, κ r 2,... GPU Parallelisation (CUDA) Assign 1 thread to each vertical column Multiple GPU implementation (in progress): Parallel domain decomposition implemented in GCL library [Mauro Bianco et al., CSCS]

5 5/8 GPU implementation EFFICIENT MATRIX FREE PCG SOLVER ON GPU with Xu Guo, Sinan Shi [EPCC], Eero Vainikko, Kait Kasak [Tartu, Estonia] Keep data on GPU: Host transfer once per solve nvidia M2090: 30 FLOPs per FP read from global memory Seriously memory bound, minimise global memory access High gains from not storing matrix stencil explicitly Preconditioner: sequential tridiagonal solver in vertical: 1 thread / col. 1kB/(col. var.) z use global memory and coalesce access by horizontally aligning data x Keep data in cache by fusing kernels

6 6/8 Bath HPC HOW DO WE USE AQUILA? Code development and performance measurements, no production runs We are different from most other aquila users Handwritten code (+ libraries: CUBLAS, CUSPARSE,... ) Stepping stone for bigger machines (HECToR, EMERALD) Job characteristics Large core counts (up to 768) Very short jobs (1-10 minutes) Negligible IO and storage requirements (multi) GPU: aquila has 2 3 nvidia M2090 Fermi GPUs! Requirements Very fast turnaround crucial (Q times < 1 min or interactive) Good support for installing software etc. essential

7 7/8 Performance TOTAL PERFORMANCE all times in seconds Problem size on nvidia M2090 Fermi Sequential code on Intel Sandybridge CPU [AVX disabled] 25% 50% of peak global memory bandwidth setup data total time speedup transfer C vs. matrix-free implementation C CUDA CUDA C CUDA CUDA vs. CSR matrix-explicit matrix-free matrix-free fused

8 8/8 Outlook: Further improvements and GPU clusters WORK IN PROGRESS with Kait Kasak, Eero Vainikko [Tartu] Parallelise tridiagonal solver in vertical direction to reduce memory per thread: substructuring / cyclic reduction. Expected speedup: 2 4 (global memory BW saturation) Replace PCG geometric multigrid. Expected speedup 4 (experience with CPU code) GPU clusters Halo exchange T(comm) T(calc) MPI+PCIe or GPU direct? t lat + 4n t word n 2 t calc n 1 Existing libraries: GCL Mauro Bianco et al., CSCS Time on EMERALD

9 9/8

10 10/8 Fused PCG algorithm PCG main loop For k = 1, maxiter do r r αq Update residual z P 1 r Apply preconditioner Kernel 1 r r, r Calculate norm of residual κ r, z if ( r / r 0 < ɛ or r < τ) then exit β κ/κ old, κ old κ u u + αp Update solution p z + βp Update search direction q Az + βq Apply operator (SpMV) Kernel 2 σ p, q α κ old /σ end do

11 11/8 Performance TIME PER ITERATION Problem size on nvidia M2090 Fermi # memory references (matrix-free) kernel Caching no mat. col. PCG SpMV prec BLAS total SpMV Fused prec PCG total time [ms] SpMV preconditioner BLAS nrm2 BLAS dot BLAS scal BLAS axpy matrixexplicit matrixfree matrixfree fused

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko