Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Size: px

Start display at page:

Download "Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling"

Nora Hunt
5 years ago
Views:

1 Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike Hermann Müller (a), Robert Scheichl (a) Collaborators: Andreas Dedner (b), Markus Gross (c), Xu Guo (d), Benson Muite (e), Sinan Shi (d), Eero Vainikko (f) (a) University of Bath, (b) University of Warwick, (c) Met Office, (d) EPCC, (e) KAUST, (f) University of Tartu CCP-ASEArch Study Group, Daresbury, Jan 21 st /22 nd 2014

linear algebra problem, n 10 10 A 1,1 A 1,2... u 1 b 1. A... 2,1.. =.

2 Iterative Solvers Numerical Results Conclusion and outlook 2/22 Solving sparse linear systems on GPUs is challenging Challenges for linear solvers 3d elliptic PDE for pressure correction u(x, y, z) { ω 2 ξ r ( r u) + } r (α r ( r u)) + S (α S ( S u)) + βu r 2 Very large sparse linear algebra problem, n A 1,1 A 1,2... u 1 b 1. A... 2,1.. =.. A u n b n n,n Au = b Iterative solvers (key components: SpMV, tridiagonal solve) #FLOPS 2 5 #Memory references (matrix-free!) t(mem) t(flop) Application seriously memory bound on GPUs [Fermi M2090, Kepler GK110]

3 Iterative Solvers Numerical Results Conclusion and outlook 3/22 Key achievements Today 1 Optimal iterative solver algorithm Conjugate Gradient and geometric multigrid 2 Efficient single-gpu implementation Minimise global memory access, optimise cache reusage (kernel fusion) optimised matrix-free implementation 4 faster than matrix-explicit (CSR) CUSparse Utilise 30% 60% of theoretical peak global memory bandwidth 3 Scalable multi-gpu parallelisation GCL based implementation on GPUs (Titan, # 2 top500.org) 0.67 PFLOP/s 1/30th of LINPACK perf., dof in 1 sec time [ms] SpMV preconditioner BLAS nrm2 BLAS dot BLAS scal BLAS axpy FLOPs PetaFLOP single GPU peak Titan theoretical peak Yesterday TeraFLOP matrixexplicit matrixfree matrixfree fused GigaFLOP Number of GPUs 1024 CG Multigrid

4 Iterative Solvers Numerical Results Conclusion and outlook 4/22 Overview 1 Iterative solvers for anisotropic PDEs Model equation Conjugate Gradient and multigrid solvers Matrix-free GPU implementation 2 Numerical Results Single-GPU performance Massively parallel scaling on EMERALD/Titan 3 Conclusion and outlook

.. n z-1 n 1 z 2 3 Horizontal grid: One panel of cubed sphere Regular (graded) 1d vertical grid, n z = O(100) R earth /H 100

5 Iterative Solvers Numerical Results Conclusion and outlook 5/22 Model equation and grid structure Tensor product grid (cubed sphere) =... n z-1 n 1 z 2 3 Horizontal grid: One panel of cubed sphere Regular (graded) 1d vertical grid, n z = O(100) R earth /H 100 Grid-aligned anisotropy ( x z ) 2 1 Tensor product elliptic operator ( ) A = ω 2 M (r) D (horiz) + λ 2 D (r) M (horiz) h h + M (r) M (horiz) h } {{ } dominant part preconditioner P, only vertical couplings P can be inverted by independent tridiag. solve in each column

6 Iterative Solvers Numerical Results Conclusion and outlook 6/22 Simplified model equation Setup for all GPU runs Simplified model equation [ ω 2 S S + λ 2 1 ( r 2 r 2 )] u + u = f r r ω t x Solve on one panel of cubed sphere Factorising profiles

7 Iterative Solvers Numerical Results Conclusion and outlook 7/22 Iterative solvers Preconditioned Conjugate Gradient Very simple iterative algorithm u (0) u (1)... u (k) SpMV y Ax Preconditioner y P 1 x (vertical anisotropy) 60 iterations for ɛ = 10 5 Tensor-product multigrid h A u =b (h) (h) (h) A e =r (2h) (2h) (2h) 1 Only coarsen in horizontal direction 2 smoother = vertical line relaxation (anisotropy) x x + ρ relax P 1 (b Ax) # iterations significantly (6 ) smaller Time per iteration 2 3 larger Overall multigrid 2 3 faster than CG A e =r (4h) (4h) (4h) A e =r (8h) (8h) (8h)

8 Iterative Solvers Numerical Results Conclusion and outlook 8/22 Components So what do we have to implement? GPU Kernels SpMV: Apply matrix stencil y Ax Preconditioner: solve 1d tridiagonal system in each vertical column y P 1 x Multigrid smoother: x x + ρ relax P 1 (b Ax) Level 1 BLAS operations: y y + αx, κ r 2,... Intergrid operators (multigrid): x (l) R l,l+1 x (l+1),...

9 Iterative Solvers Numerical Results Conclusion and outlook 9/22 Matrix-free algorithm Tensor-product operator A = ω 2 ( M (r) D (horiz) h Matrix free implementation ) + λ 2 D (r) M (horiz) + M (r) M (horiz) h h Recompute local matrix stencil O(n horiz +n z ) O(n horiz n z ) mem. ref. Number of vertical levels n z = Small overhead from calculation of horizontal coupling D (horiz), M (horiz) h h (and indirect addressing on unstructured grids) Precompute D (r), M (r) 4 vectors of length n z (keep in cache)

10 Iterative Solvers Numerical Results Conclusion and outlook 10/22 GPU Implementation 1 Thread-level parallelism Dependency in tridiag. solve 1 thread per vertical column (Work in progress on vertical parallelisation via substructuring/cyclic reduction) 2 Multiple GPU implementation Parallel domain decomposition in horizontal direction GCL library & GPUDirect [Mauro Bianco et al., CSCS] threads GCL (using GPUDirect)

11 Iterative Solvers Numerical Results Conclusion and outlook 11/22 GPU implementation Efficient main memory access and cache reusage Keep data on GPU: Host transfer once per solve Preconditioner: sequential tridiagonal solver in vertical: 1 thread / col. 1kB/(col. var.) Shared memory too small for explicit prefetching rely on L use global memory and coalesce access by horizontally aligning data Keep data in cache by fusing kernels 30% 50% gain z x r r αq z P 1 r r r, r κ r, z Update residual Apply preconditioner Calculate norm of residual

12 Iterative Solvers Numerical Results Conclusion and outlook 12/22 Overview 1 Iterative solvers for anisotropic PDEs Model equation Conjugate Gradient and multigrid solvers Matrix-free GPU implementation 2 Numerical Results Single-GPU performance Massively parallel scaling on EMERALD/Titan 3 Conclusion and outlook

13 Iterative Solvers Numerical Results Conclusion and outlook 13/22 Performance CG performance: Time per iteration Problem size 128 on nvidia M2090 Fermi # FLOPS & mem. ref. (matrix-free) kernel FLOPs Caching none perf. CG SpMV prec BLAS total SpMV Fused prec CG total time [ms] SpMV preconditioner BLAS nrm2 BLAS dot BLAS scal BLAS axpy matrixexplicit matrixfree matrixfree fused

14 Iterative Solvers Numerical Results Conclusion and outlook 14/22 Performance Total performance all times in seconds Problem size on nvidia M2090 Fermi Sequential code on Intel Sandybridge CPU [AVX disabled] 60% of peak global memory bandwidth setup data total time speedup transfer C vs. matrix-free implementation C CUDA CUDA C CUDA CUDA vs. CSR matrix-explicit matrix-free matrix-free fused

15 Iterative Solvers Numerical Results Conclusion and outlook 15/22 Multigrid Comparison CG multigrid , nvidia K20X, GK110 Kepler (times in ms) t iter % peak # iter Mem cpy. Total (incl. memcpy) BW & transpose time speedup CG % Multigrid % Performance analysis kernel FLOPs Caching [per call] none perfect Multigrid Smooth Restrict + Smooth Residual Prolongate Level 1 BLAS Total (fine level only) (122) (53) (23) Fused CG total

16 Iterative Solvers Numerical Results Conclusion and outlook 16/22 Multigrid Cost per multigrid level , 1 GPU Time [ms] x512 x 128x128 64x64 Smooth Restrict + Smooth Residual Prolongate 32x Multigrid level Time per call [ms] Dominated by fine grid where n thread = = = n cores Cost decays as expected 4 (l 1) for level l 4 5 multigrid levels sufficient for our problem (zero-order term) Impact of parallel communications? Smooth 512x512 Restrict + Smooth Residual Prolongate x quadratic decay linear decay 128x128 64x Multigrid level 32x32

17 Iterative Solvers Numerical Results Conclusion and outlook 17/22 Multigrid Communication overhead , 64 GPUs x Time [ms] x 128x128 64x64 Smooth Restrict + Smooth Residual Prolongate 32x Multigrid level Time per call [ms] Smooth 512x512 Restrict + Smooth Residual x Prolongate quadratic decay 128x128 linear decay 64x64 32x Multigrid level Worse calc/comm ratio T comm T calc = R BW mem BW MPI Still dominated by fine grid Still 2 faster than CG overall Solver ratio R CG Multigrid Multigrid (fine lvl.) BW mem /BW MPI 100 (measured)

18 Iterative Solvers Numerical Results Conclusion and outlook 18/22 Performance on HECToR and Titan Multiple GPUs: Comparison CPU GPU Titan (# 2 top500.org): 18,688 nvidia GK110 Kepler HECToR: core AMD Opteron 2.3GHz Interlagos Problem sizes Global problem size n x # sockets # GPU cores # CPU cores n x = n x = 512 n x = , , , ,032 1, ,128 4, ,752,512 16, ,010,048 65, ,040,

19 Iterative Solvers Numerical Results Conclusion and outlook 19/22 Solution times Number of CPU cores Number of CPU cores Time per iteration [ms] Total solution time [s] CG Multigrid 64 Number of GPUs Titan [nvidia Kepler] Hector [AMD Opteron] CG Multigrid 64 Number of GPUs Titan [nvidia Kepler] Hector [AMD Opteron] time per iteration total time (incl. host-device data transfer) Largest system solved: 0.5 trillion unknowns in 1 s on 16,384 GPUs

20 Iterative Solvers Numerical Results Conclusion and outlook 20/22 Massively parallel runs on Titan Absolute performance on Titan FLOPs PetaFLOP single GPU peak GigaFLOP Number of GPUs Titan theoretical peak TeraFLOP CG Multigrid percentage of peak global memory BW 100% 75% 50% 25% Number of GPUs CG Multigrid Achieve 0.67 Peta FLOPs ( 3% of peak) on 16,384 GPUs Utilise 25% - 50% of peak memory BW

21 Iterative Solvers Numerical Results Conclusion and outlook 21/22 Conclusion and Outlook Conclusions Efficient multi-gpu solvers for anisotropic problems in NWP Outlook 1 Optimal iterative solver algorithm (multigrid) 2 Efficient single-gpu implementation 3 Scalable multi-gpu parallelisation Utilise significant fraction of peak memory bandwidth Solvers scale to GPUs, PFLOP performance on Titan Strong scaling Use both CPU and GPU simultaneously (additive multigrid?) Overlapping Calc/comm (important on coarse multigrid levels) More realistic grids, discretisation, equations,... Multithreaded tridiagonal solver

22 Iterative Solvers Numerical Results Conclusion and outlook 22/22 References References EHM, Xu Guo, RS, Sinan Shi: Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs. Computation and Visualization in Science (2013) EHM, RS et al.: Petascale performance of elliptic solvers for anisotropic PDEs on large GPU clusters (in preparation) Mauro Bianco: An Interface for Halo Exchange Pattern

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko