Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Similar documents
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Total efficiency of core components in Finite Element frameworks

M. Geveler, D. Ribbrock, D. Göddeke, P. Zajac, S. Turek Corresponding author:

GPU Cluster Computing for FEM

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Performance engineering for hardware-, numericaland energy-efficiency in FEM frameworks: modeling and control

High Performance Computing for PDE Towards Petascale Computing

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Accelerating Double Precision FEM Simulations with GPUs

GPU Cluster Computing for Finite Element Applications

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Computing on GPU Clusters

High Performance Computing for PDE Some numerical aspects of Petascale Computing

GPU Acceleration of Unmodified CSM and CFD Solvers

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Case study: GPU acceleration of parallel multigrid solvers

Resilient geometric finite-element multigrid algorithms using minimised checkpointing

Lecture 15: More Iterative Ideas

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Large scale Imaging on Current Many- Core Platforms

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Hardware-Oriented Finite Element Multigrid Solvers for PDEs

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs)

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Performance. Computing (UCHPC)) for Finite Element Simulations

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Automated Finite Element Computations in the FEniCS Framework using GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Contents. I The Basic Framework for Stationary Problems 1

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs

Two-Phase flows on massively parallel multi-gpu clusters

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Accelerating Double Precision FEM Simulations with GPUs

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

How to Optimize Geometric Multigrid Methods on GPUs

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

ME964 High Performance Computing for Engineering Applications

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

PROGRAMMING OF MULTIGRID METHODS

Large Displacement Optical Flow & Applications

Iterative Sparse Triangular Solves for Preconditioning

Multi-GPU Acceleration of Algebraic Multigrid Preconditioners

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Figure 6.1: Truss topology optimization diagram.

Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction

Performance of Implicit Solver Strategies on GPUs

smooth coefficients H. Köstler, U. Rüde

Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

cuibm A GPU Accelerated Immersed Boundary Method

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Part 2: Applications on GPU Clusters

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Krishnan Suresh Associate Professor Mechanical Engineering

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

GPU-based Parallel Reservoir Simulators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Lecture 17: More Fun With Sparse Matrices

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Multigrid for Matrix-Free Finite Element Computations on Graphics Processors

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Transcription:

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund, Germany markus.geveler@math.tu-dortmund.de ParCFD Barcelona, May 8, 20

Motivation FEM highly accurate for solving PDEs: high order (non-conforming) FEs arbitrarily unstructured grids to resolve complex geometries grid adaptivity Pressure-Schur-Complement Preconditioning... in connection with Geometric Multigrid solvers: convergence rates indepent of mesh width h superlinear convergence effect possible ( high order FE spaces) Finite Element Geometric Multigrid enhances numerical efficiency.

Motivation GPUs high on-chip memory bandwidth maximisation of the overall throughput of a large set of tasks parallelisation techniques for FEM software are being explored stronger smoothers are still an issue SPAI, ILU complete Geometric Multigrid solvers haven t had much attention yet SIMT-MP SIMT-MP GPU... SIMT-MP... Device Memory SIMT-MP 20-80 GB/s -8 GB/s PCIe CPU-Socket... cores cores caches caches shared caches and MCs 6-35 GB/s Host Memory -2 GB/s Infiniband (PCIe) But: bare MachoFlop -performance does not count! Today: Realising FE-gMG on the GPU hardware-oriented numerics

Solution approach Idea: One performance-critical kernel: SpMV coarse-grid solver: Conjugate Gradients smoothers: based on preconditioned Richardson iteration defect calculations What s left some BLAS- (dot-product, norm,...) grid transfer can be reduced to SpMV too (later) Benefits solver must be implemented only once oblivious of FE space and domain dimension performance tuning reduced to one kernel

Solution approach Grid transfers chose the standard Lagrange bases for two consecutively refined Q k finite element spaces V 2h and V h function u 2h V 2h can be interpolated in order to prolongate it u h := m i= x i ϕ (i) h, x i := u 2h (ξ (i) h ) for the basis functions of V 2h and u 2h = n j= y j ϕ (j) 2h with coefficient vector y, we can write the prolongation as u h := m i= x i ϕ (i) h, x := P h 2h y restriction matrix R 2h h = (P h 2h )T

Solution approach Grid transfer: Simplified example - 2D, Q on regular grid 4 5 6 2 3 0.5 7 8 9 7 20 8 2 9 7 24 8 25 9 4 5 5 6 6 2 22 3 23 4 0 2 3 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

left to right: 2-Level, Cuthill McKee, Coordinate-based and Hierarchical orderings sparsity pattern (and bandwidth) depends on DOF numbering technique performance Solution approach Grid transfer: Prolongation matrix examples

Implementation Sparse matrix-vector multiply on the GPU: ELLPACK-R store sparse matrix S in two arrays A (non-zeros in column-major order) and j (column index for each entry in A) A has size (#rows in S) (maximum number of non-zeros in any row of S) shorter rows are padded with zeros additional array rl to store effective count of non-zeros in every row without the padding-zeros (stop computation on a row after the actual non-zeros) 7 0 0 7 0 2 S = 0 2 8 0 5 0 3 9 A = 2 8 5 3 9 j = 2 0 2 3 rl = 2 3 0 6 0 4 6 4 3 2

Implementation Sparse matrix-vector multiply on the GPU y i = rl i nz=0 A i,nz x jnz based on the ELLPACK-R format y = Ax can be performed by computing each entry y i of the result vector y independently (one GPU-thread per y i ) regular access pattern on data of y and A access pattern on x depends highly on sparsity pattern of A data access to all three arrays is fully coalesced due to column-major ordering x-values can be cached (texture-cache or L2 on FERMI) no synchronisation between threads necessary no branch divergence

Results Benchmark setup u =, x Ω u = 0, x Γ u =, x Γ 2 Q Q 2 L N non-zeros N non-zeros 4 576 4552 276 3292 5 276 8208 8448 28768 6 8448 72832 33280 55072 7 33280 29328 32096 2078720 8 32096 72480 526336 835744 9 526336 4704256 20248 33480704 0 20248 8845696 - - Poisson problem as a fundamental component in many practical situations different FE spaces different DOF numbering techniques Jacobi preconditioning, V-cycle Intel Core i7 980 Gulftown hexacore workstation / NVIDIA Fermi GPU (Tesla C2070)

Results In addition: stronger preconditioning with SPAI I MA 2 F = n e T k m T k A 2 2 = k= n A T m k e k 2 2 k= where e k is the k-th unit-vector and m k is the k-th row of M. for n columns of M n least squares opt.-problems: min mk A T m k e k 2, k =,... n. sparsity-pattern of the stiffness-matrix is used for pattern of preconditioner

Results Sparse matrix-vector multiply on the GPU

Results Its numerics, that counts: #iterations potential degradation of /000 hardware may offer an order of magnitude speedup

Results FE-gMG: a closer look at preconditioning SPAI offers /2; SPAI+Q 2 works well

Results Execution times for finest discretisation and reasonable numbering-techniques: CPU, GPU

Results Geometric Multigrid mission accomplished: SpMV performance transported to solver level clever sorting may pay off gap between weak solvers + unthoughtful DOF-ordering + unoptimised kernels (with respect to hardware) and FE-gMG + clever reordering + hardware-acceleration is huge current design oblivious of FE-spaces, domain-dimension, preconditioning, grid properties,...

Conclusion Summary FE-gMG is efficient and flexible GPU vs. multicore CPU: close to one order of magnitude speedup sophisticated (sparse) preconditioners make the difference single-node hardware-oriented FE-gMG is ready from the solver-side, but... Future work assembly of preconditioners, system matrices, transfer-matrices still unresolved, especially for unstructured grids cross-effects with resorting the degrees of freedom in combination with a specific matrix storage format and associated SpMV kernel other related data-parallel operations: adaptive grid-deformation,...

Acknowledgements Supported by BMBF, HPC Software für skalierbare Parallelrechner: SKALB project 0IH08003D.