Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Similar documents
GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Lecture 15: More Iterative Ideas

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

Iterative Sparse Triangular Solves for Preconditioning

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Handling Parallelisation in OpenFOAM

Introduction to Multigrid and its Parallelization

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

GPU Cluster Computing for FEM

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Computational Fluid Dynamics - Incompressible Flows

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Multigrid Methods for Markov Chains

OpenFOAM + GPGPU. İbrahim Özküçük

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

PROGRAMMING OF MULTIGRID METHODS

Large scale Imaging on Current Many- Core Platforms


Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

GPU Acceleration of Unmodified CSM and CFD Solvers

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Case study: GPU acceleration of parallel multigrid solvers

Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver

c 2006 Society for Industrial and Applied Mathematics

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

f xx + f yy = F (x, y)

(Sparse) Linear Solvers

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

High Performance Computing: Tools and Applications

Lecture 17: More Fun With Sparse Matrices

D036 Accelerating Reservoir Simulation with GPUs

AMS526: Numerical Analysis I (Numerical Linear Algebra)

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Sparse Linear Systems

REDUCING COMPLEXITY IN PARALLEL ALGEBRAIC MULTIGRID PRECONDITIONERS

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

Higher Order Multigrid Algorithms for a 2D and 3D RANS-kω DG-Solver

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Tools and Primitives for High Performance Graph Computation

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

How to Optimize Geometric Multigrid Methods on GPUs

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

This is an author-deposited version published in: Eprints ID: 16478

TAU mesh deformation. Thomas Gerhold

EXPOSING FINE-GRAINED PARALLELISM IN ALGEBRAIC MULTIGRID METHODS

Numerical Algorithms

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

cuibm A GPU Accelerated Immersed Boundary Method

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

(Sparse) Linear Solvers

ETNA Kent State University

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

Parallel Computations

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

On Convergence Acceleration Techniques for Unstructured Meshes

Resilient geometric finite-element multigrid algorithms using minimised checkpointing

Using GPUs for unstructured grid CFD

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Contents. I The Basic Framework for Stationary Problems 1

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Smoothers. < interactive example > Partial Differential Equations Numerical Methods for PDEs Sparse Linear Systems

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

1.2 Numerical Solutions of Flow Problems

arxiv: v1 [cs.ms] 2 Jun 2016

Recent developments for the multigrid scheme of the DLR TAU-Code

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner

Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU

Structured-grid multigrid with Taylor-Hood finite elements

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

Multi-GPU Acceleration of Algebraic Multigrid Preconditioners

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

GPU-based Parallel Reservoir Simulators

PyAMG. Algebraic Multigrid Solvers in Python Nathan Bell, Nvidia Luke Olson, University of Illinois Jacob Schroder, University of Colorado at Boulder

Scalability of Elliptic Solvers in NWP. Weather and Climate- Prediction

Figure 6.1: Truss topology optimization diagram.

Transcription:

Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship Project Jonathan Cohen, Robert Strzodka and lots of others.

Algebraic Multigrid Solving sparse linear systems: Ax = b 100K 400M unknowns (CFD, ResSim) Iterative solver, trying to minimise A x b k Usually part of a non-linear outer solver, tolerance may be low or high

Why AMG? - Smoothers Error of initial guess Error after 5 steps sweeps Error after 10 steps Error after 15 steps

Error Smooth Restrict Interpolate & Apply Correction Smooth Algebraic Multigrid Multi-level sparse iterative solver

Algebraic Multigrid Setup: Fine matrix Compute Interpolator and Restrictor A Compute coarse matrix Repeat A C = RAP Set up smoothers for every level P, R(= P T ) Solve: input: x, b Smooth x Restrict residual Repeat b C = R(Ax b) Prolongate & apply correction x+ = Px C O(N) Convergence independent of problem size

NVAMG Main focus on AMG as a preconditioner Classical and aggregation AMG, implements Krylov solvers, various smoothers Jacobi, Gauss-Seidel, Diagonal ILU0 etc. Dynamic solver hierarchy, user specifies preconditioners, smoothers, coarse solvers etc. FGMRES preconditioned by AMG with a DILU smoother and a PBiCGStab coarse solver, preconditioned by Supports arbitrary block sizes (e.g. 4*4 u,v,w,p) ~200k lines, makes extensive use of libraries Thrust, cusparse, CUSP

AMG on the GPU Lots of sequential algorithms in the literature SpMM, aggregation, Gauss-Seidel, ILU, coloring Integer/logical algorithms Difficult to parallelise and even then bandwidth limited Less and less work on coarser levels Utilisation problems Expensive setup Depending on tolerance, may be up to 60% of runtime No way around it, convergence is not just a matter of time Memory-limited (~5 GB/GPU)

Colored smoothers Algorithm on vertices+edges updating vertices (in x) Data races during updates (of x) For k = 1 until convergence For each row i x k+1 i = 1 # i 1 n & a ij x k+1 j a ij x k j + b i a % ( ii $ j=1 j=i+1 ' Update solution on node i based on already updated information 1 i-1 and old information i+1 n

Colored smoothers Execute sets of vertices color by color For k = 1 until convergence For each row with color i x k+1 i = 1 $ n ' a ij x k+1 j + b i a & ) ii % j=1, j n ( No write/read conflicts! Still, not quite the same

Colored smoothers 1 2 3 1 4 5 6 2 x k+1 i = 1 $ n ' a ij x k+1 j + b i a & ) ii % j=1, j n ( Colors 1-6 Update result (spmv) for each color Forward sweep: Only use nonzeros: color < current color (DILU) Backward sweep: 6->1 colors > current color (DILU) CUSPARSE bsrxmv How do we get the coloring?

Parallel coloring Very sequential problem (propagate information) Parallel version will use more colors (local decision) Iterative algorithm: assign random number to uncolored nodes, if local maximum, assign current color Jones-Plassman

Min-Max coloring 1. Assign random number to vertices 1.03 7.12 2. For i=1 1. Assign color i to local maxima 3.33 8.71 4.42 2. Marked colored vertices as done 3.14 3. Stop at x% uncolored 1.74 2.17 6 colors 6 iterations Min-max: 6 colors 3 iterations

Parallel coloring Very sequential problem (propagate information) Parallel version will use more colors (local decision) Iterative algorithm: assign random number to uncolored nodes, if local maximum, assign current color Jones-Plassman Next, for local minima All nodes need the same information Generate vector of random numbers Use a hash (on node IDs) Multi-hash: several random values per node several possible colors

Graph matching - Aggregation Graph matching: set of edges so that no two share a vertex Select neighbors that are strongly connected, merge them together on the next coarser level Trivial serial implementation, similar to coloring One-phase handshaking: extend a hand to your strongest neighbor, if a handshake happens, we have an aggregate Repeat with unmatched Two-phase handshaking

One-phase handshaking

Graph matching performance Figure from Cohen, Castonguay @ GTC 2012

NVAMG vs. HYPRE (12 core Xeon) 4.5 4 3.5 3 2.5 2 1.5 1 NVAMG Solve NVAMG Setup Hypre Solve Hypre Setup 0.5 0 reservoir case 1 (191k, 1.3m) reservoir case 2 (350k, 2.2m) reservoir case 3 (1.4m, 10m) reservoir case 4 (289k, 1.66m) 9-pt Poisson (1m, 9m)

Other GPU algorithms SpMM with dynamic size hash-maps See Julien Demouth @ GTC 2012 ID S0285 For more on aggregation & coloring Jonathan Cohen @ GTC 2012 ID S0332 Classical AMG PMIS, D1, D2 selectors Simon Layton @ GTC 2012 ID S0305

Distributed NVAMG B2L(α, β) L2H(α, β) export α α B2L(α, β) transfer L2H(α, β) import β L2H(β, α) β B2L(β, α)

0,1 Node 4. 510-600 284-396 Node 0. 1,2,3,4 396-510 Node 1. 0,2,4,5 0-100 Node 3. 0,2 200-284 Node 2. 0,1,3,5 100-200 Node 5. 1,2 Node id Neighbors Index ranges B2L Halo rows Comms module

Internal representation Numbering from 0 N-1 Renumbering & reordering in a way that interior nodes are first, then the boundary ones Support latency hiding Mapping from original numbering to new one and from new one back to global indices Halo rows appended to the matrix, neighbor-byneighbor, renumbered but not reordered, throw away unknown edges

Communications It s not as difficult as it sounds Some bookkeeping In user code insert I need up-to-date data Everything else happens behind the scenes Isolated communications module Nothing else knows about the distributed environment Plug-in style modules for OpenMP, MPI Can do all sorts of tricks: P2P, GPU Direct, blocking communications etc.

GPU Direct vs. Host buffers - Solve 1.5 1.4 1.3 1.2 1.1 1 0.9 Matrix1 ~80M Matrix2 ~80M Matrix3 ~50M Matrix4 ~20M Matrix5 ~90M Matrix6 ~50M Matrix7 ~200M Matrix8 ~310M 0.8 2 GPUs

Aggregation Represented by a vector: for every fine node, which coarse node it is aggregated into.

Galerkin product (A c = R A f P) The Galerkin product merges edges between fine nodes that went into the same aggregate Perform spmm on locally consistent matrices Some values in the coarse halo may be different Update them

Algorithmic differences No aggregation through boundaries No globally consistent coloring Doing halo exchange between colors would be expensive Latency hiding Separation of interior-boundary vertices Init comms & execute interior -> wait -> execute boundary Differences in both colored & non-colored smoothers Occupancy & latency issues Numerical stability issues

DILU sensitivity to latency hiding 70% 60% 50% 40% 30% 20% 10% 0% Row threshold for stopping latency hiding (2 nodes) 59.9% 52.9% Matrix1 ~80M Matrix2 ~80M 42.9% Matrix3 ~50M Matrix4 ~20M 27.5% Matrix5 ~90M Matrix6 ~50M Matrix7 ~200M 13.0% Matrix8 ~310M 3.2% 3.2% 3.0%

Scaling 1 - Setup Scaling with FGMRES preconditioned by Jacobi/GS/DILU, latency hiding turned on/off, no AMG 3D Poisson problem 7 Setup Speedup 6 5 4 3 2 1 Jacobi GS DILU 0 1 2 3 4 5 6 Number of GPUs

Scaling 1 Solve Latency hiding 4.5 4 Speedup 3.5 3 2.5 2 1.5 Jacobi LH1 Jacobi LH0 GS LH1 GS LH0 DILU LH1 DILU LH0 1 1 2 3 4 5 6 Number of GPUs

Scaling 2 Incompressible N.S. Set of incompressible Navier-Stokes matrices (4*4 blocks for u,v,w,pressure) Numerically difficult Sensitive to DILU/GS decoupling due to latency hiding Level cap at something sensible Inside non-linear solver, so solved to high tolerance (0.1) Setup time is important By default deterministic (strict coloring) Often it can be turned off for some performance increase

Scaling 2 Total speedup 3 Speedup 2.5 2 1.5 1 0.5 Matrix1 ~80M Matrix2 ~80M Matrix3 ~50M Matrix4 ~20M Matrix5 ~90M Matrix6 ~50M Matrix7 ~200M Matrix8 ~310M 0 1 2 3 4 5 6

Optimisations Partitioning Currently we use naïve partitioning bad edge cuts Support for already partitioned problems Since the underlying solver may already do partitioning Partition migration Less communication, more work on coarser levels Better numerical behavior Tuning, tuning, tuning Public release TBA