GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Similar documents
Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

CUDA Accelerated Compute Libraries. M. Naumov

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Iterative Sparse Triangular Solves for Preconditioning

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

GPU DEVELOPMENT & FUTURE PLAN OF MIDAS NFX

Applications of Berkeley s Dwarfs on Nvidia GPUs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Lecture 15: More Iterative Ideas

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

smooth coefficients H. Köstler, U. Rüde

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

GPU Cluster Computing for FEM

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

OpenFOAM + GPGPU. İbrahim Özküçük

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver

PROGRAMMING OF MULTIGRID METHODS

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Higher Order Multigrid Algorithms for a 2D and 3D RANS-kω DG-Solver

Multigrid Methods for Markov Chains

Krishnan Suresh Associate Professor Mechanical Engineering

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

PARDISO Version Reference Sheet Fortran

GPU-based Parallel Reservoir Simulators

Recent developments for the multigrid scheme of the DLR TAU-Code

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs


Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Numerical Algorithms

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

GPU Acceleration of Unmodified CSM and CFD Solvers

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Software and Performance Engineering for numerical codes on GPU clusters

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Two-Phase flows on massively parallel multi-gpu clusters

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Handling Parallelisation in OpenFOAM

D036 Accelerating Reservoir Simulation with GPUs

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

Figure 6.1: Truss topology optimization diagram.

Iterative methods for use with the Fast Multipole Method

GPU-Acceleration of CAE Simulations. Bhushan Desam NVIDIA Corporation

On Convergence Acceleration Techniques for Unstructured Meshes

Performance of Implicit Solver Strategies on GPUs

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

Introduction to Multigrid and its Parallelization

OPENFOAM ON GPUS USING AMGX

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

Application of GPU technology to OpenFOAM simulations

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Contents. I The Basic Framework for Stationary Problems 1

Total efficiency of core components in Finite Element frameworks

Parallel resolution of sparse linear systems by mixing direct and iterative methods

ACCELERATING PRECONDITIONED ITERATIVE LINEAR SOLVERS ON GPU

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

How to Optimize Geometric Multigrid Methods on GPUs

arxiv: v1 [cs.ms] 2 Jun 2016

Transcription:

GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

ANSYS Fluent 2

Fluent control flow Accelerate this first Non-linear iterations Assemble Linear System of Equations Solve Linear System of Equations: Ax = b Runtime: ~ 33% ~ 67% No Converged? Yes Stop 3

Aggregation (Un-smoothed) AMG SETUP 1. Choose aggregates based on A f 2. Construct coarsening operator (R = P T ) 3. Construct coarse matrix (A c = R A f P) 4. Initialize smoother (if needed) SOLVE 1. Smooth 2. Compute residual (r f = b f A f x f ) 3. Restrict residual (R r f = r c ) 4. Recurse to solve coarse problem 5. Prolongate correction (x f = x f + Pe c ) 6. Smooth 7. If not converged, goto 1

Challenge of Multigrid on GPU CPU approach (domain decomposition) Chop problem into 1 subdomain per CPU core Run serial algorithm per subdomain (Fix up the boundaries) => Cores are fast, subdomains big enough that boundaries small GPU requirements are different Each thread is slow, but O(10,000) threads per GPU Therefore, subdomain size is O(1) Typical domain decomposition approaches break down => Methods must scale to limit of 1 grid cell per thread

Computational Pattern - SpMV SpMV (Sparse Matrix-Vector product) compute: y = Ax Parallel map Visit each edge for each row i in matrix A row_sum = 0 for each non-zero entry j row_sum += A i,j * x j y i = row_sum Reduction i j=1 j=3 =

SpMV-like pattern: SpMM-like pattern: for every vertex a i for every neighbor(a i ) b i compute F(a i,b i ) reduce {F(a i,b i ) } write result into position i for every vertex a i for every neighbor(a i ) b i for every neighbor(b i ) c k segmented_reduce {F(a i,b i,c k ) by (i,k)} for every (i,k) write result into position i,k

CUDA Implementation: SpMV-like Scalar kernel one thread per row / vertex For-loop over edges Serial reduction Vector kernel Multiple threads per row/vertex Process edges in parallel Parallel reduction

CUDA Implementation: SpMM-like For every vertex in 2-ring: Segmented reduction of F(a i,b i,c k ) by (i,k) Challenge: #{(a i,b i,c k ) triplets} > #{unique (i,k)} Unpredictable data expansion & contraction Difficult with 10,000 threads! Approaches: Shmem/RF caching (spill to global mem) Count, allocate, execute Everything in global memory with tuples (cusp/thrust approach)

Scaling Beyond Single GPU Workload breakdown approach SpMV-like => communicate edges and attached vertices which connect data between GPUs SpMM-like => communication edges and attached rows (e.g. 1-ring) which connect data between GPUs Plus lots of complex software engineering

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Size-2 Aggregation via Graph Matching Graph Matching: Set of edges such that no two edges share a vertex Maximum matching matching the includes the largest number of edges Equivalent: Independent set on dual of graph independent pairs of connected vertices FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking Each vertex extends a hand to its strongest neighbour FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking Each vertex checks if its strongest neighbor extended a hand back FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking Repeat with unmatched vertices FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

Create P from Aggregate P i,j = 1 if vertex j in aggregate i 0 otherwise P = 1 1 1 1 1 1 1 1

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Tricks for Computing A c For UA-AMG, P is either 0 or 1, so Galerkin product simplifies to A c I,J = sum over {fine points i in aggregate I and fine points j in aggregate J} of entries in fine matrix A ij f SpMM-like kernel

Computing A c is SpMM-like for every coarse vertex I for every contained fine vertex i for every non-zero entry j with coarse(j) == J segmented_sum {A f ij by (I,J)} for every (I,J) write result into position A c IJ I=0 J=1 i=0 i=1 j=2 A f 1,2 A f 0,2 A c 0,1= A f 0,2 + A f 1,2

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Smoothers Preconditioned Richardson iteration M is called preconditioning matrix Can encode smoother via choice of M Solution vector x is updated iteratively

Jacobi Smoother Trivial Parallelism For Jacobi, M is block-diagonal (Could be nxn block)

Jacobi Smoother Jacobi Setup phase: compute inverses of D blocks

Graph Coloring Assignment of color (integer) to vertices, with no two adjacent vertices the same color Each color forms independent set (conflict-free) reveals parallelism inherent in graph topology

Reordering via Graph Coloring

DILU Smoother DILU preconditioner has the form E is such that Same as ILU(0) preconditioner for some banded matrices Only requires one extra diagonal of storage Cheap, strong, low-storage

DILU Smoother Setup is sequential Solve is also sequential (two triangular solves)

Multi-color DILU Smoother Use coloring to extract parallelism Setup: Forward solve: include neighbors whose color is less than yours in SpMV-like updates Backward solve: include neighbors whose colors is greater than yours in SpMV-like updates

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

NVAMG Library Implements AMG, Krylov methods, some utilities Supports standard matrix formats, easy-to-integrate C API MPI support, interoperates with MPI-enabled applications General beta release in July, 2013

ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent 14.5 with NVAMG (beta feature) 3000 2832 Dual Socket CPU Dual Socket CPU + Tesla C2075 Helix Model 2000 Lower is Better 1000 0 5.5x 2 x Xeon X5650, Only 1 Core Used 933 1.8x 517 517 2 x Xeon X5650, All 12 Cores Used Helix geometry 1.2M Hex cells Unsteady, laminar Coupled PBNS, DP AMG F-cycle on CPU AMG V-cycle on GPU NOTE: This is a performance preview GPU support is a beta feature All jobs solver time only 48

Fluent + NVAMG Preview Results 3500 ANSYS Fluent AMG on single CPU/GPU in ms Best solver settings on each platform 3000 2500 2000 1500 1000 500 0 Lower is Better Helix (hex 208K) Helix (tet 1173K) Airfoil (hex 784K) K20X(1) 3930K(6) FEMTEC 2013 - NVIDIA 2013

NVAMG Strong Scaling 3M unknowns FEMTEC 2013 - NVIDIA 2013

ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent GPU Acceleration of Truck 14M 75 69 Intel Xeon E5-2667, 2.90GHz Intel Xeon E5-2667, 2.90GHz + Tesla K20X Truck Body Model 50 Lower is Better 14 M Mixed cells 41 DES Turbulence 25 3.5x 28 Coupled PBNS, SP Times for 1 Iteration AMG F-cycle on CPU 12 3.3x 9 GPU: Preconditioned FGMRES with AMG 0 1 x Nodes, 2 CPUs (12 Cores Total) 2 x Nodes, 4 CPUs (24 Cores Total); 8 GPUs (4 ea Node) 4 x Nodes, 8 CPUs (48 Cores Total); 16 GPUs (4 ea Node) NOTE: All jobs solver time only 51

Tremendous Collaboration and Team Thanks to an awesome team (in alphabetical order): Marat Arsaev, Patrice Castonguay, Jonathan Cohen, Julien Demouth, Joe Eaton, Justin Luitjens, Nikolay Markovskiy, Maxim Naumov, Stan Posey, Nikolai Sakharnykh, Robert Strzodka, Zhenhai Zhu Star interns: Peter Zaspel, Simon Layton, Lu Wang, Istvan Reguly, Francesco Rossi, Christoph Franke, Felix Abecassis Our collaborators and AMG advisors: ANSYS: Sunil Sathe, Prasad Alavilli, Rongguang Jia PSU: James Brannick, Ludmil Zikatanov, Jinchao Xu, Xiaozhe Hu Developers at companies to be named later