AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Size: px

Start display at page:

Download "AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015"

Edwina Jefferson
6 years ago
Views:

1 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2

3 AmgX Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit for GPU accelerated Ax = b solver Simple API makes it easy to solve your problems faster 3

4 Using AmgX has allowed us to exploit the power of the GPU while freeing up development time to concentrate on reservoir simulation. Garf Bowen, RidgewayKiteSoftware 4

AmgX in Reservoir Simulation Application Time (seconds) 1500 Solve Faster 1150 1000 Lower is Better Solve Larger Systems 500 197 98 Flexible High Level API

5 AmgX in Reservoir Simulation Application Time (seconds) 1500 Solve Faster Lower is Better Solve Larger Systems Flexible High Level API 0 CPU GPU Custom AmgX 3-phase Black Oil Reservoir Simulation. 400K grid blocks solved fully implicitly. CPU: Intel Xeon CPU E GPU: NVIDIA Tesla K10 5

6 AmgX 2.0: New Features since 1.0 Classical AMG with truncation, robust aggressive coarsening Complex arithmetic GPUDirect, RDMA-async Power8 support, Maxwell support Crash-proof object management Re-usable setup phase Adaptors for major solver packages: HYPRE, PETSc, Trilinos Import data structures directly to AmgX for solve, export solution Host or Device pointer support JSON configuration 6

7 Key Features Un-smoothed Aggregation AMG Krylov methods: CG, GMRES, BiCGStab, IDR Smoothers and Solvers: Block-Jacobi, Gauss-Seidel Incomplete LU, Dense LU KPZ-Polynomial, Chebyshev Flexible composition system Scalar or coupled block systems, multi-precision MPI, OpenMP support Auto-consolidation Flexible, simple high level C API 7

8 Minimal Example With Config //One header #include amgx_c.h //Read config file AMGX_create_config(&cfg, cfgfile); //Create resources based on config AMGX_resources_create_simple(&res, cfg); //Create solver object, A,x,b, set precision AMGX_solver_create(&solver, res, mode, cfg); AMGX_matrix_create(&A,res,mode); AMGX_vector_create(&x,res,mode); AMGX_vector_create(&b,res,mode); //Read coefficients from a file AMGX_read_system(&A,&x,&b, matrixfile); //Setup and Solve Loop AMGX_solver_setup(solver,A); AMGX_solver_solve(solver, b, x); //Download Result AMGX_download_vector(&x) solver(main)=fgmres main:max_iters=100 main:convergence=relative_max main:tolerance=0.1 main:preconditioner(amg)=amg amg:algorithm=aggregation amg:selector=size_8 amg:cycle=v amg:max_iters=1 amg:max_levels=10 amg:smoother(smoother)=block_jacobi amg:relaxation_factor= 0.75 amg:presweeps=1 amg:postsweeps=2 amg:coarsest_sweeps=4 determinism_flag=1 8

9 Integrates easily MPI and OpenMP domain decomposition Adding GPU support to existing applications raises new issues Proper ratio of CPU cores / GPU? How can multiple CPU cores (MPI ranks) share a single GPU? How does MPI switch between two sets of ranks : one set for CPUs, one set for GPUs? AmgX handles this via Consolidation Consolidate multiple smaller sub-matrices into single matrix Handled automatically during PCIE data copy 9

10 Partitioned to 2 MPI Ranks Original Problem Rank 0 u 1 u 5 Consolidated onto 1 GPU u 1 u 5 u 3 u 2 u 4 PCIE GPU u 1 u 5 u 3 u 2 u 4 u 6 Rank 1 Boundary exchange PCIE u 3 u 2 u 4 u 7 u 2 u 4 u 7 u 6 u 6 u 7 10

11 1 CPU socket <=> 1 GPU Consolidation Examples Arbitrary Cluster: 4 nodes x [2 CPUs + 3 GPUs] Dual socket CPU <=> 2 GPUs IB Dual socket CPU <=> 4 GPUs 11

12 PETSc KSP vs AmgX performance test PDE: u2 2x+ u2 2y+ u2 2z= 12π2cos(2πx)cos(2πy)cos(2πz) BCs: u x x=0= u x x=1= u y y=0= u y y=1= u z z=0= u z z=1=0 Exact solution: u(x,y)=cos(2πx)cos(2πy)cos(2πz) 12

PETSc vs AmgX 7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs Machine specification

13 PETSc vs AmgX 7x unknowns 16 cores vs 1 GPU 8x unknowns 512 cores vs 32 GPUs Machine specification GPU nodes: GPU: two K20m per node CPU nodes: CPU: two Intel Xeon E per node (totally 16 cores per node) PETSc KSP solver 13

14 SPE10 Cases We derived several test cases from the SPE10 permeability distribution by fixing an x-y resolution and adding resolution in z, using TPFA stencil. 14

15 Speedup GPU: NVIDIA K40 SPE10 Matrix Tests Millions of Unknowns CPU: HYPRE on 10 core IvyBridge Xeon E GHz 1 Socket vs 1 GPU 15

16 Scaling up the right way 16

17 Time (s) Poisson Equation / Laplace operator 12.0 Aggregation and Classical Weak Scaling, 8Million DOF per GPU Setup AmgX 1.0 (PMIS) AmgX 1.0 (AGG) Number of GPUs Titan (Oak Ridge National Laboratory) GPU: NVIDIA K20x (one per node) CPU: 16 core AMD Opteron 2.2GHz 17

18 Solve Time Poisson Equation / Laplace operator Aggregation and Classical Weak Scaling, 8Million DOF per GPU Time per Iteration vs Log(P) y = x R² = y = x R² = ClassicalAMGSolve AggregationAMGSolve Linear (ClassicalAMGSolve) Linear (AggregationAMGSolve) Number of GPUs Titan (Oak Ridge National Laboratory) GPU: NVIDIA K20x (one per node) CPU: 16 core AMD Opteron 2.2GHz 18

19 Iterations Poisson Equation / Laplace operator 120 Classical AMG Preconditioner, 8Million DOF per GPU PCG GMRES Number of GPUs Titan (Oak Ridge National Laboratory) GPU: NVIDIA K20x (one per node) CPU: 16 core AMD Opteron 2.2GHz 19

20 Solve Time(s) Poisson Equation / Laplace operator Classical AMG Preconditioner, 8Million DOF per GPU GMRES PCG Number of GPUs Titan (Oak Ridge National Laboratory) GPU: NVIDIA K20x (one per node) CPU: 16 core AMD Opteron 2.2GHz 20

21 AmgX 2.0: MPI with GPUDirect RDMA 4x lower latency, 3x Bandwidth, 45% lower CPU utilization 21

22 Basic Coarsening 22

23 Basic Coarsening 23

24 Aggressive Coarsening 24

25 Aggressive Coarsening Less Memory, Faster Setup 25

26 AmgX 2.0 Licensing Developer/Academic License non commercial use, free Commercial License, Developer License, Premier Support Service Subscription License (node/year) Includes Support and Maintenance Volume based pricing Site License Perpetual License 20% Maintenance and Support 26

multi GPU - Aggressive coarsening Complex

PETSc, HYPRE, Trilinos Robust convergence on

0 Commercial License Premier Support AmgX 2.

past 512 GPUs Range Decomposition AMG Guaranteed

27 AmgX Roadmap AmgX 2.0 Release Q Classical AMG - multi node - multi GPU - Aggressive coarsening Complex Arithmetic + Aggregation Easy interfaces, python PETSc, HYPRE, Trilinos Robust convergence on SPE10 GPUDirect v2.0 Commercial License Premier Support AmgX 2.5 Q Scalable Sparse Eigensolvers Scaling past 512 GPUs Range Decomposition AMG Guaranteed convergence aggregation CUDA 8.0 with Pascal Support Tuning for Maxwell Continuous Improvement Features Availability 27

28 AmgX 2.0 was made by a great team of contributors. AmgX 2.0 Team: Marat Arsaev, Joe Eaton, Alex Fender, Andrei Schaffer AmgX 2.0 Devtechs: Simon Layton, Nikolai Sakharnykh, Nikolay Markovskiy Interns: Rohit Gupta, Constantine Stulov

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2