Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Similar documents
Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning

Flexible Batched Sparse Matrix-Vector Product on GPUs

Iterative Sparse Triangular Solves for Preconditioning

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Iterative Sparse Triangular Solves for Preconditioning

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Hartwig Anzt, Edmond Chow, Daniel B. Szyld, and Jack Dongarra. Report Novermber Revised February 2016

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Hartwig Anzt, Edmond Chow, Daniel Szyld, and Jack Dongarra. Report Novermber 2015

GPU-based Parallel Reservoir Simulators

Adaptive precision in block-jacobi preconditioning for iterative sparse linear system solvers

Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Mixed Precision Methods

MAGMA. Matrix Algebra on GPU and Multicore Architectures

CUDA Accelerated Compute Libraries. M. Naumov

A Standard for Batching BLAS Operations

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Whitepaper: Software-Defined Events (SDEs) in MAGMA-Sparse

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Automatic Development of Linear Algebra Libraries for the Tesla Series

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

MAGMA: a New Generation

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems

A GPU Sparse Direct Solver for AX=B

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Accelerating GPU kernels for dense linear algebra

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

Solving Dense Linear Systems on Graphics Processors

NEW ADVANCES IN GPU LINEAR ALGEBRA

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Performance Strategies for Parallel Mathematical Libraries Based on Historical Knowledgebase

PARDISO Version Reference Sheet Fortran

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Large Displacement Optical Flow & Applications

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Accelerating GPU Kernels for Dense Linear Algebra

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors

Figure 6.1: Truss topology optimization diagram.

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

AMS526: Numerical Analysis I (Numerical Linear Algebra)

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018

A MATLAB Interface to the GPU

Accelerating the Iterative Linear Solver for Reservoir Simulation

Solution of the Transport Equation Using Graphical Processing Units

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Lecture 15: More Iterative Ideas

Report of Linear Solver Implementation on GPU

WHILE linear algebra software has achieved high efficiency

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Linear Equation Systems Iterative Methods

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

ME964 High Performance Computing for Engineering Applications

Performance engineering for hardware-, numericaland energy-efficiency in FEM frameworks: modeling and control

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Solving the heat equation with CUDA

Lecture 17: More Fun With Sparse Matrices

HPC with Multicore and GPUs

Transcription:

Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran Flegar, Enrique Quintana-Orti http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf

Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran Flegar, Enrique Quintana-Orti Pitch from the sparse community: Small size (<30), somewhat flexible Add inversion routine to Batched-Blas Go for Gauss-Jordan Elimination instead of LU http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf

Iterative linear system solvers Goal: solve linear system Ax = b, A 2 R n n Use iterative Krylov method for generating solution approximation Convergence typically benefits from using a preconditioner We focus on Jacobi preconditioners based on diagonal scaling P =(diag(a)) 1 Provide a high level of parallelism Attractive for many-core accelerators like GPUs Scalar Jacobi acts on the main diagonal block-jacobi acts on the block-diagonal 3

Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 4

Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 5

Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application Preconditioner setup (inversion of diagonal blocks) can become a challenge Batched inversion of many small systems Diagonal blocks are typically of small size 6

Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. 3. insert solution as diagonal block into the preconditioner matrix 7

Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. 3. insert solution as diagonal block into the preconditioner matrix 8

Batched Gauss-Jordan elimination (BGJE) on GPUs Flexible size up to 32 x 32 One warp per system Inversion in registers Communication via shfl() Implicit pivoting: accumulate row/col swaps realize them when writing results Advantages over LU-based approach: No operations on triangular systems Implicit pivoting Implementation can use static system size 9

NVIDIA Pascal architecture (P100) NVIDIA whitepaper: 56 SMX, 5.3 TF DP, 768 GB/s ( 6 : 1 ) 2n 3 operations for inversion Flexible-size inversion up to size 32x32, one warp per system. 600 BGJE 600 BGJE 500 500 GFlop/s 400 300 GFlop/s 400 300 200 200 100 100 0 0.5 1 1.5 2 2.5 3 3.5 4 Batch size 10 4 5 10 15 20 25 30 Matrix size Batched Gauss-Jordan Elimination (BGJE) achieves >10% of DP peak! Anzt et al. Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs. PMAM'17, 8th International Workshop on Programming Models and Applications for Multicores and Manycores, pp 1-10, 2017. 0 10

Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. 3. insert solution as diagonal block into the preconditioner matrix 11

Challenge: diagonal block extraction from sparse data structures 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access 3. insert solution as diagonal block into the preconditioner matrix 12

Challenge: diagonal block extraction from sparse data structures memory requests cached extraction Matrix in CSR format 1 Diagonal block extraction costly Assign one thread per row 2 Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck 3 Non-coalescent memory access 4 5 row-ptr col-indices r 13

Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage 14

Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage 15

Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage Merge kernels vs. 3 separate kernels: [ extraction + inversion + insertion ] [ extraction ] + [ inversion ] + [ insertion ] 16

Diagonal block extraction from sparse data structures US: unmerged + shared extraction UC: unmerged + cached extraction MS: merged + shared extraction MC: merged + cached extraction Results show: Runtime block-jacobi vs. BGJE 56 test matrices (SuiteSparse) Data arranged w.r.t. performance of (avg.) best strategy BJ generation time / BGJE time 6 5 4 3 2 1 BJ generation time / BGJE time 12 10 8 6 4 2 US UC MS MC Block K40 size 8 0 10 20 30 40 50 60 Test matrices US UC MS MC Block size 32 0 0 10 20 30 40 50 60 Test matrices 17

Diagonal block extraction from sparse data structures US: unmerged + shared extraction UC: unmerged + cached extraction MS: merged + shared extraction MC: merged + cached extraction Tridiagonal-type matrix structure: - balanced nnz/row Arrow-type matrix structure: - few rows with many elements - corresponding threads are bottleneck for cached extraction BJ generation time / BGJE time BJ generation time / BGJE time 3 2.5 2 1.5 1 US UC MS MC 0.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Matrix size 10 4 20 15 10 5 US UC MS MC Tridiagonal-type Arrow-type 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Matrix size 10 4 18

Block-Jacobi preconditioner setup cost Block Jacobi generation time [s] 10-1 10-2 10-3 10-4 10-5 19 Block size 4 Block size 8 Block size 16 Block size 24 Block size 32 Chebyshev2 dw2048 dw1024 LeGresley_2508 Chebyshev3 saylr4 sts4098 olm5000 nasa2910 s2rmq4m1 s2rmt3m1 s3rmt3m1 s3rmt3m3 s3rmq4m1 s1rmt3m1 dw8192 dw4096 Kuu bcsstk38 CurlCurl_0 linverse nemeth15 bcsstk18 piston bcsstk17 nd3k cbuckle Pres_Poisson msc10848 ABACUS_shell_ud nd6k Preconditioner setup needs up to 0.01 s on P100 GPU. Irrelevant for overall Iterative solver execution time. spmsrtls cz40948 gridgena ibm_matrix_2 sme3dc 3D_51448_3D 2D_54019_highK nd12k gas_sensor rail_79841 crankseg_1 matrix_9 dc3 matrix-new_3 nd24k ship_003 CurlCurl_1 F1 af_shell3 ecology2 Emilia_923 G3_circuit Hook_1498 ML_Geer Flan_1565 K40

Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 20

Factorization-based block-jacobi Preconditioner setup Preconditioner application 2. Generate factorization and pivoting. Apply trsv using pivoting information to turn the right-hand side sub-vector into the solution of the diagonal-block linear problem. 1. extract diagonal block from sparse data structure. 3. Insert factorization as block into the preconditioner matrix and store pivoting. *ipiv 21

Factorization-based block-jacobi Batched routines: Small-Size LU, Gauss-Huard, Gauss-Huard-T. Gauss-Huard-T writes the factors in access-friendly fashion. Flexible size up to limit of 32 x 32. Use one warp per system, factorization in registers. GFlop/s 350 300 250 200 150 100 Small-Size LU Gauss-Huard Gauss-Huard-T cublas LU dgetrf GFlop/s 30 25 20 15 10 Small-Size LU Gauss-Huard Gauss-Huard-T cublas LU dgetri 50 5 0 5 10 15 20 25 30 Matrix size 0 5 10 15 20 25 30 Matrix size NVIDIA P100 GPU 22

NVIDIA Pascal architecture (P100) 10-4 3.5 3 Gauss-Jordan elimination (BGJE) MAGMA dgetrf-inv (BLU) Gauss-Huard (BGH) Gauss-Huard + transp (BGHT) Time/matrix [ms] 2.5 2 1.5 1 23 0.5 0 1 2 3 4 5 Batch size 10 4 BGJE and BLU invert diagonal blocks vs. BGH / BGHT factorizing only. BGHT writes factors in non-coalescent fashion for faster preconditioner application. NVIDIA P100 GPU

Inversion-based block-jacobi vs. factorization-based block-jacobi Convergence of BiCGSTAB with block-jacobi preconditioner for 300 matrices Test cases 140 120 100 80 60 40 block size 4 block size 8 block size 16 block size 24 block size 32 20 0-100 -50 0 50 100 BGH needs more iterations BGJE needs more iterations Explicit inversion better. Factorization better. NVIDIA P100 GPU 24

25 Block-Jacobi preconditioning: Performance goal Runtime (BGJE) [s] Runtime (BGH) [s] 4 2 0 4 2 0 10-3 1 2 3 4 5 6 7 8 9 10 Test matrices 10-3 solve setup solve setup 1 2 3 4 5 6 7 8 9 10 Test matrices Execution time of BiCGSTAB with Jacobi(24) preconditioner Runtime [s] 10 2 10 1 10 0 10-1 10-2 10-3 BGJE BGH BGHT 50 100 150 200 250 300 Test matrices Almost no difference in total execution time. BGH better if setup is significant to iterative solver time. BGJE better for long solver runs. NVIDIA P100 GPU

Pitch from the sparse community: Small size (<30) somewhat flexible Add inversion routine to Batched-Blas Go for Gauss-Jordan Elimination instead of LU http://icl.cs.utk.edu/magma/ This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy. http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf 26