GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Similar documents
NEW ADVANCES IN GPU LINEAR ALGEBRA

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CUDA Accelerated Compute Libraries. M. Naumov

Iterative Sparse Triangular Solves for Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra)

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Krishnan Suresh Associate Professor Mechanical Engineering

(Sparse) Linear Solvers

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

ME964 High Performance Computing for Engineering Applications

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Lecture 15: More Iterative Ideas

GPU-based Parallel Reservoir Simulators

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018

(Sparse) Linear Solvers

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

OpenFOAM + GPGPU. İbrahim Özküçük

Accelerating the Iterative Linear Solver for Reservoir Simulation

Optimizing the operations with sparse matrices on Intel architecture

A parallel direct/iterative solver based on a Schur complement approach

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Performance of Implicit Solver Strategies on GPUs

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

PARDISO Version Reference Sheet Fortran

DEVELOPMENT OF A RESTRICTED ADDITIVE SCHWARZ PRECONDITIONER FOR SPARSE LINEAR SYSTEMS ON NVIDIA GPU

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU

nag sparse nsym sol (f11dec)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

High Performance Computing: Tools and Applications

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Parallel resolution of sparse linear systems by mixing direct and iterative methods

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

INCOMPLETE-LU AND CHOLESKY PRECONDITIONED ITERATIVE METHODS USING CUSPARSE AND CUBLAS

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

Report of Linear Solver Implementation on GPU

Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Research Article A PETSc-Based Parallel Implementation of Finite Element Method for Elasticity Problems

Two-Phase flows on massively parallel multi-gpu clusters

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

CUDA Toolkit 5.0 Performance Report. January 2013

GPU Cluster Computing for FEM

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach

Figure 6.1: Truss topology optimization diagram.

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Contents. I The Basic Framework for Stationary Problems 1

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Introduction to Parallel Computing

A GPU Sparse Direct Solver for AX=B

Super Matrix Solver-P-ICCG:

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Optimising the Mantevo benchmark suite for multi- and many-core architectures

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

A Standard for Batching BLAS Operations

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Sparse Linear Systems

Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters

NAG Library Function Document nag_sparse_nsym_sol (f11dec)

HYBRID DIRECT AND ITERATIVE SOLVER FOR H ADAPTIVE MESHES WITH POINT SINGULARITIES

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

CS 542G: Solving Sparse Linear Systems

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Transcription:

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013

INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks» Approximate inverse preconditioner» GPU implementation details» In-depth benchmarks» Direct solvers» EM Photonics CULA Sparse library

PREVIOUS WORK CULA DENSE» Dense LAPACK for CUDA» System solvers» Least squares solvers» Eigenvalues» Singular value decomposition» Multi-GPU support» 3-7x speedup vs optimized CPU» Presented at GTC past few years

SPARSE SYSTEM INTRODUCTION» Physical phenomena modeled by partial differential equations»» Forces, momentums, velocities, energy, temperature, etc.» Sparse systems appear in discretization methods» Finite element, finite volume, topology methods, etc.» Fluid & structural dynamics, electromagnetics, etc.

SPARSE SYSTEM INTRODUCTION finite element mesh assembled matrix Images courtesy of Iterative Methods for Sparse Linear Systems by Yousef Saad

ITERATIVE SOLVERS

ITERATIVE SOLVERS - BASICS» Want to solve for x in Ax = b» Each iteration works toward lowering the residual value of r = Ax b» Krylov subspace methods» K m (A, v) span {v, Av, A 2 v,..., A m 1 v}» Sparse matrix-vector multiplication» CG, GMRES, BiCGSTAB, Minres, etc.

ITERATIVE SOLVERS CG METHOD Apply preconditioner M with a matrix-vector solve Matrix-vector multiplication

ITERATIVE SOLVERS - PRECONDITIONERS» Modify original system to make it easier to solve» Ax = b MAx = Mb» (Hopefully) reduce the number of iterations needed to solve» One time setup & generation cost» Per-iteration application cost

PRECONDITIONERS JACOBI» Jacobi / Block Jacobi» M = diag(a)» M = block_diag(a, bs)» Minor reduction in iterations» Lightweight and parallel in generation and application

PRECONDITIONERS ILU0» Zero fill incomplete LU (ilu0)» M LU A» M has same pattern as A» Significant reduction in iterations» Performance depends on matrix structure» Parallelism from graph coloring» Applied with sparse triangular solve» Limited performance on GPU

PRECONDITIONERS APPROXIMATE INVERSE» Approximate Inverse» M -1 AM -1 I» Factorized Approximate Inverse (for SPD systems)» M -1 LL T A» M -1 has pattern of A (or A 2, A 3 )» Moderate reduction in iterations compared to ILU0» Very expensive to generate but highly parallel» Applied with sparse matrix-vector multiplication» Extremely high performance on GPU

RELATIVE RESIDUAL PRECONDITIONER PERFORMANCE - ITERATIONS RESIDUAL PER ITERATION 100 10 1 0.1 0 500 1000 1500 2000 2500 3000 0.01 0.001 0.0001 0.00001 0.000001 ITERATION COUNT none ilu0 jacobi block inverse fainv circuit simulation problem with 7.5M non-zero elements

RELATIVE RESIDUAL PRECONDITIONER PERFORMANCE RESIDUAL OVER TIME - CPU (MULTI-CORE) 100 10 1 0.1 0 20 40 60 80 100 120 0.01 0.001 0.0001 0.00001 0.000001 TIME (SECONDS) none ilu0 jacobi block inverse fainv circuit simulation problem with 7.5M non-zero elements

RELATIVE RESIDUAL PRECONDITIONER PERFORMANCE RESIDUAL OVER TIME GPU (MANY CORE) 100 10 1 0.1 0 5 10 15 20 25 30 0.01 0.001 0.0001 0.00001 0.000001 TIME (SECONDS) none ilu0 jacobi block inverse fainv circuit simulation problem with 7.5M non-zero elements

APPROXIMATE INVERSE DETAILS

FACTORIZED APPROXIMATE INVERSE MATH LITE» Minimization of Frobenius norm: AM I F 2 n k=1 A k m k e k 2 2» Can be solved with per-column parallelism!» Finding A k is memory intensive» extract a dense small matrix relative to the number of elements in the column» Minimization is compute intensive» positive definite solve on the small dense matrix» least squares solve for non-spd inputs

FACTORIZED APPROXIMATE INVERSE - IMPLEMENTATION» High performance hybrid CPU/GPU solution» Find chunk of successive A k matrices using CPU» batched with OpenMP» Perform positive definite solves on the GPU» batched with custom CUDA kernels» Overlap chunks using asynchronous operations CPU Find A n/4 Find A n/4 Find A n/4 Find A n/4 GPU Solve M n/4 Solve M n/4 Solve M n/4 Solve M n/4

FACTORIZED APPROXIMATE INVERSE TEST MATRICES G3_circuit circuit simulation n = 1.5M nnz = 7.6M apache2 structure n = 700K nnz = 4.8M msdoor structure n = 400K nnz = 19M ecology2 2d/3d problem n = 1M nnz = 5M offshore electromagnetics n = 250k nnz = 4.2M audikw_1 structure n = 950k nnz = 77M ldoor structure n = 900k nnz = 42M thermal2 heat transfer n = 1.2M nnz = 8.5M test matrices from the UF Sparse Matrix Collection

Total Execution Time (s) FACTORIZED APPROXIMATE INVERSE BENCHMARKS 600 500 smaller is better did not converge in reasonable time. GPU: NVIDIA C2070 CPU: Dual Intel Xeon 5660 name speed up* 400 G3_circuit 9.92x apache2 7.14x 300 msdoor 11.06x ecology2 10.73x offshore 8.21x 200 audikw_1 8.08x ldoor 10.73x 100 thermal2 15.00x 0 G3_circuit apache2 msdoor ecology2 offshore audikw_1 ldoor thermal2 ilu0 (cpu) fainv (cpu) ilu0 (gpu) ainv (gpu) * Best CUDA algorithm relative to best CPU algorithm. Includes data transfers, preconditioner generation, and iterations.

FACTORIZED APPROXIMATE INVERSE BENCHMARKS Precondtioner Generation Time Realitve CPU ilu0 12 10 smaller is better 8 6 4 2 0 G3_circuit apache2 msdoor ecology2 offshore audikw_1 ldoor thermal2 ilu-host ainv-host ilu-gpu ainv-gpu

FACTORIZED APPROXIMATE INVERSE - SUMMARY» Successful in reducing number of iterations» Very expensive to generate» Massively parallel» Excellent preconditioners for GPU» Chose an algorithm that fits your platform

DIRECT SPARSE LINEAR SYSTEM SOLVERS

DIRECT SPARSE SOLVERS» LU, Cholesky, or QR factorization of matrix» Dense methods require full MxN matrix» Express problem as a number of relatively small dense problems with task and data parallelism» Factorize stages: 1. Symbolic Analyze graph structure to determine a plan of attack that best suits platform 2. Numeric Perform factorization calculations

DIRECT SPARSE SOLVERS - SYMBOLIC FACTORIZATION» The nonzero pattern forms a elimination tree describing an order of column (vector) operations» Reordering allows for better parallelism and data locality 0 2 4 6 8 10 12 0 2 4 6 8 10 12 nz = 33 0 2 4 6 8 10 12 6 10 12 0 2 4 8 nz = 33

DIRECT SPARSE SOLVERS - SYMBOLIC FACTORIZATION» Merge adjacent nodes into supernodes that can now utilize high-performance matrix operations» Fill in with explicit zeros if needed» Make choices based on the hardware available» Zero fill thresholds» Number of processors/gpus 0 2 4 6 8 10 12 0 2 4 6 8 10 12 nz = 33

DIRECT SPARSE SOLVERS» Tree can now expressed as a task dependency graph of dense matrices operations» matrix multiply, triangular solve, small dense factorization» Traverse graph queuing operations on the GPU» Process smaller problems on the host» Manage memory locality to prevent ping-ponging» Kepler s Hyper-Q improves performance of processing different sized matrices performing different tasks

CULA SPARSE

CULA SPARSE» Software package available from www.culatools.com» Free for personal academic usage» Version S4 available now» Version S5 beta starting in early April

» Platforms» Multi-threaded host, CUDA, CUDA (device pointers)» Solvers» CG, BiCG, BiCGStab, BiCGStab(L), GMRES, Minres,» Preconditioners:» Jacobi, Block Jacobi, SSOR, ILU0, Ainv, Fainv,» Formats» Coo, Csr, Csc, Eli, Dense,

CULA SPARSE PSEUDO-CODE USAGE EXAMPLE // create plan culasparseplan plan; culsparsecreateplan(&plan); // configure plan culasparsesetcudaplatform(plan, &cudaoptions); culasparsesetdcooformat(plan, &coooptions, n, nnz, a, b, x); culasparsesetfainvpreconditioner(plan, &fainvoptions); culasparsesetcgsolver(plan, &cgoptions); // execute plan culasparseexecute(plan, &config, &results);

CULA SPARSE PSUDEDO USAGE EXAMPLE // change preconditioner to ilu0 culasparsesetilu0solver(plan, &ilu0options); // re-execute plan with cached data (avoids data transfer) culasparseexecute(plan, &config, &results); // change solver to bicg culasparsesetbicgsolver(plan, &bicgoptions); // re-execute plan with cached data (avoid preconditioner regeneration) culasparseexecute(plan, &config, &results);

FUTURE WORK» Matrix free methods» Multi-GPU support» More solvers, preconditioners, data formats, and performance» PETSc integration» Intel Xeon Phi platform» Direct solver realeases

THANKS! Thanks! Questions?» Convention hall @ booth 311» More information @ www.culatools.com