On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Similar documents
iennacl GPU-accelerated Linear Algebra at the Convenience of the C++ Boost Libraries Karl Rupp

Iterative Sparse Triangular Solves for Preconditioning

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

ME964 High Performance Computing for Engineering Applications

Automated Finite Element Computations in the FEniCS Framework using GPUs

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Krishnan Suresh Associate Professor Mechanical Engineering

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

CME 213 S PRING Eric Darve

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Accelerating image registration on GPUs

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Accelerating Double Precision FEM Simulations with GPUs

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Large scale Imaging on Current Many- Core Platforms

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

High Performance Computing for PDE Towards Petascale Computing

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

QR Decomposition on GPUs

How to Optimize Geometric Multigrid Methods on GPUs

OpenFOAM + GPGPU. İbrahim Özküçük

Fast Tridiagonal Solvers on GPU

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Performance of deal.ii on a node

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

GPU-based Parallel Reservoir Simulators

High Performance Computing for PDE Some numerical aspects of Petascale Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HPC with Multicore and GPUs

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

Optimising the Mantevo benchmark suite for multi- and many-core architectures

GPU Cluster Computing for FEM

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

n N c CIni.o ewsrg.au

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Lessons Learned in Developing the Linear Algebra Library ViennaCL

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Introduction to GPGPU and GPU-architectures

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

Super Matrix Solver-P-ICCG:

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

High Performance Computing on GPUs using NVIDIA CUDA

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

DEVELOPMENT OF A RESTRICTED ADDITIVE SCHWARZ PRECONDITIONER FOR SPARSE LINEAR SYSTEMS ON NVIDIA GPU

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Using GPUs for unstructured grid CFD

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

ViennaCL and PETSc Tutorial

MAGMA: a New Generation

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

On the efficiency of the Accelerated Processing Unit for scientific computing

Performance of Implicit Solver Strategies on GPUs

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Asynchronous OpenCL/MPI numerical simulations of conservation laws

CUDA Accelerated Compute Libraries. M. Naumov

Technology for a better society. hetcomp.com

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Electronic structure calculations on Thousands of CPU's and GPU's

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

NEW ADVANCES IN GPU LINEAR ALGEBRA

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Accelerating Financial Applications on the GPU

Numerical Algorithms on Multi-GPU Architectures

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

Dense Linear Algebra. HPC - Algorithms and Applications

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Transcription:

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC 2013 May 20th, 2013

Hardware Parallel Hardware Constraints 2

GPUs: Disillusion Computing Architecture Schematic 2x12 GB/s 8x20 GB/s CPU GPU 100 GFLOPs SP 50 GFLOPs DP PCI Express 8 GB/s, ~1us Latency 1000 GFLOPs SP 250 GFLOPs DP Good for large FLOP-intensive tasks, high memory bandwidth PCI-Express can be a bottleneck 10-fold speedups (usually) not backed by hardware 3

Benchmarks 10-1 Vector Addition x = y + z 10-2 Execution Time (sec) 10-3 10-4 10-5 NVIDIA GTX 285, CUDA 10-6 NVIDIA GTX 285, OpenCL AMD Radeon HD 7970, OpenCL Intel Xeon Phi Beta, OpenCL 10-7 Intel Xeon Phi Beta, native Intel Xeon X5550, OpenMP Intel Xeon X5550, single-threaded 10-8 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Vector Size 4

Basic Idea Factor sparse matrix A LŨ L and Ũ sparse, triangular ILU0: Pattern of L, Ũ equal to A ILUT: Keep k elements per row Solver Cycle Phase Residual correction LŨx = z Forward solve Ly = z Backward solve Ũx = y Little parallelism in general 5 3 4 5 5 6 3 4 4 5

Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 16 17 18 19 20 11 12 13 14 15 6 7 8 9 10 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 16 17 18 19 20 11 12 13 14 15 6 7 8 9 10 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 16 17 18 19 20 11 12 13 14 15 6 7 8 9 10 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 20 19 18 17 16 11 12 13 14 15 10 9 8 7 6 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 20 19 18 17 16 11 12 13 14 15 10 9 8 7 6 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 20 19 18 17 16 11 12 13 14 15 10 9 8 7 6 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 10 14 17 19 20 6 9 13 16 18 3 5 8 12 15 1 2 4 7 11 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 10 14 17 19 20 6 9 13 16 18 3 5 8 12 15 1 2 4 7 11 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 10 14 17 19 20 6 9 13 16 18 3 5 8 12 15 1 2 4 7 11 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 17 9 18 10 20 6 16 7 17 8 13 4 14 5 15 1 11 2 12 3 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 17 9 18 10 20 6 16 7 17 8 13 4 14 5 15 1 11 2 12 3 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 17 9 18 10 20 6 16 7 17 8 13 4 14 5 15 1 11 2 12 3 7

Block-ILU Apply ILU to diagonal blocks Higher parallelism Usually more iterations required (problem-dependent) 8

Block-ILU Apply ILU to diagonal blocks Higher parallelism Usually more iterations required (problem-dependent) 8

Benchmark Benchmark - Setup 9

Benchmark Setup Hardware NVIDIA GTX 580 (default) AMD HD 7970 (only for final benchmark) Intel Core2Quad 9550 Numbering Lexicographic Red-Black Minimum Degree Remarks Setup purely on CPU, not included Data transfer costs not included OpenCL for both GPUs 10

Benchmark Case Study 1: 2D Poisson, Structured Grid 11

Benchmarks 10 2 10 1 2D FDM - Solver Time Execution Time (sec) 10 0 10-1 10-2 Lexicographic Red-Black 10-3 Min-Bandwidth CPU unpreconditioned 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 10 3 2D FDM Iterations 10 2 12 10 1 10 2 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10-1 2D FDM - Preconditioner Application Execution Time (sec) 10-2 10-3 10-4 Lexicographic Red-Black Min-Bandwidth 10-5 10 2 10 3 10 4 10 5 10 6 Unknowns 10 4 2D FDM 10 3 ILU Levels 10 2 13 10 1 Lexicographic Red-Black Min-Bandwidth 10 0 10 2 10 3 10 4 10 5 10 6 Unknowns

Benchmark Case Study 2: 3D Poisson, Structured Grid 14

Benchmarks 10 1 3D FDM - Solver Time Execution Time (sec) 10 0 10-1 10-2 Lexicographic 10-3 Red-Black Min-Bandwidth CPU unpreconditioned 10-4 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FDM 15 Iterations 10 2 10 1 10 0 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10-2 3D FDM - Preconditioner Application Execution Time (sec) 10-3 10-4 Lexicographic Red-Black Min-Bandwidth 10-5 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FDM ILU Levels 10 2 10 1 16 Lexicographic Red-Black Min-Bandwidth 10 0 10 3 10 4 10 5 10 6 Unknowns

Benchmark Case Study 3: 2D Poisson, Unstructured Grid 17

Benchmarks 10 3 10 2 2D FEM - Solver Time Execution Time (sec) 10 1 10 0 10-1 10-2 Lexicographic Red-Black 10-3 Min-Bandwidth CPU unpreconditioned 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 10 4 2D FEM 18 Iterations 10 3 10 2 10 1 10 2 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10 0 2D FEM - Preconditioner Application Execution Time (sec) 10-1 10-2 10-3 Lexicographic Red-Black Min-Bandwidth 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 10 4 2D FEM ILU Levels 10 3 10 2 19 Lexicographic Red-Black Min-Bandwidth 10 1 10 2 10 3 10 4 10 5 10 6 Unknowns

Benchmark Case Study 4: 3D Poisson, Unstructured Grid 20

Benchmarks 10 2 3D FEM - Solver Time Execution Time (sec) 10 1 10 0 10-1 Lexicographic 10-2 Red-Black Min-Bandwidth CPU unpreconditioned 10-3 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FEM Iterations 10 2 21 10 1 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10-2 3D FEM - Preconditioner Application Execution Time (sec) 10-3 Lexicographic Red-Black Min-Bandwidth 10-4 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FEM ILU Levels 10 2 22 Lexicographic Red-Black Min-Bandwidth 10 1 10 3 10 4 10 5 10 6 Unknowns

Coloring Color dependency graph Purely algebraic 1 23

Coloring Color dependency graph Purely algebraic 1 23

Coloring Color dependency graph Purely algebraic 1 2 23

Benchmarks 10 2 10 1 2D FEM - Solver Time Execution Time (sec) 10 0 10-1 10-2 NVIDIA GTX 580, ILU NVIDIA GTX 580, no 10-3 AMD HD 7970, ILU AMD HD 7970, no CPU, no 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 24

Conclusion ILU Preconditioners Fine-grained parallelism exploitable (if done right) Higher-order discretizations less parallel Matrix Pattern CPU: banded for cache reuse GPU: colored for parallelism Availability ViennaCL: http://viennacl.sourceforge.net/ (PETSc: http://www.mcs.anl.gov/petsc/) ViennaCL + PETSc tutorial on Thursday afternoon! 25