A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

Similar documents
A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Fast Tridiagonal Solvers on GPU

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Module 12 Floating-Point Considerations

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Scan Primitives for GPU Computing

State of Art and Project Proposals Intensive Computation

Warps and Reduction Algorithms

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Programming in CUDA. Malik M Khan

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

Modern GPUs (Graphics Processing Units)

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Data Parallel Execution Model

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Optimization of Tele-Immersion Codes

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

pyeemd Documentation Release Perttu Luukko

MAGMA. Matrix Algebra on GPU and Multicore Architectures

A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

CS 677: Parallel Programming for Many-core Processors Lecture 6

Parallel FFT Program Optimizations on Heterogeneous Computers

Fast and reliable linear system solutions on new parallel architectures

Interpolation & Polynomial Approximation. Cubic Spline Interpolation II

Technical Report TR

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Spring Prof. Hyesoon Kim

Master Informatics Eng.

Numerical considerations

Figure 6.1: Truss topology optimization diagram.

Cluster EMD and its Statistical Application

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

CUDA Experiences: Over-Optimization and Future HPC

Using GPUs to compute the multilevel summation of electrostatic forces

Spring 2011 Prof. Hyesoon Kim

GPU Implementation of Implicit Runge-Kutta Methods

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Rethinking Computer Architecture for Throughput Computing

L14: CUDA, cont. Execution Model and Memory Hierarchy"

Parallel multi-frontal solver for isogeometric finite element methods on GPU

CUDA Performance Optimization

Dense Linear Algebra. HPC - Algorithms and Applications

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

Report of Linear Solver Implementation on GPU

A MATLAB Interface to the GPU

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

Dense Matrix Algorithms

Spring 2009 Prof. Hyesoon Kim

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Parallel Implementations of Gaussian Elimination

Module 3: CUDA Execution Model -I. Objective

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

A MATLAB Interface to the GPU

Module Memory and Data Locality

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

GPU Supercomputing From Blue Waters to Exascale

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

PARDISO Version Reference Sheet Fortran

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Advanced CUDA Optimizations. Umar Arshad ArrayFire

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Parallel Numerical Algorithms

KSTAR tokamak. /

ECE 8823: GPU Architectures. Objectives

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Parallel Data Mining on a Beowulf Cluster

Performance of Multicore LUP Decomposition

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Accelerating Molecular Modeling Applications with Graphics Processors

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

Module 12 Floating-Point Considerations

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Transcription:

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- How to Build a gtsv for Performance Tridiagonal Solver for CUSPARSE 23 GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois

Material in this Session This talk is based on our SC 2 paper Chang, Li-Wen; Stratton, John; Kim, Hee-Seok; Hwu, Wen-mei, A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs, Proceedings of the International Conference for High Performance Computing, Networking Storage and Analysis, 22 (SC 2) But it contains more Details not shown in the paper due to page limit Extension worked with the NVIDIA CUSPARSE team 3

Comparison among Tridiagonal Solvers Solver Matlab (backslash) Intel MKL (gtsv) Intel SPIKE Numerical Stability Yes Yes Yes CUSPASRE gtsv (22) No Our gtsv Yes Our heterogeneous gtsv Yes 4

Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance Matlab (backslash) Yes Poor Intel MKL (gtsv) Yes Good Intel SPIKE Yes Good CUSPASRE gtsv (22) No Not supported Our gtsv Yes Not supported Our heterogeneous gtsv Yes Good 5

Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Matlab (backslash) Yes Poor Not supported Intel MKL (gtsv) Yes Good Not supported Intel SPIKE Yes Good Not supported CUSPASRE gtsv (22) No Not supported Good Our gtsv Yes Not supported Good Our heterogeneous gtsv Yes Good Good 6

Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 7

Numerical Stability on GPUs All previous related works for GPUs Unstable algorithms, like Thomas algorithms, Cyclic reduction (CR), or Parallel cyclic reduction (PCR) No pivoting Why pivoting important? 8

CUSPARSE gtsv 22) CR (+ PCR) But when bi s are s 9 2 2 3 3 2 2 3 2 3 3 2 2 2 3 2 b a c b b a b a c b a c b e e e e b a c b a c b a c b e e e e 3 2 2 3 2 a c a c a c e e e e

Why Numerical Stability is Difficult on GPUs Why people didn t apply pivoting on GPU? They worried about performance Pivoting does not seem to fit GPU Pivoting may serialize computation Pivoting requires data-dependent control flow GPU likes regular computation and regular memory access Branch divergence may hurt performance

Our gtsv For parallelization SPIKE algorithm is applied to decompose the problem A optimization technique is applied to achieve high memory efficiency Data layout transformation For data-dependent control flow Diagonal pivoting is chosen A optimization technique is proposed to achieve high memory efficiency Dynamic tiling

Part : SPIKE Algorithm SPIKE algorithm decomposes a tridiagonal matrix A into several blocks 2

SPIKE Algorithm D and S can be redefined as A = DS AX = F can be solved by solving DY = F, and SX =Y 3

A Small Example A e e e 2 3 2 5 A B e 4 2 3 2 4 5 F 3 27 26 C 2 A 2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23

A Small Example 5 2 4 3 2 5 2 3 2 e e e e 22 2 2 w w v v 5 2 4 3 5 2 = David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23

How to build S? SPIKE Algorithm Solve DY = F Solve several independent tridiagonal matrices A i s 6

SPIKE Algorithm How to solve SX =Y? Solving the collection of the first and latest rows in all blocks Reduction* Problem size: 4L -> 6 Backward substitution v L w 2 w 2L v 2 v 2L w 3 v 3 w 3L v 3L w 4 *E. Polizzi and A. H. Sameh, A parallel hybrid banded system solver: The SPIKE algorithm, 7 Parallel Computing, vol. 32, no. 2, pp. 77 94, 26.

Part 2: Diagonal Pivoting for Tridiagonal Matrices How to solve each block A i in a numerically stable way? Diagonal pivoting* A i can be solved sequentially by each thread Why diagonal pivoting? A better data-dependent control flow, which we can handle on GPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp. 269 275, 2. 8

Diagonal Pivoting A tridiagonal matrix A can be decomposed to LBM T Instead of LDU L and M are unit lower triangular matrices B is a block diagonal matrix with -by- or 2-by-2 blocks Criteria for choosing -by- or 2-by-2 blocks Asymmetric Bunch-Kaufman pivoting 9

LBM^T decompistion Bd is -by- or 2-by-2 block As is also a tridiagonal matrix As is updated by modifying leading elements of T22 As can be decomposed recursively =bb2-a2c 2

Diagonal Pivoting A can be solved by solving L, B, and then M T It has data-dependent control flow B contains -by- or 2-by-2 blocks It is better than other pivoting Only access nearby rows Require dynamic tiling to perform efficiently on GPUs 2

b c a2 b2 c2 a3 b3 c3 a4 b4 More Optimization Not stored. computed on the fly We store conditions and leading elements of B d= b c2/b a2/b L2 B2 M2^T d=2 b c -cc2/ -a2a3/ ba3/ a2 b2 bc2/ L2 B2 M2^T

An Example 2.5 d=2.5 2.5 2 What we really store A = 2.5 condition= 2 2

Pivoting Criteria Bunch-Kaufman algorithm for unsymmetric cases k = 5 /2 σ = max c, a 2, b 2, c 2, a 3 if b σ k c a 2 by pivoting else 2 by 2 pivoting

Our gtsv Algorithm Solving each A i dominates the runtime Using diagonal pivoting One A i is solved sequentially, and all A i s are solved in parallel Require data layout transformation to perform efficiently on GPUs 25

Data Layout Observation GPU requires stride-one memory access to fully utilize memory bandwidth Contradiction Consecutive elements in a diagonal are stored in consecutive memory in gtsv interface Each block is processed by one thread Solution Data layout transformation 26

Data Layout Transformation Local transpose b i s are elements in a diagonal 6 4-elements blocks (block in SPIKE) address address local transpose 27

Data Layout Transformation 8 7 6 5 4 3 2 69.46 59.87 random 38.63 9.68 diagonally dominant Runtime (ms) old layout (gtsv interface) proposed layout 4-5x 34.59 7.7 4.73 zero diagonal data marshaling overhead Random -by- or 2-by- 2 pivoting Diagonally dominant Always -by- pivoting Zero diagonal Always 2-by-2 pivoting 28

Dynamic Tiling Observation Memory access with compact footprint can be handled well by L Even though branch divergence exists Scattered footprint dramatically reduces memory efficiency Solution Insert barriers to regularize memory access footprint address thread ID T T2 T3 T4 2 2 3 3 2 2 3 4 3 2 4 5 3 3 4 5 4 3 5 6 4 4 6 6 5 4 compact footprint scattered footprint 29

Dynamic Tiling T T2 T3 T4 2 2 3 3 2 2 3 4 3 2 4 5 3 3 4 5 4 3 5 6 4 4 6 6 5 4 estimated tiling boundary real barrier estimated tiling boundary real barrier T T2 T3 T4 2 2 3 3 2 2 3 4 4 2 4 5 4 4 4 5 5 4 6 6 5 6 7 6 6 6 3

Dynamic Tiling 7 Runtime (ms) data layout only dynamic tiling (with data layout) 6 5 59.87 3.5x 4 3 2 6.83 9.68 9.88 7.7 7.3 random diagonally dominant zero diagonal 3

Dynamic Tiling % Performance counters Global Memory Load Efficiency Global Memory Store Efficiency L Hit Rate Warp Execution Efficiency 9 8 7 6 5 4 3 3x.8x 3x 2 no tiling, random tiling, random no tiling, diagonally dominant tiling, diagonally dominant Because of branch divergence no tiling, zero diagonal tiling, zero diagonal 32

Final Evaluation 3 kinds of evaluation Numerical stability A backward analysis 6 selected types of matrices* One GPU performance Cluster scalability Multiple GPUs Ax b b Multiple GPUs + multiple CPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp. 269 275, 2 33

Numerical Stability Relative Backward Error Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab.82E-4.97E-4 7.4E-2.88E-4.39E-5.96E-4 2.27E-6.27E-6.69E-6.3E-6.2E-6.3E-6 3.55E-6.52E-6 2.57E-6.35E-6.29E-6.35E-6 4.37E-4.22E-4.39E-2 3.E-5.69E-5 2.78E-5 5.7E-4.3E-4.82E-4.56E-4 4.62E-5 2.93E-4 6.5E-6.6E-6.57E-6 9.34E-7 9.5E-7 9.34E-7 7 2.42E-6 2.46E-6 5.3E-6 2.52E-6 2.55E-6 2.27E-6 8 2.4E-4 2.4E-4.5E+ 3.76E-4 2.32E-6 2.4E-4 9 2.32E-5 3.9E-4.93E+8 3.5E-5 9.7E-6.9E-5 4.27E-5 4.83E-5 2.74E+5 3.2E-5 4.72E-6 3.2E-5 7.52E-4 6.59E-2 4.54E+ 2.99E-4 2.2E-5 2.28E-4 2 5.58E-5 7.95E-5 5.55E-4 2.24E-5 5.52E-5 2.24E-5 3 5.5E- 5.45E-.2E+6 3.34E- 3.92E-5 3.8E- 4 2.86E+49 4.49E+49 2.92E+5.77E+48 3.86E+54.77E+48 5 2.9E+6 Nan Nan.47E+59 Fail 3.69E+58 6 Inf Nan Nan Inf Fail 4.68E+7 34

GPU Performance Runtime of solving an 8M matrix (ms) 5 5 2 25 3 Our dgtsv (GPU) Our ddtsvb (GPU) CUSPARSE dtsv (GPU) Data transfer (pageable) Data transfer (pinned) MKL dgtsv(sequential, CPU) Random Diagonally dominant 35

Our Heterogeneous gtsv SPIKE algorithm OpenMP for multicore in one node CUDA stream for multi-gpus MPI for multi-nodes MKL gtsv for CPU Our gtsv for GPU 36

Cluster Scalability (GPUs) Strong Scaling (ms) Our gtsv Our gtsv (predistributed data).e+3.e+2.e+ GPU 2 GPUs 4 GPUs 8 GPUs 6 GPUs 37

Cluster Scalability (GPUs) Weak Scaling (ms).e+3 Our gtsv Our gtsv (predistributed data).e+2.e+ GPU, a 2Msized matrix 2 GPUs, a 4Msized matrix 4 GPUs, an 8Msized matrix 8 GPUs, a 6Msized matrix 6 GPUs, a 32Msized matrix 38

Cluster Scalability (GPUs+CPUs) Strong scaling Weak Strong (predistributed scaling data) 39

Short Summary Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 4

More Features for Our gtsv Support 4 data types (in CUSPARSE 23) Float(S), double(d), complex(c), double complex(z) Support arbitrary sizes Support multiple right-hand-side vectors Support both general matrices (gtsv) and diagonally dominant matrices (dtsvb) 4

More Details 4 data types CURSPARSE built-in operators dtsvb SPIKE + Thomas algorithm Arbitrary sizes Padding Pad s for the main diagonal, and s for the lower and upper diagonals 42

More Details Multiple right-hand-side vectors Yi s have multiple columns, but Wi s and Vi s only have one column 43

More Details Solve Vi s, Wi s and the first column of Yi s Build L, B, and M^T Then solve the rest columns of Yi s using the pre-built L, B, and M^T 44

Summary The first numerically stable tridiagonal solver for GPUs Comparable numerical stability with Intel MKL Comparable speed with NVIDIA CUSPARSE 22 Support large size matrices CUSPARSE gtsv 23 Cluster support is removed Source codes for a prototype are available at http://impact.crhc.illinois.edu/ With a BSD-like license 45

Something We Forgot How about the batch version? Batch version means multiple matrices of the same size Currently, you can just simply merge them in a large matrix Even work for multiple matrices of different sizes 46

A Case Study Empirical Mode Decomposition (EMD) An adaptive time-(or spatial-)frequency analysis Applications Climate research Orbit research Structural health monitoring Water wave analysis Biomedical signal analysis 47

Empirical Mode Decomposition Spline interpolation Sifting Procedure maxima Extrema Detector Spline Interpolation Tridiagonal Solver Spline Interpolation Interpolation Vector Mean - + Vector Subs minima Tridiagonal Solver Interpolation IMF Procedure Sifting m (t) Sifting Sifting + m M (t) - Vector Subs r i (t) c i (t) x(t) IMF r (t) IMF IMF r N (t) c (t) c 2 (t) c N (t) 48

Characteristics of Tridiagonal Matrices in EMD Large size Different numbers of matrices Dimensions or channels of signals Simultaneous tridiagonal matrices D ( channel) signal/d multiple channels signals/2d signals Variations of EMD Ensemble EMD (EEMD) Adding noise and performing EMD several times Multiple dimensional EEMD 49

Benefits of Our gtsv Large size matrices Some previous GPU EMD works used B-spline to approximate spline, because they cannot solve large-size systems efficiently Our gtsv perfectly fits Multiple matrices of different sizes Our gtsv perfectly fits 5

Short Summary It is still an on-going work New GPU EMD source codes coming soon Check http://impact.crhc.illinois.edu/ A joint project with Norden Huang s group http://rcada.ncu.edu.tw 5

Q & A Thank you Li-Wen Chang at SC'2 52