A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- How to Build a gtsv for Performance Tridiagonal Solver for CUSPARSE 23 GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois

Material in this Session This talk is based on our SC 2 paper Chang, Li-Wen; Stratton, John; Kim, Hee-Seok; Hwu, Wen-mei, A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs, Proceedings of the International Conference for High Performance Computing, Networking Storage and Analysis, 22 (SC 2) But it contains more Details not shown in the paper due to page limit Extension worked with the NVIDIA CUSPARSE team 3

Comparison among Tridiagonal Solvers Solver Matlab (backslash) Intel MKL (gtsv) Intel SPIKE Numerical Stability Yes Yes Yes CUSPASRE gtsv (22) No Our gtsv Yes Our heterogeneous gtsv Yes 4

Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance Matlab (backslash) Yes Poor Intel MKL (gtsv) Yes Good Intel SPIKE Yes Good CUSPASRE gtsv (22) No Not supported Our gtsv Yes Not supported Our heterogeneous gtsv Yes Good 5

Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Matlab (backslash) Yes Poor Not supported Intel MKL (gtsv) Yes Good Not supported Intel SPIKE Yes Good Not supported CUSPASRE gtsv (22) No Not supported Good Our gtsv Yes Not supported Good Our heterogeneous gtsv Yes Good Good 6

Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 7

Numerical Stability on GPUs All previous related works for GPUs Unstable algorithms, like Thomas algorithms, Cyclic reduction (CR), or Parallel cyclic reduction (PCR) No pivoting Why pivoting important? 8

CUSPARSE gtsv 22) CR (+ PCR) But when bi s are s 9 2 2 3 3 2 2 3 2 3 3 2 2 2 3 2 b a c b b a b a c b a c b e e e e b a c b a c b a c b e e e e 3 2 2 3 2 a c a c a c e e e e

Why Numerical Stability is Difficult on GPUs Why people didn t apply pivoting on GPU? They worried about performance Pivoting does not seem to fit GPU Pivoting may serialize computation Pivoting requires data-dependent control flow GPU likes regular computation and regular memory access Branch divergence may hurt performance

Our gtsv For parallelization SPIKE algorithm is applied to decompose the problem A optimization technique is applied to achieve high memory efficiency Data layout transformation For data-dependent control flow Diagonal pivoting is chosen A optimization technique is proposed to achieve high memory efficiency Dynamic tiling

Part : SPIKE Algorithm SPIKE algorithm decomposes a tridiagonal matrix A into several blocks 2

SPIKE Algorithm D and S can be redefined as A = DS AX = F can be solved by solving DY = F, and SX =Y 3

A Small Example A e e e 2 3 2 5 A B e 4 2 3 2 4 5 F 3 27 26 C 2 A 2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23

A Small Example 5 2 4 3 2 5 2 3 2 e e e e 22 2 2 w w v v 5 2 4 3 5 2 = David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23

How to build S? SPIKE Algorithm Solve DY = F Solve several independent tridiagonal matrices A i s 6

SPIKE Algorithm How to solve SX =Y? Solving the collection of the first and latest rows in all blocks Reduction* Problem size: 4L -> 6 Backward substitution v L w 2 w 2L v 2 v 2L w 3 v 3 w 3L v 3L w 4 *E. Polizzi and A. H. Sameh, A parallel hybrid banded system solver: The SPIKE algorithm, 7 Parallel Computing, vol. 32, no. 2, pp. 77 94, 26.

Part 2: Diagonal Pivoting for Tridiagonal Matrices How to solve each block A i in a numerically stable way? Diagonal pivoting* A i can be solved sequentially by each thread Why diagonal pivoting? A better data-dependent control flow, which we can handle on GPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp. 269 275, 2. 8

Diagonal Pivoting A tridiagonal matrix A can be decomposed to LBM T Instead of LDU L and M are unit lower triangular matrices B is a block diagonal matrix with -by- or 2-by-2 blocks Criteria for choosing -by- or 2-by-2 blocks Asymmetric Bunch-Kaufman pivoting 9

LBM^T decompistion Bd is -by- or 2-by-2 block As is also a tridiagonal matrix As is updated by modifying leading elements of T22 As can be decomposed recursively =bb2-a2c 2

Diagonal Pivoting A can be solved by solving L, B, and then M T It has data-dependent control flow B contains -by- or 2-by-2 blocks It is better than other pivoting Only access nearby rows Require dynamic tiling to perform efficiently on GPUs 2

b c a2 b2 c2 a3 b3 c3 a4 b4 More Optimization Not stored. computed on the fly We store conditions and leading elements of B d= b c2/b a2/b L2 B2 M2^T d=2 b c -cc2/ -a2a3/ ba3/ a2 b2 bc2/ L2 B2 M2^T

An Example 2.5 d=2.5 2.5 2 What we really store A = 2.5 condition= 2 2

Pivoting Criteria Bunch-Kaufman algorithm for unsymmetric cases k = 5 /2 σ = max c, a 2, b 2, c 2, a 3 if b σ k c a 2 by pivoting else 2 by 2 pivoting

Our gtsv Algorithm Solving each A i dominates the runtime Using diagonal pivoting One A i is solved sequentially, and all A i s are solved in parallel Require data layout transformation to perform efficiently on GPUs 25

Data Layout Observation GPU requires stride-one memory access to fully utilize memory bandwidth Contradiction Consecutive elements in a diagonal are stored in consecutive memory in gtsv interface Each block is processed by one thread Solution Data layout transformation 26

Data Layout Transformation Local transpose b i s are elements in a diagonal 6 4-elements blocks (block in SPIKE) address address local transpose 27

Data Layout Transformation 8 7 6 5 4 3 2 69.46 59.87 random 38.63 9.68 diagonally dominant Runtime (ms) old layout (gtsv interface) proposed layout 4-5x 34.59 7.7 4.73 zero diagonal data marshaling overhead Random -by- or 2-by- 2 pivoting Diagonally dominant Always -by- pivoting Zero diagonal Always 2-by-2 pivoting 28

Dynamic Tiling Observation Memory access with compact footprint can be handled well by L Even though branch divergence exists Scattered footprint dramatically reduces memory efficiency Solution Insert barriers to regularize memory access footprint address thread ID T T2 T3 T4 2 2 3 3 2 2 3 4 3 2 4 5 3 3 4 5 4 3 5 6 4 4 6 6 5 4 compact footprint scattered footprint 29

Dynamic Tiling T T2 T3 T4 2 2 3 3 2 2 3 4 3 2 4 5 3 3 4 5 4 3 5 6 4 4 6 6 5 4 estimated tiling boundary real barrier estimated tiling boundary real barrier T T2 T3 T4 2 2 3 3 2 2 3 4 4 2 4 5 4 4 4 5 5 4 6 6 5 6 7 6 6 6 3

Dynamic Tiling 7 Runtime (ms) data layout only dynamic tiling (with data layout) 6 5 59.87 3.5x 4 3 2 6.83 9.68 9.88 7.7 7.3 random diagonally dominant zero diagonal 3

Dynamic Tiling % Performance counters Global Memory Load Efficiency Global Memory Store Efficiency L Hit Rate Warp Execution Efficiency 9 8 7 6 5 4 3 3x.8x 3x 2 no tiling, random tiling, random no tiling, diagonally dominant tiling, diagonally dominant Because of branch divergence no tiling, zero diagonal tiling, zero diagonal 32

Final Evaluation 3 kinds of evaluation Numerical stability A backward analysis 6 selected types of matrices* One GPU performance Cluster scalability Multiple GPUs Ax b b Multiple GPUs + multiple CPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp. 269 275, 2 33

Numerical Stability Relative Backward Error Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab.82E-4.97E-4 7.4E-2.88E-4.39E-5.96E-4 2.27E-6.27E-6.69E-6.3E-6.2E-6.3E-6 3.55E-6.52E-6 2.57E-6.35E-6.29E-6.35E-6 4.37E-4.22E-4.39E-2 3.E-5.69E-5 2.78E-5 5.7E-4.3E-4.82E-4.56E-4 4.62E-5 2.93E-4 6.5E-6.6E-6.57E-6 9.34E-7 9.5E-7 9.34E-7 7 2.42E-6 2.46E-6 5.3E-6 2.52E-6 2.55E-6 2.27E-6 8 2.4E-4 2.4E-4.5E+ 3.76E-4 2.32E-6 2.4E-4 9 2.32E-5 3.9E-4.93E+8 3.5E-5 9.7E-6.9E-5 4.27E-5 4.83E-5 2.74E+5 3.2E-5 4.72E-6 3.2E-5 7.52E-4 6.59E-2 4.54E+ 2.99E-4 2.2E-5 2.28E-4 2 5.58E-5 7.95E-5 5.55E-4 2.24E-5 5.52E-5 2.24E-5 3 5.5E- 5.45E-.2E+6 3.34E- 3.92E-5 3.8E- 4 2.86E+49 4.49E+49 2.92E+5.77E+48 3.86E+54.77E+48 5 2.9E+6 Nan Nan.47E+59 Fail 3.69E+58 6 Inf Nan Nan Inf Fail 4.68E+7 34

GPU Performance Runtime of solving an 8M matrix (ms) 5 5 2 25 3 Our dgtsv (GPU) Our ddtsvb (GPU) CUSPARSE dtsv (GPU) Data transfer (pageable) Data transfer (pinned) MKL dgtsv(sequential, CPU) Random Diagonally dominant 35

Our Heterogeneous gtsv SPIKE algorithm OpenMP for multicore in one node CUDA stream for multi-gpus MPI for multi-nodes MKL gtsv for CPU Our gtsv for GPU 36

Cluster Scalability (GPUs) Strong Scaling (ms) Our gtsv Our gtsv (predistributed data).e+3.e+2.e+ GPU 2 GPUs 4 GPUs 8 GPUs 6 GPUs 37

Cluster Scalability (GPUs) Weak Scaling (ms).e+3 Our gtsv Our gtsv (predistributed data).e+2.e+ GPU, a 2Msized matrix 2 GPUs, a 4Msized matrix 4 GPUs, an 8Msized matrix 8 GPUs, a 6Msized matrix 6 GPUs, a 32Msized matrix 38

Cluster Scalability (GPUs+CPUs) Strong scaling Weak Strong (predistributed scaling data) 39

Short Summary Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 4

More Features for Our gtsv Support 4 data types (in CUSPARSE 23) Float(S), double(d), complex(c), double complex(z) Support arbitrary sizes Support multiple right-hand-side vectors Support both general matrices (gtsv) and diagonally dominant matrices (dtsvb) 4

More Details 4 data types CURSPARSE built-in operators dtsvb SPIKE + Thomas algorithm Arbitrary sizes Padding Pad s for the main diagonal, and s for the lower and upper diagonals 42

More Details Multiple right-hand-side vectors Yi s have multiple columns, but Wi s and Vi s only have one column 43

More Details Solve Vi s, Wi s and the first column of Yi s Build L, B, and M^T Then solve the rest columns of Yi s using the pre-built L, B, and M^T 44

Summary The first numerically stable tridiagonal solver for GPUs Comparable numerical stability with Intel MKL Comparable speed with NVIDIA CUSPARSE 22 Support large size matrices CUSPARSE gtsv 23 Cluster support is removed Source codes for a prototype are available at http://impact.crhc.illinois.edu/ With a BSD-like license 45

Something We Forgot How about the batch version? Batch version means multiple matrices of the same size Currently, you can just simply merge them in a large matrix Even work for multiple matrices of different sizes 46

A Case Study Empirical Mode Decomposition (EMD) An adaptive time-(or spatial-)frequency analysis Applications Climate research Orbit research Structural health monitoring Water wave analysis Biomedical signal analysis 47

Empirical Mode Decomposition Spline interpolation Sifting Procedure maxima Extrema Detector Spline Interpolation Tridiagonal Solver Spline Interpolation Interpolation Vector Mean - + Vector Subs minima Tridiagonal Solver Interpolation IMF Procedure Sifting m (t) Sifting Sifting + m M (t) - Vector Subs r i (t) c i (t) x(t) IMF r (t) IMF IMF r N (t) c (t) c 2 (t) c N (t) 48

Characteristics of Tridiagonal Matrices in EMD Large size Different numbers of matrices Dimensions or channels of signals Simultaneous tridiagonal matrices D ( channel) signal/d multiple channels signals/2d signals Variations of EMD Ensemble EMD (EEMD) Adding noise and performing EMD several times Multiple dimensional EEMD 49

Benefits of Our gtsv Large size matrices Some previous GPU EMD works used B-spline to approximate spline, because they cannot solve large-size systems efficiently Our gtsv perfectly fits Multiple matrices of different sizes Our gtsv perfectly fits 5

Short Summary It is still an on-going work New GPU EMD source codes coming soon Check http://impact.crhc.illinois.edu/ A joint project with Norden Huang s group http://rcada.ncu.edu.tw 5

Q & A Thank you Li-Wen Chang at SC'2 52