S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Size: px

Start display at page:

Download "S4289: Efficient solution of multiple scalar and block-tridiagonal equations"

Charla Stokes
6 years ago
Views:

1 S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University, Budapest, Hungary Mike Giles (Oxford), Jeremy Appleyard (NVIDIA) GPU Technology Conference March 26th, 2014 San Jose Endre László (Oxford) S4289 March 26th, 2014 San Jose 1 / 25

2 Outline for GPU developers 1 Batch scalar-tridiagonal solvers ADI (Alternating Directions Implicit) method Thomas algorithm Multi-dimensional data structures - access patterns Optimization: local data transposition in shared memory Optimization: local data transposition with shfl() Thomas-PCR hybrid Comparison to CPU, Xeon Phi and LAPACK tridiagonal solver 2 Batch block-tridiagonal solver Block tridiagonal data structure - access patterns Work-sharing on the GPU Comparison to CPU and LAPACK banded solver Conclusion Endre László (Oxford) S4289 March 26th, 2014 San Jose 2 / 25

3 Example: Solving the heat equation with ADI The heat-diffusion equation is the PDE that is solved with the method: ADI (Alternating Directions Implicit) method Classical FD scheme Computationally cheaper then Crank-Nicolson Relies on approximate factorization O( t 2, x 2 ) order accurate in both space and time Unconditionally stable if parameters chosen right positive Introduced by Peaceman and Rachford 1 du dt = 2 u (1) 1 D. W. Peaceman and J. Rachford, H. H., The numerical solution of parabolic and elliptic differential equations, Journal of the Society for Industrial and Applied Mathematics, vol. 3, no. 1, pp. pp. 2841, Endre László (Oxford) S4289 March 26th, 2014 San Jose 3 / 25

4 Example: Solving the heat equation with ADI 3 tridiagonal solves along dimensions X, Y, Z preproc : u (0) = λ ( δx 2 + δy 2 + δz 2 ) u n ( ) x dim : 1 λδ 2 x u (1) = u (0) ( ) y dim : 1 λδ 2 y u (2) = u (1) ( ) z dim : 1 λδ 2 z u = u (2) add : u n+1 = u n + u The upcoming discussion of tridiagonal solvers is in the context of the ADI method Endre László (Oxford) S4289 March 26th, 2014 San Jose 4 / 25

5 A tridiagonal system Storage: 3 coefficient arrays 1 solution arrays 1 RHS array All stored in a cubic datastructure b 0 c a 1 b 1 c a 2 b 2 c a N 1 b N 1 u 0 u 1 u 2 u 3. u N 1 = d 0 d 1 d 2 d 3. d N 1 Endre László (Oxford) S4289 March 26th, 2014 San Jose 5 / 25

6 Solving tridiagonal systems Assumptions: Stems from real-world CFD and financial applications Computation domain: structured multidimensional Hypercubic-ish : Ω = R N 0 N 1 N D, D = 2..8 Large number of systems: N 2 on an N 3 cube Enough to saturate GPU System sizes in the order of 100s 1000s Each system has its own coefficients and RHS No pivoting: diagonal dominance required Endre László (Oxford) S4289 March 26th, 2014 San Jose 6 / 25

7 Batch scalar-tridiagonal solvers cusparse?gtsvbathstride() Inefficient with the previous assumptions CR-PCR hybrid Lack of multidimensional support Global data transpose needed in Y and Z dimensions Extra space requirements: 768MB for a SP problem Uses two kernel calls Enough parallelism is batch problems CR/PCR not necessarily needed Tesla K40 has 12GB device memory Multidimensional problem domain N d N = d for dimensions d = 2..8: d N # parallel systems Endre László (Oxford) S4289 March 26th, 2014 San Jose 7 / 25

8 Thomas algorithm Algorithm 1 Thomas algorithm Require: thomas(a, b, c, d) 1: d 0 := d 0/b 0 2: c 0 := c 0/b 0 3: for i = 1,..., N 1 do 4: r := 1 / (b i a i c i 1 ) 5: d i := r (d i a i d i 1 ) 6: c i := r c i 7: end for 8: for i = N 2,..., 0 do 9: d i := d i c i u i+1 10: end for 11: return d Endre László (Oxford) S4289 March 26th, 2014 San Jose 8 / 25

9 Multi-dimensional data structures - access patterns Data layout: idx = k NX NY + j NX + i Performance depends on how the threads are mapped to the domain Different efficiency along different dimensions Assume sequential dependence in algorithm iterating along dim.: X: stride = 1 worst performance Y: stride = NX best performance Z: stride = NX NY good performance if TLB miss rate is avoided Endre László (Oxford) S4289 March 26th, 2014 San Jose 9 / 25

10 Time / grid element [ns] Mapping threads to the domain: X/Y/Z-dimension solves x16.5 PreProc X-solve Y-solve Z-solve SP DP X: offset = 1, stride = NX 4byte/32byte = 12.5% cache line utilization in SP Y: offset = NX, stride = 1 perfectly coalesced, 100% utilization Z: offset = NX NY, stride = 1 perfectly coalesced, 100% utilization + moderate TLB hit rate Endre László (Oxford) S4289 March 26th, 2014 San Jose 10 / 25

11 GB/s Mapping threads to the domain: X/Y/Z-dimension solves Nvidia Tesla K40 (GK 110B): 288 GB/s SP DP 50 0 X-solve Y-solve Z-solve Endre László (Oxford) S4289 March 26th, 2014 San Jose 11 / 25

12 TLB (Translation Lookaside Buffer) miss rate CUDA uses Unified Virtual Address Space Virtual address space uses memory pages Memory page frame pointers are cached in TLB TLB is a coarser cache that works with LLC translates address tag to frame pointer caches frame pointers from main memory On NVIDIA devices TLB is hardware implemented and page sizes can not be changed Small page size + long-stride high TLB miss rate NVVP implicitly reports it within the Global memory replay overhead counter 753 clock latency in case of TLB page miss 2 2 Measure on GT200 by Wong et al. in Demystifying GPU microarchitecture through microbenchmarking, Performance Analysis of Systems and Software (ISPASS), 2010 Endre László (Oxford) S4289 March 26th, 2014 San Jose 12 / 25

13 How to cope with TLB miss rate and coalescence? TLB is easy: remap your solver for better locality Change 2D thread block mapping into 1D thread blocks So that threads within a block will solve the closest neighboring set of systems Perform cache/register blocking Coalesced memory access is more difficult: Only a problem in the X-dim Need for cache blocking: Local transpose in shared memory or Local transpose with register shuffle ( shfl() intrinsic) or Caching a whole system Thomas-PCR hybrid Endre László (Oxford) S4289 March 26th, 2014 San Jose 13 / 25

14 Register file 32 rows Shared memory Thomas with shared memory transpose Forward pass: 1 Wrap a warp (32 threads) into 4x8 blocks to perform non-caching (32byte) loads 2 Load 32x8 size tiles into shared memory: 8 steps of 4x8 blocks loads 3 Transpose data by putting values into registers: float a[8]; is compiled to 8 registers if array indexing is known in compile time 4 Perform calculation with the 8 values along X dimension 5 Repeat from 2 until end of X-dim is reached Backward pass: y x 8 columns Step 0 Step 1 32byte Transpose Thread 0: float reg[8] Thread 1: float reg[8] Thread 2: float reg[8] Thread 3: float reg[8] Thread 4: float reg[8] Thread 5: float reg[8] Thread 6: float reg[8] Thread 7: float reg[8] Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 1 Do the same backward: transpose + store Endre László (Oxford) S4289 March 26th, 2014 San Jose 14 / 25

15 Register file 32 rows Register file Thomas with register shuffle transpose Forward pass: 1 Wrap 32 threads into 8x4 blocks to perform 4 x float4 vector loads 2 Load 32x16 size tiles into registers: 4 threads read 4 consecutive float4 vectors = 64 bytes Do this 4 times for rows under each other 3 Transpose data within 4 threads: 4 threads exchange data on a 4x4 2D array with shfl(float4) Each element in the 2D array is a float4 vector 4 Perform calculation with the 16 values along X dimension 5 Repeat from 2 until end of X-dim is reached Backward pass: 1 Do the same backward: transpose + store y x 16 columns Read Read steps Step 1 Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4 Thread 0: float reg[16] = Thread 1: float reg[16] = Thread 2: float reg[16] = Thread 3: float reg[16] = Thread 4: float reg[16] = Thread 5: float reg[16] = Thread 6: float reg[16] = Thread 7: float reg[16] = 64 bytes float4 float4 float4 float Transpose float a[16] Endre László (Oxford) S4289 March 26th, 2014 San Jose 15 / 25

16 Thomas/PCR hybrid algorithm Endre László (Oxford) S4289 March 26th, 2014 San Jose 16 / 25

17 Time / grid element [ns] Performance comparison Trid-X SP Trid-X DP 1 Trid-Y SP 0.5 Trid-Y DP 0 Naïve Shared transpose Register shuffle ThomasPCR hybrid cusparse v socket Xeon LAPACKE v Xeon Phi CPU: Intel Xeon E5-2680, 2 socket, 40MB, 16 core (32 HT), 102 GB/s GPU: Nvidia K40m, 288 GB/s Endre László (Oxford) S4289 March 26th, 2014 San Jose 17 / 25

18 Scalar-tridiagonal library use with OpenACC void main() { int n = NX*NY*NZ; float* u = (float *) malloc(sizeof(float)*n); float* ax = (float *)acc_malloc(sizeof(float)*n);... #pragma acc data copy(u[n]) deviceptr(ax,bx,cx,ay,by,cy,az,bz,cz,du) for(it=0; it<iter; it++) {... // calculate r.h.s. and set tri-diagonal coefficients int ndim = 3; int dims[3] = {256,256,256}; int pads[3] = {320,320,320}; solvedim = 0; // X-solve tridsmtsvstridedbatch(ax, bx, cx, du, u, ndim, solvedim, dims, pads); solvedim = 1; // Y-solve tridsmtsvstridedbatch(ay, by, cy, du, u, ndim, solvedim, dims, pads); }... } Endre László (Oxford) S4289 March 26th, 2014 San Jose 18 / 25

19 Batch block-tridiagonal solver Motivation for block solver: State variables in CFD/finance PDEs have inter-depenedence Block matrices with block sizes of 2 8 Sub-problems to be solved: Inverting, multiplying blocks (matrices) involves branching Data storage shortage limits the number of systems on the device Optimization strategies on GPUs: Data storage - for better data locality Work sharing - to increase parallelism Inter-thread communication with shared memory Inter-thread communication with register shuffle Endre László (Oxford) S4289 March 26th, 2014 San Jose 19 / 25

20 Batch block-tridiagonal solver work sharing Threads within a warp compute: Matrix-matrix product Matrix-vector product Gauss-Jordan block solve A thread stores a column of a block and a scalar value of a vector Need to pay special attention to help register allocation Algorithms are implemented with shared memory or shfl() instrinsic communication In worst case (M = 7) 4 threads out of 32 are idle Endre László (Oxford) S4289 March 26th, 2014 San Jose 20 / 25

21 Batch block-tridiagonal matrix storage Blocks are stored in a row major format Blocks of different problems are interleaved for better data locality A 0 0 A 1 0 A A P 1 0 A 0 1 A 1 1 A A P 1 1 A 0 2 A 1 2 A (2) Endre László (Oxford) S4289 March 26th, 2014 San Jose 21 / 25

22 Batch block-tridiagonal solver Two versions: shared memory, register shuffle Register spill above 8x8 DP block size Approx. 8-16k threads saturate GPU Low shared memory use 576(1125) bytes/threadblock SP(DP) good occupancy Shared SP 8x8 Shared memory efficiency 84.5% Shared memory load/store throughput 1700 GB/s L2 Hit Rate (L1 Reads) 51.9% Executed IPC 1.68 Texture Cache hit rate 50% In SP shuffle is better In DP shared memory is better Endre László (Oxford) S4289 March 26th, 2014 San Jose 22 / 25

23 GB/s GFLOPS Data and compute throughput SP DP SP DP M - block size M - block size (a) Effective data throughput (b) Compute throughput CPU: Intel Xeon E5-2690, 2 socket, 40MB, 16 core (32 HT), 102 GB/s GPU: Nvidia K40m, 288 GB/s Endre László (Oxford) S4289 March 26th, 2014 San Jose 23 / 25

24 Speedup over LAPACKE Speedup over LAPACKE Performance comparison Baseline: Multi-threaded LAPACKE?gbsv work() banded solver CPU GPU - Shared GPU - Shuffle 4 2 CPU GPU - Shared GPU - Shuffle M - block size M - block size (a) Single Precision (b) Double Precision Endre László (Oxford) S4289 March 26th, 2014 San Jose 24 / 25

25 Conclusion Batch tridiagonal solvers Scalar solver Different optimization strategies: Thomas with shared memory transpose Thomas with register shuffle transpose Thomas/PCR hybrid Library quality solution for scalar tridiagonal solvers Block solver High throughput solver Higher performance than: Vectorized, multi-threaded CPU block solver or Banded LAPACK(E) solver Contributions of Nvidia funded summer interns is acknowledged: James Whittle, Catherine Hastings Endre László (Oxford) S4289 March 26th, 2014 San Jose 25 / 25

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular