Flexible Batched Sparse Matrix-Vector Product on GPUs

Size: px

Start display at page:

Download "Flexible Batched Sparse Matrix-Vector Product on GPUs"

Gordon Kelley
5 years ago
Views:

1 ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra, Goran Flegar, Enrique, S. Quintana-Orti

2 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). 2

3 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. A = C A value = [ ] Value 3

4 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. Need to also store location of nonzero elements! COO format : A = C A Memory footprint of COO format: nz(val) + 2*nz(int) value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 4

5 Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. Need to also store location of nonzero elements! CSR format : A = C A Memory footprint of COO format: nz(val) + 2*nz(int) Memory footprint of CSR format: nz(val) + nz(int) + (n+1) (int) value = [ ] Value colidx = [ ] Column-index rowptr = [ ] Points to the first element in each row Number of nonzero elements 5

6 How to parallelize this? A = C A value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 6

7 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Significant workload imbalance! Need branching logic, branch divergence on vector machines. T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 7

8 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Balanced workload. Can result in significant overhead for unbalanced problems. ELL format : T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 Values and column-index padded for uniform row-length value = [ ] colidx = [ ] Number of nonzero elements 8

9 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Significant workload imbalance! Ordered access to input vector x. T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 9

10 How to parallelize this? Parallelize by elements: Balanced workload. Partial sums need synchronization: Write conflicts! A = C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 1

11 How to parallelize this? Parallelize by elements: Balanced workload. Partial sums need synchronization: Write conflicts! A = C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 11 G. Flegar et al.: Overcoming Load Imbalance for Irregular Sparse Matrices, IA 3, 217.

12 Different kernels optimal for different problem classes CSR small memory footprint Needs some logic for row-parallel processing ELL zero-padding allows for efficient SIMD execution Efficient for balanced matrices COO can compensate workload imbalance for irregular patterns 12

13 Different kernels optimal for different problem classes CSR small memory footprint Needs some logic for row-parallel processing ELL zero-padding allows for efficient SIMD execution Efficient for balanced matrices COO can compensate workload imbalance for irregular patterns For a single problem, we can usually find an optimal kernel, BUT 13

14 What if we process many different matrices at a time? (Assume they are all small) 14

15 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. 15

16 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. 16

17 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. Are the problems Same Size? Same number of nonzerosoverall? Same number of nonzerosin every row? Same sparsity pattern? 17

What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel.

18 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. Are the problems Same Size? Same number of nonzerosoverall? Same number of nonzerosin every row? Same sparsity pattern? Let s be as flexible as possible! Flexible means everything can be different Every Thread block handles one system Memory pointers to distinct systems Load input vector x into shared memory Kernel for all matrices in CSR, COO, ELL 18

19 What if we process many different matrices at a time? (Assume they are all small) GPU thread blocks Load respective vector x Load respective vector x All matrices stored the same format. Let s be as flexible as possible! Flexible means everything can be different Every Thread block handles one system Memory pointers to distinct systems Load input vector x into shared memory Kernel for all matrices in CSR, COO, ELL 19

20 Flexible batched SpMV First experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of the same matrices (homogeneous batch) NVIDIA P1 GPU 56 SMX, 5.3 TF DP GB/s CAGE_8 CAN838 DWT_922 EX25 EX27 GR_3_3 2

21 Flexible batched SpMV First experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of the same matrices (homogeneous batch) NVIDIA P1 GPU 56 SMX, 5.3 TF DP GB/s CAGE_8 CAN838 DWT_922 EX2 EX27 GR_3_3 Disclaimer: This is an artificial problem setting! In a real-world scenario, a homogeneous batched SpMVwould be handled as SpMM. 21

22 Flexible batched SpMV batched ELL std. ELL batched CSR std. CSR batched COO 1 GFLOPs batched ELL std. ELL batched CSR std. CSR batched COO batch size batched ELL std. ELL batched CSR std. CSR batched COO batch size 22 GFLOPs 8 GFLOPs 8 batched ELL std. ELL batched CSR std. CSR batched COO GR_3_ batch size ex batch size ex2 batched ELL std. ELL batched CSR std. CSR batched COO 2 GFLOPs DWT_922 CAN838 GFLOPs GFLOPs CAGE8 4 3 batched ELL std. ELL batched CSR std. CSR batched COO batch size batch size 8 1

23 Flexible batched SpMV 12 1 batched ELL batched CSR batched COO 8 GFLOPs

24 Flexible batched SpMV Second experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of different matrices (in-homogeneous batch) 1. somewhat similar (similar size, nonzero count) 2. completely different 24

25 Flexible batched SpMV Batch of random similar-sized problems n 2 [9, 1] nz 2 [3, 4] GFLOPs Performance batched ELL batched CSR batched COO std. CSR std. ELL batch size GB/s Bandwidth batched ELL batched CSR batched COO batch size 25

26 Flexible batched SpMV Batch of random similar-sized problems n 2 [9, 1] nz 2 [3, 4] Batch of random any-sized problems. n 2 [1, 1] nz 2 [1, 4] GFLOPs GFLOPs Performance batched ELL batched CSR batched COO std. CSR std. ELL batch size batched ELL batched CSR batched COO std. CSR std. ELL batch size GB/s GB/s Bandwidth batched ELL batched CSR batched COO batch size batched ELL batched CSR batched COO batch size 26

27 ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible batched SpMV on GPUs Large number of small SpMV simultaneously Matrices can be different in size, nnz, pattern COO format most suitable for inhomogeneous batches This work is in Collaboration with:

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran