Dynamic Sparse Matrix Allocation on GPUs. James King

Size: px

Start display at page:

Download "Dynamic Sparse Matrix Allocation on GPUs. James King"

Annabelle Chambers
5 years ago
Views:

1 Dynamic Sparse Matrix Allocation on GPUs James King

2 Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation

3 Motivation Graph operations (adding edges) (e.g. transitive closure of a graph) Iterative Updates: Ax = b A = A + B A = A (Repeat) 3

4 Sparse Matrices Store nonzero values Sparsity of an MxN matrix: nnz M N COO, ELL, CSR, HYB, DIA, etc. 4

5 Coordinate Format (COO) Stores tuples of row index, column index, and value using 3 arrays Row Indices Column Indices Values

6 Compressed Sparse Row (CSR) Rows compressed Column Indices and Values stored uncompressed Row Offsets Column Indices Values

7 Ellpack (ELL) / Hybrid-Ellpack (HYB) ELL stores all rows with a fixed column width W HYB combines ELL with a COO matrix for overflow rows Column Indices ELL Values COO Row Indices 3 3 Column Indices Values

8 Current Formats Current formats are inefficient for dynamic updates Compressed sparse formats like CSR must be rebuilt for each update COO like formats must be sorted for efficient SpMV 8

9 Dynamic Compressed Sparse Row (DCSR) Stores K segments with starting and ending index of each segment Segments can be dynamically allocated and need not be in order Row Offsets Segments Column Indices Values Row Sizes 3 4 9

10 Memory Footprint Matrix Format COO ELL HYB CSR DCSR Memory Footprint for MxN Sparse Matrix 3(nnz) MW MW + 3(ovf) M+(nnz) MK + (nnz) 1

11 COO Memory Fill Elements are appended to the end of the matrix 11

12 DCSR Memory Fill Elements are added in new segments Gaps between segments are allowed 1

13 Dynamic Allocation Memory offset pointer keeps track of currently allocated space When a new segment is allocated, the offset pointer is atomically adjusted Defragmentation compacts segments into 1 segment 13

14 Dynamic Insertions Segs. Row Offsets Column Indices Values Segs. Segs. Inserting Entries: Row Offsets Row Indices Column Indices Values Column Indices Values

15 Defragmentation Gaps can form between segments due to insertion and deletion Defragmentation performs a prefix sum operation on the row sizes Row segments are scattered to their appropriate location in newly allocated arrays 15

16 Defragmentation Algorithm 1. Exclusive scan on row sizes to get offsets. Allocate new memory space. 3. Shuffle column indices and values to offsets within new memory space 4. Adjust entries in row offsets array 16

17 Defragmentation Segs. Row Offsets Column Indices Values Defragmentation: Row Sizes New offsets 17

18 Inserting Entries: Defragmentation Row Indices Column Indices Values Segs. Row Offsets New offsets Column Indices Values Segs. Row Offsets Column Indices Values

19 SpMV Compatible with standard CSR SpMV algorithms (CSR-scalar, CSR-Vector, etc.) Loop is added to SpMV to iterate over segments 15 CSR DCSR Def. DCSR HYB GFLOPS 1 5 AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK 19

20 SpMV Optimizations DCSR compatible with CSR optimizations Bin rows by row size for optimized performance (Ashari et al. 14) Sort tuples of bin ID, row index, and row size Prefix-sum over permuted row sizes to get offsets Defragmentation will group by row sizes

21 Optimized SpMV Improved memory access due to row size groupings and row sorting 15 CSR ADCSR Def. ADCSR HYB GFLOPS 1 5 AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK 1

22 Results Iterative Updates with SpMV operations Relative Speedup 1 5 DCSR HYB CSR AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK

23 Conversion Between Formats RelaCve Conversion Times COO -> CSR COO -> DCSR COO -> HYB CSR -> DCSR CSR -> HYB DCSR -> CSR 3

24 Sorting vs. Defragmentation RelaCve Time Comparison AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK DCSR HYB 4

25 Sparse Matrix-Matrix Multiplication (SpMM) GEMM algorithm is inefficient in sparse case Many entries are zero and need not be operated on = 5

26 Related Work Bell et al. 13 Exploiting Fine-Grained Parallelism in Algebraic Multigrid Methods 1. Compute partial products. Sort partial products 3. Reduce partial products 6

27 SpMM Work Efficient Given A (MxK), B(KxN), and C(MxN) AB = C Set of partial products: a ij b jk k =1...nnz(B j ) 8 nonzeros A 7

28 SpMM Example A B C x 5 = Row Indices Column Indices Values Set of ParNal Products Row Indices Column Indices Values Sorted by row and column Row Indices Column Indices Values Reduced by row and column 8

29 Format Conversions CSR à COO Compute C matrix Sort & reduction performed in COO format COO à CSR 9

30 Improved SpMM Compute rows asynchronously Dynamically update C matrix in DCSR format Defragment C matrix to get result in sorted CSR format Avoids conversion to and from COO format 3

31 Adaptive Binning Partial product row size of C is computed by a first pass over A and B rs i = nnz(a i ) X i=1 nnz(b j ) a ij 31

32 Adaptive Binning Rows are binned by partial product row size 3

33 Adaptive Binning Rows are grouped by size of partial product set Up to shared memory limit 33

34 Asynchronous Computations Kernels asynchronously compute rows by bin Bandwidth is reduced since row is implicit Shared memory is faster than global memory Row size is not known until after reduction 34

35 Row Updates Kernel 1-3 C Matrix: Column Indices Kernel Kernel Values Kernel

36 Defragmentation Defragmentation is >x faster than sorting equivalent COO matrix Bandwidth is reduced by a third since rows are compressed Segments are shuffled directly to offset location without sort 36

37 SpMM Results Relative Speedup DCSR SpMM CSR SpMM 5 a 9 a 3 7 a 3 7 a 5 b 9 b 3 7 b 3 7 b 37

38 Summary DCSR allows for efficient dynamic updates to sparse matrices Suitable for graph applications Defragmentation method is faster than sorting equivalent COO matrix Code will soon be available through Scientific Computing and Imaging (SCI) Institute GPUTUM library 38

39 Questions 39

Dynamic Sparse-Matrix Allocation on GPUs

Dynamic Sparse-Matrix Allocation on GPUs James King, Thomas Gilray, Robert M. Kirby, and Matthew Might University of Utah {jsking2,tgilray,kirby,might}@cs.utah.edu Abstract. Sparse matrices are a core