GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Size: px

Start display at page:

Download "GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication"

Holly Blake
5 years ago
Views:

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2015; 27: Published online 7 October 2014 in Wiley Online Library (wileyonlinelibrary.com) GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication Yuan Tao 1,2,3, Yangdong Deng 4, Shuai Mu 4, Zhenzhong Zhang 1,2, Mingfa Zhu 1,2, *,, Limin Xiao 1,2 and Li Ruan 1,2 1 State Key Laboratory of Software Development Environment, Beihang University, Beijing, , China 2 School of Computer Science and Engineering, Beihang University, Beijing, , China 3 College of Mathematics, Jilin Normal University, Jilin, , China 4 School of Software, Tsinghua University, Beijing, , China SUMMARY Many high performance computing applications require computing both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi-core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (ecsb), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU-based CSB implementation. In addition, our ecsb procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of ecsb by means of wall-clock time of bi-conjugate gradient algorithm; our ecsb is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright 2014 John Wiley & Sons, Ltd. Received 31 October 2013; Revised 23 July 2014; Accepted 10 September 2014 KEY WORDS: sparse matrix-transpose vector product; sparse matrix-vector product; compressed sparse block; CSB; compressed sparse rows; CSR; GPU 1. INTRODUCTION Sparse matrices are extensively used in high performance computing applications because they naturally reflect the characteristics of a wide range of scientific and engineering problems [1]. Among various computation patterns of sparse matrices, a specially interesting one is to compute both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) in the same application with the underlying matrices stored in a single format [2, 3]. Such a pattern is exemplified by the bi-conjugate gradient algorithm (BCG) [3, 4]. As shown by the BCG pseudo-code in Figure 1, SMVP and SMTVP are computed in lines 7 and 8, respectively. Because these two steps have no data dependency, they can be assigned to two computing nodes for parallel processing. Under such a circumstance, a key requirement is a similar computation throughput for the two *Correspondence to: Mingfa Zhu, School of Computer Science and Engineering, Beihang University, Beijing, , China. zhumf@buaa.edu.cn Copyright 2014 John Wiley & Sons, Ltd.

2 3772 Y. TAO ET AL. Figure 1. Bi-conjugate gradient algorithm [3, 4]. problems. Meanwhile, both should be performed in a highly efficient manner, while the underlying sparse matrix should be stored in a single format to save memory space. Buluç et al. [5] proposed a so-called compressed sparse block (CSB) format so that both SMVP and SMTVP procedures can be processed with similar throughputs on multi-core CPUs. Recently, significant breakthrough has been made to exploit graphics processing unit (GPU) for sparse matrix computations. Especially, the seminal work of Bell and Garland [6] proved that a throughput of 16 GFLOPS in single-precision could be achieved on a single GPU node for SMVP. On the other hand, the problem of computing SMVP and SMTVP in a single format has not been extensively studied. At present, typically, SMTVP is solved by a transposition followed by an SMVP procedure. Such a strategy was adopted by a leading-edge GPU-based sparse matrix library, CUSPARSE [7]. The significant overhead of transposing a sparse matrix, however, makes the strategy less efficient. Or experiments using CUS- PARSE evidenced that the transposition cost dominates the overall computation time [8]. The other solution for SMTVP is based on the atomic addition operation on NVIDIA GPUs with compute capability 2.0 or higher. Experiment shows that the performance of SMTVP by atomic additions can outperform by up to eightfold than CUSPARSE [8], but there would still be a gap of up to sevenfold between SMVP and SMTVP [8]. It is thus natural to consider porting CSB to GPU to solve the aforementioned problem. However, a straightforward porting will not deliver sufficient computing throughput on GPUs because of its inherent irregularity. Accordingly, we investigate efficient GPU solutions to the problem of computing both SMVP and SMTVP in this work. We first perform experiments to identify the performance gap between SMVP and SMTVP on GPUs. On such a basis, we re-design the CSB data structure, designated as the expanded CSB (ecsb), to fully take advantage of GPU SIMD hardware. We also adopt the hybrid storage format proposed by Bell and Garland [6] in storing a block of ecsb and develop a heuristic to select the different expression of ecsb. Our experimental results prove that the proposed data structure and the corresponding GPU implementation enable similar computation times for both SMVP and SMTVP at a throughput that is up to 12 times faster than that of a 4-core, 8-thread CPU-based implementation of CSB. Our techniques can also be faster than the previous GPU-based CSR and HYB [6] procedures on over half of the sparse matrix benchmarks. The mainly contributions of this work are as follows. We redesigned the CSB data structure for GPU-based computing to achieve high performance SMVP and SMTVP at the same time. A dynamic selecting algorithm is proposed to determine the proper data storage format for ecsb according to the distributions of non-zeros.

3 SPARSE MATRIX MULTIPLICATION ON GPUs 3773 We validate the effectiveness of the proposed ecsb data structure with extensive experiments. The data structure was compared with existing data structures in terms of the ratio of between SMVP and SMTVP computation throughput, the throughput, the total kernel execution time between SMVP and SMTVP, the wall-clock time of BCG algorithm, and the utilization. The remaining in the paper is organized as follows. The preliminaries of this work are introduced in Section 2. Section 3 analyzes the challenges of porting CSB to GPUs. Sections 4 and 5 explains a so-called ecsb data structure and its efficient GPU implementation to compute both SMVP and SMTVP. Section 6 reports and discusses the experimental results. We review related works in Section 7. In the last section, we conclude the paper and outline future works. 2. PRELIMINARIES In this section, we introduce the preliminaries related to this work. First, we explain the basic idea of CSB. Then, the fundamental concepts of GPU computing are discussed Compressed sparse block Sparse matrix allows using a compressed format to store the non-zero elements to save memory space. Such formats usually remove zero elements along either the row or the column direction. However, such formats cannot guarantee efficient access to non-zeros along both row and column directions and lead to performance gap between SMVP and SMTVP. Buluç et al. [5] introduced the CSB to support efficiently computing both SMVP and SMTVP on CPU. In CSB, a matrix A is partitioned into square blocks of identical size. An example matrix and its CSB are illustrated in Figure 2. The definition of CSB data structure is listed in Figure 3. It consists of three arrays blk ptr, comb lowbits, andval. Array blk ptr is a two-dimensional array recording the offset of the first non-zero of each block. The array comb lowbits stores the concatenated row and column indexes of non-zeros in a block (in Figure 2, row and column indexes are shown separately). Array val stores non-zeros. In CSB, a row (column) of blocks is designated as a blockrow (blockcolumn). TheCSB computing is designed for coarse grain parallel execution on multi-core CPUs. The pseudo-codes Figure 2. Compressed sparse block format with bit-interleave sort in intra-block for sparse matrices [5]. Figure 3. Structure of CPU-based compressed sparse block.

4 3774 Y. TAO ET AL. Figure 4. Various forms of parallelism used in parallel compressed sparse block computing. for computing SMVP and SMTVP are listed in Algorithms 1 and 2. The algorithms use a rather complicated procedure to recursively partition every block into groups of non-zeros. The procedure consists of two steps. The first step packs blocks into a linear array, according to the density of the block. If the density of a block is bigger than a threshold, it is packed into linear array as a single element. Otherwise, multiple sparse blocks are combined as one element. The second step performs SMVP or SMTVP in parallel on elements in the linear array. Given an element corresponding to a dense block, the block may need to be divided into multiple sub-blocks until the density of subblocks is smaller than a threshold. Then, the sub-blocks can be handled concurrently. The basic element of parallel execution is a group, containing either a dense sub-block or multiple sparse blocks. Such an element is manipulated by a CPU thread. The aforementioned process is illustrated in Figure 4. Note that CSB offers good locality for both SMVP and SMTVP.

5 SPARSE MATRIX MULTIPLICATION ON GPUs 3775 Figure 5. A generic architecture of graphics processing unit [13] General purpose computing with graphics processing unit Recently, GPUs have emerged as a new general purpose computing platform [9, 10]. GPUs were originally designed as application specific integrated circuits for graphics rendering. Pushed by the relentless pursuit for better visual experiences, GPUs evolved to offer both high programmability and superior computing throughput. Now, two major GPU vendors, NVIDIA and AMD, both offer GPU hardware and software design with support for general purpose computing. We use NVIDIA GPUs and the corresponding programming model, CUDA, [11] in this work. Modern GPUs adopt a single program multiple data [12] programming model. A GPU or graphics processor consists of tens of multiprocessors, each equipped with multiple lanes of computing logic. A multiprocessor fetches each instruction and dispatches it to all lanes, which then operate on different data. To hide latency, the single program multiple data model is enhanced with a multithreading mechanism. A GPU program follows a three-tier organization. A thread is the basic unit of parallel execution. Multiple threads constitute a thread block, which will be assigned to a multiprocessor for execution. A number of thread blocks form a grid and are scheduled by a scheduler on the GPU. Threads in a block can synchronize with each other and exchange data through a fast on-chip memory. The warps of threads are executed on a multiprocessor as a single unit in which each thread is mapped to a different lane. A warp is suspended at a memory access instruction, and then another warp with all data ready will resume execution. Various multiprocessors execute program independently. Graphics processing unit memory is more complicated than its CPU counterpart. The instruction operands are stored in a register file located on a multiprocessor, which also has a shared memory. The shared memory works as a software controlled cache. Each multiprocessor also has its own L1 instruction and data caches. The whole GPU processor has a unified L2 cache. GPU main memory, that is global memory, is installed off-chip and connected to GPU through a wide dedicated bus. The global memory has a long latency but offers a high bandwidth. To hide latency, modern GPUs adopt a memory coalescing mechanism. When a warp of threads executes a memory instruction, the corresponding memory requests can be merged into a fewer number of references if they fall into a single segment of memory space aligned at multiple of 64 bytes. The computing power of GPUs depends on the capability of concurrently running a massive number of threads. It is thus essential to guarantee the correctness of concurrently accessing to a single address by multiple threads. Recent GPUs support atomic functions to enable a complete memory writing process without being

6 3776 Y. TAO ET AL. interrupted by other requests to the same address. In NVIDIA GPUs, the atomic function for global memory is realized in the L2 cache. All atomic operations to the same address of global memory are buffered in a first-in first-out queue to be processed by an Arithmetic Logic Unit (ALU) in a serialized manner, as shown in Figure 5. Graphics processing units exhibit different characteristics from multi-core CPUs. First of all, the computing cores on a GPU are much simpler than those deployed on multi-core CPUs. These cores are organized into single instruction multiple data (SIMD) lanes to deliver high computing throughput. The simpler cores have less hardware support for memory pre-fetch and branching instructions. Second, GPUs exploit a massive number of threads to hide memory latency, while the caches deployed on GPUs have much less capacity than their equivalent on multi-core CPUs. On the other hand, CPUs heavily depend on cache to hide memory latency but only support a relatively number of threads. Third, GPU threads are directly supported by hardware. In other words, every active thread (i.e., a thread is active if it is already assigned to a GPU core for execution) has its own registers to store program state. So, no context switching is necessary when a thread is suspended or activated. This is drastically different from the CPU threads. 3. CHALLENGES OF COMPUTING BOTH SPARSE MATRIX-VECTOR PRODUCT AND SPARSE MATRIX-TRANSPOSE VECTOR PRODUCT WITH GRAPHICS PROCESSING UNIT Graphics processing units have been extensively used in many scientific and engineering applications. However, GPUs are originally designed to support highly regular workload. The irregular data structures exemplified by sparse matrices pose considerable challenges to GPUs, which needs programmers to design algorithms that are able to effectively extract data level parallelism. Many research works have been proposed to improve the efficiency of GPU computing on sparse matrices (e.g., [6, 7, 14 17]). Among these works, CUSPARSE systematically addresses the problem of SMVP and provides routines for popular sparse matrix formats. Currently, the SMVP can be computed with a throughput of 16 GFOPS in single precision on a single GPU [6]. To solve the problem of computing SMVP and SMTVP on GPUs with a single format, there are two basic approaches. The first one is to choose an existing sparse matrix format and develop a corresponding procedure, while the second approach is to port the CSB data structure to GPU. In the remainder of this section, we investigate the feasibility of these two approaches Computing sparse matrix-vector product and sparse matrix-transpose vector product with existing sparse matrix formats There exist many sparse matrix formats among which COO, CSR, and ELL [3] are representative formats. COO stores the row and column indexes for each non-zero, while CSR compresses the row index information by only storing information to locate the first non-zero in each row. ELL, or Ellpack format, uses a fixed row width to store non-zeros. If a row has fewer non-zeros, padding of zeros has to be performed. ELL needs two arrays to store column index and non-zeros. Given structural matrices, ELL is more appropriate for GPU computing because the regular width of rows enables a predictable computing pattern. However, too many zeros may need to be padded when processing unstructural matrices. The HYB format was introduced by Bell and Garland [6] for efficient computations of SMVP on GPU by combining both ELL and COO. In HYB, most nonzeros of each row is stored in the ELL format, while the remaining non-zeros out of the row width of ELL are recorded in COO. Figure 6 illustrates the three storage formats of an example matrix. From Figure 6, it can be seen that existing formats cannot enable efficient access to non-zeros along both row and column directions, while efficient computing of both SMVP and SMTVP with a single format does need quick access in both directions. In addition, the available formats cannot guarantee efficient use of cache in both SMVP and SMTVP.

7 SPARSE MATRIX MULTIPLICATION ON GPUs Porting compressed sparse block to graphics processing unit The CSB data structure proves to be efficient on multi-core CPUs because it utilizes better the locality of non-zeros of input matrix and vectors. A straightforward porting of CSB to GPUs, however, turns to be inefficient because of several reasons. First, the CSB algorithm performs a relatively complex recursive partition process to decompose the original matrix into small chunks. Here, one chunk can be a row of blocks, a few blocks along a row, or a partial block according to the number of non-zeros available inside. As illustrated in Figure 4 and elaborated in Section 2.1, the computing procedure actually uses multiple forms of dynamic parallelism exiting among different rows, sub-rows, and sub-blocks. Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs. GPUs are built on the foundation of SIMD execution that requires all parallel threads to run exactly the same program. Second, CSB operations are performed on blocks. Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted. In addition, the memory bandwidth is also under-utilized because a warp accesses memory as a complete unit. Third, Buluç et al. [5] depends on Cilk [18] to distribute computing tasks to cores at runtime for dynamic load balancing. However, such dynamic schedulers are not available on GPUs yet. Instead, we have to carefully handcraft a thread scheduling mechanism for a good load balance. 4. THE EXPANDED COMPRESSED SPARSE BLOCK DATA STRUCTURE Considering the problems discussed in Section 3.2, a natural question arises: what data structures are appropriate for GPU-based CSB computations? On the one hand, the block structure is necessary to maintain the two-dimensional parallelism that is essential to compute both SMVP and SMTVP in an efficient manner. In other words, it is desirable to have blk ptr[i][j+1]- blk ptr[i][j] as the non-zeros of block of (i,j), where i is the offset of blockrow and j is the offset of blockcolumn. Meanwhile, we want to break the block structure so that the number of wasted (i.e., threads assigned with zero elements) threads can be minimized. To meet the aforementioned seemingly contradictory requirements, we redesigned the CSB data structure as follows. The values of the non-zeros are kept in an array val. The original two-dimensional array, blk ptr, is converted to one-dimensional array in the GPU memory. It records the position of the first nonzeros of blockrows in array val. A new array, block ind, stores the block index for each non-zero Figure 6. Comparison of common sparse matrix formats.

8 3778 Y. TAO ET AL. Figure 7. Expanded compressed sparse block data structure. element. Such an arrangement simplifies the processing process, because when handling a non-zero, a thread can easily determine which block it should look at. In fact, an element of the array block ind is the concatenation of the row and column indexes of the block. The array comb lowbits is the same as its equivalent in CSB. The proposed ecsb data structure is listed in Figure 7. A key idea of ecsb is that we reserve the CSB-based ordering of non-zeros but handle a blockrow as the basic unit of parallel processing. One critical issue is that the distribution of non-zeros may vary sizably. Some sparse matrices have a relatively uniform distribution of non-zeros among rows, while others may have a few denser rows and a larger number of sparse rows. Accordingly, we extend the hybrid format, HYB, proposed by Bell and Garland [6] to adapt the non-zero distribution. Given sparse matrix very relative regular non-zero distribution, we use the ELL format because it helps improve load balance. In our implementation, we do not perform padding of zeros when the number of nonzeros is less than the row width. We use the COO format for highly irregular sparse matrices. A hybrid format is used on all remaining matrices. The basic idea is to store non-zeros within a certain row width in the ELL format, and the other non-zeros in the COO format. We propose a heuristic to choose a proper format. Note that the ELL and COO formats are on top of CSB because the non-zeros are arranged in the ordering defined in a block. The non-zeros in each block can be stored in different orders, among which row-major and column-major orderings are commonly used. The original CSB data structure uses a bit-interleaving ordering or Z-Morton ordering [5]. With the row-major ordering, we need to compute the SMVP problem with atomic operations (the details will be discussed in Section 5). Atomic addition is also used for SMTVP when using the column-major ordering. Both orderings introduce collisions and lower the overall throughput. So we adopt the bit-interleaving ordering for GPU-based storage. 5. GRAPHICS PROCESSING UNIT IMPLEMENTATION We develop both ELL and COO procedures for both SMVP and SMTVP because ecsb exploits a hybrid format. The remainder of this section explains the details. We also propose the heuristic to choose a proper format from three possible combinations: ELL, COO, and HYB ELL based sparse matrix-vector product and sparse matrix-transpose vector product ELL was designed to support relatively regular sparse matrices. Non-zeros are compressed into a dense matrix with a fixed row width. In this work, one row of ELL corresponds to a blockrow of ecsb. We use one thread of block to handle one row. One thread computes the product of a non-zero and a vector element in one iteration. The threads in a block iterate through a single row. Figure 8(a) and (b) lists the CUDA pseudo-code for computing the SMVP and SMTVP of the ELL part of a sparse matrix encoded in the ecsb data structure. In the code listed in Figure 8(a), line 4 derives the starting position of the first non-zero in the current blockrow via the offset (line 2) stored in the blk ptr array. The offset is also used to extract the combined word of row and column indices of the current block (line 6) and the combined word of row and column indices of intra-block (line 7). In the loop started at line 8, a thread fetches multiple non-zeros and the corresponding vector elements. Then, it performs multiplications to generate a series of intermediate results that should be summed together. Note that there is no guarantee on the order of processing by multiple threads on a massively parallel platform like GPU. In addition, the random distribution of the numbers of non-zeros makes it infeasible to store the intermediate results

9 SPARSE MATRIX MULTIPLICATION ON GPUs 3779 Figure 8. CUDA pseudo-code of ELL parts of expanded compressed sparse block based format for computing y D y C A x and y D y C A T x. in the shared memory directly. As a result, we resort to the atomic operations to add the intermediate results to the corresponding entry of the product vector located in shared memory. Current CUDA compilers only support atomic addition on integers and single-precision float numbers. For double-precision float numbers, one way is to cast them into integers of long long type, which is supported with an atomic compare and swap function, atomiccas [11], to avoid data race. The corresponding CUDA pseudo-code is listed in Figure 9. In the code snippet, longlong as double, and double as longlong at line 3 are CUDA library functions to perform the conversion between 64- bit unsigned integer value and double-precision floating point numbers. The codes of Figure 8(a) at lines 16-f and 16-d are for single and double-precision floating numbers, respectively. Figure 8(b) listed the code of SMTVP. The difference between SMVP and SMTVP is that the SMTVP procedure has its atomic additions done in global memory because of the wider distribution of data entries. Non-zeros in a blockrow are stored in a block-by-block order, while inside a block, the order is defined by CSB. As a result, it is feasible to visit non-zeros in a block along both the row direction and the column direction in a similar manner. Such a similarity accounts for the more comparable throughputs between SMVP and SMTVP.

10 3780 Y. TAO ET AL. Figure 9. The original atomic addition of compare and swap for double-precision data (adapted from [11]) COO-based sparse matrix-vector product and sparse matrix-transpose vector product COO format is used to handle highly irregularly distributed non-zero elements. Here, we do not need array blk ptr because the block index plus the intra-block row and column indices are sufficient. Each thread first computes the product of one non-zero element and its respective entry of x and atomically adds the product to the corresponding entry of product vector y. Figure 10(a) and (b) shows the pseudo-code of SMVP and SMTVP of COO of ecsb. The difference between Figure 10(a) and (b) lies in lines 7 9 because SMVP differs with SMTVP in the ordering of accessing vectors x and y Determining the storage format for expanded compressed sparse block As explained in Section 4, a sparse matrix encoded in ecsb can be physically stored in ELL, COO, or a hybrid ELL and COO format. Given a sparse matrix, it is an essential problem to determine the best storage format. We propose a heuristic to choose a proper format according to the characteristics of the non-zero distribution. Bell and Garland [6] proposed the HYB format to store sparse matrices as well as a procedure to derive the best row width for ELL format. The procedure starts by removing those rows with the least number of non-zeros. If the number of remaining rows is beyond a threshold, the procedure continues by removing rows with the second least number of non-zeros. This process stops when reaching the threshold. The least number of non-zeros among the remaining rows is the width of ELL. An important feature of ecsb is that the width of blockrow varies significantly because a blockrow actually consists of multiple rows in the original matrix. Hence, we extend Bell and Garland s procedure to select a proper storage format. We first run Bell and Garland s procedure to derive two parameters, width of ELL format and number of remaining rows. We then use the following heuristic to select a format: 1) If the width of ELL is the maximum width among all rows, we use the ELL format to store the matrix 2) Otherwise, 2.1) If the number of remaining rows is less than 20% of the total number of rows, and the width of ELL of ecsb is more than 90% of the maximum width among all rows, we use COO. 2.2) Otherwise, we use HYB of ecsb to store. However, the maximum number of non-zeros among all rows is the main factor affecting the throughput of GPU for SMVP and SMTVP. If the width of ELL is the maximum width among all

11 SPARSE MATRIX MULTIPLICATION ON GPUs 3781 Figure 10. CUDA pseudo-code of COO parts of expanded compressed sparse block based format for computing y D y C A x and y D y C A T x. rows after the Bell and Garland s procedure, there are a lot of rows having the maximum number of non-zeros. Accordingly, we select to use the ELL format because any other methods of further dividing these rows and use a HYB format will result in additional overhead. If the non-zeros are highly randomly distributed, we choose the COO format. Otherwise, we use HYB to store the ecsbbased sparse matrix. 6. RESULTS AND DISCUSSION In this section, we evaluate the effectiveness of our techniques for both SMVP and SMTVP. Our results are compared with the multi-core CPU-based CSB implementation by Buluç et al. [5] as well as GPU-based CSR implementations provided in the CUSPARSE package and HYB implementations provided in the CUSP package. We chose HYB because it delivers the highest throughput on unstructured matrices [6], and CSR is the most commonly used format, so we should do them all together for comparison. Because CUSPARSE is not an open-sourced, the GPU-based SMTVP procedure is implemented by modifying the code of CUSP, that is, the research prototype of CUSPARSE. The implementation is proved to offer a higher level of throughput than SMTVP of CUSPARSE [8] Experimental setup In this work, all experiments are conducted on a NVIDIA GPGPU platform list in Table I. The benchmarking matrices are summarized in Table II, which were taken from the University of Florida sparse matrix collections [19]. These matrices exhibit diverse characteristics, and most of them have been used by Buluç et al. in a related work [5].

12 3782 Y. TAO ET AL. Table I. Test-beds used in this work. IntelR CoreTM Platform GTX Titan C2075 i Architecture Kepler Fermi x86 64 Compute capability N/A Number of cores 2698 stream cores 448 stream cores 4 cores/8 threads Peak performance 3.52 TFLOPS 1.03 TFLOPS GFlOPS Single-precision Bandwidth GB/s 144 GB/s 25.6 GB/s Runtime CUDA 5.0, Thrust CUDA 5.0, Thrust Cilk 8503 Error Correcting Code support Disabled Enabled Yes Operation system Centos release 6.2 Centos release 6.2 Ubuntu LTS Kernel el el generic. x86 64 x86 64 x86 64 CPU compiler gcc gcc gcc Table II. Benchmark matrices used in this work. Non-zero per column Name Description Row/column Non-zeros mean/max asic_320k Circuit simulation 321k/321k 2635k 8/203.8k parabolic_fem Parabolic FEM 525k/525k 2100k 3/7 Kkt_power Optimal power flow, 2063k/ 8130k 3/84 nonlinear optimization 2063k torso1 Finite difference/ 116k/116k 8516k 73/1224 boundary elem sme3dc 3D structural 42k/42k 3148k 73/405 mechanics problem Rucci Ill-conditioned 1977k/109k 7791k 70/108 least-squares problem bone010 3D trabecular bone 986k/986k 36326k 36/42 ldoor INDEED Test 952k/952k 23737k 24/77 Matrix Cont11_l Linear programming 1468k/1961k 5382k 2/7 problem rajat31 Rajat circuit 4690k/4690k 20316k 4/1252 simulation matrix Cage15 DNA 5154k/5154k 99199k 19/ Experimental results In this subsection, we present the experimental results. The performance is evaluated in five different metrics, the ratio of SMVP throughput to SMTVP throughput, the total throughput of SMVP and SMTVP, the execution time measured in kernel time and wall-clock time of BCG algorithm, and the utilization. To evaluate the effectiveness of our techniques, we implement CSR and HYB based procedures by extending CUSP. We use the function provided in CUSPARSE to compute the SMVP on CSR and in CUSP to do it on HYB. The corresponding SMTVP computation is implemented with a reduction-based procedure [8] by tweaking the code of CUSP [6]. We also compare our work with the original CSB results collected on an Intel i CPU. All the results reported in this subsection are the average values of 100 runs Ratio of sparse matrix-vector product throughput to sparse matrix-transpose vector product throughput. Tables III and IV compare the ratio of computation throughput between SMVP and SMTVP in single precision and double precision, respectively. For GPU-based ecsb, the GPU implementations are tested on GTX Titan. The CPU results are collected by running eight parallel threads. To better evaluate the results, we use the root mean square deviation (RMSD) [20] to measure the quality of different approaches in delivering a balanced computation throughput of

13 SPARSE MATRIX MULTIPLICATION ON GPUs 3783 Table III. Comparison of the ratio of SMVP and sparse matrix-transpose vector product throughputs on single-precision data. GTXTitanGPU CSBonCPU Matrix name ecsb (MFLOPS) CSR (MFLOPS) HYB (MFLOPS) (MFLOPS) asic 320k parabolic_fem kkt power torso sme3dc Rucci bone ldoor cont cage rajat Root mean square deviation CSB, compressed sparse block; ecsb, expanded compressed sparse block. Table IV. Comparison of the ratio of SMVP and sparse matrix-transpose vector product throughputs on double-precision data. GTXTitanGPU CSBonCPU Matrix name ecsb (MFLOPS) CSR (MFLOPS) HYB (MFLOPS) (MFLOPS) asic 320k parabolic fem kkt power torso sme3dc Rucci bone ldoor cont cage rajat Root mean square deviation q both SMVP and SMTVP. RMSD is defined as RMSD()= E.. / O 2 /,where O is the target of estimated ratio, we designed it as 1.0 in the paper, and is the ratio collected in our experiments. The results in Table III prove that the proposed techniques enable highly balanced computation throughput of SMVP and SMTVP. The ecsb-based results are significantly better than those based on CSR and HYB. In fact, ecsb allows an even better RMSD than the original CSB algorithm implemented on a multi-core CPU. Two matrices, asic 320k and torsor1, have relatively unbalanced ratios. The reason is that these matrices have a few denser rows that lead to more collisions. We also evaluate the ratio of computation throughput for double-precision data, and the results are listed in Table IV. The performance difference is bigger than the case of single-precision data, but the results of ecsb are still significantly more balanced than those with CSR and HYB. The larger difference between SMVP and SMTVP is due to the lack of direct support for atomic operations on double-precision data. Current GPUs only allow the generic compare and swap procedure as illustrated in Figure 9 for double-precision data [11]. The performance of the resultant atomic operations is considerably slower than that of single-precision data whose atomic functions can be performed by the raster operation unit (raster operations processor) in the level 2 cache [21].

14 3784 Y. TAO ET AL Computing throughput. In this subsection, we report the computing throughputs of both SMVP and SMTVP. We collect single-precision and double-precision results on GTX Titan. The results are demonstrated in Tables V and VI. We also perform the same experiment on singleprecision data with C2075, and the results are illustrated in Table VII. The computing throughput is derived by dividing the total number of floating number operations by the kernel time. Table V compares the single-precision performance of the proposed ecsb data structure with HYB and CSR structures on GTX Titan. On average, ecsb-based SMVP are comparable with or slightly better than HYB and outperform CSR by 20%, while ecsb-based SMTVP delivers a throughput that is higher than HYB and CSR by 31.5% and 195.8%, respectively. Compared with CPU for SMVP and SMTVP, the proposed ecsb data structure attains at least a 5.0 times speedup. We also evaluate the double-precision performance on GTX Titan, and the results are listed in Table VI. ecsb delivers a slightly lower performance in SMVP than HYB and CSR but outperforms HYB and CSR by 365.4% and 69.8%, respectively, in SMTVP. To study the performance scalability of GPUs, we compare the single-precision performance results on C2075 in Table VII. ecsb-based SMTVP outperforms HYB and CSR based SMTVP by 7.6% and 79.9%, respectively. In Table V, two matrices, asic 320k and torso1, when stored in the ecsb data structure, have a lower performance in SMVP. The reason is that the numbers of non-zeros in their most populated rows are much higher than the average value and such a distribution leads to more collisions in atomic additions. Our GPU implementation of the ecsb data structure delivers a higher throughput than the original HYB-based implementation of CUSP and CUSPARSE package. The major reason is that the block structure enables a better spatial locality in both the matrix and vectors in SMVP or SMTVP computations. The CSB data structure allows non-zeros of adjacent rows and columns to be stored together. Given dense blocks in a sparse matrix, the adjacent elements in vectors x and y have a better chance to be reused after being loaded into cache. In fact, when computing yœi C DAŒi; j xœj, the neighboring elements of yœi can be automatically loaded into cache. Because we compute multiple rows inside a block, it is highly likely that the target entries of y are already in the cache. In case of the CSR-based computing, the locality in vector elements also exists but is hardly to be fully exploited because programmers cannot explicitly control the scheduling of warps Computation time. In this section, we report the computation time of single-precision data including both kernel time and wall-clock time on GTX Titan. Kernel time is the total kernel time of both SMVP and SMTVP, and wall-clock time is the total time of BCG on GPU including both the CPU-GPU data transfer time and kernel time. Table V. Throughput of single precision on GTX Titan graphics processing unit and CPU (MFLOPS). TITAN ecsb TITAN HYB TITAN CSR CPU Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k , parabolic fem 19, , , , , kkt power 17, , , , torso1 10, , , , , , sme3dc 17, , , , , Rucci1 23, , , , bone010 29, , , , , , ldoor 28, , , , , , cont , , , , , cage15 27, , , , , rajat31 20, , , , , Average 19, , , , , GTX Titan GPU and CPU mean platform. ecsb, HYB, and CSR mean the matrix expression. SMVP and SMTVP mean operation. The notations in other tables follow these rules.

15 SPARSE MATRIX MULTIPLICATION ON GPUs 3785 Table VI. Throughput of double precision on GTX Titan graphics processing unit and CPU (MFLOPS). TITAN ecsb TITAN HYB TITAN CSR CPU Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k parabolic fem 12, kkt power 10, torso , , sme3dc , Rucci , , bone , , ldoor , , cont , , , , cage15 15, , , , rajat , , Average , , SMVP, sparse matrix-vector product; SMTVP, sparse matrix-transpose vector product. Table VII. Throughput of single precision on C2075 graphics processing unit (MFLOPS). ecsb HYB CSR Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k parabolic fem kkt power torso , sme3dc , Rucci bone , , ldoor , cont cage15 11, , rajat Average , ecsb, expanded compressed sparse block; SMVP, sparse matrix-vector product; SMTVP, sparse matrix-transpose vector product Kernel time. Figure 11 compares the total kernel time of both SMVP and SMTVP. Our ecsb outperforms HYB on eight out of 11 benchmarking matrices by up to 44.8% and CSR on 10 out of 11 benchmarking matrices by up to 82.3%. The three matrices named asic 320k, torso1, and rajat31 have a longer kernel time in ecsb when compared with HYB. The reason is also due to the highly irregular distribution of non-zeros. Some rows of these matrices have significantly more non-zeros, which lead to more frequent collisions in the atomic operations Wall-clock time on GTX Titan graphics processing unit. The ecsb structure enables a higher computing throughput for both SMVP and SMTVP, but at a cost of a large memory footprint. Because the memory overhead suggests a longer time for data transfer between CPU and GPU, it is necessary to validate if ecsb can really lead to performance advantage in typical application scenarios. We use the BCG algorithm mentioned in Section 1 as a case study. We derive a GPU implementation on top of CUSP by incorporating the ecsb data structures and the respective procedures proposed in this work. The BCG procedure is tested on a sparse matrix, parabolic fem, which is derived from finite element analysis. The BCG computation converges in eight iterations with an accuracy tolerance of 1e-5. We also implement the HYB and CSR based procedures in the framework of CUSP. Figure 12 compares the wall-clock time of the data structures. The reported wall-clock time includes both GPU computation time and CPU GPU data transfer

16 3786 Y. TAO ET AL. Figure 11. Total kernel time of both sparse matrix-vector product and sparse matrix-transpose vector product (single precision data on GTX Titan graphics processing unit). Figure 12. Wall-clock time of preconditioned bi-conjugate gradient algorithm for matrix named parabolic fem (Single precision data on GTX Titan graphics processing unit). time. The ecsb data structure outperforms HYB and CSR by 6% and 25%, respectively. The results suggest that the ecsb data structure delivers a higher computing throughput despite of the memory overhead Utilization. The effectiveness of a data structure and its manipulation procedures is reflected by the utilization ratio of GPU hardware resources. Utilization is defined as the ratio of the average number of active warps in each cycle and the maximum number of warps allowed by a multiprocessor [22]. We used NVIDIA Visual Profiler [22] to collect the utilization data. Figures 13 and 14 give the utilization values of single-precision data of SMVP and SMTVP over all benchmarking matrices. Obviously, the ecsb data structure enables a higher utilization ratio of GPU hardware in both SMVP and SMTVP. Such an observation suggests that ecsb is more proper for GPU computing and likely to be more scalable. 7. RELATED WORK Sparse matrix-transpose vector product is a commonly used computing pattern that is already supported by the BLAS package [23]. Traditionally, SMTVP is solved by a transposition followed by a SMVP procedure. So its solution is closely related to the work in SMVP. Bell and Garland [6] developed an efficient solution for sparse matrix-vector multiplication on GPUs. The key idea is to use a warp to derive the inner product of a row in the sparse matrix and the input vector. Such a solution hits a good balance between parallelism and load balance. Meanwhile,

17 SPARSE MATRIX MULTIPLICATION ON GPUs 3787 Figure 13. Average utilization of sparse matrix-vector product on GTX Titan graphics processing units. Figure 14. Average utilization of sparse matrix-transpose vector product on GTX Titan graphics processing units. the summation operation in the computation of inner product is realized by an intra-warp reduction. This work dramatically improves the performance of GPU-based SMVP computation. It was released in an open-source package, CUSP [6], and later incorporated in NVIDIA s CUSPARSE package [7]. CSR-based atomic addition is based on CUSP [6] because of the availability of its source code. CUSP [6] solves the SMTVP problem by the transposition and SMVP procedure, while CSR-based atomic addition takes a different approach and delivers a much higher throughput [8]. Many optimization techniques have also been introduced for GPU-based SMVP. Choi et al. [15] developed a model-driven based auto-tuning technique to compute SMVP on GPUs. They used the blocked compressed sparse row and blocked ELLPACK (BELLPACK) formats and achieved up to 29.0 GFLOPS in single-precision and 15.7 GFLOPS in double-precision on NVIDIA T10P multiprocessor (C1060). Deng et al. [14] proposed data reordering based techniques for GPU accelerated SMVP computation. Their techniques are especially suited for very sparse matrices. Su and Keutzer [16] developed the Cocktail format, a new sparse matrix format and its OpenCL [24] implementation, clspmv. The framework can analyze a sparse matrix at runtime and recommend a suitable format for a given GPU. CUSPARSE [7] was also heavily optimized, but such optimizations are orthogonal to the techniques proposed in CUSP [6]. Tang et al. [17] proposed to compress row/column index to reduce memory traffic for an improved throughput on GPU. Fu et al. [25] presented a parallel FEM solver for unstructured meshes and observed a speedup of up to 87 times speedup. However, all the aforementioned works could not achieve the same throughput

18 3788 Y. TAO ET AL. for both SMVP and SMTVP on GPU, because commonly used sparse formats do not allow efficient row-wise and column-wise accesses at the same time. Buluç et al. [5] proposed a novel solution to simultaneously compute SMVP and SMTVP on multi-core platforms. This work is based on a new data structure, CSB, for storing of sparse matrices. The size of blocks is determined by adapting to the size of last level cache so as to enhance the locality of data accesses. This approach is effective on multi-core CPUs but only delivers limited performance on GPU because a costly pre-processing step is needed to traverse all non-zeros and change their storage layout. Juan Gomez-Luna et al. [26] provide a collision model for atomic functions on the shared memory on GPU. A key observation is that the collision on the shared memory is equivalent to the bank conflict problem. The collision model for global memory is still open, while the problem is more complicated. 8. CONCLUSION AND FUTURE WORK Sparse matrix-vector product and SMTVP are two common computing patterns of sparse matrix based algorithms. Despite the progress in GPU-based sparse matrix computations, it is still a challenging problem to compute both problems with a similar throughput on GPUs. In this work, we propose a GPU-based CSB data structure, designated as ecsb, as well as an atomic operation computing procedure to shorten the gap between SMVP and SMTVP computations. We also extend the previous work on using a hybrid storage format for ecsb. We perform experiments to validate the proposed techniques in a systematic manner. It has shown that ecsb enables a higher throughput on SMVP and SMTVP than previous works on Kepler GPUs. In addition, both SMVP and SMTVP can be computed in much closer times. We also test our work on the BCG algorithm and further validate the effectiveness of ecsb. In the future, we are going to further this research in several directions. First, we will develop a performance model for atomic operations. The model considers the overhead in both collision and cache miss. It will help construct adaptive algorithms for SMVP and SMTVP computations. Second, we will refine the ecsb data structure to reduce the memory overhead. Finally, we are also going to explore heterogeneous methods involving both CPU and GPU for CSB-based computing. Especially, we want to investigate the possibility of using CPU to recursively partition the input sparse matrix into proper small blocks and then launch GPU for massively parallel execution. ACKNOWLEDGEMENTS The authors would like to thank Professor Songping Wu of Beihang University for discussing the BCG algorithm. The work is partially supported by the National Natural Science Foundation of China under grant no , the National Natural Science Foundation of China under grant no , the Fundamental Research Funds for the Central Universities under grant no. YWF-14- JSJXY-14, Beijing Natural Science Foundation under grant no , and the fund of the State Key Laboratory of Software Development Environment under grant no. SKLSDE-2014ZX-13. REFERENCES 1. Duff SI. A survey of sparse matrix research. Proceedings of the IEEE 1977; 65(4): DOI: /PROC Vuduc R, Gyulassy A, Demmel JW, Yelick KA. Memory hierarchy optimizations and performance bounds for sparse A T Ax. Proceedings of the 2003 international conference on Computational science: PartIII(ICCS), June Springer: Berlin, Heidelberg, 2003; Saad Y. Iterative Methods for Sparse Linear Systems (2nd edn). SIAM: Philadelphia, USA, Barrett R, Berry M, Chan TF, Demmel J, Donato JM, Dongarra J, Eijkhout V, Pozo R, Romine C, Vorst HVD. Templates for the solution of linear systems:building blocks for iterative methods. (Available from: org/templates/templates.pdf) [Accessed 6 May 2014].

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard