GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication
|
|
- Holly Blake
- 5 years ago
- Views:
Transcription
1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2015; 27: Published online 7 October 2014 in Wiley Online Library (wileyonlinelibrary.com) GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication Yuan Tao 1,2,3, Yangdong Deng 4, Shuai Mu 4, Zhenzhong Zhang 1,2, Mingfa Zhu 1,2, *,, Limin Xiao 1,2 and Li Ruan 1,2 1 State Key Laboratory of Software Development Environment, Beihang University, Beijing, , China 2 School of Computer Science and Engineering, Beihang University, Beijing, , China 3 College of Mathematics, Jilin Normal University, Jilin, , China 4 School of Software, Tsinghua University, Beijing, , China SUMMARY Many high performance computing applications require computing both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi-core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (ecsb), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU-based CSB implementation. In addition, our ecsb procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of ecsb by means of wall-clock time of bi-conjugate gradient algorithm; our ecsb is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright 2014 John Wiley & Sons, Ltd. Received 31 October 2013; Revised 23 July 2014; Accepted 10 September 2014 KEY WORDS: sparse matrix-transpose vector product; sparse matrix-vector product; compressed sparse block; CSB; compressed sparse rows; CSR; GPU 1. INTRODUCTION Sparse matrices are extensively used in high performance computing applications because they naturally reflect the characteristics of a wide range of scientific and engineering problems [1]. Among various computation patterns of sparse matrices, a specially interesting one is to compute both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) in the same application with the underlying matrices stored in a single format [2, 3]. Such a pattern is exemplified by the bi-conjugate gradient algorithm (BCG) [3, 4]. As shown by the BCG pseudo-code in Figure 1, SMVP and SMTVP are computed in lines 7 and 8, respectively. Because these two steps have no data dependency, they can be assigned to two computing nodes for parallel processing. Under such a circumstance, a key requirement is a similar computation throughput for the two *Correspondence to: Mingfa Zhu, School of Computer Science and Engineering, Beihang University, Beijing, , China. zhumf@buaa.edu.cn Copyright 2014 John Wiley & Sons, Ltd.
2 3772 Y. TAO ET AL. Figure 1. Bi-conjugate gradient algorithm [3, 4]. problems. Meanwhile, both should be performed in a highly efficient manner, while the underlying sparse matrix should be stored in a single format to save memory space. Buluç et al. [5] proposed a so-called compressed sparse block (CSB) format so that both SMVP and SMTVP procedures can be processed with similar throughputs on multi-core CPUs. Recently, significant breakthrough has been made to exploit graphics processing unit (GPU) for sparse matrix computations. Especially, the seminal work of Bell and Garland [6] proved that a throughput of 16 GFLOPS in single-precision could be achieved on a single GPU node for SMVP. On the other hand, the problem of computing SMVP and SMTVP in a single format has not been extensively studied. At present, typically, SMTVP is solved by a transposition followed by an SMVP procedure. Such a strategy was adopted by a leading-edge GPU-based sparse matrix library, CUSPARSE [7]. The significant overhead of transposing a sparse matrix, however, makes the strategy less efficient. Or experiments using CUS- PARSE evidenced that the transposition cost dominates the overall computation time [8]. The other solution for SMTVP is based on the atomic addition operation on NVIDIA GPUs with compute capability 2.0 or higher. Experiment shows that the performance of SMTVP by atomic additions can outperform by up to eightfold than CUSPARSE [8], but there would still be a gap of up to sevenfold between SMVP and SMTVP [8]. It is thus natural to consider porting CSB to GPU to solve the aforementioned problem. However, a straightforward porting will not deliver sufficient computing throughput on GPUs because of its inherent irregularity. Accordingly, we investigate efficient GPU solutions to the problem of computing both SMVP and SMTVP in this work. We first perform experiments to identify the performance gap between SMVP and SMTVP on GPUs. On such a basis, we re-design the CSB data structure, designated as the expanded CSB (ecsb), to fully take advantage of GPU SIMD hardware. We also adopt the hybrid storage format proposed by Bell and Garland [6] in storing a block of ecsb and develop a heuristic to select the different expression of ecsb. Our experimental results prove that the proposed data structure and the corresponding GPU implementation enable similar computation times for both SMVP and SMTVP at a throughput that is up to 12 times faster than that of a 4-core, 8-thread CPU-based implementation of CSB. Our techniques can also be faster than the previous GPU-based CSR and HYB [6] procedures on over half of the sparse matrix benchmarks. The mainly contributions of this work are as follows. We redesigned the CSB data structure for GPU-based computing to achieve high performance SMVP and SMTVP at the same time. A dynamic selecting algorithm is proposed to determine the proper data storage format for ecsb according to the distributions of non-zeros.
3 SPARSE MATRIX MULTIPLICATION ON GPUs 3773 We validate the effectiveness of the proposed ecsb data structure with extensive experiments. The data structure was compared with existing data structures in terms of the ratio of between SMVP and SMTVP computation throughput, the throughput, the total kernel execution time between SMVP and SMTVP, the wall-clock time of BCG algorithm, and the utilization. The remaining in the paper is organized as follows. The preliminaries of this work are introduced in Section 2. Section 3 analyzes the challenges of porting CSB to GPUs. Sections 4 and 5 explains a so-called ecsb data structure and its efficient GPU implementation to compute both SMVP and SMTVP. Section 6 reports and discusses the experimental results. We review related works in Section 7. In the last section, we conclude the paper and outline future works. 2. PRELIMINARIES In this section, we introduce the preliminaries related to this work. First, we explain the basic idea of CSB. Then, the fundamental concepts of GPU computing are discussed Compressed sparse block Sparse matrix allows using a compressed format to store the non-zero elements to save memory space. Such formats usually remove zero elements along either the row or the column direction. However, such formats cannot guarantee efficient access to non-zeros along both row and column directions and lead to performance gap between SMVP and SMTVP. Buluç et al. [5] introduced the CSB to support efficiently computing both SMVP and SMTVP on CPU. In CSB, a matrix A is partitioned into square blocks of identical size. An example matrix and its CSB are illustrated in Figure 2. The definition of CSB data structure is listed in Figure 3. It consists of three arrays blk ptr, comb lowbits, andval. Array blk ptr is a two-dimensional array recording the offset of the first non-zero of each block. The array comb lowbits stores the concatenated row and column indexes of non-zeros in a block (in Figure 2, row and column indexes are shown separately). Array val stores non-zeros. In CSB, a row (column) of blocks is designated as a blockrow (blockcolumn). TheCSB computing is designed for coarse grain parallel execution on multi-core CPUs. The pseudo-codes Figure 2. Compressed sparse block format with bit-interleave sort in intra-block for sparse matrices [5]. Figure 3. Structure of CPU-based compressed sparse block.
4 3774 Y. TAO ET AL. Figure 4. Various forms of parallelism used in parallel compressed sparse block computing. for computing SMVP and SMTVP are listed in Algorithms 1 and 2. The algorithms use a rather complicated procedure to recursively partition every block into groups of non-zeros. The procedure consists of two steps. The first step packs blocks into a linear array, according to the density of the block. If the density of a block is bigger than a threshold, it is packed into linear array as a single element. Otherwise, multiple sparse blocks are combined as one element. The second step performs SMVP or SMTVP in parallel on elements in the linear array. Given an element corresponding to a dense block, the block may need to be divided into multiple sub-blocks until the density of subblocks is smaller than a threshold. Then, the sub-blocks can be handled concurrently. The basic element of parallel execution is a group, containing either a dense sub-block or multiple sparse blocks. Such an element is manipulated by a CPU thread. The aforementioned process is illustrated in Figure 4. Note that CSB offers good locality for both SMVP and SMTVP.
5 SPARSE MATRIX MULTIPLICATION ON GPUs 3775 Figure 5. A generic architecture of graphics processing unit [13] General purpose computing with graphics processing unit Recently, GPUs have emerged as a new general purpose computing platform [9, 10]. GPUs were originally designed as application specific integrated circuits for graphics rendering. Pushed by the relentless pursuit for better visual experiences, GPUs evolved to offer both high programmability and superior computing throughput. Now, two major GPU vendors, NVIDIA and AMD, both offer GPU hardware and software design with support for general purpose computing. We use NVIDIA GPUs and the corresponding programming model, CUDA, [11] in this work. Modern GPUs adopt a single program multiple data [12] programming model. A GPU or graphics processor consists of tens of multiprocessors, each equipped with multiple lanes of computing logic. A multiprocessor fetches each instruction and dispatches it to all lanes, which then operate on different data. To hide latency, the single program multiple data model is enhanced with a multithreading mechanism. A GPU program follows a three-tier organization. A thread is the basic unit of parallel execution. Multiple threads constitute a thread block, which will be assigned to a multiprocessor for execution. A number of thread blocks form a grid and are scheduled by a scheduler on the GPU. Threads in a block can synchronize with each other and exchange data through a fast on-chip memory. The warps of threads are executed on a multiprocessor as a single unit in which each thread is mapped to a different lane. A warp is suspended at a memory access instruction, and then another warp with all data ready will resume execution. Various multiprocessors execute program independently. Graphics processing unit memory is more complicated than its CPU counterpart. The instruction operands are stored in a register file located on a multiprocessor, which also has a shared memory. The shared memory works as a software controlled cache. Each multiprocessor also has its own L1 instruction and data caches. The whole GPU processor has a unified L2 cache. GPU main memory, that is global memory, is installed off-chip and connected to GPU through a wide dedicated bus. The global memory has a long latency but offers a high bandwidth. To hide latency, modern GPUs adopt a memory coalescing mechanism. When a warp of threads executes a memory instruction, the corresponding memory requests can be merged into a fewer number of references if they fall into a single segment of memory space aligned at multiple of 64 bytes. The computing power of GPUs depends on the capability of concurrently running a massive number of threads. It is thus essential to guarantee the correctness of concurrently accessing to a single address by multiple threads. Recent GPUs support atomic functions to enable a complete memory writing process without being
6 3776 Y. TAO ET AL. interrupted by other requests to the same address. In NVIDIA GPUs, the atomic function for global memory is realized in the L2 cache. All atomic operations to the same address of global memory are buffered in a first-in first-out queue to be processed by an Arithmetic Logic Unit (ALU) in a serialized manner, as shown in Figure 5. Graphics processing units exhibit different characteristics from multi-core CPUs. First of all, the computing cores on a GPU are much simpler than those deployed on multi-core CPUs. These cores are organized into single instruction multiple data (SIMD) lanes to deliver high computing throughput. The simpler cores have less hardware support for memory pre-fetch and branching instructions. Second, GPUs exploit a massive number of threads to hide memory latency, while the caches deployed on GPUs have much less capacity than their equivalent on multi-core CPUs. On the other hand, CPUs heavily depend on cache to hide memory latency but only support a relatively number of threads. Third, GPU threads are directly supported by hardware. In other words, every active thread (i.e., a thread is active if it is already assigned to a GPU core for execution) has its own registers to store program state. So, no context switching is necessary when a thread is suspended or activated. This is drastically different from the CPU threads. 3. CHALLENGES OF COMPUTING BOTH SPARSE MATRIX-VECTOR PRODUCT AND SPARSE MATRIX-TRANSPOSE VECTOR PRODUCT WITH GRAPHICS PROCESSING UNIT Graphics processing units have been extensively used in many scientific and engineering applications. However, GPUs are originally designed to support highly regular workload. The irregular data structures exemplified by sparse matrices pose considerable challenges to GPUs, which needs programmers to design algorithms that are able to effectively extract data level parallelism. Many research works have been proposed to improve the efficiency of GPU computing on sparse matrices (e.g., [6, 7, 14 17]). Among these works, CUSPARSE systematically addresses the problem of SMVP and provides routines for popular sparse matrix formats. Currently, the SMVP can be computed with a throughput of 16 GFOPS in single precision on a single GPU [6]. To solve the problem of computing SMVP and SMTVP on GPUs with a single format, there are two basic approaches. The first one is to choose an existing sparse matrix format and develop a corresponding procedure, while the second approach is to port the CSB data structure to GPU. In the remainder of this section, we investigate the feasibility of these two approaches Computing sparse matrix-vector product and sparse matrix-transpose vector product with existing sparse matrix formats There exist many sparse matrix formats among which COO, CSR, and ELL [3] are representative formats. COO stores the row and column indexes for each non-zero, while CSR compresses the row index information by only storing information to locate the first non-zero in each row. ELL, or Ellpack format, uses a fixed row width to store non-zeros. If a row has fewer non-zeros, padding of zeros has to be performed. ELL needs two arrays to store column index and non-zeros. Given structural matrices, ELL is more appropriate for GPU computing because the regular width of rows enables a predictable computing pattern. However, too many zeros may need to be padded when processing unstructural matrices. The HYB format was introduced by Bell and Garland [6] for efficient computations of SMVP on GPU by combining both ELL and COO. In HYB, most nonzeros of each row is stored in the ELL format, while the remaining non-zeros out of the row width of ELL are recorded in COO. Figure 6 illustrates the three storage formats of an example matrix. From Figure 6, it can be seen that existing formats cannot enable efficient access to non-zeros along both row and column directions, while efficient computing of both SMVP and SMTVP with a single format does need quick access in both directions. In addition, the available formats cannot guarantee efficient use of cache in both SMVP and SMTVP.
7 SPARSE MATRIX MULTIPLICATION ON GPUs Porting compressed sparse block to graphics processing unit The CSB data structure proves to be efficient on multi-core CPUs because it utilizes better the locality of non-zeros of input matrix and vectors. A straightforward porting of CSB to GPUs, however, turns to be inefficient because of several reasons. First, the CSB algorithm performs a relatively complex recursive partition process to decompose the original matrix into small chunks. Here, one chunk can be a row of blocks, a few blocks along a row, or a partial block according to the number of non-zeros available inside. As illustrated in Figure 4 and elaborated in Section 2.1, the computing procedure actually uses multiple forms of dynamic parallelism exiting among different rows, sub-rows, and sub-blocks. Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs. GPUs are built on the foundation of SIMD execution that requires all parallel threads to run exactly the same program. Second, CSB operations are performed on blocks. Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted. In addition, the memory bandwidth is also under-utilized because a warp accesses memory as a complete unit. Third, Buluç et al. [5] depends on Cilk [18] to distribute computing tasks to cores at runtime for dynamic load balancing. However, such dynamic schedulers are not available on GPUs yet. Instead, we have to carefully handcraft a thread scheduling mechanism for a good load balance. 4. THE EXPANDED COMPRESSED SPARSE BLOCK DATA STRUCTURE Considering the problems discussed in Section 3.2, a natural question arises: what data structures are appropriate for GPU-based CSB computations? On the one hand, the block structure is necessary to maintain the two-dimensional parallelism that is essential to compute both SMVP and SMTVP in an efficient manner. In other words, it is desirable to have blk ptr[i][j+1]- blk ptr[i][j] as the non-zeros of block of (i,j), where i is the offset of blockrow and j is the offset of blockcolumn. Meanwhile, we want to break the block structure so that the number of wasted (i.e., threads assigned with zero elements) threads can be minimized. To meet the aforementioned seemingly contradictory requirements, we redesigned the CSB data structure as follows. The values of the non-zeros are kept in an array val. The original two-dimensional array, blk ptr, is converted to one-dimensional array in the GPU memory. It records the position of the first nonzeros of blockrows in array val. A new array, block ind, stores the block index for each non-zero Figure 6. Comparison of common sparse matrix formats.
8 3778 Y. TAO ET AL. Figure 7. Expanded compressed sparse block data structure. element. Such an arrangement simplifies the processing process, because when handling a non-zero, a thread can easily determine which block it should look at. In fact, an element of the array block ind is the concatenation of the row and column indexes of the block. The array comb lowbits is the same as its equivalent in CSB. The proposed ecsb data structure is listed in Figure 7. A key idea of ecsb is that we reserve the CSB-based ordering of non-zeros but handle a blockrow as the basic unit of parallel processing. One critical issue is that the distribution of non-zeros may vary sizably. Some sparse matrices have a relatively uniform distribution of non-zeros among rows, while others may have a few denser rows and a larger number of sparse rows. Accordingly, we extend the hybrid format, HYB, proposed by Bell and Garland [6] to adapt the non-zero distribution. Given sparse matrix very relative regular non-zero distribution, we use the ELL format because it helps improve load balance. In our implementation, we do not perform padding of zeros when the number of nonzeros is less than the row width. We use the COO format for highly irregular sparse matrices. A hybrid format is used on all remaining matrices. The basic idea is to store non-zeros within a certain row width in the ELL format, and the other non-zeros in the COO format. We propose a heuristic to choose a proper format. Note that the ELL and COO formats are on top of CSB because the non-zeros are arranged in the ordering defined in a block. The non-zeros in each block can be stored in different orders, among which row-major and column-major orderings are commonly used. The original CSB data structure uses a bit-interleaving ordering or Z-Morton ordering [5]. With the row-major ordering, we need to compute the SMVP problem with atomic operations (the details will be discussed in Section 5). Atomic addition is also used for SMTVP when using the column-major ordering. Both orderings introduce collisions and lower the overall throughput. So we adopt the bit-interleaving ordering for GPU-based storage. 5. GRAPHICS PROCESSING UNIT IMPLEMENTATION We develop both ELL and COO procedures for both SMVP and SMTVP because ecsb exploits a hybrid format. The remainder of this section explains the details. We also propose the heuristic to choose a proper format from three possible combinations: ELL, COO, and HYB ELL based sparse matrix-vector product and sparse matrix-transpose vector product ELL was designed to support relatively regular sparse matrices. Non-zeros are compressed into a dense matrix with a fixed row width. In this work, one row of ELL corresponds to a blockrow of ecsb. We use one thread of block to handle one row. One thread computes the product of a non-zero and a vector element in one iteration. The threads in a block iterate through a single row. Figure 8(a) and (b) lists the CUDA pseudo-code for computing the SMVP and SMTVP of the ELL part of a sparse matrix encoded in the ecsb data structure. In the code listed in Figure 8(a), line 4 derives the starting position of the first non-zero in the current blockrow via the offset (line 2) stored in the blk ptr array. The offset is also used to extract the combined word of row and column indices of the current block (line 6) and the combined word of row and column indices of intra-block (line 7). In the loop started at line 8, a thread fetches multiple non-zeros and the corresponding vector elements. Then, it performs multiplications to generate a series of intermediate results that should be summed together. Note that there is no guarantee on the order of processing by multiple threads on a massively parallel platform like GPU. In addition, the random distribution of the numbers of non-zeros makes it infeasible to store the intermediate results
9 SPARSE MATRIX MULTIPLICATION ON GPUs 3779 Figure 8. CUDA pseudo-code of ELL parts of expanded compressed sparse block based format for computing y D y C A x and y D y C A T x. in the shared memory directly. As a result, we resort to the atomic operations to add the intermediate results to the corresponding entry of the product vector located in shared memory. Current CUDA compilers only support atomic addition on integers and single-precision float numbers. For double-precision float numbers, one way is to cast them into integers of long long type, which is supported with an atomic compare and swap function, atomiccas [11], to avoid data race. The corresponding CUDA pseudo-code is listed in Figure 9. In the code snippet, longlong as double, and double as longlong at line 3 are CUDA library functions to perform the conversion between 64- bit unsigned integer value and double-precision floating point numbers. The codes of Figure 8(a) at lines 16-f and 16-d are for single and double-precision floating numbers, respectively. Figure 8(b) listed the code of SMTVP. The difference between SMVP and SMTVP is that the SMTVP procedure has its atomic additions done in global memory because of the wider distribution of data entries. Non-zeros in a blockrow are stored in a block-by-block order, while inside a block, the order is defined by CSB. As a result, it is feasible to visit non-zeros in a block along both the row direction and the column direction in a similar manner. Such a similarity accounts for the more comparable throughputs between SMVP and SMTVP.
10 3780 Y. TAO ET AL. Figure 9. The original atomic addition of compare and swap for double-precision data (adapted from [11]) COO-based sparse matrix-vector product and sparse matrix-transpose vector product COO format is used to handle highly irregularly distributed non-zero elements. Here, we do not need array blk ptr because the block index plus the intra-block row and column indices are sufficient. Each thread first computes the product of one non-zero element and its respective entry of x and atomically adds the product to the corresponding entry of product vector y. Figure 10(a) and (b) shows the pseudo-code of SMVP and SMTVP of COO of ecsb. The difference between Figure 10(a) and (b) lies in lines 7 9 because SMVP differs with SMTVP in the ordering of accessing vectors x and y Determining the storage format for expanded compressed sparse block As explained in Section 4, a sparse matrix encoded in ecsb can be physically stored in ELL, COO, or a hybrid ELL and COO format. Given a sparse matrix, it is an essential problem to determine the best storage format. We propose a heuristic to choose a proper format according to the characteristics of the non-zero distribution. Bell and Garland [6] proposed the HYB format to store sparse matrices as well as a procedure to derive the best row width for ELL format. The procedure starts by removing those rows with the least number of non-zeros. If the number of remaining rows is beyond a threshold, the procedure continues by removing rows with the second least number of non-zeros. This process stops when reaching the threshold. The least number of non-zeros among the remaining rows is the width of ELL. An important feature of ecsb is that the width of blockrow varies significantly because a blockrow actually consists of multiple rows in the original matrix. Hence, we extend Bell and Garland s procedure to select a proper storage format. We first run Bell and Garland s procedure to derive two parameters, width of ELL format and number of remaining rows. We then use the following heuristic to select a format: 1) If the width of ELL is the maximum width among all rows, we use the ELL format to store the matrix 2) Otherwise, 2.1) If the number of remaining rows is less than 20% of the total number of rows, and the width of ELL of ecsb is more than 90% of the maximum width among all rows, we use COO. 2.2) Otherwise, we use HYB of ecsb to store. However, the maximum number of non-zeros among all rows is the main factor affecting the throughput of GPU for SMVP and SMTVP. If the width of ELL is the maximum width among all
11 SPARSE MATRIX MULTIPLICATION ON GPUs 3781 Figure 10. CUDA pseudo-code of COO parts of expanded compressed sparse block based format for computing y D y C A x and y D y C A T x. rows after the Bell and Garland s procedure, there are a lot of rows having the maximum number of non-zeros. Accordingly, we select to use the ELL format because any other methods of further dividing these rows and use a HYB format will result in additional overhead. If the non-zeros are highly randomly distributed, we choose the COO format. Otherwise, we use HYB to store the ecsbbased sparse matrix. 6. RESULTS AND DISCUSSION In this section, we evaluate the effectiveness of our techniques for both SMVP and SMTVP. Our results are compared with the multi-core CPU-based CSB implementation by Buluç et al. [5] as well as GPU-based CSR implementations provided in the CUSPARSE package and HYB implementations provided in the CUSP package. We chose HYB because it delivers the highest throughput on unstructured matrices [6], and CSR is the most commonly used format, so we should do them all together for comparison. Because CUSPARSE is not an open-sourced, the GPU-based SMTVP procedure is implemented by modifying the code of CUSP, that is, the research prototype of CUSPARSE. The implementation is proved to offer a higher level of throughput than SMTVP of CUSPARSE [8] Experimental setup In this work, all experiments are conducted on a NVIDIA GPGPU platform list in Table I. The benchmarking matrices are summarized in Table II, which were taken from the University of Florida sparse matrix collections [19]. These matrices exhibit diverse characteristics, and most of them have been used by Buluç et al. in a related work [5].
12 3782 Y. TAO ET AL. Table I. Test-beds used in this work. IntelR CoreTM Platform GTX Titan C2075 i Architecture Kepler Fermi x86 64 Compute capability N/A Number of cores 2698 stream cores 448 stream cores 4 cores/8 threads Peak performance 3.52 TFLOPS 1.03 TFLOPS GFlOPS Single-precision Bandwidth GB/s 144 GB/s 25.6 GB/s Runtime CUDA 5.0, Thrust CUDA 5.0, Thrust Cilk 8503 Error Correcting Code support Disabled Enabled Yes Operation system Centos release 6.2 Centos release 6.2 Ubuntu LTS Kernel el el generic. x86 64 x86 64 x86 64 CPU compiler gcc gcc gcc Table II. Benchmark matrices used in this work. Non-zero per column Name Description Row/column Non-zeros mean/max asic_320k Circuit simulation 321k/321k 2635k 8/203.8k parabolic_fem Parabolic FEM 525k/525k 2100k 3/7 Kkt_power Optimal power flow, 2063k/ 8130k 3/84 nonlinear optimization 2063k torso1 Finite difference/ 116k/116k 8516k 73/1224 boundary elem sme3dc 3D structural 42k/42k 3148k 73/405 mechanics problem Rucci Ill-conditioned 1977k/109k 7791k 70/108 least-squares problem bone010 3D trabecular bone 986k/986k 36326k 36/42 ldoor INDEED Test 952k/952k 23737k 24/77 Matrix Cont11_l Linear programming 1468k/1961k 5382k 2/7 problem rajat31 Rajat circuit 4690k/4690k 20316k 4/1252 simulation matrix Cage15 DNA 5154k/5154k 99199k 19/ Experimental results In this subsection, we present the experimental results. The performance is evaluated in five different metrics, the ratio of SMVP throughput to SMTVP throughput, the total throughput of SMVP and SMTVP, the execution time measured in kernel time and wall-clock time of BCG algorithm, and the utilization. To evaluate the effectiveness of our techniques, we implement CSR and HYB based procedures by extending CUSP. We use the function provided in CUSPARSE to compute the SMVP on CSR and in CUSP to do it on HYB. The corresponding SMTVP computation is implemented with a reduction-based procedure [8] by tweaking the code of CUSP [6]. We also compare our work with the original CSB results collected on an Intel i CPU. All the results reported in this subsection are the average values of 100 runs Ratio of sparse matrix-vector product throughput to sparse matrix-transpose vector product throughput. Tables III and IV compare the ratio of computation throughput between SMVP and SMTVP in single precision and double precision, respectively. For GPU-based ecsb, the GPU implementations are tested on GTX Titan. The CPU results are collected by running eight parallel threads. To better evaluate the results, we use the root mean square deviation (RMSD) [20] to measure the quality of different approaches in delivering a balanced computation throughput of
13 SPARSE MATRIX MULTIPLICATION ON GPUs 3783 Table III. Comparison of the ratio of SMVP and sparse matrix-transpose vector product throughputs on single-precision data. GTXTitanGPU CSBonCPU Matrix name ecsb (MFLOPS) CSR (MFLOPS) HYB (MFLOPS) (MFLOPS) asic 320k parabolic_fem kkt power torso sme3dc Rucci bone ldoor cont cage rajat Root mean square deviation CSB, compressed sparse block; ecsb, expanded compressed sparse block. Table IV. Comparison of the ratio of SMVP and sparse matrix-transpose vector product throughputs on double-precision data. GTXTitanGPU CSBonCPU Matrix name ecsb (MFLOPS) CSR (MFLOPS) HYB (MFLOPS) (MFLOPS) asic 320k parabolic fem kkt power torso sme3dc Rucci bone ldoor cont cage rajat Root mean square deviation q both SMVP and SMTVP. RMSD is defined as RMSD()= E.. / O 2 /,where O is the target of estimated ratio, we designed it as 1.0 in the paper, and is the ratio collected in our experiments. The results in Table III prove that the proposed techniques enable highly balanced computation throughput of SMVP and SMTVP. The ecsb-based results are significantly better than those based on CSR and HYB. In fact, ecsb allows an even better RMSD than the original CSB algorithm implemented on a multi-core CPU. Two matrices, asic 320k and torsor1, have relatively unbalanced ratios. The reason is that these matrices have a few denser rows that lead to more collisions. We also evaluate the ratio of computation throughput for double-precision data, and the results are listed in Table IV. The performance difference is bigger than the case of single-precision data, but the results of ecsb are still significantly more balanced than those with CSR and HYB. The larger difference between SMVP and SMTVP is due to the lack of direct support for atomic operations on double-precision data. Current GPUs only allow the generic compare and swap procedure as illustrated in Figure 9 for double-precision data [11]. The performance of the resultant atomic operations is considerably slower than that of single-precision data whose atomic functions can be performed by the raster operation unit (raster operations processor) in the level 2 cache [21].
14 3784 Y. TAO ET AL Computing throughput. In this subsection, we report the computing throughputs of both SMVP and SMTVP. We collect single-precision and double-precision results on GTX Titan. The results are demonstrated in Tables V and VI. We also perform the same experiment on singleprecision data with C2075, and the results are illustrated in Table VII. The computing throughput is derived by dividing the total number of floating number operations by the kernel time. Table V compares the single-precision performance of the proposed ecsb data structure with HYB and CSR structures on GTX Titan. On average, ecsb-based SMVP are comparable with or slightly better than HYB and outperform CSR by 20%, while ecsb-based SMTVP delivers a throughput that is higher than HYB and CSR by 31.5% and 195.8%, respectively. Compared with CPU for SMVP and SMTVP, the proposed ecsb data structure attains at least a 5.0 times speedup. We also evaluate the double-precision performance on GTX Titan, and the results are listed in Table VI. ecsb delivers a slightly lower performance in SMVP than HYB and CSR but outperforms HYB and CSR by 365.4% and 69.8%, respectively, in SMTVP. To study the performance scalability of GPUs, we compare the single-precision performance results on C2075 in Table VII. ecsb-based SMTVP outperforms HYB and CSR based SMTVP by 7.6% and 79.9%, respectively. In Table V, two matrices, asic 320k and torso1, when stored in the ecsb data structure, have a lower performance in SMVP. The reason is that the numbers of non-zeros in their most populated rows are much higher than the average value and such a distribution leads to more collisions in atomic additions. Our GPU implementation of the ecsb data structure delivers a higher throughput than the original HYB-based implementation of CUSP and CUSPARSE package. The major reason is that the block structure enables a better spatial locality in both the matrix and vectors in SMVP or SMTVP computations. The CSB data structure allows non-zeros of adjacent rows and columns to be stored together. Given dense blocks in a sparse matrix, the adjacent elements in vectors x and y have a better chance to be reused after being loaded into cache. In fact, when computing yœi C DAŒi; j xœj, the neighboring elements of yœi can be automatically loaded into cache. Because we compute multiple rows inside a block, it is highly likely that the target entries of y are already in the cache. In case of the CSR-based computing, the locality in vector elements also exists but is hardly to be fully exploited because programmers cannot explicitly control the scheduling of warps Computation time. In this section, we report the computation time of single-precision data including both kernel time and wall-clock time on GTX Titan. Kernel time is the total kernel time of both SMVP and SMTVP, and wall-clock time is the total time of BCG on GPU including both the CPU-GPU data transfer time and kernel time. Table V. Throughput of single precision on GTX Titan graphics processing unit and CPU (MFLOPS). TITAN ecsb TITAN HYB TITAN CSR CPU Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k , parabolic fem 19, , , , , kkt power 17, , , , torso1 10, , , , , , sme3dc 17, , , , , Rucci1 23, , , , bone010 29, , , , , , ldoor 28, , , , , , cont , , , , , cage15 27, , , , , rajat31 20, , , , , Average 19, , , , , GTX Titan GPU and CPU mean platform. ecsb, HYB, and CSR mean the matrix expression. SMVP and SMTVP mean operation. The notations in other tables follow these rules.
15 SPARSE MATRIX MULTIPLICATION ON GPUs 3785 Table VI. Throughput of double precision on GTX Titan graphics processing unit and CPU (MFLOPS). TITAN ecsb TITAN HYB TITAN CSR CPU Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k parabolic fem 12, kkt power 10, torso , , sme3dc , Rucci , , bone , , ldoor , , cont , , , , cage15 15, , , , rajat , , Average , , SMVP, sparse matrix-vector product; SMTVP, sparse matrix-transpose vector product. Table VII. Throughput of single precision on C2075 graphics processing unit (MFLOPS). ecsb HYB CSR Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k parabolic fem kkt power torso , sme3dc , Rucci bone , , ldoor , cont cage15 11, , rajat Average , ecsb, expanded compressed sparse block; SMVP, sparse matrix-vector product; SMTVP, sparse matrix-transpose vector product Kernel time. Figure 11 compares the total kernel time of both SMVP and SMTVP. Our ecsb outperforms HYB on eight out of 11 benchmarking matrices by up to 44.8% and CSR on 10 out of 11 benchmarking matrices by up to 82.3%. The three matrices named asic 320k, torso1, and rajat31 have a longer kernel time in ecsb when compared with HYB. The reason is also due to the highly irregular distribution of non-zeros. Some rows of these matrices have significantly more non-zeros, which lead to more frequent collisions in the atomic operations Wall-clock time on GTX Titan graphics processing unit. The ecsb structure enables a higher computing throughput for both SMVP and SMTVP, but at a cost of a large memory footprint. Because the memory overhead suggests a longer time for data transfer between CPU and GPU, it is necessary to validate if ecsb can really lead to performance advantage in typical application scenarios. We use the BCG algorithm mentioned in Section 1 as a case study. We derive a GPU implementation on top of CUSP by incorporating the ecsb data structures and the respective procedures proposed in this work. The BCG procedure is tested on a sparse matrix, parabolic fem, which is derived from finite element analysis. The BCG computation converges in eight iterations with an accuracy tolerance of 1e-5. We also implement the HYB and CSR based procedures in the framework of CUSP. Figure 12 compares the wall-clock time of the data structures. The reported wall-clock time includes both GPU computation time and CPU GPU data transfer
16 3786 Y. TAO ET AL. Figure 11. Total kernel time of both sparse matrix-vector product and sparse matrix-transpose vector product (single precision data on GTX Titan graphics processing unit). Figure 12. Wall-clock time of preconditioned bi-conjugate gradient algorithm for matrix named parabolic fem (Single precision data on GTX Titan graphics processing unit). time. The ecsb data structure outperforms HYB and CSR by 6% and 25%, respectively. The results suggest that the ecsb data structure delivers a higher computing throughput despite of the memory overhead Utilization. The effectiveness of a data structure and its manipulation procedures is reflected by the utilization ratio of GPU hardware resources. Utilization is defined as the ratio of the average number of active warps in each cycle and the maximum number of warps allowed by a multiprocessor [22]. We used NVIDIA Visual Profiler [22] to collect the utilization data. Figures 13 and 14 give the utilization values of single-precision data of SMVP and SMTVP over all benchmarking matrices. Obviously, the ecsb data structure enables a higher utilization ratio of GPU hardware in both SMVP and SMTVP. Such an observation suggests that ecsb is more proper for GPU computing and likely to be more scalable. 7. RELATED WORK Sparse matrix-transpose vector product is a commonly used computing pattern that is already supported by the BLAS package [23]. Traditionally, SMTVP is solved by a transposition followed by a SMVP procedure. So its solution is closely related to the work in SMVP. Bell and Garland [6] developed an efficient solution for sparse matrix-vector multiplication on GPUs. The key idea is to use a warp to derive the inner product of a row in the sparse matrix and the input vector. Such a solution hits a good balance between parallelism and load balance. Meanwhile,
17 SPARSE MATRIX MULTIPLICATION ON GPUs 3787 Figure 13. Average utilization of sparse matrix-vector product on GTX Titan graphics processing units. Figure 14. Average utilization of sparse matrix-transpose vector product on GTX Titan graphics processing units. the summation operation in the computation of inner product is realized by an intra-warp reduction. This work dramatically improves the performance of GPU-based SMVP computation. It was released in an open-source package, CUSP [6], and later incorporated in NVIDIA s CUSPARSE package [7]. CSR-based atomic addition is based on CUSP [6] because of the availability of its source code. CUSP [6] solves the SMTVP problem by the transposition and SMVP procedure, while CSR-based atomic addition takes a different approach and delivers a much higher throughput [8]. Many optimization techniques have also been introduced for GPU-based SMVP. Choi et al. [15] developed a model-driven based auto-tuning technique to compute SMVP on GPUs. They used the blocked compressed sparse row and blocked ELLPACK (BELLPACK) formats and achieved up to 29.0 GFLOPS in single-precision and 15.7 GFLOPS in double-precision on NVIDIA T10P multiprocessor (C1060). Deng et al. [14] proposed data reordering based techniques for GPU accelerated SMVP computation. Their techniques are especially suited for very sparse matrices. Su and Keutzer [16] developed the Cocktail format, a new sparse matrix format and its OpenCL [24] implementation, clspmv. The framework can analyze a sparse matrix at runtime and recommend a suitable format for a given GPU. CUSPARSE [7] was also heavily optimized, but such optimizations are orthogonal to the techniques proposed in CUSP [6]. Tang et al. [17] proposed to compress row/column index to reduce memory traffic for an improved throughput on GPU. Fu et al. [25] presented a parallel FEM solver for unstructured meshes and observed a speedup of up to 87 times speedup. However, all the aforementioned works could not achieve the same throughput
18 3788 Y. TAO ET AL. for both SMVP and SMTVP on GPU, because commonly used sparse formats do not allow efficient row-wise and column-wise accesses at the same time. Buluç et al. [5] proposed a novel solution to simultaneously compute SMVP and SMTVP on multi-core platforms. This work is based on a new data structure, CSB, for storing of sparse matrices. The size of blocks is determined by adapting to the size of last level cache so as to enhance the locality of data accesses. This approach is effective on multi-core CPUs but only delivers limited performance on GPU because a costly pre-processing step is needed to traverse all non-zeros and change their storage layout. Juan Gomez-Luna et al. [26] provide a collision model for atomic functions on the shared memory on GPU. A key observation is that the collision on the shared memory is equivalent to the bank conflict problem. The collision model for global memory is still open, while the problem is more complicated. 8. CONCLUSION AND FUTURE WORK Sparse matrix-vector product and SMTVP are two common computing patterns of sparse matrix based algorithms. Despite the progress in GPU-based sparse matrix computations, it is still a challenging problem to compute both problems with a similar throughput on GPUs. In this work, we propose a GPU-based CSB data structure, designated as ecsb, as well as an atomic operation computing procedure to shorten the gap between SMVP and SMTVP computations. We also extend the previous work on using a hybrid storage format for ecsb. We perform experiments to validate the proposed techniques in a systematic manner. It has shown that ecsb enables a higher throughput on SMVP and SMTVP than previous works on Kepler GPUs. In addition, both SMVP and SMTVP can be computed in much closer times. We also test our work on the BCG algorithm and further validate the effectiveness of ecsb. In the future, we are going to further this research in several directions. First, we will develop a performance model for atomic operations. The model considers the overhead in both collision and cache miss. It will help construct adaptive algorithms for SMVP and SMTVP computations. Second, we will refine the ecsb data structure to reduce the memory overhead. Finally, we are also going to explore heterogeneous methods involving both CPU and GPU for CSB-based computing. Especially, we want to investigate the possibility of using CPU to recursively partition the input sparse matrix into proper small blocks and then launch GPU for massively parallel execution. ACKNOWLEDGEMENTS The authors would like to thank Professor Songping Wu of Beihang University for discussing the BCG algorithm. The work is partially supported by the National Natural Science Foundation of China under grant no , the National Natural Science Foundation of China under grant no , the Fundamental Research Funds for the Central Universities under grant no. YWF-14- JSJXY-14, Beijing Natural Science Foundation under grant no , and the fund of the State Key Laboratory of Software Development Environment under grant no. SKLSDE-2014ZX-13. REFERENCES 1. Duff SI. A survey of sparse matrix research. Proceedings of the IEEE 1977; 65(4): DOI: /PROC Vuduc R, Gyulassy A, Demmel JW, Yelick KA. Memory hierarchy optimizations and performance bounds for sparse A T Ax. Proceedings of the 2003 international conference on Computational science: PartIII(ICCS), June Springer: Berlin, Heidelberg, 2003; Saad Y. Iterative Methods for Sparse Linear Systems (2nd edn). SIAM: Philadelphia, USA, Barrett R, Berry M, Chan TF, Demmel J, Donato JM, Dongarra J, Eijkhout V, Pozo R, Romine C, Vorst HVD. Templates for the solution of linear systems:building blocks for iterative methods. (Available from: org/templates/templates.pdf) [Accessed 6 May 2014].
Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationA Cross-Platform SpMV Framework on Many-Core Architectures
A Cross-Platform SpMV Framework on Many-Core Architectures YUNQUAN ZHANG and SHIGANG LI, State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationGenerating and Automatically Tuning OpenCL Code for Sparse Linear Algebra
Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction
More informationPerformance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform
J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationTechniques for Optimizing FEM/MoM Codes
Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationTuning CUDA Applications for Fermi. Version 1.2
Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationSIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision
SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin
More informationBlocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers
The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman
More informationarxiv: v1 [cs.dc] 24 Feb 2010
Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationAuto-tunable GPU BLAS
Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationGPU-based Parallel Reservoir Simulators
GPU-based Parallel Reservoir Simulators Zhangxin Chen 1, Hui Liu 1, Song Yu 1, Ben Hsieh 1 and Lei Shao 1 Key words: GPU computing, reservoir simulation, linear solver, parallel 1 Introduction Nowadays
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationExploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationHow naive is naive SpMV on the GPU?
How naive is naive SpMV on the GPU? Markus Steinberger, Andreas Derler, Rhaleb Zayer and Hans-Peter Seidel Max Planck Institute for Informatics Saarbrücken, Germany Email: {msteinbe,rzayer,hpseidel}@mpi-inf.mpg.de
More informationAutomatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation
Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Dominik Grewe Institute for Computing Systems Architecture School of Informatics University
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationFlexible Batched Sparse Matrix-Vector Product on GPUs
ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationLecture 6: Input Compaction and Further Studies
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationSummary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory
Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Background to Workshop Organised by CERN OpenLab and held in IT Auditorium,
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationGenerating Optimized Sparse Matrix Vector Product over Finite Fields
Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationStorage Formats for Sparse Matrices in Java
Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13
More informationHighly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs
Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationEvaluation Of The Performance Of GPU Global Memory Coalescing
Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea
More information