GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Size: px
Start display at page:

Download "GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication"

Transcription

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2015; 27: Published online 7 October 2014 in Wiley Online Library (wileyonlinelibrary.com) GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication Yuan Tao 1,2,3, Yangdong Deng 4, Shuai Mu 4, Zhenzhong Zhang 1,2, Mingfa Zhu 1,2, *,, Limin Xiao 1,2 and Li Ruan 1,2 1 State Key Laboratory of Software Development Environment, Beihang University, Beijing, , China 2 School of Computer Science and Engineering, Beihang University, Beijing, , China 3 College of Mathematics, Jilin Normal University, Jilin, , China 4 School of Software, Tsinghua University, Beijing, , China SUMMARY Many high performance computing applications require computing both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi-core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (ecsb), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU-based CSB implementation. In addition, our ecsb procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of ecsb by means of wall-clock time of bi-conjugate gradient algorithm; our ecsb is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright 2014 John Wiley & Sons, Ltd. Received 31 October 2013; Revised 23 July 2014; Accepted 10 September 2014 KEY WORDS: sparse matrix-transpose vector product; sparse matrix-vector product; compressed sparse block; CSB; compressed sparse rows; CSR; GPU 1. INTRODUCTION Sparse matrices are extensively used in high performance computing applications because they naturally reflect the characteristics of a wide range of scientific and engineering problems [1]. Among various computation patterns of sparse matrices, a specially interesting one is to compute both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) in the same application with the underlying matrices stored in a single format [2, 3]. Such a pattern is exemplified by the bi-conjugate gradient algorithm (BCG) [3, 4]. As shown by the BCG pseudo-code in Figure 1, SMVP and SMTVP are computed in lines 7 and 8, respectively. Because these two steps have no data dependency, they can be assigned to two computing nodes for parallel processing. Under such a circumstance, a key requirement is a similar computation throughput for the two *Correspondence to: Mingfa Zhu, School of Computer Science and Engineering, Beihang University, Beijing, , China. zhumf@buaa.edu.cn Copyright 2014 John Wiley & Sons, Ltd.

2 3772 Y. TAO ET AL. Figure 1. Bi-conjugate gradient algorithm [3, 4]. problems. Meanwhile, both should be performed in a highly efficient manner, while the underlying sparse matrix should be stored in a single format to save memory space. Buluç et al. [5] proposed a so-called compressed sparse block (CSB) format so that both SMVP and SMTVP procedures can be processed with similar throughputs on multi-core CPUs. Recently, significant breakthrough has been made to exploit graphics processing unit (GPU) for sparse matrix computations. Especially, the seminal work of Bell and Garland [6] proved that a throughput of 16 GFLOPS in single-precision could be achieved on a single GPU node for SMVP. On the other hand, the problem of computing SMVP and SMTVP in a single format has not been extensively studied. At present, typically, SMTVP is solved by a transposition followed by an SMVP procedure. Such a strategy was adopted by a leading-edge GPU-based sparse matrix library, CUSPARSE [7]. The significant overhead of transposing a sparse matrix, however, makes the strategy less efficient. Or experiments using CUS- PARSE evidenced that the transposition cost dominates the overall computation time [8]. The other solution for SMTVP is based on the atomic addition operation on NVIDIA GPUs with compute capability 2.0 or higher. Experiment shows that the performance of SMTVP by atomic additions can outperform by up to eightfold than CUSPARSE [8], but there would still be a gap of up to sevenfold between SMVP and SMTVP [8]. It is thus natural to consider porting CSB to GPU to solve the aforementioned problem. However, a straightforward porting will not deliver sufficient computing throughput on GPUs because of its inherent irregularity. Accordingly, we investigate efficient GPU solutions to the problem of computing both SMVP and SMTVP in this work. We first perform experiments to identify the performance gap between SMVP and SMTVP on GPUs. On such a basis, we re-design the CSB data structure, designated as the expanded CSB (ecsb), to fully take advantage of GPU SIMD hardware. We also adopt the hybrid storage format proposed by Bell and Garland [6] in storing a block of ecsb and develop a heuristic to select the different expression of ecsb. Our experimental results prove that the proposed data structure and the corresponding GPU implementation enable similar computation times for both SMVP and SMTVP at a throughput that is up to 12 times faster than that of a 4-core, 8-thread CPU-based implementation of CSB. Our techniques can also be faster than the previous GPU-based CSR and HYB [6] procedures on over half of the sparse matrix benchmarks. The mainly contributions of this work are as follows. We redesigned the CSB data structure for GPU-based computing to achieve high performance SMVP and SMTVP at the same time. A dynamic selecting algorithm is proposed to determine the proper data storage format for ecsb according to the distributions of non-zeros.

3 SPARSE MATRIX MULTIPLICATION ON GPUs 3773 We validate the effectiveness of the proposed ecsb data structure with extensive experiments. The data structure was compared with existing data structures in terms of the ratio of between SMVP and SMTVP computation throughput, the throughput, the total kernel execution time between SMVP and SMTVP, the wall-clock time of BCG algorithm, and the utilization. The remaining in the paper is organized as follows. The preliminaries of this work are introduced in Section 2. Section 3 analyzes the challenges of porting CSB to GPUs. Sections 4 and 5 explains a so-called ecsb data structure and its efficient GPU implementation to compute both SMVP and SMTVP. Section 6 reports and discusses the experimental results. We review related works in Section 7. In the last section, we conclude the paper and outline future works. 2. PRELIMINARIES In this section, we introduce the preliminaries related to this work. First, we explain the basic idea of CSB. Then, the fundamental concepts of GPU computing are discussed Compressed sparse block Sparse matrix allows using a compressed format to store the non-zero elements to save memory space. Such formats usually remove zero elements along either the row or the column direction. However, such formats cannot guarantee efficient access to non-zeros along both row and column directions and lead to performance gap between SMVP and SMTVP. Buluç et al. [5] introduced the CSB to support efficiently computing both SMVP and SMTVP on CPU. In CSB, a matrix A is partitioned into square blocks of identical size. An example matrix and its CSB are illustrated in Figure 2. The definition of CSB data structure is listed in Figure 3. It consists of three arrays blk ptr, comb lowbits, andval. Array blk ptr is a two-dimensional array recording the offset of the first non-zero of each block. The array comb lowbits stores the concatenated row and column indexes of non-zeros in a block (in Figure 2, row and column indexes are shown separately). Array val stores non-zeros. In CSB, a row (column) of blocks is designated as a blockrow (blockcolumn). TheCSB computing is designed for coarse grain parallel execution on multi-core CPUs. The pseudo-codes Figure 2. Compressed sparse block format with bit-interleave sort in intra-block for sparse matrices [5]. Figure 3. Structure of CPU-based compressed sparse block.

4 3774 Y. TAO ET AL. Figure 4. Various forms of parallelism used in parallel compressed sparse block computing. for computing SMVP and SMTVP are listed in Algorithms 1 and 2. The algorithms use a rather complicated procedure to recursively partition every block into groups of non-zeros. The procedure consists of two steps. The first step packs blocks into a linear array, according to the density of the block. If the density of a block is bigger than a threshold, it is packed into linear array as a single element. Otherwise, multiple sparse blocks are combined as one element. The second step performs SMVP or SMTVP in parallel on elements in the linear array. Given an element corresponding to a dense block, the block may need to be divided into multiple sub-blocks until the density of subblocks is smaller than a threshold. Then, the sub-blocks can be handled concurrently. The basic element of parallel execution is a group, containing either a dense sub-block or multiple sparse blocks. Such an element is manipulated by a CPU thread. The aforementioned process is illustrated in Figure 4. Note that CSB offers good locality for both SMVP and SMTVP.

5 SPARSE MATRIX MULTIPLICATION ON GPUs 3775 Figure 5. A generic architecture of graphics processing unit [13] General purpose computing with graphics processing unit Recently, GPUs have emerged as a new general purpose computing platform [9, 10]. GPUs were originally designed as application specific integrated circuits for graphics rendering. Pushed by the relentless pursuit for better visual experiences, GPUs evolved to offer both high programmability and superior computing throughput. Now, two major GPU vendors, NVIDIA and AMD, both offer GPU hardware and software design with support for general purpose computing. We use NVIDIA GPUs and the corresponding programming model, CUDA, [11] in this work. Modern GPUs adopt a single program multiple data [12] programming model. A GPU or graphics processor consists of tens of multiprocessors, each equipped with multiple lanes of computing logic. A multiprocessor fetches each instruction and dispatches it to all lanes, which then operate on different data. To hide latency, the single program multiple data model is enhanced with a multithreading mechanism. A GPU program follows a three-tier organization. A thread is the basic unit of parallel execution. Multiple threads constitute a thread block, which will be assigned to a multiprocessor for execution. A number of thread blocks form a grid and are scheduled by a scheduler on the GPU. Threads in a block can synchronize with each other and exchange data through a fast on-chip memory. The warps of threads are executed on a multiprocessor as a single unit in which each thread is mapped to a different lane. A warp is suspended at a memory access instruction, and then another warp with all data ready will resume execution. Various multiprocessors execute program independently. Graphics processing unit memory is more complicated than its CPU counterpart. The instruction operands are stored in a register file located on a multiprocessor, which also has a shared memory. The shared memory works as a software controlled cache. Each multiprocessor also has its own L1 instruction and data caches. The whole GPU processor has a unified L2 cache. GPU main memory, that is global memory, is installed off-chip and connected to GPU through a wide dedicated bus. The global memory has a long latency but offers a high bandwidth. To hide latency, modern GPUs adopt a memory coalescing mechanism. When a warp of threads executes a memory instruction, the corresponding memory requests can be merged into a fewer number of references if they fall into a single segment of memory space aligned at multiple of 64 bytes. The computing power of GPUs depends on the capability of concurrently running a massive number of threads. It is thus essential to guarantee the correctness of concurrently accessing to a single address by multiple threads. Recent GPUs support atomic functions to enable a complete memory writing process without being

6 3776 Y. TAO ET AL. interrupted by other requests to the same address. In NVIDIA GPUs, the atomic function for global memory is realized in the L2 cache. All atomic operations to the same address of global memory are buffered in a first-in first-out queue to be processed by an Arithmetic Logic Unit (ALU) in a serialized manner, as shown in Figure 5. Graphics processing units exhibit different characteristics from multi-core CPUs. First of all, the computing cores on a GPU are much simpler than those deployed on multi-core CPUs. These cores are organized into single instruction multiple data (SIMD) lanes to deliver high computing throughput. The simpler cores have less hardware support for memory pre-fetch and branching instructions. Second, GPUs exploit a massive number of threads to hide memory latency, while the caches deployed on GPUs have much less capacity than their equivalent on multi-core CPUs. On the other hand, CPUs heavily depend on cache to hide memory latency but only support a relatively number of threads. Third, GPU threads are directly supported by hardware. In other words, every active thread (i.e., a thread is active if it is already assigned to a GPU core for execution) has its own registers to store program state. So, no context switching is necessary when a thread is suspended or activated. This is drastically different from the CPU threads. 3. CHALLENGES OF COMPUTING BOTH SPARSE MATRIX-VECTOR PRODUCT AND SPARSE MATRIX-TRANSPOSE VECTOR PRODUCT WITH GRAPHICS PROCESSING UNIT Graphics processing units have been extensively used in many scientific and engineering applications. However, GPUs are originally designed to support highly regular workload. The irregular data structures exemplified by sparse matrices pose considerable challenges to GPUs, which needs programmers to design algorithms that are able to effectively extract data level parallelism. Many research works have been proposed to improve the efficiency of GPU computing on sparse matrices (e.g., [6, 7, 14 17]). Among these works, CUSPARSE systematically addresses the problem of SMVP and provides routines for popular sparse matrix formats. Currently, the SMVP can be computed with a throughput of 16 GFOPS in single precision on a single GPU [6]. To solve the problem of computing SMVP and SMTVP on GPUs with a single format, there are two basic approaches. The first one is to choose an existing sparse matrix format and develop a corresponding procedure, while the second approach is to port the CSB data structure to GPU. In the remainder of this section, we investigate the feasibility of these two approaches Computing sparse matrix-vector product and sparse matrix-transpose vector product with existing sparse matrix formats There exist many sparse matrix formats among which COO, CSR, and ELL [3] are representative formats. COO stores the row and column indexes for each non-zero, while CSR compresses the row index information by only storing information to locate the first non-zero in each row. ELL, or Ellpack format, uses a fixed row width to store non-zeros. If a row has fewer non-zeros, padding of zeros has to be performed. ELL needs two arrays to store column index and non-zeros. Given structural matrices, ELL is more appropriate for GPU computing because the regular width of rows enables a predictable computing pattern. However, too many zeros may need to be padded when processing unstructural matrices. The HYB format was introduced by Bell and Garland [6] for efficient computations of SMVP on GPU by combining both ELL and COO. In HYB, most nonzeros of each row is stored in the ELL format, while the remaining non-zeros out of the row width of ELL are recorded in COO. Figure 6 illustrates the three storage formats of an example matrix. From Figure 6, it can be seen that existing formats cannot enable efficient access to non-zeros along both row and column directions, while efficient computing of both SMVP and SMTVP with a single format does need quick access in both directions. In addition, the available formats cannot guarantee efficient use of cache in both SMVP and SMTVP.

7 SPARSE MATRIX MULTIPLICATION ON GPUs Porting compressed sparse block to graphics processing unit The CSB data structure proves to be efficient on multi-core CPUs because it utilizes better the locality of non-zeros of input matrix and vectors. A straightforward porting of CSB to GPUs, however, turns to be inefficient because of several reasons. First, the CSB algorithm performs a relatively complex recursive partition process to decompose the original matrix into small chunks. Here, one chunk can be a row of blocks, a few blocks along a row, or a partial block according to the number of non-zeros available inside. As illustrated in Figure 4 and elaborated in Section 2.1, the computing procedure actually uses multiple forms of dynamic parallelism exiting among different rows, sub-rows, and sub-blocks. Such a combination of various types of dynamic parallelism suggests that this scheme is more proper for coarse grain multithreaded execution on multi-core CPUs. GPUs are built on the foundation of SIMD execution that requires all parallel threads to run exactly the same program. Second, CSB operations are performed on blocks. Such blocks have a varying number of non-zeros and have to be processed as a single unit. As a result, we generally choose to use one warp to handle one block. Given a block with few non-zeros, however, the computing resource is substantially wasted. In addition, the memory bandwidth is also under-utilized because a warp accesses memory as a complete unit. Third, Buluç et al. [5] depends on Cilk [18] to distribute computing tasks to cores at runtime for dynamic load balancing. However, such dynamic schedulers are not available on GPUs yet. Instead, we have to carefully handcraft a thread scheduling mechanism for a good load balance. 4. THE EXPANDED COMPRESSED SPARSE BLOCK DATA STRUCTURE Considering the problems discussed in Section 3.2, a natural question arises: what data structures are appropriate for GPU-based CSB computations? On the one hand, the block structure is necessary to maintain the two-dimensional parallelism that is essential to compute both SMVP and SMTVP in an efficient manner. In other words, it is desirable to have blk ptr[i][j+1]- blk ptr[i][j] as the non-zeros of block of (i,j), where i is the offset of blockrow and j is the offset of blockcolumn. Meanwhile, we want to break the block structure so that the number of wasted (i.e., threads assigned with zero elements) threads can be minimized. To meet the aforementioned seemingly contradictory requirements, we redesigned the CSB data structure as follows. The values of the non-zeros are kept in an array val. The original two-dimensional array, blk ptr, is converted to one-dimensional array in the GPU memory. It records the position of the first nonzeros of blockrows in array val. A new array, block ind, stores the block index for each non-zero Figure 6. Comparison of common sparse matrix formats.

8 3778 Y. TAO ET AL. Figure 7. Expanded compressed sparse block data structure. element. Such an arrangement simplifies the processing process, because when handling a non-zero, a thread can easily determine which block it should look at. In fact, an element of the array block ind is the concatenation of the row and column indexes of the block. The array comb lowbits is the same as its equivalent in CSB. The proposed ecsb data structure is listed in Figure 7. A key idea of ecsb is that we reserve the CSB-based ordering of non-zeros but handle a blockrow as the basic unit of parallel processing. One critical issue is that the distribution of non-zeros may vary sizably. Some sparse matrices have a relatively uniform distribution of non-zeros among rows, while others may have a few denser rows and a larger number of sparse rows. Accordingly, we extend the hybrid format, HYB, proposed by Bell and Garland [6] to adapt the non-zero distribution. Given sparse matrix very relative regular non-zero distribution, we use the ELL format because it helps improve load balance. In our implementation, we do not perform padding of zeros when the number of nonzeros is less than the row width. We use the COO format for highly irregular sparse matrices. A hybrid format is used on all remaining matrices. The basic idea is to store non-zeros within a certain row width in the ELL format, and the other non-zeros in the COO format. We propose a heuristic to choose a proper format. Note that the ELL and COO formats are on top of CSB because the non-zeros are arranged in the ordering defined in a block. The non-zeros in each block can be stored in different orders, among which row-major and column-major orderings are commonly used. The original CSB data structure uses a bit-interleaving ordering or Z-Morton ordering [5]. With the row-major ordering, we need to compute the SMVP problem with atomic operations (the details will be discussed in Section 5). Atomic addition is also used for SMTVP when using the column-major ordering. Both orderings introduce collisions and lower the overall throughput. So we adopt the bit-interleaving ordering for GPU-based storage. 5. GRAPHICS PROCESSING UNIT IMPLEMENTATION We develop both ELL and COO procedures for both SMVP and SMTVP because ecsb exploits a hybrid format. The remainder of this section explains the details. We also propose the heuristic to choose a proper format from three possible combinations: ELL, COO, and HYB ELL based sparse matrix-vector product and sparse matrix-transpose vector product ELL was designed to support relatively regular sparse matrices. Non-zeros are compressed into a dense matrix with a fixed row width. In this work, one row of ELL corresponds to a blockrow of ecsb. We use one thread of block to handle one row. One thread computes the product of a non-zero and a vector element in one iteration. The threads in a block iterate through a single row. Figure 8(a) and (b) lists the CUDA pseudo-code for computing the SMVP and SMTVP of the ELL part of a sparse matrix encoded in the ecsb data structure. In the code listed in Figure 8(a), line 4 derives the starting position of the first non-zero in the current blockrow via the offset (line 2) stored in the blk ptr array. The offset is also used to extract the combined word of row and column indices of the current block (line 6) and the combined word of row and column indices of intra-block (line 7). In the loop started at line 8, a thread fetches multiple non-zeros and the corresponding vector elements. Then, it performs multiplications to generate a series of intermediate results that should be summed together. Note that there is no guarantee on the order of processing by multiple threads on a massively parallel platform like GPU. In addition, the random distribution of the numbers of non-zeros makes it infeasible to store the intermediate results

9 SPARSE MATRIX MULTIPLICATION ON GPUs 3779 Figure 8. CUDA pseudo-code of ELL parts of expanded compressed sparse block based format for computing y D y C A x and y D y C A T x. in the shared memory directly. As a result, we resort to the atomic operations to add the intermediate results to the corresponding entry of the product vector located in shared memory. Current CUDA compilers only support atomic addition on integers and single-precision float numbers. For double-precision float numbers, one way is to cast them into integers of long long type, which is supported with an atomic compare and swap function, atomiccas [11], to avoid data race. The corresponding CUDA pseudo-code is listed in Figure 9. In the code snippet, longlong as double, and double as longlong at line 3 are CUDA library functions to perform the conversion between 64- bit unsigned integer value and double-precision floating point numbers. The codes of Figure 8(a) at lines 16-f and 16-d are for single and double-precision floating numbers, respectively. Figure 8(b) listed the code of SMTVP. The difference between SMVP and SMTVP is that the SMTVP procedure has its atomic additions done in global memory because of the wider distribution of data entries. Non-zeros in a blockrow are stored in a block-by-block order, while inside a block, the order is defined by CSB. As a result, it is feasible to visit non-zeros in a block along both the row direction and the column direction in a similar manner. Such a similarity accounts for the more comparable throughputs between SMVP and SMTVP.

10 3780 Y. TAO ET AL. Figure 9. The original atomic addition of compare and swap for double-precision data (adapted from [11]) COO-based sparse matrix-vector product and sparse matrix-transpose vector product COO format is used to handle highly irregularly distributed non-zero elements. Here, we do not need array blk ptr because the block index plus the intra-block row and column indices are sufficient. Each thread first computes the product of one non-zero element and its respective entry of x and atomically adds the product to the corresponding entry of product vector y. Figure 10(a) and (b) shows the pseudo-code of SMVP and SMTVP of COO of ecsb. The difference between Figure 10(a) and (b) lies in lines 7 9 because SMVP differs with SMTVP in the ordering of accessing vectors x and y Determining the storage format for expanded compressed sparse block As explained in Section 4, a sparse matrix encoded in ecsb can be physically stored in ELL, COO, or a hybrid ELL and COO format. Given a sparse matrix, it is an essential problem to determine the best storage format. We propose a heuristic to choose a proper format according to the characteristics of the non-zero distribution. Bell and Garland [6] proposed the HYB format to store sparse matrices as well as a procedure to derive the best row width for ELL format. The procedure starts by removing those rows with the least number of non-zeros. If the number of remaining rows is beyond a threshold, the procedure continues by removing rows with the second least number of non-zeros. This process stops when reaching the threshold. The least number of non-zeros among the remaining rows is the width of ELL. An important feature of ecsb is that the width of blockrow varies significantly because a blockrow actually consists of multiple rows in the original matrix. Hence, we extend Bell and Garland s procedure to select a proper storage format. We first run Bell and Garland s procedure to derive two parameters, width of ELL format and number of remaining rows. We then use the following heuristic to select a format: 1) If the width of ELL is the maximum width among all rows, we use the ELL format to store the matrix 2) Otherwise, 2.1) If the number of remaining rows is less than 20% of the total number of rows, and the width of ELL of ecsb is more than 90% of the maximum width among all rows, we use COO. 2.2) Otherwise, we use HYB of ecsb to store. However, the maximum number of non-zeros among all rows is the main factor affecting the throughput of GPU for SMVP and SMTVP. If the width of ELL is the maximum width among all

11 SPARSE MATRIX MULTIPLICATION ON GPUs 3781 Figure 10. CUDA pseudo-code of COO parts of expanded compressed sparse block based format for computing y D y C A x and y D y C A T x. rows after the Bell and Garland s procedure, there are a lot of rows having the maximum number of non-zeros. Accordingly, we select to use the ELL format because any other methods of further dividing these rows and use a HYB format will result in additional overhead. If the non-zeros are highly randomly distributed, we choose the COO format. Otherwise, we use HYB to store the ecsbbased sparse matrix. 6. RESULTS AND DISCUSSION In this section, we evaluate the effectiveness of our techniques for both SMVP and SMTVP. Our results are compared with the multi-core CPU-based CSB implementation by Buluç et al. [5] as well as GPU-based CSR implementations provided in the CUSPARSE package and HYB implementations provided in the CUSP package. We chose HYB because it delivers the highest throughput on unstructured matrices [6], and CSR is the most commonly used format, so we should do them all together for comparison. Because CUSPARSE is not an open-sourced, the GPU-based SMTVP procedure is implemented by modifying the code of CUSP, that is, the research prototype of CUSPARSE. The implementation is proved to offer a higher level of throughput than SMTVP of CUSPARSE [8] Experimental setup In this work, all experiments are conducted on a NVIDIA GPGPU platform list in Table I. The benchmarking matrices are summarized in Table II, which were taken from the University of Florida sparse matrix collections [19]. These matrices exhibit diverse characteristics, and most of them have been used by Buluç et al. in a related work [5].

12 3782 Y. TAO ET AL. Table I. Test-beds used in this work. IntelR CoreTM Platform GTX Titan C2075 i Architecture Kepler Fermi x86 64 Compute capability N/A Number of cores 2698 stream cores 448 stream cores 4 cores/8 threads Peak performance 3.52 TFLOPS 1.03 TFLOPS GFlOPS Single-precision Bandwidth GB/s 144 GB/s 25.6 GB/s Runtime CUDA 5.0, Thrust CUDA 5.0, Thrust Cilk 8503 Error Correcting Code support Disabled Enabled Yes Operation system Centos release 6.2 Centos release 6.2 Ubuntu LTS Kernel el el generic. x86 64 x86 64 x86 64 CPU compiler gcc gcc gcc Table II. Benchmark matrices used in this work. Non-zero per column Name Description Row/column Non-zeros mean/max asic_320k Circuit simulation 321k/321k 2635k 8/203.8k parabolic_fem Parabolic FEM 525k/525k 2100k 3/7 Kkt_power Optimal power flow, 2063k/ 8130k 3/84 nonlinear optimization 2063k torso1 Finite difference/ 116k/116k 8516k 73/1224 boundary elem sme3dc 3D structural 42k/42k 3148k 73/405 mechanics problem Rucci Ill-conditioned 1977k/109k 7791k 70/108 least-squares problem bone010 3D trabecular bone 986k/986k 36326k 36/42 ldoor INDEED Test 952k/952k 23737k 24/77 Matrix Cont11_l Linear programming 1468k/1961k 5382k 2/7 problem rajat31 Rajat circuit 4690k/4690k 20316k 4/1252 simulation matrix Cage15 DNA 5154k/5154k 99199k 19/ Experimental results In this subsection, we present the experimental results. The performance is evaluated in five different metrics, the ratio of SMVP throughput to SMTVP throughput, the total throughput of SMVP and SMTVP, the execution time measured in kernel time and wall-clock time of BCG algorithm, and the utilization. To evaluate the effectiveness of our techniques, we implement CSR and HYB based procedures by extending CUSP. We use the function provided in CUSPARSE to compute the SMVP on CSR and in CUSP to do it on HYB. The corresponding SMTVP computation is implemented with a reduction-based procedure [8] by tweaking the code of CUSP [6]. We also compare our work with the original CSB results collected on an Intel i CPU. All the results reported in this subsection are the average values of 100 runs Ratio of sparse matrix-vector product throughput to sparse matrix-transpose vector product throughput. Tables III and IV compare the ratio of computation throughput between SMVP and SMTVP in single precision and double precision, respectively. For GPU-based ecsb, the GPU implementations are tested on GTX Titan. The CPU results are collected by running eight parallel threads. To better evaluate the results, we use the root mean square deviation (RMSD) [20] to measure the quality of different approaches in delivering a balanced computation throughput of

13 SPARSE MATRIX MULTIPLICATION ON GPUs 3783 Table III. Comparison of the ratio of SMVP and sparse matrix-transpose vector product throughputs on single-precision data. GTXTitanGPU CSBonCPU Matrix name ecsb (MFLOPS) CSR (MFLOPS) HYB (MFLOPS) (MFLOPS) asic 320k parabolic_fem kkt power torso sme3dc Rucci bone ldoor cont cage rajat Root mean square deviation CSB, compressed sparse block; ecsb, expanded compressed sparse block. Table IV. Comparison of the ratio of SMVP and sparse matrix-transpose vector product throughputs on double-precision data. GTXTitanGPU CSBonCPU Matrix name ecsb (MFLOPS) CSR (MFLOPS) HYB (MFLOPS) (MFLOPS) asic 320k parabolic fem kkt power torso sme3dc Rucci bone ldoor cont cage rajat Root mean square deviation q both SMVP and SMTVP. RMSD is defined as RMSD()= E.. / O 2 /,where O is the target of estimated ratio, we designed it as 1.0 in the paper, and is the ratio collected in our experiments. The results in Table III prove that the proposed techniques enable highly balanced computation throughput of SMVP and SMTVP. The ecsb-based results are significantly better than those based on CSR and HYB. In fact, ecsb allows an even better RMSD than the original CSB algorithm implemented on a multi-core CPU. Two matrices, asic 320k and torsor1, have relatively unbalanced ratios. The reason is that these matrices have a few denser rows that lead to more collisions. We also evaluate the ratio of computation throughput for double-precision data, and the results are listed in Table IV. The performance difference is bigger than the case of single-precision data, but the results of ecsb are still significantly more balanced than those with CSR and HYB. The larger difference between SMVP and SMTVP is due to the lack of direct support for atomic operations on double-precision data. Current GPUs only allow the generic compare and swap procedure as illustrated in Figure 9 for double-precision data [11]. The performance of the resultant atomic operations is considerably slower than that of single-precision data whose atomic functions can be performed by the raster operation unit (raster operations processor) in the level 2 cache [21].

14 3784 Y. TAO ET AL Computing throughput. In this subsection, we report the computing throughputs of both SMVP and SMTVP. We collect single-precision and double-precision results on GTX Titan. The results are demonstrated in Tables V and VI. We also perform the same experiment on singleprecision data with C2075, and the results are illustrated in Table VII. The computing throughput is derived by dividing the total number of floating number operations by the kernel time. Table V compares the single-precision performance of the proposed ecsb data structure with HYB and CSR structures on GTX Titan. On average, ecsb-based SMVP are comparable with or slightly better than HYB and outperform CSR by 20%, while ecsb-based SMTVP delivers a throughput that is higher than HYB and CSR by 31.5% and 195.8%, respectively. Compared with CPU for SMVP and SMTVP, the proposed ecsb data structure attains at least a 5.0 times speedup. We also evaluate the double-precision performance on GTX Titan, and the results are listed in Table VI. ecsb delivers a slightly lower performance in SMVP than HYB and CSR but outperforms HYB and CSR by 365.4% and 69.8%, respectively, in SMTVP. To study the performance scalability of GPUs, we compare the single-precision performance results on C2075 in Table VII. ecsb-based SMTVP outperforms HYB and CSR based SMTVP by 7.6% and 79.9%, respectively. In Table V, two matrices, asic 320k and torso1, when stored in the ecsb data structure, have a lower performance in SMVP. The reason is that the numbers of non-zeros in their most populated rows are much higher than the average value and such a distribution leads to more collisions in atomic additions. Our GPU implementation of the ecsb data structure delivers a higher throughput than the original HYB-based implementation of CUSP and CUSPARSE package. The major reason is that the block structure enables a better spatial locality in both the matrix and vectors in SMVP or SMTVP computations. The CSB data structure allows non-zeros of adjacent rows and columns to be stored together. Given dense blocks in a sparse matrix, the adjacent elements in vectors x and y have a better chance to be reused after being loaded into cache. In fact, when computing yœi C DAŒi; j xœj, the neighboring elements of yœi can be automatically loaded into cache. Because we compute multiple rows inside a block, it is highly likely that the target entries of y are already in the cache. In case of the CSR-based computing, the locality in vector elements also exists but is hardly to be fully exploited because programmers cannot explicitly control the scheduling of warps Computation time. In this section, we report the computation time of single-precision data including both kernel time and wall-clock time on GTX Titan. Kernel time is the total kernel time of both SMVP and SMTVP, and wall-clock time is the total time of BCG on GPU including both the CPU-GPU data transfer time and kernel time. Table V. Throughput of single precision on GTX Titan graphics processing unit and CPU (MFLOPS). TITAN ecsb TITAN HYB TITAN CSR CPU Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k , parabolic fem 19, , , , , kkt power 17, , , , torso1 10, , , , , , sme3dc 17, , , , , Rucci1 23, , , , bone010 29, , , , , , ldoor 28, , , , , , cont , , , , , cage15 27, , , , , rajat31 20, , , , , Average 19, , , , , GTX Titan GPU and CPU mean platform. ecsb, HYB, and CSR mean the matrix expression. SMVP and SMTVP mean operation. The notations in other tables follow these rules.

15 SPARSE MATRIX MULTIPLICATION ON GPUs 3785 Table VI. Throughput of double precision on GTX Titan graphics processing unit and CPU (MFLOPS). TITAN ecsb TITAN HYB TITAN CSR CPU Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k parabolic fem 12, kkt power 10, torso , , sme3dc , Rucci , , bone , , ldoor , , cont , , , , cage15 15, , , , rajat , , Average , , SMVP, sparse matrix-vector product; SMTVP, sparse matrix-transpose vector product. Table VII. Throughput of single precision on C2075 graphics processing unit (MFLOPS). ecsb HYB CSR Matrix Name SMVP SMTVP SMVP SMTVP SMVP SMTVP asic 320k parabolic fem kkt power torso , sme3dc , Rucci bone , , ldoor , cont cage15 11, , rajat Average , ecsb, expanded compressed sparse block; SMVP, sparse matrix-vector product; SMTVP, sparse matrix-transpose vector product Kernel time. Figure 11 compares the total kernel time of both SMVP and SMTVP. Our ecsb outperforms HYB on eight out of 11 benchmarking matrices by up to 44.8% and CSR on 10 out of 11 benchmarking matrices by up to 82.3%. The three matrices named asic 320k, torso1, and rajat31 have a longer kernel time in ecsb when compared with HYB. The reason is also due to the highly irregular distribution of non-zeros. Some rows of these matrices have significantly more non-zeros, which lead to more frequent collisions in the atomic operations Wall-clock time on GTX Titan graphics processing unit. The ecsb structure enables a higher computing throughput for both SMVP and SMTVP, but at a cost of a large memory footprint. Because the memory overhead suggests a longer time for data transfer between CPU and GPU, it is necessary to validate if ecsb can really lead to performance advantage in typical application scenarios. We use the BCG algorithm mentioned in Section 1 as a case study. We derive a GPU implementation on top of CUSP by incorporating the ecsb data structures and the respective procedures proposed in this work. The BCG procedure is tested on a sparse matrix, parabolic fem, which is derived from finite element analysis. The BCG computation converges in eight iterations with an accuracy tolerance of 1e-5. We also implement the HYB and CSR based procedures in the framework of CUSP. Figure 12 compares the wall-clock time of the data structures. The reported wall-clock time includes both GPU computation time and CPU GPU data transfer

16 3786 Y. TAO ET AL. Figure 11. Total kernel time of both sparse matrix-vector product and sparse matrix-transpose vector product (single precision data on GTX Titan graphics processing unit). Figure 12. Wall-clock time of preconditioned bi-conjugate gradient algorithm for matrix named parabolic fem (Single precision data on GTX Titan graphics processing unit). time. The ecsb data structure outperforms HYB and CSR by 6% and 25%, respectively. The results suggest that the ecsb data structure delivers a higher computing throughput despite of the memory overhead Utilization. The effectiveness of a data structure and its manipulation procedures is reflected by the utilization ratio of GPU hardware resources. Utilization is defined as the ratio of the average number of active warps in each cycle and the maximum number of warps allowed by a multiprocessor [22]. We used NVIDIA Visual Profiler [22] to collect the utilization data. Figures 13 and 14 give the utilization values of single-precision data of SMVP and SMTVP over all benchmarking matrices. Obviously, the ecsb data structure enables a higher utilization ratio of GPU hardware in both SMVP and SMTVP. Such an observation suggests that ecsb is more proper for GPU computing and likely to be more scalable. 7. RELATED WORK Sparse matrix-transpose vector product is a commonly used computing pattern that is already supported by the BLAS package [23]. Traditionally, SMTVP is solved by a transposition followed by a SMVP procedure. So its solution is closely related to the work in SMVP. Bell and Garland [6] developed an efficient solution for sparse matrix-vector multiplication on GPUs. The key idea is to use a warp to derive the inner product of a row in the sparse matrix and the input vector. Such a solution hits a good balance between parallelism and load balance. Meanwhile,

17 SPARSE MATRIX MULTIPLICATION ON GPUs 3787 Figure 13. Average utilization of sparse matrix-vector product on GTX Titan graphics processing units. Figure 14. Average utilization of sparse matrix-transpose vector product on GTX Titan graphics processing units. the summation operation in the computation of inner product is realized by an intra-warp reduction. This work dramatically improves the performance of GPU-based SMVP computation. It was released in an open-source package, CUSP [6], and later incorporated in NVIDIA s CUSPARSE package [7]. CSR-based atomic addition is based on CUSP [6] because of the availability of its source code. CUSP [6] solves the SMTVP problem by the transposition and SMVP procedure, while CSR-based atomic addition takes a different approach and delivers a much higher throughput [8]. Many optimization techniques have also been introduced for GPU-based SMVP. Choi et al. [15] developed a model-driven based auto-tuning technique to compute SMVP on GPUs. They used the blocked compressed sparse row and blocked ELLPACK (BELLPACK) formats and achieved up to 29.0 GFLOPS in single-precision and 15.7 GFLOPS in double-precision on NVIDIA T10P multiprocessor (C1060). Deng et al. [14] proposed data reordering based techniques for GPU accelerated SMVP computation. Their techniques are especially suited for very sparse matrices. Su and Keutzer [16] developed the Cocktail format, a new sparse matrix format and its OpenCL [24] implementation, clspmv. The framework can analyze a sparse matrix at runtime and recommend a suitable format for a given GPU. CUSPARSE [7] was also heavily optimized, but such optimizations are orthogonal to the techniques proposed in CUSP [6]. Tang et al. [17] proposed to compress row/column index to reduce memory traffic for an improved throughput on GPU. Fu et al. [25] presented a parallel FEM solver for unstructured meshes and observed a speedup of up to 87 times speedup. However, all the aforementioned works could not achieve the same throughput

18 3788 Y. TAO ET AL. for both SMVP and SMTVP on GPU, because commonly used sparse formats do not allow efficient row-wise and column-wise accesses at the same time. Buluç et al. [5] proposed a novel solution to simultaneously compute SMVP and SMTVP on multi-core platforms. This work is based on a new data structure, CSB, for storing of sparse matrices. The size of blocks is determined by adapting to the size of last level cache so as to enhance the locality of data accesses. This approach is effective on multi-core CPUs but only delivers limited performance on GPU because a costly pre-processing step is needed to traverse all non-zeros and change their storage layout. Juan Gomez-Luna et al. [26] provide a collision model for atomic functions on the shared memory on GPU. A key observation is that the collision on the shared memory is equivalent to the bank conflict problem. The collision model for global memory is still open, while the problem is more complicated. 8. CONCLUSION AND FUTURE WORK Sparse matrix-vector product and SMTVP are two common computing patterns of sparse matrix based algorithms. Despite the progress in GPU-based sparse matrix computations, it is still a challenging problem to compute both problems with a similar throughput on GPUs. In this work, we propose a GPU-based CSB data structure, designated as ecsb, as well as an atomic operation computing procedure to shorten the gap between SMVP and SMTVP computations. We also extend the previous work on using a hybrid storage format for ecsb. We perform experiments to validate the proposed techniques in a systematic manner. It has shown that ecsb enables a higher throughput on SMVP and SMTVP than previous works on Kepler GPUs. In addition, both SMVP and SMTVP can be computed in much closer times. We also test our work on the BCG algorithm and further validate the effectiveness of ecsb. In the future, we are going to further this research in several directions. First, we will develop a performance model for atomic operations. The model considers the overhead in both collision and cache miss. It will help construct adaptive algorithms for SMVP and SMTVP computations. Second, we will refine the ecsb data structure to reduce the memory overhead. Finally, we are also going to explore heterogeneous methods involving both CPU and GPU for CSB-based computing. Especially, we want to investigate the possibility of using CPU to recursively partition the input sparse matrix into proper small blocks and then launch GPU for massively parallel execution. ACKNOWLEDGEMENTS The authors would like to thank Professor Songping Wu of Beihang University for discussing the BCG algorithm. The work is partially supported by the National Natural Science Foundation of China under grant no , the National Natural Science Foundation of China under grant no , the Fundamental Research Funds for the Central Universities under grant no. YWF-14- JSJXY-14, Beijing Natural Science Foundation under grant no , and the fund of the State Key Laboratory of Software Development Environment under grant no. SKLSDE-2014ZX-13. REFERENCES 1. Duff SI. A survey of sparse matrix research. Proceedings of the IEEE 1977; 65(4): DOI: /PROC Vuduc R, Gyulassy A, Demmel JW, Yelick KA. Memory hierarchy optimizations and performance bounds for sparse A T Ax. Proceedings of the 2003 international conference on Computational science: PartIII(ICCS), June Springer: Berlin, Heidelberg, 2003; Saad Y. Iterative Methods for Sparse Linear Systems (2nd edn). SIAM: Philadelphia, USA, Barrett R, Berry M, Chan TF, Demmel J, Donato JM, Dongarra J, Eijkhout V, Pozo R, Romine C, Vorst HVD. Templates for the solution of linear systems:building blocks for iterative methods. (Available from: org/templates/templates.pdf) [Accessed 6 May 2014].

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

A Cross-Platform SpMV Framework on Many-Core Architectures

A Cross-Platform SpMV Framework on Many-Core Architectures A Cross-Platform SpMV Framework on Many-Core Architectures YUNQUAN ZHANG and SHIGANG LI, State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences

More information

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction

More information

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

GPU-based Parallel Reservoir Simulators

GPU-based Parallel Reservoir Simulators GPU-based Parallel Reservoir Simulators Zhangxin Chen 1, Hui Liu 1, Song Yu 1, Ben Hsieh 1 and Lei Shao 1 Key words: GPU computing, reservoir simulation, linear solver, parallel 1 Introduction Nowadays

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

How naive is naive SpMV on the GPU?

How naive is naive SpMV on the GPU? How naive is naive SpMV on the GPU? Markus Steinberger, Andreas Derler, Rhaleb Zayer and Hans-Peter Seidel Max Planck Institute for Informatics Saarbrücken, Germany Email: {msteinbe,rzayer,hpseidel}@mpi-inf.mpg.de

More information

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Dominik Grewe Institute for Computing Systems Architecture School of Informatics University

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Background to Workshop Organised by CERN OpenLab and held in IT Auditorium,

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information