Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Size: px
Start display at page:

Download "Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs"

Transcription

1 212 SC Companion: High Performance Computing, Networking Storage and Analysis Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs Kazuya Matsumoto, Naohito Nakasato, and Stanislav G. Sedukhin Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu City, Fukushima, Japan {d812111, nakasato, Abstract OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes performance of OpenCL programs also portable on different processors. We have developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL. This paper presents results of performance evaluation of DGEMM (double-precision general matrix multiply) and SGEMM (single-precision GEMM) implementations by using the auto-tuning system. Performance evaluations are conducted on two AMD GPUs (Tahiti and Cayman), two NVIDIA GPUs (Kepler and Fermi), and two CPUs (Intel Sandy Bridge and AMD Bulldozer). Our GEMM implementations on the AMD GPUs show higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable. I. INTRODUCTION Matrix-matrix multiplication is a fundamental routine in linear algebra, which is called GEMM (GEneral Matrix Multiply) in BLAS (Basic Linear Algebra Subprograms) standard [1]. GEMM is used in many important numerical algorithms, and it is a building block of LAPACK (Linear Algebra PACKage) [2] and other Level-3 BLAS routines [3]. GEMM algorithms have a high computational intensity and regularity and, therefore are good candidates for performance acceleration. OpenCL (Open Computing Language) is a standard framework for parallel programming [4], [5]. Programs in OpenCL are functionally portable across multiple processors that include CPUs, GPUs and other computing devices such as FPGAs. OpenCL offers an abstract hardware layer such that it allows programmers to develop applications without knowing details of underlying processor architectures. However, performance is not always portable across different processors in OpenCL. Parallel processing on GPUs and multi-core CPUs are widely used. State-of-the-art GPUs contain more than a thousand processing elements at around 1 GHz clock speed and are capable of achieving around 1 TFlop/s in doubleprecision and more than 3 TFlop/s in single-precision. Many numerical applications rely on the high computational power of GPUs. On the other hand, current commodity CPUs have up to a few tens of processing cores at a few GHz clock speed. The peak performance of CPUs is a few hundred GFlop/s, but CPUs are capable of running many more types of applications. Automatic performance tuning (or auto-tuning in short) is an important technique to resolve the problem of performance portability. Auto-tuning is a well accepted solution for fast GEMM implementations. PHiPAC (Portable High Performance ANSI C) [6] and ATLAS (Automatically Tuned Linear Algebra Software) [7] are famous projects for auto-tuned BLAS routines on CPUs. Also, several autotuning systems for GEMM have been developed [8] [13]. In CUDA, an auto-tuning framework for NVIDIA GPUs has been implemented [1], which is named as ASTRA (Automatic Stencil TuneR for Accelerators). In OpenCL, Du et al. [12] presented auto-tuned GEMM routines on an NVIDIA Fermi GPU and an AMD Cypress GPU. We have also previously implemented a GEMM code generator for fast GEMM kernels through an auto-tuning process on an AMD Tahiti GPU [13]. Our contribution of this study is that we have applied our auto-tuning system to different GPUs and CPUs, and evaluated the performance. The processors are the following four GPUs and two CPUs: 1) AMD Tahiti GPU (Radeon HD 797); 2) AMD Cayman GPU (Radeon HD 697); 3) NVIDIA Kepler GPU (GeForce GTX 67 overclocked); 4) NVIDIA Fermi GPU (Tesla M29); 5) Intel Sandy Bridge CPU (Core i7 396X); 6) AMD Bulldozer CPU (FX-815). Another contribution is that we have improved our previous GEMM code generator such that it supports a greater number of parameters. The new parameters include a parameter designating a specific matrix multiply algorithm and additional blocking factors for more flexible usage of local memory (sharable memory). The new generator have succeeded in producing GEMM kernels which run faster on the Tahiti GPU, i.e., the maximum performance /13 $ IEEE DOI 1.119/SC.Companion

2 of DGEMM (double-precision GEMM) kernel is increased to 863 GFlop/s (91% of the peak performance) from 848 GFlop/s and that of SGEMM (single-precision) kernel is improved to 347 GFlop/s (8% of the peak) from 2646 GFlop/s. The following parts of this paper are organized as follows. Section II shows OpenCL basics. Section III describes our improved GEMM code generator. This section also explains differences between the improved generator and the previous generator. After that, Section IV presents results of performance evaluation on different processors. Finally, Section V concludes the paper. II. OPENCL BASICS OpenCL is an open standard framework for generalpurpose parallel programming on heterogeneous platforms. The OpenCL framework includes a C99-based language for writing parallel functions called kernels, and runtime APIs (application programming interfaces) for controlling OpenCL platforms and devices. An OpenCL platform is composed of one or more OpenCL devices connected to a host. An OpenCL device comprises multiple compute units (CUs), each of which has multiple processing elements (PEs). When an OpenCL kernel is submitted for execution on the device, an N- dimensional index space, which is called NDRange, is defined. In this study, we consider a two-dimensional index space only, which is suitable for matrix data. Each instance in an NDRange is called a work-item which has a unique ID. Several work-items organize a work-group. A workitem runs on one or more PEs. A task of a work-group is processed by all the PEs of a CU. In an OpenCL kernel, four distinct memory regions are accessible to work-items. 1) Global memory is a memory region in which data can be read/written by all work-items. It is impossible to synchronize work-items during a kernel execution in this memory. 2) Constant memory is a read-only region of global memory. Data in this region are not changed during execution. 3) Local memory is a specific memory region to a workgroup. Work-items in a work-group can share data in local memory. 4) Private memory is a specific memory region to a workitem. Data in private memory of a work-item is not visible to other work-items. On most OpenCL devices, private memory is in the register file. III. GEMM CODE GENERATOR The GEMM routine in a single or double precision is defined as C αop(a)op(b) + βc, where both α and β are scalar values, and op(a), op(b) and C are M K, K N and M N matrices, respectively. Since the op(x) takes either X (non-transposed matrix) or X T (transposed matrix), there are four multiplication types: (NN) C αab + βc, (NT) C αab T + βc, (TN) C αa T B + βc, (TT) C αa T B T + βc. An auto-tuning system uses two core components: a code generator and a heuristic search engine. Our code generator takes a set of parameters as the input. When the input is given, the code generator produces the corresponding GEMM kernel code written in OpenCL as the output. We can set different input parameters to the generator. We have tuned the code generator so that it produces a fast A T B +C kernel in which data are properly aligned in a row-major order. For simplicity multiplications with scalars α and β are omitted in the following descriptions though they are performed in the kernel. Our approach for fast GEMM implementations is to utilize the A T B + C kernel. This means that the implementation firstly copies matrices with, if needed, matrix transposition and data layout change, and then executes the A T B + C kernel. The following explanations of each parameter are targeted for a multiplication of a transposed matrix A T and a non-transposed matrix B. A. Blocking Matrix blocking (or tiling) in a matrix multiply algorithm is a necessary technique for high performance computation. The blocking technique increases the data reuse ratio in multi-level memory hierarchy of today processors by considering of the fact that matrix multiply requires O(N 3 ) multiply-add floating-point operations over O(N 2 ) data. Two levels of blocking are used in our matrix multiply algorithms. In this paper, we refer the first level of blocking to alleviate access latencies between the local memory and private memory for work-item processing as work-item blocking. The other level of blocking is to efficiently use local memory (and data caches) of a compute unit for work-group processing. We call this blocking work-group blocking. Let, N wg, be the blocking factors of workgroup blocking, where M, N, K are considered to be divisible by, N wg,, correspondingly. The blocking divides three matrices A T, B, C into blocks of, N wg and N wg matrices, respectively. Fig. 1 shows a matrix multiply-add partitioned by the blocking factors. The elements of each N wg matrix of C are computed by a work-group. The work-group involves an K block of A T and a K N wg block of B for multiplication and an N wg block of C for addition with the result of multiplication. The blocked matrix multiply requires K/ iterations in the outermost loop of our GEMM algorithms. In every 397

3 N wg N M N wg N N wg N wg N wi M wi M wi N wi M += K K += C A T Figure 1. Blocked matrix multiply-add partitioned with factors, N wg, in the work-group blocking B C A T A T (a) With a unit stride (adjacent) memory access B iteration, the work-group updates the N wg block by multiplying a block of A T with a N wg block of B and adding the product to the N wg block of C. Fig. 2(a) depicts the further blocked matrix multiplyadd. Each block is additionally divided with blocking factors M wi, N wi in the work-item blocking. The two blocking factors M wi, N wi are not parameters of the code generator. Instead, the size (M dimc, N dimc ) of a work-group are parameterized, where and N wg are in multiples of M dimc and N dimc respectively, Using the parameters M dimc, N dimc, the two blocking factors M wi, N wi are calculated by /M dimc, N wg /N dimc correspondingly. A work-item of the work-group is in charge of multiplication of an K M wi sub-block of A T by a K N wi sub-block of B and accumulation of the product on an M wi N wi sub-block of C. In addition, the code generator supports another parameter of K wi which determines the degree of unrolling in the innermost loop of our GEMM algorithms. Note that is divisible by K wi and we categorize the parameter K wi as one of blocking factors. Loop unrolling [14] is an optimization technique in which the body of a loop is replaced with multiple copies of itself. As positive effects, the technique exposes parallelism explicitly in an OpenCL kernel to language compilers and reduces loop overheads such as loop-counter increment; however, the unrolling also has a side effect of increasing the number of required registers. The unrolling degree is, hence, necessary to be parameterized. B. Vector width and stride memory access A width of vector variables is a parameter to the code generator. Vector variables in OpenCL resemble arrays containing multiple elements of the same variable type. The vector width vw affect the performance of generated kernels, and the best width depends on a processor and an algorithm. In the work-item blocking of Fig. 2(a), each work-item is in charge of computation for adjacent (unit stride) elements in M wi N wi block of C and dark shade indicates elements accessed by a work-item. In previous works [9], [1], [15], instead of using the unit stride memory access, a non-unit N dimc N wg M dimc C += M dimc A T (b) With a non-unit stride access N dimc N wg Figure 2. Further blocked matrix multiply-add with factors M wi, N wi in the work-item blocking stride memory access is utilized for performance optimization on Fermi GPUs. Fig. 2(b) depicts the computation with our non-unit stride memory access where the stride size in M direction is M dimc and the one in N direction is N dimc. If vector variables are used (vw 2), stride sizes are multiplied by the vector width, i.e., the sizes are vw M dimc and vw N dimc. C. Usage of local memory We parameterize usage of local memory for sharing data in work-items of a work-group. Local memory offers advantage to re-use matrix data of A and B which are loaded once from global memory. A disadvantage of using local memory is that it needs a barrier synchronization between the work-items, which takes a certain amount of time. This fact means that using local memory does not always lead to high performance. In case when local memory is used, an assignment pattern of work-items in a work-group can be reshaped (this reshaping technique is also used in [1]). To represent the reshape, let us introduce values M dima, K dima, K dimb, N dimb. Reshaping the block is possible as long as three shapes of A, B, C completely overlay the corresponding matrix. We add two parameters of M dima and N dimb and the other values of K dima and K dimb are calculated by (M dimc N dimc )/M dima and (M dimc N dimc )/N dimb, respectively. B 398

4 K K K M M M (a) Row-major layout (b) Column-block-row-major layout (CBL) (c) Row-block-row-major layout (RBL) Figure 3. Matrix data layouts of an M K transposed matrix with blocking factors, D. Matrix data layouts The code generator supports A T B + C kernels in which matrices A, B are supposed to be aligned in block-major data layouts in addition to a row-major layout. Fig. 3 shows the supported layouts (this is an example of an M K transposed matrix A T with blocking factors, ). Fig. 3(a) is a row-major layout. Fig. 3(b) depicts a columnblock-row-major layout (CBL) where data of each K column-block are stored in a row-major order. In CBL, matrix data required for a multiplication of K column-block of A T by K N wg column-block of B are in contiguous memory space. Fig. 3(c) shows a row-block-rowmajor layout (RBL) where data of each subblock of a M row-block are aligned in a row-major order. In RBL, matrix data for a multiplication between sub-block and N wg sub-block are in sequential memory space. Both CBL and RBL have better spatial locality than row-major layout and GEMM kernels using either of them are expected to read matrix data more efficiently. To make use of a fast A T B + C kernel for GEMM routines, matrix data have to be copied into extra allocated buffers in global memory before executing the kernel. For example, to implement an AB + C routine where data are stored in row-major order, the matrix A is copied into a buffer with matrix transposition and the matrix B is copied into another buffer without transposition. If designated data layouts are not row-major, matrix data are changed into the required layouts along with the copying. E. Algorithms We have implemented three GEMM algorithms and the code generator has a parameter to indicate one of them. In OpenCL language, kernels are written in an SPMD (Single- Program Multiple-Data) fashion where a specific kernel describes behavior of each work-item in NDRange. Note that the following algorithms are examples which use local memory for both matrices A and B. If a generated kernel does not use local memory, data elements are directly loaded from global memory into private memory. In the description of algorithms, barrier means a barrier synchronization between work-items in a work-group to ensure correct memory access to local memory. Fig. 4 presents one of the GEMM algorithms. The algorithm is considered as a basic algorithm (BA), which is similar to the GEMM algorithm by Volkov and Demmel [16]. In the body of the outer loop (lines 2-12), a workitem firstly loads matrix elements of A and B from global memory into local memory. The inner loop body (lines 6-1) loads the elements from local memory into private memory A pm, B pm, multiply A pm by B pm, and added the product to C pm also in private memory. The inner loop is fully unrolled. When the work-item exits from the outer loop, it merges computed results in C pm with corresponding elements of C (line 13). The scalars α and β are used for multiplication at the same time as the merging. 1: C pm = 2: for p wg = to K step do 3: load M wia K wia elements of A into A lm 4: load K wib N wib elements of B into B lm 5: barrier 6: for p wi = to K wi step K wi do 7: load M wi K wi elements of A lm into A pm 8: load K wi N wi elements of B lm into B pm 9: C pm + = A pm B pm 1: end for 11: barrier 12: end for 13: merge C pm with M wi N wi elements of C pm - private memory, lm - local memory, M wia = /M dima, K wia = /K dima, K wib = /K dimb, and N wib = N wg /N dimb. (The same notations are used in the following algorithms.) Figure 4. Basic GEMM algorithm (BA) Fig. 5 shows another GEMM algorithm. The algorithm uses a software pipelining (PL) technique and it is based on the GEMM algorithm proposed in [9], [1]. The PL algorithm has a prologue processing and an epilogue processing 399

5 in addition to a loop body for the pipelining. The main feature of the algorithm is that the loop body loads elements of A and B from global memory while it loads elements from local memory and the computes the multiply-add. This strategy is considered to be effective on OpenCL devices where big memory access latencies to global memory are a bottleneck. 1: C pm = 2: load M wia K wia elements of A into A lm 3: load K wib N wib elements of B into B lm 4: barrier 5: for p wg = to K 2 step do 6: load M wia K wia elements of A into A pm 7: load K wib N wib elements of B into B pm 8: barrier 9: for p wi = to K wi step K wi do 1: load M wi K wi elements of A lm into A pm1 11: load K wi N wi elements of B lm into B pm1 12: C pm + = A pm1 B pm1 13: end for 14: barrier 15: store M wia K wia elements of A pm into A lm 16: store K wib N wib elements of B pm into B lm 17: barrier 18: end for 19: for p wi = to K wi step K wi do 2: load M wi K wi elements of A lm into A pm1 21: load K wi N wi elements of B lm into B pm1 22: C pm + = A pm1 B pm1 23: end for 24: merge C pm with M wi N wi elements of C Figure 5. GEMM algorithm with software pipelining (PL) The third GEMM algorithm is shown in Fig. 6. The algorithm is a variant of the algorithm with double-buffering strategy (DB) by Tan et al. [15]. In contrast to the PL algorithm, the DB algorithm requires less private memory. The private memory is usually allocated in registers. In some processors (especially in GPUs), the number of used registers affects the kernel performance [5]. The number of registers determines the number of work-groups launched on a compute unit. If the number of work-groups is not enough, processors cannot hide memory access latencies. A drawback of the DB algorithm is that it requires bigger space in local memory than the other two algorithms. F. Modifications of the GEMM code generator Our GEMM code generator has been modified from [13] in several aspects. The number of parameters related to blocking factors is increased from six to eight. Each size of blocking factors was previously limited to a power of two. This limitation has been eliminated. 1: C pm = 2: load M wia (K wia /2) elements of A into A lm 3: load (K wib /2) N wib elements of B into B lm 4: for p wg = to K 2 step do 5: barrier 6: load M wia (K wia /2) elements of A into A lm1 7: load (K wib /2) N wib elements of B into B lm1 8: for p wi = to /2 K wi step K wi do 9: load M wi K wi elements of A lm into A pm 1: load K wi N wi elements of B lm into B pm 11: C pm + = A pm B pm 12: end for 13: barrier 14: load M wia (K wia /2) elements of A into A lm 15: load (K wib /2) N wib elements of B into B lm 16: for p wi = /2 to K wi step K wi do 17: load M wi K wi elements of A lm1 into A pm 18: load K wi N wi elements of B lm1 into B pm 19: C pm + = A pm B pm 2: end for 21: end for 22: barrier 23: load M wia (K wia /2) elements of A into A lm1 24: load (K wib /2) N wib elements of B into B lm1 25: for p wi = to /2 K wi step K wi do 26: load M wi K wi elements of A lm into A pm 27: load K wi N wi elements of B lm into B pm 28: C pm + = A pm B pm 29: end for 3: barrier 31: for p wi = /2 to K wi step K wi do 32: load M wi K wi elements of A lm1 into A pm 33: load K wi N wi elements of B lm1 into B pm 34: C pm + = A pm B pm 35: end for 36: merge C pm with M wi N wi elements of C Figure 6. GEMM algorithm with double-buffering strategy (DB) A non-unit stride memory access is implemented in addition to a unit-stride access. The current generator can produce GEMM kernels using local memory for both matrices A and B while the previous generator was incomplete on such kernel production. GEMM kernels access data in buffer objects on global memory. Image objects, which are another possible memory objects in OpenCL, are not used currently. The three discussed above GEMM algorithms are now supported. Only the BA algorithm was used in the previous version of code generator. Our strategy to search the fastest (best) kernel has not been changed significantly. We searched tens of thousands 4

6 Table I PROCESSOR SPECIFICATION Code name Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer Product name HD 797 HD 697 GTX 67 OC Tesla M29 Core i7 396X FX-815 Core clock speed [GHz] Number of compute units Max DP operations / clock Max SP operations / clock Peak DP performance [GFlop/s] Peak SP performance [GFlop/s] Global memory size [GB] Peak global memory bandwidth [GB/s] L3 cache size [MB] a 8 a L2 cache size [kb] 768 a 512 a 512 a 768 a 256 b 248 c L1 cache size [kb] 16 b 8 b 64 b 16 b 32 b 64 c Local memory size [kb] 64 b 32 b 48 b 48 b Local memory type Scratchpad Scratchpad Scratchpad Scratchpad Global Global OpenCL SDK AMD APP 2.6 AMD APP 2.6 CUDA 5. RC f CUDA Intel 213 beta AMD APP 2.7 Display driver version 12.3 d d e e - - SP: Single-precision a : Size per processor d : Catalyst driver version DP: Double-precision b : Size per compute unit (core) e : CUDA driver version SDK: Software Development Kit c : Size per two cores f : CUDA 5. Release Candidate of kernel variants per single GEMM type on an OpenCL device. For this number, kernels which are failed in code generation, compilation or testing are not counted. Those many variants were heuristically chosen. We implemented a heuristic search engine and selected the fastest kernel. To find the best set of parameters for each GEMM kernel, the search engine should run more than five hours. The procedure for selecting the best kernel is as follows: 1) Measuring the performance in GFlop/s of every generated GEMM kernel for a problem size N = 496/LCM LCM in GPU devices and N = 1536/LCM LCM in CPU devices, where matrices are square M = N = K and LCM is the least common multiple of work-group blocking factors, N wg,. 2) Further measuring the performance of the fastest 5 kernels for problems sizes N (N is in multiples of LCM and N 8192) among a large number of previously tested kernels tested in 1). 3) Selecting the fastest kernel among the 5 kernels tested in 2). IV. PERFORMANCE EVALUATION In this study, we have made the performance measurements for four different GPUs (AMD Tahiti, AMD Cayman, NVIDIA Kepler and NVIDIA Fermi) and two different CPUs (Intel Sandy Bridge and AMD Bulldozer). The specifications of processors are shown in Table I. The Kepler GPU has a boost function which dynamically increases processor frequency and, therefore, the measured performance may be higher than the listed peak performance. Note that the presented performance numbers do not take into account data transfer time between host and OpenCL device. A. Performance of GEMM kernels Fig. 7 depicts the performance of the selected fastest DGEMM and SGEMM kernels as a function of problem size. Table II shows the set of parameters and the observed maximum performance of the kernels. The Tahiti GPU shows the highest performance: 863 Gflop/s (91% of the peak performance) in DGEMM and 347 Gflop/s (8%) in SGEMM. Those performance numbers are higher than our previous results [13]. In particular, the SGEMM performance is significantly increased from 2646 GFlop/s. The main reason for the performance improvement is that the new SGEMM kernel uses a local memory for both matrices A, B. In addition to the Tahiti case, the local memory usage affects performance improvement on the Kepler and Fermi GPUs. For instance, if local memory is not used for both matrices on the Kepler, the maximum SGEMM performance is decreased from 144 GFlop/s to 115 GFlop/s. A prominent performance difference can not be seen on the CPUs depending on the local memory usage. The Cayman runs slower when the local memory is utilized, probably because the cost for barrier synchronizations is too large. The selection of GEMM algorithm affects the performance of GEMM kernels. Fig. 8 depicts the relative performance using three different GEMM algorithms in respect to the maximum performance from Table II for each processor. Note that DGEMM kernels with PL algorithm always fail to execute on the Bulldozer. The BA algorithm is apparently the best on the Tahiti GPU. For the other three GPUs, the best algorithm is different between the DGEMM and SGEMM kernels. Performance variations on the CPUs are relatively small. GEMM kernels using block-major matrix layouts show the highest performance on all tested processors. Influence of block-major layouts to the performance is big on the 41

7 1 1 Figure Tahiti Cayman Fermi Kepler Sandy Bridge Bulldozer DGEMM performance 1 1 Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer SGEMM performance Performance of the fastest DGEMM and SGEMM C αa T B + βc kernels produced by our code generator on different processors DGEMM SGEMM Table II PARAMETERS FOR THE FASTEST C αa T B + βc KERNEL AND THE MAXIMUM PERFORMANCE Parameter Max perf. Parameter Max perf. Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer, N wg, 96,32,48 64,32,48 32,64,8 64,64,8 64,32,64 48,32,96 M wi, N wi, K wi 6,2,2 4,4,24 2,4,4 4,4,2 4,8,4 2,8,16 M dimc, N dimc 16,16 16,8 16,16 16,16 16,4 24,4 M dima, K dima 16,16 16,8 32,8 64,4 16,4 24,4 K dimb, N dimb 16,16 16,8 8,32 4,64 16,4 48,2 Vector a Stride b - N N N - M Shared c B - A, B A, B B B Layout d CBL,CBL CBL,CBL CBL,CBL CBL,RBL RBL,RBL CBL,RBL Algorithm BA BA BA PL DB DB GFlop/s Efficiency 91% 86% 15% 56% 4% 32%, N wg, 96,96,16 128,64,96 64,64,8 64,64,16 64,64,64 32,48,192 M wi, N wi, K wi 6,6,2 8,8,24 8,4,8 8,4,16 8,8,8 4,12,4 M dimc, N dimc 16,16 16,8 8,16 8,16 8,8 8,4 M dima, K dima 16,16 16,8 32,4 32,4 8,8 8,4 K dimb, N dimb 16,16 16,8 4,32 8,16 8,8 8,4 Vector a Stride b M N M M, N M M Shared c A, B - A, B A, B B - Layout d CBL,CBL CBL,CBL CBL,CBL CBL,CBL RBL,RBL CBL,CBL Algorithm BA PL PL BA BA BA GFlop/s Efficiency 8% 8% 49% 67% 44% 38% a. Width of vector variables. c. Matrix whose data are shared in local memory. b. Non-unit stride access in each direction d. Data layout for matrices A, B, respectively. two AMD GPUs while it is relatively small on the other processors. The fastest DGEMM kernel without using blockmajor data layouts demonstrates the maximum performance of 837 Gflop/s on the Tahiti and the performance for some problem sizes (such as multiples of 248) is drastically deteriorated because of memory bank conflicts. B. Performance of GEMM implementations Our GEMM implementations execute the C αa T B + βc kernel after copying matrix data. Matrix data are transposed and changed into a block-major order during the copying. When a matrix size is not in multiples of a blocking factor, we use a zero padding technique. This section presents the performance results of our GEMM implementations where matrix data are stored in columnmajor order. Table III shows a summary of the measured performance and a comparison with vendor BLAS libraries. Fig. 9 depicts the performance of different DGEMM and SGEMM C αab + βc implementations on the Tahiti GPU. In the figure, we also plot the performance results of our previous implementation [13] and AMD APPML (Accelerated Parallel Processing Math Libraries) clblas Our current implementation shows the highest per- 42

8 Table III MAXIMUM PERFORMANCE [GFLOP/S] OF OUR GEMM IMPLEMENTATIONS AND GEMM ROUTINES OF VENDOR LIBRARIES, WHERE MATRIX DATA ARE STORED IN COLUMN-MAJOR ORDER DGEMM SGEMM NN: C αab + βc; Processor Impl. NN NT TN TT NN NT TN TT NT: C αab T + βc; Tahiti Ours Vendor a TN: C αa T B + βc; TT: C αa T B T + βc; Cayman Vendor a Libraries (APPML) clblas Ours a : AMD Accelerated Parallel Processing Math Kepler Ours Vendor b : NVIDIA CUBLAS in CUDA 5. RC : NVIDIA CUBLAS in CUDA b Fermi Ours d : Intel Math Kernel Library (MKL) Vendor c e : AMD Core Math Library (ACML) 5.1. Sandy Bridge Ours Vendor d Bulldozer Ours Vendor e Relative performance Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer BA (DGEMM) BA (SGEMM) PL (DGEMM) PL (SGEMM) DB (DGEMM) DB (SGEMM) Figure 8. Relative performance of the GEMM kernels using different algorithms in respect to the maximum performance from Table II for each processor formance. Note that the current implementation is not fast for small sizes because the ratio of copying time to total time is relatively big. When the matrix size is large, the overhead for the copying is amortized since the copying of N N matrix needs O(N 2 ) memory operations while the matrix multiplication requires O(N 3 ) arithmetic operations. The performance of GEMM implementations on the Fermi and Kepler GPUs is shown in Fig. 1. The figure compares the performance of our current implementation with NVIDIA CUBLAS and MAGMA (Matrix Algebra on GPU and Multicore Architectures) on the Fermi and CUBLAS 5. RC on the Kepler. As can be seen, our implementation in OpenCL is comparable to these in CUDA. The performance of our OpenCL implementation does not highly depend on GEMM types (see Table III). The OpenCL implementation on CPUs is not so good compared with the vendor libraries. The performance in OpenCL is twice or more times lower than Intel MKL (Math Kernel Library) on the Sandy Bridge. A possible reason of the low utilization efficiency is that current OpenCL compilers for CPUs are not as mature as for GPUs. Another reason is that our auto-tuning system is not particularly optimized for CPUs. On the Sandy Bridge CPU, we have tested another version of Intel OpenCL SDK (version 212) in addition to the latest version (213 beta). Fig. 11 shows the performance of different DGEMM implementations. Using the newer SDK improves the performance by around 2%. In the figure, we also compare our DGEMM performance with the performance of DGEMM routine auto-tuned by ATLAS [7]. ATLAS searches the best BLAS kernels written in C language. The performance by ATLAS is higher though both C and OpenCL are high-level languages. C. Comparison to other works Matrix multiplication is compute intensive and important in high performance computing. There have been a number of works for fast matrix multiplication. Kurzak et al. [1] developed an auto-tuning system for all GEMM routines on a Fermi GPU. Their system is to develop fast GEMM kernels written in CUDA. In [17], they also reported results of performance evaluation by the auto-tuning system on a Kepler GPU (GeForce GTX 68) that has the peak single-precision performance of 39 GFlop/s. The SGEMM performance is around 115 GFlop/s when size M = N = K = 496. Although experimental environments including the GPU model are different, our current SGEMM implementation shows higher performance, which is 134 GFlop/s, on a Kepler GPU. Tan et al. [15] presented a fast DGEMM implementation on a Fermi GPU (Tesla C25). The DB algorithm in Fig. 6 is based on their GEMM algorithm with double-buffering strategy. They reported that their DGEMM kernel achieves 362 GFlop/s, which is 7% utilization efficiency. The tuned kernel is written in Fermi s native machine language. They claim that such high processor utilization is impossible by using CUDA C or PTX language, and it is also valid for OpenCL as shown in our experiments. 43

9 This study 1 Our previous study AMD clblas This study AMD clblas Our previous study DGEMM performance SGEMM performance Figure 9. Performance of different DGEMM and SGEMM C αab + βc implementations on the Tahiti GPU CUBLAS (Fermi) MAGMA (Fermi) 5 This study (Fermi) This study (Kepler) CUBLAS 5. RC (Kepler) CUBLAS 5. RC (Kepler) This study (Kepler) 2 This study (Fermi) MAGMA (Fermi) CUBLAS (Fermi) DGEMM performance SGEMM performance Figure 1. Performance of different DGEMM and SGEMM C αab + βc implementations on the Fermi and Kepler GPUs Nakasato [18] implemented GEMM kernels in assemblylike intermediate language (IL). His GEMM kernels read matrix data through texture cache (image). In our measurement, the performance of this DGEMM kernel is up to 498 GFlop/s (92% efficiency) on an AMD/ATI Cypress GPU (Radeon HD 587). We applied our auto-tuning system on the GPU and the fastest generated DGEMM implementation in OpenCL achieves 495 GFlop/s. Du et al. [12] presented auto-tuned SGEMM and DGEMM routines in OpenCL. The maximum performance of their DGEMM routine is 38 GFlop/s (57% efficiency) on the Cypress GPU. We consider the large performance difference between our and their implementations comes from the following two main reasons: 1) The OpenCL SDK they used is older and less mature. They used ATI Stream SDK 2.1 while we use AMD APP SDK ) A set of parameters in their code generator is different. The parameters include vector variable width, blocking factors, texture cache usage, and local memory usage. V. CONCLUSION We have showed that our tuning system for fast matrix multiplication can be widely applicable to processors that support OpenCL. The performance demonstrated by the best GEMM kernel is superior to the vendor library (clblas) on AMD GPUs. On NVIDIA GPUs, the GEMM performance is almost equivalent to libraries in CUDA (CUBLAS and MAGMA). For CPUs, our current implementations do not perform well as for GPUs. The high performance of our GEMM kernels relies on the usage of block-major layouts for storing matrix data. Block-major layouts contributes the performance improvement on all tested processors. We have implemented three different GEMM algorithms and measured the performance difference among these algorithms. 44

10 Intel MKL ATLAS 3.1. This study (Intel SDK 213 beta) This study (Intel SDK 212) Figure 11. Performance of different DGEMM C αab + βc implementations on the Sandy Bridge CPU Our implementations conduct a copying of matrix data for using a GEMM kernel in block-major layout. For small sizes, an overhead for the copying is relatively large; therefore, the implementation does not run fast. One possible solution for such sizes is to use another GEMM kernel without the matrix copying. A future work is to implement the kernel and combine it with the current implementation. ACKNOWLEDGMENT A part of this work has been carried out under the Interdisciplinary Computational Science Program in Center for Computational Sciences, University of Tsukuba. REFERENCES [1] Basic Linear Algebra Subprograms Technical Forum Standard, Aug. 21. [Online]. Available: org/blas/blast-forum/blas-report.pdf [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK User s Guide, 3rd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, [3] B. Kågström, P. Ling, and C. Van Loan, GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark, ACM Transactions on Mathematical Software, vol. 24, no. 3, pp , [4] Khronos Group. (Accessed Sep. 3, 212) OpenCL - The open standard for parallel programming of heterogeneous systems. [Online]. Available: [5] AMD Inc., AMD Accelerated Parallel Processing OpenCL Programming Guide, rev2.3, Jul [6] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, Optimizing matrix multiply using PHiPAC : a portable, high-performance, ANSI C Coding methodology, Computer Science Department, University of Tennessee, Tech. Rep., May [Online]. Available: lawnspdf/lawn111.pdf [7] R. C. Whaley, A. Petitet, and J. J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Computing, vol. 27, no. 1-2, pp. 3 35, Jan. 21. [8] C. Jiang and M. Snir, Automatic tuning matrix multiplication performance on graphics hardware, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 25), pp , Sep. 25. [9] R. Nath, S. Tomov, and J. Dongarra, An improved MAGMA GEMM for Fermi graphics processing units, International Journal of High Performance Computing Applications, vol. 24, no. 4, pp , 21. [1] J. Kurzak, S. Tomov, and J. Dongarra, Autotuning GEMM kernels for the Fermi GPU, IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, pp , Nov [11] C. Jang. (Accessed Sep. 3, 212) GATLAS GPU Automatically Tuned Linear Algebra Software. [Online]. Available: [12] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming, Parallel Computing, vol. 38, no. 8, pp , Oct [13] K. Matsumoto, N. Nakasato, and S. G. Sedukhin, Implementing a code generator for fast matrix multiplication in OpenCL on the GPU, in Proceedings of the IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC- 12). Aizu-Wakamatsu City, Japan: IEEE Computer Society, Sep. 212, pp [14] G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, Optimal loop unrolling for GPGPU programs, in Proceedings of the 24th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 21). IEEE, Apr. 21, pp [15] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, Fast implementation of DGEMM on Fermi GPU, in Proceedings of the 211 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). Seattle, WA, USA: ACM, Nov. 211, pp. 35:1 35:11. [16] V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, in Proceedings of the 28 ACM/IEEE conference on Supercomputing (SC 8). Austin, Texas: IEEE Press, Nov. 28, pp. 31:1 31:11. [17] J. Kurzak, P. Luszczek, S. Tomov, and J. Dongarra, Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture - GeForce GTX 68, 212, LAPACK Working Note 267. [Online]. Available: [18] N. Nakasato, A fast GEMM implementation on the Cypress GPU, ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 5 55, Mar

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer

More information

Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems

Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems Available online at www.sciencedirect.com Procedia Computer Science 4 (2011) 342 351 International Conference on Computational Science, ICCS 2011 Multi-level Optimization of Matrix Multiplication for GPU-equipped

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

A Note on Auto-tuning GEMM for GPUs

A Note on Auto-tuning GEMM for GPUs A Note on Auto-tuning GEMM for GPUs Yinan Li 1, Jack Dongarra 1,2,3, and Stanimire Tomov 1 1 University of Tennessee, USA 2 Oak Ridge National Laboratory, USA 3 University of Manchester, UK Abstract. The

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

COMPARISON OF RECTANGULAR MATRIX MULTIPLICATION WITH AND WITHOUT BORDER CONDITIONS

COMPARISON OF RECTANGULAR MATRIX MULTIPLICATION WITH AND WITHOUT BORDER CONDITIONS COMPARISON OF RECTANGULAR MATRIX MULTIPLICATION WITH AND WITHOUT BORDER CONDITIONS Petre Lameski Igor Mishkovski Sonja Filiposka Dimitar Trajanov Leonid Djinevski Ss. Cyril and Methodius University in

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster th IEEE International Conference on Computer and Information Technology (CIT ) Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster WANG Lei ZHANG Yunquan

More information

Auto-tuning a Matrix Routine for High Performance. Rune E. Jensen Ian Karlin Anne C. Elster

Auto-tuning a Matrix Routine for High Performance. Rune E. Jensen Ian Karlin Anne C. Elster Auto-tuning a Matrix Routine for High Performance Rune E. Jensen Ian Karlin Anne C. Elster (runeerle,iank,elster)@idi.ntnu.no Abstract Well-written scientific simulations typically get tremendous performance

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Chapter 3 Dense Linear Algebra for Hybrid GPU-Based Systems Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Department of Electrical Engineering

More information

LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs

LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs LAPACK Working Note 202 Vasily Volkov Computer Science Division University of California at Berkeley James W. Demmel Computer Science

More information

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Bogdan Oancea 1, Tudorel Andrei 2 1 Nicolae Titulescu University of Bucharest, e-mail: bogdanoancea@univnt.ro, Calea

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Parallel Programming Concepts. GPU Computing with OpenCL

Parallel Programming Concepts. GPU Computing with OpenCL Parallel Programming Concepts GPU Computing with OpenCL Frank Feinbube Operating Systems and Middleware Prof. Dr. Andreas Polze Agenda / Quicklinks 2 Recapitulation Motivation History of GPU Computing

More information

Fast Implementation of DGEMM on Fermi GPU

Fast Implementation of DGEMM on Fermi GPU Fast Implementation of DGEMM on Fermi GPU Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun Key Laboratory of Computer Architecture, Institute of Computing Technology,Chinese

More information

Technical Report, CALDGEMM and HPL

Technical Report, CALDGEMM and HPL Technical Report, CALDGEMM and HPL David Rohr, Matthias Kretz, Matthias Bach Frankfurt Institute for Advanced Studies, University of Frankfurt, Germany December 9, 21 Abstract The LOEWE-CSC cluster at

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Benchmarking GPUs to Tune Dense Linear Algebra

Benchmarking GPUs to Tune Dense Linear Algebra Benchmarking GPUs to Tune Dense Linear Algebra Vasily Volkov Computer Science Division University of California at Berkeley James W. Demmel Computer Science Division and Department of Mathematics University

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS October 9, 215 9:46 WSPC/INSTRUCTION FILE ssbench Parallel Processing Letters c World Scientific Publishing Company A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS BRUCE MERRY SKA South Africa,

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) Michael Chuvelev, Intel Corporation, michael.chuvelev@intel.com Sergey Kazakov, Intel Corporation sergey.kazakov@intel.com

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

A Square Block Format for Symmetric Band Matrices

A Square Block Format for Symmetric Band Matrices A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Fernando Tinetti 1, Emilio Luque 2 1 Universidad Nacional de La Plata Facultad de Informática, 50 y 115 1900 La Plata, Argentina

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

GRAPHICS Processing Units (GPUs) maintain a strong lead

GRAPHICS Processing Units (GPUs) maintain a strong lead IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 11, NOVEMBER 2012 2045 Autotuning GEMM Kernels for the Fermi GPU Jakub Kurzak, Member, IEEE, Stanimire Tomov, Member, IEEE, and Jack

More information

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available

More information

Optimization Space Pruning without Regrets

Optimization Space Pruning without Regrets Optimization Space Pruning without Regrets Ulysse Beaugnon Antoine Pouille Marc Pouzet ENS, France name.surname@ens.fr Jacques Pienaar Google, USA jpienaar@google.com Albert Cohen INRIA, France albert.cohen@inria.fr

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

Accurate Cache and TLB Characterization Using Hardware Counters

Accurate Cache and TLB Characterization Using Hardware Counters Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

GPUBenchmark results for tesla2

GPUBenchmark results for tesla2 Benchmark results for tesla2 May 4, 202 Abstract This report shows the Benchmark results obtained on tesla2 on May 4, 202. Contents Introduction 2 Hardware description 3 Transfer speed between hard disk

More information

On the efficiency of the Accelerated Processing Unit for scientific computing

On the efficiency of the Accelerated Processing Unit for scientific computing 24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-21 Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Application-independent Autotuning for GPUs

Application-independent Autotuning for GPUs Application-independent Autotuning for GPUs Martin Tillmann, Thomas Karcher, Carsten Dachsbacher, Walter F. Tichy KARLSRUHE INSTITUTE OF TECHNOLOGY KIT University of the State of Baden-Wuerttemberg and

More information

Rajib Nath Dept. of Electrical Engineering and Computer Science, Univ. of Tennessee

Rajib Nath Dept. of Electrical Engineering and Computer Science, Univ. of Tennessee Chapter 4 BLAS for GPUs Rajib Nath Dept. of Electrical Engineering and Computer Science, Univ. of Tennessee Stanimire Tomov Dept. of Electrical Engineering and Computer Science, Univ. of Tennessee Jack

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Redesigning Triangular Dense Matrix Computations on GPUs

Redesigning Triangular Dense Matrix Computations on GPUs Redesigning Triangular Dense Matrix Computations on GPUs Ali Charara, Hatem Ltaief, and David Keyes Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah,

More information

GPU vs FPGA : A comparative analysis for non-standard precision

GPU vs FPGA : A comparative analysis for non-standard precision GPU vs FPGA : A comparative analysis for non-standard precision Umar Ibrahim Minhas, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London

More information

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

More information

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Ahmet Erdem, Gianluca Palermo 6, and Cristina Silvano 6 Department of Electronics, Information and Bioengineering Politecnico di

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information