Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

212 SC Companion: High Performance Computing, Networking Storage and Analysis Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs Kazuya Matsumoto, Naohito Nakasato, and Stanislav G. Sedukhin Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu City, Fukushima, 965-858 Japan Email: {d812111, nakasato, sedukhin}@u-aizu.ac.jp Abstract OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes performance of OpenCL programs also portable on different processors. We have developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL. This paper presents results of performance evaluation of DGEMM (double-precision general matrix multiply) and SGEMM (single-precision GEMM) implementations by using the auto-tuning system. Performance evaluations are conducted on two AMD GPUs (Tahiti and Cayman), two NVIDIA GPUs (Kepler and Fermi), and two CPUs (Intel Sandy Bridge and AMD Bulldozer). Our GEMM implementations on the AMD GPUs show higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable. I. INTRODUCTION Matrix-matrix multiplication is a fundamental routine in linear algebra, which is called GEMM (GEneral Matrix Multiply) in BLAS (Basic Linear Algebra Subprograms) standard [1]. GEMM is used in many important numerical algorithms, and it is a building block of LAPACK (Linear Algebra PACKage) [2] and other Level-3 BLAS routines [3]. GEMM algorithms have a high computational intensity and regularity and, therefore are good candidates for performance acceleration. OpenCL (Open Computing Language) is a standard framework for parallel programming [4], [5]. Programs in OpenCL are functionally portable across multiple processors that include CPUs, GPUs and other computing devices such as FPGAs. OpenCL offers an abstract hardware layer such that it allows programmers to develop applications without knowing details of underlying processor architectures. However, performance is not always portable across different processors in OpenCL. Parallel processing on GPUs and multi-core CPUs are widely used. State-of-the-art GPUs contain more than a thousand processing elements at around 1 GHz clock speed and are capable of achieving around 1 TFlop/s in doubleprecision and more than 3 TFlop/s in single-precision. Many numerical applications rely on the high computational power of GPUs. On the other hand, current commodity CPUs have up to a few tens of processing cores at a few GHz clock speed. The peak performance of CPUs is a few hundred GFlop/s, but CPUs are capable of running many more types of applications. Automatic performance tuning (or auto-tuning in short) is an important technique to resolve the problem of performance portability. Auto-tuning is a well accepted solution for fast GEMM implementations. PHiPAC (Portable High Performance ANSI C) [6] and ATLAS (Automatically Tuned Linear Algebra Software) [7] are famous projects for auto-tuned BLAS routines on CPUs. Also, several autotuning systems for GEMM have been developed [8] [13]. In CUDA, an auto-tuning framework for NVIDIA GPUs has been implemented [1], which is named as ASTRA (Automatic Stencil TuneR for Accelerators). In OpenCL, Du et al. [12] presented auto-tuned GEMM routines on an NVIDIA Fermi GPU and an AMD Cypress GPU. We have also previously implemented a GEMM code generator for fast GEMM kernels through an auto-tuning process on an AMD Tahiti GPU [13]. Our contribution of this study is that we have applied our auto-tuning system to different GPUs and CPUs, and evaluated the performance. The processors are the following four GPUs and two CPUs: 1) AMD Tahiti GPU (Radeon HD 797); 2) AMD Cayman GPU (Radeon HD 697); 3) NVIDIA Kepler GPU (GeForce GTX 67 overclocked); 4) NVIDIA Fermi GPU (Tesla M29); 5) Intel Sandy Bridge CPU (Core i7 396X); 6) AMD Bulldozer CPU (FX-815). Another contribution is that we have improved our previous GEMM code generator such that it supports a greater number of parameters. The new parameters include a parameter designating a specific matrix multiply algorithm and additional blocking factors for more flexible usage of local memory (sharable memory). The new generator have succeeded in producing GEMM kernels which run faster on the Tahiti GPU, i.e., the maximum performance 978--7695-4956-9/13 $26. 213 IEEE DOI 1.119/SC.Companion.212.59 396

of DGEMM (double-precision GEMM) kernel is increased to 863 GFlop/s (91% of the peak performance) from 848 GFlop/s and that of SGEMM (single-precision) kernel is improved to 347 GFlop/s (8% of the peak) from 2646 GFlop/s. The following parts of this paper are organized as follows. Section II shows OpenCL basics. Section III describes our improved GEMM code generator. This section also explains differences between the improved generator and the previous generator. After that, Section IV presents results of performance evaluation on different processors. Finally, Section V concludes the paper. II. OPENCL BASICS OpenCL is an open standard framework for generalpurpose parallel programming on heterogeneous platforms. The OpenCL framework includes a C99-based language for writing parallel functions called kernels, and runtime APIs (application programming interfaces) for controlling OpenCL platforms and devices. An OpenCL platform is composed of one or more OpenCL devices connected to a host. An OpenCL device comprises multiple compute units (CUs), each of which has multiple processing elements (PEs). When an OpenCL kernel is submitted for execution on the device, an N- dimensional index space, which is called NDRange, is defined. In this study, we consider a two-dimensional index space only, which is suitable for matrix data. Each instance in an NDRange is called a work-item which has a unique ID. Several work-items organize a work-group. A workitem runs on one or more PEs. A task of a work-group is processed by all the PEs of a CU. In an OpenCL kernel, four distinct memory regions are accessible to work-items. 1) Global memory is a memory region in which data can be read/written by all work-items. It is impossible to synchronize work-items during a kernel execution in this memory. 2) Constant memory is a read-only region of global memory. Data in this region are not changed during execution. 3) Local memory is a specific memory region to a workgroup. Work-items in a work-group can share data in local memory. 4) Private memory is a specific memory region to a workitem. Data in private memory of a work-item is not visible to other work-items. On most OpenCL devices, private memory is in the register file. III. GEMM CODE GENERATOR The GEMM routine in a single or double precision is defined as C αop(a)op(b) + βc, where both α and β are scalar values, and op(a), op(b) and C are M K, K N and M N matrices, respectively. Since the op(x) takes either X (non-transposed matrix) or X T (transposed matrix), there are four multiplication types: (NN) C αab + βc, (NT) C αab T + βc, (TN) C αa T B + βc, (TT) C αa T B T + βc. An auto-tuning system uses two core components: a code generator and a heuristic search engine. Our code generator takes a set of parameters as the input. When the input is given, the code generator produces the corresponding GEMM kernel code written in OpenCL as the output. We can set different input parameters to the generator. We have tuned the code generator so that it produces a fast A T B +C kernel in which data are properly aligned in a row-major order. For simplicity multiplications with scalars α and β are omitted in the following descriptions though they are performed in the kernel. Our approach for fast GEMM implementations is to utilize the A T B + C kernel. This means that the implementation firstly copies matrices with, if needed, matrix transposition and data layout change, and then executes the A T B + C kernel. The following explanations of each parameter are targeted for a multiplication of a transposed matrix A T and a non-transposed matrix B. A. Blocking Matrix blocking (or tiling) in a matrix multiply algorithm is a necessary technique for high performance computation. The blocking technique increases the data reuse ratio in multi-level memory hierarchy of today processors by considering of the fact that matrix multiply requires O(N 3 ) multiply-add floating-point operations over O(N 2 ) data. Two levels of blocking are used in our matrix multiply algorithms. In this paper, we refer the first level of blocking to alleviate access latencies between the local memory and private memory for work-item processing as work-item blocking. The other level of blocking is to efficiently use local memory (and data caches) of a compute unit for work-group processing. We call this blocking work-group blocking. Let, N wg, be the blocking factors of workgroup blocking, where M, N, K are considered to be divisible by, N wg,, correspondingly. The blocking divides three matrices A T, B, C into blocks of, N wg and N wg matrices, respectively. Fig. 1 shows a matrix multiply-add partitioned by the blocking factors. The elements of each N wg matrix of C are computed by a work-group. The work-group involves an K block of A T and a K N wg block of B for multiplication and an N wg block of C for addition with the result of multiplication. The blocked matrix multiply requires K/ iterations in the outermost loop of our GEMM algorithms. In every 397

N wg N M N wg N N wg N wg N wi M wi M wi N wi M += K K += C A T Figure 1. Blocked matrix multiply-add partitioned with factors, N wg, in the work-group blocking B C A T A T (a) With a unit stride (adjacent) memory access B iteration, the work-group updates the N wg block by multiplying a block of A T with a N wg block of B and adding the product to the N wg block of C. Fig. 2(a) depicts the further blocked matrix multiplyadd. Each block is additionally divided with blocking factors M wi, N wi in the work-item blocking. The two blocking factors M wi, N wi are not parameters of the code generator. Instead, the size (M dimc, N dimc ) of a work-group are parameterized, where and N wg are in multiples of M dimc and N dimc respectively, Using the parameters M dimc, N dimc, the two blocking factors M wi, N wi are calculated by /M dimc, N wg /N dimc correspondingly. A work-item of the work-group is in charge of multiplication of an K M wi sub-block of A T by a K N wi sub-block of B and accumulation of the product on an M wi N wi sub-block of C. In addition, the code generator supports another parameter of K wi which determines the degree of unrolling in the innermost loop of our GEMM algorithms. Note that is divisible by K wi and we categorize the parameter K wi as one of blocking factors. Loop unrolling [14] is an optimization technique in which the body of a loop is replaced with multiple copies of itself. As positive effects, the technique exposes parallelism explicitly in an OpenCL kernel to language compilers and reduces loop overheads such as loop-counter increment; however, the unrolling also has a side effect of increasing the number of required registers. The unrolling degree is, hence, necessary to be parameterized. B. Vector width and stride memory access A width of vector variables is a parameter to the code generator. Vector variables in OpenCL resemble arrays containing multiple elements of the same variable type. The vector width vw affect the performance of generated kernels, and the best width depends on a processor and an algorithm. In the work-item blocking of Fig. 2(a), each work-item is in charge of computation for adjacent (unit stride) elements in M wi N wi block of C and dark shade indicates elements accessed by a work-item. In previous works [9], [1], [15], instead of using the unit stride memory access, a non-unit N dimc N wg M dimc C += M dimc A T (b) With a non-unit stride access N dimc N wg Figure 2. Further blocked matrix multiply-add with factors M wi, N wi in the work-item blocking stride memory access is utilized for performance optimization on Fermi GPUs. Fig. 2(b) depicts the computation with our non-unit stride memory access where the stride size in M direction is M dimc and the one in N direction is N dimc. If vector variables are used (vw 2), stride sizes are multiplied by the vector width, i.e., the sizes are vw M dimc and vw N dimc. C. Usage of local memory We parameterize usage of local memory for sharing data in work-items of a work-group. Local memory offers advantage to re-use matrix data of A and B which are loaded once from global memory. A disadvantage of using local memory is that it needs a barrier synchronization between the work-items, which takes a certain amount of time. This fact means that using local memory does not always lead to high performance. In case when local memory is used, an assignment pattern of work-items in a work-group can be reshaped (this reshaping technique is also used in [1]). To represent the reshape, let us introduce values M dima, K dima, K dimb, N dimb. Reshaping the block is possible as long as three shapes of A, B, C completely overlay the corresponding matrix. We add two parameters of M dima and N dimb and the other values of K dima and K dimb are calculated by (M dimc N dimc )/M dima and (M dimc N dimc )/N dimb, respectively. B 398

K K K M M M (a) Row-major layout (b) Column-block-row-major layout (CBL) (c) Row-block-row-major layout (RBL) Figure 3. Matrix data layouts of an M K transposed matrix with blocking factors, D. Matrix data layouts The code generator supports A T B + C kernels in which matrices A, B are supposed to be aligned in block-major data layouts in addition to a row-major layout. Fig. 3 shows the supported layouts (this is an example of an M K transposed matrix A T with blocking factors, ). Fig. 3(a) is a row-major layout. Fig. 3(b) depicts a columnblock-row-major layout (CBL) where data of each K column-block are stored in a row-major order. In CBL, matrix data required for a multiplication of K column-block of A T by K N wg column-block of B are in contiguous memory space. Fig. 3(c) shows a row-block-rowmajor layout (RBL) where data of each subblock of a M row-block are aligned in a row-major order. In RBL, matrix data for a multiplication between sub-block and N wg sub-block are in sequential memory space. Both CBL and RBL have better spatial locality than row-major layout and GEMM kernels using either of them are expected to read matrix data more efficiently. To make use of a fast A T B + C kernel for GEMM routines, matrix data have to be copied into extra allocated buffers in global memory before executing the kernel. For example, to implement an AB + C routine where data are stored in row-major order, the matrix A is copied into a buffer with matrix transposition and the matrix B is copied into another buffer without transposition. If designated data layouts are not row-major, matrix data are changed into the required layouts along with the copying. E. Algorithms We have implemented three GEMM algorithms and the code generator has a parameter to indicate one of them. In OpenCL language, kernels are written in an SPMD (Single- Program Multiple-Data) fashion where a specific kernel describes behavior of each work-item in NDRange. Note that the following algorithms are examples which use local memory for both matrices A and B. If a generated kernel does not use local memory, data elements are directly loaded from global memory into private memory. In the description of algorithms, barrier means a barrier synchronization between work-items in a work-group to ensure correct memory access to local memory. Fig. 4 presents one of the GEMM algorithms. The algorithm is considered as a basic algorithm (BA), which is similar to the GEMM algorithm by Volkov and Demmel [16]. In the body of the outer loop (lines 2-12), a workitem firstly loads matrix elements of A and B from global memory into local memory. The inner loop body (lines 6-1) loads the elements from local memory into private memory A pm, B pm, multiply A pm by B pm, and added the product to C pm also in private memory. The inner loop is fully unrolled. When the work-item exits from the outer loop, it merges computed results in C pm with corresponding elements of C (line 13). The scalars α and β are used for multiplication at the same time as the merging. 1: C pm = 2: for p wg = to K step do 3: load M wia K wia elements of A into A lm 4: load K wib N wib elements of B into B lm 5: barrier 6: for p wi = to K wi step K wi do 7: load M wi K wi elements of A lm into A pm 8: load K wi N wi elements of B lm into B pm 9: C pm + = A pm B pm 1: end for 11: barrier 12: end for 13: merge C pm with M wi N wi elements of C pm - private memory, lm - local memory, M wia = /M dima, K wia = /K dima, K wib = /K dimb, and N wib = N wg /N dimb. (The same notations are used in the following algorithms.) Figure 4. Basic GEMM algorithm (BA) Fig. 5 shows another GEMM algorithm. The algorithm uses a software pipelining (PL) technique and it is based on the GEMM algorithm proposed in [9], [1]. The PL algorithm has a prologue processing and an epilogue processing 399

in addition to a loop body for the pipelining. The main feature of the algorithm is that the loop body loads elements of A and B from global memory while it loads elements from local memory and the computes the multiply-add. This strategy is considered to be effective on OpenCL devices where big memory access latencies to global memory are a bottleneck. 1: C pm = 2: load M wia K wia elements of A into A lm 3: load K wib N wib elements of B into B lm 4: barrier 5: for p wg = to K 2 step do 6: load M wia K wia elements of A into A pm 7: load K wib N wib elements of B into B pm 8: barrier 9: for p wi = to K wi step K wi do 1: load M wi K wi elements of A lm into A pm1 11: load K wi N wi elements of B lm into B pm1 12: C pm + = A pm1 B pm1 13: end for 14: barrier 15: store M wia K wia elements of A pm into A lm 16: store K wib N wib elements of B pm into B lm 17: barrier 18: end for 19: for p wi = to K wi step K wi do 2: load M wi K wi elements of A lm into A pm1 21: load K wi N wi elements of B lm into B pm1 22: C pm + = A pm1 B pm1 23: end for 24: merge C pm with M wi N wi elements of C Figure 5. GEMM algorithm with software pipelining (PL) The third GEMM algorithm is shown in Fig. 6. The algorithm is a variant of the algorithm with double-buffering strategy (DB) by Tan et al. [15]. In contrast to the PL algorithm, the DB algorithm requires less private memory. The private memory is usually allocated in registers. In some processors (especially in GPUs), the number of used registers affects the kernel performance [5]. The number of registers determines the number of work-groups launched on a compute unit. If the number of work-groups is not enough, processors cannot hide memory access latencies. A drawback of the DB algorithm is that it requires bigger space in local memory than the other two algorithms. F. Modifications of the GEMM code generator Our GEMM code generator has been modified from [13] in several aspects. The number of parameters related to blocking factors is increased from six to eight. Each size of blocking factors was previously limited to a power of two. This limitation has been eliminated. 1: C pm = 2: load M wia (K wia /2) elements of A into A lm 3: load (K wib /2) N wib elements of B into B lm 4: for p wg = to K 2 step do 5: barrier 6: load M wia (K wia /2) elements of A into A lm1 7: load (K wib /2) N wib elements of B into B lm1 8: for p wi = to /2 K wi step K wi do 9: load M wi K wi elements of A lm into A pm 1: load K wi N wi elements of B lm into B pm 11: C pm + = A pm B pm 12: end for 13: barrier 14: load M wia (K wia /2) elements of A into A lm 15: load (K wib /2) N wib elements of B into B lm 16: for p wi = /2 to K wi step K wi do 17: load M wi K wi elements of A lm1 into A pm 18: load K wi N wi elements of B lm1 into B pm 19: C pm + = A pm B pm 2: end for 21: end for 22: barrier 23: load M wia (K wia /2) elements of A into A lm1 24: load (K wib /2) N wib elements of B into B lm1 25: for p wi = to /2 K wi step K wi do 26: load M wi K wi elements of A lm into A pm 27: load K wi N wi elements of B lm into B pm 28: C pm + = A pm B pm 29: end for 3: barrier 31: for p wi = /2 to K wi step K wi do 32: load M wi K wi elements of A lm1 into A pm 33: load K wi N wi elements of B lm1 into B pm 34: C pm + = A pm B pm 35: end for 36: merge C pm with M wi N wi elements of C Figure 6. GEMM algorithm with double-buffering strategy (DB) A non-unit stride memory access is implemented in addition to a unit-stride access. The current generator can produce GEMM kernels using local memory for both matrices A and B while the previous generator was incomplete on such kernel production. GEMM kernels access data in buffer objects on global memory. Image objects, which are another possible memory objects in OpenCL, are not used currently. The three discussed above GEMM algorithms are now supported. Only the BA algorithm was used in the previous version of code generator. Our strategy to search the fastest (best) kernel has not been changed significantly. We searched tens of thousands 4

Table I PROCESSOR SPECIFICATION Code name Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer Product name HD 797 HD 697 GTX 67 OC Tesla M29 Core i7 396X FX-815 Core clock speed [GHz].925.88 1.85 1.3 3.3 3.6 Number of compute units 32 24 7 16 6 8 Max DP operations / clock 124 768 96 512 48 32 Max SP operations / clock 496 372 2688 124 96 64 Peak DP performance [GFlop/s] 947 676 122 665 158.4 115.2 Peak SP performance [GFlop/s] 3789 273 2916 1331 316.8 23.4 Global memory size [GB] 3 1 2 6 - - Peak global memory bandwidth [GB/s] 264 176 192 177 - - L3 cache size [MB] - - - - 15 a 8 a L2 cache size [kb] 768 a 512 a 512 a 768 a 256 b 248 c L1 cache size [kb] 16 b 8 b 64 b 16 b 32 b 64 c Local memory size [kb] 64 b 32 b 48 b 48 b 32 32 Local memory type Scratchpad Scratchpad Scratchpad Scratchpad Global Global OpenCL SDK AMD APP 2.6 AMD APP 2.6 CUDA 5. RC f CUDA 4.1.28 Intel 213 beta AMD APP 2.7 Display driver version 12.3 d 11.11 d 34.33 e 285.5 e - - SP: Single-precision a : Size per processor d : Catalyst driver version DP: Double-precision b : Size per compute unit (core) e : CUDA driver version SDK: Software Development Kit c : Size per two cores f : CUDA 5. Release Candidate of kernel variants per single GEMM type on an OpenCL device. For this number, kernels which are failed in code generation, compilation or testing are not counted. Those many variants were heuristically chosen. We implemented a heuristic search engine and selected the fastest kernel. To find the best set of parameters for each GEMM kernel, the search engine should run more than five hours. The procedure for selecting the best kernel is as follows: 1) Measuring the performance in GFlop/s of every generated GEMM kernel for a problem size N = 496/LCM LCM in GPU devices and N = 1536/LCM LCM in CPU devices, where matrices are square M = N = K and LCM is the least common multiple of work-group blocking factors, N wg,. 2) Further measuring the performance of the fastest 5 kernels for problems sizes N (N is in multiples of LCM and N 8192) among a large number of previously tested kernels tested in 1). 3) Selecting the fastest kernel among the 5 kernels tested in 2). IV. PERFORMANCE EVALUATION In this study, we have made the performance measurements for four different GPUs (AMD Tahiti, AMD Cayman, NVIDIA Kepler and NVIDIA Fermi) and two different CPUs (Intel Sandy Bridge and AMD Bulldozer). The specifications of processors are shown in Table I. The Kepler GPU has a boost function which dynamically increases processor frequency and, therefore, the measured performance may be higher than the listed peak performance. Note that the presented performance numbers do not take into account data transfer time between host and OpenCL device. A. Performance of GEMM kernels Fig. 7 depicts the performance of the selected fastest DGEMM and SGEMM kernels as a function of problem size. Table II shows the set of parameters and the observed maximum performance of the kernels. The Tahiti GPU shows the highest performance: 863 Gflop/s (91% of the peak performance) in DGEMM and 347 Gflop/s (8%) in SGEMM. Those performance numbers are higher than our previous results [13]. In particular, the SGEMM performance is significantly increased from 2646 GFlop/s. The main reason for the performance improvement is that the new SGEMM kernel uses a local memory for both matrices A, B. In addition to the Tahiti case, the local memory usage affects performance improvement on the Kepler and Fermi GPUs. For instance, if local memory is not used for both matrices on the Kepler, the maximum SGEMM performance is decreased from 144 GFlop/s to 115 GFlop/s. A prominent performance difference can not be seen on the CPUs depending on the local memory usage. The Cayman runs slower when the local memory is utilized, probably because the cost for barrier synchronizations is too large. The selection of GEMM algorithm affects the performance of GEMM kernels. Fig. 8 depicts the relative performance using three different GEMM algorithms in respect to the maximum performance from Table II for each processor. Note that DGEMM kernels with PL algorithm always fail to execute on the Bulldozer. The BA algorithm is apparently the best on the Tahiti GPU. For the other three GPUs, the best algorithm is different between the DGEMM and SGEMM kernels. Performance variations on the CPUs are relatively small. GEMM kernels using block-major matrix layouts show the highest performance on all tested processors. Influence of block-major layouts to the performance is big on the 41

1 1 Figure 7. 1 1 Tahiti Cayman Fermi Kepler Sandy Bridge Bulldozer 1 124 248 372 496 512 6144 DGEMM performance 1 1 Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer 1 124 248 372 496 512 6144 SGEMM performance Performance of the fastest DGEMM and SGEMM C αa T B + βc kernels produced by our code generator on different processors DGEMM SGEMM Table II PARAMETERS FOR THE FASTEST C αa T B + βc KERNEL AND THE MAXIMUM PERFORMANCE Parameter Max perf. Parameter Max perf. Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer, N wg, 96,32,48 64,32,48 32,64,8 64,64,8 64,32,64 48,32,96 M wi, N wi, K wi 6,2,2 4,4,24 2,4,4 4,4,2 4,8,4 2,8,16 M dimc, N dimc 16,16 16,8 16,16 16,16 16,4 24,4 M dima, K dima 16,16 16,8 32,8 64,4 16,4 24,4 K dimb, N dimb 16,16 16,8 8,32 4,64 16,4 48,2 Vector a 2 2 1 1 4 2 Stride b - N N N - M Shared c B - A, B A, B B B Layout d CBL,CBL CBL,CBL CBL,CBL CBL,RBL RBL,RBL CBL,RBL Algorithm BA BA BA PL DB DB GFlop/s 863 58 128 37 64 37 Efficiency 91% 86% 15% 56% 4% 32%, N wg, 96,96,16 128,64,96 64,64,8 64,64,16 64,64,64 32,48,192 M wi, N wi, K wi 6,6,2 8,8,24 8,4,8 8,4,16 8,8,8 4,12,4 M dimc, N dimc 16,16 16,8 8,16 8,16 8,8 8,4 M dima, K dima 16,16 16,8 32,4 32,4 8,8 8,4 K dimb, N dimb 16,16 16,8 4,32 8,16 8,8 8,4 Vector a 1 4 2 2 8 4 Stride b M N M M, N M M Shared c A, B - A, B A, B B - Layout d CBL,CBL CBL,CBL CBL,CBL CBL,CBL RBL,RBL CBL,CBL Algorithm BA PL PL BA BA BA GFlop/s 347 2167 144 896 14 87 Efficiency 8% 8% 49% 67% 44% 38% a. Width of vector variables. c. Matrix whose data are shared in local memory. b. Non-unit stride access in each direction d. Data layout for matrices A, B, respectively. two AMD GPUs while it is relatively small on the other processors. The fastest DGEMM kernel without using blockmajor data layouts demonstrates the maximum performance of 837 Gflop/s on the Tahiti and the performance for some problem sizes (such as multiples of 248) is drastically deteriorated because of memory bank conflicts. B. Performance of GEMM implementations Our GEMM implementations execute the C αa T B + βc kernel after copying matrix data. Matrix data are transposed and changed into a block-major order during the copying. When a matrix size is not in multiples of a blocking factor, we use a zero padding technique. This section presents the performance results of our GEMM implementations where matrix data are stored in columnmajor order. Table III shows a summary of the measured performance and a comparison with vendor BLAS libraries. Fig. 9 depicts the performance of different DGEMM and SGEMM C αab + βc implementations on the Tahiti GPU. In the figure, we also plot the performance results of our previous implementation [13] and AMD APPML (Accelerated Parallel Processing Math Libraries) clblas 1.8.291. Our current implementation shows the highest per- 42

Table III MAXIMUM PERFORMANCE [GFLOP/S] OF OUR GEMM IMPLEMENTATIONS AND GEMM ROUTINES OF VENDOR LIBRARIES, WHERE MATRIX DATA ARE STORED IN COLUMN-MAJOR ORDER DGEMM SGEMM NN: C αab + βc; Processor Impl. NN NT TN TT NN NT TN TT NT: C αab T + βc; Tahiti Ours Vendor a 852 647 855 731 849 549 851 65 2989 2468 38 2489 297 1476 2989 2281 TN: C αa T B + βc; TT: C αa T B T + βc; Cayman Vendor a 329 336 32 329 171 111 662 121 Libraries (APPML) clblas 1.8.291 Ours 568 567 565 565 26 296 237 274 a : AMD Accelerated Parallel Processing Math Kepler Ours Vendor b 127 124 128 122 127 122 128 122 1399 1371 1417 1417 1382 1227 1399 1361 : NVIDIA CUBLAS in CUDA 5. RC : NVIDIA CUBLAS in CUDA 4.1.28 b Fermi Ours 366 368 363 365 882 888 876 882 d : Intel Math Kernel Library (MKL) 211.1.319 Vendor c 45 46 48 45 83 942 92 889 e : AMD Core Math Library (ACML) 5.1. Sandy Bridge Ours 6 6 6 6 132 133 132 133 Vendor d 138 139 138 138 282 285 281 283 Bulldozer Ours 36 37 36 36 74 78 7 74 Vendor e 5 5 5 5 13 11 13 11 Relative performance 1.8.6.4.2 Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer BA (DGEMM) BA (SGEMM) PL (DGEMM) PL (SGEMM) DB (DGEMM) DB (SGEMM) Figure 8. Relative performance of the GEMM kernels using different algorithms in respect to the maximum performance from Table II for each processor formance. Note that the current implementation is not fast for small sizes because the ratio of copying time to total time is relatively big. When the matrix size is large, the overhead for the copying is amortized since the copying of N N matrix needs O(N 2 ) memory operations while the matrix multiplication requires O(N 3 ) arithmetic operations. The performance of GEMM implementations on the Fermi and Kepler GPUs is shown in Fig. 1. The figure compares the performance of our current implementation with NVIDIA CUBLAS 4.1.28 and MAGMA (Matrix Algebra on GPU and Multicore Architectures) 1.2.1 on the Fermi and CUBLAS 5. RC on the Kepler. As can be seen, our implementation in OpenCL is comparable to these in CUDA. The performance of our OpenCL implementation does not highly depend on GEMM types (see Table III). The OpenCL implementation on CPUs is not so good compared with the vendor libraries. The performance in OpenCL is twice or more times lower than Intel MKL (Math Kernel Library) 211.1.319 on the Sandy Bridge. A possible reason of the low utilization efficiency is that current OpenCL compilers for CPUs are not as mature as for GPUs. Another reason is that our auto-tuning system is not particularly optimized for CPUs. On the Sandy Bridge CPU, we have tested another version of Intel OpenCL SDK (version 212) in addition to the latest version (213 beta). Fig. 11 shows the performance of different DGEMM implementations. Using the newer SDK improves the performance by around 2%. In the figure, we also compare our DGEMM performance with the performance of DGEMM routine auto-tuned by ATLAS [7]. ATLAS searches the best BLAS kernels written in C language. The performance by ATLAS is higher though both C and OpenCL are high-level languages. C. Comparison to other works Matrix multiplication is compute intensive and important in high performance computing. There have been a number of works for fast matrix multiplication. Kurzak et al. [1] developed an auto-tuning system for all GEMM routines on a Fermi GPU. Their system is to develop fast GEMM kernels written in CUDA. In [17], they also reported results of performance evaluation by the auto-tuning system on a Kepler GPU (GeForce GTX 68) that has the peak single-precision performance of 39 GFlop/s. The SGEMM performance is around 115 GFlop/s when size M = N = K = 496. Although experimental environments including the GPU model are different, our current SGEMM implementation shows higher performance, which is 134 GFlop/s, on a Kepler GPU. Tan et al. [15] presented a fast DGEMM implementation on a Fermi GPU (Tesla C25). The DB algorithm in Fig. 6 is based on their GEMM algorithm with double-buffering strategy. They reported that their DGEMM kernel achieves 362 GFlop/s, which is 7% utilization efficiency. The tuned kernel is written in Fermi s native machine language. They claim that such high processor utilization is impossible by using CUDA C or PTX language, and it is also valid for OpenCL as shown in our experiments. 43

9 3 8 7 25 6 5 4 3 2 This study 1 Our previous study AMD clblas 1.8.291 124 248 372 496 512 6144 2 15 1 5 This study AMD clblas 1.8.291 Our previous study 124 248 372 496 512 6144 DGEMM performance SGEMM performance Figure 9. Performance of different DGEMM and SGEMM C αab + βc implementations on the Tahiti GPU 4 14 35 12 3 25 2 15 1 CUBLAS 4.1.28 (Fermi) MAGMA 1.2.1 (Fermi) 5 This study (Fermi) This study (Kepler) CUBLAS 5. RC (Kepler) 124 248 372 496 512 6144 1 8 6 4 CUBLAS 5. RC (Kepler) This study (Kepler) 2 This study (Fermi) MAGMA 1.2.1 (Fermi) CUBLAS 4.1.28 (Fermi) 124 248 372 496 512 6144 DGEMM performance SGEMM performance Figure 1. Performance of different DGEMM and SGEMM C αab + βc implementations on the Fermi and Kepler GPUs Nakasato [18] implemented GEMM kernels in assemblylike intermediate language (IL). His GEMM kernels read matrix data through texture cache (image). In our measurement, the performance of this DGEMM kernel is up to 498 GFlop/s (92% efficiency) on an AMD/ATI Cypress GPU (Radeon HD 587). We applied our auto-tuning system on the GPU and the fastest generated DGEMM implementation in OpenCL achieves 495 GFlop/s. Du et al. [12] presented auto-tuned SGEMM and DGEMM routines in OpenCL. The maximum performance of their DGEMM routine is 38 GFlop/s (57% efficiency) on the Cypress GPU. We consider the large performance difference between our and their implementations comes from the following two main reasons: 1) The OpenCL SDK they used is older and less mature. They used ATI Stream SDK 2.1 while we use AMD APP SDK 2.5. 2) A set of parameters in their code generator is different. The parameters include vector variable width, blocking factors, texture cache usage, and local memory usage. V. CONCLUSION We have showed that our tuning system for fast matrix multiplication can be widely applicable to processors that support OpenCL. The performance demonstrated by the best GEMM kernel is superior to the vendor library (clblas) on AMD GPUs. On NVIDIA GPUs, the GEMM performance is almost equivalent to libraries in CUDA (CUBLAS and MAGMA). For CPUs, our current implementations do not perform well as for GPUs. The high performance of our GEMM kernels relies on the usage of block-major layouts for storing matrix data. Block-major layouts contributes the performance improvement on all tested processors. We have implemented three different GEMM algorithms and measured the performance difference among these algorithms. 44

14 12 1 8 6 4 Intel MKL 211.1.319 2 ATLAS 3.1. This study (Intel SDK 213 beta) This study (Intel SDK 212) 124 248 372 496 512 Figure 11. Performance of different DGEMM C αab + βc implementations on the Sandy Bridge CPU Our implementations conduct a copying of matrix data for using a GEMM kernel in block-major layout. For small sizes, an overhead for the copying is relatively large; therefore, the implementation does not run fast. One possible solution for such sizes is to use another GEMM kernel without the matrix copying. A future work is to implement the kernel and combine it with the current implementation. ACKNOWLEDGMENT A part of this work has been carried out under the Interdisciplinary Computational Science Program in Center for Computational Sciences, University of Tsukuba. REFERENCES [1] Basic Linear Algebra Subprograms Technical Forum Standard, Aug. 21. [Online]. Available: http://www.netlib. org/blas/blast-forum/blas-report.pdf [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK User s Guide, 3rd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1999. [3] B. Kågström, P. Ling, and C. Van Loan, GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark, ACM Transactions on Mathematical Software, vol. 24, no. 3, pp. 268 32, 1998. [4] Khronos Group. (Accessed Sep. 3, 212) OpenCL - The open standard for parallel programming of heterogeneous systems. [Online]. Available: http://www.khronos.org/opencl [5] AMD Inc., AMD Accelerated Parallel Processing OpenCL Programming Guide, rev2.3, Jul. 212. [6] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, Optimizing matrix multiply using PHiPAC : a portable, high-performance, ANSI C Coding methodology, Computer Science Department, University of Tennessee, Tech. Rep., May 1996. [Online]. Available: http://www.netlib.org/lapack/ lawnspdf/lawn111.pdf [7] R. C. Whaley, A. Petitet, and J. J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Computing, vol. 27, no. 1-2, pp. 3 35, Jan. 21. [8] C. Jiang and M. Snir, Automatic tuning matrix multiplication performance on graphics hardware, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 25), pp. 185 194, Sep. 25. [9] R. Nath, S. Tomov, and J. Dongarra, An improved MAGMA GEMM for Fermi graphics processing units, International Journal of High Performance Computing Applications, vol. 24, no. 4, pp. 511 515, 21. [1] J. Kurzak, S. Tomov, and J. Dongarra, Autotuning GEMM kernels for the Fermi GPU, IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, pp. 245 257, Nov. 212. [11] C. Jang. (Accessed Sep. 3, 212) GATLAS GPU Automatically Tuned Linear Algebra Software. [Online]. Available: http://golem5.org/gatlas [12] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming, Parallel Computing, vol. 38, no. 8, pp. 391 47, Oct. 211. [13] K. Matsumoto, N. Nakasato, and S. G. Sedukhin, Implementing a code generator for fast matrix multiplication in OpenCL on the GPU, in Proceedings of the IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC- 12). Aizu-Wakamatsu City, Japan: IEEE Computer Society, Sep. 212, pp. 198 24. [14] G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, Optimal loop unrolling for GPGPU programs, in Proceedings of the 24th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 21). IEEE, Apr. 21, pp. 1 11. [15] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, Fast implementation of DGEMM on Fermi GPU, in Proceedings of the 211 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 11). Seattle, WA, USA: ACM, Nov. 211, pp. 35:1 35:11. [16] V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, in Proceedings of the 28 ACM/IEEE conference on Supercomputing (SC 8). Austin, Texas: IEEE Press, Nov. 28, pp. 31:1 31:11. [17] J. Kurzak, P. Luszczek, S. Tomov, and J. Dongarra, Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture - GeForce GTX 68, 212, LAPACK Working Note 267. [Online]. Available: http://www.netlib.org/lapack/lawnspdf/lawn267.pdf [18] N. Nakasato, A fast GEMM implementation on the Cypress GPU, ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 5 55, Mar. 211. 45