Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA

Size: px
Start display at page:

Download "Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA"

Transcription

1 Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA Ahmad Abdelfattah a,, Azzam Haidar a, Stanimire Tomov a, Jack Dongarra a,b,c a Innovative Computing Laboratory, University of Tennessee, Knoxville, USA b Oak Ridge National Laboratory, Oak Ridge, USA c University of Manchester, UK Abstract This paper presents a GPU-accelerated Cholesky factorization for two different modes of operation. The first one is the batch mode, where many independent factorizations on small matrices can be performed concurrently. This mode supports fixed size and variable size problems, and is found in many scientific applications. The second mode is the native mode, where one factorization is performed on a large matrix without any CPU involvement, which allows the CPU do other useful work. We show that, despite the different workloads, both modes of operation share a common code-base that uses the GPU only. We also show that the developed routines achieve significant speedups against a multicore CPU using the MKL library. This work is part of the MAGMA library. Keywords: GPU computing, Cholesky factorization, batched execution 1. Introduction High performance solutions of many small independent problems are crucial to many scientific applications, including astrophysics [1], quantum chemistry [2], metabolic networks [3], CFD and resulting PDEs through direct and Corresponding author addresses: ahmad@icl.utk.edu (Ahmad Abdelfattah), haidar@icl.utk.edu (Azzam Haidar), tomov@icl.utk.edu (Stanimire Tomov), dongarra@icl.utk.edu (Jack Dongarra) Preprint submitted to Journal of Computational Science November 1, 2016

2 multifrontal solvers [4], high-order FEM schemes for hydrodynamics [5], directiterative preconditioned solvers [6], image [7] and signal processing [8]. The lack of parallelism in each of the small problems drives researchers to take advantage of the mutual independence among these problems, and develop specialized software that groups the computation into a single batched routine. Such software is relatively easy to develop for multicore CPUs using the existing optimized sequntial vendor libraries as building blocks. For example, considering Intel CPUs, a combination of the MKL library and OpenMP (scheduling individual cores dynamically across the input problems) usually achieves a very high performance, since most of the computation can be performed through the fast CPU cache. However, the same technique cannot be used for GPUs, fundamentally due to the lack of large caches. On the other hand, there is a need to develop factorizations and linear system solvers that work entirely on the GPU, with no computational work submitted to the CPU. We call this mode of operation the native mode. Native execution on the GPU would allow the CPU to do other useful work. It can also be used in power sensitive environments, or in embedded systems with relatively slow CPUs, such as the Jetson TK1. Since GPUs are inherantly more energy efficient than CPUs, it is expected that a native code, although slower than a hybrid code using both CPU and GPU, be more energy efficient than the hybrid one [9]. While MAGMA [10] provides high performance LAPACK functionality on GPUs, most of the MAGMA routines are hybrid. This means that both the CPU and the GPU are engaged in performing the computation. This technique is proved to achieve very high performance on large problems [11]. However, it cannot be used efficiently to solve a batch of small problems due to the prohibitive cost that CPU-GPU communications will have for small problems. It cannot be used either in systems with low-end CPUs, or when the CPU is required to do other work. In general, we need a different design approach that uses the GPU only. This paper presents a high performance Cholesky factorization that can run entirely on the GPU. We discuss two modes of operations. The first is the 2

3 40 batch 1 mode, where many small independent problems, of the same size or different sizes, are factorized concurrently. We extend the work presented in this direction [12], by showing a design that works for any size, not only those sizes where the panel fits into the GPU shared memory. The second mode of operation is called the native mode, where one large matrix is factorized using the GPU only. We show that the developed software for both modes share a common code-base while achieving high performance. Eventually, with this work integrated into the MAGMA library, we provide various choices to perform the factorization efficiently according to the different situations summarized in Figure 1. start many small matrices problem config.? one large matrix no same size? yes no high performance CPU? yes high performance & power efficient efficiency metric? batched variable batched fixed native hybrid best performance finish Figure 1: Decision flowchart for modes of operations in MAGMA 45 The paper starts by progressive optimization and tuning for the batched mode where all problems have the same size. We then proceed with the best configuration for fixed size problems and extend it to support variable size prob- 1 We use the words batch and batched interchangeably. 3

4 50 55 lems. The native mode is realized by using the same code base with different tuning parameters to work on one large problem at a time. We show experimental results that demonstrate the performance of the proposed routines against state-of-the-art CPU and GPU solutions. The rest of the paper is organized as follows. Section 2 discusses some previous efforts in GPU-accelerated matrix factorizations, with a focus on batched routines. Different modes of operation are discussed in Section 3. Section 4 presents a detailed description of the design approach, In Section 5, we discuss the obtained performance results. The paper ends with a conclusion in Section Related Work Since the emergence of general purpose GPU computing (GPGPU), performance optimization of matrix factorization algorithms on GPUs has been a trending research topic. The hybrid algorithms in MAGMA represent the state-of-the-art in this area, where GPUs significantly accelerate the computeintensive trailing updates [13, 14], and the CPU, in the meantime, prepares the next panel factorization [11]. It has been shown, however, that such an algorithmic design is not suitable for batched workloads [15], mainly due to the lack of parallelism in trailing matrix updates. This led to some research efforts that deal with small matrix computations on GPUs. Small LU factorizations were investigated by Villa et al. [16, 17] (for size up to 128), and Wainwright [18] (for sizes up to 32). Batch one-sided factorizations have been the focus of some research efforts, including Cholesky factorization ([19], [20]), and LU and QR factor factorizations ([21], [22], [23]). Some contributions focus on very small matrices, where all the computational stages are fused and performed by a single thread block (TB), as proposed in [20], [12], and [24]. The authors of this paper introduced variable size batched matrix multiplication (GEMM) [25] as a first step to develop LAPACK algorithms on variable size batched workloads. In addition, the work done by the authors [12] presented 4

5 80 85 optimized batched Cholesky factorization that had a limitation on the matrix sizes it could operate on. In fact, the kernel design proposed in [12] requires a dynamic shared memory allocation that is a function of the matrix size, meaning that it cannot work on any size. This paper extends such work and provides a design that can work on any matrix size, while supporting batches of fixed and variable sizes. It also uses the same code-base to develop native GPU factorization on very large matrices. We also show that the performance of the developed work is portable to three different GPU architectures, achieving high performance in all scenarios. 3. Modes of Operation As mentioned earlier, we are designing a full GPU solution that can operate in two modes. We set a design goal to have a unified code base for both modes. As an example, Figure 2 shows the modes of operation for the POTf2 algorithm, which is used to perform the Cholesky panel factorization. A code base, written using CUDA device routines, represents the core operation for one matrix. Such a code base is oblivious to any tuning parameters, which are defined later for each mode. The device routines are then wrapped into three CUDA kernels as shown in the figure. The native mode is the simplest, as it considers only one problem. The kernel passes the input arguments directly to the device routine, with no preprocessing required. potf2_native potf2_batched potf2_vbatched - Read arguments - Determine batchid - Read local arguments potf2 CODE BASE (CUDA device routine) - Determine batchid - Read local arguments - ETM: terminate extra TBs Figure 2: Modes of operations for the POTF2 routine The batched mode requires some preprocessing. The potf2 batched kernel is used for fixed size batched problems. It is internally organized into a number 5

6 of subgrids, each with a unique batchid. The batchid is used to map a certain matrix to a specific subgrid. The kernel reads the local input arguments of the assigned problem and passes them to the device routine. On the other hand, the variable size batched routine (potf2 vbatched) assumes that each matrix has a different size and leading dimension. The kernel is configured according to the largest matrix in the batch, which means that all subgrids can accomodate this matrix. An extra preprocessing step called Early Termination Mechanism (ETM) [25] [12] trims each subgrid according to the local size of the assigned problem. This step is necessary to avoid any runtime failures or memory access violations. After trimming subgrids, the kernel normally passes the local input arguments to the device routine to start the execution. We use the same approach in Figure 2 for all other building block routines discussed in this paper. 4. Algorithmic Design 115 This section describes the design details for Cholesky factorization in both batched and native modes. Our starting point is to have a high performance design and implementation for fixed size batched workloads. Such a design can then be ported easily to support variable size batched workloads, as well as the native mode for large matrices Overall Design Figure 3 shows the overall design for the Cholesky factorization algorithm. The three main computational stages of the algorithm are the Cholesky panel factorization (POTF2), the triangular solve (TRSM), and the Hermitian rank-k update (HERK). The right side of the figure is the conventional way of performing the computation as three separate BLAS kernels, each of which is launched by the CPU. However, if the matrix size (N) is less than a threshold (C), then we use the blocked POTF2 routine to perform the entire factorization. The blocked POTF2 routine is recursively blocked to make use of level-3 BLAS operations, and 6

7 130 thus achieve high performance (left side of the figure). It consists of three stages (unblocked POTF2, TRSM, and HERK) that are fused together into a single kernel. The fusion of these routines helps save global memory traffic and reuse data in shared memory across the computational stages, which gives a big performance advantage for very small matrices. The blocked POTF2 routine serves the panel factorization step on the right side of the figure if the matrix size is larger than C. N C N > C POTF2 (unblocked) TRSM HERK POTF2 POTF2 (blocked) TRSM HERK POTRF Figure 3: Overall design of the Cholesky factorization algorithm The blocked POTF2 routine is probably the most important routine in Figure 3. This is because it is used solely in the batched mode to perform the factorization on small matrices. In the native mode, it replaces the panel factorization done by the multicore CPU. Therefore, it has to be well optimized in order to deliver the best performance in the batched mode, and to introduce a minimum overhead to the execution time in the native mode. This is why we focus more on the design details of POTF2 in the following subsections. The other routines (TRSM and HERK) are simpler to optimize due to their reliance on our batched GEMM kernel [25] Cholesky Panel Factorization (POTF2) 145 Previous studies [22][12] showed that an efficient panel factorization of an N N matrix should be recursively blocked, as shown in Figure 3, in order 7

8 150 to use the fused level-3 BLAS routines instead of the memory-bound level-2 BLAS operations. For example, thanks to the recursive blocking in Figure 3, trailing matrix updates inside the blocked POTF2 routine use the HERK operation instead of the memory-bound Hermitian rank 1 update (the HER routine in level- 2 BLAS). In addition, blocking at the kernel level follows a left-looking Cholesky factorization, with a blocking size ib, as shown in Algorithm 1, which is known to minimize data writes (in this case from GPU shared memory to GPU main memory). Algorithm 1: The left looking fashion. for i 0 to N Step ib do end if (i > 0) then // Update current panel A i:n,i:i+ib end HERK A i:i+ib,i:i+ib = A i:i+ib,i:i+ib A i:i+ib,0:i A T i:i+ib,0:i ; GEMM A i+ib:n,i:i+ib = A i+ib:n,i:i+ib A i+ib:n,0:i A T i:i+ib,0:i ; // Panel factorize A i:n,i:i+ib POTF2 A i:i+ib,i:i+ib ; TRSM A i+ib:n,i:i+ib = A i+ib:n,i:i+ib A 1 i:i+ib,i:i+ib ; Kernel optimization Using a left-looking Cholesky algorithm, the update writes a panel of size N ib in the fast shared memory instead of the main memory, so that the unblocked POTF2 stage can execute directly in shared memory. Note that N and ib control the amount of the required shared memory We developed an optimized and customized fused kernel that first performs the update (HERK), and keeps the updated panel in shared memory to be used by the unblocked POTF2 and the TRSM steps. The cost of the left looking algorithm is dominated by the update step (HERK). The panel C, shown in Figure 4, is updated as C = C A B T. A double buffering scheme is employed to perform the update in steps of lb, 8

9 m-i ib m-i A C ib lb Figure 4: Left-looking Cholesky factorization which minimizes the update cost, as described in Algorithm 2. For clarity, we prefix the data array by r and s to denote register and shared memory, respectively. We prefetch data from A into register a array rak while a multiplication is being performed between register array rakk and the array sb stored in shared memory. Since the matrix B is the shaded portion of A, our kernel avoids reading it from the global memory and transposes into the shared memory array sb. Once the update is finished, the factorization (POTF2 and TRSM) is performed as one operation on the panel C, held in shared memory Loop-inclusive vs. Loop-exclusive Kernels In addition of fusing the computational steps of a single iteration in Algorithm 1, another level of fusion is to merge all iterations together into one GPU kernel. The motivation behind the loop-inclusive design is to maximize the reuse of data, not only in the computation of a single iteration, but also among iterations. For example, the factorized panel of iteration i 1 (which is in shared memory) can be reused to update the panel of iteration i, which means replacing the load from slow memory of the last blue block of A (illustrated in Figure 4) by directly accessing it from fast shared memory. However, such a design has a downside regarding occupancy, in terms of the number of factorizations that 9

10 Algorithm 2: The fused kernel correspond to one iteration of Algorithm 1. rak A (i:n,0:lb); rc 0; for k 0 to N-i Step lb do end rakk rak; sb rak (i:lb,k:k+lb) // inplace transpose; barrier(); ra1 A (i:n,k+lb:k+2lb) // prefetching; rc rc + rakk sb // multiplying; barrier(); sc ra1 - rc; factorize sc; can be performed on a single Streaming Multiprocessor (SM). A loop-inclusive 185 kernel should be configured based on the tallest sub-panel (i.e., based on the size N). As we execute more iterations of Algorithm 1, more threads become idle and more of the reserved shared memory becomes unused. In other words, the kernel runs entirely on the occupancy level defined by the resource requirements of the first iteration. 190 The analysis of the occupancy and the throughput of the loop-fusion tech- 195 nique motivated the development of a more occupancy-oriented design, which we call the loop-exclusive kernel. In this regard, each iteration of Algorithm 1 corresponds to a kernel launch that has the exact resources required by this iteration, with no idle threads and no waste in shared memory. While this design leads to reloading the previous panel from the main memory, such extra cost is alleviated thanks to the double buffering technique in the update step. We conducted a tuning experiment for both kernels. The results, summarized in Figure 5, prove that the loop-exclusive approach tends to help the CUDA runtime increase the throughput of the factorized matrices during execution by 200 increasing the occupancy at the SMs level. 10

11 Figure 5: Performance tuning of loop-inclusive(inc) and loop-exclusive(exc) kernels on a K40c GPU, batchcount = The value of ib is shown between brackets. Results are shown for double precision Greedy vs. Lazy Scheduling for potf2 vbatched Following a loop-exclusive design, the potf2 vbatched kernel is called as many times as required by the largest matrix in the batch. In this regard, there is a degree of freedom in determining when to start the factorization for smaller matrices. We present two different techniques for scheduling those factorizations. These techniques control when a factorization should start for every matrix in the batch. The first one is called greedy scheduling, where the factorization begins on all the matrices at the first iteration. Once a matrix is fully factorized, the Thread Block (TB) assigned to it in the following iterations becomes idle and is terminated using the ETM technique. With greedy scheduling, completion of factorization on individual matrices occurs at different iterations. A drawback of this problem is that smaller matrices are factorized alongside larger matrices in the same iteration. Since the shared memory allocation has to accomodate the tallest sub-panel, greedy scheduling results in wasted shared memory for smaller subpanels, which in turn results in low occupancy. The downside of greedy scheduling motivated the design of the opposite technique, which we call lazy scheduling. Individual factorizations start at different iter- 11

12 220 ations, such that they all finish at the last iteration. At each iteration, lazy scheduling considers only matrices with local sizes within the range max N - i to max N -i+ib, and ignores other matrices using ETMs. As a result, the resource allocation per iterations (number of threads and shared memory) is closest to the optimal configuration. In other words, lazy scheduling technique always ensures better occupancy than greedy scheduling, and is in fact more robust to the variations of sizes in the batch Figure 6: Performance robustness test of greedy and lazy scheduling techniques, batchcount = Sizes are randomly sampled within the (384±r) interval Figure 6 shows a performance robustness test for the greedy and the lazy scheduling techniques. We conducted performance tests on 3000 matrices, with a mean size of 384 and a variation of ±r, so that the interval (384±r) is randomly sampled 3000 times to construct the batch. The figure shows that if the variation is small, both scheduling techniques score roughly the same performance. However, as we increase r, the greedy scheduling loses performance due to the larger variation in sizes, which causes bad occupancy. In fact, greedy scheduling loses up to 25% of its performance, while the lazy scheduling technique is capable of maintaining a stable performance regardless of the size variations. The discussion of different scheduling techniques do not apply to the TRSM and HERK routines. Unlike the potf2 vbatched kernel which uses dynamic 12

13 shared memory allocation based on the max N, both routines use static shared memory allocations based on tuning parameters rather than the input sizes. Therefore, their occupancy are controlled by the tuning parameters, and are minimally affected by size variations Triangular Solve (TRSM) The TRSM routine starts by inverting square diagonals blocks of size tri nb in the triangular matrix. The inversion is performed using a batched triangular inversion routine (TRTRI). The solution is, therefore, obtained by multiplying these inverses (which are stored in a workspace) with the corresponding submatrices of the right hand side matrix. A carefully tuned GEMM [25] is used to perform the multiplication. The value of tri nb is chosen to let GEMM dominate the computation involved in the TRSM routine. Figure 7 shows the impact of the parameter tri nb on performance. The figure represent a typical test case that is invoked by the Cholesky factorization if the panel size is set to 256. The best configuration of MAGMA is 10-17% times faster than a MKL+OpenMP, and 4-5 faster than CUBLAS Figure 7: Performance tuning of batched TRSM, batchcount = Experiments are performed on a 1 K40c GPU and 16-core Intel Sandy Bridge CPU. Results are shown for double precision. 13

14 4.4. Hermitian Rank-k Update (HERK) The HERK routine is a key to high performance in Cholesky factorization, as it dominates the trailing matrix updates, which represent the most computeintensive phase of the computation if the matrix is larger than the crossover point C. MAGMA uses one of two HERK implementations based on the input size. The first one is a MAGMA kernel that uses the same code-base and tuning parameters of the GEMM kernel proposed in [25]. Such kernel performs a normal GEMM operation except for a preprocessing layer that terminates thread blocks writing to the upper/lower triangular part of the matrix. This means that the kernel inherits all the optimization techniques and tuning efforts that have been done for the GEMM kernel. The second implementation uses concurrent CUDA streams to launch multiple instances of the CUBLAS HERK kernel. The motivation behind the second implementation is that it achieves very high performance when the input size becomes relatively large. MAGMA transparently decides which approach to use based on the input size. 5. Performance Results 5.1. System Setup 270 Performance experiments are conducted on a 16-core Intel Sandy Bridge CPU (Intel Xeon E5-2670, running at 2.6 GHz), and three GPUs that are summarized in Table 1. The Titan-X and the GTX1080 GPUs do not support native double precision arithmetic, which means that it is emulated by software and is not expected to deliver any good performance. Our test environment uses Intel MKL Library for CPU tests and CUDA Toolkit 8.0RC for GPU tests Performance of The Batched Routines Figure 8 compares the performance of the blocked POTF2 kernel (when used solely), against the performance of the full POTRF routine (which internally calls POTF2). The figure shows that, if the matrix size is below some crossover point, 14

15 Name Architecture Compute Capability CUDA Cores Frequency K40c Kepler GHz Titan-X Maxwell GHz GTX1080 Pascal GHz Table 1: Summary of the GPUs used in performance tests it is better to perform the entire factorization using the blocked POTF2 kernel only. Operating on such small sizes, most of the operations become more memory-bound. It is, therefore, important to save any unnecessary global memory traffic. The blocked POTF2 routine does exactly that by fusing all operations into one kernel, and increasing data reuse in shared memory among the different computational stages. Considering matrix sizes less than 400, Figure 8 shows significant speedups against the full POTRF, ranging from 1.12 up to 4, as the matrix size gets smaller Figure 8: Example crossover point for dpotrf batched, batchcount= Following the design strategy in Figure 3, the blocked POTF2 routine cannot be used for any problem size, because its shared memory requirements are function of the matrix size. Moreover, Figure 8 shows that its performance starts to stagnate as the problem becomes more compute-bound. At this stage, our 15

16 1 1 1 (a) Single precision (b) Double precision Figure 9: Performance of the fixed size batched Cholesky factorization, batchcount= final solution switches to the full POTRF implementation. The crossover points in MAGMA are tunable according to the precision and the GPU model. Figures 9 and 10 shows the final performance of the batched Cholesky factorization for fixed and variable size problems, respectively. A first observation is that the MAGMA performance on the Titan-X is better than on the GTX1080, although the latter is the latest GPU architecture to date. The reason behind such a behavior is that GTX1080 is a low-end configuration of the Pascal architecture. In 16

17 fact, the Titan-X used in this work has more CUDA cores (refer to table 1), and, as our benchmarks show, has a higher memory bandwidth. We also point out that some graphs for the GTX1080 are not complete because it has less memory than the other two GPUs. Both figures show that the MAGMA performance in single precision is portable on three different GPU architectures. MAGMA is faster than the CPU implementation using MKL and OpenMP. It also receives a performance boost on the newer architectures, scoring and 4-5 speedups on the GTX1080 and the Titan-X GPUs, respectively. Despite the expected low performance achieved in double precision using the GTX1080 and the Titan-X GPUs, MAGMA outperforms the CPU implementation on the K40c GPU, scoring speedups ranging from 1.2 up to 2.5. For the experiments for variable size batched problems, we constructed every test batch by randomly sampling the interval [1:N], where N is varied on the x-axis of Figures 10a and 10b. Similar to the fixed size batched routine, running the MAGMA vbatched routine on Titan-X/GTX1080 is 2-4 faster than MKL in single precision, and is faster on the K40c GPU. In double precision, MAGMA achieves a similar speedups against MKL when running on the K40c GPU Performance of The Native Routines Figure 11 shows the performance of the MAGMA native Cholesky factorization. Since this test involves one factorization of a large matrix, we switch the CPU implementation to use all cores together to do the factorization, which means that the MKL configuration is switched to multithreaded. We also point out that the native MAGMA routines, while sharing the same code base with the batched routines, they usually use a larger panel size, in order to have a more compute intensive operation on the trailing updates. In single precision, the speedups scored by MAGMA are up to 4.2, 8.8, and 9.8 on the K40c, GTX1080, and Titan-X GPUs, respectively. Similar to the batched routines, speedups in double precision are scored on the K40c GPU only, where MAGMA is up to 4.1 faster than the multithreaded MKL implementation. 17

18 1 1 1 (a) Single precision (b) Double precision Figure 10: Performance of the variable size batched Cholesky factorization, batchcount=1000. Matrix sizes in each batch are randomly sampled between 1 and the maximum size shown on the x-axis. 6. Conclusion and Future Work 330 This paper introduced a high performance Cholesky factorization that is designed for GPUs. The proposed work can operate in a batch mode, factorizing many small matrices of similar or different sizes, or in a native mode, factorizing one large matrix using the GPU only. The paper introduces a common codebase that can be used in both modes, and can deliver high performance against 18

19 (a) Single precision (b) Double precision Figure 11: Performance of the native GPU Cholesky factorization. 335 state-of-the-art solutions using multicore CPUs. Future directions include applying the same design concept to broader functionalities (e.g. LU and QR factorization), and developing an autotuning framework to guarantee portable performance across many GPU architectures. 19

20 Acknowledgement 340 This material is based on work supported by NSF under Grants No. CSR and ACI , NVIDIA, and in part by the Russian Scientific Foun- dation, Agreement N References References [1] O. Messer, J. Harris, S. Parete-Koon, M. Chertkow, Multicore and accelerator development for a leadership-class stellar astrophysics code, in: Proceedings of PARA 2012: State-of-the-Art in Scientific and Parallel Computing., [2] A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Luc, M. Nooijene, R. Pitzerf, J. Ramanujamg, P. Sadayappanc, A. Sibiryakovc, Automatic code generation for many-body electronic structure methods: the tensor contraction engine, Molecular Physics 104 (2) (2006) [3] J. L. Khodayari A., A.R. Zomorrodi, C. Maranas, A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data, Metabolic engineering 25C (2014) [4] S. N. YERALAN, T. A. DAVIS, S.-L. W. M, S. RANKA, Algorithm 9xx: Sparse QR Factorization on the GPU, ACM Transactions on Mathematical Software. URL qrgpu_revised.pdf [5] T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra, A step towards energy efficient computing: Redesigning a hydrodynamic ap- 20

21 365 plication on CPU-GPU, in: IEEE 28th International Parallel Distributed Processing Symposium (IPDPS), [6] E.-J. Im, K. Yelick, R. Vuduc, Sparsity: Optimization framework for sparse matrix kernels, Int. J. High Perform. Comput. Appl. 18 (1) (2004) doi: / URL [7] J. Molero, E. Garzón, I. García, E. Quintana-Ortí, A. Plaza, Poster: A batched Cholesky solver for local RX anomaly detection on GPUs, PUMPS (2013). [8] M. Anderson, D. Sheffield, K. Keutzer, A predictive model for solving small linear algebra problems in gpu registers, in: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), [9] A. Haidar, S. Tomov, P. Luszczek, J. Dongarra, Magma embedded: Towards a dense linear algebra library for energy efficient extreme computing, in: 2015 IEEE High Performance Extreme Computing Conference (HPEC 15), (Best Paper Award), IEEE, IEEE, Waltham, MA, [10] Matrix algebra on GPU and multicore architectures (MAGMA), available at (2014). [11] S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in: Proc. of the IEEE IPDPS 10, IEEE Computer Society, Atlanta, GA, 2010, pp. 1 8, DOI: /IPDPSW [12] A. Abdelfattah, A. Haidar, S. Tomov, J. J. Dongarra, Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs, in: International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, 2016, pp doi: /j.procs

22 [13] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, S. Tomov, Faster, Cheaper, Better a Hybridization Methodology to Develop Linear Algebra Software for GPUs, in: W. mei W. Hwu (Ed.), GPU Computing Gems, Vol. 2, Morgan Kaufmann, [14] J. Dongarra, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, A. YarKhan, Model-driven one-sided factorizations on multicore accelerated systems, International Journal on Supercomputing Frontiers and Innovations 1 (1). [15] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures, in: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, 2016, pp doi: /ipdpsw [16] V. Oreste, M. Fatica, N. A. Gawande, A. Tumeo, Power/performance tradeoffs of small batched LU based solvers on GPUs, in: 19th International Conference on Parallel Processing, Euro-Par 2013, Vol of Lecture Notes in Computer Science, Aachen, Germany, 2013, pp [17] V. Oreste, N. A. Gawande, A. Tumeo, Accelerating subsurface transport simulation on heterogeneous clusters, in: IEEE International Conference on Cluster Computing (CLUSTER 2013), Indianapolis, Indiana, [18] I. Wainwright, Optimized LU-decomposition with full pivot for small batched matrices, gtc 13 ID S3069 (April, 2013). URL S3069-LU-Decomposition-Small-Batched-Matrices.pdf [19] T. Dong, A. Haidar, S. Tomov, J. Dongarra, A fast batched Cholesky factorization on a GPU, in: Proc. of 2014 International Conference on Parallel Processing (ICPP-2014), [20] J. Kurzak, H. Anzt, M. Gates, J. Dongarra, Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs, Parallel 22

23 and Distributed Systems, IEEE Transactions on PP (99) (2015) 1 1. doi: /TPDS [21] A. Haidar, T. Dong, S. Tomov, P. Luszczek, J. Dongarra, A framework for batched and gpu-resident factorization algorithms applied to block householder transformations, in: J. M. Kunkel, T. Ludwig (Eds.), High Performance Computing, Vol of Lecture Notes in Computer Science, Springer International Publishing, 2015, pp doi: / _3. [22] A. Haidar, T. Dong, P. Luszczek, S. Tomov, J. Dongarra, Batched matrix computations on hardware accelerators based on gpus, IJHPCA 29 (2) (2015) doi: / [23] A. Haidar, P. Luszczek, S. Tomov, J. Dongarra, Towards batched linear solvers on accelerated hardware platforms, in: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, ACM, ACM, San Francisco, CA, [24] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, J. J. Dongarra, High-Performance Matrix-Matrix Multiplications of Very Small Matrices, in: Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings, 2016, pp doi: / _48. [25] A. Abdelfattah, A. Haidar, S. Tomov, J. Dongarra, Performance, Design, and Autotuning of Batched GEMM for GPUs, in: High Performance Computing - 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings, 2016, pp doi: / _2. 23

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra Innovative Computing Laboratory University

More information

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices This space is reserved for the Procedia header, do not use it Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices Tingxing Dong 1, Azzam Haidar 2, Stanimire Tomov 2, and Jack Dongarra

More information

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory, University of Tennessee

More information

Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations

Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations Azzam Haidar 1, Tingxing Tim Dong 1, Stanimire Tomov 1, Piotr Luszczek 1, and Jack Dongarra

More information

High-performance Cholesky factorization for GPU-only execution

High-performance Cholesky factorization for GPU-only execution High-performance Cholesky factorization for GPU-only execution Azzam Haidar Ahmad Abdelfatah Stanimire Tomov University of Tennessee, U.S.A. haidar,ahmad,tomov@icl.utk.edu Jack Dongarra University of Tennessee,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Optimization for Performance and Energy for Batched Matrix Computations on GPUs

Optimization for Performance and Energy for Batched Matrix Computations on GPUs Optimization for Performance and Energy for Batched Matrix Computations on GPUs Azzam Haidar University of Tennessee, U.S.A. haidar@eecs.utk.edu Stanimire Tomov University of Tennessee, U.S.A. tomov@eecs.utk.edu

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak

More information

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors Khairul Kabir 1, Azzam Haidar 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of

More information

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

H2020 FETHPC 2014: GA D7.3 Draft specification for Hybrid (Batched) BLAS

H2020 FETHPC 2014: GA D7.3 Draft specification for Hybrid (Batched) BLAS H2020 FETHPC 2014: GA 671633 D7.3 Draft specification for Hybrid (Batched) BLAS April 2016 Document information Scheduled delivery 2016-04-30 Actual delivery 2016-04-26 Version 0.1 Responsible partner

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

AperTO - Archivio Istituzionale Open Access dell'università di Torino

AperTO - Archivio Istituzionale Open Access dell'università di Torino AperTO - Archivio Istituzionale Open Access dell'università di Torino An hybrid linear algebra framework for engineering This is the author's manuscript Original Citation: An hybrid linear algebra framework

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices Hiroki Tokura, Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

High-Performance Matrix-Matrix Multiplications of Very Small Matrices

High-Performance Matrix-Matrix Multiplications of Very Small Matrices High-Performance Matrix-Matrix Multiplications of Very Small Matrices Ian Masliah 2(B), Ahmad Abdelfattah 1, A. Haidar 1, S. Tomov 1, Marc Baboulin 2, J. Falcou 2, and J. Dongarra 1,3 1 Innovative Computing

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Communication-Avoiding QR Decomposition for GPUs

Communication-Avoiding QR Decomposition for GPUs Communication-Avoiding QR Decomposition for GPUs Michael Anderson, Grey Ballard, James Demmel and Kurt Keutzer UC Berkeley: Department of Electrical Engineering and Computer Science Berkeley, CA USA {mjanders,ballard,demmel,keutzer}@cs.berkeley.edu

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1 Procedia Computer Science Procedia Computer Science 00 1 10 International Conference on Computational Science, ICCS One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 18C (217) 495 54 International Conference on Computational Science, ICCS 217, 12-14 June 217, Zurich, Switzerland The Design

More information

WHILE linear algebra software has achieved high efficiency

WHILE linear algebra software has achieved high efficiency 2036 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 7, JULY 2016 Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs Jakub Kurzak, Hartwig Anzt, Mark

More information

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Chapter 3 Dense Linear Algebra for Hybrid GPU-Based Systems Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Department of Electrical Engineering

More information

High-performance matrix-matrix multiplications of very small matrices

High-performance matrix-matrix multiplications of very small matrices High-performance matrix-matrix multiplications of very small matrices I. Masliah 2, A. Abdelfattah 1, A. Haidar 1, S. Tomov 1, M. Baboulin 2, J. Falcou 2, and J. Dongarra 1,3 1 Innovative Computing Laboratory,

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Mixed Precision Methods

Mixed Precision Methods Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

NVBLAS LIBRARY. DU _v6.0 February User Guide

NVBLAS LIBRARY. DU _v6.0 February User Guide NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate

More information

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

A Note on Auto-tuning GEMM for GPUs

A Note on Auto-tuning GEMM for GPUs A Note on Auto-tuning GEMM for GPUs Yinan Li 1, Jack Dongarra 1,2,3, and Stanimire Tomov 1 1 University of Tennessee, USA 2 Oak Ridge National Laboratory, USA 3 University of Manchester, UK Abstract. The

More information

A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu

A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu Qi Li, Vojislav Kecman, Raied Salman Department of Computer Science School of Engineering, Virginia Commonwealth

More information

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hatem Ltaief 1, Stanimire Tomov 1, Rajib Nath 1, and Jack Dongarra 1,2,3 1 Department of Electrical Engineering and Computer Science,

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

International Conference on Computational Science (ICCS 2017)

International Conference on Computational Science (ICCS 2017) International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A.

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study J Supercomput DOI 10.1007/s11227-014-1102-4 Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study S. Tabik G. Ortega E. M. Garzón Springer Science+Business Media

More information

Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters

Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters Tingxing Dong, Veselin Dobrev, Tzanio Kolev, Robert Rieben, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory, University

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 18C (217) 1783 1792 International Conference on Computational Science, ICCS 217, 12-14 June 217, Zurich, Switzerland Variable-Size

More information

A priori power estimation of linear solvers on multi-core processors

A priori power estimation of linear solvers on multi-core processors A priori power estimation of linear solvers on multi-core processors Dimitar Lukarski 1, Tobias Skoglund 2 Uppsala University Department of Information Technology Division of Scientific Computing 1 Division

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment Azzam Haidar 1, Chongxiao Cao 1, Asim YarKhan 1, Piotr Luszczek 1, Stanimire Tomov 1,

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information