Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves

Size: px
Start display at page:

Download "Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves"

Transcription

1 Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious algorithms are designed to benefit from any existing cache hierarchy regardless of cache size or architecture. In matrix computations, cache oblivious approaches are usually obtained from block-recursive approaches. In this article, we extend an existing cache oblivious approach for matrix operations, which is based on eano space-filling curves, for multiplication of sparse and dense matrices (sparse-dense, dense-sparse, sparse-sparse). We present the respective block-oriented data structure to store sparse matrices, and give first performance results on multicore platforms. 1 Introduction One of the problems that high performance computing is currently confronted with is a growing gap between the theoretically available peak performance on modern CU and parallel computing platforms, and the performance that can be achieved in practice. Multilevel memory hierarchies, specific processor extensions, and multicore arcitectures in general, are just some of the features, where a careful, hardware-oriented algorithm design and implementation is necessary to achieve a satisfying portion of the available performance. For dense linear algebra computations, the BLAS (Basic Linear Algebra Subroutines) library standard forms the basis for many highly efficient implementations. Especially for matrix multiplication, one of the core BLAS routines, highly optimised implementations are available either vendor-supplied (such as ACML or Intel MKL) or third-party developments (GotoBLAS [5], ATLAS [10]). For sparse matrix computations, the situation is much less satisfying. In contrast to dense matrix multiplication, also the multiplication of matrices (sparsesparse or dense-sparse) is nearly a memory-bound problem, where the data transfer between main memory and CU becomes the limiting factor for performance. For dense matrix computations, blocking techniques are an efficient means to exploit existing cache hierarchies and thus sustain optimal CU peformance. Both cache aware and cache oblivious techniques have been proposed and sucessfully used, either for BLAS implementations (cf. [5, 10]) or for other linear algebra tasks [6, 4]. For sparse matrix computations, blocking techniques have up to now mainly been popular for low-level blocking of elements, for example using the BCSR format (block compressed sparse row [8]). However, also block-recursive techniques for data storage and respective linear algebra computations have been presented [7].

2 2 R S R S R S Fig. 1. Recursive construction of a eano curve (for patterns and, only) In this article, we extend an existing cache oblivious approach for matrix operations, which is based on eano space-filling curves, to the multiplication of sparse and dense matrices (sparse-dense, dense-sparse, sparse-sparse). The approach is based on a block-recursive data structure to store sparse matrices, where the leaf blocks can be either zero, dense, or stored in a standard sparse format (compressed sparse row). Our current intended application is the computation of the exponential function of sparse matrices as they occur within a quantum control problem [9]. After a quick recapitulation of the eano matrix multiplication in section 2, we introduce the eano data structure for sparse matrices in section 3. Section 4 introduces the current test scenarios and gives first performance results on singleand multicore platforms. In section 5, we outline current and future work. 2 Dense Matrix Multiplication Based on eano Curves In previous works [1, 2], we have introduced a cache oblivious algorithm for matrix multiplication that uses a memory layout to store the elements that is derived from eano space-filling curves. A corresponding block-recursive scheme for matrix multiplication leads to a cache oblivious algorithm with excellent inherent locality properties. The respective algorithm can be efficiently parallelised on multicore platforms [3], and was shown to be competitive with the currently fastest libraries, such as GotoBLAS [5]. 2.1 Element Order and Block-recursive Multiplication Figure 1 illustrates the eano element order to store dense matrices. It is based on so-called iterations of a eano curve. A matrix is recursively subdivided into 3 3 subblocks, which are stored contiguously in memory according to four different block numbering patterns marked as (starting pattern),, R, and S. The recursion is stopped once the matrix blocks are small enough to be efficiently multiplied within the level 1 cache. Hence, we refer to these smallest blocks as L1 blocks. Within these L1 blocks, simple column major order is used. A blockwise matrix multiplication of matrices stored in eano order is illustrated in equation (1). There, each matrix block is named according to its numbering scheme and indexed with the name of the global matrix and the

3 3 position within the storage scheme: A0 R A5 A6 A1 S A4 A7 B0 R B5 B6 B1 S B4 B7 = C0 R C5 C6 C1 S C4 C7. A2 R A3 A8 }{{} =: A B2 R B3 B8 }{{} =: B C2 R C3 C8 }{{} =: C (1) For the resulting block multiplications, the following inherently cache-efficient execution order can be derived [1]: 0 += 0 0 R 5 += 6R 3 R 5 += R 5S 4 6 += R = = 1 0 S 4 += 7R 3 S 4 += S 4S 4 7 += S = = 2 0 R 3 += 8R 3 R 3 += R 3S 4 8 += R = = R = 8 2 R 3 += 2R 5 8 += = S = 7 2 S 4 += 1R 5 7 += = R = 6 2 R 5 += 0R 5 6 += 0 6 (2) (indices A,B,C are left out to improve readability). Recursive extension of this blockwise multiplication leads to a cache oblivious algorithm, which gains excellent locality properties from the underlying eano curve: After an access to an element (or L1 block) in any of the matrices A, B, or C, either the same element (L1 block) will be reused or its direct left or right neighbour compare equation (2). Any sequence of k 3 floating point operations is executed on only O(k 2 ) contiguous elements in each matrix. Vice versa, on any memory block of k 2 contiguous elements, at least O(k 3 ) operations will be performed. As a result, a sharp, asymptotically optimal upper bound of the number of cache misses can be obtained (cf. [1]). 2.2 Efficient Implementation on Multicore latforms To achieve high performance on modern CU cores, the block-recursive algorithm has to be combined with hardware-oriented, optimised multiplication kernels [2]. In particular, to allow efficient use of the vectorisation properties of CUs (SIMD extensions, etc.), we stop the recursion in memory layout and block multiplication on so-called L1 blocks. The size of these L1 block proved to be optimal, if two L1 blocks fit into the L1 cache these are basically the A- and B-blocks of the matrices, as the write accesses to the C-blocks is streamed directly to the L2 cache.

4 4 MFlops/s / Core MKL Average MKL Maximum Goto Average Goto Maximum TifaMMy Average TifaMMy Maximum Thread 2 Threads 4 Threads 8 Threads 16 Threads Fig. 2. Relative performance of TifaMMy, GotoBLAS and MKL on a Xeon server using up to 16 cores (four times quad-core) [3]. erformance is given in per core. In a recent paper [3], we demonstrated that the resulting implementation, called TifaMMy, is competitive with up-to-date BLAS implementations (Goto- BLAS and Intel s MKL). Moreover, we presented a shared-memory parallelisation of our algorithm based on OpenM, and tested the performance of our approach on several multicore platforms. Figure 2 shows the average and maximum performance for double precision on a Xeon server with four quad-core Xeon processors (Tigerton, 2.93 GHz). In this test, TifaMMy outran both Goto- BLAS and MKL in terms of absolute performance, if 8 or all 16 cores were used. 3 A Block Recursive Data Structure for Sparse Matrices To extend our approach for sparse matrix multiplications, we first need to modify our block-recursive data structure to allow efficient storage of sparse matrices. Inspired by an approach used by Herrero and Navarro [7], we allow, in a first step, that each L1 block of the matrix can be either a zero block, a dense block stored in row-major order (as already existent), or a sparse matrix block stored according to the CSR (compressed sparse row) data structure. To store a sparse matrix with n rows and m non-zero elements, CSR uses three arrays: two arrays, each with m elements, that store the matrix element values and the corresponding column indices, respectively; and a third array of size n that references the first non-zero element in each matrix row. In a second step, we allow to stop the block recursion already for larger zero blocks in the numbering scheme, i.e. as soon as a block contains only zero elements. The respective, adaptive block recursion is best described by a tree structure, where each node of the tree has 9 children (according to the eano blocking), and where the leaves of the tree are either zero blocks (of any size), or L1 blocks that can be sparse, dense, or zero. The data structure to store sparse matrices is also split into two parts: The first part is a contiguous data stream that holds, in eano order, the element data required to store the dense L1 blocks and the sparse blocks in CSR format. The second part describes the sparsity tree of the matrix, and also contains all management information (in particular, the respective block types). To store this

5 5 sparsity tree, we use a sequentialised storage scheme according to a modified depth-first traversal. Again, the order in which the children of each node are stored in memory follows the eano order. All block recursive algorithms that will be implemented in this data structure will have to be able to address the nine child blocks when doing a recursive call. Hence, this data has to be stored together. Therefore, for each node of the sequentialised sparsity tree, we store the start index of each child node plus a pointer to the end of this subblock (as illustrated in figure 3). s[0] s[6] s[5] s[1] s[7] s[4] s[2] s[3] s[8] s[9]... dense block sparse block zero block s[0]s[1] s[2] s[3] s[4] s[5] s[6] s[7] s[8] s[9]... s[0] s[1] s[3] Fig. 3. Illustration of the modified depth-first storage scheme (recursive The blockwise multiplication is performed according to the eano scheme given in equation (2). Multiplications that involve zero blocks are, of course, not performed. Due to the missing block operations, the eano multiplication schemes loses its strict locality properties [1], the more zero blocks are present. However, we can still expect a positive effect on cache performance due to the underlyig space-filling curve. 4 erformance Tests We evaluated the performance of our approach on two different workstations with a total of eight CU cores: one Xeon workstation that holds two quadcore processors (Clovertown, 2,66 GHz), which are connected to main memory via the single front side bus; and a Barcelona test platform that holds two AMD Opteron 2347 processors (Barcelona, quadcore, 1.9 GHz), which are connected to memory via AMD s NUMA technology. For all sparse-dense multiplications, we compared TifaMMy s implementation with the one provided by Intel s MKL. In section 4.1, we give the performance for several synthetic benchmark scenarios. In section 4.2, we examine a concrete application scenario, where sparse-dense matrix multipliation is the runtime-dominating problem during the computation of the exponential function of a structured sparse matrix.

6 6 4.1 erformance for Benchmark Scenarios Our first performance test once more considers the multiplication of dense matrices. Here, dense matrices are stored in TifaMMy s eano-block hybrid format, but only using dense L1 blocks. Figure 4 gives a performance comparison between TifaMMy s dense optimised (1.3.2) and sparse-enabled (2.0.0) version, both in relation with the dense matrix multiplication routines provided by Intel s MKL and GotoBLAS. We see that the extension to allow hybrid and sparse matrices leads to a slight performance loss in TifaMMy. However, the sparse-enabled TifaMMy is still faster than MKL, nearly level with GotoBLAS for up to four threads, and slightly faster than GotoBLAS, when all eight CU cores are used TifaMMy (Dense) TifaMMy (Hybrid) MKL 10 GotoBLAS Thread 2 Threads 4 Threads 8 Threads Fig. 4. erformance comparison of dense matrix multiplication. Results (in per core, using 1 to 8 threads on the Xeon workstation, matrix size ) are given for TifaMMy (optimised for dense multiplication), TifaMMy (sparsematrix version using only dense blocks), and the dgemm implementations in Intel s MKL and in GotoBLAS. As a second benchmark, we consider an n n band matrix with bandwidth n, which is multiplied with a dense n n matrix. In particular, we compare the performance difference, if TifaMMy stores most of the L1 blocks as dense blocks, compared to the performance achieved when allowing only CSR blocks. The respective results are given in figure 5. The performance of MKL for this band-dense multiplication is given for comparison note that here the CSR format was used to store the dense band matrix. For once, we can see that the eano-block CSR format used by TifaMMy leads to a performance increase of approximately % compared to MKL. Allowing dense blocks in TifaMMy, however, leads to an additional 100 % gain in, which results from the better use of SSE instructions in the dense multiplication. For a first test of sparse-dense matrix multiplication, we used a 9-diagonal n n sparse matrix, where each row i of the matrix contains non-zero elements a ij, if j {i, i ± 1, i ± n, i ± n ± 1}. The respective performance comparison between TifaMMy and MKL (and on both the Xeon and the Barcelona workstation) is given in figure 6. For the single-thread test, TifaMMy gave a performance

7 TifaMMy 2.0.0; CSR only TifaMMy 2.0.0; Threshold 20% MKL Sparse Fig. 5. erformance (single-core, on the Xeon workstation) of a band matrix multiplication (band matrix dense matrix: MKL (using CSR format) vs. TifaMMy (eanoblock CSR) vs. TifaMMy (eano order with dense and CSR blocks) Xeon, single core: Barcelona, single core: TifaMMy Intel MKL TifaMMy Intel MKL Xeon, eight cores: Barcelona, eight cores: 4900 TifaMMy Intel MKL TifaMMy Intel MKL Fig. 6. Single-core vs. eight-core performance of the sparse-dense matrix multiplication in the quantum control problem. We also compare the Xeon with the Barcelona workstations.

8 8 gain of approximately 25 % compared to MKL. On eight cores, TifaMMy was also able to outrun MKL by %, Especially in this example of a sparsedense matrix multiplication, we notice the meagre speedup when using eight cores instead of one, which illustrates the much stronger memory-boundedness of the sparse-dense matrix multiplication. Note that the Barcelona workstation with its NUMA architecture showed by far the better scalability in that test. 4.2 erformance of Sparse-Dense Multiplication in a uantum Control roblem As an example of a particular application of sparse-dense matrix multiplication, we will, in this section, present first performance tests of using TifaMMy within the simulation of a quantum control problem. There, the subsequent evolution of quantum states is modelled by a sequence of evolution matrices U (r) (t k ): U (r) (t k ) = e i th(r) k e i th (r) k 1 e i th (r) 1. (3) To obtain the factors e i th(r) k, the exponential function of the sparse Hamiltonian matrices H (r) k needs to be computed. The sparsity structure of the matrices is illustrated in figure 7. The matrices have dimension 2 q (where q is the number of modelled quantum bits), and have q + 1 non-zeros per matrix row (one non-zero results from the anti-diagonal elements indicated in grey). The exponential function is approximated by a Chebyshev polynomial, whose evaluation requires sparse matrix multiplication. As the accumulated result finally becomes a dense matrix, we consider only sparse-dense matrix multiplication at the moment. Figure 8 shows the performance results on the Xeon workstation. We compare TifaMMy with MKL when using 1, 2, 4, or 8 threads on this eight core platform. Obviously, TifaMMy can especially profit from the NUMA architecture on the AMD workstation. There, TifaMMy gives a substantial performance advantage (more than a factor of 2 on eight threads) over MKL. In contrast, on the Xeon platform this performance advantage diminishes for larger matrix sizes, when using 4 or 8 cores. In that case, both MKL and TifaMMy seem to be bound by the available memory bandwidth. H (r) k 5 Conclusions, Current and Future Work We have demonstrated that introducing a eano block layout to allow the storage of sparse matrices in TifaMMy can lead to a solid performance gain in sparse-dense matrix multiplication compared to using the standard CSR storage scheme. One of the main advantages of the eano approach is that both sparse and dense matrix multiplication are performing well, in both single- and multicore platforms. Especially on several cores, sparse-dense multiplication proved to be a memory-bound problem, already. Hence, we can expect that the sparsity structure of the involved matrices will have a strong influence on performance.

9 9 q = 4 q = 5 Fig. 7. Recursive structure of the sparse Hamiltonian matrices H (r) k q = 4 and q = 5). of size 2 q (cases (a) Xeon 2 quadcore (Cloverton) TifaMMy 1-Thread MKL 1-Thread TifaMMy 2-Threads MKL 2-Threads TifaMMy 4-Threads MKL 4-Threads TifaMMy 8-Threads MKL 8-Threads (b) AMD 2 quadcore (Barcelona) TifaMMy 1-Thread MKL 1-Thread TifaMMy 2-Threads MKL 2-Threads TifaMMy 4-Threads MKL 4-Threads TifaMMy 8-Threads MKL 8-Threads Fig. 8. erformance comparison (TifaMMy vs. MKL) of the sparse-dense matrix multiplication within the quantum control problem. Results are given for different matrix sizes, and for using 1, 2, 4, or 8 threads on a Xeon and a Barcelona workstation.

10 10 Our next steps will therefore be to identify and run tests on further applications of sparse-dense matrix multiplication. Concerning our quantum control example, we will also test whether including sparse-sparse multiplication can further improve the solution times. During the accumulation of the Chebyshev polynomial, the accumulated result matrix is initially sparse, and typically takes a few multiplications to become dense. As in sparse-sparse mutliplication, computing (or constantly updating) the sparsity structure of the result matrix can be rather time-consuming, we will, in a first step, use a sparse pattern of dense L1 blocks to store the accumulated results. We also intend to test sparse-sparse matrix multiplication in scenarios, where the pattern of the result matrix is known a priori. One such example would be the blockwise computation of an incomplete LU decomposition, where such a sparse-sparse matrix multiplication is required in a respective block-recursive formulation. Another example could be the Galerkin-type computation of coarsegrid operators in geometric multigrid methods. References 1. Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a eano curve. Linear Algebra Appl. 417 (2 3), Bader, M., Franz, R., Guenther, S., Heinecke, A.: Hardware-oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves. LNCS 4967, Heinecke, A., Bader, M.: arallel Matrix Multiplication based on Space-filling Curves on Shared Memory Multicore latforms. roc Computing Frontiers Conf. and co-located workshops: MAW 08 & WREFT 08, Ischia, Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46 (1), Goto, K., van de Geijn, R.A.: Anatomy of a High-erformance Matrix Multiplication. ACM Transactions on Mathematical Software 34 (3), Gustavson, F. G.: Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM Journal of Research and Development 41 (6), Herrero, J.R., Navarro, J.J.: Adapting Linear Algebra Codes to the Memory Hierarchy Using a Hypermatrix Scheme. LNCS 3911, 2006, pp Im, E.-J., Yelick, K.A., Vuduc, R.: SARSITY: An Optimization Framework for Sparse Matrix Kernels, International Journal of High erformance Computing Applications, 18 (1), 2004, pp Khaneja, N., Reiss, T., Kehlet, C., Schulte-Herbrüggen, T., Glaser, S.J.: Optimal Control of Coupled Spin Dynamics: Design of NMR ulse Sequences by Gradient Ascent Algorithms. Journal of Magnetic Resonance 172, 2005, pp Whaley, R.C., etitet, A., and Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project, arallel Computing 27 (1 2), 2001.

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A. N. Yzelman and Rob H. Bisseling Abstract The sparse matrix vector (SpMV) multiplication is an important kernel

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Cache oblivious matrix multiplication using an element ordering based on a Peano curve

Cache oblivious matrix multiplication using an element ordering based on a Peano curve Linear Algebra and its Applications 47 (6) www.elsevier.com/locate/laa Cache oblivious matrix multiplication using an element ordering based on a eano curve Michael Bader, Christoph Zenger Institut für

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Sparse Matrix Operations on Multi-core Architectures

Sparse Matrix Operations on Multi-core Architectures Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für

More information

A Square Block Format for Symmetric Band Matrices

A Square Block Format for Symmetric Band Matrices A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture

More information

Using non-canonical array layouts in dense matrix operations

Using non-canonical array layouts in dense matrix operations Using non-canonical array layouts in dense matrix operations José R. Herrero, Juan J. Navarro Computer Architecture Dept., Univ. Politècnica de Catalunya Barcelona, (Spain) {josepr,juanjo}@ac.upc.edu,

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

GPU acceleration of the matrix-free interior point method

GPU acceleration of the matrix-free interior point method GPU acceleration of the matrix-free interior point method E. Smith, J. Gondzio and J. A. J. Hall School of Mathematics and Maxwell Institute for Mathematical Sciences The University of Edinburgh Mayfield

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2

2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2 Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Fred G. Gustavson IBM T.J. Watson Research Center and Jerzy Waśniewski Department

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Parsing in Parallel on Multiple Cores and GPUs

Parsing in Parallel on Multiple Cores and GPUs 1/28 Parsing in Parallel on Multiple Cores and GPUs Mark Johnson Centre for Language Sciences and Department of Computing Macquarie University ALTA workshop December 2011 Why parse in parallel? 2/28 The

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax

Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax Richard Vuduc, Attila Gyulassy, James W. Demmel, and Katherine A. Yelick Computer Science Division, University of California, Berkeley

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Placement de processus (MPI) sur architecture multi-cœur NUMA

Placement de processus (MPI) sur architecture multi-cœur NUMA Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

A priori power estimation of linear solvers on multi-core processors

A priori power estimation of linear solvers on multi-core processors A priori power estimation of linear solvers on multi-core processors Dimitar Lukarski 1, Tobias Skoglund 2 Uppsala University Department of Information Technology Division of Scientific Computing 1 Division

More information

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Auto-tuning Multigrid with PetaBricks

Auto-tuning Multigrid with PetaBricks Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

NUMA-aware Multicore Matrix Multiplication

NUMA-aware Multicore Matrix Multiplication Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner

fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner Thomas Huckle Matous Sedlacek 2011 09 10 Technische Universität München Research Unit Computer Science V Scientific Computing in Computer

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Michael Bader TU München Stefanie Schraufstetter TU München Jörn Behrens AWI Bremerhaven Abstract

More information

Performance comparison and optimization: Case studies using BenchIT

Performance comparison and optimization: Case studies using BenchIT John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Autotuning (1/2): Cache-oblivious algorithms

Autotuning (1/2): Cache-oblivious algorithms Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel

More information

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract ATLAS NOTE December 4, 2014 ATLAS offline reconstruction timing improvements for run-2 The ATLAS Collaboration Abstract ATL-SOFT-PUB-2014-004 04/12/2014 From 2013 to 2014 the LHC underwent an upgrade to

More information

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors Kai Zhang, ShuMing Chen*, Wei Liu, and Xi Ning School of Computer, National University of Defense Technology #109, Deya Road,

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

Portable, usable, and efficient sparse matrix vector multiplication

Portable, usable, and efficient sparse matrix vector multiplication Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix

More information