Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves
|
|
- Aldous Tate
- 6 years ago
- Views:
Transcription
1 Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious algorithms are designed to benefit from any existing cache hierarchy regardless of cache size or architecture. In matrix computations, cache oblivious approaches are usually obtained from block-recursive approaches. In this article, we extend an existing cache oblivious approach for matrix operations, which is based on eano space-filling curves, for multiplication of sparse and dense matrices (sparse-dense, dense-sparse, sparse-sparse). We present the respective block-oriented data structure to store sparse matrices, and give first performance results on multicore platforms. 1 Introduction One of the problems that high performance computing is currently confronted with is a growing gap between the theoretically available peak performance on modern CU and parallel computing platforms, and the performance that can be achieved in practice. Multilevel memory hierarchies, specific processor extensions, and multicore arcitectures in general, are just some of the features, where a careful, hardware-oriented algorithm design and implementation is necessary to achieve a satisfying portion of the available performance. For dense linear algebra computations, the BLAS (Basic Linear Algebra Subroutines) library standard forms the basis for many highly efficient implementations. Especially for matrix multiplication, one of the core BLAS routines, highly optimised implementations are available either vendor-supplied (such as ACML or Intel MKL) or third-party developments (GotoBLAS [5], ATLAS [10]). For sparse matrix computations, the situation is much less satisfying. In contrast to dense matrix multiplication, also the multiplication of matrices (sparsesparse or dense-sparse) is nearly a memory-bound problem, where the data transfer between main memory and CU becomes the limiting factor for performance. For dense matrix computations, blocking techniques are an efficient means to exploit existing cache hierarchies and thus sustain optimal CU peformance. Both cache aware and cache oblivious techniques have been proposed and sucessfully used, either for BLAS implementations (cf. [5, 10]) or for other linear algebra tasks [6, 4]. For sparse matrix computations, blocking techniques have up to now mainly been popular for low-level blocking of elements, for example using the BCSR format (block compressed sparse row [8]). However, also block-recursive techniques for data storage and respective linear algebra computations have been presented [7].
2 2 R S R S R S Fig. 1. Recursive construction of a eano curve (for patterns and, only) In this article, we extend an existing cache oblivious approach for matrix operations, which is based on eano space-filling curves, to the multiplication of sparse and dense matrices (sparse-dense, dense-sparse, sparse-sparse). The approach is based on a block-recursive data structure to store sparse matrices, where the leaf blocks can be either zero, dense, or stored in a standard sparse format (compressed sparse row). Our current intended application is the computation of the exponential function of sparse matrices as they occur within a quantum control problem [9]. After a quick recapitulation of the eano matrix multiplication in section 2, we introduce the eano data structure for sparse matrices in section 3. Section 4 introduces the current test scenarios and gives first performance results on singleand multicore platforms. In section 5, we outline current and future work. 2 Dense Matrix Multiplication Based on eano Curves In previous works [1, 2], we have introduced a cache oblivious algorithm for matrix multiplication that uses a memory layout to store the elements that is derived from eano space-filling curves. A corresponding block-recursive scheme for matrix multiplication leads to a cache oblivious algorithm with excellent inherent locality properties. The respective algorithm can be efficiently parallelised on multicore platforms [3], and was shown to be competitive with the currently fastest libraries, such as GotoBLAS [5]. 2.1 Element Order and Block-recursive Multiplication Figure 1 illustrates the eano element order to store dense matrices. It is based on so-called iterations of a eano curve. A matrix is recursively subdivided into 3 3 subblocks, which are stored contiguously in memory according to four different block numbering patterns marked as (starting pattern),, R, and S. The recursion is stopped once the matrix blocks are small enough to be efficiently multiplied within the level 1 cache. Hence, we refer to these smallest blocks as L1 blocks. Within these L1 blocks, simple column major order is used. A blockwise matrix multiplication of matrices stored in eano order is illustrated in equation (1). There, each matrix block is named according to its numbering scheme and indexed with the name of the global matrix and the
3 3 position within the storage scheme: A0 R A5 A6 A1 S A4 A7 B0 R B5 B6 B1 S B4 B7 = C0 R C5 C6 C1 S C4 C7. A2 R A3 A8 }{{} =: A B2 R B3 B8 }{{} =: B C2 R C3 C8 }{{} =: C (1) For the resulting block multiplications, the following inherently cache-efficient execution order can be derived [1]: 0 += 0 0 R 5 += 6R 3 R 5 += R 5S 4 6 += R = = 1 0 S 4 += 7R 3 S 4 += S 4S 4 7 += S = = 2 0 R 3 += 8R 3 R 3 += R 3S 4 8 += R = = R = 8 2 R 3 += 2R 5 8 += = S = 7 2 S 4 += 1R 5 7 += = R = 6 2 R 5 += 0R 5 6 += 0 6 (2) (indices A,B,C are left out to improve readability). Recursive extension of this blockwise multiplication leads to a cache oblivious algorithm, which gains excellent locality properties from the underlying eano curve: After an access to an element (or L1 block) in any of the matrices A, B, or C, either the same element (L1 block) will be reused or its direct left or right neighbour compare equation (2). Any sequence of k 3 floating point operations is executed on only O(k 2 ) contiguous elements in each matrix. Vice versa, on any memory block of k 2 contiguous elements, at least O(k 3 ) operations will be performed. As a result, a sharp, asymptotically optimal upper bound of the number of cache misses can be obtained (cf. [1]). 2.2 Efficient Implementation on Multicore latforms To achieve high performance on modern CU cores, the block-recursive algorithm has to be combined with hardware-oriented, optimised multiplication kernels [2]. In particular, to allow efficient use of the vectorisation properties of CUs (SIMD extensions, etc.), we stop the recursion in memory layout and block multiplication on so-called L1 blocks. The size of these L1 block proved to be optimal, if two L1 blocks fit into the L1 cache these are basically the A- and B-blocks of the matrices, as the write accesses to the C-blocks is streamed directly to the L2 cache.
4 4 MFlops/s / Core MKL Average MKL Maximum Goto Average Goto Maximum TifaMMy Average TifaMMy Maximum Thread 2 Threads 4 Threads 8 Threads 16 Threads Fig. 2. Relative performance of TifaMMy, GotoBLAS and MKL on a Xeon server using up to 16 cores (four times quad-core) [3]. erformance is given in per core. In a recent paper [3], we demonstrated that the resulting implementation, called TifaMMy, is competitive with up-to-date BLAS implementations (Goto- BLAS and Intel s MKL). Moreover, we presented a shared-memory parallelisation of our algorithm based on OpenM, and tested the performance of our approach on several multicore platforms. Figure 2 shows the average and maximum performance for double precision on a Xeon server with four quad-core Xeon processors (Tigerton, 2.93 GHz). In this test, TifaMMy outran both Goto- BLAS and MKL in terms of absolute performance, if 8 or all 16 cores were used. 3 A Block Recursive Data Structure for Sparse Matrices To extend our approach for sparse matrix multiplications, we first need to modify our block-recursive data structure to allow efficient storage of sparse matrices. Inspired by an approach used by Herrero and Navarro [7], we allow, in a first step, that each L1 block of the matrix can be either a zero block, a dense block stored in row-major order (as already existent), or a sparse matrix block stored according to the CSR (compressed sparse row) data structure. To store a sparse matrix with n rows and m non-zero elements, CSR uses three arrays: two arrays, each with m elements, that store the matrix element values and the corresponding column indices, respectively; and a third array of size n that references the first non-zero element in each matrix row. In a second step, we allow to stop the block recursion already for larger zero blocks in the numbering scheme, i.e. as soon as a block contains only zero elements. The respective, adaptive block recursion is best described by a tree structure, where each node of the tree has 9 children (according to the eano blocking), and where the leaves of the tree are either zero blocks (of any size), or L1 blocks that can be sparse, dense, or zero. The data structure to store sparse matrices is also split into two parts: The first part is a contiguous data stream that holds, in eano order, the element data required to store the dense L1 blocks and the sparse blocks in CSR format. The second part describes the sparsity tree of the matrix, and also contains all management information (in particular, the respective block types). To store this
5 5 sparsity tree, we use a sequentialised storage scheme according to a modified depth-first traversal. Again, the order in which the children of each node are stored in memory follows the eano order. All block recursive algorithms that will be implemented in this data structure will have to be able to address the nine child blocks when doing a recursive call. Hence, this data has to be stored together. Therefore, for each node of the sequentialised sparsity tree, we store the start index of each child node plus a pointer to the end of this subblock (as illustrated in figure 3). s[0] s[6] s[5] s[1] s[7] s[4] s[2] s[3] s[8] s[9]... dense block sparse block zero block s[0]s[1] s[2] s[3] s[4] s[5] s[6] s[7] s[8] s[9]... s[0] s[1] s[3] Fig. 3. Illustration of the modified depth-first storage scheme (recursive The blockwise multiplication is performed according to the eano scheme given in equation (2). Multiplications that involve zero blocks are, of course, not performed. Due to the missing block operations, the eano multiplication schemes loses its strict locality properties [1], the more zero blocks are present. However, we can still expect a positive effect on cache performance due to the underlyig space-filling curve. 4 erformance Tests We evaluated the performance of our approach on two different workstations with a total of eight CU cores: one Xeon workstation that holds two quadcore processors (Clovertown, 2,66 GHz), which are connected to main memory via the single front side bus; and a Barcelona test platform that holds two AMD Opteron 2347 processors (Barcelona, quadcore, 1.9 GHz), which are connected to memory via AMD s NUMA technology. For all sparse-dense multiplications, we compared TifaMMy s implementation with the one provided by Intel s MKL. In section 4.1, we give the performance for several synthetic benchmark scenarios. In section 4.2, we examine a concrete application scenario, where sparse-dense matrix multipliation is the runtime-dominating problem during the computation of the exponential function of a structured sparse matrix.
6 6 4.1 erformance for Benchmark Scenarios Our first performance test once more considers the multiplication of dense matrices. Here, dense matrices are stored in TifaMMy s eano-block hybrid format, but only using dense L1 blocks. Figure 4 gives a performance comparison between TifaMMy s dense optimised (1.3.2) and sparse-enabled (2.0.0) version, both in relation with the dense matrix multiplication routines provided by Intel s MKL and GotoBLAS. We see that the extension to allow hybrid and sparse matrices leads to a slight performance loss in TifaMMy. However, the sparse-enabled TifaMMy is still faster than MKL, nearly level with GotoBLAS for up to four threads, and slightly faster than GotoBLAS, when all eight CU cores are used TifaMMy (Dense) TifaMMy (Hybrid) MKL 10 GotoBLAS Thread 2 Threads 4 Threads 8 Threads Fig. 4. erformance comparison of dense matrix multiplication. Results (in per core, using 1 to 8 threads on the Xeon workstation, matrix size ) are given for TifaMMy (optimised for dense multiplication), TifaMMy (sparsematrix version using only dense blocks), and the dgemm implementations in Intel s MKL and in GotoBLAS. As a second benchmark, we consider an n n band matrix with bandwidth n, which is multiplied with a dense n n matrix. In particular, we compare the performance difference, if TifaMMy stores most of the L1 blocks as dense blocks, compared to the performance achieved when allowing only CSR blocks. The respective results are given in figure 5. The performance of MKL for this band-dense multiplication is given for comparison note that here the CSR format was used to store the dense band matrix. For once, we can see that the eano-block CSR format used by TifaMMy leads to a performance increase of approximately % compared to MKL. Allowing dense blocks in TifaMMy, however, leads to an additional 100 % gain in, which results from the better use of SSE instructions in the dense multiplication. For a first test of sparse-dense matrix multiplication, we used a 9-diagonal n n sparse matrix, where each row i of the matrix contains non-zero elements a ij, if j {i, i ± 1, i ± n, i ± n ± 1}. The respective performance comparison between TifaMMy and MKL (and on both the Xeon and the Barcelona workstation) is given in figure 6. For the single-thread test, TifaMMy gave a performance
7 TifaMMy 2.0.0; CSR only TifaMMy 2.0.0; Threshold 20% MKL Sparse Fig. 5. erformance (single-core, on the Xeon workstation) of a band matrix multiplication (band matrix dense matrix: MKL (using CSR format) vs. TifaMMy (eanoblock CSR) vs. TifaMMy (eano order with dense and CSR blocks) Xeon, single core: Barcelona, single core: TifaMMy Intel MKL TifaMMy Intel MKL Xeon, eight cores: Barcelona, eight cores: 4900 TifaMMy Intel MKL TifaMMy Intel MKL Fig. 6. Single-core vs. eight-core performance of the sparse-dense matrix multiplication in the quantum control problem. We also compare the Xeon with the Barcelona workstations.
8 8 gain of approximately 25 % compared to MKL. On eight cores, TifaMMy was also able to outrun MKL by %, Especially in this example of a sparsedense matrix multiplication, we notice the meagre speedup when using eight cores instead of one, which illustrates the much stronger memory-boundedness of the sparse-dense matrix multiplication. Note that the Barcelona workstation with its NUMA architecture showed by far the better scalability in that test. 4.2 erformance of Sparse-Dense Multiplication in a uantum Control roblem As an example of a particular application of sparse-dense matrix multiplication, we will, in this section, present first performance tests of using TifaMMy within the simulation of a quantum control problem. There, the subsequent evolution of quantum states is modelled by a sequence of evolution matrices U (r) (t k ): U (r) (t k ) = e i th(r) k e i th (r) k 1 e i th (r) 1. (3) To obtain the factors e i th(r) k, the exponential function of the sparse Hamiltonian matrices H (r) k needs to be computed. The sparsity structure of the matrices is illustrated in figure 7. The matrices have dimension 2 q (where q is the number of modelled quantum bits), and have q + 1 non-zeros per matrix row (one non-zero results from the anti-diagonal elements indicated in grey). The exponential function is approximated by a Chebyshev polynomial, whose evaluation requires sparse matrix multiplication. As the accumulated result finally becomes a dense matrix, we consider only sparse-dense matrix multiplication at the moment. Figure 8 shows the performance results on the Xeon workstation. We compare TifaMMy with MKL when using 1, 2, 4, or 8 threads on this eight core platform. Obviously, TifaMMy can especially profit from the NUMA architecture on the AMD workstation. There, TifaMMy gives a substantial performance advantage (more than a factor of 2 on eight threads) over MKL. In contrast, on the Xeon platform this performance advantage diminishes for larger matrix sizes, when using 4 or 8 cores. In that case, both MKL and TifaMMy seem to be bound by the available memory bandwidth. H (r) k 5 Conclusions, Current and Future Work We have demonstrated that introducing a eano block layout to allow the storage of sparse matrices in TifaMMy can lead to a solid performance gain in sparse-dense matrix multiplication compared to using the standard CSR storage scheme. One of the main advantages of the eano approach is that both sparse and dense matrix multiplication are performing well, in both single- and multicore platforms. Especially on several cores, sparse-dense multiplication proved to be a memory-bound problem, already. Hence, we can expect that the sparsity structure of the involved matrices will have a strong influence on performance.
9 9 q = 4 q = 5 Fig. 7. Recursive structure of the sparse Hamiltonian matrices H (r) k q = 4 and q = 5). of size 2 q (cases (a) Xeon 2 quadcore (Cloverton) TifaMMy 1-Thread MKL 1-Thread TifaMMy 2-Threads MKL 2-Threads TifaMMy 4-Threads MKL 4-Threads TifaMMy 8-Threads MKL 8-Threads (b) AMD 2 quadcore (Barcelona) TifaMMy 1-Thread MKL 1-Thread TifaMMy 2-Threads MKL 2-Threads TifaMMy 4-Threads MKL 4-Threads TifaMMy 8-Threads MKL 8-Threads Fig. 8. erformance comparison (TifaMMy vs. MKL) of the sparse-dense matrix multiplication within the quantum control problem. Results are given for different matrix sizes, and for using 1, 2, 4, or 8 threads on a Xeon and a Barcelona workstation.
10 10 Our next steps will therefore be to identify and run tests on further applications of sparse-dense matrix multiplication. Concerning our quantum control example, we will also test whether including sparse-sparse multiplication can further improve the solution times. During the accumulation of the Chebyshev polynomial, the accumulated result matrix is initially sparse, and typically takes a few multiplications to become dense. As in sparse-sparse mutliplication, computing (or constantly updating) the sparsity structure of the result matrix can be rather time-consuming, we will, in a first step, use a sparse pattern of dense L1 blocks to store the accumulated results. We also intend to test sparse-sparse matrix multiplication in scenarios, where the pattern of the result matrix is known a priori. One such example would be the blockwise computation of an incomplete LU decomposition, where such a sparse-sparse matrix multiplication is required in a respective block-recursive formulation. Another example could be the Galerkin-type computation of coarsegrid operators in geometric multigrid methods. References 1. Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a eano curve. Linear Algebra Appl. 417 (2 3), Bader, M., Franz, R., Guenther, S., Heinecke, A.: Hardware-oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves. LNCS 4967, Heinecke, A., Bader, M.: arallel Matrix Multiplication based on Space-filling Curves on Shared Memory Multicore latforms. roc Computing Frontiers Conf. and co-located workshops: MAW 08 & WREFT 08, Ischia, Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46 (1), Goto, K., van de Geijn, R.A.: Anatomy of a High-erformance Matrix Multiplication. ACM Transactions on Mathematical Software 34 (3), Gustavson, F. G.: Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM Journal of Research and Development 41 (6), Herrero, J.R., Navarro, J.J.: Adapting Linear Algebra Codes to the Memory Hierarchy Using a Hypermatrix Scheme. LNCS 3911, 2006, pp Im, E.-J., Yelick, K.A., Vuduc, R.: SARSITY: An Optimization Framework for Sparse Matrix Kernels, International Journal of High erformance Computing Applications, 18 (1), 2004, pp Khaneja, N., Reiss, T., Kehlet, C., Schulte-Herbrüggen, T., Glaser, S.J.: Optimal Control of Coupled Spin Dynamics: Design of NMR ulse Sequences by Gradient Ascent Algorithms. Journal of Magnetic Resonance 172, 2005, pp Whaley, R.C., etitet, A., and Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project, arallel Computing 27 (1 2), 2001.
8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationBehavioral Data Mining. Lecture 12 Machine Biology
Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationStorage Formats for Sparse Matrices in Java
Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13
More informationA cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve
A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A. N. Yzelman and Rob H. Bisseling Abstract The sparse matrix vector (SpMV) multiplication is an important kernel
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationCache oblivious matrix multiplication using an element ordering based on a Peano curve
Linear Algebra and its Applications 47 (6) www.elsevier.com/locate/laa Cache oblivious matrix multiplication using an element ordering based on a eano curve Michael Bader, Christoph Zenger Institut für
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationSparse Matrix Operations on Multi-core Architectures
Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für
More informationA Square Block Format for Symmetric Band Matrices
A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture
More informationUsing non-canonical array layouts in dense matrix operations
Using non-canonical array layouts in dense matrix operations José R. Herrero, Juan J. Navarro Computer Architecture Dept., Univ. Politècnica de Catalunya Barcelona, (Spain) {josepr,juanjo}@ac.upc.edu,
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationOvercoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science
Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute
More informationPerformance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors
Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationHow to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato
How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationHigh-Performance Implementation of the Level-3 BLAS
High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationAutomatic Tuning of Sparse Matrix Kernels
Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationAccelerating the Iterative Linear Solver for Reservoir Simulation
Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationSciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationGPU acceleration of the matrix-free interior point method
GPU acceleration of the matrix-free interior point method E. Smith, J. Gondzio and J. A. J. Hall School of Mathematics and Maxwell Institute for Mathematical Sciences The University of Edinburgh Mayfield
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More information2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2
Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Fred G. Gustavson IBM T.J. Watson Research Center and Jerzy Waśniewski Department
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationParsing in Parallel on Multiple Cores and GPUs
1/28 Parsing in Parallel on Multiple Cores and GPUs Mark Johnson Centre for Language Sciences and Department of Computing Macquarie University ALTA workshop December 2011 Why parse in parallel? 2/28 The
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationMemory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax
Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax Richard Vuduc, Attila Gyulassy, James W. Demmel, and Katherine A. Yelick Computer Science Division, University of California, Berkeley
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationGenerating Optimized Sparse Matrix Vector Product over Finite Fields
Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationA priori power estimation of linear solvers on multi-core processors
A priori power estimation of linear solvers on multi-core processors Dimitar Lukarski 1, Tobias Skoglund 2 Uppsala University Department of Information Technology Division of Scientific Computing 1 Division
More informationExploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationAuto-tuning Multigrid with PetaBricks
Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
More informationNUMA-aware Multicore Matrix Multiplication
Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationSparse Direct Solvers for Extreme-Scale Computing
Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationfspai-1.1 Factorized Sparse Approximate Inverse Preconditioner
fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner Thomas Huckle Matous Sedlacek 2011 09 10 Technische Universität München Research Unit Computer Science V Scientific Computing in Computer
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationMemory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves
Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Michael Bader TU München Stefanie Schraufstetter TU München Jörn Behrens AWI Bremerhaven Abstract
More informationPerformance comparison and optimization: Case studies using BenchIT
John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationDevelopment of efficient computational kernels and linear algebra routines for out-of-order superscalar processors
Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationCluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationAutotuning (1/2): Cache-oblivious algorithms
Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationHigh-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen
High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationMVAPICH2 vs. OpenMPI for a Clustering Algorithm
MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract
ATLAS NOTE December 4, 2014 ATLAS offline reconstruction timing improvements for run-2 The ATLAS Collaboration Abstract ATL-SOFT-PUB-2014-004 04/12/2014 From 2013 to 2014 the LHC underwent an upgrade to
More informationA Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD
A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationA Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors
A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors Kai Zhang, ShuMing Chen*, Wei Liu, and Xi Ning School of Computer, National University of Defense Technology #109, Deya Road,
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationPortable, usable, and efficient sparse matrix vector multiplication
Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix
More information