Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves

Size: px

Start display at page:

Download "Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves"

Aldous Tate
6 years ago
Views:

1 Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious algorithms are designed to benefit from any existing cache hierarchy regardless of cache size or architecture. In matrix computations, cache oblivious approaches are usually obtained from block-recursive approaches. In this article, we extend an existing cache oblivious approach for matrix operations, which is based on eano space-filling curves, for multiplication of sparse and dense matrices (sparse-dense, dense-sparse, sparse-sparse). We present the respective block-oriented data structure to store sparse matrices, and give first performance results on multicore platforms. 1 Introduction One of the problems that high performance computing is currently confronted with is a growing gap between the theoretically available peak performance on modern CU and parallel computing platforms, and the performance that can be achieved in practice. Multilevel memory hierarchies, specific processor extensions, and multicore arcitectures in general, are just some of the features, where a careful, hardware-oriented algorithm design and implementation is necessary to achieve a satisfying portion of the available performance. For dense linear algebra computations, the BLAS (Basic Linear Algebra Subroutines) library standard forms the basis for many highly efficient implementations. Especially for matrix multiplication, one of the core BLAS routines, highly optimised implementations are available either vendor-supplied (such as ACML or Intel MKL) or third-party developments (GotoBLAS [5], ATLAS [10]). For sparse matrix computations, the situation is much less satisfying. In contrast to dense matrix multiplication, also the multiplication of matrices (sparsesparse or dense-sparse) is nearly a memory-bound problem, where the data transfer between main memory and CU becomes the limiting factor for performance. For dense matrix computations, blocking techniques are an efficient means to exploit existing cache hierarchies and thus sustain optimal CU peformance. Both cache aware and cache oblivious techniques have been proposed and sucessfully used, either for BLAS implementations (cf. [5, 10]) or for other linear algebra tasks [6, 4]. For sparse matrix computations, blocking techniques have up to now mainly been popular for low-level blocking of elements, for example using the BCSR format (block compressed sparse row [8]). However, also block-recursive techniques for data storage and respective linear algebra computations have been presented [7].

2 2 R S R S R S Fig. 1. Recursive construction of a eano curve (for patterns and, only) In this article, we extend an existing cache oblivious approach for matrix operations, which is based on eano space-filling curves, to the multiplication of sparse and dense matrices (sparse-dense, dense-sparse, sparse-sparse). The approach is based on a block-recursive data structure to store sparse matrices, where the leaf blocks can be either zero, dense, or stored in a standard sparse format (compressed sparse row). Our current intended application is the computation of the exponential function of sparse matrices as they occur within a quantum control problem [9]. After a quick recapitulation of the eano matrix multiplication in section 2, we introduce the eano data structure for sparse matrices in section 3. Section 4 introduces the current test scenarios and gives first performance results on singleand multicore platforms. In section 5, we outline current and future work. 2 Dense Matrix Multiplication Based on eano Curves In previous works [1, 2], we have introduced a cache oblivious algorithm for matrix multiplication that uses a memory layout to store the elements that is derived from eano space-filling curves. A corresponding block-recursive scheme for matrix multiplication leads to a cache oblivious algorithm with excellent inherent locality properties. The respective algorithm can be efficiently parallelised on multicore platforms [3], and was shown to be competitive with the currently fastest libraries, such as GotoBLAS [5]. 2.1 Element Order and Block-recursive Multiplication Figure 1 illustrates the eano element order to store dense matrices. It is based on so-called iterations of a eano curve. A matrix is recursively subdivided into 3 3 subblocks, which are stored contiguously in memory according to four different block numbering patterns marked as (starting pattern),, R, and S. The recursion is stopped once the matrix blocks are small enough to be efficiently multiplied within the level 1 cache. Hence, we refer to these smallest blocks as L1 blocks. Within these L1 blocks, simple column major order is used. A blockwise matrix multiplication of matrices stored in eano order is illustrated in equation (1). There, each matrix block is named according to its numbering scheme and indexed with the name of the global matrix and the

3 3 position within the storage scheme: A0 R A5 A6 A1 S A4 A7 B0 R B5 B6 B1 S B4 B7 = C0 R C5 C6 C1 S C4 C7. A2 R A3 A8 }{{} =: A B2 R B3 B8 }{{} =: B C2 R C3 C8 }{{} =: C (1) For the resulting block multiplications, the following inherently cache-efficient execution order can be derived [1]: 0 += 0 0 R 5 += 6R 3 R 5 += R 5S 4 6 += R = = 1 0 S 4 += 7R 3 S 4 += S 4S 4 7 += S = = 2 0 R 3 += 8R 3 R 3 += R 3S 4 8 += R = = R = 8 2 R 3 += 2R 5 8 += = S = 7 2 S 4 += 1R 5 7 += = R = 6 2 R 5 += 0R 5 6 += 0 6 (2) (indices A,B,C are left out to improve readability). Recursive extension of this blockwise multiplication leads to a cache oblivious algorithm, which gains excellent locality properties from the underlying eano curve: After an access to an element (or L1 block) in any of the matrices A, B, or C, either the same element (L1 block) will be reused or its direct left or right neighbour compare equation (2). Any sequence of k 3 floating point operations is executed on only O(k 2 ) contiguous elements in each matrix. Vice versa, on any memory block of k 2 contiguous elements, at least O(k 3 ) operations will be performed. As a result, a sharp, asymptotically optimal upper bound of the number of cache misses can be obtained (cf. [1]). 2.2 Efficient Implementation on Multicore latforms To achieve high performance on modern CU cores, the block-recursive algorithm has to be combined with hardware-oriented, optimised multiplication kernels [2]. In particular, to allow efficient use of the vectorisation properties of CUs (SIMD extensions, etc.), we stop the recursion in memory layout and block multiplication on so-called L1 blocks. The size of these L1 block proved to be optimal, if two L1 blocks fit into the L1 cache these are basically the A- and B-blocks of the matrices, as the write accesses to the C-blocks is streamed directly to the L2 cache.

4 4 MFlops/s / Core MKL Average MKL Maximum Goto Average Goto Maximum TifaMMy Average TifaMMy Maximum Thread 2 Threads 4 Threads 8 Threads 16 Threads Fig. 2. Relative performance of TifaMMy, GotoBLAS and MKL on a Xeon server using up to 16 cores (four times quad-core) [3]. erformance is given in per core. In a recent paper [3], we demonstrated that the resulting implementation, called TifaMMy, is competitive with up-to-date BLAS implementations (Goto- BLAS and Intel s MKL). Moreover, we presented a shared-memory parallelisation of our algorithm based on OpenM, and tested the performance of our approach on several multicore platforms. Figure 2 shows the average and maximum performance for double precision on a Xeon server with four quad-core Xeon processors (Tigerton, 2.93 GHz). In this test, TifaMMy outran both Goto- BLAS and MKL in terms of absolute performance, if 8 or all 16 cores were used. 3 A Block Recursive Data Structure for Sparse Matrices To extend our approach for sparse matrix multiplications, we first need to modify our block-recursive data structure to allow efficient storage of sparse matrices. Inspired by an approach used by Herrero and Navarro [7], we allow, in a first step, that each L1 block of the matrix can be either a zero block, a dense block stored in row-major order (as already existent), or a sparse matrix block stored according to the CSR (compressed sparse row) data structure. To store a sparse matrix with n rows and m non-zero elements, CSR uses three arrays: two arrays, each with m elements, that store the matrix element values and the corresponding column indices, respectively; and a third array of size n that references the first non-zero element in each matrix row. In a second step, we allow to stop the block recursion already for larger zero blocks in the numbering scheme, i.e. as soon as a block contains only zero elements. The respective, adaptive block recursion is best described by a tree structure, where each node of the tree has 9 children (according to the eano blocking), and where the leaves of the tree are either zero blocks (of any size), or L1 blocks that can be sparse, dense, or zero. The data structure to store sparse matrices is also split into two parts: The first part is a contiguous data stream that holds, in eano order, the element data required to store the dense L1 blocks and the sparse blocks in CSR format. The second part describes the sparsity tree of the matrix, and also contains all management information (in particular, the respective block types). To store this

5 5 sparsity tree, we use a sequentialised storage scheme according to a modified depth-first traversal. Again, the order in which the children of each node are stored in memory follows the eano order. All block recursive algorithms that will be implemented in this data structure will have to be able to address the nine child blocks when doing a recursive call. Hence, this data has to be stored together. Therefore, for each node of the sequentialised sparsity tree, we store the start index of each child node plus a pointer to the end of this subblock (as illustrated in figure 3). s[0] s[6] s[5] s[1] s[7] s[4] s[2] s[3] s[8] s[9]... dense block sparse block zero block s[0]s[1] s[2] s[3] s[4] s[5] s[6] s[7] s[8] s[9]... s[0] s[1] s[3] Fig. 3. Illustration of the modified depth-first storage scheme (recursive The blockwise multiplication is performed according to the eano scheme given in equation (2). Multiplications that involve zero blocks are, of course, not performed. Due to the missing block operations, the eano multiplication schemes loses its strict locality properties [1], the more zero blocks are present. However, we can still expect a positive effect on cache performance due to the underlyig space-filling curve. 4 erformance Tests We evaluated the performance of our approach on two different workstations with a total of eight CU cores: one Xeon workstation that holds two quadcore processors (Clovertown, 2,66 GHz), which are connected to main memory via the single front side bus; and a Barcelona test platform that holds two AMD Opteron 2347 processors (Barcelona, quadcore, 1.9 GHz), which are connected to memory via AMD s NUMA technology. For all sparse-dense multiplications, we compared TifaMMy s implementation with the one provided by Intel s MKL. In section 4.1, we give the performance for several synthetic benchmark scenarios. In section 4.2, we examine a concrete application scenario, where sparse-dense matrix multipliation is the runtime-dominating problem during the computation of the exponential function of a structured sparse matrix.

6 6 4.1 erformance for Benchmark Scenarios Our first performance test once more considers the multiplication of dense matrices. Here, dense matrices are stored in TifaMMy s eano-block hybrid format, but only using dense L1 blocks. Figure 4 gives a performance comparison between TifaMMy s dense optimised (1.3.2) and sparse-enabled (2.0.0) version, both in relation with the dense matrix multiplication routines provided by Intel s MKL and GotoBLAS. We see that the extension to allow hybrid and sparse matrices leads to a slight performance loss in TifaMMy. However, the sparse-enabled TifaMMy is still faster than MKL, nearly level with GotoBLAS for up to four threads, and slightly faster than GotoBLAS, when all eight CU cores are used TifaMMy (Dense) TifaMMy (Hybrid) MKL 10 GotoBLAS Thread 2 Threads 4 Threads 8 Threads Fig. 4. erformance comparison of dense matrix multiplication. Results (in per core, using 1 to 8 threads on the Xeon workstation, matrix size ) are given for TifaMMy (optimised for dense multiplication), TifaMMy (sparsematrix version using only dense blocks), and the dgemm implementations in Intel s MKL and in GotoBLAS. As a second benchmark, we consider an n n band matrix with bandwidth n, which is multiplied with a dense n n matrix. In particular, we compare the performance difference, if TifaMMy stores most of the L1 blocks as dense blocks, compared to the performance achieved when allowing only CSR blocks. The respective results are given in figure 5. The performance of MKL for this band-dense multiplication is given for comparison note that here the CSR format was used to store the dense band matrix. For once, we can see that the eano-block CSR format used by TifaMMy leads to a performance increase of approximately % compared to MKL. Allowing dense blocks in TifaMMy, however, leads to an additional 100 % gain in, which results from the better use of SSE instructions in the dense multiplication. For a first test of sparse-dense matrix multiplication, we used a 9-diagonal n n sparse matrix, where each row i of the matrix contains non-zero elements a ij, if j {i, i ± 1, i ± n, i ± n ± 1}. The respective performance comparison between TifaMMy and MKL (and on both the Xeon and the Barcelona workstation) is given in figure 6. For the single-thread test, TifaMMy gave a performance

7 TifaMMy 2.0.0; CSR only TifaMMy 2.0.0; Threshold 20% MKL Sparse Fig. 5. erformance (single-core, on the Xeon workstation) of a band matrix multiplication (band matrix dense matrix: MKL (using CSR format) vs. TifaMMy (eanoblock CSR) vs. TifaMMy (eano order with dense and CSR blocks) Xeon, single core: Barcelona, single core: TifaMMy Intel MKL TifaMMy Intel MKL Xeon, eight cores: Barcelona, eight cores: 4900 TifaMMy Intel MKL TifaMMy Intel MKL Fig. 6. Single-core vs. eight-core performance of the sparse-dense matrix multiplication in the quantum control problem. We also compare the Xeon with the Barcelona workstations.

8 8 gain of approximately 25 % compared to MKL. On eight cores, TifaMMy was also able to outrun MKL by %, Especially in this example of a sparsedense matrix multiplication, we notice the meagre speedup when using eight cores instead of one, which illustrates the much stronger memory-boundedness of the sparse-dense matrix multiplication. Note that the Barcelona workstation with its NUMA architecture showed by far the better scalability in that test. 4.2 erformance of Sparse-Dense Multiplication in a uantum Control roblem As an example of a particular application of sparse-dense matrix multiplication, we will, in this section, present first performance tests of using TifaMMy within the simulation of a quantum control problem. There, the subsequent evolution of quantum states is modelled by a sequence of evolution matrices U (r) (t k ): U (r) (t k ) = e i th(r) k e i th (r) k 1 e i th (r) 1. (3) To obtain the factors e i th(r) k, the exponential function of the sparse Hamiltonian matrices H (r) k needs to be computed. The sparsity structure of the matrices is illustrated in figure 7. The matrices have dimension 2 q (where q is the number of modelled quantum bits), and have q + 1 non-zeros per matrix row (one non-zero results from the anti-diagonal elements indicated in grey). The exponential function is approximated by a Chebyshev polynomial, whose evaluation requires sparse matrix multiplication. As the accumulated result finally becomes a dense matrix, we consider only sparse-dense matrix multiplication at the moment. Figure 8 shows the performance results on the Xeon workstation. We compare TifaMMy with MKL when using 1, 2, 4, or 8 threads on this eight core platform. Obviously, TifaMMy can especially profit from the NUMA architecture on the AMD workstation. There, TifaMMy gives a substantial performance advantage (more than a factor of 2 on eight threads) over MKL. In contrast, on the Xeon platform this performance advantage diminishes for larger matrix sizes, when using 4 or 8 cores. In that case, both MKL and TifaMMy seem to be bound by the available memory bandwidth. H (r) k 5 Conclusions, Current and Future Work We have demonstrated that introducing a eano block layout to allow the storage of sparse matrices in TifaMMy can lead to a solid performance gain in sparse-dense matrix multiplication compared to using the standard CSR storage scheme. One of the main advantages of the eano approach is that both sparse and dense matrix multiplication are performing well, in both single- and multicore platforms. Especially on several cores, sparse-dense multiplication proved to be a memory-bound problem, already. Hence, we can expect that the sparsity structure of the involved matrices will have a strong influence on performance.

9 9 q = 4 q = 5 Fig. 7. Recursive structure of the sparse Hamiltonian matrices H (r) k q = 4 and q = 5). of size 2 q (cases (a) Xeon 2 quadcore (Cloverton) TifaMMy 1-Thread MKL 1-Thread TifaMMy 2-Threads MKL 2-Threads TifaMMy 4-Threads MKL 4-Threads TifaMMy 8-Threads MKL 8-Threads (b) AMD 2 quadcore (Barcelona) TifaMMy 1-Thread MKL 1-Thread TifaMMy 2-Threads MKL 2-Threads TifaMMy 4-Threads MKL 4-Threads TifaMMy 8-Threads MKL 8-Threads Fig. 8. erformance comparison (TifaMMy vs. MKL) of the sparse-dense matrix multiplication within the quantum control problem. Results are given for different matrix sizes, and for using 1, 2, 4, or 8 threads on a Xeon and a Barcelona workstation.

10 10 Our next steps will therefore be to identify and run tests on further applications of sparse-dense matrix multiplication. Concerning our quantum control example, we will also test whether including sparse-sparse multiplication can further improve the solution times. During the accumulation of the Chebyshev polynomial, the accumulated result matrix is initially sparse, and typically takes a few multiplications to become dense. As in sparse-sparse mutliplication, computing (or constantly updating) the sparsity structure of the result matrix can be rather time-consuming, we will, in a first step, use a sparse pattern of dense L1 blocks to store the accumulated results. We also intend to test sparse-sparse matrix multiplication in scenarios, where the pattern of the result matrix is known a priori. One such example would be the blockwise computation of an incomplete LU decomposition, where such a sparse-sparse matrix multiplication is required in a respective block-recursive formulation. Another example could be the Galerkin-type computation of coarsegrid operators in geometric multigrid methods. References 1. Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a eano curve. Linear Algebra Appl. 417 (2 3), Bader, M., Franz, R., Guenther, S., Heinecke, A.: Hardware-oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves. LNCS 4967, Heinecke, A., Bader, M.: arallel Matrix Multiplication based on Space-filling Curves on Shared Memory Multicore latforms. roc Computing Frontiers Conf. and co-located workshops: MAW 08 & WREFT 08, Ischia, Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46 (1), Goto, K., van de Geijn, R.A.: Anatomy of a High-erformance Matrix Multiplication. ACM Transactions on Mathematical Software 34 (3), Gustavson, F. G.: Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM Journal of Research and Development 41 (6), Herrero, J.R., Navarro, J.J.: Adapting Linear Algebra Codes to the Memory Hierarchy Using a Hypermatrix Scheme. LNCS 3911, 2006, pp Im, E.-J., Yelick, K.A., Vuduc, R.: SARSITY: An Optimization Framework for Sparse Matrix Kernels, International Journal of High erformance Computing Applications, 18 (1), 2004, pp Khaneja, N., Reiss, T., Kehlet, C., Schulte-Herbrüggen, T., Glaser, S.J.: Optimal Control of Coupled Spin Dynamics: Design of NMR ulse Sequences by Gradient Ascent Algorithms. Journal of Magnetic Resonance 172, 2005, pp Whaley, R.C., etitet, A., and Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project, arallel Computing 27 (1 2), 2001.

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum