Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Size: px

Start display at page:

Download "Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5"

Winifred Sims
5 years ago
Views:

1 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH Columbus, OH 4321 Abstract We analyze single-node performance of sparse matrix-vector multiplication by investigating issues of data locality and ne-grained parallelism. We examine the data-locality characteristics of the compressedsparse-row representation and consider improvements in locality through matrix permutation. Motivated by potential improvements in ne-grained parallelism, we evaluate modied sparse-matrix representations. The results lead to general conclusions about improving single-node performance of sparse matrix-vector multiplication in parallel libraries of sparse iterative solvers. 1 Introduction One of the core operations of iterative sparse solvers is sparse matrix-vector multiplication. In order to achieve high performance, a parallel implementation of sparse matrix-vector multiplication must maintain scalability. This scalability comes from a balanced mapping of the matrix and vectors among the distributed processors, a mapping that minimizes interprocessor communication. Load balancing and minimizing communication do not guarantee high performance, however. High single-node performance is also essential. In this work, we extend the analysis of the mapping of matrices and vectors beyond the global multipleprocessor distribution; we consider the memory hierarchy and architecture of a single processor. The local representation of the matrix and vectors could aect performance in two signicant ways, through its impact on data-cache locality and through its eect on ne-grained parallelism for pipelined, superscalar execution. Supported in part by the National Science Foundation under grants DMR and CDA Matrix C A Start index of each row i Column index j Nonzero matrix value a Figure 1: Compressed-sparse-row format (CSR). The de-facto-standard storage format for unstructured sparse matrices is the Compressed-Sparse-Row format (CSR). For a parallel implementation, the rows of a sparse matrix can be distributed among the processors, with the rows local to a processor stored in CSR. Matrix-vector multiplication is then performed using an \owner-computes" strategy. Therefore, we use CSR as a starting point for our analysis. 2 Compressed-Sparse-Row Format The idea behind CSR is to pack each row, storing only the nonzero elements (Figure 1). Each row may have a dierent structure, so the column index for each nonzero element must also be stored. Since each row may have an arbitrary number of nonzero elements, a 2-D array is not an appropriate container for the compressed matrix. Instead, the nonzero elements and the column indices are stored in 1-D arrays, and a third array stores the index of the start of each \row" within these two arrays. With a sparse matrix in CSR, matrix-vector multiplication can be implemented with a simple nested pair of loops; a C implementation is shown in Figure 2. The outer loop is over the rows of the matrix, and the inner loop is over the elements of a particular

2 } for (k = ; k < n; k++) { temp =.; for (l = i[k]; l < i[k+1]; l++) temp += a[l] * x[j[l]]; y[k] = temp; Convex Exemplar SPP12 Processor 12 MHz HP PA-RISC 72 Data Cache 256 kb Operating System SPP-UX 4.2 C Compiler Convex C 6.5 Compiler Options -O2 Table 1: System specications. Figure 2: Sample C code for CSR sparse matrix-vector multiplication. row, as dened by the row-index array, i[..n]. The target vector is y[..n-1], and the source vector is x[..n-1]. The nonzero elements of the matrix are stored in the value array, a[..i[n]-1], and the column indices of the nonzero elements are stored in the column-index array, j[..i[n]-1]. 3 Data Locality To analyze the single-node performance characteristics of CSR, we rst consider data-locality characteristics. Start with the value array, a[l], and the column-index array, j[l]. For a given matrix-vector multiplication, each element of these arrays is used only once; only spatial locality is possible. Since all the elements of each array are used consecutively, this spatial locality is achieved. Next consider the target vector, y[k]. Here, each element is used a number of times, so spatial locality and temporal locality are both possible. Again, all the elements are used consecutively, so spatial locality is achieved. In addition, all of the multiple accesses of a particular element are performed consecutively; temporal locality is also achieved, potentially in the form of CPU-register locality. Problems with locality nally appear in the source vector, x[j[l]]. Since the vector is accessed according to an index array, the access pattern is noncontiguous. Though there is potential for spatial and temporal locality, whether this locality is achieved depends on the particular structure of the sparse matrix. A sparse matrix with a relatively random structure, for example, would yield relatively poor memoryaccess performance. For a particular sparse matrix, it may be possible to improve performance by modifying the structure of the matrix. For parallel sparse matrix-vector multiplication on distributed-memory multi-computers, the amount of communication is signicantly aected by the way that the rows of the matrix are partitioned among the processors. It may be possible to use the same partitioning strategy to more nely partition a sparse matrix on a single node to enhance data locality; the matrix can be renumbered to make elements within a cache-sized partition contiguous. With a partitioning scheme that minimizes references between partitions, most references could be to elements nearby in memory, resulting in enhanced locality. We tested the eectiveness of reordering via partitioning on a number of sample sparse matrices using METIS [3], a popular partitioning program from the University of Minnesota. The tests were performed on a single node of the Convex Exemplar SPP12 at the Ohio Supercomputer Center (Table 1). The SPP12 includes hardware that collects cache usage statistics, and these statistics can be analyzed with the Convex Performance Analyzer. Cache hit rates and execution times were recorded for each matrix before and after partitioning by METIS. The performance results for matrix-vector multiplication for a grid-based test matrix are given in Table 2. The test matrix has nonzero elements, making the system much larger than the SPP12's 256 kb data cache. The results show that reordering by METIS did not improve the cache hit rate or the execution time over a wide range of partition sizes. We also tested large matrices from the Harwell-Boeing collection [1]; the results are summarized in Table 3. The original matrices gave cache hit rates between 9% and 98%, and for no matrix did reordering by METIS improve the hit rate or the execution time. A possible explanation for why reordering by METIS did not improve performance is that the tested matrices already had a signicant level of structure. To determine the validity of this explanation, we randomly renumbered the grid of the original test matrix. We then used METIS to partition and reorder this randomly reordered matrix. The results are given in Table 2. They conrm the explanation; given a matrix with a random structure, a METIS-reordering was indeed able to improve performance. The im-

3 Original Randomly reordered Partitions Performance Cache hit Performance Cache hit (MFLOPS) rate (%) (MFLOPS) rate (%) 1 5:64 94: :49 94: :49 94: :49 94: :44 94: :34 94: Table 2: The eects of METIS partitions on performance for a element, 2-D-grid, 5-point stencil matrix on a single node of the SPP12. Matrix MFLOPS Cache hits bcsstk32.psa BEFORE 8:24 9:6% 1 6 =3 AFTER 8:8 9:5% lns 3937.rua BEFORE 7:87 96:6% 2:5 1 4 =6 AFTER 7:72 96:4% psmigr 2.rua BEFORE 8:64 96:6% =15 AFTER 8:64 96:4% saylr4.rua BEFORE 8:73 97:9% 2:2 1 4 =6 AFTER 8:73 97:5% Table 3: Large Harwell-Boeing matrices tested on a single node of the SPP12. Performance and cache hit rates before and after partitioning with METIS. Below each matrix name is an indication of the number of nonzeros=number of partitions. provement, however, did not bring performance back to the level of the original matrix. We conclude from this that the sparse matrices resulting from real-world problems usually have a signicant level of structure. Toledo [4] reports similar results for partitioning schemes on an IBM RS/6. 4 Fine-Grained Parallelism In the previous section, we analyzed the datalocality characteristics of CSR matrix-vector multiplication. Here, we discuss the other primary issue in single-node performance, ne-grained parallelism. In order for modern superscalar, pipelined processors to operate eciently, they must receive an adequate mix of instructions that can be executed concurrently. For looping algorithms, such as sparse matrixvector multiplication, this requirement means there must be an adequate number of arithmetic operations between the branch statements that perform the loops. The original CSR code falls short in this respect; only! Figure 3: Sorting rows by length. Here, the 1-D value array is represented pictorially be a matrix with jagged rows. a small number of operations are performed at each iteration. The well-established solution to this problem is loop unrolling; each iteration is modied to include operations on multiple elements instead of just one, and the number of iterations is consequently decreased. However, CSR can make loop unrolling relatively dicult and inecient. If the number of nonzero elements per row is small, the inner loop of a CSR code runs over only a few iterations. This small number of iterations may marginalize the benets of inner-loop unrolling. In addition, CSR matrices can have rows with variable length. This variability makes outerloop unrolling almost impossible, and it forces the overhead of inner-loop unrolling to be performed for every row, marginalizing the benets of unrolling even further. Improving ne-grained parallelism for CSR then amounts to improving the eectiveness of loop unrolling. A rst step is improving the structure of the matrix by rearranging the rows according to length (Figure 3). The resulting matrix has a block structure, where each block has equal-length rows. Note that the reordering may have detrimental side eects. If the original matrix has a structure giving good data locality, the reordering may move neighboring rows far apart, eliminating some of this locality. Improvements in ne-grained parallelism must overcome any such loss of locality in order for the performance to improve. The next step is to store the reordered matrix in a form that facilitates loop unrolling. We analyze three options for data representations that may enhance the eectiveness of unrolling. The rst option is column-major CSR (CM-CSR), where the rows are compressed, but the resulting jagged-row matrix is stored in column-major form (Figure 4). The array that stores the beginning of each row in original CSR

4 Row major?! Column major Figure 4: CM-CSR stores the row-sorted matrix in column-major form. BCM-CSR BRM-CSR Figure 5: Blocked matrix formats. instead stores the beginning of each column in CM- CSR. CM-CSR may make inner-loop unrolling more eective because the inner loop is over much longer subsections of the value array. This form is equivalent to the Jagged Diagonal storage format [5], which was designed for vector machines, such as the Cray Y-MP, where short vector operations are relatively inecient. For cache-based architectures, however, the advantage in unrolling may not be able to overcome the disadvantage in data locality. In column-major form, the temporal locality of the target vector may be lost; each element is reused only after all other elements have been accessed. This locality problem is a motivation for the second modied form, blocked column-major CSR (BCM- CSR). After reordering, the matrix is stored in blocks of equal row length, where each block is stored in column-major form (Figure 5). The columns of a block may be short enough that some temporal locality of the target vector can be exploited. Conversely, the columns are likely to be longer than the rows, making inner-loop unrolling more ecient. Matrix nnz nnz/row bcsstk2 1, bcsstk32 2,14, blckhole 8,52 4. crystk2 968, ct2stif 2,698, e5r 5, memplus 126, msc1848 1,229, onetone1 341, watson5 1, Table 4: Sparse matrices tested. The number of nonzero elements in each matrix is nnz, and the average number of nonzero elements per row is nnz/row. The blocking also gives other bonuses. Since all the columns of a block are the same length, outerloop unrolling is practical. For the same reason, innerloop unrolling is more ecient; the overhead for the unrolling may be moved to an outer loop. The nal option for modifying CSR simply takes the blocking of BCM-CSR and applies it to row-major CSR, giving BRM-CSR (Figure 5). BRM-CSR maintains all the locality of the original CSR except what is lost in reordering, and it allows the full range of inner-loop and out-loop unrolling, including moving unrolling overhead to outer loops. 5 Performance Analysis of Alternative Implementations The various modications of CSR have dierent strengths and weaknesses. The relative performance of a particular form can vary according to problem domain and machine type. Therefore, we tested each representation using matrices of various sizes from different application domains, and we tested them on a number of machines. We started with a number of sample sparse matrices from the Matrix Market [1] and the University of Florida collection [2]. The size and average row length for each of these matrices are given in Table 4. As illustrated by the table, the matrices were chosen to span a wide range of row size and total size. For each of these matrices, we timed multiplication using multiple implementations for each representation, with varying levels of inner-loop and outer-loop unrolling (Table 5). For each case, a number of successive matrix-vector multiplications were performed using the same matrix. This was done so that the

5 Unrolling original CM BRM BCM 1 1 X X X X 1 2 X 1 4 X X X 1 8 X X X 4 1 X X X 8 1 X X X 2 8 X improvement as with the systems that t within cache. This is due to the functional unit pipeline stalls for main-memory accesses in case of the large systems. 6 Small Matrices fit into cache Table 5: Levels of unrolling for each form of CSR. The levels of unrolling are displayed as \column-loop unrolling" \row-loop unrolling". For original and BRM, the row loop is the inner loop, while the column loop is the inner loop for CM and BCM measured performance would be representative of its use in a context such as the iterative solution of sparse linear systems where repeated matrix-vector products are performed using the same matrix. We measured double-precision (double) performance on single nodes of four parallel machines at the Ohio Supercomputer Center: a Convex Exemplar SPP12, an IBM SP2, an SGI Power Challenge, and an SGI Onyx. In addition, we measured performance on a single-processor DEC 3/3 AXP workstation. Because of space constraints, we only discuss results for the SPP12 in this paper. Figure 6 has performance plots for execution on the SPP12. Along with the results for original CSR, only the best-performing alternative representations are shown: 1 8 BRM-CSR, 4 1 BRM-CSR, and 1 4 BCM-CSR. Four performance regimes emerge from these gures: small systems with short rows, small systems with long rows, large systems with short rows, and large systems with long rows. To more carefully and systematically evaluate each of these regions, we used synthetic matrices based on nearest-neighbor grid structures. Their size and average row length were varied, as shown in Figure 7. The stencil sizes roughly translate into average row sizes. Again, only the results of the best-performing implementations are shown. The gures illustrate dierences among the four regimes dened above. For systems that t into cache, the alternate representations result in higher absolute performance and greater improvements in performance over CSR. The higher absolute performance comes as no surprise since cache access is much less expensive than main-memory access. For systems that do not t in cache, the benets of improved negrained parallelism do not yield as much performance bcsstk2 blckhole e5r watson5 Large Matrices do not fit into cache bcsstk32 crystk2 ct2stif memplus msc1848 onetone1 Figure 6: Performance of sample sparse matrix-vector multiplication for a single node of the SPP12. The two charts divide the sample systems into those that t into the data cache (< 256 kb) and those that do not. The gures illustrate more subtle dierences between long-row and short-row systems. The opportunities for ne-grained parallelism dier for long and short rows. Because of this, dierent implementations have better characteristics in the two regimes. The results from the other computer systems we tested reveal that the four-regime picture of performance is generally applicable. The split between large and small clearly depends upon cache size, while the split between long and short is consistently around 1 elements.

6 For the SPP12, the results show that a particular single implementation is most eective across each regime. For small-short, 1 8 BRM-CSR consistently performs the best, while 1 4 BCM-CSR is best for large-short, and 4 1 BRM-CSR is best for small-long and large-long. Again, the results for other machines show that this holds across architectures: each regime tends to have a single best implementation. Which implementation proves best for a particular regime is dependent on the architecture, however. The best choices for the Convex SPP12 dier from those for the SGI Power Challenge or IBM SP Performance on 2 D Grid, 5 pt. Stencil Number of nonzero elements 6 Conclusions Single-node performance of sparse matrix-vector multiplication depends on two primary factors: data locality and ne-grained parallelism. Typical sparse matrices, such as those from grid problems or from the Harwell-Boeing collection, already have structures that support data locality. By improving ne-grained parallelism, however, a change in the local representation of a matrix can lead to signicant performance improvements. The optimal choice of representation and implementation depends on three general characteristics: the size of the matrix relative to the machine's data cache, the average number of nonzero elements per row of the matrix, and the particular architecture of the machine. Since the alternative representations we describe only involve transformations of local data, they can be used in parallel libraries with little modication. The choice of most ecient representation is dictated by the easily determined characteristics of the input matrix described above, and no change in the existing interfaces of libraries is required. Therefore, our work can be seen as complementary to other eorts in parallel-iterative- library development. The integration of these performanceenhancing techniques within existing parallel libraries appears to be straightforward, and the performance improvements promise to be signicant Performance on 3 D Grid, 27 pt. Stencil Number of nonzero elements Figure 7: Performance of grid-based sparse matrixvector multiplication for a single node of the SPP12. The two graphs illustrate the dierences between short and long rows. References [1] R. Boisvert, R. Pozo, K. Remington, B. Miller, and R. Lipman, Matrix Market < [2] Tim Davis, Sparse Matrix Project: Directory Index < ~davis/sparse/hb-menu.html>. [3] George Karypis and Vipin Kumar. METIS: Unstructured Graph Partitioning and Sparse Matrix Ordering System: Version 2.. University of Minnesota, Minneapolis, MN, August [4] Sivan Toledo. \Improving Instruction-Level Parallelism in Sparse Matrix-Vector Multiplication using Reordering, Blocking, and Prefetching." Proc. 8th SIAM Conference on Parallel Processing for Scientic Computing, March [5] Yousef Saad. \Krylov Subspace Methods on Supercomputers." SIAM Journal on Scientic and Statistical Computing, 1 (1989), pp

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo y Abstract Sparse matrix-vector multiplication is an important kernel that often runs ineciently on superscalar RISC