Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN

Size: px

Start display at page:

Download "Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN"

Michael Eaton
5 years ago
Views:

1 Tools and Libraries for Parallel Sparse Matrix Computations Edmond Chow and Yousef Saad Department of Computer Science, and Minnesota Supercomputer Institute University of Minnesota Minneapolis, MN June 1994 Abstract This paper describes two portable packages for general-purpose sparse matrix computations: SPARSKIT and P SPARSLIB. Their emphasis is on iterative techniques, with the latter also emphasizing parallel computation. The packages are a collection of tools which may be used either as a library, or as templates for the development of specialized codes. The majority of this paper will describe the key components of the parallel iterative solution of linear systems with P SPARSLIB. Key words: sparse matrix computations, parallel computing, matrix-vector product, partitioning, iterative methods, preconditioning, tools and libraries. 1 Introduction The complexity of parallel software makes it particularly necessary for development tools and software reuse. Tools and libraries for sparse matrix computations are scarce compared to packages such as LAPACK that are available for dense matrix computations. Two reasons for this are the high complexity of sparse matrix routines, and the need for dierent solution techniques and data structures to obtain good performance on various architectures. It is arguable that current numerical problems the problems that are worth solving are so dicult that high-performance, customized solution procedures are required, and therefore portable, high-level library routines are not useful. At the same time, it is clear that libraries are globally economical in the sense that the gains obtained from their overall use, however partial, can redeem their development price several times over. We wish to point out that a successful compromise may be the use of library codes as templates or a source of algorithms for developing machine-specic codes. In the current environment of quickly changing hardware and increasingly dicult problems, this approach of software reuse may not only be viable, but also unavoidable. Work supported by the NSF under grant NSF/CCR , and by ARPA under grant NIST 60NANB2D1272.

2 SPARSKIT and P SPARSLIB 2 We also mention here that what many researchers need is not necessarily high-performance routines, but a useful collection of routines on their platform for experimenting with algorithms. The sparse matrix support in MATLAB has been invaluable to this end, but unfortunately is not exible nor ecient enough for larger, more realistic problems. This paper describes SPARSKIT and P SPARSLIB, two FORTRAN 77 packages for sparse matrix computations. SPARSKIT is not designed to run on a parallel machine, but contains essential tools for developing research or specialized application codes, and is often used as templates as described above. SPARSKIT contains conversion routines between 16 dierent storage formats, and has more than 200 routines for operating on sparse matrices such as matrix addition, reordering, iterative solution, matrix generation, and plotting. The tools work closely with matrices stored externally in the Harwell-Boeing format. Version 2 of SPARSKIT was recently released. P SPARSLIB is a parallel sparse matrix computations library. For generality, parallelism is extracted using a domain decomposition approach on the matrix rather than on the physical problem. The code is exible enough to handle, for example, overlapping domains. P SPARSLIB uses message passing and runs portably on top of PVM. This layered solution for portability takes advantage of future improvements in the underlying communication library or hardware. P SPARSLIB provides useful kernels and tools such as parallel sparse matrix-vector multiplication, parallel preconditioning, iterative solution of linear systems, partitioning, multicoloring, and reordering. 2 SPARSKIT Because of the complexity of sparse matrix routines, a common set of tools shared among researchers should dramatically reduce the time to implement sparse matrix research codes. SPARSKIT is a package developed for this purpose, providing routines such as extracting submatrices, matrix addition and multiplication, etc. The package also alleviates the problem of the wide variety of sparse matrix storage formats by providing conversion routines between them, and facilitates the exchange of data with the Harwell-Boeing format and through matrix generators. In the following, we briey describe each module of SPARSKIT. See [7] for a complete description. FORMATS This module contains two sets of routines. The rst set is composed of routines which convert the storage format of a matrix to and from the basic Compressed Sparse Row format. Thus one can translate between any of the supported formats with two transformations at the most. The formats currently supported are the following. DNS Dense format BND Linpack Banded format CSR Compressed Sparse Row format CSC Compressed Sparse Column format

3 SPARSKIT and P SPARSLIB 3 COO Coordinate format ELL Ellpack-Itpack generalized diagonal format DIA Diagonal format BSR Block Sparse Row format MSR Modied Compressed Sparse Row format SSK Symmetric Skyline format NSK Nonsymmetric Skyline format LNK Linked list storage format JAD Jagged Diagonal format SSS Symmetric Sparse Skyline format USS Unsymmetric Sparse Skyline format VBR Variable Block Row format The second set of routines contains a number of routines that perform simple manipulation functions on sparse matrices, such as extracting a particular diagonal, permuting a matrix, computing norms, or ltering out small elements. For reasons of space we cannot list these routines here. BLASSM This module contains a number of routines for performing basic linear algebra with sparse matrices. It is also composed of two sets of routines. The rst set consists of matrix-matrix operations (e.g., multiplication of matrices) and the second consists of matrix-vector operations. The rst set allows one to perform the following operations with sparse matrices, where A; B; C are sparse matrices, D is a diagonal matrix, and is a scalar: C = AB, C = A+B, C = A+B, C = A B T, C = A + B T, A = A + I, C = A + D. The second set contains various routines for performing matrix-vector products and solving sparse triangular linear systems in dierent storage formats. INOUT This module consists of routines to read and write matrices in the Harwell-Boeing format. For more information on this format and the Harwell-Boeing collection, see [2]. This module also provides routines for printing the pattern of the matrix in postscript, or simply dumping the nonzeros in a readable format. INFO The purpose of this module is to provide as many statistics as possible on a matrix with little cost. For example, the code analyzes diagonal dominance of the matrix (row and column), its degree of symmetry (structural as well as numerical), its block structure, its diagonal structure, etc. Functionality for estimating information about the spectrum of the matrix may be added later.

4 SPARSKIT and P SPARSLIB 4 MATGEN The set of routines in this module allows one to generate test matrices. There are generators for several dierent types of matrices: ve-point and seven-point matrices on rectangular regions discretizing a general elliptic partial dierential equation, block forms of these (several degrees of freedom per grid point in the PDE), nite elements matrices for the convectiondiusion problem using various domains (including user-provided ones), Markov chain matrices arising from a random walk on a triangular grid, and some others. ORDERINGS This module provides matrix reorderings based on level sets (including Cuthill- McKee implemented with breadth rst search), coloring (including a greedy algorithm for multicolor ordering), and strongly connected components. The latter two are useful for extracting parallelism from sparse matrices. ITSOL This module currently contains four preconditioners and nine Krylov-subspace iterative methods. The preconditioners are ILUT, a robust preconditioner which uses a dual threshold for dropping elements; ILUTP, a variant with column pivoting; ILU(0); MILU(0). The iterative solvers include popular ones such as CG, CGNR, BiCG, BiCGSTAB, TFQMR, and GMRES, and are implemented with reverse communication to make them independent of the matrix storage format and preconditioner. See Section 3.4 for more details. UNSUPP As suggested by its name, this module contains various unsupported software tools that are not necessarily portable or do not t in any of the previous modules. This module currently contains routines for plotting matrices and routines related to matrix exponentials. 3 P SPARSLIB Many sparse matrices arise from the discretization of partial dierential equations. In these applications, domain decomposition has been a successful general approach for extracting parallelism. In essence, the domain of interest is partitioned into a number of subdomains and some technique is used to recover the global solution. For generality, P SPARSLIB begins with a matrix rather than a dierential equation, and partitioning is performed on the adjacency graph of the matrix, which is the same as the discretization mesh if this was indeed the source of the matrix. However, this broader concept of `graph partitioning' does not use the geometric information that may be available in the problem, and which may be necessary when solving more dicult problems. General descriptions of data structures and algorithms in P SPARSLIB may be found in [4, 6]. P SPARSLIB has the structure shown in Figure 3.1. We have implemented a number of routines in each of the modules shown. P SPARSLIB uses message passing for exibility and in the gure, B-COMS is a temporary name for the communication library provided by the manufacturer, possibly augmented with a few high-level communication primitives required for sparse problems. P SPARSLIB, as well as SPARSKIT described above, is primarily geared towards iterative solvers because of the growing importance of these techniques and the limitation of direct

5 SPARSKIT and P SPARSLIB 5 Basic kernels B-COMS D-BLAS Preprocessing tools Partition, color, setup,... Matrix Primitives Matvec, Tsolve,... Preconditioners, D-ILU,D-SOR,.. ITERATIVE SOLVERS Figure 3.1 General block diagram of P SPARSLIB. solvers both in terms of their potential for high parallel eciency and in terms of their unmanageable requirements for realistic 3-dimensional problems. In the remainder of this paper, we will describe the key components of the parallel iterative solution of linear systems. These components are the matrix-vector product, the iterative solvers themselves, parallel preconditioning, and partitioning. Before we begin, we discuss the data structures used for distributed sparse matrices. 3.1 Distributed sparse matrices Assume that we have a convenient partitioning of the graph and, without any loss of generality, we can think of the matrix under consideration as originating from the discretization of a partial dierential equation on a certain domain as is illustrated in Figure 3.2. We need to set up a local data structure in each processor (or subdomain, or subgraph) which will allow us to perform basic operations such as global matrix-vector products and preconditioning operations. We will assume that the rows and associated unknowns are mapped to the same processor, i.e., the matrix is distributed row-wise to the processors according to the distribution of the variables. Note that if there is an obvious blocking which may come from several unknowns associated with the same grid-point, then this should be exploited and all the unknowns should be mapped together. In other words, in our mapping algorithms we should deal with the reduced adjacency graph corresponding to a physical grid rather than with the original adjacency graph. Another assumption we will make here is that the graph is undirected, i.e., the matrix has a symmetric pattern. This restriction is only made for simplicity and because we would like to use exchange of information across boundaries (swaps) rather than one-way sends and receives.

6 SPARSKIT and P SPARSLIB 6 Local interface points Internal points External interface points Figure 3.2 Decomposition of the domain (or adjacency graph) and classi- cation of nodal points. The rst part of the local data-structure consists of a list of all other processors with which a given processor must exchange information when performing matrix-vector products. Although the processors on this list are not necessarily physical neighbors, they hold subdomains that are adjacent to the subdomain that is mapped to them. The information needed to nd these neighboring processors is a global node-to-processor mapping, described by an array map, where map(j) is the processor to which node j is mapped. For simplicity we will assume for this description that there is no overlap, i.e., any node j belongs to only one processor, namely processor map(j). The local rows are inspected one by one and for each nonzero a ij with map(j) 6= myproc, where myproc is the label of the current processor, we add map(j) to the list of neighboring processors if it is not already listed. We store the labels of the neighboring processors in an array proc(1 : nproc) where nproc is the number of neighboring processors. In this initial phase, each processor myproc will also determine for each of its neighboring processors the list of nodes that are coupled with nodes of that processor. We refer to these nodes as local interface nodes. When performing a matrix-vector product, neighboring processors must exchange values of their adjacent interface nodes. In order to perform this data exchange operation eciently, it is important to group these nodes processor by processor. Thus, we list

7 SPARSKIT and P SPARSLIB 7 External nodes A loc Local nodes Aext Aext External nodes Figure 3.3 The distributed sparse matrix A. rst all those nodes that must be sent to proc(1), followed by those to be sent to proc(2), etc. Two arrays are used for this purpose, one called ix which lists the nodes as indicated above and a pointer array ipr which points to the beginning of the list for proc(i). Once the boundary exchange information is determined, we need to set up the distributed matrices in each processor, using a suitable data structure. In order to perform a matrix-vector product with a matrix that is distributed in the manner described earlier, we need to multiply the matrix consisting of rows that are local to a given processor by some global vector x. Some components of this vector will be local, and some components must be moved to the current processor for the operation to complete. Let x loc be the local components of vector x for a given processor and let x ext be the external components that are required. The vector x loc itself is composed of strictly internal nodes x int and boundary, or local interface nodes x bnd. Let A 0 be the local matrix, i.e., the rectangular matrix consisting of all the rows that are mapped to myproc. We will call A loc the `diagonal block' of A located in A 0, i.e., the submatrix of A 0 whose nonzero elements a ij are such that j is a local variable. Similarly, we will call A ext the `o-diagonal' block, i.e., the submatrix of A 0 whose nonzero elements a ij are such that j is not a local variable.

8 SPARSKIT and P SPARSLIB Matrix-vector product Let us consider the simple case illustrated in Figure 3.3, in which the rows are assigned to 5 given processors in block order. To perform a matrix-vector product, we start by multiplying the diagonal block A loc by the local variables. We then multiply A ext by the external variables. Notice that since the external interface points are not coupled with local internal points, only the rows corresponding to the boundary nodes in A ext will have nonzero elements. Thus, we can separate the matrix-vector product into two such operations, one involving only the local variables and the other involving external variables. We need to construct these two matrices and dene a local numbering of the local variables in order to perform the two matrixvector products eciently each time. For convenience the local interface points are labeled last in each processor. This is illustrated in Figure 3.4. A loc = Internal points (x int ) Local interface points (x bnd ) A ext = External interface matrix Figure 3.4 The local matrix data structure for each subdomain. The algorithm for matrix-vector product is then as follows: Algorithm 3.1 Distributed sparse matrix-vector product Exchange interface data. Scatter x bnd to neighbors and gather x ext from neighbors Local matrix-vector product: y = A loc x loc External matrix-vector product: y = y + A ext x ext

9 SPARSKIT and P SPARSLIB Graph partitioning We now outline an approach for partitioning the matrix graph vertices now that it is clear that to minimize communication costs, we need to minimize the number of neighbors of each subdomain and the number of interface points (or the number of edges connecting the domains). Load balancing by making the subdomains of roughly equal size or some other criteria is also necessary. For vertex partitioning when geometric information is not available, we are developing parallel algorithms based on the simultaneous level-set expansion from a set of center points, one for each subdomain [3]. This procedure is inherently parallel, but requires unbiased arbitration in case of conicts. The challenging aspect of the method is determining a good set of center points, and for this we have developed heuristic algorithms using cost functions. From an initial, perhaps random set of centers, the cost of this set of centers is computed, for example, as the sum of the inverse of the distances between the centers. The centers are then moved in a way so that the cost is decreased. When the cost cannot be further decreased, the algorithm terminates. Many criteria such as those described at the beginning of this section may be built into the cost function. We do not consider the mapping of subdomains to processors, since this is obviously architecture dependent, and many computer manufacturers are striving to make the dierence between best-case and worst-case communication as small as possible. 3.4 Iterative solvers The iterative solvers module in P SPARSLIB is the same as that in SPARSKIT. We note that a exible variant of GMRES called FGMRES [5] that allows the preconditioner to change at each step is especially useful for parallel computation, as will be seen in Section 3.5. The iterative methods will not be described here, except to say that the four basic operations required are the matrix-vector product, the preconditioning operation, SAXPY, and dot product. The matrixvector product was described in Section 3.2, and the preconditioning operation will be described in Section 3.5. SAXPY involves no communication if the vectors are partitioned the same way across the processors, and the dot product is a global reduction operation, often provided by the hardware or underlying software. The iterative solvers are implemented in a way so that they are independent of: 1. the environment, whether it is parallel or serial 2. the storage format of the matrix and the preconditioner 3. the preconditioning operation This could be achieved, for example, by using callback functions for the four basic operations described above. However, this is not entirely exible for the matrix-vector product and preconditioning operations without using global parameters, since the callback functions must have their calling sequences xed beforehand. P SPARSLIB uses a reverse-communication mechanism for these two operations to achieve the same eect. Here, the iterative solver exits back to the caller and indicates through an output variable which operation needs to be performed on its output vector. A typical code follows. See [8] for more details.

10 SPARSKIT and P SPARSLIB 10 icode = 0 1 continue call fgmres(n,im,rhs,sol,i,vv,w,wk1,wk2,eps,maxits,iout,icode) if (icode.eq. 1) then call precon(n,wk1,wk2) goto 1 else if (icode.eq. 2) then call matvec(n,wk1,wk2) goto 1 endif 3.5 Distributed preconditioners Krylov subspace methods generally work very poorly if no preconditioning is used. Unfortunately, the traditional and most eective preconditioners are extremely dicult to parallelize. In this section, we describe multicolor SOR and a new technique based on approximate inverses. We begin by describing multicoloring, a useful tool for extracting parallelism from sparse matrices. Multicoloring assigns each subdomain a `color' such that no two adjacent subdomains have the same color. The standard heuristic method to multicolor a graph is to rst select an order in which to color the nodes. Consider the natural order 1; 2; : : :; n. If we want to execute this algorithm in a parallel environment, then we observe that a given node never needs to examine the nodes whose labels are larger than its own label. Although this seems to establish a sequential procedure, there is actually a substantial amount of parallelism because of the sparsity of the graph. Typically, the degree of parallelism is of the order of the diameter of the graph. Since the coloring process is essentially a preprocessing task, this amount of parallelism is sucient for most practical situations, even for a large number of processors. A parallel implementation, with each processor holding one domain, would be as follows: Algorithm 3.2 Parallel multicoloring Start with one colored region From each neighbor proc(k) with proc(k) < myproc do receive color(proc(k)) Compute color(myproc) = minfvalid colorsg To each neighbor proc(k) with proc(k) > myproc do send color(myproc) Once the subdomains have been colored, one can use a form of multicolor SOR or SSOR relaxation as a preconditioner. In each processor, a one-step SOR iteration takes the form: Algorithm 3.3 Multicolor SOR preconditioning For k = 1; 2; : : :; ncolrs do Exchange interface values If (k = mycol) then x loc := x loc +!A?1 loc [b loc? A ext x ext ]

11 SPARSKIT and P SPARSLIB 11 enddo endif In this algorithm, myproc represents the processor number in which the code is being executed, and mycol is the color of myproc. An s-step SOR iteration would simply consist of adding an outer loop to the above algorithm. The solution of a system with the local matrix A loc indicated by the A?1 loc operation in the above algorithm is typical of parallel preconditioners such as this one and, for example, block Jacobi. A preconditioned iterative method may be used, such as ILUT preconditioned GMRES, which is available in SPARSKIT. Since the preconditioning operation is an iterative process, FGMRES is required for the outer iterations. Sparse direct solution methods are also useful in this case. Since the local systems A loc are small and must be solved with several right-hand-sides, it may be worthwhile to factor A loc exactly. Another class of preconditioner are explicit: these use matrix-vector multiplication as the preconditioning step, which can be parallelized as described in Section 3.2. We have investigated approaches to approximating the inverse of the system matrix A directly by minimizing the Frobenius norm of the residual matrix F (M) = ki? AM k F where M is the approximate right inverse. In practice, this is achieved by solving approximately Am j = e j ; j = 1; : : :; n where m j is the j-th column of M and e j is the j-th coordinate vector. The approximate solution of these linear systems is an iterative procedure itself and thus involves the operations already described. An important dierence is that we must keep M sparse, and the Krylov basis which is used to construct the solution is also kept sparse. If it is possible for each processor to access the entire matrix A, then an alternative parallel implementation is to compute the n individual columns of M simultaneously, with no communication required. We have found that this approximate inverse preconditioner is advantageous in many cases where the matrix A is nonsymmetric or indenite [1]. 4 Conclusion We have described some implementations and algorithms for basic sparse matrix computations which may be used as part of a library, or because of its simple and open design, may be used as templates for the development of research or application codes. We have argued that a portable parallel sparse matrix computations package is timely, although it may not immediately be extremely ecient. Such a code, for example, may be used as a benchmark for testing the suitability of various parallel architectures for sparse matrix computations. Acknowlegements The authors wish to thank Sandra Carney, Todd Goehring, Kesheng Wu, and Mike Heroux for many helpful discussions.

12 SPARSKIT and P SPARSLIB 12 References [1] E. Chow and Y. Saad. Approximate inverse preconditioners for general sparse matrices, UMSI 94/101, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota, [2] I. S. Du, R. G. Grimes and J. G. Lewis. Users' guide for the Harwell-Boeing sparse matrix collection. TR/PA/92/86, CERFACS, Toulouse, [3] T. Goehring and Y. Saad. Heuristic algorithms for automatic graph partitioning, UMSI 94/29, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota, [4] Y. Saad. Data structures and algorithms for domain decomposition and distributed sparse matrix computations. In preparation, Army High Performance Computing Research Center, Minneapolis, Minnesota, [5] Y. Saad. A exible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput., 14 (1993), pp [6] Y. Saad. Krylov subspace methods in distributed computing environments. AHPCRC , Army High Performance Computing Research Center, Minneapolis, Minnesota, [7] Y. Saad. SPARSKIT: a basic tool kit for sparse matrix computations, Version 2. Manuscript, University of Minnesota, Minneapolis, Minnesota, [8] Y. Saad and K. Wu. Parallel sparse matrix library (P SPARSLIB): the iterative solvers module, AHPCRC , Army High Performance Computing Research Center, Minneapolis, Minnesota, 1994.

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation Threshold Ordering for Preconditioning Nonsymmetric Problems Michele Benzi 1, Hwajeong Choi 2, Daniel B. Szyld 2? 1 CERFACS, 42 Ave. G. Coriolis, 31057 Toulouse Cedex, France (benzi@cerfacs.fr) 2 Department