Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN

Size: px
Start display at page:

Download "Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN"

Transcription

1 Tools and Libraries for Parallel Sparse Matrix Computations Edmond Chow and Yousef Saad Department of Computer Science, and Minnesota Supercomputer Institute University of Minnesota Minneapolis, MN June 1994 Abstract This paper describes two portable packages for general-purpose sparse matrix computations: SPARSKIT and P SPARSLIB. Their emphasis is on iterative techniques, with the latter also emphasizing parallel computation. The packages are a collection of tools which may be used either as a library, or as templates for the development of specialized codes. The majority of this paper will describe the key components of the parallel iterative solution of linear systems with P SPARSLIB. Key words: sparse matrix computations, parallel computing, matrix-vector product, partitioning, iterative methods, preconditioning, tools and libraries. 1 Introduction The complexity of parallel software makes it particularly necessary for development tools and software reuse. Tools and libraries for sparse matrix computations are scarce compared to packages such as LAPACK that are available for dense matrix computations. Two reasons for this are the high complexity of sparse matrix routines, and the need for dierent solution techniques and data structures to obtain good performance on various architectures. It is arguable that current numerical problems the problems that are worth solving are so dicult that high-performance, customized solution procedures are required, and therefore portable, high-level library routines are not useful. At the same time, it is clear that libraries are globally economical in the sense that the gains obtained from their overall use, however partial, can redeem their development price several times over. We wish to point out that a successful compromise may be the use of library codes as templates or a source of algorithms for developing machine-specic codes. In the current environment of quickly changing hardware and increasingly dicult problems, this approach of software reuse may not only be viable, but also unavoidable. Work supported by the NSF under grant NSF/CCR , and by ARPA under grant NIST 60NANB2D1272.

2 SPARSKIT and P SPARSLIB 2 We also mention here that what many researchers need is not necessarily high-performance routines, but a useful collection of routines on their platform for experimenting with algorithms. The sparse matrix support in MATLAB has been invaluable to this end, but unfortunately is not exible nor ecient enough for larger, more realistic problems. This paper describes SPARSKIT and P SPARSLIB, two FORTRAN 77 packages for sparse matrix computations. SPARSKIT is not designed to run on a parallel machine, but contains essential tools for developing research or specialized application codes, and is often used as templates as described above. SPARSKIT contains conversion routines between 16 dierent storage formats, and has more than 200 routines for operating on sparse matrices such as matrix addition, reordering, iterative solution, matrix generation, and plotting. The tools work closely with matrices stored externally in the Harwell-Boeing format. Version 2 of SPARSKIT was recently released. P SPARSLIB is a parallel sparse matrix computations library. For generality, parallelism is extracted using a domain decomposition approach on the matrix rather than on the physical problem. The code is exible enough to handle, for example, overlapping domains. P SPARSLIB uses message passing and runs portably on top of PVM. This layered solution for portability takes advantage of future improvements in the underlying communication library or hardware. P SPARSLIB provides useful kernels and tools such as parallel sparse matrix-vector multiplication, parallel preconditioning, iterative solution of linear systems, partitioning, multicoloring, and reordering. 2 SPARSKIT Because of the complexity of sparse matrix routines, a common set of tools shared among researchers should dramatically reduce the time to implement sparse matrix research codes. SPARSKIT is a package developed for this purpose, providing routines such as extracting submatrices, matrix addition and multiplication, etc. The package also alleviates the problem of the wide variety of sparse matrix storage formats by providing conversion routines between them, and facilitates the exchange of data with the Harwell-Boeing format and through matrix generators. In the following, we briey describe each module of SPARSKIT. See [7] for a complete description. FORMATS This module contains two sets of routines. The rst set is composed of routines which convert the storage format of a matrix to and from the basic Compressed Sparse Row format. Thus one can translate between any of the supported formats with two transformations at the most. The formats currently supported are the following. DNS Dense format BND Linpack Banded format CSR Compressed Sparse Row format CSC Compressed Sparse Column format

3 SPARSKIT and P SPARSLIB 3 COO Coordinate format ELL Ellpack-Itpack generalized diagonal format DIA Diagonal format BSR Block Sparse Row format MSR Modied Compressed Sparse Row format SSK Symmetric Skyline format NSK Nonsymmetric Skyline format LNK Linked list storage format JAD Jagged Diagonal format SSS Symmetric Sparse Skyline format USS Unsymmetric Sparse Skyline format VBR Variable Block Row format The second set of routines contains a number of routines that perform simple manipulation functions on sparse matrices, such as extracting a particular diagonal, permuting a matrix, computing norms, or ltering out small elements. For reasons of space we cannot list these routines here. BLASSM This module contains a number of routines for performing basic linear algebra with sparse matrices. It is also composed of two sets of routines. The rst set consists of matrix-matrix operations (e.g., multiplication of matrices) and the second consists of matrix-vector operations. The rst set allows one to perform the following operations with sparse matrices, where A; B; C are sparse matrices, D is a diagonal matrix, and is a scalar: C = AB, C = A+B, C = A+B, C = A B T, C = A + B T, A = A + I, C = A + D. The second set contains various routines for performing matrix-vector products and solving sparse triangular linear systems in dierent storage formats. INOUT This module consists of routines to read and write matrices in the Harwell-Boeing format. For more information on this format and the Harwell-Boeing collection, see [2]. This module also provides routines for printing the pattern of the matrix in postscript, or simply dumping the nonzeros in a readable format. INFO The purpose of this module is to provide as many statistics as possible on a matrix with little cost. For example, the code analyzes diagonal dominance of the matrix (row and column), its degree of symmetry (structural as well as numerical), its block structure, its diagonal structure, etc. Functionality for estimating information about the spectrum of the matrix may be added later.

4 SPARSKIT and P SPARSLIB 4 MATGEN The set of routines in this module allows one to generate test matrices. There are generators for several dierent types of matrices: ve-point and seven-point matrices on rectangular regions discretizing a general elliptic partial dierential equation, block forms of these (several degrees of freedom per grid point in the PDE), nite elements matrices for the convectiondiusion problem using various domains (including user-provided ones), Markov chain matrices arising from a random walk on a triangular grid, and some others. ORDERINGS This module provides matrix reorderings based on level sets (including Cuthill- McKee implemented with breadth rst search), coloring (including a greedy algorithm for multicolor ordering), and strongly connected components. The latter two are useful for extracting parallelism from sparse matrices. ITSOL This module currently contains four preconditioners and nine Krylov-subspace iterative methods. The preconditioners are ILUT, a robust preconditioner which uses a dual threshold for dropping elements; ILUTP, a variant with column pivoting; ILU(0); MILU(0). The iterative solvers include popular ones such as CG, CGNR, BiCG, BiCGSTAB, TFQMR, and GMRES, and are implemented with reverse communication to make them independent of the matrix storage format and preconditioner. See Section 3.4 for more details. UNSUPP As suggested by its name, this module contains various unsupported software tools that are not necessarily portable or do not t in any of the previous modules. This module currently contains routines for plotting matrices and routines related to matrix exponentials. 3 P SPARSLIB Many sparse matrices arise from the discretization of partial dierential equations. In these applications, domain decomposition has been a successful general approach for extracting parallelism. In essence, the domain of interest is partitioned into a number of subdomains and some technique is used to recover the global solution. For generality, P SPARSLIB begins with a matrix rather than a dierential equation, and partitioning is performed on the adjacency graph of the matrix, which is the same as the discretization mesh if this was indeed the source of the matrix. However, this broader concept of `graph partitioning' does not use the geometric information that may be available in the problem, and which may be necessary when solving more dicult problems. General descriptions of data structures and algorithms in P SPARSLIB may be found in [4, 6]. P SPARSLIB has the structure shown in Figure 3.1. We have implemented a number of routines in each of the modules shown. P SPARSLIB uses message passing for exibility and in the gure, B-COMS is a temporary name for the communication library provided by the manufacturer, possibly augmented with a few high-level communication primitives required for sparse problems. P SPARSLIB, as well as SPARSKIT described above, is primarily geared towards iterative solvers because of the growing importance of these techniques and the limitation of direct

5 SPARSKIT and P SPARSLIB 5 Basic kernels B-COMS D-BLAS Preprocessing tools Partition, color, setup,... Matrix Primitives Matvec, Tsolve,... Preconditioners, D-ILU,D-SOR,.. ITERATIVE SOLVERS Figure 3.1 General block diagram of P SPARSLIB. solvers both in terms of their potential for high parallel eciency and in terms of their unmanageable requirements for realistic 3-dimensional problems. In the remainder of this paper, we will describe the key components of the parallel iterative solution of linear systems. These components are the matrix-vector product, the iterative solvers themselves, parallel preconditioning, and partitioning. Before we begin, we discuss the data structures used for distributed sparse matrices. 3.1 Distributed sparse matrices Assume that we have a convenient partitioning of the graph and, without any loss of generality, we can think of the matrix under consideration as originating from the discretization of a partial dierential equation on a certain domain as is illustrated in Figure 3.2. We need to set up a local data structure in each processor (or subdomain, or subgraph) which will allow us to perform basic operations such as global matrix-vector products and preconditioning operations. We will assume that the rows and associated unknowns are mapped to the same processor, i.e., the matrix is distributed row-wise to the processors according to the distribution of the variables. Note that if there is an obvious blocking which may come from several unknowns associated with the same grid-point, then this should be exploited and all the unknowns should be mapped together. In other words, in our mapping algorithms we should deal with the reduced adjacency graph corresponding to a physical grid rather than with the original adjacency graph. Another assumption we will make here is that the graph is undirected, i.e., the matrix has a symmetric pattern. This restriction is only made for simplicity and because we would like to use exchange of information across boundaries (swaps) rather than one-way sends and receives.

6 SPARSKIT and P SPARSLIB 6 Local interface points Internal points External interface points Figure 3.2 Decomposition of the domain (or adjacency graph) and classi- cation of nodal points. The rst part of the local data-structure consists of a list of all other processors with which a given processor must exchange information when performing matrix-vector products. Although the processors on this list are not necessarily physical neighbors, they hold subdomains that are adjacent to the subdomain that is mapped to them. The information needed to nd these neighboring processors is a global node-to-processor mapping, described by an array map, where map(j) is the processor to which node j is mapped. For simplicity we will assume for this description that there is no overlap, i.e., any node j belongs to only one processor, namely processor map(j). The local rows are inspected one by one and for each nonzero a ij with map(j) 6= myproc, where myproc is the label of the current processor, we add map(j) to the list of neighboring processors if it is not already listed. We store the labels of the neighboring processors in an array proc(1 : nproc) where nproc is the number of neighboring processors. In this initial phase, each processor myproc will also determine for each of its neighboring processors the list of nodes that are coupled with nodes of that processor. We refer to these nodes as local interface nodes. When performing a matrix-vector product, neighboring processors must exchange values of their adjacent interface nodes. In order to perform this data exchange operation eciently, it is important to group these nodes processor by processor. Thus, we list

7 SPARSKIT and P SPARSLIB 7 External nodes A loc Local nodes Aext Aext External nodes Figure 3.3 The distributed sparse matrix A. rst all those nodes that must be sent to proc(1), followed by those to be sent to proc(2), etc. Two arrays are used for this purpose, one called ix which lists the nodes as indicated above and a pointer array ipr which points to the beginning of the list for proc(i). Once the boundary exchange information is determined, we need to set up the distributed matrices in each processor, using a suitable data structure. In order to perform a matrix-vector product with a matrix that is distributed in the manner described earlier, we need to multiply the matrix consisting of rows that are local to a given processor by some global vector x. Some components of this vector will be local, and some components must be moved to the current processor for the operation to complete. Let x loc be the local components of vector x for a given processor and let x ext be the external components that are required. The vector x loc itself is composed of strictly internal nodes x int and boundary, or local interface nodes x bnd. Let A 0 be the local matrix, i.e., the rectangular matrix consisting of all the rows that are mapped to myproc. We will call A loc the `diagonal block' of A located in A 0, i.e., the submatrix of A 0 whose nonzero elements a ij are such that j is a local variable. Similarly, we will call A ext the `o-diagonal' block, i.e., the submatrix of A 0 whose nonzero elements a ij are such that j is not a local variable.

8 SPARSKIT and P SPARSLIB Matrix-vector product Let us consider the simple case illustrated in Figure 3.3, in which the rows are assigned to 5 given processors in block order. To perform a matrix-vector product, we start by multiplying the diagonal block A loc by the local variables. We then multiply A ext by the external variables. Notice that since the external interface points are not coupled with local internal points, only the rows corresponding to the boundary nodes in A ext will have nonzero elements. Thus, we can separate the matrix-vector product into two such operations, one involving only the local variables and the other involving external variables. We need to construct these two matrices and dene a local numbering of the local variables in order to perform the two matrixvector products eciently each time. For convenience the local interface points are labeled last in each processor. This is illustrated in Figure 3.4. A loc = Internal points (x int ) Local interface points (x bnd ) A ext = External interface matrix Figure 3.4 The local matrix data structure for each subdomain. The algorithm for matrix-vector product is then as follows: Algorithm 3.1 Distributed sparse matrix-vector product Exchange interface data. Scatter x bnd to neighbors and gather x ext from neighbors Local matrix-vector product: y = A loc x loc External matrix-vector product: y = y + A ext x ext

9 SPARSKIT and P SPARSLIB Graph partitioning We now outline an approach for partitioning the matrix graph vertices now that it is clear that to minimize communication costs, we need to minimize the number of neighbors of each subdomain and the number of interface points (or the number of edges connecting the domains). Load balancing by making the subdomains of roughly equal size or some other criteria is also necessary. For vertex partitioning when geometric information is not available, we are developing parallel algorithms based on the simultaneous level-set expansion from a set of center points, one for each subdomain [3]. This procedure is inherently parallel, but requires unbiased arbitration in case of conicts. The challenging aspect of the method is determining a good set of center points, and for this we have developed heuristic algorithms using cost functions. From an initial, perhaps random set of centers, the cost of this set of centers is computed, for example, as the sum of the inverse of the distances between the centers. The centers are then moved in a way so that the cost is decreased. When the cost cannot be further decreased, the algorithm terminates. Many criteria such as those described at the beginning of this section may be built into the cost function. We do not consider the mapping of subdomains to processors, since this is obviously architecture dependent, and many computer manufacturers are striving to make the dierence between best-case and worst-case communication as small as possible. 3.4 Iterative solvers The iterative solvers module in P SPARSLIB is the same as that in SPARSKIT. We note that a exible variant of GMRES called FGMRES [5] that allows the preconditioner to change at each step is especially useful for parallel computation, as will be seen in Section 3.5. The iterative methods will not be described here, except to say that the four basic operations required are the matrix-vector product, the preconditioning operation, SAXPY, and dot product. The matrixvector product was described in Section 3.2, and the preconditioning operation will be described in Section 3.5. SAXPY involves no communication if the vectors are partitioned the same way across the processors, and the dot product is a global reduction operation, often provided by the hardware or underlying software. The iterative solvers are implemented in a way so that they are independent of: 1. the environment, whether it is parallel or serial 2. the storage format of the matrix and the preconditioner 3. the preconditioning operation This could be achieved, for example, by using callback functions for the four basic operations described above. However, this is not entirely exible for the matrix-vector product and preconditioning operations without using global parameters, since the callback functions must have their calling sequences xed beforehand. P SPARSLIB uses a reverse-communication mechanism for these two operations to achieve the same eect. Here, the iterative solver exits back to the caller and indicates through an output variable which operation needs to be performed on its output vector. A typical code follows. See [8] for more details.

10 SPARSKIT and P SPARSLIB 10 icode = 0 1 continue call fgmres(n,im,rhs,sol,i,vv,w,wk1,wk2,eps,maxits,iout,icode) if (icode.eq. 1) then call precon(n,wk1,wk2) goto 1 else if (icode.eq. 2) then call matvec(n,wk1,wk2) goto 1 endif 3.5 Distributed preconditioners Krylov subspace methods generally work very poorly if no preconditioning is used. Unfortunately, the traditional and most eective preconditioners are extremely dicult to parallelize. In this section, we describe multicolor SOR and a new technique based on approximate inverses. We begin by describing multicoloring, a useful tool for extracting parallelism from sparse matrices. Multicoloring assigns each subdomain a `color' such that no two adjacent subdomains have the same color. The standard heuristic method to multicolor a graph is to rst select an order in which to color the nodes. Consider the natural order 1; 2; : : :; n. If we want to execute this algorithm in a parallel environment, then we observe that a given node never needs to examine the nodes whose labels are larger than its own label. Although this seems to establish a sequential procedure, there is actually a substantial amount of parallelism because of the sparsity of the graph. Typically, the degree of parallelism is of the order of the diameter of the graph. Since the coloring process is essentially a preprocessing task, this amount of parallelism is sucient for most practical situations, even for a large number of processors. A parallel implementation, with each processor holding one domain, would be as follows: Algorithm 3.2 Parallel multicoloring Start with one colored region From each neighbor proc(k) with proc(k) < myproc do receive color(proc(k)) Compute color(myproc) = minfvalid colorsg To each neighbor proc(k) with proc(k) > myproc do send color(myproc) Once the subdomains have been colored, one can use a form of multicolor SOR or SSOR relaxation as a preconditioner. In each processor, a one-step SOR iteration takes the form: Algorithm 3.3 Multicolor SOR preconditioning For k = 1; 2; : : :; ncolrs do Exchange interface values If (k = mycol) then x loc := x loc +!A?1 loc [b loc? A ext x ext ]

11 SPARSKIT and P SPARSLIB 11 enddo endif In this algorithm, myproc represents the processor number in which the code is being executed, and mycol is the color of myproc. An s-step SOR iteration would simply consist of adding an outer loop to the above algorithm. The solution of a system with the local matrix A loc indicated by the A?1 loc operation in the above algorithm is typical of parallel preconditioners such as this one and, for example, block Jacobi. A preconditioned iterative method may be used, such as ILUT preconditioned GMRES, which is available in SPARSKIT. Since the preconditioning operation is an iterative process, FGMRES is required for the outer iterations. Sparse direct solution methods are also useful in this case. Since the local systems A loc are small and must be solved with several right-hand-sides, it may be worthwhile to factor A loc exactly. Another class of preconditioner are explicit: these use matrix-vector multiplication as the preconditioning step, which can be parallelized as described in Section 3.2. We have investigated approaches to approximating the inverse of the system matrix A directly by minimizing the Frobenius norm of the residual matrix F (M) = ki? AM k F where M is the approximate right inverse. In practice, this is achieved by solving approximately Am j = e j ; j = 1; : : :; n where m j is the j-th column of M and e j is the j-th coordinate vector. The approximate solution of these linear systems is an iterative procedure itself and thus involves the operations already described. An important dierence is that we must keep M sparse, and the Krylov basis which is used to construct the solution is also kept sparse. If it is possible for each processor to access the entire matrix A, then an alternative parallel implementation is to compute the n individual columns of M simultaneously, with no communication required. We have found that this approximate inverse preconditioner is advantageous in many cases where the matrix A is nonsymmetric or indenite [1]. 4 Conclusion We have described some implementations and algorithms for basic sparse matrix computations which may be used as part of a library, or because of its simple and open design, may be used as templates for the development of research or application codes. We have argued that a portable parallel sparse matrix computations package is timely, although it may not immediately be extremely ecient. Such a code, for example, may be used as a benchmark for testing the suitability of various parallel architectures for sparse matrix computations. Acknowlegements The authors wish to thank Sandra Carney, Todd Goehring, Kesheng Wu, and Mike Heroux for many helpful discussions.

12 SPARSKIT and P SPARSLIB 12 References [1] E. Chow and Y. Saad. Approximate inverse preconditioners for general sparse matrices, UMSI 94/101, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota, [2] I. S. Du, R. G. Grimes and J. G. Lewis. Users' guide for the Harwell-Boeing sparse matrix collection. TR/PA/92/86, CERFACS, Toulouse, [3] T. Goehring and Y. Saad. Heuristic algorithms for automatic graph partitioning, UMSI 94/29, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota, [4] Y. Saad. Data structures and algorithms for domain decomposition and distributed sparse matrix computations. In preparation, Army High Performance Computing Research Center, Minneapolis, Minnesota, [5] Y. Saad. A exible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput., 14 (1993), pp [6] Y. Saad. Krylov subspace methods in distributed computing environments. AHPCRC , Army High Performance Computing Research Center, Minneapolis, Minnesota, [7] Y. Saad. SPARSKIT: a basic tool kit for sparse matrix computations, Version 2. Manuscript, University of Minnesota, Minneapolis, Minnesota, [8] Y. Saad and K. Wu. Parallel sparse matrix library (P SPARSLIB): the iterative solvers module, AHPCRC , Army High Performance Computing Research Center, Minneapolis, Minnesota, 1994.

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation

Nonsymmetric Problems. Abstract. The eect of a threshold variant TPABLO of the permutation Threshold Ordering for Preconditioning Nonsymmetric Problems Michele Benzi 1, Hwajeong Choi 2, Daniel B. Szyld 2? 1 CERFACS, 42 Ave. G. Coriolis, 31057 Toulouse Cedex, France (benzi@cerfacs.fr) 2 Department

More information

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best Sparse Matrix Libraries in C++ for High Performance Architectures Jack Dongarra xz, Andrew Lumsdaine, Xinhui Niu Roldan Pozo z, Karin Remington x x Oak Ridge National Laboratory z University oftennessee

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Data Structures for sparse matrices

Data Structures for sparse matrices Data Structures for sparse matrices The use of a proper data structures is critical to achieving good performance. Generate a symmetric sparse matrix A in matlab and time the operations of accessing (only)

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Parallel resolution of sparse linear systems by mixing direct and iterative methods

Parallel resolution of sparse linear systems by mixing direct and iterative methods Parallel resolution of sparse linear systems by mixing direct and iterative methods Phyleas Meeting, Bordeaux J. Gaidamour, P. Hénon, J. Roman, Y. Saad LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

A parallel direct/iterative solver based on a Schur complement approach

A parallel direct/iterative solver based on a Schur complement approach A parallel direct/iterative solver based on a Schur complement approach Gene around the world at CERFACS Jérémie Gaidamour LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix project) February 29th, 2008

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for computers on scientic application has been the Linpack

More information

Parallel Threshold-based ILU Factorization

Parallel Threshold-based ILU Factorization A short version of this paper appears in Supercomputing 997 Parallel Threshold-based ILU Factorization George Karypis and Vipin Kumar University of Minnesota, Department of Computer Science / Army HPC

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Technical Report TR , Computer and Information Sciences Department, University. Abstract

Technical Report TR , Computer and Information Sciences Department, University. Abstract An Approach for Parallelizing any General Unsymmetric Sparse Matrix Algorithm Tariq Rashid y Timothy A.Davis z Technical Report TR-94-036, Computer and Information Sciences Department, University of Florida,

More information

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS Contemporary Mathematics Volume 157, 1994 A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS T.E. Tezduyar, M. Behr, S.K. Aliabadi, S. Mittal and S.E. Ray ABSTRACT.

More information

A Compiler for Parallel Finite Element Methods. with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS

A Compiler for Parallel Finite Element Methods. with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS Contemporary Mathematics Volume 00, 0000 A Compiler for Parallel Finite Element Methods with Domain-Decomposed Unstructured Meshes JONATHAN RICHARD SHEWCHUK AND OMAR GHATTAS December 11, 1993 Abstract.

More information

DEVELOPMENT OF A RESTRICTED ADDITIVE SCHWARZ PRECONDITIONER FOR SPARSE LINEAR SYSTEMS ON NVIDIA GPU

DEVELOPMENT OF A RESTRICTED ADDITIVE SCHWARZ PRECONDITIONER FOR SPARSE LINEAR SYSTEMS ON NVIDIA GPU INTERNATIONAL JOURNAL OF NUMERICAL ANALYSIS AND MODELING, SERIES B Volume 5, Number 1-2, Pages 13 20 c 2014 Institute for Scientific Computing and Information DEVELOPMENT OF A RESTRICTED ADDITIVE SCHWARZ

More information

Chapter 4. Matrix and Vector Operations

Chapter 4. Matrix and Vector Operations 1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and

More information

The Matrix Market Exchange Formats:

The Matrix Market Exchange Formats: NISTIR 5935 The Matrix Market Exchange Formats: Initial Design Ronald F. Boisvert Roldan Pozo Karin A. Remington U. S. Department of Commerce Technology Administration National Institute of Standards and

More information

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach Mini-workshop PHyLeaS associated team J. Gaidamour, P. Hénon July 9, 28 HIPS : an hybrid direct/iterative solver /

More information

arxiv: v1 [cs.ms] 2 Jun 2016

arxiv: v1 [cs.ms] 2 Jun 2016 Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1

More information

Lecture 17: More Fun With Sparse Matrices

Lecture 17: More Fun With Sparse Matrices Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you

More information

THE application of advanced computer architecture and

THE application of advanced computer architecture and 544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,

More information

Scaling Strategy. Compress. Strategy. Background/ Scaling. Expose. Code Inputs User Interface Inputs Matrix Data Inputs Outputs.

Scaling Strategy. Compress. Strategy. Background/ Scaling. Expose. Code Inputs User Interface Inputs Matrix Data Inputs Outputs. EMILY: A VISUALIZATION TOOL FOR LARGE SPARSE MATRICES T. LOOS AND R. BRAMLEY Abstract. A visualization tool for large sparse matrices and its usage is described. Because such matrices come from a wide

More information

Sparse Linear Systems

Sparse Linear Systems 1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the

More information

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University Parallel LU Factorization of Block-Diagonal-Bordered Sparse Matrices D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC)

More information

CS 542G: Solving Sparse Linear Systems

CS 542G: Solving Sparse Linear Systems CS 542G: Solving Sparse Linear Systems Robert Bridson November 26, 2008 1 Direct Methods We have already derived several methods for solving a linear system, say Ax = b, or the related leastsquares problem

More information

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck. To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken German Aerospace Center (DLR) Simulation- and Software Technology

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The

More information

Chapter 3 SPARSE MATRICES. 3.1 Introduction

Chapter 3 SPARSE MATRICES. 3.1 Introduction Chapter 3 SPARSE MATRICES As described in the previous chapter, standard discretizations of Partial Differential Equations typically lead to large and sparse matrices. A sparse matrix is defined, somewhat

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Sequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices

Sequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices Sequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices Nerma Baščelija Sarajevo School of Science and Technology Department of Computer Science Hrasnicka Cesta 3a, 71000 Sarajevo

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Chapter 1. Reprinted from Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing,Norfolk, Virginia (USA), March 1993. Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel

More information

Hypergraph Partitioning for Parallel Iterative Solution of General Sparse Linear Systems

Hypergraph Partitioning for Parallel Iterative Solution of General Sparse Linear Systems Hypergraph Partitioning for Parallel Iterative Solution of General Sparse Linear Systems Masha Sosonkina Bora Uçar Yousef Saad February 1, 2007 Abstract The efficiency of parallel iterative methods for

More information

Implicit schemes for wave models

Implicit schemes for wave models Implicit schemes for wave models Mathieu Dutour Sikirić Rudjer Bo sković Institute, Croatia and Universität Rostock April 17, 2013 I. Wave models Stochastic wave modelling Oceanic models are using grids

More information

Preconditioner updates for solving sequences of linear systems in matrix-free environment

Preconditioner updates for solving sequences of linear systems in matrix-free environment NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2000; 00:1 6 [Version: 2002/09/18 v1.02] Preconditioner updates for solving sequences of linear systems in matrix-free environment

More information

Sparse Matrices and Graphs: There and Back Again

Sparse Matrices and Graphs: There and Back Again Sparse Matrices and Graphs: There and Back Again John R. Gilbert University of California, Santa Barbara Simons Institute Workshop on Parallel and Distributed Algorithms for Inference and Optimization

More information

Ω2

Ω2 CACHE BASED MULTIGRID ON UNSTRUCTURED TWO DIMENSIONAL GRIDS CRAIG C. DOUGLAS, JONATHAN HU y, ULRICH R UDE z, AND MARCO BITTENCOURT x. Abstract. High speed cache memory is commonly used to address the disparity

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller October 994 CMU-CS-94-25 Marco Zagha School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 This

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Preconditioning for linear least-squares problems

Preconditioning for linear least-squares problems Preconditioning for linear least-squares problems Miroslav Tůma Institute of Computer Science Academy of Sciences of the Czech Republic tuma@cs.cas.cz joint work with Rafael Bru, José Marín and José Mas

More information

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger The goal of my project was to develop an optimized linear system solver to shorten the

More information

PROGRESS IN NEWTON-KRYLOV METHODS FOR AERODYNAMIC CALCULATIONS. University of Toronto Institute for Aerospace Studies

PROGRESS IN NEWTON-KRYLOV METHODS FOR AERODYNAMIC CALCULATIONS. University of Toronto Institute for Aerospace Studies PROGRESS IN NEWTON-KRYLOV METHODS FOR AERODYNAMIC CALCULATIONS Alberto Pueyo David W. Zingg y University of Toronto Institute for Aerospace Studies 4925 Duerin Street, Downsview, Ontario M3H 5T6 Canada

More information

GraphBLAS Mathematics - Provisional Release 1.0 -

GraphBLAS Mathematics - Provisional Release 1.0 - GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,

More information

NAG Fortran Library Routine Document F11DSF.1

NAG Fortran Library Routine Document F11DSF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo y Abstract Sparse matrix-vector multiplication is an important kernel that often runs ineciently on superscalar RISC

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Efficient Minimization of New Quadric Metric for Simplifying Meshes with Appearance Attributes

Efficient Minimization of New Quadric Metric for Simplifying Meshes with Appearance Attributes Efficient Minimization of New Quadric Metric for Simplifying Meshes with Appearance Attributes (Addendum to IEEE Visualization 1999 paper) Hugues Hoppe Steve Marschner June 2000 Technical Report MSR-TR-2000-64

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Oleg Bessonov (B) Institute for Problems in Mechanics of the Russian Academy of Sciences, 101, Vernadsky Avenue, 119526 Moscow, Russia

More information

Mathematics and Computer Science

Mathematics and Computer Science Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR

More information

nag sparse nsym sol (f11dec)

nag sparse nsym sol (f11dec) f11 Sparse Linear Algebra f11dec nag sparse nsym sol (f11dec) 1. Purpose nag sparse nsym sol (f11dec) solves a real sparse nonsymmetric system of linear equations, represented in coordinate storage format,

More information

An Ecient Parallel Algorithm. for Matrix{Vector Multiplication. Albuquerque, NM 87185

An Ecient Parallel Algorithm. for Matrix{Vector Multiplication. Albuquerque, NM 87185 An Ecient Parallel Algorithm for Matrix{Vector Multiplication Bruce Hendrickson 1, Robert Leland 2 and Steve Plimpton 3 Sandia National Laboratories Albuquerque, NM 87185 Abstract. The multiplication of

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

SCALABLE ALGORITHMS for solving large sparse linear systems of equations

SCALABLE ALGORITHMS for solving large sparse linear systems of equations SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational

More information

Native mesh ordering with Scotch 4.0

Native mesh ordering with Scotch 4.0 Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse

More information

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical

More information

Xinyu Dou Acoustics Technology Center, Motorola, Inc., Schaumburg, Illinois 60196

Xinyu Dou Acoustics Technology Center, Motorola, Inc., Schaumburg, Illinois 60196 A unified boundary element method for the analysis of sound and shell-like structure interactions. II. Efficient solution techniques Shaohai Chen and Yijun Liu a) Department of Mechanical Engineering,

More information

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany

More information

Parallel ILU Ordering and Convergence Relationships: Numerical Experiments

Parallel ILU Ordering and Convergence Relationships: Numerical Experiments NASA/CR-00-2119 ICASE Report No. 00-24 Parallel ILU Ordering and Convergence Relationships: Numerical Experiments David Hysom and Alex Pothen Old Dominion University, Norfolk, Virginia Institute for Computer

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

paper, we focussed on the GMRES(m) which is improved GMRES(Generalized Minimal RESidual method), and developed its library on distributed memory machi

paper, we focussed on the GMRES(m) which is improved GMRES(Generalized Minimal RESidual method), and developed its library on distributed memory machi Performance of Automatically Tuned Parallel GMRES(m) Method on Distributed Memory Machines Hisayasu KURODA 1?, Takahiro KATAGIRI 12, and Yasumasa KANADA 3 1 Department of Information Science, Graduate

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The

More information

(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza

(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza SOME BASIC CONCEPTS OF FEAST M. Altieri, Chr. Becker, S. Kilian, H. Oswald, S. Turek, J. Wallis Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany

More information

Minimal Equation Sets for Output Computation in Object-Oriented Models

Minimal Equation Sets for Output Computation in Object-Oriented Models Minimal Equation Sets for Output Computation in Object-Oriented Models Vincenzo Manzoni Francesco Casella Dipartimento di Elettronica e Informazione, Politecnico di Milano Piazza Leonardo da Vinci 3, 033

More information

Sparse Matrices Introduction to sparse matrices and direct methods

Sparse Matrices Introduction to sparse matrices and direct methods Sparse Matrices Introduction to sparse matrices and direct methods Iain Duff STFC Rutherford Appleton Laboratory and CERFACS Summer School The 6th de Brùn Workshop. Linear Algebra and Matrix Theory: connections,

More information

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Sparse Linear Algebra

Sparse Linear Algebra Lecture 5 Sparse Linear Algebra The solution of a linear system Ax = b is one of the most important computational problems in scientific computing. As we shown in the previous section, these linear systems

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning

Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning Edmond Chow a,, Hartwig Anzt b,c, Jennifer Scott d, Jack Dongarra c,e,f a School of

More information

NAG Library Function Document nag_sparse_nsym_sol (f11dec)

NAG Library Function Document nag_sparse_nsym_sol (f11dec) f11 Large Scale Linear Systems NAG Library Function Document nag_sparse_nsym_sol () 1 Purpose nag_sparse_nsym_sol () solves a real sparse nonsymmetric system of linear equations, represented in coordinate

More information

Sivan Toledo Coyote Hill Road. Palo Alto, CA November 25, Abstract

Sivan Toledo Coyote Hill Road. Palo Alto, CA November 25, Abstract Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 9434 November 25, 1996 Abstract Sparse Matrix-Vector

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Cpu time [s] BICGSTAB_RPC CGS_RPC BICGSTAB_LPC BICG_RPC BICG_LPC LU_NPC

Cpu time [s] BICGSTAB_RPC CGS_RPC BICGSTAB_LPC BICG_RPC BICG_LPC LU_NPC Application of Non-stationary Iterative Methods to an Exact Newton-Raphson Solution Process for Power Flow Equations Rainer Bacher, Eric Bullinger Swiss Federal Institute of Technology (ETH), CH-809 Zurich,

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS ITERATIVE LIEAR SOLVERS. Objectives The goals of the laboratory workshop are as follows: to learn basic properties of iterative methods for solving linear least squares problems, to study the properties

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information