SCALABLE ALGORITHMS for solving large sparse linear systems of equations

Similar documents
Sparse Matrices Direct methods

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

Overview of MUMPS (A multifrontal Massively Parallel Solver)

Null space computation of sparse singular matrices with MUMPS

Sparse Direct Solvers for Extreme-Scale Computing

Analysis of the Out-of-Core Solution Phase of a Parallel Multifrontal Approach

Combinatorial problems in a Parallel Hybrid Linear Solver

PARDISO Version Reference Sheet Fortran

On the I/O Volume in Out-of-Core Multifrontal Methods with a Flexible Allocation Scheme

A parallel direct/iterative solver based on a Schur complement approach

MaPHyS, a sparse hybrid linear solver and preliminary complexity analysis

Toward robust hybrid parallel sparse solvers for large scale applications

(Sparse) Linear Solvers

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Sparse Linear Systems

Parallel resolution of sparse linear systems by mixing direct and iterative methods

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Approaches to Parallel Implementation of the BDDC Method

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

A Parallel Implementation of the BDDC Method for Linear Elasticity

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Sparse Matrices Introduction to sparse matrices and direct methods

SUMMARY. solve the matrix system using iterative solvers. We use the MUMPS codes and distribute the computation over many different

A Static Parallel Multifrontal Solver for Finite Element Meshes

2 Fundamentals of Serial Linear Algebra

Sparse matrices, graphs, and tree elimination

A direct solver with reutilization of previously-computed LU factorizations for h-adaptive finite element grids with point singularities

Model Reduction for High Dimensional Micro-FE Models

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Libraries for Scientific Computing: an overview and introduction to HSL

An Hybrid Approach for the Parallelization of a Block Iterative Algorithm

A parallel frontal solver for nite element applications

An Out-of-Core Sparse Cholesky Solver

Solvers and partitioners in the Bacchus project

A comparison of parallel rank-structured solvers

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Advanced Numerical Techniques for Cluster Computing

Primal methods of iterative substructuring

qr_mumps a multithreaded, multifrontal QR solver The MUMPS team,

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Lecture 27: Fast Laplacian Solvers

GPU COMPUTING WITH MSC NASTRAN 2013

(Sparse) Linear Solvers

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

MAGMA. Matrix Algebra on GPU and Multicore Architectures

A General Sparse Sparse Linear System Solver and Its Application in OpenFOAM

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach

THE application of advanced computer architecture and

CERFACS 42 av. Gaspard Coriolis, Toulouse, Cedex 1, France. Available at Date: August 7, 2009.

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

A substructure based parallel dynamic solution of large systems on homogeneous PC clusters

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

SUPERFAST MULTIFRONTAL METHOD FOR STRUCTURED LINEAR SYSTEMS OF EQUATIONS

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Preconditioning for linear least-squares problems

6 Implementation of Parallel FE Systems

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

Lecture 17: More Fun With Sparse Matrices

THE procedure used to solve inverse problems in areas such as Electrical Impedance

Implicit schemes for wave models

Computational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn

Parallel Numerical Algorithms

Lecture 15: More Iterative Ideas

Exploiting Mixed Precision Floating Point Hardware in Scientific Computations

MAGMA: a New Generation

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Accelerating the Iterative Linear Solver for Reservoir Simulation

Sparse Matrix Algorithms

High Performance Computing: Tools and Applications

NEW ADVANCES IN GPU LINEAR ALGEBRA

A GPU Sparse Direct Solver for AX=B

Mixed Precision Methods

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Parallel Sparse LU Factorization on Different Message Passing Platforms

CS 542G: Solving Sparse Linear Systems

Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Computational issues in linear programming

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Combinatorial problems in solving linear systems

Linear Algebra for Network Loss Characterization

Iterative Sparse Triangular Solves for Preconditioning

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach

Performance Evaluation of a New Parallel Preconditioner

Introduction to Parallel Computing

Transcription:

SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational Science, University of Bergen (prepared with help from the MUMPS team, esp. Patrick Amestoy and Jean-Yves L Excellent)

Introduction Simulation of a physical problem often leads to sparse systems A x = b The (repeated) solution of sparse linear systems of equations is often the most computationally intensive part of a simulation process Problem: solve A x = b, where A is square n x n, sparse, and symmetric positive definite, symmetric indefinite, or unsymmetric, and possibly rank deficient, vector b of length n is given, vector x of length n is to be computed Develop software libraries tailored for parallel processing, in particular distributed-memory platforms (loosely connected SMPs, clusters) 2

Direct methods - Introduction Solve A x = b :. LU factorization: P A Q = LU or P A Q = LDL T Block version: eliminate k variables P is row permutation, Q is column permutation, L is lower triangular, U is upper triangular matrix Dominant part of computation is O(n^3): for k =,n /* eliminate variable k */ for i = k+,n /* overwrite A[ k+:n, k+:n ] */ for j = k+,n k+ n F C A k+ n B D D = D C [F] - B 2. Solve L y = P b, for vector y (forward substitution, O(n^2)) 3. Solve U [Q] - x = y, for vector x (back substitution, O(n^2)) Uses BLAS, BLAS2, and BLAS3 for high performance 3

Direct methods - Introduction Why permutations P and Q in P A Q = LU? k q n k P, Q needed to find large (pivots) a kk (k) p numerical accuracy A (k) P, Q needed to preserve sparsity: make as few as possible a ij nonzero (fill-in) keep L and U sparse computational and memory savings n k j n k i P and Q define variable ordering In the matrix n 4

Basic idea: a large sparse matrix is a sum of smaller dense matrices The factorization of matrix A is driven by an elimination tree that is determined by the matrix structure and the variable ordering (permutations P and Q). A node in the tree represents a partial factorization of a (small) dense matrix. An edge in the tree represents data movement between dense matrices. The tree defines a partial ordering: a node can only be processed when all its children are processed. Leaves are processed first; root comes last. Nodes that are not ancestors to one another can be processed simultaneously (parallelism). root more parallelism diagonal matrix; embarrasing parallel leaf tridiagonal matrix with natural ordering; no parallelism 5

Sparse direct methods Variable ordering has large impact on performance - Ordering impacts elimination tree, parallelism, computation, and memory use - It easily pays off to spend time on analyzing the matrix before factorization - However, finding the optimal ordering to minimize fill-in is an NP-complete problem Use heuristics to study/optimize topology of the tree (number of nodes, sizes of nodes, ) 2 Fill-in (entries created during factorization) 5 3 Initial matrix 4 no fill-in 6

7 nodes ready to be processed 8 nodes not ready to be processed 7

6 nodes ready to be processed 8 nodes not ready to be processed 8

6 nodes ready to be processed 7 nodes not ready to be processed 9

5 nodes ready to be processed 7 nodes not ready to be processed 0

5 nodes ready to be processed 6 nodes not ready to be processed

4 nodes ready to be processed 6 nodes not ready to be processed 2

4 nodes ready to be processed 5 nodes not ready to be processed 3

4 nodes ready to be processed 4 nodes not ready to be processed 4

3 nodes ready to be processed 4 nodes not ready to be processed 5

3 nodes ready to be processed 3 nodes not ready to be processed 6

3 nodes ready to be processed 2 nodes not ready to be processed 7

2 nodes ready to be processed 2 nodes not ready to be processed 8

nodes ready to be processed 2 nodes not ready to be processed 9

nodes ready to be processed nodes not ready to be processed 20

(DFS numbering) 5 8 4 3 7 3 2 6 9 2 4 5 0 node ready to be processed ps. In practice, tree could be processed in many orders. A depth-first search (DFS) order allows the frontal matrices to be stored on a stack. 2

D3 k3+ F3 k3+ n3 B3 D3 = D3 C3 [F3] - B3 C3 D3 D D2 n3 A3 F2 k2+ n2 B2 assemble D2 into A3 k2+ D2 = D2 C2 [F2] - B2 C2 D2 D D n2 A2 F k+ n B assemble D into A2 ZOOM-IN k+ C D D = D C [F] - B n D D A 22

Three phases to solve A x = b:. Analysis of matrix A (symbolic factorization): Determine appropriate P and Q based on nonzero structure of A to minimize fill-in in L and U. Compute the elimination tree. Prepare for parallel execution, map tree nodes onto processors. Each processor prepares the local data structures. 2. Factorization of A (numerical factorization): Compute L and U using the tree. Possibly modify P and Q a-posteriori to ensure numerical stability. 3. Forward/back substitution: Use P, Q, L, and U to compute x = [A] - b For a sequence of systems A x = b, A x2 = b2, : do steps and 2 only once. For a sequence of systems A x = b, A2 x2 = b2, with A, A2, having the same nonzero structure: do step only once. 23

In a distributed-memory environment: map nodes of the tree onto processors 3 communication 2 2 no communication 2 3 Four processors 0,, 2, 3 3 0 2 3 3 2 0 2 3 0 0 0 2 2 2 2 0 0 Constraints; Each node has its own amount of work Tree defines node dependencies Processors must get roughly the same amount of total work (work load balance) Minimize inter-processor communication Minimize idling of processors 24

MUMPS parallel multifrontal scheme Task granularity: assign subtrees to processors to minimize communication In practice, nodes near the root require much work (often 75% of work is in few top levels) Reduced parallelism near the root (relatively few nodes, many processors) Use multiple processors to process large nodes near the root: Four processors P0, P, P2, P3 25

MUMPS parallel multifrontal scheme Allow dynamic assignment of matrices during factorization to take care of numerical stability issues Four processors P0, P, P2, P3 26

MUMPS in a cluster environment Flop-based scheduling strategy appears to be most natural (work balance, minimize elapsed time). Works fine in shared-memory environment. However, a good work balance does not necessarily imply a good memory balance. On clusters (of small SMPs) with little memory per node, a bad memory balance may make computing a solution impossible. memory scalability and memory load balance are as important need for memory-aware task scheduling Similarly, there is a need for interconnect-aware task scheduling to take into account network latencies and bandwidths 27

MUMPS package http://graal.ens-lyon.fr/mumps Background: Project continued from LTR European project PARASOL (996-999) Developers and contributors: Patrick Amestoy (ENSEEIHT-IRIT, Toulouse) Jean-Yves L'Excellent (ReMAP project, INRIA, Lyon) Iain Duff (CERFACS, Toulouse and RAL, UK) Abdou Guermouche (ReMAP project, Toulouse) Jacko Koster (Parallab, BCCS, Norway) Stéphane Pralet (CERFACS, Toulouse) Christophe Vömel (CERFACS, Toulouse) General purpose, competitive, many functionalities Types of matrices: SPD, symmetric, unsymmetric Input matrix format: assembled, elemental, distributed Arithmetic: real, double, complex, double complex Numerical pivoting, scalings, backward error analysis, iterative refinement Written in F90 and MPI; C interface provided MUMPS 4.3.2 (latest public release, July 2003) Requested/downloaded by ca. 500 users Ca. 200.000 lines of code and growing Freely available software 28

Some MUMPS usage PETSc (Argonne National Laboratory) MUMPS available from PETSc 2.2.0 library as an optional package Academic and industrial users from various application fields Structural mechanical engineering Biomechanics Heat transfer analysis Medical image processing Geophysics Optical problems Ad-hoc network modeling (Markov processes) Econometric modeling Oil reservoir simulation Computation fluid dynamics Astrophysics Circuit simulation Used by EADS, Dassault, CEA, Boeing, NEC, THALES, NASA, MIT, several US national labs, 29

SALSA - Introduction Iterative substructuring for solving A x = b The idea is to combine the best of direct methods (robustness) with that of iterative methods (speed). Reorder A into bordered block diagonal form : This represents N non-overlapping subdomains. The interior matrices A ii can be processed simultaneously which provides a natural source of parallelism. 30

SALSA - Introduction Example : 4 subdomains, 9 interface variables, 6 interior variables 3 interface variables 2 4 interior variables The solution process for A x = b consists of three main steps. 3. Eliminate interior variables in A ii p (in parallel for each subdomain), e.g., with a sparse direct solver like MUMPS 4. Eliminate interface variables in A rr p with use of local Schur complement matrices S(p) = A rr p A ri p A ii p A ir p typically with preconditioned iterative method (PCG, Bi-CGSTAB, GMRES,...) 5. Postprocessing to obtain final solution 3

SALSA features SALSA accepts matrices in various formats Assembled format (traditional Compressed Sparse Row) Elemental format (sum of small dense matrices) Partitioned format (sum of N matrices in CSR format, one per subdomain) The number of processors N is independent of number of subdomains P P = N : one domain per processor (default) N > P : map multiple subdomains onto one processor of interest for convergence studies of preconditioners P > N : map multiple processors onto one subdomain of interest for load balancing and optimal use of available resources SALSA works with multiple internal Schur complement matrix formats Explicit : the matrix S(p) is computed explicitly (e.g., by MUMPS) Implicit : the action of S(p) is derived from its definition (3 mat-vec products, fw/bw solve) 32

SALSA features -level preconditioners include: Jacobi/Diagonal Block (incomplete) LU Neumann-Neumann (Deroeck & Le Tallec 9) 2-level preconditioners: based on Balancing Neumann-Neumann (Mandel 93) Requires approximation of local null spaces in case S(p) is singular. computed algebraically based on deriving smallest eigenvalue 2. known a-priori depending on problem type (elasticity, etc.) Direct substructuring: the interface problem can also solved with a direct method to provide maximum robustness 33

SALSA example of usage DNV tubular joint problem : Partitioned by DNV into 58 subdomains of varying sizes Mix of solid, shell, transitional finite elements 97,470 dofs, of which 3,920 on interface nprocs #iter preproc (secs) PCG (secs) total (secs) 32 29.0 6.6 45.7 2 3 6.5 8.7 25.3 4 3 9.7 4.3 3.9 8 3 6. 2. 8.3 2 3 5.3.8 7.2 Non-optimal speedups for larger number of processors primarily due to differences in domain sizes ( naive cyclic mapping of domains used). Results obtained on IBM p690 (power4.3 Ghz) 34

SALSA on-going/future work More robust 2-level preconditioners: Collaboration with L. Giraud (CERFACS) Collaboration with J. Mandel (Univ. of Colorado) Automatic load balancing: Needed when the number of subdomains differs from the number of processors or when subdomains vary in size and work Inclusion of more iterative schemes: Primarily based on user requests Pursue further collaboration with Simula: To integrate SALSA into a challenging application for simulating the electrical activity of the heart (talk Xing Cai tomorrow) 35

Miscellaneous MUMPS and/or SALSA is being tested/developed on problems like: 36

How it fits together application SALSA MUMPS ScaLAPACK ARPACK LAPACK METIS MPI BLAS 37