Overview of MUMPS (A multifrontal Massively Parallel Solver)

Size: px

Start display at page:

Download "Overview of MUMPS (A multifrontal Massively Parallel Solver)"

Adam Edwards
5 years ago
Views:

1 Overview of MUMPS (A multifrontal Massively Parallel Solver) E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, J.-Y. L Excellent, T. Slavova, B. Uçar CERFACS, ENSEEIHT-IRIT, INRIA (LIP and LaBRI), ( or Petascale Applications, Algorithms and Programming (PAAAP) workshop, Toulouse June 24, Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 1

2 Outline Introduction-History 1. Introduction-History 2. Sparse Direct Methods and preprocessing 3. Parallel sparse solver - Memory issues 4. Conclusion-On going work. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 2

3 History Introduction-History objective : Solve Ax = b, where A is large and sparse. At the beginning : LTR ( Long Term Research) European project, from 1996 to 1999 Led to first public domain version Now : MUMPS is supported by CERFACS (Toulouse),ENSEEIHT-IRIT (Toulouse) and INRIA (Lyon, Bordeaux). Public Domain (available free of charge) 300,000 lines of F90/C/MPI code 1000 users, 3 requests per day. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 3

4 History Introduction-History Current development team : Patrick Amestoy, ENSEEIHT-IRIT (Toulouse) Alfredo Buttari, INRIA Lyon Philippe Combes, CNRS/ENS Lyon Abdou Guermouche, LaBRI-INRIA (Bordeaux) Jean-Yves L Excellent, INRIA Bora Uçar, CERFACS Phd Students Emmanuel Agullo, ENS-Lyon Tzvetomila Slavova, CERFACS (Toulouse).. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 4

5 Users Introduction-History NORTH AMERICA 31% OCEANIA 2% AFRICA SOUTH AMERICA < 1% 4% EASTERN EUROPE 6% ASIA 19% EUROPE 39%. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 5

6 Outline Sparse Direct Methods and preprocessing 1. Introduction-History 2. Sparse Direct Methods and preprocessing 3. Parallel sparse solver - Memory issues 4. Conclusion-On going work. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 6

7 Sparse Direct Methods and preprocessing MUMPS MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT. It uses a multifrontal technique which is a direct method. 3 main steps (plus initialization and termination) : JOB = 1 ANALYSIS JOB = 1 FACTORIZATION JOB = 2 SOLVE JOB = 3 JOB = 2 E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 7

8 Sparse Direct Methods and preprocessing MUMPS MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT. It uses a multifrontal technique which is a direct method. 3 main steps (plus initialization and termination) : JOB = 1 ANALYSIS JOB = 1 FACTORIZATION JOB = 2 SOLVE JOB = 3 JOB = 2 JOB=-1 : initialize solver type (LU, LDL T ) and default parameters. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 7

9 Sparse Direct Methods and preprocessing MUMPS MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT. It uses a multifrontal technique which is a direct method. 3 main steps (plus initialization and termination) : JOB = 1 ANALYSIS JOB = 1 FACTORIZATION JOB = 2 SOLVE JOB = 3 JOB = 2 JOB=1 : analyse the matrix, build an ordering, prepare factorization. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 7

10 Sparse Direct Methods and preprocessing MUMPS MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT. It uses a multifrontal technique which is a direct method. 3 main steps (plus initialization and termination) : JOB = 1 ANALYSIS JOB = 1 FACTORIZATION JOB = 2 SOLVE JOB = 3 JOB = 2 JOB=2 : (parallel) numerical factorization A = LU E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 7

11 Sparse Direct Methods and preprocessing MUMPS MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT. It uses a multifrontal technique which is a direct method. 3 main steps (plus initialization and termination) : JOB = 1 ANALYSIS JOB = 1 FACTORIZATION JOB = 2 SOLVE JOB = 3 JOB = 2 JOB=3 : (parallel) solution step forward and backward substitutions (Ly = b,ux = y). Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 7

12 Sparse Direct Methods and preprocessing MUMPS MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT. It uses a multifrontal technique which is a direct method. 3 main steps (plus initialization and termination) : JOB = 1 ANALYSIS JOB = 1 FACTORIZATION JOB = 2 SOLVE JOB = 3 JOB = 2 JOB=-2 : termination deallocate all MUMPS data structures. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 7

13 Sparse solver for Ax = b : only a black box? Preprocessing and postprocessing : Symmetric permutations to reduce fill :(Ax = b P AP t P x = b) Numerical pivoting, scaling to preserve numerical accuracy Maximum transversal (set large entries on the diagonal) Preprocessing for parallelism (influence of task mapping on parallelism) Iterative refinement, error analysis Default (often automatic/adaptive) setting of the options is available. However, a better knowledge of the options can help the user to further improve : memory usage, time for solution, numerical accuracy.

14 Sparse Direct Methods and preprocessing Preprocessing - illustration Original (A =lhr01) Preprocessed matrix (A (lhr01)) nz = nz = Modified Problem :A x = b with A = P n D r P AQP t D c. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 9

15 Sparse Direct Methods and preprocessing Impact of fill-reducing heuristics Number of operations (millions) METIS SCOTCH PORD AMF AMD gupta ship twotone wang xenon Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 10

16 Sparse Direct Methods and preprocessing Multifrontal method (Duff and Reid, 83) 1 2 A= 3 L+U I= Fill in Memory is divided into two parts (that can overlap in time) : the factors the active memory Factors Contribution block Factors Active frontal matrix Stack of contribution blocks Active Memory Elimination tree represents tasks dependencies. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 11

17 Sparse Direct Methods and preprocessing Impact of fill-reducing heuristics Peak of active memory for multifrontal approach ( 10 6 entries) METIS SCOTCH PORD AMF AMD gupta ship twotone wang xenon Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 12

18 Sparse Direct Methods and preprocessing Impact of fill-reducing heuristics Time for factorization (seconds) 1p 16p 32p 64p 128p coneshl METIS PORD audi METIS PORD Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 13

19 Sparse Direct Methods and preprocessing Preprocessing - Maximum weighted matching (I) Objective : Set large entries on the diagonal Unsymmetric permutation and scaling Preprocessed matrix B = D 1 AQD 2 is such that b ii = 1 and b ij 1 0 Original (A =lhr01) 0 Permuted (A = AQ) nz = nz = Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 14

20 Sparse Direct Methods and preprocessing Preprocessing - Maximum weighted matching (II) Influence of maximum weighted matching on the performance Matrix Symmetry LU Flops Backwd (10 6 ) (10 9 ) Error twotone OFF ON fidapm11 OFF ON On very unsymmetric matrices : reduce flops, factor size and memory used. In general improve accuracy, and reduce number of iterative refinements. Improve reliability of memory estimates. E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 15

21 Sparse Direct Methods and preprocessing Preprocessing - Maximum weighted matching (II) Influence of maximum weighted matching on the performance Matrix Symmetry LU Flops Backwd (10 6 ) (10 9 ) Error twotone OFF ON fidapm11 OFF ON On very unsymmetric matrices : reduce flops, factor size and memory used. In general improve accuracy, and reduce number of iterative refinements. Improve reliability of memory estimates. E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 15

22 Sparse Direct Methods and preprocessing Preprocessing symmetric matrices - Compressed ordering Perform an unsymmetric weighted matching Matched entry. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 16

23 Sparse Direct Methods and preprocessing Preprocessing symmetric matrices - Compressed ordering Perform an unsymmetric weighted matching Select matched entries Select Matched entry Matched entry Selected Matched entry. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 16

24 Sparse Direct Methods and preprocessing Preprocessing symmetric matrices - Compressed ordering Perform an unsymmetric weighted matching Select matched entries Symmetrically permute matrix to set large entries near diagonal j1 j2 j3 j4 j5 j6 j1 j4 j2 j3 j5 j6 Permute B = Qt A Q Selected entries. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 16

25 Sparse Direct Methods and preprocessing Preprocessing symmetric matrices - Compressed ordering Perform an unsymmetric weighted matching Select matched entries Symmetrically permute matrix to set large entries near diagonal Compression : 2 2 diagonal blocks become supervariables. Compress permuted matrix B. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 16

26 Sparse Direct Methods and preprocessing Preprocessing symmetric matrices - Compressed ordering Perform an unsymmetric weighted matching Select matched entries Symmetrically permute matrix to set large entries near diagonal Compression : 2 2 diagonal blocks become supervariables. Compress permuted matrix B Influence of using a compressed graph (with scaling) Total time Nb of entries in factors in Millions (seconds) (estimated) (effective) Compression : OFF ON OFF ON OFF ON cont cvxqp stokes Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 16

27 Outline Parallel sparse solver - Memory issues 1. Introduction-History 2. Sparse Direct Methods and preprocessing 3. Parallel sparse solver - Memory issues 4. Conclusion-On going work. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 17

28 Parallel sparse solver - Memory issues Out-of-core Solving sparse linear systems Ax = b : 10 M variables A = LU (Direct methods) Current limits : BRGM matrix variables non zeros in A non zeros in LU flops Physical constraint. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 18

29 Parallel sparse solver - Memory issues Out-of-core Solving sparse linear systems Ax = b : 10 M variables A = LU (Direct methods) Current limits : BRGM matrix variables non zeros in A non zeros in LU flops Physical constraint Core memory Memory required Memory crash. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 18

Parallel sparse solver - Memory issues Out-of-core Solving sparse linear systems Ax = b : 10 M variables A = LU (Direct methods) Current limits : BRGM matrix 3.

30 Parallel sparse solver - Memory issues Out-of-core Solving sparse linear systems Ax = b : 10 M variables A = LU (Direct methods) Current limits : BRGM matrix variables non zeros in A non zeros in LU flops Out-of-core Core memory Disks Memory required Use of disks. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 18

31 Parallel sparse solver - Memory issues Out-of-core solution phase For a symmetric case : Ax = b A = LDL T matrix pattern elimination tree L factors Location of factors on DISK Out-Of-Core solution phase (LDL T x = b) is often as costly as the factorization. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 19

32 Parallel sparse solver - Memory issues Out-of-core Basics Objective : Reduce time for solution of Ax = b in a parallel limited memory environment (OOC) Time performance of the solution phase is related to : the bandwidth and number of disk acesses for reading data the number of processors and volume of factors per processor the regularity in the disk accesses Scheduling can significantly improve the parallel performance of the solution Matrix name Order Nb entries Factor size Nb Nodes Origin (Millions) (MBytes) in the tree qimonda Qimonda AG cas4r-l EADS Publicly available matrices can be obtained on gridtlse.org web site. E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 20

33 Parallel sparse solver - Memory issues Out-of-core Basics Objective : Reduce time for solution of Ax = b in a parallel limited memory environment (OOC) Time performance of the solution phase is related to : the bandwidth and number of disk acesses for reading data the number of processors and volume of factors per processor the regularity in the disk accesses Scheduling can significantly improve the parallel performance of the solution Matrix name Order Nb entries Factor size Nb Nodes Origin (Millions) (MBytes) in the tree qimonda Qimonda AG cas4r-l EADS Publicly available matrices can be obtained on gridtlse.org web site. E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 20

34 Parallel sparse solver - Memory issues Performance study - I qimonda07 (Qimonda AG company) Strategy Nb of Factor Size Workspace Fwd Bwd Procs (MB) (MB) (sec) (sec) Standard New Standard New Standard New cas4r-l15 (EADS) Strategy Nb of Factor Size Workspace Fwd Bwd Procs (MB) (MB) (sec) (sec) Standard New Standard New Standard New E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 21

35 Parallel sparse solver - Memory issues Memory related to communication buffers Communication buffer size becomes critical in Out-of-core. Idea : send message by packets that fit in a small buffer of fixed size. Cost : more synchronizations (receiver must be ready to receive before we can send the next packet). Communication scheme Matrix Large buffers Small buffers AUDIKW CONESHL MOD CONV3D ULTRASOUND Tab.: Size of the communication buffers (MB) with 32 processors. Balance to be found to avoid too many synchronizations between senders and receivers.. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 22

36 Parallel sparse solver - Memory issues Memory related to communication buffers Communication buffer size becomes critical in Out-of-core. Idea : send message by packets that fit in a small buffer of fixed size. Cost : more synchronizations (receiver must be ready to receive before we can send the next packet). Communication scheme Matrix Large buffers Small buffers AUDIKW CONESHL MOD CONV3D ULTRASOUND Total IC : 1260 Tab.: Size of the communication buffers (MB) with 32 processors. Balance to be found to avoid too many synchronizations between senders and receivers. E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 22

37 Parallel sparse solver - Memory issues Memory related to communication buffers Communication buffer size becomes critical in Out-of-core. Idea : send message by packets that fit in a small buffer of fixed size. Cost : more synchronizations (receiver must be ready to receive before we can send the next packet). Communication scheme Matrix Large buffers Small buffers AUDIKW CONESHL MOD CONV3D ULTRASOUND Total OOC : 800 Tab.: Size of the communication buffers (MB) with 32 processors. Balance to be found to avoid too many synchronizations between senders and receivers. E. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 22

38 Parallel sparse solver - Memory issues Preliminary work : scheduling unfluences peak memory needed Modify tree mapping to reduce the minimum memory requirement in parallel executions. Factors Performance oriented Memory oriented In Core Max Avg Out-Of-Core Max Avg Tab.: Estimated total memory (MB) - AUDIKW 1 matrix - 16 processors.. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 23

39 Outline Conclusion-On going work 1. Introduction-History 2. Sparse Direct Methods and preprocessing 3. Parallel sparse solver - Memory issues 4. Conclusion-On going work. Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 24

40 Conclusion-On going work Conclusion-On going work Future functionalities (on-going work) Out-of-core continuing Parallel analysis phase (orderings, scalings) Rank-detection and null space basis Scheduling for memory (OOC, Fault-Tolerance) Multi-core : (Flops not best metric to minimize) On-going projects ANR CIS-SOLSTICE (ANR-06-CIS6-010) France-Berkeley (2008-) REDIMPS : CNRS/JST (Japan) cooperation project ( ) SEISCOPE (http ://www-geoazur.unice.fr/seiscope). Agullo, P. Amestoy, A. Buttari, P. Combes, A. Guermouche, Overview J.-Y. of L Excellent, MUMPS (AT. multifrontal Slavova, B. Massively Uçar Parallel Solver) 25

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future