2 Fundamentals of Serial Linear Algebra

Size: px

Start display at page:

Download "2 Fundamentals of Serial Linear Algebra"

Philip Lloyd
5 years ago
Views:

1 . Direct Solution of Linear Systems.. Gaussian Elimination.. LU Decomposition and FBS..3 Cholesky Decomposition..4 Multifrontal Methods. Iterative Solution of Linear Systems.. Jacobi Method Fundamentals of Serial Linear Algebra.. Preconditioned Conjugate Gradient Method (PCG).3 Comparison of Direct and Iterative Methods

2 Fundamentals of Serial Linear Algebra Solution of linear systems plays major role in the FEM for example in linear static analyses this is still the most expensive part of the whole analysis the task is to solve a linear system of equations of the form where : A R B R n n n m A X = B coefficient matrix (e.g.stiffness matrix) right hand side vectors (e.g. load vectors) X R n m solution vectors to be computed (e.g.displacement vectors) best solution technique depends on properties of linear system, for example sparse or dense coefficient matrix A symmetric or unsymmetric A number of right hand sides m size of system n (small, medium, large, ) Nonzero pattern of coefficient matrix A, for example banded matrix

3 Fundamentals of Serial Linear Algebra even linear systems arising from the FEM can have very different characteristics, dependent on application areas, for example in linear statics up to 6 dofs per grid, in heat transfer usually dof per grid element types: denser matrices with solid elements (TETRA, HEXA, etc.) than with D elements (TRIA, QUAD, etc.), because more grids are connected with each other within each element QUAD4 elements, dof per grid Black: element ids blue: grid ids row row row 3 row 4 row 5 row 6 row 7 row 8 row 9 row 0 row row nonzero terms, density 46%

4 Fundamentals of Serial Linear Algebra HEXA8 elements, dof per grid row row row 3 row 4 row 5 row 6 row 7 row 8 row 9 row 0 row row Black: element ids blue: grid ids nonzero terms, density 78%

5 Fundamentals of Serial Linear Algebra linear systems arising from linear static analysis are usually sparse symmetric small number of right hand sides, often m= higher concentration of nonzero terms around the diagonal, but not necessarily banded structure positive definite: v T Av > 0 v 0 due to physics, because remember that was energy built up from strains and stresses (strain energy); if model is properly defined and fixed, any displacement u<>0 should result in a positive strain energy but even linear systems arising from linear static analysis can vary significantly, for example in density because of element types T σ ε V dv = in size because of model size (number of grids and elements) u T Ku

6 Example: piston Fundamentals of Serial Linear Algebra FE model Stiffness matrix # grids 9,90 # elements 43,084 TETRA4 # dofs 8,590 # loads # rows 8,590 # nonzeros,089,7 density 0.33 % # RHSs

7 . Direct Solution of Linear Systems.. Gaussian Elimination Fundamentals of Serial Linear Algebra

8 .. LU Decomposition and FBS. Direct Solution of Linear Systems

9 ..3 Cholesky Decomposition see derivation on board basic dense algorithm not acceptable in terms of run time and memory/disk requirements even for this small example exploitation of sparsity needed possible solutions:. Direct Solution of Linear Systems solvers exploiting bandedness, idea: only computations within a band matrix can be transformed into pseudo-banded form by a suitable permutation matrix P: resequencing P T A P multifrontal methods (see..4)

10 idea: exploit sparsity example matrix..4 Multifrontal Methods during the decomposition of a sparse matrix A it is frequently observed that some rows can be eliminated independently this is due to the fact that the elimination of a row k creates a contribution to a row i only if the term in row k and column i of the T transpose Cholesky factor L is not equal to 0, that means if 0 the resulting partial ordering of the rows is usually represented by an elimination tree in an elimination tree each row k of the linear system to be solved is represented as a node l ik

11 If a node i is an ancestor of node k in the tree, then row k must be eliminated before row i formally the elimination tree is defined as directed graph if A is an nxn matrix, we define V = { v k {,..., n}} T..4 Multifrontal Methods T ( A) = ( VT, ET ); VT : vertices, ET k :edges ET = {( vi, vk ) VT VT i = min{ q { k +,..., n} lqk 0}} where l qk is again the term in row q and column k of the Cholesky T factor and thus the term in row k and column q of L L v this definition means that i is the parent of k if and only if i is the column index of the first offdiagonal term in row k of the T transpose Cholesky factor L this definition is intuitive, because if column i contains the first T offdiagonal term in row k of L, then row i is the first row below row k to which the elimination of row k creates contributions, therefore it makes sense to define v i as closest ancestor, which means parent, of in the elimination tree v k v

12 ..4 Multifrontal Methods Elimination tree for our 9x9 sample matrix once the elimination tree has been created, the algorithm for the multifrontal matrix decomposition can be described the multifrontal decomposition executes a bottom-up traversal of the elimination tree for each node s we create a dense nfront(s)xnfront(s) submatrix, T where nfront(s) is the number of nonzero terms in row s of L this submatrix is called front s, nfront(s) is the corresponding front size for symmetric matrices we store only the upper (or lower) triangle

13 ..4 Multifrontal Methods Then the created front is initialized with 0s for a leaf node the next step is to fill the first row of front s with a s, j s 0 after that the first row of front s is eliminated by applying an algorithm similar to procedure CHOLESKY to the front, except that the outermost look is only executed for k= (only first row is eliminated) after this elimination row of front s is equal to row s of column s of L respectively) the remaining rows of front s have to be passed to the parent node as the contributions of row s to other matrix rows which have not been eliminated yet nonleaf nodes s are processed similar, except for the fact that between the initialization of front s with a s, j s 0 and its elimination the contributions of the fronts of the children in the elimination tree have to be assembled into front s T L (and

14 ..4 Multifrontal Methods Example: assembly and elimination of front and Row/column of factor Contributions from elimination of row Row/column of factor Contributions from elimination of row Example: assembly of front This procedure is continued until the root is reached used in MSC/NASTRAN for direct solution of linear systems

15 ..4 Multifrontal Methods Example: multifrontal decomposition in MSC/NASTRAN for piston model: 3:36:50 : SEKRRS 7 DCMP BEGN *** USER INFORMATION MESSAGE 457 (DFMSYN) PARAMETERS FOR SPARSE DECOMPOSITION OF DATA BLOCK KLL ( TYPE=RDP ) FOLLOW MATRIX SIZE = 8590 ROWS NUMBER OF NONZEROES = TERMS NUMBER OF ZERO COLUMNS = 0 NUMBER OF ZERO DIAGONAL TERMS = 0 CPU TIME ESTIMATE = 09 SEC I/O TIME ESTIMATE = SEC MINIMUM MEMORY REQUIREMENT = 377 K WORDS MEMORY AVAILABLE = 8560 K WORDS MEMORY REQR'D TO AVOID SPILL = 963 K WORDS EST. INTEGER WORDS IN FACTOR = 965 K WORDS EST. NONZERO TERMS = 545 K TERMS ESTIMATED MAXIMUM FRONT SIZE = 966 TERMS RANK OF UPDATE = 6 3:36:58 : SPDC BGN TE=09 3:37:35 : # # SPDC END *** USER INFORMATION MESSAGE 6439 (DFMSA) ACTUAL MEMORY AND DISK SPACE REQUIREMENTS FOR SPARSE SYM. DECOMPOSITION SPARSE DECOMP MEMORY USED = 963 K WORDS MAXIMUM FRONT SIZE = 966 TERMS INTEGER WORDS IN FACTOR = 6 K WORDS NONZERO TERMS IN FACTOR = 545 K TERMS SPARSE DECOMP SUGGESTED MEMORY = 905 K WORDS *8** Module DMAP Matrix Cols Rows F T IBlks NBlks NumFrt FrtMax DCMP 7 LLL *8** *8** Module DMAP Matrix Cols Rows F T NzWds Density BlockT StrL NbrStr BndAvg BndMax NulCol DCMP 7 SCRATCH D *8** DCMP 7 SCRATCH D *8** 3:37:35 : # # SEKRRS DCMP END CPU time of decomposition: 8 seconds factor size: 5.5 mio nonzeros, 4.6 MB maximum front size: maximum number of nonzeros in a column of L 9 9 number of FLOPS: decomp: , FBS: enormous savings if compared to dense algorithms

16 . Iterative Solution of Linear Systems.. Jacobi Method Fundamentals of Serial Linear Algebra

17 Fundamentals of Serial Linear Algebra.. Preconditioned Conjugate Gradient Method (PCG) belongs to nonstationary methods nonstationary methods use projection or direction vectors or other search algorithms to obtain updated approximate solutions sketch of derivation of CG method: x( i +) basic idea: try to find a new approximate solution vector ( +) which minimizes the functional x i x( i +) minimization of F will decrease residual and make converge

18 CG algorithm:.. Preconditioned Conjugate Gradient Method (PCG)

19 PCG algorithm:.. Preconditioned Conjugate Gradient Method (PCG)

20 PCG method is the basis of almost any effective iterative solver found in commercial finite element programs today, they vary mainly in the applied preconditioning techniques example: run iterative solver in MSC/NASTRAN with Jacobi preconditioning add NASTRAN ITER=YES on top of data deck add SMETHOD=<SID> in case control section add ITER <SID>.. Preconditioned Conjugate Gradient Method (PCG) PRECOND=J MSGFLG=YES in bulk data section our piston with iterative solver: nastran pist0000it mem=0m scr=yes convergence history in f06 file: *** USER INFORMATION MESSAGE 6447 (SITDRV) ITERATIVE SOLVER DIAGNOSTIC OUTPUT MXY FITS INCORE EPS : E-06 JACOBI PRECONDITIONING ITERATION NUMBER CONVERGENCE RATIO NORM OF RESIDUAL E E E E E E+0

21 .. Preconditioned Conjugate Gradient Method (PCG) convergence history in f06 file (cont d): E E E E-05 ITERATION NUMBER CONVERGENCE RATIO LOAD NUMBER E iterations effort in each iteration of Jacobi preconditioning dominated by matrix-vector multiplication *nnz-n FLOPs 3 vector products 3*(n-) FLOPs 3 scaled vector updates: 3*(n) FLOPs Jacobi preconditioning step: n FLOPs for i iterations approx. i*(nnz+n) FLOPs in piston example: number of FLOPs approx. 43*(*,089,7+*8,590) =,04,80,4 =.04 e09 FLOPs

22 F04 file pist0000it.f04 shows:.. Preconditioned Conjugate Gradient Method (PCG) :0:38 : STATRS 56 SOLVIT BEGN *** SYSTEM INFORMATION MESSAGE 457 (SITDRV) PARAMETERS FOR THE ITERATIVE SOLUTION WITH DATA BLOCK KLL (TYPE = RDP ) FOLLOW MATRIX SIZE = 8590 ROWS DENSITY = STRING LENGTH = 4.9 AVG NUMBER OF STRINGS = 59 K NONZERO TERMS = 089 K FULL BAND WIDTH = 548 AVG MEMORY AVAILABLE = 8560 K WORDS MIN MEMORY NEEDED = 48 K WORDS NUMBER OF RHS = NUMBER OF PASSES = OPTIMAL MEMORY = 0 K WORDS PREFACE CPU TIME = 0.00 SECONDS AVG. CPU/ITER = SECONDS *8** Module DMAP Matrix Cols Rows F T NzWds Density BlockT StrL NbrStr BndAvg BndMax NulCol SOLVIT 56 UL D *8** SOLVIT 56 RUL D *8** ::5 : # # STATRS SOLVIT END 68. CPU seconds average CPU performance:,04,80,4 MFLOP 68.sec 5.3 MFLOP sec why is the MFLOP/sec rate so low? Dominating operation is a sparse matrix-vector multiplication low data locality: ratio of data transfer from/to memory over number of operations is high indexed operations (supported by special hardware in vector supercomputers!

23 .3 Comparison of Direct and Iterative Methods Advantages of direct methods: Fundamentals of Serial Linear Algebra robust: delivers solution for any properly defined finite element model easy to use: can be used as black box solver, without the need for selecting special parameters if a linear system with multiple right hand sides has to be solved, one (expensive) decomposition followed by multiple (cheap) FBSes is sufficient high data locality: ratio of data transfer to number of operations is low good for modern computer architectures (like RISC with cache memory), highly tuned kernels can be used, e.g. BLAS in the piston example, an average of 46.4 MFLOP/sec can be achieved on HP Omnibook for the multifrontal decomposition of the matrix

24 Disadvantages of direct methods: basic algorithms (e.g. Cholesky decomposition) are not suitable for the very large, sparse matrices arising from the FEM sophisticated algorithms are required, for example multifrontal methods high number of operations, for example 30 MFLOP for piston with direct multifrontal solution (04 MFLOP for iterative solution with simple Jacobi preconditioning!) the computed matrix factor (Cholesky factor) can grow very large, in the small piston example: data for matrix: for each nonzero term we store double precision numerical value (8 bytes) plus one integer for row position (4 bytes) storage of upper (or lower) triangle including diagonal is sufficient for symmetric matrix in total:.3 Comparison of Direct and Iterative Methods (,089,7 + 8,590) 6.4 MB,04,04

25 data for factor: from f04 file (UIM 6439): integer words in factor: 6,000 nonzero terms in factor: 5,45,000 in total: ( 5,45, ,000) MB,04,04 appr. 6.7 times more than the amount of data in the matrix due to fill-in high amount of I/O:.3 Comparison of Direct and Iterative Methods factor is written to disk in decomposition (absolutely necessary for large matrices) factor is read twice in each FBS (once forward, once backward)

26 .3 Comparison of Direct and Iterative Methods Advantages of iterative methods number of operations is often lower than with direct methods, in the FEM this is in general true with solid models, I.e. with models built from tetrahedrons, hexahedrons and wedges no fill-in, at least not for simple preconditioning techniques like Jacobi storage requirements dominated by memory for matrix low or even no I/O traffic during iterations if matrix fits into memory iterative methods are usually the best method for solid models with quadratic elements (TETRA0, HEXA0, etc.) Disadvantages of iterative methods less robust, often convergence problems with shell models like car bodies (shell elements are for example quadrilateral and triangular elements)

27 Fundamentals of Serial Linear Algebra Example: van body on IBM RS/ H # grids 9,066 with PCG and Jacobi precond. 36,50 iterations, 38,96 seconds! (note influence of round-off errors, in theory n=47,07 iterations) direct multifrontal solver: 79 seconds # elements 6,874 QUAD4,77 TRIA3 57 BAR 5 ELAS # dofs 47,07 # nzts,336,904 # loads in FEM analysis direct methods are still preferred for shell element models like car bodies, planes, etc.

28 .3 Comparison of Direct and Iterative Methods Disadvantages of iterative methods (cont d) lower MFLOP/sec rates, therefore in many cases where number of operations would be lower than with direct methods, direct methods are still faster; this happens often with linear solid elements (TETRA4, HEXA8) careful selection of preconditioners required the more elaborate the preconditioner, the lower the number of iterations but effort to compute this preconditioner and its storage requirements go up, preconditioning step gets more expensive example: block incomplete Cholesky preconditioner (BIC) in MSC/NASTRAN for piston (P is computed by an incomplete decomposition of A, fill-in is partially ignored): #iterations CPU time Memory PCG+J sec 0 KW = 8.4 MB PCG+BIC sec 3857 KW = 4.7 MB

29 .3 Comparison of Direct and Iterative Methods Now faster than direct solution! Iterative solvers usually cannot be used as black box solvers yet with number of right hand sides m>, iterative algorithm usually has to be repeated for each RHS number of operations increases by a factor of m increase in computation time is lower, since data locality is higher (algorithms work on multiple vectors simultaneously) note: so-called projection methods, which exploit the existence of multiple RHSs to find better direction vectors p can improve the situation, but are not discussed here ( search into multiple directions simultaneously) summing up: in the FEM, iterative methods often result in lower number of operations for solid models and require less (disk) storage, but are more difficult to apply and require numerical background knowledge from the engineer.

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other