High Performance Matrix Computations

Size: px

Start display at page:

Download "High Performance Matrix Computations"

Solomon Beasley
5 years ago
Views:

1 High Performance Matrix Computations J.-Y. L Excellent (INRIA/LIP-ENS Lyon) Jean-Yves.L.Excellent@ens-lyon.fr Office -8 prepared in collaboration with P. Amestoy, L.Giraud, M. Daydé (ENSEEIHT-IRIT) Contents Lectures related to sparse direct methods novembre Introduction to Sparse Matrix Computations. Motivation and main issues Sparse matrices Gaussian elimination Symmetric matrices and graphs Ordering sparse matrices. Objectives/Outline Fill-reducing orderings Impact of fill reduction algorithm on the shape of the tree Postorderings and memory usage Equivalent orderings and elimination trees Reorder unsymmetric matrices to special forms Combining approaches Factorization of sparse matrices. Introduction Elimination tree and Multifrontal approach Comparison between approaches for LU factorization Task mapping and scheduling Distributed memory approaches Some parallel solvers Case study: comparison of MUMPS and SuperLU Concluding remarks Related topics Introduction to Sparse Matrix Computations A selection of references Books Duff, Erisman and Reid, Direct methods for Sparse Matrices, Clarenton Press, Oxford 98. Dongarra, Duff, Sorensen and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, 99. George, Liu, and Ng, Computer Solution of Sparse Positive Definite Systems, book to appear

2 Articles Gilbert and Liu, Elimination structures for unsymmetric sparse LU factors, SIMAX, 99. Liu, The role of elimination trees in sparse factorization, SIMAX, 99. Heath and E. Ng and B. W. Peyton, Parallel Algorithms for Sparse Linear Systems, SIAM review 99.. Motivation and main issues Motivations solution of linear systems of equations key algorithmic kernel Main parameters: Continuous problem Discretization Solution of a linear system Ax = b Numerical properties of the linear system (symmetry, pos. definite, conditioning,... ) Size and structure: Large (>?), square/rectangular Dense or sparse (structured / unstructured) Target computer (sequential/parallel) Algorithmic choices are critical Motivations for designing efficient algorithms Time-critical applications Solve larger problems Decrease elapsed time (parallelism?) Minimize cost of computations (time, memory) Difficulties Access to data : Computer : complex memory hierarchy (registers, multilevel cache, main memory (shared or distributed), disk) Sparse matrix : large irregular dynamic data structures. Exploit the locality of references to data on the computer (design algorithms providing such locality) Efficiency (time and memory) Number of operations and memory depend very much on the algorithm used and on the numerical and structural properties of the problem. The algorithm depends on the target computer (vector, scalar, shared, distributed, clusters of Symmetric Multi-Processors (SMP), GRID). Algorithmic choices are critical

3 . Sparse matrices Sparse matrices Example: x + x = x - x = x + x = can be represented as where A =, x = x x x Sparse matrix: only nonzeros are stored. Sparse matrix? Ax = b,, and b = Original matrix nz = Matrix dwt 9.rua (N=9, NZ=); Structural analysis of a submarine Factorization process Solution of Ax = b A is unsymmetric : A is factorized as: A = LU, where L is a lower triangular matrix, and U is an upper triangular matrix. Forward-backward substitution: Ly = b then Ux = y A is symmetric: A = LDL T or LL T A is rectangular m n with m n and min x Ax b : A = QR where Q is orthogonal (Q = Q T and R is triangular). Solve: y = Q T b then Rx = y

Difficulties Only non-zero values are stored Factors L and U have far more nonzeros than A Data structures are complex Computations are only a small portion of the code (the rest is data

Géosiences Azur, Valbonne Complex matrix arising in D, nonzeros Storage : GB ( GB with the factors?) Flops : tens of TeraFlops - Typical performance (MUMPS): PC LINUX (P, GHz) :.

4 Difficulties Only non-zero values are stored Factors L and U have far more nonzeros than A Data structures are complex Computations are only a small portion of the code (the rest is data manipulation) Memory size is a limiting factor out-of-core solvers Key numbers: - Average size : MB matrix; Factors = GB; Flops = Gflops ; - A bit more challenging : Lab. Géosiences Azur, Valbonne Complex matrix arising in D, nonzeros Storage : GB ( GB with the factors?) Flops : tens of TeraFlops - Typical performance (MUMPS): PC LINUX (P, GHz) :. GFlops/s Cray TE ( procs) : Speed-up, Perf. GFlops/s Typical test problems: Typical test problems: BMW car body,, unknowns,,,99 nonzeros, MSC.Software Size of factors:. Number of operation Size of factors: 9. million entries Number of operations:.9 9 BMW crankshaft, 8, unknowns,,9,8 nonzeros, MSC.Software

5 Sources of parallelism Several levels of parallelism can be exploited: At problem level: problem can de decomposed into sub-problems (e.g. domain decomposition) At matrix level arising from its sparse structure At submatrix level within dense linear algebra computations (parallel BLAS,... ) Data structure for sparse matrices Storage scheme depends on the pattern of the matrix and on the type of access required band or variable-band matrices block bordered or block tridiagonal matrices general matrix row, column or diagonal access Data formats for a general sparse matrix A What needs to be represented Assembled matrices: MxN matrix A with NNZ nonzeros. Elemental matrices (unassembled): MxN matrix A with NELT elements. Arithmetic: Real ( or 8 bytes) or complex (8 or bytes) Symmetric (or Hermitian) store only part of the data. Distributed format? Duplicate entries and/or out-of-range values? Classical Data Formats for Assembled Matrices Example of a x matrix with NNZ= nonzeros a a a a a Coordinate format IRN [ : NNZ] = JCN [ : NNZ] = VAL [ : NNZ] = a a a a a Compressed Sparse Column (CSC) format IRN [ : NNZ] = VAL [ : NNZ] = a a a a a COLPTR [ : N + ] = column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+)- Compressed Sparse Column format: Similar to CSC, but row by row na na a Diagonal format: NDIAG = IDIAG = VAL = a a a (na: not accessed) na a na

6 Example of elemental matrix format A A, A = N= NELT= NVAR= A = P NELT i= A i ELTPTR [:NELT+] = ELTVAR [:NVAR] = ELTVAL [:NVAL] = Remarks: NVAR = NVAL = P S i (unsym) ou P S i(s i + )/ (sym), avec S i = ELT P T R(i + ) ELT P T R(i) storage of elements in ELTVAL: by columns File storage: Rutherford-Boeing Standard ASCII format for files Header + Data (CSC format). key xyz: x=[rcp] (real, complex, pattern) y=[suhzr] (sym., uns., herm., skew sym., rectang.) z=[ae] (assembled, elemental) ex: M T.RSA, SHIP.RSE Supplementary files: right-hand-sides, solution, permutations... Canonical format introduced to guarantee a unique representation (order of entries in each column, no duplicates). File storage: Rutherford-Boeing DNV-Ex : Tubular joint M_T rsa (I8) (I8) (e.) E+.98E+9.E+.889E E+8.89E+8.889E+ -.98E E+.889E+8.99E+ -.99E+.9998E E E+9.8E+8.8E E E+8.98E+9.988E E E E+8 A. Gaussian elimination Gaussian elimination A = A (), b = b (), A () x = b () a a a a a a a a a x x x A b b b A a /a A () x = b () a /a a a a a () a () a () a () A Finally A () x = b () x x x A = b b () b () C A a a a a () a () a () b () = b a b /a... a () = a a a /a... A Typical Gaussian elimination step k : x x x A = a (k+) ij b b () b () = a (k) ij C A a () () = a() () a() a() /a()... a(k) ik a(k) kj a (k) kk

7 Relation with A = LU factorization One step of Gaussian elimination can be written: A (k+) = L (k) A (k), with L k = l k+,k... l n,k C A and l ik = a(k) ik a (k) kk. Then, A (n) = U = L (n )... L () A, which gives A = LU, with L = [L () ]... [L (n ) ] = l i,j C A, In dense codes, entries of L and U overwrite entries of A. Furthermore, if A is symmetric, A = LDL T with d kk = a (k) kk : A = LU = A t = U t L t implies (U)(L t ) = L U t = D diagonal and U = DL t, thus A = L(DL t ) = LDL t Gaussian elimination and sparsity Step k of LU factorization (a kk pivot): For i > k compute l ik = a ik /a kk (= a ik ), For i > k, j > k or a ij = a ij a ik a kj a kk a ij = a ij l ik a kj If a ik et a kj then a ij If a ij was zero its non-zero value must be stored k j k j k x x k x x i x x i x fill-in Idem for Cholesky : For i > k compute l ik = a ik / a kk (= a ik ), For i > k, j > k, j i (lower triang.) a ij = a ij a ik a jk akk or a ij = a ij l ik a jk

8 Example Original matrix x x x x x x x x x x x x x Matrix is full after the first step of elimination After reordering the matrix (st row and column last row and column) x x x x x x x x x x x x x No fill-in Ordering the variables has a strong impact on the fill-in the number of operations Table : Benefits of Sparsity on matrix of order with nonzeros. (Dongarra etal 9). Procedure Total storage Flops Time (sec.) on CRAY J9 Full Syst. 8 Kwords. Sparse Syst. Kwords. Sparse Syst. and reordering Kwords.9 Efficient implementation of sparse solvers Indirect addressing is often used in sparse calculations: e.g. sparse SAXPY do i =, m A( ind(i) ) = A( ind(i) ) + alpha * w( i ) enddo Even if manufacturers provide hardware for improving indirect addressing It penalizes the performance Switching to dense calculations as soon as the matrix is not sparse enough Effect of switch to dense calculations Matrix from -point discretization of the Laplacian on a grid (Dongarra etal 9) Sparse structure should be exploited if density < %. 8

9 Density for Order of Millions Time switch to full code full matrix of flops (sec.) No switch Symmetric matrices and graphs Symmetric matrices and graphs Assumptions: A symmetric and pivots are chosen on the diagonal Structure of A symmetric represented by the graph G = (V, E) Vertices are associated to columns: V = {,..., n} Edges E are defined by: (i, j) E a ij G undirected (symmetry of A) Remarks: Number of nonzeros in column j = Adj G (j) Symmetric permutation renumbering the graph Symmetric matrix Corresponding graph The elimination graph model Construction of the elimination graphs Let v i denote the vertex of index i. G = G(A), i =. At each step delete v i and its incident edges Add edges so that vertices in Adj(v i ) are pairwise adjacent in G i = G(H i ). G i are the so-called elimination graphs. G : H = 9

10 A sequence of elimination graphs G : H = G : H = G : H = G : H = Introducing the filled graph G + (A) Let F = L + L T be the filled matrix, and G(F) the filled graph of A denoted by G + (A). Lemma (Parter 9) : (v i, v j ) G + (v i, v k ) G + and (v k, v j ) G +. if and only if (v i, v j ) G or k < min(i, j) such that + G (A) = G(F) F = L + L T Modeling elimination by reachable sets The fill edge (v, v ) is due to the path (v, v, v ) in G. However (v, v ) originates from the path (v, v, v ) in G. Thus the path (v, v, v, v ) in the original graph is in fact responsible of the fill in edge (v, v ). Illustration : + G (A) = G(F) F = L + L T This has motivated George and Liu to introduce the notion of reachable sets.

11 Reachable sets Let G = (V, E) be a graph and S be a subset of V. Definition: The vertex v is said to be reachable from a vertex w through S if there exists a path (w, v,..., v k, v) such that all v i S (where k could be zero). Reach(w, S)= is the set of reachable vertices from w through S. Illustration: let S = {s, s, s, s } then Reach(w, S) = {a, b, c} a S S w S d b c S Fill-in and reachable sets The filled graph G + (A) can be characterized with reachable sets. of vertices eliminated at step i. Let S i = {v... v i } denote the set Theorem. For i < j, (v i, v j ) G + (A) if and only if v j Reach(v i, S i ). In terms of the matrix the set Reach(v i, S i )is the set of row subscripts that correspond to non-zeros entries below the diagonal in column i of L. Illustration: Reach(v i, S i ) provides the column structure of column i of L Reach(, S ) = {, 8}; Reach(, S ) = {, } ; Reach(, S ) = {8}; Reach(, S ) = {, }; Reach(, S ) = {,, 8} Structure of L factors Theorem. Let w be a vertex in G i. The set of vertices adjacent to w in G i (column structure of w) is given by Reach(w, S i ) where the reach operator is applied to the original graph G (A). The structure of L can thus be characterized in terms of the structure of A. Illustration : S = {, }, Reach(, S ) = {,, }; Reach(, S ) = {, }; Reach(, S ) = {, }; and Reach(, S ) = {,, } due to the paths (,,, ), (,,, ), (, ). G : G : + G (A) = G(F) F = L + L T Previous theorem shows that the whole elimination process can be described implicitly by the sequence of reach operator. It can be considered as an implicit model for elimination, as opposed to the explicit model using elimination graphs.

12 This theorem also gives indication on the impact of reordering on the fill-in: Suppose that S separates G into two disconnected components C and C. If the vertices in S are labelled after C and C then from previous theorem there cannot be fill-edge between vertices of the disconnected components. A first definition of the elimination tree A spanning tree of a connected graph G is a subgraph T of G such that if there is a path in G between i and j then there exists a path between i and j in T. Let A be a symmetric positive-definite matrix, A = LL T its filled graph (graph of F = L + L T ). its Cholesky factorization, and G + (A) Definition. The elimination tree of A is a spanning tree of G + (A) satisfying the relation P ARENT [j] = min{i > j l ij }. Graph structures A = a b c d e f g h i j F = a b c d e f g h i j fill in entries entries in T(A) j a c g a c g 9 i h d i e G(A) f j b 8 9 h i d e j b + G (A) = G(F) f e b c a 8 h g f T(A) d Properties of the elimination tree Another perspective also leads to the elimination tree j i j i Dependency between columns of L :. Column i > j depends on column j iff l ij. Use a directed graph to express this dependency. Simplify redundant dependencies (transitive reduction in graph theory) The transitive reduction of the directed filled graph gives the elimination tree structure

13 Directed filled graph and its transitive reduction a c g a c g 9 j i 8 9 h i d e f j b 8 9 h i d e f j b e b c 8 h g f d Directed filled graph Transitive reduction a T(A) Ordering sparse matrices. Objectives/Outline Ordering sparse matrices: objectives/outline Reduce fill-in and number of operations during factorization (local and global heuristics): Increase parallelism (wide tree) Decrease memory usage (deep tree) Equivalent orderings : (Traverse tree to minimize working memory) Reorder unsymmetric matrices to special forms: block upper triangular matrix: with (large) non-zero entries on the diagonal (maximum transversal). Combining approaches. Fill-reducing orderings Fill-reducing orderings Three main classes of methods for minimizing fill-in during factorization Global approach: The matrix is permuted into a matrix with a given pattern Fill-in is restricted to occur within that structure Cuthill-McKee (block tridiagonal matrix) Nested dissections ( block bordered matrix). Local heuristics: At each step of the factorization, selection of the pivot that is likely to minimize fill-in. Method is characterized by the way pivots are selected. Markowitz criterion (for a general matrix). Minimum degree (for symmetric matrices). Hybrid approaches: Once the matrix is permuted in order to obtain a block structure, local heuristics are used within the blocks. Consider the matrix:

14 The corresponding graph is x x x x x x A = x x x x x x x x x x x x Cuthill-McKee algorithm Goal: reduce the profile/bandwidth of the matrix (the fill is restricted to the band structure) Level sets (such as Breadth First Search) are built from the vertex of minimum degree (priority to the vertex of smallest number) We get: S = {}, S = {}, S = {, }, S = {, } and thus the ordering,,,,,. The reordered matrix is: x A = x x x x x x x x x x x x x x x x x Reverse Cuthill-McKee The ordering is the reverse of that obtained using Cuthill-McKee i.e. on the example {,,,,, } The reordered matrix is: x x x x x x x x A = x x x x x x x x x x More efficient than Cuthill-McKee at reducing the envelop of the matrix. Nested Dissection Recursive approach based on graph partitioning. S Graph partitioning S () () () () S Permuted matrix S S S

15 Local heuristics to reduce fill-in during factorization Let G(A) be the graph associated to a matrix A that we want to order using local heuristics. Metric such that Metric(v i ) < Metric(v j ) implies v i is a better than v j Generic algorithm Loop until all nodes are selected Step: select current node p (so called pivot) with minimum metric value, Step: update elimination graph, Step: update Metric(v j ) for all non-selected nodes v j. Step should only be applied to nodes for which the Metric value might have changed. Let Reordering unsymmetric matrices: Markowitz criterion At step k of Gaussian elimination: U L A k ri k = number of non-zeros in row i of A k c k j = number of non-zeros in column j of Ak a kk must be large enough and should minimize (ri k ) (ck j ) i, j > k Minimum degree : Markowitz criterion for symmetric diagonally dominant matrices Minimum degree algorithm Step : Select the vertex that possesses the smallest number of neighbors in G (a) Sparse symmetric matrix (b) Elimination graph The node/variable selected is of degree. Notation for the elimination graph Let G k = (V k, E k ) be the graph built at step k. G k describes the structure of A k after eliminating k pivots. G k is non-oriented (A k is symmetric) Fill-in in A k adding edges in the graph.

16 Illustration Step : elimination of pivot (a) Elimination graph (b) Factors and active submatrix Initial nonzeros Nonzeros in factors Fill in Minimum degree algorithm applied to the graph: Step k : Select the node with the smallest number of neighbors G k is built from G k by suppressing the pivot and adding edges corresponding to fill-in. Illustration (cont d) Graphs G, G, G and corresponding reduced matrices (a) Elimination graphs (b) Factors and active submatrices Original nonzero Original nonzero modified Fill in Nonzeros e in factors Minimum Degree does not always minimize fill-in!!!

17 Consider the following matrix Remark: Using initial ordering No fill in Corresponding elimination graph Step of Minimum Degree: Select pivot (minimum degree = ) Updated graph 9 Add (,) i.e. fill in 8 Reduce time complexity. Accelerate selection of pivots and update of the graph:. Supervariables: if several variables have the same adjacency structure in G k, they can be eliminated simultaneously.. Two non-adjacent nodes of same degree can be eliminated simultaneously (multiple eliminations).. Degree update of neighbours of the pivot can be effected in an approximate way (Approximate Minimum Degree). Reduce memory complexity. Decrease size of working space Using the elimination graph, working space is of order O(#nonzeros in factors). Fill-in: Let pivot be the pivot at step k If i Adj G k (pivot) then Adj G k (pivot) Adj G k (i) Structure of pivot column included in filled structure of column i. We can then use an implicit representation of fill-in by defining the notion of element (variable already eliminated) and quotient graph. A variable of the quotient graph is adjacent to variables and elements. One can show that k [... N], the size of the quotient graph is O(G ) Influence on the structure of factors Harwell-Boeing matrix: dwt 9.rua, structural computing on a submarine. NZ(LU factors)=8 nz =

18 Structure of factors after permutation Minimum Degree MMD (.+.+) nz = nz = 88 Detection of supervariables allows to build more regularly structured factors (easier factorization). Comparison of implementations of Minimum Degree Let V be the initial algorithm (based on the elimination graph) MMD the version including./ +./ + / (Multiple Minimum Degree, Liu 98, 989), used in MATLAB AMD the version including./ +./ + / (Approximate Minimum Degree, Amestoy, Davis, Duff 99). Execution times (secs) on a SUN Sparc Matrix Order Nonzeros Minimum Degree V MMD AMD dwt Min. memory size KB KB KB Wang 8 - Orani Fill-in is similar Memory space for MMD and AMD : NZ integers V was not able to perform reordering for the last matrices (lack of memory after hours of computations) Mininimum fill-in heuristics Recalling the generic algorithm Let G(A) be the graph associated to a matrix A that we want to order using local heuristics. Let Metric be such that Metric(v i ) < Metric(v j ) v i is a better than v j Generic algorithm Loop until all nodes are selected Step: Select current node p (so called pivot) with minimum metric value, Step: update elimination (or quotient) graph, Step: update Metric(v j ) for all non-selected nodes v j. Step should only be applied to nodes for which the Metric value might have changed. 8

19 Minimum fill based algorithm Metric(v i ) is the amount of fill-in that v i would introduce if it were selected as a pivot. Illustration: r has a degree d = and a fill-in metric of d (d )/ = whereas s has degree d = but a fill-in metric of d (d )/ 9 =. j j j r j s i i i i i Minimum fill-in properties The situation typically occurs when {i, i, i } and {i, i, i, i } were adjacent to two already selected nodes (here e and e ) j j j r j s i i i i i e e r s i i i i i j j j j e and e are previously selected nodes The elimination of a node v k affects the degree of nodes adjacent to v k. Adj(Adj(v k )) is also affected. The fill-in metric of Illustration: selecting r affects the fill-in metric of i (because of fill edge (j, j )). How to compute the fill-in metrics Computing the exact minimum fill-in metric is too costly Only nodes adjacent to current pivot are updated. Only approximated metrics (using clique structures) are computed Let d k be the degree of node k; d k (d k )/ is an upper bound of the fill (s d s = d s (d s )/ = ). Several possibilities:. Deduce the clique area of the last selected pivot adjacent to k (s clique of e ).. Deduce the largest clique area of all adjacent selected pivots (s clique of e ). If for d k we use instead AMD then cliques of all adjacent selected pivots can be deduced.. Impact of fill reduction algorithm on the shape of the tree Impact of fill reduction on the shape of the tree 9

20 Reordering technique Shape of the tree observations AMD Deep well-balanced Large frontal matrices on top AMF Very deep unbalanced Small frontal matrices Reordering technique Shape of the tree observations PORD deep unbalanced Small frontal matrices SCOTCH Very wide well-balanced Large frontal matrices METIS Wide well-balanced Smaller frontal matrices (than SCOTCH)

21 Importance of the shape of the tree Suppose that each node in the tree corresponds to a task that: - consumes temporary data from the children, - produces temporary data, that is passed to the parent node. Wide tree Good parallelism Many temporary blocks to store Large memory usage Deep tree Less parallelism Smaller memory usage. Postorderings and memory usage Trees, topological orderings and postorderings A rooted tree is a tree for which one node has been selected to be the root. A topological ordering of a rooted tree is an ordering that numbers children nodes before their parent. Postorderings are topological orderings which number nodes in any subtree consecutively. v z y u x w Connected graph G z y v x u w Rooted spanning tree with topological ordering u y x z w v Rooted spanning tree with postordering Postorderings and memory usage Assumptions: Tree processed from the leaves to the root Parents processed as soon as all children have completed (postorder of the tree) Each node produces and sends temporary data consumed by its father. Exercise: In which sense is postordering-based tree traversal be more interesting than a random topological ordering? Furthermore, memory usage also depends on the postordering chosen:

22 a b a b c d f e g h c d e f g h i i Best (abcdefghi) Worst (hfdbacegi) Leaves Root Example : Processing a wide tree Memory unused memory space stack memory space factor memory space non-free memory space Active memory Example : Processing a deep tree

23 Memory.. Allocation of Assembly step for Factorization step for + Stack step for.. unused memory space factor memory space stack memory space non-free memory space Modelization of the problem M i : memory peak for complete subtree rooted at i, temp i : temporary memory produced by node i, m parent : memory for storing the parent. M(parent) temp temp temp M M M M parent = max( max nbchildren j= (M i + j k= temp k), m parent + nbchildren j= temp j ) () Objective: order the children to minimize M parent Memory-minimizing schedules Theorem. [Liu,8] The minimum of max j (x j + j i= y i) is obtained when the sequence (x i, y i ) is sorted in decreasing order of x i y i. Corollary. An optimal child sequence is obtained by rearranging the children nodes in decreasing order of M i temp i. Interpretation: At each level of the tree, child with relatively large peak of memory in its subtree (M i large with respect to temp i ) should be processed first. Apply on complete tree starting from the leaves (or from the root with a recursive approach)

24 Optimal tree reordering Objective: Minimize peak of stack memory Tree Reorder (T ): Begin for all i in the set of root nodes do Process Node(i); end for End Process Node(i): if i is a leaf then M i=m i else for j = to nbchildren do Process Node(j th child); end for Reorder the children of i in decreasing order of (M j temp j); Compute M parent at node i using Formula (); end if. Equivalent orderings and elimination trees Equivalent orderings of symmetric matrices Let F be the filled matrix of a symmetric matrix A (that is, F = L + L t, where A = LL T G + (A) is the associated filled graph. Definition (Equivalent orderings). P and Q are said to be equivalent orderings iff G + (PAP T ) = G + (QAQ T ) By extension, a permutation P is said to be an equivalent ordering of a matrix A iff G + (PAP T ) = G + (A) It can be shown that an equivalent reordering also preserves the amount of arithmetic operations for sparse Cholesky factorization. Relation with elimination trees Let A be a reordered matrix, and G + (A) be its filled graph In the elimination tree, any tree traversal (that processes children before parents) corresponds to an equivalent ordering P of A and the elimination tree of PAP T is identical to that of A. A = u v w x F y F z z v z y y x v u x + Graph G (A) w u w Elimination Tree of A with topological ordering Tree rotations Definition. An ordering that does not introduce any fill is referred to as a perfect ordering Natural ordering is a perfect ordering of the filled matrix F.

25 Theorem 8. For any node x of G + (A) = G(F), there exists a perfect ordering on G(F) such that x is numbered last. Essence of tree rotations : Nodes in the clique of x in F are numbered last Relative ordering of other nodes is preserved. Example of equivalent orderings On the right hand side tree rotations applied on w : relative ordering w.r.t. tree on the left is preserved). (clique of w is {w, x} and for other nodes F = u w x y F v F z u y v z F F x w z y x u w v with a postordering w x z y u v Remark: Tree rotations can help reducing the temporary memory usage!. Reorder unsymmetric matrices to special forms Unsymmetric matrices and graphs An unsymmetric matrix can be seen as a bipartite graph: G = (V r, V c, E V r V c ) (r, c) E iff there is an entry at row r column c. or a directed graph (digraph): G = (V, E) There is an oriented edge from the source r to the target c ((r, c) E) iff there is an entry at row r column c. Maximum matching/transversal orderings Let A be a sparse matrix and G = (V r, V c, E V r V c ) its associated bipartite graph. M E is a matching iff for all (r, c ), (r, c ) in M distinct, r r and c c. Structural problem: Maximum transversal: how to permute the columns of A to have the maximum number of non zero entries on the diagonal? Maximum matching: how to find a matching of maximum size? dmperm in MATLAB, MC in fortran (HSL library).

26 Components of the Maximum Matching algorithm Suppose that matching of size k has been found (i.e. there exists permutations to place entries on the first k diagonal positions). augmenting paths: try to find a matching of size k + using a matching of size k, branch and bound techniques on depth first search. X X X X X X X X X X X X r r r r r r c c c c c c Structurally singular matrices If the matching is not maximal (cannot be extended) then the matrix is said to be structurally singular or symbolically singular. In the example Columns and are multiple of each other. X X X X X X X Illustration of the algorithm X X X X X X X X X X X X r r r r r r c c c c c c X X X X X X X X X X X X r r r r r r c c c c c c X X X X X X X X X X X X r r r r r r c c c c c c

27 X X X X X X X X X X X X r r r r r c c c c c X X X X X X X X X X X X r r r r r c c c c c r c r c Maximum weighted matching Structural+numerical problem: Maximum weighted transversal: how to find a permutation P of A columns such that the product (or another metric: min, sum... ) of the diagonal entries of AP is maximum? Maximum weighted matching? Same techniques as in maximum matching, but branch and bound more complicated and it is more efficient to do breadth first search. MC in fortran or 9 (HSL library). Reduction to Block Triangular Form (BTF) Suppose that there exists permutations matrices P and Q such that B B B B B B PAQ = B N B N B N... B NN If N > A is said to be reducible (irreducible otherwise) Each B ii is supposed to be irreducible (otherwise finer decomposition is possible) Advantage: to solve Ax = b only B ii need be factored B ii x i = (Pb) i i j= B ijy j, i =,..., N with y = Q T x A two-stage approach to compute the reduction Stage (): compute a (column) permutation matrix Q such that AQ has non zeros on the diagonal (find a maximum transversal of A), then Stage(): compute a (symmetric) permutation matrix P such that PAQ P t is BTF. The diagonal blocks of the BTF are uniquely defined. Techniques exits to directly compute the BTF form. They do not present advantage over this two stage approach. Main components of the algorithm Objective : assume AQ has non zeros on the diagonal and compute P such that PAQ P t is BTF. Use of a digraph (directed graph) associated to the matrix. Symmetric permutations on digraph relabelling nodes of the graph. If there is no closed path through all nodes in the digraph then the digraph can be subdivided into two parts. Strong components of a graph are the set of nodes belonging to a closed path. The strong components of the graph are the diagonal blocks B ii of the BTF format.

28 A = Digraph of A. PAPt = BTF of A Preamble : Finding the triangular form of a matrix Observations: a triangular matrix has a BTF form with blocks of size (BTF will be considered as generalization of the triangular form where each diagonal entry is now a strong component). Note that if the matrix has triangular form the digraph has no cycle (i.e. acyclic). Sargent and Westerberg algorithm :. Select a starting node and trace a path until finding a node from which no path leaves.. Number the last node first, eliminate it (and all edges pointing to it).. Continue from previous node on the path (or choose a new starting node). Digraph of A. Path Steps 8 9 A = PAPt = A permuted to triangular form (not unique) Generalizing Sargent and Westerberg algorithm to compute a BTF A composite node is a set nodes belonging to a closed path. Start from any node and follow a path until () a closed path is found: collapse all nodes in a closed path into a composite node the path continues from the composite node. () reaching a node or composite node from which no path leaves: the node or composite node is numbered next. 8

29 Digraph of A. PATHS STEPS 8 9 A = Digraph of A. PAPt = BTF of A Hall property Consider the bipartite representation of an unsymmetric matrix A bipartite graph with m rows and n columns has the Hall property if every set of k column vertices is adjacent to at least k row vertices for all k n. Hall theorem: A bipartite graph has a column matching iff it has the Hall property Strong Hall property A bipartite graph with m rows and n columns has the strong Hall property if every set of k column vertices is adjacent to at least k + row vertices for all k < n. Any matrix that is not strong Hall can be permuted in block lower triangular form.. Combining approaches Example () of hybrid approach Top-down followed by bottom-up processing of the graph: Top-down : Apply nested dissection (ND) on complete graph Bottom-up : Local heuristic on each subgraph Generally better for large-scale irregular problems than pure nested dissection local heuristics 9

30 ( cont) hybrid approach Graph partitioning S MD () MD () S MD () MD () S Permuted matrix S S S S S S ND Elimination graph MD Example () : combine maximum transversal and fill-in reduction Consider the LU factorization A = LU of an unsymmetric matrix. Compute the column permutation Q leading to a maximum numerical transversal of A. AQ has large (in some sense) numerical entries on the diagonal. Find best ordering of AQ preserving the diagonal entries. Equivalent to finding symmetric permutation P such that the factorization of PAQP T has reduced fill-in. Factorization of sparse matrices Outline. Introduction. Elimination tree and multifrontal method. Comparison between multifrontal,frontal and general approaches for LU factorization. Task mapping and scheduling. Distributed memory approaches: fan-in, fan-out, multifrontal. Some parallel solvers; case study on MUMPS and SuperLU.. Concluding remarks. Introduction Recalling the Gaussian elimination Step k of LU factorization (a kk pivot): For i > k compute l ik = a ik /a kk (= a ik ), For i > k, j > k such that a ik and a kj are nonzeros a ij = a ij a ik a kj a kk

31 If a ik et a kj then a ij If a ij was zero its non-zero value must be stored Orderings (minimum degree, Cuthill-McKee, ND) limit fill-in, the number of operations and modify the tasks graph Three-phase scheme to solve Ax = b. Analysis step Preprocessing of A (symmetric/unsymmetric orderings, scalings) Build the dependency graph (elimination tree, edag... ). Factorization (A = LU, LDL T, LL T, QR) Numerical pivoting. Solution based on factored matrices triangular solves: Ly = b, then Ux = y improvement of solution (iterative refinement), error analysis Control of numerical stability: numerical pivoting In dense linear algebra partial pivoting commonly used (at each step the largest entry in the column is selected). In sparse linear algebra, flexibility to preserve sparsity is offered : Partial threshold pivoting : Eligible pivots are not too small with respect to the maximum in the column. Set of eligible pivots = {r a (k) rk u max i a (k) ik }, where < u. Then among eligible pivots select one preserving better sparsity. u is called the threshold parameter (u = partial pivoting). It restricts the maximum possible growth of: a ij = a ij a ik a kj a kk u. is often chosen in practice. Symmetric indefinite case: requires by pivots, e.g. Threshold pivoting and numerical accuracy «Table : Effect of variation in threshold parameter u on a matrix with 8 nonzeros (Dongarra etal 9). u Nonzeros in LU factors Error

32 Iterative refinement for linear systems Suppose that a solver has computed A = LU (or LDL T or LL T, and a solution x to Ax = b.. Compute r = b A x.. Solve LU δx = r.. Update x = x + δx.. Repeat if necessary/useful.. Elimination tree and Multifrontal approach Elimination tree and Multifrontal approach We recall that: The elimination tree expresses dependencies between the various steps of the factorization. It also exhibits parallelism arising from the sparse structure of the matrix. Building the elimination tree Permute matrix (to reduce fill-in) PAP T. Build filled matrix A F = L + L T where PAP T = LL T Transitive reduction of associated filled graph Each column corresponds to a node of the graph. Each node k of the tree corresponds to the factorization of a frontal matrix whose row structure is that of column k of A F. Illustration of multifrontal factorization We assume pivot are chosen down the diagonal in order. _ F Filled matrix F _, _, _,,, Treatment at each node: Elimination graph Assembly of the frontal matrix using the contributions from the sons. Gaussian elimination on the frontal matrix Elimination of variable (a pivot) Assembly of the frontal matrix x x x x x Contributions : a ij = (a i a j ) a i >, j > on a, a, a and a :

33 Terms a i a j a a () (a a) = a () (a a) = a a () a (a a) = a () (a a) = a a of the contribution matrix are stored for later updates. Elimination of variable (a pivot) Assembly of frontal matrix: update of elements of pivot row and column using contributions from previous updates (none here) x x x x x Contributions on a, a, a, and a. a () (a a) = a () = a () = a (a a) a (a a) a a () (a a) = a Elimination of variable. Assembly of frontal matrix Update using the previous contributions: a = a + a () + a() a = a + a () + a() (a = ) a = a + a () + a(), (a = ) a = a () + a() stored as a so called contribution matrix. Note that a is partially summed since contributions are transfered only between son and father. x x x Contribution on a : a () = a (a a ) a Elimination of variable Frontal involves only a : a = a + a () The multifrontal method (Duff, Reid 8) Memory is divided into two parts (that can overlap in time): the factors the active memory Elimination tree represents tasks dependencies

34 A= L+U I= Fill in Factors Active frontal matrix Stack of contribution blocks Active Memory Multifrontal method AT EACH NODE F F F T F F Pivot can ONLY be chosen from F block since F is NOT fully summed Supernodal methods Definition 9. A supernode (or supervariable) is a set of contiguous columns in the factors L that share essentially the same sparsity structure. All algorithms (ordering, symbolic factor., factor., solve) generalize to blocked versions. Use of efficient matrix-matrix kernels (improve cache usage). Same concept as supervariables for elimination tree/minimum degree ordering. Supernodes and pivoting: pivoting inside a supernode does not increase fill-in. Factors Contribution block

35 Amalgamation Goal How? Exploit a more regular structure in the original matrix Decrease the amount of indirect addressing Increase the size of frontal matrices Relax the number of nonzeros of the matrix Amalgamation of nodes of the elimination tree Consequences? Increase in the total amount of flops But decrease of indirect addressing And increase in performance Remark : If i, i, i... i f is a son of node j, j, j... j p and if {j, j, j... j p } {i, i... i f } then the amalgamation of i and j is without fill-in Amalgamation of supernodes (same lower diagonal structure) is without fill-in Illustration of amalgamation Original Matrix Elimination tree (Root), F,, F F F, (Leaves),,,, Structure of node i = frontal matrix noted i, i, i... i f Illustration of amalgamation Amalgamation and Supervariables Amalgamation of supervariables does not cause fill-in Initial Graph: 9 8 Reordering:,,,,, 8,,,,, 9,, Supervariables: {,, } ; {,, 8} ; {, } ; {,, 9,, }

36 Amalgamation (WITHOUT fill-in) (WITH fill-in ),,,,,,,,,,,,, fill-in : (,) and (,) Parallelization: two levels of parallelism Arising from sparsity : between nodes of the elimination tree first level of parallelism Within each node: parallel dense LU factorization (BLAS) second level of parallelism Decreasing tree parallelism Increasing node parallelism U L L U L U Exploiting the second level of parallelism is crucial Multifrontal factorization () () Computer #procs MFlops (speed-up) MFlops (speed-up) Alliant FX/8 8 (.9) (.) IBM 9J/VF (.) (.8) CRAY- (.8) (.) CRAY Y-MP 9 (.) 9 (.8) Performance summary of the multifrontal factorization on matrix BCSSTK. In column (), we exploit only parallelism from the tree. In column (), we combine the two levels of parallelism. Other features Dynamic management of parallelism: Pool of tasks for exploiting the two levels of parallelism Assembly operations also parallel (but indirect addressing) L U

37 Dynamic management of data Storage of LU factors, frontal and contribution matrices Amount of memory available may conflict with exploiting maximum parallelism. Comparison between approaches for LU factorization We compare three general approaches for sparse LU factorization General technique Frontal method Multifrontal approach distributed memory multifrontal and supernodal approaches will be compared in another section Description of the approaches General technique Numerical and sparsity pivoting performed at the same time Dynamic sparse data structures Good preservation of sparsity : local decision influenced by numerical choices. Frontal method Extension of band or variable-band schemes No indirect addressing is required in the innermost loop (data are stored in dense matrices) Simple data structure, fast methods, easier to implement. Very popular, easy out-of-core implementation. Sequential by nature Description of the approaches Multifrontal approach Can be seen as an extension of frontal method Analysis phase to compute an ordering ordering can then be perturbed by numerical pivoting Full matrices are used in the innermost loops. Compared to frontal schemes: -complex to implement (assembly of dense matrices, managament of numerical pivoting) -Preserve in a better way the sparse structure.

38 Frontal method a a A = ajj Etape (Elimination of a) Frontal matrix holds of values for elimination of a F () = a Etape j (Elimination of ajj, j=,..) Frontal matrix holds: updates from previous steps and additionally row j and column j of the original matrix ajj (j) F = Values for updating are generated: fij= (ai*aj)/a Properties: The band is treated as full Efficient reordering to minimize bandwidth is crucial. Frontal vs Multifrontal methods Multifrontal method Step : Frontal method Step : Frontal matrix: Step : Considering frontal Matrices is sufficient: Considered as full Several fronts move ahead simultaneously (treatment of blocks and is independent) Characteristics of multifrontal method: More complex data structures. Usually more efficient for preserving sparsity than frontal techniques Parallelism arising from sparsity. Illustration: comparison between software for LU General approach (MA8: Davis and Duff): Control of fill-in: Markowitz criterion Numerical Stability: partial pivoting Numerical and sparsity pivoting are performed in one step Multifrontal method(ma, Amestoy and Duff): 8

39 Reordering before numerical factorization Minimum degree type of reordering Partial pivoting for numerical stability Frontal method (MA, Duff and Scott): Reordering before numerical factorization Reordering for decreasing bandwidth Partial pivoting for numerical stability All these software (MA8, MA, MA) are available in HSL-Library. Test problems from Harwell-Boeing and Tim Davis (Univ. Florida) collections. Matrix Order Nb of Description nonzeros Orani8.rua 98 Economic modelling Onetone.rua 88 Harmonic balance method Garon.rua 9 D Navier-Stokes Wang.rua 8 D Simulation of semiconductor mhda.rua 8 Spectral problem in Hydrodynamic rim.rua 9 CFD nonlinear problem Execution times on a SUN Matrix FLops Size of factors Time (Method) ( words) (seconds) Onetone.rua ( 9 ) General 9 Frontal 9 9 Multifrontal 8 9 Garon.rua ( 8 ) General 8 9 Frontal 9 8 Multifrontal 8 mhda.rua ( ) General..8 Frontal.. Multifrontal... Task mapping and scheduling Affect tasks to processors to achieve a goal: makespan minimization, memory minimization,... many approaches: static: Build the schedule before the execution and follow it at run-time Advantage: very efficient since it has a global view of the system 9

40 Drawback: Requires a very-good modelization of the platform dynamic: Take scheduling decisions dynamically at run-time Advantage: Reactive to the evolution of the platform and easy to use on several platforms Drawback: Decisions taken with local criteria (a decision which seems to be good at time t can have very bad consequences at time t + ) Influence of scheduling on the makespan Objective: Assign process/tasks to processors so that the completion time, also called the makespan is minimized. (We may also say that we minimize the maximum total processing time on any processor.) Task scheduling on shared memory computers The data can be shared between processors without any communication. Dynamic scheduling of the tasks (pool of ready tasks). Each processor selects a task (order can influence the performance). Example of good topological ordering (w.r.t time). 8 9 Ordering not so good in terms of working memory. Static scheduling: Subtree to subcube (or proportional) mapping Main objective: reduce the volume of communication between processors. Recursively partition the processors equally between children of a given node. Initially all processors are assigned to root node. Good at localizing communication but not so easy if no overlapping between processor partitions at each step. 8,,,,,,,,, 9 Mapping of the tasks onto the processors Mapping of the tree onto the processors Objective : Find a layer L such that subtrees of L can be mapped onto the processor with a good balance. Construction and mapping of the initial level L Begin Let L Roots of the assembly tree repeat

41 Find the node q in L whose subtree has largest computational cost Set L (L \{q}) {children of q} (See Figure??) Greedy mapping of the nodes of L onto the processors Estimate the load unbalance until load unbalance < threshold End Step A Step B Step C Decomposition of the tree into levels Determination of Level L based on subtree cost. L L L L Subtree roots Mapping of top of the tree can be dynamic. Could be useful for both shared and distributed memory algo.. Distributed memory approaches Computational strategies for parallel direct solvers The parallel algorithm is characterized by: Computational graph dependency Communication graph Three classical approaches. Fan-in. Fan-out. Multifrontal Preamble: left and right looking approaches for Cholesky factorization cmod(j, k) : Modification of column j by column k, k < j, cdiv(j) division of column j by a scalar Left-looking approach for j = to n do for k Struct(row L j ) do cmod(j, k) cdiv(j) Right-looking approach for k = to n do cdiv(k) for j Struct(col L k ) do cmod(j, k)

42 Illustration of Left and right looking Left looking Right looking used for modification modified Assumptions and Notations Assumptions : We assume that each column of L/each node of the tree is assigned to a single processor. Each processor is in charge of computing cdiv(j) for columns j that it owns. Notations : mycols(p) = is the set of columns owned by processor p. map(j) gives the processor owning column j (or task j). procs(l k ) = {map(j) j Struct(L k )} (only processors in procs(l k ) require updates from column k they correspond to ancestors of k in the tree). Fan-in variant (similar to left looking) Demand driven algorithm : data required are aggregated update columns computed by sending processor Fan-in (p) for j = to p do u = for all k Struct(row L j ) mycols(p) do u = u + cmod(j, k) end for if map(j) p then Send u to processor map(j) end if if map(j) == p then Incorporate u in column j Receive all necessary updated aggregates on column j and incorporate them in column j cdiv(j) end if end for

43 () () () () () Left Looking () () () Modified Used for modification Algorithm: (Cholesky) () For j= to n do () () () For k in Struct(L ) do j,* cmod(j,k) Endfor cdiv(j) Endfor if map() = map() = map() = p and map() p (only) one message sent by p to update column exploits data locality of proportional mapping. Fan-out variant (similar to right-looking) Data driven algorithm. Fan-out(p): for all leaf node j mycols(p) do cdiv(j) send column L j to procs(col L j ) mycols(p) = mycols(p) {j} end for while mycols(p) do Receive any column (say L k ) of L for j Struct(col L k ) mycols(p) do cmod(j, k) if column j completely updated then cdiv(j) send column L j mycols(p) = mycols(p) \ {j} end if end for end while () () () () () Right Looking () () () Updated Computed Algorithm: (Cholesky) () () () () For k= to n do cdiv(k) For j in Struct(L ) do *,k cmod(j,k) Endfor Endfor

44 if map() = map() = p and map() p then messages (for column and ) are sent by p to update column. Fan-out variant Properties of fan-out: Historically the first implemented. Incurs greater interprocessor communications than fan-in (or multifrontal) approach both in terms of total number of messages total volume Does not exploit data locality of proportional mapping. Improved algorithm (local aggregation): send aggregated update columns instead of individual factor columns for columns mapped on a single processor. Improve exploitation of data locality of proportional mapping. But memory increase to store aggregates can be critical (as in fan-in). Multifrontal variant () () () () () () Right Looking () Updated () Computed () () () () () () () () () () Elimination tree L C B () ()()()() () () () Algorithm: For k= to n do Build full frontal matrix with all indices in Struct(L *,k ) Partial factorisation Send Contribution Block to Father Endfor "Multifrontal Method". Some parallel solvers Shared memory sparse direct codes Code Technique Scope Availability ( MA Multifrontal UNS cse.clrc.ac.uk/activity/hsl MA9 Multifrontal QR RECT cse.clrc.ac.uk/activity/hsl PanelLLT Left-looking SPD Ng PARDISO Left-right looking UNS Schenk PSL Left-looking SPD/UNS SGI product SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles SuperLU Left-looking UNS nersc.gov/ xiaoye/superlu WSMP Multifrontal SYM/UNS IBM product Only object code for SGI is available

Sparse Matrices Direct methods

Sparse Matrices Direct methods Iain Duff STFC Rutherford Appleton Laboratory and CERFACS Summer School The 6th de Brùn Workshop. Linear Algebra and Matrix Theory: connections, applications and computations.