SUPERFAST MULTIFRONTAL METHOD FOR STRUCTURED LINEAR SYSTEMS OF EQUATIONS

Size: px

Start display at page:

Download "SUPERFAST MULTIFRONTAL METHOD FOR STRUCTURED LINEAR SYSTEMS OF EQUATIONS"

Delilah Merritt
6 years ago
Views:

1 SUPERFAS MULIFRONAL MEHOD FOR SRUCURED LINEAR SYSEMS OF EQUAIONS S. CHANDRASEKARAN, M. GU, X. S. LI, AND J. XIA Abstract. In this paper we develop a fast direct solver for discretized linear systems using the multifrontal method together with low-rank approximations. For linear systems arising from certain partial differential equations such as elliptic equations we discover that during the Gaussian elimination of the matrices with proper ordering, the fill-in has a low-rank property: all off-diagonal blocks have small numerical ranks with proper definition of off-diagonal blocks. Matrices with this low-rank property can be efficiently approximated with some semiseparable structures, called hierarchically semiseparable HSS representations. We reveal the above low-rank property by ordering the variables with nested dissection and eliminating them with the multifrontal method. All matrix operations in the multifrontal method are performed in HSS forms. We present efficient ways to build compact HSS structures along the elimination. Necessary HSS matrix operations for the structured multifrontal method are proposed. By taking advantage of the low-rank property the multifrontal method with HSS structures can be shown to have linear complexity and to require only linear storage. We therefore call this structured method a superfast multifrontal method. It is especially suitable for large problems, and also has natural adaptability to parallel computations and great potential to provide effective preconditioners. Numerical results demonstrate the efficiency. Key words. direct solver, HSS matrix, low-rank property, superfast multifrontal method, nested dissection AMS subject classifications. 15A23, 65F05, 65F30, 65F50 1. Introduction. In many computational and engineering problems it is critical to solve large structured linear systems of equations. Different structures come from different natures of the original problems or different techniques of discretization, linearization, or simplification. It is usually important to take advantage of the special structures, or to preserve the structures when necessary. Direct solvers often provide good chances to exploit the structures. Direct methods are attractive also due to their efficiency, reliability, and generality. Here we are interested in the discretization of differential equations such as elliptic PDEs on a two dimensional 2D finite element mesh grid M [15, 2, 11] which is any mesh formed by partitioning a planar region with a boundary into subregions elements by some lines edges. Edges interconnect at their end points nodes. In this paper we consider a regular mesh, or more generally, a well-shaped mesh [28, 29, 33]. Here by a well-shaped mesh we simply mean a mesh where nested dissection or its generalizations [15, 25, 16, 18, 1, 33] can be used. A symmetric positive definite SPD system Ax = b 1.1 associated with M is called a finite element system, if variables are associated with mesh nodes, and the system has the property that A ij 0 if and only if x i and x j are associated with nodes of the same element of M. Such matrices arise when we apply 5-point difference or finite element techniques to solve 2D linear boundary Department of Electrical and Computer Engineering, University of California at Santa Barbara shiv@ece.ucsb.edu. Department of Mathematics, University of California at Berkeley mgu@math.berkeley.edu. Lawrence Berkeley National Laboratory, MS 50F-1650, One Cyclotron Rd., Berkeley, CA xsli@lbl.gov. Department of Mathematics, University of California at Los Angeles jxia@math.ucla.edu. 1

2 2 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA value problems such as elliptic boundary problems on rectangular domains [14]. For the purpose of presentations sometimes we will refer to the following model problem. Model Problem 1.1. We use the 2D discrete Laplacian on a 5-point stencil as our model problem for the numerical experiments, where A is a 5-diagonal n 2 n 2 sparse matrix: I I A = I 0 I, = Another example is the 9-point stencil. o solve 1.1 people typically first order the mesh nodes and then factorize the discretized matrix. Direct factorization of such a matrix with row-wise node ordering takes On 4 flops [15]. he nested dissection ordering of the nodes [15] gives an elimination scheme with On 3 cost which is optimal for any ordering in exact arithmetic [22]. But On 3 is still large for a big n. Sometimes iterative methods are cheaper if some effective preconditioners are available. But for hard problems good preconditioners can be difficult to find Superfast multifrontal method. Direct solvers have been considered expensive because of the large amount of fill-in even if the original matrix is sparse. o effectively handle the fill-in, some people represent or approximate full matrices in complicated problems with structured matrices such as H-matrices [19, 21, 20], quasiseparable matrices [13], semiseparable matrices [7, 5, 6], etc. Similarly, when solving discretized PDEs such as elliptic equations, we can develop fast direct solvers by exploiting the numerical low-rank property and by approximating dense matrices in these problems with compact semiseparable matrices without compromising the accuracy. hese approximations are feasible as we notice that the fill-in during the elimination with certain ordering is actually structured. he fill-in is closely related to the Green s functions via Schur complements. hus the off-diagonal blocks of the fill-in have small numerical ranks. Due to this low-rank property, we show that the structured approximations of the dense N N subproblems in those problems can be solved with a cost of Op 2 N, and this leads to a total complexity of Opn 2, where p is a constant related to the PDE and the accuracy in the matrix approximations and n is the mesh size. o be more precise, our solver includes both the approximation stage of the dense matrices in the original problem and the direct solve stage of the structured approximations. We still call the overall procedure a direct solver. One of the important direct methods for sparse matrix solutions is the multifrontal method by Duff and Reid [12]. Liu gave a detailed description of the multifrontal method for large sparse SPD systems in [27]. In our direct solver we order the mesh nodes with nested dissection and organize the elimination process with the multifrontal method. his procedure is demonstrated to be effective for spare problems. Moreover, we discover that it produces dense fill-ins with the above low-rank property for many discretized PDEs. he method eliminates variables and accumulates updates according to an elimination tree or assembly tree whose nodes are the variables. Here, we use a supernodal version of the multifrontal method. We treat each separator as a node in the supernodal elimination tree of the multifrontal method. In this paper the semiseparable matrices we use to approximate the dense submatrices in the multifrontal method are called hierarchically semiseparable HSS

SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 3 matrices, introduced by Chandrasekaran, Gu, et al. in [9, 10], and further generalized in [8, 34]. Many fast algorithms exist for HSS structures.

We call them HSS off-diagonal blocks. Off-diagonal block columns can be similarly considered. i Bottom level HSS blocks ii Upper level HSS blocks Fig. 1.1. wo levels of HSS Off-diagonal blocks.

3 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 3 matrices, introduced by Chandrasekaran, Gu, et al. in [9, 10], and further generalized in [8, 34]. Many fast algorithms exist for HSS structures. HSS matrices are featured by hierarchical low-rank properties in the off-diagonal blocks Figure 1.1 which are block rows excluding the diagonal blocks in different levels of splittings of the matrix. We call them HSS off-diagonal blocks. Off-diagonal block columns can be similarly considered. i Bottom level HSS blocks ii Upper level HSS blocks Fig wo levels of HSS Off-diagonal blocks. More specifically, we say that a matrix has the low-rank property if all its HSS off-diagonal blocks have small numerical ranks. We call the maximum numerical rank of all HSS blocks the HSS rank of the matrix. Here by numerical ranks we mean the ranks obtained by rank revealing QR factorizations or τ-accurate SVD in the SVD all singular values less than a tolerance τ have been discarded when τ is an absolute tolerance; similarly, τ can be a relative tolerance. Later we simply use ranks to mean numerical ranks. Update matrices and frontal matrices in the multifrontal method for the model problem have the low-rank property. hus HSS approximations can be used. In the multifrontal method all matrix operations are conducted efficiently via HSS approximations. Some basic HSS operations can be found in [8, 9, 10], including structure generations, system solving, etc. In this paper we also provide additional HSS operations necessary for the structured multifrontal method. Both the multifrontal method and the HSS structure have nice hierarchical tree structures. hey both take good advantage of dense matrix computations and have efficient storage, good data locality, and natural adaptability to parallel computations. he usage of HSS matrices in the multifrontal method leads to an efficient structured multifrontal method. For problems such as the 5-point or 9-point finite difference Laplacian, this structured multifrontal method has nearly linear complexity. We thus call this method a superfast multifrontal method. he method is also memory efficient. By setting a relatively large tolerance in the matrix approximations the method becomes more efficient and can also work as an effective preconditioner. We have developed a software package for the solver and various HSS operations Outline of the paper. his paper is organized as follows. In Section 2 we give an introduction to the multifrontal method, nested dissection, and HSS structures. Section 3 presents our new superfast multifrontal method. he development proceeds as follows. First, in Subsection 3.1 we investigate the low-rank properties in the multifrontal method. Certain rules are set up to guarantee small numerical ranks and the feasibility of low-rank structures. Subsection 3.3 discusses the superfast multifrontal method with HSS structures in great detail. Necessary HSS operations are proposed so that all matrix operations are structured. hen in Subsection 3.4 we present some implementation issues and summarize the method in an algorithm. Section 4 demonstrates the efficiency of the superfast multifrontal method with numerical experiments. Section 5 gives some general remarks about the method.

4 4 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA 2. Review of the multifrontal method and HSS representations. In this section we briefly review the multifrontal method and HSS representations which will build our superfast structured multifrontal method Multifrontal method with nested dissection ordering. he central idea of the multifrontal method is to reorganize the overall factorization of a large sparse matrix into partial updates and factorizations of small dense matrices [12, 27]. It has been widely used in numerical methods for PDEs, optimization, fluid dynamics, and other areas. We take a look at the factorization of an N N SPD matrix A. he elimination of j 1 variables in a block form is A Dj 1 Vj 1 V j 1 A j 1 Lj 1 0 = V j 1 L j 1 I N j+1 L j 1 L 1 j 1 V j 1 0 A j, 2.1 where A j = A j 1 V j 1 D 1 j 1 V j 1 is the Schur complement. Cholesky factorizations can be represented by outer-product updates [27]. he matrix V j 1 D 1 j 1 V j 1 represents the update contributions from the first j 1 rows/columns to A j. It is clear that V j 1 D 1 j 1 V j 1 can be written in terms of the first j 1 columns of the Cholesky factor L as V j 1 D 1 j 1 V j 1 = V j 1 L j 1 V j 1 j 1L j 1 = Ũ k, where Ũk l j:n,k lj:n,k is an outer-product update from the k-th column of L. o conveniently consider how the update contributions are passed, a powerful tool called elimination tree is used. he following definitions can be found in [26, 27, 32, etc.]. Definition 2.1. he elimination tree A of the N N matrix A is the tree structure with N nodes {1,..., N} such that node p is the parent of j if and only if p = min{i > j l ij 0}. Definition 2.2. he j-th frontal matrix F j for A is a jj a ji1 a jip l jk a i1 j F j =. 0 + l i1 k. ljk l i1 k l ip k, k: proper a descendant of j ip j l ip k 2.2 where j, i 1,..., i p are the row subscripts of the nonzeros in L :,j. We call the first matrix on the right hand side of 2.2 the initial frontal matrix and denote it by Fj 0. he update matrix U j is the Schur complement from one step of elimination of F j to get the j-th column of L, that is, l jj 0 l jj l i1 j l ip j l i1j 1 F j = U j. l ip j 1 he frontal matrix F j carries contributions from matrix A and the outer-product contributions from those preceding nodes that are proper descendants of j. For each child c k of j, the update matrix U ck carries contributions from the subtree with root k=1

5 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 5 c k. he contributions from all c k are assembled together by an operation called extend-add. he extend-add operator aligns the subscript sets of two matrices by padding zero entries, and then adds matrices. For example, if two matrices U ci = ai b i, i = 1, 2 have subscript sets {1, 2} and {1, 3}, respectively, then c i d i U c1 U c2 = a 1 b 1 0 c 1 d a 2 0 b c 2 0 d 2 = a 1 + a 2 b 1 b 2 c 1 d 1 0 c 2 0 d 2 he relationship between frontal matrices and update matrices can be revealed by F j = F 0 i U c 1 U c2 U cq,where nodes c 1, c 2,..., c q are the children of j. he manipulation of frontal and update matrices are related to the ordering of the nodes. Different orderings can lead to highly different performance of the method. Nested dissection proposed by George [15] is one of the most attractive methods for finding an elimination ordering in solving sparse systems. Nested dissection orders mesh nodes with the aid of separators. A separator is a small set of nodes whose removal divides the rest of the mesh or submesh into two disjoint pieces. Different levels of separators are used to order the nodes. hat is, lower level nodes are ordered before upper level ones. See Figure 2.1i. Note that here by a level we mean a set of separators in the same level of division which are either all horizontal or all vertical. It can be easily verified that the total number of levels in nested dissection is l = Olog 2 n. After the nodes are ordered, we compute the Cholesky factorization A = LL by Gaussian elimination which corresponds to the elimination of the mesh nodes unknowns from lower levels to upper levels.. i Partition with separators ii Node connections during elimination Fig After eliminating bottom levels Nodes may get connected during the elimination Figure 2.1ii. Consider matrix A j as in 2.1 and variables x k and x i j < i, k associated with nodes k and i. We say x k and x i are connected if their corresponding off-diagonal component in A j is nonzero. Assume no zeros are created through cancellation during the elimination process. hen the elimination of x j connects pairwise all x i i > j which were previously connected to x j [31, 15]. We use a supernodal version of the multifrontal method with nested dissection ordering to solve the discretized problems. Each separator in nested dissection is considered to be a node in the postordering elimination tree. he separators are put into different levels of the tree Figure 2.2. During elimination, the separators are eliminated following the nested dissection ordering of the mesh nodes. he elimination of separator i will connect all its neighbor separators, which, in the 2D mesh, we assume to be p 1, p 2, p 3, p 4 satisfying p 1 <

6 6 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA i Ordering separators ii Separator tree/nested dissection elimination tree Fig Ordering separators p 2 < p 3 < p 4 some of them may be empty. Here we say that the separators {i, p 1, p 2, p 3, p 4 } form an element, and that i is the pivot separator of this element. Figure 2.3i shows an example. Later when no confusion is caused we also call i a node of the elimination tree. Here we have two cases, depending on whether i is a bottom level separator in the mesh a leaf node in the elimination tree or not. i Leaf node i ii Non-leaf node i Fig Examples of elements. he ordering of p 1, p 2, p 3, p 4 may not necessarily follow their order in the elimination tree. If separator i is a bottom level separator in nested dissection or a leaf node in the elimination tree Figure 2.3i, the frontal matrix F i is simply the block form initial frontal matrix Fi 0 associated with node i. he elimination of separator i will provide the block column in L corresponding to i, and the Schur complement is the i-th update matrix. A ii A ip1 A ip2 A ip3 A ip4 A p1 A p1 i F i Fi 0 i = A p2i A p3 i 0, U i = A p2i A p3 i A 1 ii Aip1 A ip2 A ip3 A ip4 A A p4 i p4i 2.3 If separator i is not a leaf node Figure 2.3ii, we assume it has two children c 1 and c 2. hen F i is obtained by assembling the initial frontal matrix and the update matrices of its children with extend-add: F i = Fi 0 Lii 0 L U c1 U c2 = ii L Bi, 2.4 L Bi I 0 U i where B denotes the element boundary {p 1, p 2, p 3, p 4 }, and the elimination of separa-

7 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 7 tor i yields U i. Note that U c1 carries contributions from the subtree rooted at c 1, or, it reflects the connections among the nodes in separators {p L 1, p 2, p L 3, i}. Here we say that the separators {c 1, p L 1, p 2, p L 3, i} form the left element, and {c 2, p R 1, p 2, p R 3, i} form the right element. In some situations separators may have empty neighbors for example, on mesh boundaries. We say an element is full if no separator is empty. Recursive application of the above procedure to all levels of the separator tree leads to a supernodal multifrontal method. he multifrontal method is just another way to carry out Gaussian elimination and it has the same complexity as nested dissection. hat is, he number of multiplicative operations required to factorize A in 1.1 is On 3. he number of nonzeros in the Cholesky factors is On 2 log 2 n. A stack structure is used for storing the update matrices. If the elimination tree is a full and balanced tree then the storage required for the update matrix stack is On Semiseparable structures and low-rank property. In our superfast multifrontal method, frontal matrices and update matrices are represented in HSS forms. A block 4 4 HSS matrix looks like m 1 m 2 m 3 m 3 m 1 D 1 U 1 B 21 V2 U 1 R 21 B 11 W23V 3 U 1 R 21 B 11 W24V 4 m 2 U 2 B 22 V1 D 2 U 2 R 22 B 11 W23V 3 U 2 R 22 B 11 W24V 4 U 3 R 23 B 12 W21V 1 U 23 R 23 B 12 W22V 2 D 3 U 3 B 23 V4, m 3 m 4 U 4 R 24 B 12 W 21V 1 U 24 R 24 B 12 W 22V 2 U 4 B 24 V 3 D where we use the postordering HSS notations introduced in [8] which are generalized and also simplified versions of those in [9, 10]. he matrices D i, U i,... are called generators. An HSS matrix can be conveniently represented by a binary tree structure, called HSS tree Figure 2.4. In an HSS tree, generators are associated with the nodes and edges. We may also say that generators are solely associated with nodes, since edges can be viewed as attached to nodes. We can identify the matrix blocks based on the paths connecting the nodes. For example the 2, 3 block of 2.5 can be defined by the path connecting nodes 2 and 4 in the bottom level: Fig HSS tree for 2.5. he hierarchical structure of HSS matrices can be illustrated by writing 2.5 as D 3 U 3 B 3 V6 U 6 B 6 V3 D 6 where the new generators for nodes 3 and 6 are defined in terms of the generators,

8 8 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA associated with their child nodes. For example, D D 3 = 1 U 1 B 1 V2 U1 R U 2 B 2 V1, U D 3 = 1 2 U 2 R 2 V1 W, V 3 = 1 U 2 W 2 Due to this structure we only store R i, W i, and bottom level D i, U i, and V i generators. For an HSS matrix with HSS rank p, when p is small and the sizes of all B i are close to p, we say that the HSS matrix is compact a more formal definition is given in [8]. For a symmetric HSS matrix we can set U i = V i, R i = W i, and B i = Bj where j is the sibling of i. It is shown in [8, 10] that for a compact HSS matrix, linear complexity system solvers exist. Many other HSS matrix operations such as structure generation, compression, etc. are also very efficient. 3. Superfast multifrontal method. Notice that in the multifrontal method for discretized equations the frontal and update matrices are generally dense because of the mutual connections among mesh nodes Figure 2.1ii. he elimination together with the extend-add operation on such an N N matrix typically take ON 3 floating point operations, in exact arithmetic. Here we consider approximations of these dense matrices. Approximations of dense matrices are feasible in solving linear systems derived from discretizations of certain PDEs such as elliptic equations, as we discover that low-rank properties exist in these problems. It turns out that we can reveal these low-rank properties by carefully organizing the elimination. In fact, if we use the multifrontal method and the nest dissection ordering of the nodes, the frontal and update matrices for those discretized problems e.g. Model Problem 1.1 have low-rank properties. herefore, we use low-rank structures to approximate those matrices. he main idea here is to employ HSS operations in the supernodal multifrontal method, due to the hierarchical/parallel tree structures of both HSS matrices and the multifrontal method. For discretized equations the HSS operations are generally complicated. We have developed a series of HSS operations to take advantage of the low-rank properties [8, 10]. Additional HSS operations necessary for our superfast multifrontal method will be presented here. With these techniques we are able to reduce the total complexity for solving the problems from On 3 to Opn 2 and storage from On 2 log 2 n to On 2 log 2 p, where p is a parameter related to the problem and the tolerance for matrix approximations Off-diagonal numerical ranks. he low-rank property has certain physical background. For example the paper [4] considers a 2D physical model consisting of a set of particles with pairwise interactions satisfying Coulomb s law. he authors define well-separated sets of particles to be the sets that have strong interactions within sets but weak ones between different sets. Here when we consider 2D meshes for discretized problems we have a similar situation. In Figure 2.1ii the nodes in the current-level element are mutually connected. But their connections differ between separators. When we consider the connections within a separator we can divide the separator into several pieces in the circles in Figure 3.1 i. Nodes inside each piece have stronger connections, while the connections between any two pieces are considered to be weak. When we consider the connections between separators, we can think of the nodes close to other separators in the circles in Figure 3.1 ii to have stronger connections. hus if we order the nodes in certain ways we can expect the off-diagonal blocks of the frontal/update matrices to have small numerical ranks. For example, we can impose the following rule..

9 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 9 i Connections within separators ii Connections between separators Fig Well-separated sets in separators. Rule 3.1. Separators in an element are ordered counterclockwise as shown in Figure 3.2i. he nodes inside each separator are ordered following the uniform ordering of mesh nodes left-right and top-down. We may also order the nodes in the boundary separators counterclockwise. i Counterclockwise ordering of separators ii Rank pattern of F i Fig Counterclockwise ordering of separators and the resulting rank pattern of F i. hen we will have the corresponding frontal matrix F i, say, in the pattern as shown in Figure 3.2ii, where dark pieces denote high-rank blocks, and others blocks are low-rank. Some variations of the pattern in Figure 3.2i can be similarly considered, for example, a transpose of the pattern, or one with certain separators missing e.g. for boundary elements. Note that after the elimination of separator i we will get the update matrix U i Schur complement F i = Aii A ib = A Bi A BB Lii 0 L ii L Bi I 0 U i L Bi. 3.1 For the frontal/update matrices, we make the following critical rank observations. Observations. In the supernodal multifrontal method for solving Model Problem 1.1 the frontal/update matrices in 3.1 have the following properties with appropriate tolerance and block sizes if applicable: he off-diagonal block A ib is a numerically low-rank matrix. he off-diagonal blocks of A ii the leading pivot block of F i have small numerical ranks. he off-diagonal blocks of U i have small numerical ranks. Here a tolerance τ is used in the rank-revealing QR factorization or τ-accurate SVD to compute the numerical ranks. We present some rank results in the multifrontal method for solving Model Problem 1.1, the 5-point discretized Laplacian. he type of off-diagonal blocks is as shown in Figure 1.1. Again by ranks we mean numerical ranks if they are not explicitly stated. a Numerical rank of A ib.

10 10 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA We choose some frontal matrices F i from different mesh size problems. Each F i is associated with a complete element in the mesh no empty separator. We compute the numerical rank for A ib in each F i see 3.1. able 3.1 shows the rank results from four A ib matrices which are from mesh sizes n = 127, 255, 511, and 1023, respectively. τ SizeA ib able 3.1 Numerical ranks of some A ib matrices with different absolute tolerances τ. b Off-diagonal numerical ranks of A ii. Similarly, we test the off-diagonal ranks for A ii matrices. As an example, we use the frontal matrix corresponding to the top level separator of a mesh. In such a situation F i A ii has order We choose a fixed block size and make all bottom level HSS off-diagonal blocks to have the same row dimension. hese offdiagonal blocks are then ordered top down as in Figure 1.1. heir ranks are computed and plotted vs their block numbers. o better demonstrate the rank distribution we scale the block numbering to be within [0, 1] so that we can put into one graph all plots for different choices of block sizes. Figure 3.3 shows the ranks of these offdiagonal blocks. When we increase the block sizes the ranks increase only slightly. his illustrates the low-rank property of these matrices. Numerical ranks for off diagonal blocks with τ=10 6 Numerical ranks for off diagonal blocks with τ= block size: 16 block size: 32 block size: 64 block size: 128 block size: block size: 16 block size: 32 block size: 64 block size: 128 block size: Rank 14 Rank Scaled block numbering i τ = Scaled block numbering ii τ = 10 4 Fig Scaled block numbering vs ranks of HSS off-diagonal block rows of a A ii with different block row sizes. c Off-diagonal numerical ranks of update matrix U i. Figure 3.4 shows the numerical ranks of some HSS off-diagonal blocks for an update matrix U i with dimension N = he mesh size is n = Similar situations hold for the frontal matrices, although their off-diagonal ranks are slightly larger. See [34]. For larger mesh sizes the low-rank property is more significant, as the numerical ranks do not increase much when the matrix dimensions increase. Also larger tolerances lead to smaller numerical ranks. he low-rank property also exists in many other discretized problems such as certain integral equations.

11 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 11 Numerical ranks for off diagonal blocks with τ=10 6 Numerical ranks for off diagonal blocks with τ= block size: 16 block size: 32 block size: 64 block size: 128 block size: block size: 16 block size: 32 block size: 64 block size: 128 block size: Rank 20 Rank Scaled block numbering Scaled block numbering i τ = 10 6 ii τ = 10 4 Fig Ranks of ype-2 off-diagonal blocks for a U i with different block row sizes Overview of structured extend-add and factorization. In Subsection 3.1 we have seen that the frontal/update matrices in the multifrontal method for some discretized problems have the low-rank property. We can take advantage of this property and replace the traditional Cholesky factorization and extend-add by structured factorization and extend-add. We will use low-rank structures HSS to represent matrices in the factorization and extend-add. Accordingly, we should pay special attention to the operations of the elements and nodes in the mesh, because any major operation will likely affect the matrix structures. Figure 3.5 shows two elements, each with a separator i, its four neighbors p 1,..., p 4 in upper levels, and its children c 1 and c 2. Element ii can be viewed as a transpose of i. hus it is sufficient to consider i. i Left-right orientation ii op-down orientation Fig Extend-add for one element. Element ii can be viewed as a transpose of i. By the rule for ordering separators in Figure 3.2 we have the frontal matrix F i from the extend-add operation 2.4 F i = F 0 i U c1 U c2 F 0 i + Ûc 1 + Ûc 2, 3.2 where the matrices take the nonzero patterns as in Figure 3.6. here are three key points about the how the separators in the child elements appear in their parent element during the extend-add. Firstly, we can see that in the element box for the child c 1 left element, the neighbors corresponding to U c1 follow the order {p 2, p L 1, i, p L 3 } according to Rule 3.1, while in the element box for i one level upper, the separators corresponding to F i follow the order {i, p 1, p 2, p 3, p 4 }. hat is, the blocks in U c1 and U c2 do not follow the natural order for those of F i. hus certain permutations on U c1 and U c2 are needed to align their blocks with the blocks of F i.

12 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA Fig. 3.6. Matrices in extend-add 3.2. he + shaped bars represent the overlap in separator p 1 p L 1 and pr 1.

12 12 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA Fig Matrices in extend-add 3.2. he + shaped bars represent the overlap in separator p 1 p L 1 and pr 1. Secondly, separators in one child element may not appear in the other. For example, p 2 appears in the left element but not in the right one. hus we need to pad some zero blocks into U c1 and U c2. hen we need the following permutation-extension: Rule 3.2. Left permutation-extension to obtain Ûc 1 : if U c1 is partitioned into four blocks conformally according to separators {p 2, p L 1, i, p L 3 } in this order, exchange p 2 and i to get {i, p L 1, p 2, p L 3 }. hen insert appropriate zero blocks to get {i, {p L 1, 0}, p 2, {p L 3, 0}, 0} where the blocks have sizes consistent with {i, p 1, p 2, p 3, p 4 }. Similarly we can define a right permutation-extension to obtain Ûc 2. Lastly, the child elements share parts of the separators. Here the left and right elements share the entire separator i, which becomes the pivot separator in the upper element. he union of separator p L 1 in the left element and p R 1 in the right one is separator p 1, but there are overlaps between p L 1 and p R 1. A similar situation happens to p L 3 and p R 3. During the extend-add the overlaps may not always correspond to an entire HSS block row/column, then certain blocks may need to be split and merged with other blocks. he techniques in Subsection will be used. Figure 3.6 also shows the overlap situation for separator p 1 marked by the + shaped bars. All these operations are done with HSS structures. After F i is formed we then eliminate the pivot separator i. his elimination and the computation of the Schur complement are also in done with HSS matrices using the algorithms in [8]. hus the update matrix remains structured again. Before we go to the details of those HSS operations we give an overview of the algorithm. Algorithm 3.3. Multifrontal method with structured factorizations 1. Use nested dissection to order the nodes. Build postordering elimination tree of separators. 2. Do traditional Cholesky factorization and extend-add at certain bottom levels. 3. Do structured factorization and extend-add at upper levels. a At the switching level, construct HSS representations for update matrices and do structured extend-add. b In later levels, factorize the pivot blocks of structured frontal matrices to obtain structured update matrices. Do structured extend-add Superfast multifrontal method with HSS structures. According to the previous two subsections, when Model Problem 1.1 is solved with the multifrontal method the update matrices and frontal matrices can be represented by compact HSS matrices. he elimination of a separator with order N can be done by a generalized

13 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 13 HSS Cholesky factorization in [8] which has complexity Op 2 N where p is an appropriate HSS rank. he computations of Schur complements update matrices are also done efficiently with HSS matrices. he details can be found in [8]. In this section we discuss some new HSS algorithms which build the structured HSS extend-add Permuting HSS blocks. It is convenient to permute a general HSS matrix by permuting its HSS tree. In many applications, say, our superfast multifrontal method, we can find the new HSS form for the permuted matrix by updating just few generators. For example, we consider to permute two neighbor block rows/columns in certain level of the HSS matrix. his corresponds to the permutation of two neighbor HSS subtrees subtrees with roots to be siblings. See Figure 3.7. Fig Permuting two subtrees. hus we can simply exchange generators associated with i and j, and all their children, without updating the matrices. he new matrix is still in HSS form. However, if the two subtrees are not neighbors we have different ways. One way is to write the new matrix as a sum of certain HSS matrices and then compress with the techniques in [8]. But this is usually expensive. A more efficient way to identify the path connecting the two subtrees and update the matrices associated with the nodes in the path and those connected to the path. As this varies for different situations we will come back to it in Subsection where the permutations are specifically designed for our superfast multifrontal method Merging and splitting HSS blocks. Elementary merging and splitting. he hierarchical structure of HSS matrices make it very convenient to merge and split blocks in many situations. For example, in Figure 3.8 we can merge the leaf nodes c 1 and c 2 in i and update node i by setting U i = Uc1 R c1 U c2 R c2, V i = Vc1 W c1 V c2 W c2, D i = D c1 U c2 B c2 V c 1 U c1 B c1 V c 2 D c i Node i with children c 1, c 2 ii Node i Fig Merging and splitting.

14 14 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA On the other hand if we want to split i in i into c 1 and c 2 then we need to find D ck, U ck, V ck, R ck, W ck, B ck, k = 1, 2 such that 3.3 is satisfied. First, we partition U i, V i, D i conformally U i = Ui;1 U i;2, V i = hen compute QR factorizations Vi;1 V i;2, D i = Dc1 D i;1, D i;2,1 D c2 Ui;1 D i;1,2 = Uc1 Rc1 1, 1 V i;2 = V c2 B c1 Ui;2 D i;2,1 = Uc2 Rc2 2, 2 V i;1 W c2 = V c2 B c2 W c1, Equations provides all necessary new generators. Advanced merging and splitting. here are more complicated situations that are very useful. Sometimes we need to maintain the tree structure during merging or splitting. For an N N HSS matrix H, we consider to split a piece from a leaf node i of the HSS tree of H and then to merge that piece with a neighbor j. We consider a general situation where i and j are not siblings. Without loss of generality we make two simplifications. One is that the matrix is symmetric; another is that block j is an empty block or zero block whose size is to be set by the splitting. It suffices to look at an example in Figure 3.9 where i = 5 and j = 10. Fig Splitting a block node 5. c 1 and c 2 are virtual nodes. We first split node i into two child blocks c 1 and c 2 by the techniques above, hat is, we introduce two virtual nodes in the tree. We will then move node c 2 to the position of node j. After that we will merge c 1 into i. Identify the path connecting node c 2 and node 10: pc 1, i : c

15 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 15 We observe that in order to get a new HSS representation for H denoted H all the generators associated with the nodes in this path and those directly connected to it should be updated. Here we use tilded notations for the generators of H, with c 1 not merged to i yet. We call the subtrees rooted at nodes 9 and 14 left and right subtrees, respectively. o get the HSS tree for H, we consider the following connections here by a connection we mean the product of the generators associated with the edges connecting two nodes. he connection between node 15 and c 2 should be transferred to the connection between 15 and 10. he connection between node c 2 and each node k= 1, 2, 3, 4, c 1 that is directly connected to the path pc 1, i should be transferred to the connection between node 10 and k. he connection between any two nodes k 1, k 2 = 1, 2, 3, 4, c 1 that are directly connected to the path pc 1, i should remain the same. he connections between node 15 and nodes k= 1, 2, 3, 4, c 1 that are directly connected to the path pc 1, i should remain the same. All these connection changes can be reflected by appropriate products of generators associated with the nodes. See [34] for the details. We can assemble all the matrix products into one single equation 0 R 14 Ũ14 R 1 R 9 B9 Ũ 14 R 2 B 1 R 8 R 9 B9 Ũ14 R 3 B 2 R 7 B 1 R 8 R 9 B9 Ũ14 R 4 B 3 R 6 B 2 R 7 B 1 R 8 R 9 B9 Ũ14 R c1 B 4 R 5 B 3 R 6 B 2 R 7 B 1 R 8 R 9 B9 Ũ14 0 R9 R8 R7 R6 R5 Rc 2 Uc 2 R 1 R 9 B 1 R7 R6 R5 Rc 2 U c 2 = R 2 B1 R 8 R 9 B 2 R6 R5 Rc 2 Uc 2 R 3 B2 R 7 B1 R 8 R 9 B 3 R5 Rc 2 Uc, R 4 B3 R 6 B2 R 7 B1 R 8 R 9 B 4 Rc 2 Uc 2 R c1 B4 R 5 B3 R 6 B2 R 7 B1 R 8 R 9 B c1 Uc 2 where Ũ14 = Ũ10 R 10 R11 R12 R13 which collects the connections in the entire right subtree. he equation 3.7 is partitioned into four nonzero blocks corresponding to the above four types of connection changes. It turns out that we can construct the generators on the left hand side sequentially as follows by using 3.7. First consider the last row in 3.7. Compute a QR factorization of the right hand side such that Rc1 B 4 R 5 B 3 R 6 B 2 R 7 B 1 R 8 R 9 B9 Ũ14 = Q 1 1. hen we partition 1 = 1;1 1;2 such that 1;1 has the same column dimension as B4. hus we can let R c1 = Q 1, B 4 = 1;1, and R5 B 3 R 6 B 2 R 7 B 1 R 8 R 9 B9 Ũ14 = 1; his way we remove one layer from the last row. Next combine 3.8 with the fourth row of 3.7 R4 R 5 B 3 R6 B 2 R 7 B 1 R 8 R 9 B9 Ũ 14 = B4 R c 2 U c 2 1;2.

16 16 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA Again compute a QR factorization Q 2 2 of the right hand side. hen partition Q 2 = Q 2;1 Q 2;2 such that Q2;1 has the same row size as B 4, and partition 2 = 2;1 2;2 such that 2;1 has the same number columns as B3. We can set R 4 = Q 2;1, R 5 = Q 2;2, B 3 = 2;1, and R6 B 2 R 7 B 1 R 8 R 9 B9 Ũ 14 = 2; Now we can combine 3.9 with the third row of 3.7, and the above procedure repeats. Finally, it is trivial to merge node c 2 into i. he overall process costs ON flops Inserting and deleting HSS blocks. Sometimes we need to insert a block row/column to an HSS matrix or remove a block row/column from it. o delete a block is usually straightforward. For simplicity, in this subsection we consider symmetric HSS matrices. As an example in Figure 3.10i if we remove node 1, we remove any generators associated with it and merge node 2 into 3 by setting U 3 = U 2 R 2. o insert a block row/column into an HSS matrix the result depends on the desired HSS structure. For example, suppose Figure 3.10i is the HSS tree for an N N matrix H, and we insert a new node i between node 2 and 4 to get a new matrix H with tree structure as in Figure 3.10ii or iii. i HSS tree example ii Inserting a node i iii Another way to insert i Fig Inserting a node. Dark nodes and edges should be updated or created. We only need to update a few generators shown in dark in Figure Again, we can consider the connection changes among nodes and use QR factorizations to find the new generators. he details are similar to those in the previous subsection and can be found in [34]. Note that in Figure 3.10, the HSS tree can be more general, and that the node i to be inserted can also represent another HSS tree. In all cases, we only need to update few generators to get the new matrix. hus the total cost is no more than ON flops HSS extend-add. Recall in Section 3.2 the discussions of the structured extend-add with regard to an element and its two child elements as shown in Figure 3.5 or also Figure he frontal and update matrices in the extend-add F i = F 0 i U c 1 U c2 F 0 i + Ûc 1 + Ûc 2 have the relationship as shown in Figure 3.6. Here we consider to use HSS structures in the extend-add. hat is, all matrices are in HSS forms. he block sizes of HSS matrices follow this rule:

17 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 17 Fig Extend-add for one element. Separators in the solid-line ovals are for update matrix U c1 and in dashed-line ovals are for update matrix U c2. Rule 3.4. In the multifrontal method with HSS structures we keep the HSS block sizes approximately the same. Avoid too large or too small blocks. Extend-add operations accumulate the numbers of blocks. he reason for this is, when the sizes of frontal/update matrices increase their HSS ranks increase very slowly. We can achieve high efficiency by setting the block sizes close to HSS ranks, as have seen in various HSS operations [10, 8]. he general HSS extend-add procedure is as follows. Assume that before the extend-add the frontal matrices F c1 and F c2 for the left and right elements, respectively, are already in HSS forms. hese HSS forms can come recursively from lower level separators, or from simple constructions at the starting level of structured factorizations. For simplicity, assume each separator is represented by one leaf of the HSS tree, although each leaf can be potentially a subtree. here are then five leaf nodes in the HSS tree. As we use HSS trees with full siblings we naturally use trees as shown in the first row of Figure A tree like that has the minimum depth among all binary trees with full siblings. he two trees in the first row of Figure 3.13 are for F c1 and F c2, respectively, with their leaf nodes marked by the separators in Figure hose graphs also show the eliminations of c 1 and c 2 to get Ûc 1 and Ûc 2, respectively. Separators in the figure are also marked with numbers 1, {1}, etc. ypically, there are five steps as follows for an extend-add operation to advance from the level of c 1 and c 2 to the level of i. For convenience, we also include the factorization of frontal matrices here at Step 0. 0 Eliminate c 1 and c 2 and get update matrices U c1 and U c2 in HSS forms. 1 Permute the trees for the update matrices as in Rule Insert appropriate zero nodes to the HSS trees of U c1 and U c2 to get Ûc 1 and Û c2, respectively. 3 Split HSS blocks in Ûc 1 and Ûc 2 to handle overlaps in separators. Update Ûc 1 and Ûc 2. 4 Write the initial frontal matrix Fi 0 in HSS form based on the tree structure of U c1 U c2 Ûc 1 + Ûc 2. 5 Get F i by adding HSS matrices Fi 0, Ûc 1, Ûc 2. Compress F i when necessary. Step 0 can be done by applying the generalized HSS Cholesky factorization in [8] to only the leading principal blocks of F cj corresponding to separator c j group 0 in Figure 3.13 for j = 1, 2. When separator c i is removed, all the rest nodes in the HSS tree of F cj are updated using the Schur complement computation procedure in the HSS form traditional Cholesky factorization in [8]. he Schur complement corresponding to the updated tree is the update matrix U cj. Step 1 is to permute some branches of the HSS tree of U cj. Note that even if

18 18 S. CHANDRASEKARAN, M. GU, X. S. LI AND J. XIA the HSS tree has more levels we still only need to change few top level nodes, because we only need to permute the four separators. his means the cost of permutations in the superfast multifrontal method is O1, even if the update matrix has dimension N. Specifically, to permute U c1 see Figure 3.13, left column, to convert row 1 to row 2, we exchange group p 2 {1} and i{3} as discussed in Rule 3.2. We can set new generators in tilded notations of the permuted tree to be B 5 = B 4, R8 = B 7 R 6, B2 = R 2 R 3 B 7, R2 = R 2 B 3, B3 = I. Note that although these formulas look specific, they are sufficient for general extendadd in the superfast multifrontal method. Similarly, to permute U c2 see Figure 3.13, right column, to convert row 1 to row 2, we permute group p 4 3 and p R 3 4 as discussed in Rule 3.2. he new generators of the permuted tree should satisfy B 2 = R 2 R 3 R4, B8 = B7 R6 R5, R2 R2 R B 3 R R 8 R 5 = 3 B 7 R 2 R 3 R5. 4 R 4 R 6 B 7 B 4 An SVD of the right hand side of the last equation can provide all the generators on the left hand side. At Step 2 we insert zero blocks into the permuted updated matrices to get Ûc 1 and Ûc 2. For the left element, a zero block node is attached to the matrix tree, see Figure 3.13, left column, row 3. his operation is trivial in that only certain zero generators should be added. For the right element, a zero node is inserted between group p R 1 2 and p R 3 4. his was already discussed in Subsection After we get the HSS trees as shown in the last row of Figure 3.13 we use Step 3 to handle the overlaps so that Ûc 1 and Ûc 2 will have the same HSS tree structure. As discussed in Section 3.2, overlaps occur in separators p 1 and p 3. As an example, the overlap p L 1 p R 1 may correspond to HSS blocks in both Ûc 1 and Ûc 2, or more likely, parts of HSS blocks. For the later case Figure 3.12i we need to cut p L 1 p R 1 from either p L 1, on the right end, or from p R 1, on the left end. As in the HSS tree of Ûc 1 the separator p L 1 is followed by a zero block on the right, we can split the overlap from p L 1 and merge into the zero block on the right. Or we can handle p R 1 similarly. We use the splitting procedure in Subsection Now Ûc 1 and Ûc 2 have the same HSS tree structure Figure 3.12ii which becomes the structure of Ûc 1 + Ûc 2 and also F i. hen at Step 4 we convert the initial frontal matrix Fi 0 into an HSS form following the HSS structure of Ûc 1 +Ûc 2. Note that as Fi 0 is often very sparse because the original matrix A is, we are generally able to develop its HSS form symbolically in advance. For example, for Model Problem 1.1 with nested dissection ordering, Fi 0 in 2.3 has the sparsity pattern where A ii is tridiagonal, A ip2 = 0, A ip4 = 0, and each of A ip1 and A ip3 has only one nonzero entry. Such a matrix has HSS rank 2. Now at Step 5 we are ready to get F i by computing the HSS sum of Fi 0, Û c1, and Ûc 2 with formulas in [8]. he sizes of the generators of F i increase quickly after this addition, although the actual HSS rank of F i will not. hus usually the HSS addition is followed by a compression step [8]. Now F i is in compact HSS form and we can continue the factorization along the elimination tree Algorithm and performance. Based on the previous discussions, we now show the main algorithm for our superfast multifrontal method with HSS structures. Before that we first clarify a few implementation issues for the HSS operations.

19 SUPERFAS MULIFRONAL MEHOD FOR LINEAR SYSEMS 19 i ii Fig i: HSS partitions of p L 1 and pr 1 showing the overlap pl 1 pr 1 dark band in the oval and how it builds into p 1 = p L 1 pr 1 ; ii: HSS tree structure of F i Implementation issues. One issue is about the HSS factorization of the frontal matrices. he techniques of the traditional Cholesky factorization of a SPD HSS matrix in [8] can be used to compute the Schur complement U i. his computation is efficient, since it corresponds to the removal of one single HSS tree node, the root of the HSS branch corresponding to separator i. he second issue is related to the mesh boundary. For convenience, we can assume the mesh boundary corresponds to empty separators. We may then have empty branches in HSS trees. An empty branch is represented by one empty node which is not associated with any actual separator or operation. Empty nodes do not accumulate or change. Another issue is to pre-determine the HSS structures before the actual factorizations. Similar to the traditional multifrontal method, we can have a symbolic factorization stage after the nested dissection to decide in advance the HSS structures for the frontal/update matrices at all levels of the elimination. Finally, we usually avoid too large or too small HSS block sizes. Because of this, in the bottom levels of elimination we do not use HSS factorizations, instead, we use the traditional multifrontal method. hen starting from certain level l s we switch to HSS factorizations. We call this l s the switching level Factorization algorithm. Our superfast multifrontal method with HSS structures can be summarized into an algorithm. Algorithm 3.5. Superfast multifrontal method with HSS structures 1. Use nested dissection to order the nodes in the n n mesh. Build elimination tree with separator ordering. Assume the total number of separators to be s, and the total number of levels to be l = log 2 n. 2. Decide l s, the switching level from traditional Cholesky factorizations to structured factorizations see heorem 3.6 below. 3. For separator i = 1,..., s a If separator i is at level l i > l s, do traditional Cholesky factorization and extend-add. i. If i is a leaf node in the elimination tree, obtain the frontal matrix F i from A and compute U i as in 2.3. Push U i onto an update

Sparse matrices, graphs, and tree elimination

Sparse matrices, graphs, and tree elimination Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);