c 2009 Society for Industrial and Applied Mathematics

Similar documents
An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Optimal Algorithm for Prufer Codes *

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

arxiv: v3 [cs.na] 18 Mar 2015

Hermite Splines in Lie Groups as Products of Geodesics


SUPERFAST MULTIFRONTAL METHOD FOR STRUCTURED LINEAR SYSTEMS OF EQUATIONS

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Cluster Analysis of Electrical Behavior

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Mathematics 256 a course in differential equations for engineering students

Machine Learning: Algorithms and Applications

Hierarchical clustering for gene expression data analysis

Parallel matrix-vector multiplication

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Related-Mode Attacks on CTR Encryption Mode

Solving two-person zero-sum game by Matlab

Problem Set 3 Solutions

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Analysis of Continuous Beams in General

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Load Balancing for Hex-Cell Interconnection Network

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Support Vector Machines

Lecture #15 Lecture Notes

LECTURE : MANIFOLD LEARNING

Lecture 5: Multilayer Perceptrons

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Programming in Fortran 90 : 2017/2018

CE 221 Data Structures and Algorithms

Module Management Tool in Software Development Organizations

AP PHYSICS B 2008 SCORING GUIDELINES

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Array transposition in CUDA shared memory

A New Approach For the Ranking of Fuzzy Sets With Different Heights

S1 Note. Basis functions.

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A Binarization Algorithm specialized on Document Images and Photos

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

Support Vector Machines

ELEC 377 Operating Systems. Week 6 Class 3

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

The Codesign Challenge

Private Information Retrieval (PIR)

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Fast Computation of Shortest Path for Visiting Segments in the Plane

CHAPTER 2 DECOMPOSITION OF GRAPHS

Multiblock method for database generation in finite element programs

A Five-Point Subdivision Scheme with Two Parameters and a Four-Point Shape-Preserving Scheme

A Facet Generation Procedure. for solving 0/1 integer programs

CSE 326: Data Structures Quicksort Comparison Sorting Bound

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Brave New World Pseudocode Reference

An Accurate Evaluation of Integrals in Convex and Non convex Polygonal Domain by Twelve Node Quadrilateral Finite Element Method

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Module 6: FEM for Plates and Shells Lecture 6: Finite Element Analysis of Shell

NOVEL CONSTRUCTION OF SHORT LENGTH LDPC CODES FOR SIMPLE DECODING

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

GSLM Operations Research II Fall 13/14

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Reading. 14. Subdivision curves. Recommended:

F Geometric Mean Graphs

UNIT 2 : INEQUALITIES AND CONVEX SETS

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Concurrent Apriori Data Mining Algorithms

CHAPTER 10: ALGORITHM DESIGN TECHNIQUES

CMPS 10 Introduction to Computer Science Lecture Notes

Unsupervised Learning

5 The Primal-Dual Method

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Meta-heuristics for Multidimensional Knapsack Problems

Wavefront Reconstructor

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Sorting. Sorting. Why Sort? Consistent Ordering

High-Boost Mesh Filtering for 3-D Shape Enhancement

Line Clipping by Convex and Nonconvex Polyhedra in E 3

Intra-Parametric Analysis of a Fuzzy MOLP

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

Transcription:

SIAM J. MATRIX ANAL. APPL. Vol. 31, No. 3, pp. 1382 1411 c 2009 Socety for Industral and Appled Mathematcs SUPERFAST MULTIFRONTAL METHOD FOR LARGE STRUCTURED LINEAR SYSTEMS OF EQUATIONS JIANLIN XIA, SHIVKUMAR CHANDRASEKARAN, MING GU, AND XIAOYE S. LI Abstract. In ths paper we develop a fast drect solver for large dscretzed lnear systems usng the supernodal multfrontal method together wth low-rank approxmatons. For lnear systems arsng from certan partal dfferental equatons such as ellptc equatons, durng the Gaussan elmnaton of the matrces wth proper orderng, the fll-n has a low-rank property: all off-dagonal blocks have small numercal ranks wth proper defnton of off-dagonal blocks. Matrces wth ths low-rank property can be effcently approxmated wth semseparable structures called herarchcally semseparable (HSS) representatons. We reveal the above low-rank property by orderng the varables wth nested dssecton and elmnatng them wth the multfrontal method. All matrx operatons n the multfrontal method are performed n HSS forms. We present effcent ways to organze the HSS structured operatons along the elmnaton. Some fast HSS matrx operatons usng tree structures are proposed. Ths new structured multfrontal method has nearly lnear complexty and a lnear storage requrement. Thus, we call t a superfast multfrontal method. It s especally sutable for large sparse problems and also has natural adaptablty to parallel computatons and great potental to provde effectve precondtoners. Numercal results demonstrate the effcency. Key words. structured drect solver, herarchcally semseparable matrx, low-rank property, superfast multfrontal method, nested dssecton AMS subject classfcatons. 15A23, 65F05, 65F30, 65F50 DOI. 10.1137/09074543X 1. Introducton. In many computatonal and engneerng problems t s crtcal to solve large structured lnear systems of equatons. Dfferent structures come from dfferent natures of the orgnal problems or dfferent technques of dscretzaton, lnearzaton, or smplfcaton. It s usually mportant to take advantage of the specal structures or to preserve the structures when necessary. Drect solvers often provde good chances to explot the structures. Drect methods are attractve also due to ther effcency, relablty, and generalty. Here we are nterested n the dscretzaton of dfferental equatons such as ellptc PDEs on a two-dmensonal (2D) fnte element mesh (grd) M. We consder dscretzatons wth a regular mesh or, more generally, dscretzatons wth a well-shaped mesh [37, 38, 42], so that nested dssecton or ts generalzatons [19, 20, 23, 32, 42] can be used. The fnte element system (1.1) Ax = b assocated wth the dscretzaton on M s consdered, where A s symmetrc postve defnte (SPD). Such matrces arse, say, when we apply fnte dfference or fnte ele- Receved by the edtors January 5, 2009; accepted for publcaton (n revsed form) by J. H. Brandts September 2, 2009; publshed electroncally December 4, 2009. http://www.sam.org/journals/smax/31-3/74543.html Department of Mathematcs, Purdue Unversty, West Lafayette, IN 47907 (xaj@math.purdue. edu). Department of Electrcal and Computer Engneerng, Unversty of Calforna, Santa Barbara, CA 93106 (shv@ece.ucsb.edu). Department of Mathematcs, Unversty of Calforna, Berkeley, CA 94720 (mgu@math.berkeley. edu). Lawrence Berkeley Natonal Laboratory, MS 50F-1650, One Cyclotron Rd., Berkeley, CA 94720 (xsl@lbl.gov). 1382

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1383 ment technques to solve 2D lnear boundary value problems such as ellptc boundary problems on rectangular domans. For the purpose of presentatons, we focus on 2D problems and demonstrate the potental of our deas, although t s possble to generalze to other problems. For convenence, we wll refer to the followng model problem at some places. Model Problem 1.1. We use the 2D dscrete Laplacan on a 5-pont stencl as a model problem, where A s a 5-dagonal n 2 n 2 sparse matrx: T I 0 4 1 0. I...... 1..... (1.2) A =...... I 0 I T, T =...... 1 0 1 4 Note we can also use the 9-pont or other stencls, snce the actual entres or nonzero patterns are not referenced n the descrptons of our method below. To factorze the dscretzed matrx A, people typcally frst order the mesh ponts. Drect factorzaton of such a matrx wth rowwse or columnwse mesh orderng takes O(n 4 ) flops [19]. The nested dssecton orderng [19] gves an elmnaton scheme wth O(n 3 ) cost, whch s optmal for any orderng n exact arthmetc [31] (gnorng any specal technques such as Strassen s algorthm [41]). But O(n 3 ) s stll large for a bg n. Sometmes, teratve methods are cheaper f effectve precondtoners are avalable. But for hard problems good precondtoners can be dffcult to fnd. 1.1. Superfast multfrontal method. Drect solvers have been consdered expensve because of the large amount of fll-n even f the orgnal matrx s sparse. To effectvely handle the fll-n, some people approxmate full matrces n complcated problems wth structured matrces such as H-matrces [26, 28, 29], H 2 -matrces [3, 27, 30], quasseparable matrces [17], semseparable matrces [7, 9, 10], etc. Smlarly, when solvng dscretzed PDEs such as ellptc equatons, we can develop fast drect solvers by explotng the rank property and by approxmatng dense matrces n these problems wth compact structured matrces wthout compromsng the accuracy. These approxmatons are feasble, as we notce that the fll-n durng the elmnaton wth certan orderng s actually structured. The fll-n s closely related to the Green s functons va Schur complements. Thus the off-dagonal blocks of the fll-n have small numercal ranks, whch has been observed n [1, 2, 8, 24, 25, 35, 45] and other work. Due to ths low-rank property, we show that the structured approxmatons of the dense N N subproblems n those problems can be solved wth a cost of O( N), and ths leads to a total complexty of O(pn 2 ), where p s a constant related to the PDE and the accuracy n the matrx approxmatons and n s the mesh dmenson. Our solver ncludes both the approxmaton stage of dense matrces and the drect soluton stage usng the approxmatons. We stll call the overall procedure a drect solver. Our work shares smlar deas as those n [1, 2, 8, 24, 25, 35, 45]. Here, we fully ntegrate sparse matrx technques (nested dssecton) n the context of a supernodal multfrontal method, and we use tree structures for both the overall matrx factorzaton and the ntermedate data structure. The multfrontal method s one of the most mportant drect methods for sparse matrx solutons [16, 34]. In our drect solver we order the mesh nodes nto separators wth nested dssecton [19] and organze the elmnaton process wth a supernodal multfrontal method. The method elmnates separators and accumulates updates locally followng an elmnaton tree [22, 33]..

1384 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI Moreover, the dense ntermedate matrces (fll-n) n the supernodal multfrontal method for many dscretzed PDEs have the above low-rank property. Those dense matrces are then approxmated by tree-structured semseparable matrces, called herarchcally semseparable (HSS) matrces [11, 13, 14, 46]. HSS matrces have close relaton to H 2 -matrces n [3, 27, 30]. Many HSS operatons such as matrx multplcatons and system solutons can be done n nearly lnear complexty. HSS matrces feature herarchcal low-rank propertes n the off-dagonal blocks. More specfcally, we say that a matrx has the low-rank property f all ts off-dagonal blocks have small ranks or numercal ranks (Defnton 2.6). In our supernodal multfrontal method, all matrx operatons are conducted effcently va HSS approxmatons. Some basc HSS operatons can be found n [11, 13, 14]. In ths paper, we provde some new HSS algorthms necessary to convert the multfrontal method nto a structured one. These new HSS algorthms provde an nnovatve way of handlng HSS matrx operatons by tree technques. Both the multfrontal method and the HSS structure have nce herarchcal tree structures. They both take good advantage of dense matrx computatons and have effcent storage, good data localty, and natural adaptablty to parallel computatons. The usage of HSS matrces n the multfrontal method leads to an effcent structured multfrontal method. For problems such as the 5-pont or 9-pont fnte dfference Laplacan, ths structured multfrontal method has nearly lnear complexty. We thus call ths method a superfast multfrontal method. The method s also memory effcent. By settng a relatvely large tolerance n the matrx approxmatons, the method can also work as an effcent and effectve precondtoner. We have developed a software package for the solver and varous HSS operatons. 1.2. Outlne of the paper. Ths paper s organzed as follows. In secton 2, we gve an ntroducton to the multfrontal method, nested dssecton, and HSS structures. Secton 3 demonstrates the low-rank property and presents an overvew of our new superfast multfrontal method. The detaled superfast structured multfrontal algorthm s dscussed n secton 4. Two major steps are covered: structured elmnaton of separators and structured matrx assembly (called extend-add). Some new HSS operatons are proposed. Secton 5 demonstrates the effcency of the superfast multfrontal method wth numercal experments n terms of the model problem and a lnear elastcty problem. Secton 6 gves some general remarks. 2. Revew of the multfrontal method and HSS representatons. In ths secton, we brefly revew the multfrontal method and HSS representatons whch buld our superfast structured multfrontal method. 2.1. Multfrontal method wth nested dssecton orderng. In the drect factorzaton of a sparse matrx, usually, the rows and columns of the matrx are frst ordered to reduce fll-n. Nested dssecton [19] s an mportant method for fndng an elmnaton orderng. Consder a dscretzed matrx and ts assocated mesh. Nested dssecton orders mesh ponts wth the ad of separators. A separator s a small set of mesh ponts whose removal dvdes the rest of the mesh or submesh nto two dsjont peces. The mesh s recursvely parttoned wth multple levels of separators. Lower level separators are ordered before upper level ones; see Fgure 2.1(). Here, by a level we mean a set of separators at the same level of partton. After the nodes are ordered, we compute the Cholesky factorzaton A = LL T by Gaussan elmnaton, whch corresponds to the elmnaton of the mesh ponts

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1385 () Partton wth separators () Connectons of ponts durng elmnaton Fg. 2.1. Separators n nested dssecton and the connectons of mesh ponts. (unknowns) from lower levels to upper levels. Mesh ponts may get connected durng the elmnaton (Fgure 2.1()). The elmnaton of a mesh pont parwse connects all ts neghbor ponts [19, 39]. The factorzaton process can be conducted followng the multfrontal method [16, 34]. The central dea of the multfrontal method s to reorganze the overall factorzaton of A nto partal updates and factorzatons of small dense matrces. It has been wdely used n numercal methods for PDEs, optmzaton, flud dynamcs, and other areas. Suppose j 1 steps of factorzatons of A are fnshed. Some portons of the frst j 1 columns of L contrbute to later computatons n the form of outer-product updates [34]. The nonzero porton of column j of A and some early outer-product contrbutons are assembled together by an operaton called extend-add. The result matrx s called a frontal matrx F j. One step of factorzaton of F j gves the jth column of L. The Schur complement s called the update matrx. See [34] for a formal dscusson of the procedure. To convenently consder how the update contrbutons are passed, a powerful tool called elmnaton tree s used. The followng defntons can be found n [33, 34, 40, etc.]. Defnton 2.1. The elmnaton tree T (A) of an N N matrx A s the tree structure wth N nodes {1,...,N} such that node p s the parent of j f and only f p =mn{ >j l j 0}, wherea = LL T and L =(l j ) N N s lower trangular. In addton, the concept of an assembly tree s gven n [34]. In ths work, we do not dstngush between these two types of trees. We also assume the elmnaton follows the postorderng of the elmnaton tree. We use a supernodal verson of the multfrontal method together wth nested dssecton to solve the dscretzed problems. Each separator n nested dssecton s consdered to be a node n the postorderng elmnaton tree. The separators are put nto dfferent levels of the tree (Fgure 2.2). Durng elmnaton, the separators are elmnated followng the postorderng elmnaton tree. The elmnaton of a separator wll connect all ts neghbor separator peces. For a separator, let,,...,p k be the peces of the unelmnated neghbor separators of whch are drectly or ndrectly connected to (due to matrx factorzaton). We say that the peces {,,,...,p k } form an element and that s the pvot separator of ths element. Fgure 2.3 shows two examples.

1386 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI 3 15 10 1 2 8 9 7 14 4 5 11 12 6 13 () Orderng separators 7 15 14 3 6 10 13 1 2 4 5 8 9 11 12 () Separator tree/nested dssecton elmnaton tree Level 1 2 3 4 Fg. 2.2. Orderng separators. L R c 1 c 2 L R () Leaf node () Non-leaf node Fg. 2.3. Examples of elements. The orderng of,,, may not necessarly follow ther order n the elmnaton tree. If separator s a bottom level separator n nested dssecton (or a leaf node n the elmnaton tree) (Fgure 2.3()), the frontal matrx F s drectly formed from A: A A p1 A pk (2.1) F F 0 = A p1 A p1.. 0, U =. A 1 ( ) Ap1 A pk. A A pk pk The elmnaton of separator provdes the block column n L correspondng to, and the Schur complement s the th update matrx U. If separator s not a leaf node (Fgure 2.3()), we assume t has two chldren c 1 and c 2. The update matrces U c1 and U c2 represent contrbutons from the subtrees rooted at c 1 and c 2, respectvely. Then F s obtaned by assemblng F 0, U c 1,andU c2 wth the extend-add process, and the elmnaton of separator yelds U : ( )( ) (2.2) F = F 0 L 0 L T U c1 U c2 = L T B, I 0 U L B where B denotes the element boundary {,,...,p k }. For convenence, when presentng the deas of handlng the connectons of separators, we use the stuaton k = 4 as n the model problem. Stuatons wth a general k can be smlarly dscussed (subsecton 4.6). Here, as shown n Fgure 2.4, we say that the separator peces {c 1, p L 1,, p L 3, } form the left chld element and {c 2,p R 1,, p R 3,} form the rght chld element.

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1387 L R c 1 c 2 L R Fg. 2.4. In extend-add, separator peces n the left and rght chld elements, marked by the sold-lne oval box and the dashed-lne oval box, respectvely. 7 3 6 1 2 4 5 () Postorderng of a bnary tree () 2nd level HSS blocks () 1st level HSS blocks Fg. 2.5. Example: A bnary tree wth postorderng and two levels of HSS off-dagonal blocks of a matrx, where the ndces follow the postorderng of the tree. Recursve applcaton of the above procedure to all nodes of the elmnaton tree leads to a supernodal multfrontal method. The supernodal multfrontal method wth nested dssecton for factorzng A n (1.2) costs O(n 3 ) flops. The number of nonzeros n the Cholesky factor s O(n 2 log 2 n). A stack of sze O(n 2 ) s used to store the update matrces. 2.2. Herarchcally semseparable structures. Semseparable or quasseparable structures have attracted a lot of nterests n the recent years. In our superfast multfrontal method, frontal matrces and update matrces are approxmated by semseparable matrces. Semseparable forms of upper level frontal matrces are obtaned from lower level ones recursvely. In ths subsecton, we revew a tree-structured semseparable representaton. Note ths tree structure s used for each frontal matrx and s not assocated wth the outer assembly tree. Thus, ths subsecton can be understood ndependent of the multfrontal method. There are dfferent defntons for semseparable matrces [17, 18, 43, 44]. One defnton often used s based on the low-rankness of approprate off-dagonal blocks. Here we use the HSS off-dagonal blocks n [11, 12, 13, 14], as shown n the example n Fgure 2.5. We frst defne HSS blocks wth the ad of a full bnary tree (a bnary tree where each node except the root has exactly one sblng) and ts postorderng. Defnton 2.2 (HSS blocks). HSS blocks are block rows or columns excludng the dagonal parts defned at dfferent levels of splttngs of a matrx as follows. Gven a full bnary tree wth ts postorderng, an N N matrx H, and a partton sequence {m j } k j=1,where j,j =1, 2,...,k, are the leaf nodes of the tree and k j=1 m j = N, partton H nto k block rows (columns) followng {m j } k j=1 so that block row (column) j has m j rows (columns) of H. Any block row (column) excludng the m j m j dagonal block s called a bottom level HSS (off-dagonal) block. Assocate

1388 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI 7 3 B 3 B 6 6 W 1 W 2 W 4 W 5 R 1 B 1 R 2 R 4 B 4 R 5 1 B 2 2 4 B 5 5 U 1, V1 U 2, V2 U 4, V4 D 1 D 2 D 4 D 5 () HSS tree U5, V5 D 1 U1B1V2 T D 2 U 3B 3V 6 T D 4 U4B4V5 T D 5 () Herarchcal form of the matrx Fg. 2.6. An HSS tree correspondng to Fgure 2.5 and the structured form of the matrx. wth each leaf a bottom level HSS block. An HSS block for an upper level node s defned recursvely from chld HSS blocks but wth the approprate dagonal block removed. We emphasze that postordered trees are used n ths paper so that the HSS blocks n Fgure 2.5() are ndexed followng the orderng of the tree nodes. Ths sgnfcantly smplfes the HSS notaton below and the codng, as dscussed n [11], snce for each node only one ndex s needed nstead of two or three as n [12, 13, 14]. A full bnary tree wth k leaves has totally 2k 1 nodes, where node 1 s the frst leaf and 2k 1 s the root. The bnary tree used n the above defnton s called an HSS tree (Fgure 2.6(), also called a merge tree n [12, 13, 14]), whch helps defne the HSS structure. Defnton 2.3 (HSS tree and HSS representaton). An HSS tree T =(V, E) that defnes an HSS representaton for a matrx H s the bnary tree n Defnton 2.2 and s further defned as follows. Let node 1 be the frst leaf, 2k 1 be the root, and j,j=1, 2,...,k, be the leaves of T. Eachnode V ( <2k 1) s assocated wth matrces D,U,V,R,W,B, whch are called generators of H. The HSS representaton of H s gven by the generators {R,W,B } 2k 2 =1 and {D j,u j,v j } k j=1,whch satsfy the recursve defnton of upper level generators D,U,andV : ( ) Dc1 U D = c1 B c1 Vc T 2, V s a nonleaf node, U c2 B c2 V T c 1 D c2 (2.3) ( ) ( ) Uc1 R U = c1 Vc1 W, V U c2 R = c1, V\{2k 1} s a nonleaf node, c2 V c2 W c2 so that at the top level, D 2k 1 H, wherec 1 and c 2 represent the left and rght chldren of, respectvely. Remark 2.4. Generators are ndexed followng the postorderng of the tree nodes. The generators D,U,V for a nonleaf node are not explctly stored. R and W are empty matrces f s a drect chld of the root. The followng s a block 4 4 HSS example correspondng to Fgure 2.5() and Fgure 2.6: (2.4) D 1 U 1 B 1 V2 T U 1 R 1 B 3 W4 T V 4 T U 1 R 1 B 3 W5 T V 5 T m 1 H = U 2 B 2 V1 T D 2 U 2 R 2 B 3 W4 T V 4 T U 2 R 2 B 3 W5 T V 5 T U 4 R 4 B 6 W1 T V 1 T U 4 R 4 B 6 W2 T V m 2 2 T D 4 U 4 B 4 V5 T. m 4 U 5 R 5 B 6 W1 T V 1 T U 5 R 5 B 6 W2 T V 2 T U 5 B 5 V5 T D 4 m 5 To see the herarchcal structure of (2.4), we can wrte H as ( ) D H = 3 U 3 B 3 V6 T m1 + m 2 U 6 B 6 V3 T, D 6 m 4 + m 5

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1389 correspondng to Fgure 2.5(), where the generators are obtaned by settng = 3, c 1 =1,c 2 = 2 n (2.3). However, only the generators n (2.4) are explctly stored. Remark 2.5. In the HSS representaton, -eachu s an approprate column bass matrx for an HSS block row. For example, the second level HSS block row assocated wth node =1s ( ( U 1 B1 V2 T R 1 B 3 W T 4 V4 T W5 T V )) 5 T. - we can verfy the followng [11]: to dentfy a block of H, say,the(2, 3) block n (2.4), we can use the drected path connectng the 2nd and 3rd nodes at the bottom level (nodes 2 and 4 as marked) of the HSS tree: U 2 2 R 2 3 B 3 6 W T 4 - for a symmetrc HSS matrx, we can set U = V,R = W,andB = Bj T, =1, 2,...,2k 2, where j represents the sblng of each. Theoretcally, an HSS form representaton can be constructed for any matrx H [11, 14] and an approprate HSS tree. However, such a representaton s generally more useful when the HSS blocks have small (numercal) ranks. Defnton 2.6 (numercal rank and HSS rank). In ths work, the numercal rank of any matrx block wth a relatve or absolute tolerance τ s the rank obtaned by applyng rank revealng QR factorzatons [5, 6] or τ-accurate SVD (SVD wth a tolerance τ for sngular values) to the block. For a matrx and an HSS tree, the maxmum of the numercal ranks of the HSS blocks at all tree levels s called the HSS rank (wth a gven τ) of the matrx. Later, we say a matrx s herarchcally separable f ts HSS rank s small wth a gven τ. For an HSS matrx wth a small HSS rank p, f all B generators n ts HSS representaton have szes close to p, we say that the HSS form s compact. It s shown n [11, 14] that for a matrx n compact HSS form, nearly lnear complexty system solvers exst. Many other HSS matrx operatons such as structure generaton, compresson, etc., are also very effcent. The reader s referred to [11, 12, 13, 14] for more detals on HSS representatons. 3. Superfast multfrontal method: Low-rank property and overvew. Notce that n the multfrontal method for dscretzed matrces the frontal and update matrces are generally dense because of the mutual connectons among mesh nodes (Fgure 2.1()). The elmnaton together wth the extend-add operaton on such a dense N N matrx typcally take O(N 3 ) flops n exact arthmetc. Here we consder approxmatons of these dense matrces. Approxmatons of dense matrces are feasble n solvng lnear systems derved from dscretzatons of certan PDEs such as ellptc equatons, as we dscover that low-rank propertes exst n these problems. Smlar results can also be found n [1, 2, 8, 24, 25, 35]. In ths work, we take advantage of the herarchcal tree structures of both HSS matrces and the multfrontal method. We have developed a seres of effcent HSS operatons [11, 14]. Addtonal HSS operatons necessary for our superfast multfrontal method wll be presented here. Wth these technques, we are able to produce a structured multfrontal method, and we reduce the total complexty for solvng dscretzed problems such as (1.2) from O(n 3 )too(pn 2 ) and storage from O(n 2 log 2 n) to O(n 2 log 2 p), where p s a parameter related to the problem and the tolerance for matrx approxmatons. V T 4.

1390 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI Table 3.1 Numercal ranks wth dfferent relatve tolerances τ of four F B blocks from mesh dmensons n = 127,255,511 and 1023, respectvely. Note the sze of F B can be larger than n. sze(f B ) τ 10 2 10 4 10 6 10 8 10 10 31 161 8 13 17 20 23 63 321 9 15 21 25 29 127 641 9 18 24 30 36 255 1281 10 20 28 35 42 Table 3.2 HSS ranks of an order 1023 block F wth dfferent τ, where the bottom level HSS block row sze s about 16 and a perfect bnary HSS tree wth 64 nodes s used. τ 10 2 10 4 10 6 10 8 10 10 HSS rank 6 12 17 21 26 3.1. Off-dagonal numercal ranks. It has been shown n [1, 2, 8, 24, 25] that the low-rank property exsts n the LU factorzatons of fnte-element matrces from ellptc operators and some other problems. Here n the context of the supernodal multfrontal method, we show some rank results of the frontal and update matrces. For a separator and all ts neghbors (denoted B) as shown n Fgure 2.3, we order them and ther nteror nodes properly (Defnton 3.1 below). The correspondng frontal matrx F has the followng form: ( ) ( )( F F (3.1) F = B L 0 L T = L T B F B F BB L B I 0 U where the elmnaton of separator gves the update matrx U. For the frontal and update matrces n the supernodal multfrontal method for solvng Model Problem 1.1, we have the followng crtcal rank observatons: The off-dagonal block F B has a small numercal rank. The HSS blocks of F have small numercal ranks. The HSS blocks of U have small numercal ranks. Some rank results are reported as follows. (1) Numercal rank of F B. We choose some frontal matrces F and compute the numercal rank of F B n each F. Table 3.1 shows the rank results. (2) Off-dagonal numercal ranks of F. We then test the HSS ranks of F wth dfferent relatve tolerances. As an example, we use the frontal matrx correspondng to the top level separator of a 1023 1023 mesh. In such a stuaton, F ( F )has order 1023. We choose a fxed block row sze and make all bottom level HSS offdagonal blocks to have approxmately the same row dmenson, so that the HSS tree s a perfect bnary tree. (As an example, when there are four block rows, a bnary tree as n Fgure 2.6 s used. Agan, ths bnary tree s not related to the elmnaton tree.) Table 3.2 shows the HSS ranks, whch are relatvely small as compared wth the sze of F. (3) Off-dagonal numercal ranks of U. Smlarly, Table 3.3 shows HSS ranks of an update matrx U wth dfferent tolerances. The mesh dmenson s n = 1023. Smlar stuatons hold for the frontal matrces. When n s larger, the low-rank property s more sgnfcant. The low-rank property has certan physcal background. For example, the paper [4] consders a 2D physcal model consstng of a set of partcles wth parwse nteractons satsfyng Coulomb s law. The authors defne well-separated sets of partcles to ),

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1391 Table 3.3 HSS ranks of a 1023 1023 update matrx U wth dfferent τ, where the bottom level HSS block row sze s about 16 and a perfect bnary HSS tree wth 64 nodessused. τ 10 2 10 4 10 6 10 8 10 10 HSS rank 6 12 17 22 26 Fg. 3.1. Examples of well-separated sets n an element. be the sets that have strong nteractons wthn sets but weak ones between dfferent sets. Here when we consder 2D meshes for dscretzed problems, we have a smlar stuaton. In Fgure 2.1() the mesh ponts n the current-level element are mutually connected. But the connectons dffer for dfferent ponts. We can thnk of the ponts closer to each other or not well separated to have stronger connectons; see Fgure 3.1. For theoretcal analyss of the low-rank property, see, e.g., [1, 2, 8, 24, 25]. 3.2. Overvew of the structured extend-add process and the structured multfrontal method. We can take advantage of the prevous low-rank property n the multfrontal process to get a new structured method. There are two major tasks: to replace the tradtonal dense Cholesky factorzaton by a structured one; to develop a structured extend-add process. We use HSS matrces to approxmate frontal and update matrces. HSS matrces can be quckly factorzed due to the low-rank property. HSS forms are accumulated bottom-up along the assembly tree. The structured extend-add process s relatvely complcated, snce the mesh nodes and separators are generally not consstent wth the HSS block parttons. We consder a separator, ts four neghbor peces,,, at upper levels of the assembly tree, and ts chldren c 1 and c 2 as n Fgure 2.3(). In the tradtonal multfrontal method, the frontal matrx F s obtaned from the extend-add operaton (2.2) (3.2) F = F 0 U c1 U c2 F 0 + Ûc 1 + Ûc 2, where each Ûc s a subtree update matrx obtaned from U c by matchng ndces to F 0 and nsertng zero entres [34]. The matrces n (3.2) take the nonzero patterns as llustrated n Fgure 3.2. There are three key ssues n developng the structured extend-add process. (1) Unform orderng. Frstly, n order to effectvely conduct (3.2) and to handle the nteractons of elements, we need to match the orderng of separators and mesh ponts at dfferent levels. The orderng can be predetermned n a symbolc factorzaton stage. We defne the followng unform orderng whose effectveness n revealng the low-rank property s shown by our numercal experments. Defnton 3.1 (unform orderng). The separator peces and mesh ponts wthn an element are unformly ordered f the neghbors are ordered counterclockwse as

1392 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI 1 L pl 1 1 L 1 L 1 pr 1 R 1 R R 1 1 Fg. 3.2. Matrces n extend-add (3.2), where the + shaped bars correspond to the overlap p L 1 pr 1 n separator. () Unform orderng () Transpose of () () Rank pattern of F Fg. 3.3. Unform orderng of neghbors and mesh ponts and the resultng rank pattern of F. Table 3.4 Makng chld element separator peces consstent wth the current level peces,,,, followng the unform orderng. Matrx Unform orderng Permutaton Paddng zero blocks F,,,, / / U c1 U c2,p L 1,,pL 3, p R 1,,p R 3, p L 1,,p L 3, p R 1,pR 3,, {p L 1, 0},, {p L 3, 0}, 0, {0,p R 1 }, 0, {0,pR 3 }, shownnfgure3.3() (or clockwse n Fgure 3.3(), whch can be consdered as a transpose of Fgure 3.3()), and the mesh ponts nsde each separator pece are ordered followng the natural orderng of mesh ponts (left-rght and top-down). Accordng to the unform orderng for Fgure 2.3(), we have the orderng of the separator peces and ther correspondng matrces as shown n the frst two columns of Table 3.4. Clearly, the neghbor orderngs assocated wth U c1 and U c2 do not match wth F. Thus, permutatons of the separator peces n the two chld elements are needed for (3.2); see the thrd column of Table 3.4. (2) Incompatble separator peces. Secondly, the separator peces for U c1 are not fully compatble wth those for U c2. That s, separator peces n one chld element may not appear n the other. For example, appears n the left chld element but not n the rght one. Thus, we need to nsert some zero blocks nto U c1 and U c2. In terms of the separators, we attach a zero pece to p L 1 so that the length of {p L 1, 0} s consstent wth. Other separators are processed smlarly; see the fourth column of Table 3.4.

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1393 F p U p structured factorzatons swtchng level F U F j U j F c 1 U c1 F c 2 U c2 tradtonal factorzatons Fg. 3.4. Illustraton of the superfast multfrontal method, where unstructured matrces are n dark color and others are structured. In each oval box, a frontal matrx s partally factorzed and, the update matrx s computed. (3) Overlaps. Lastly, the chld elements share (parts of) the neghbors. The left and rght chld elements share the entre separator, whch becomes the pvot separator n the upper element. The pece p L 1 n the left chld element and pr 1 n the rght one satsfy p L 1 pr 1 =. In general, p L 1 pr 1 s nonempty (Fgures 2.3(), 2.4, and 3.2) and s shared by both the left and rght chld elements. Furthermore, p L 1 p R 1 may not always correspond to an entre HSS block row/column. Then certan blocks may need to be splt and merged wth others. The technques n subsecton 4.2.2 can be used. A smlar stuaton holds for p L 3 and p R 3. All the matrx operatons are done n HSS forms to provde a structured extendadd process. After F s formed, we elmnate the pvot separator and compute the Schur complement wth the fast HSS algorthms n [11]. The structured extend-add s used agan, and the process repeats. Before gong nto the detals of the HSS operatons, we gve an overvew of the structured multfrontal algorthm. A pctoral llustraton s shown n Fgure 3.4. Algorthm 3.2 (Structured supernodal multfrontal method). 1. Use nested dssecton to order the nodes. Buld postorderng elmnaton tree of separators. 2. Do tradtonal factorzaton and extend-add at certan bottom levels. 3. At a swtchng level, construct HSS approxmatons of update matrces and do structured extend-add. 4. Followng the nested dssecton orderng, do structured factorzaton and extend-add at each upper level. (a) Elmnate a separator by factorzng the pvot blocks of the structured frontal matrx to obtan the structured update matrx. (b) Do structured extend-add and repeat. Two layers of trees are used: the outer layer elmnaton tree and an nner layer HSS tree for each separator. Steps 4(a) and 4(b) are the two major structured operatons. 4. Superfast multfrontal method: Detaled algorthm. Accordng to the prevous secton, n the supernodal multfrontal method, the update matrces and frontal matrces are approxmated by compact HSS matrces. There are two major

1394 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI W R W j B R j B j j W j R j j W B j R B Fg. 4.1. Permutng two subtrees. tasks: the structured elmnaton ste(a) of Algorthm 3.2 and the HSS structured extend-add n ste(b) of Algorthm 3.2. In ths secton we brefly revew a fast generalzed HSS factorzaton n [11] and then dscuss n detal some new HSS algorthms whch buld the HSS structured extend-add. 4.1. Fast generalzed HSS Cholesky factorzaton. The elmnaton of a separator wth order N can be done by the fast generalzed HSS Cholesky factorzaton n [11] n O( N)flops,wherepsan approprate HSS rank. The fast generalzed HSS Cholesky factorzaton computes explct factorzatons of HSS matrces where the factors (called generalzed HSS factors) consst of trangular matrces, permutatons, and other orthogonal matrces. For a gven SPD HSS matrx wth generators {D j }, {U j ( V j )}, {W j ( R j )}, {B j }, the major steps nclude the followng: 1. Introduce zeros nto off-dagonal blocks by compressng U j generators. The compresson s done by rank revealng QR factorzatons or τ-accurate SVD. 2. Partally factorze D j. The subblock of D j correspondng to zero off-dagonal entres are elmnated. 3. Merge the unelmnated subblock of D j wth that of the sblng of j n the HSS tree. Pass the block to the parent. 4. The HSS matrx s reduced to a new one wth a smaller sze and fewer blocks. Repeat the process. Ths process s appled to F of an HSS form frontal matrx F as n (3.1). Ths elmnaton corresponds to the removal of the subtree for F from the HSS tree for F. Or, more specfcally, after ths elmnaton the HSS subtree for F shrnks to one sngle node. The generators assocated wth ths sngle node are used to update the rest nodes. Ths leads to the Schur complement, or the update matrx U n HSS form. The detals are gven n [11]. Note that L n (3.1) s now the generalzed HSS factor (see [11] for an example), and U = F BB L L T s essentally computed by a structured low-rank update. Ths update s fast because F BB and L share some common generators. 4.2. Some basc HSS operatons needed n structured extend-add. In order to convert the standard extend-add process nto a structured one, we need some basc HSS operatons, whch are used to address the ssues dscussed n subsecton 3.2. These operatons nclude permutng, mergng/splttng, and nsertng/deletng HSS blocks n an HSS matrx. 4.2.1. Permutng HSS blocks. It s convenent to permute an HSS matrx by permutng ts HSS tree. We can generally get the new HSS form of the permuted matrx by updatng just a few generators. For example, consder permutng two neghbor block rows/columns at a certan level of the HSS matrx. Ths corresponds to the permutaton of two neghbor HSS subtrees wth roots beng sblngs; see Fgure 4.1. Thus n ths smple stuaton, we can drectly exchange generators assocated wth and j, and all ther chldren, wthout updatng the matrces. The new matrx

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1395 W c1 R c1 c 1 U c1, Vc1 D c1 R W W c2 B c1 R c2 B c2 c 2 U c2, Vc2 D c2 U, V D R W Fg. 4.2. Mergng and splttng nodes of an HSS tree. s stll n HSS form. For more complcated stuatons, say, f the two subtrees are not neghbors, we can dentfy the path connectng the two subtrees and update the matrces assocated wth the nodes n the path and those connected to the path. As ths vares for dfferent stuatons, we come back to t n subsecton 4.3, where the permutatons are specfcally desgned for our superfast multfrontal method. 4.2.2. Mergng and splttng HSS blocks. We frst look at a smple stuaton. (1) Basc mergng and splttng. The herarchcal structure of HSS matrces makes t very convenent to merge and to splt HSS blocks. For a node wth two chldren c 1 and c 2 n an HSS tree (Fgure 4.2), we can merge c 1 and c 2 and update node as n (2.3) n Defnton 2.3. On the other hand, f we want to splt nto c 1 and c 2, then we need to fnd D ck, U ck, V ck, R ck, W ck, B ck, k =1, 2 such that (2.3) s satsfed. Frst, partton U,V,D conformally as ( ) ( ) ( ) U;1 V;1 Dc1 D (4.1) U =, V U =, D ;2 V = ;1,2. ;2 D ;2,1 D c2 Then compute QR factorzatons ( ) ( ) ( ) T T (4.2) U;1 D ;1,2 = Uc1 Rc1 T 1, 1 V ;2 (4.3) ( ) B T = V c1 c2, W c2 ( ) ( ) ( ) ( ) T T U;2 D ;2,1 = Uc2 Rc2 T 2, 2 B T = V c2 V c2. ;1 W c1 Equatons (4.1) (4.3) provdes all the necessary new generators. (2) Advanced mergng and splttng. There are more complcated stuatons that are very useful. Sometmes, we need to mantan the tree structure durng mergng or splttng. For an HSS matrx H, weconsdersplttngapecefromaleafnode of the HSS tree of H and then to mergng that pece wth a neghbor j. We look at a general stuaton where and j are not sblngs. Wthout loss of generalty, we make two smplfcatons. One s that the matrx s symmetrc; another s that block j s an empty block (or zero block whose sze s to be set by the splttng). It suffces to look at an example n Fgure 4.3, where =5andj = 10. We frst splt node nto two chld nodes c 1 and c 2 by the method above: (4.1) (4.3). Then move c 2 to the poston of j. Afterthatwemergec 1 nto. The detals are as follows. Identfy the path connectng nodes c 2 and j: p(c 2,):c 2 5 6 7 8 9 14 13 12 11 10. We observe that n order to get a new HSS representaton for H (denoted H), all the generators assocated wth the nodes n ths path and those drectly connected to

1396 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI 15 1 R 1 9 B R 8 1 R 2 B 2 2 3 8 R3 R 7 R 9 R 14 B 9 B3 7 R 6 R 11 6 11 R 4 B R 5 4 R 10 5 4 10 D 4, D 5, U5 U4 D10, U10 12 R 13 13 R 12 14 c 1 c 2 Fg. 4.3. Splttng a block (node 5) ofanhsstree,wherec 1 and c 2 are vrtual nodes. t should be updated. We use tlded notaton for the generators of H, wth c1 not merged to yet. Call the subtrees rooted at nodes 9 and 14 left and rght subtrees, respectvely. To get the HSS tree for H, we consder the followng connectons, where by a connecton we mean the product of the generators assocated wth the path connectng two nodes. - The connecton between nodes 15 and c 2 should be transferred to the connecton between 15 and 10. - The connecton between node c 2 and each node k {1, 2, 3, 4,c 1 } that s drectly connected to the path p(c 2,) should be transferred to the connecton between nodes 10 and k. - The connecton between any two nodes k 1,k 2 {1, 2, 3, 4,c 1 } that are drectly connected to the path p(c 2,) should reman the same. - The connectons between node 15 and each node k {1, 2, 3, 4,c 1 } that s drectly connected to the path p(c 2,) should reman the same. All these relatons can be reflected by approprate products of generators assocated wth the nodes; see [46] for the detals. We can assemble all the matrx products nto one sngle equaton 0 RT 14 Ũ14 T R 1 ( R 9 B9 Ũ T 14) R 2 ( B 1 T R 8 ( R 9 B9 T )) R 3 ( B 2 T R 7 ( B 1 T R 8 ( R 9 B9 Ũ14))) T R 4 ( B 3 T R 6 ( B 2 T R 7 ( B 1 T R 8 ( R 9 B9 Ũ14)))) T R c1 ( B 4 T R 5 ( B 3 T R 6 ( B 2 T R 7 ( B 1 T R 8 ( R 9 B9 Ũ14 T ))))) 0 R9 T RT 8 RT 7 RT 6 RT 5 RT c 2 Uc T 2 R 1 R 9 B 1 R7 T RT 6 RT 5 RT c 2 U T c 2 = R 2 (B1 T R 8 R 9 ) B 2 R6 T RT 5 RT c 2 Uc T (4.4) 2 R 3 (B2 T R 7 (B1 T R 8 R 9 )) B 3 R5 T Rc T 2 Uc T, 2 R 4 (B3 T R 6(B2 T R 7 (B1 T R 8 R 9 ))) B 4 Rc T 2 Uc T 2 R c1 (B4 T R 5(B3 T R 6(B2 T R 7 (B1 T R 8 R 9 )))) B c1 Uc T 2 where Ũ14 Ũ10 R 10 R11 R12 R13. Equaton (4.4) s parttoned nto four nonzero blocks correspondng to the above four types of connecton changes. It turns out that we

3 SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1397 7 R 3 B3 R 6 6 R 1 B 1 R 2 R 4 B 4 R 5 1 2 4 5 D 1, U1 D 2, U2 D 4, U4 D 5, U5 7 B 7 7 5 B 3 3 6 R 3 B3 R 6 D 5, U5 R 1 R 2 R p R 5 3 6 B 1 B p 1 2 p 5 R 1 R 2 R R 4 D B 1 B 1, U1 D 2, U2 R R 4 D 5, U5 B 1 2 4 4 D 1, U1 D 2, U2 D, U D 4, U4 D, U D 4, U4 () HSS tree example () Insertng a node () Another way to nsert Fg. 4.4. Insertng a node nto an HSS tree n two dfferent ways, where dark nodes and edges should be updated or created. can construct the generators on the left-hand sde of (4.4) n the followng sequental way. Frst, consder the last row n (4.4). Compute a QR factorzaton of the rght-hand sde such that ( Rc1 ( B T 4 R 5 ( B T 3 R 6 ( B T 2 R 7 ( B T 1 R 8 ( R 9 B9 Ũ T 14))))) ) = Q 1 T 1. Then partton T 1 =(T 1;1 T 1;2 ) such that T 1;1 has the same column dmenson as = T 1;1,and (4.5) B T 4. Thus, we can let R c1 = Q 1, BT 4 ( R5 ( B T 3 R 6 ( B T 2 R 7 ( B T 1 R 8 ( R 9 B9 Ũ T 14)))) ) = T 1;2. In ths way, one layer (the last row) s removed from (4.4). Next, combne (4.5) wth the fourth row of (4.4): ( R4 R 5 ) ( BT 3 R6 ( B T 2 R 7 ( B T 1 R 8 ( R 9 B9 Ũ14))) ) ( ) T B4 Rc = T 2 Uc T 2. T 1;2 Agan, compute a QR factorzaton Q 2 T 2 of the rght-hand sde. Then partton Q T 2 =(Q T 2;1 Q T 2;2) such that Q 2;1 has the same row dmenson as B 4, and partton T 2 =(T 2;1 T 2;2 ) such that T 2;1 has the same column dmenson as B3 T. We can set = T 2;1,and (4.6) R 4 = Q 2;1, R 5 = Q 2;2, BT 3 ( R 6 ( B T 2 R 7 ( B T 1 R 8 ( R 9 B 9 Ũ T 14))) ) = T 2;2. Now, we can combne (4.6) wth the thrd row of (4.4), and the above procedure repeats. Fnally, t s trval to merge node c 2 nto. The overall process costs no more than O(N) flops for an order-n matrx H. 4.2.3. Insertng and deletng HSS blocks. Sometmes, we need to nsert a block row/column to an HSS matrx or to remove one from t. To remove a block s usually straghtforward. For smplcty, n ths subsecton we consder symmetrc HSS matrces. To remove a node from an HSS tree, we remove any generators assocated wth the subtree rooted at and merge the sblng node j of nto ts parent p by settng U p = U j R j and D p = D j f j s a leaf node. To nsert a block row/column nto an HSS matrx, the result depends on the desred HSS structure. For example, suppose Fgure 4.4() s the orgnal HSS tree for an HSS matrx, and we nsert a new node betweennode2and4togetanewmatrx

1398 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI wth a tree structure as n Fgure 4.4() or Fgure 4.4(). We need only to update a few generators (shown n dark n Fgure 4.4). Agan, we can consder the connecton changes between nodes and use QR factorzatons to fnd the new generators. The detals are smlar to those n the prevous subsecton. Note that n Fgure 4.4, the HSS tree can be more general and that the node to be nserted can also represent another HSS tree. In all cases, we need only to update few generators to get the new matrx. Thus the overall process s fast. 4.3. HSS structured extend-add. In ths subsecton, we use the prevous basc HSS operatons to buld the structured extend-add process, as outlned n subsecton 3.2. We consder a general element and ts two chld elements, as shown n Fgure 2.3() or 2.4, where s the pvot separator n the assembly tree. The frontal and update matrces n the extend-add F = F 0 U c 1 U c2 F 0 + Ûc 1 + Ûc 2 have the relatonshp as shown n Fgure 3.2, where all matrces should now be n HSS forms. The general HSS extend-add procedure s as follows. Assume that, before the extend-add, the frontal matrces F c1 and F c2 for the left and rght chld elements, respectvely, are already n HSS forms. (These HSS forms come recursvely from lower level separators or from smple constructons at the startng level of structured factorzatons.) For smplcty, assume each separator s represented by one leaf of the HSS tree, although each leaf can be potentally a subtree. There are then fve leaf nodes n each HSS tree. As we use full bnary HSS trees, t s natural to use trees as shown n the frst row of Fgure 4.5. A tree lke that has the mnmum depth among all full bnary trees. Note that the separators are ordered wth the unform orderng. Followng the unform orderng of the separator peces,,,,,thehsstree of F has the form n Fgure 4.6. Therefore, we should transform the tree structures n the frst row of Fgure 4.5 to the structure n Fgure 4.6 (also see Table 3.4). Fgure 4.5 shows the process of generatng Ûc 1 and Ûc 2 from F c1 and F c2, respectvely. The HSS trees of F c1 and F c2 are shown n the frst row of Fgure 4.5, wth ther leaf nodes marked by the separators n Fgure 2.4. Typcally, there are fve steps as follows for an extend-add operaton to advance from the level of c 1 and c 2 to the level of (for convenence, we also nclude the partal factorzaton of frontal matrces at the begnnng): 0. Elmnate c 1 and c 2 and get update matrces U c1 and U c2 n HSS forms. 1. Permute the trees for U c1 and U c2, as n the thrd column of Table 3.4. 2. Insert approprate zero nodes to the HSS trees of U c1 and U c2 to get Ûc 1 and Û c2, respectvely; see the fourth column of Table 3.4. 3. Splt HSS blocks n Ûc 1 and Ûc 2 to handle overlaps n separators. 4. Wrte the ntal frontal matrx F 0 n HSS form based on the tree structure of U c1 U c2 Ûc 1 + Ûc 2. 5. Get F by addng HSS matrces F 0, Ûc 1, Ûc 2.CompressF when necessary. Step 0 can be done by applyng the generalzed HSS Cholesky factorzaton n subsecton 4.1 to the leadng prncpal blocks of F cj correspondng to separator c j for j = 1, 2 (frst row n Fgure 4.5). When separator c j s removed, the Schur complement/update matrx U cj s obtaned by updatng the rest tree nodes. Ste s to permute some branches of the HSS tree of U cj. Note that even f the HSS tree has more levels, we stll just need to update few top level nodes because we need only to permute the four separator peces. Ths means the cost of permutatons n the superfast multfrontal method s O(1), even f the update matrx has dmenson N. Specfcally, to permute U c1 (Fgure 4.5, left column, from row 1 to row 2), we exchange and, as shown n the thrd column of Table 3.4. We can set new

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1399 9 9 7 8 7 8 3 6 L 3 6 R 1 c 1 2 4 5 L 1 c 2 2 4 5 R F c1 (elmnaton and permutaton) 7 F c2 (elmnaton and permutaton) 7 3 6 3 6 5 L 4 2 8 L 2 R 4 8 5 U c1 (permuted) U c2 (permuted) (nsertng zero blocks) (nsertng zero blocks) R 0 L {, 0} L {, 0} {0, p R 1 } 0 {0, p R 3 } Û c1 () Operatons on the left chld element Û c2 () Operatons on the rght chld element Fg. 4.5. Operatons on chld elements n the structured extend-add process (3.2). generators (n tlded notaton) of the permuted tree to be B 5 = B4 T, R8 = B7 T RT 6, B2 = R 2 R 3 B 7, R2 = R 2 B 3, B3 = I. Smlarly, to permute U c2 (Fgure 4.5, rght column, from row 1 to row 2), we exchange and p R 3. The new generators of the permuted tree should satsfy B 2 = R 2 R 3 R4 T, B8 ( ) ( = B7 T RT 6 RT 5, ) R2 ( ) R2 R B 3 RT R 8 RT 5 = 3 B 7 R 2 R 3 R5 T. 4 R 4 R 6 B 7 B 4

1400 J. XIA, S. CHANDRASEKARAN, M. GU, AND X. S. LI L R L R Fg. 4.6. HSS tree structure of F, where the dark bars represent overlaps p L 1 pr 1 and pl 3 pr 3, respectvely. An SVD of the rght-hand sde of the last equaton can provde all the generators on the left-hand sde. Although these formulas look specfc, they are suffcent for general extend-add n the superfast multfrontal method. At ste, we nsert zero blocks nto the permuted update matrces to get Ûc 1 and Ûc 2. For the left chld element, a zero block (node) s attached to the matrx (tree); see Fgure 4.5, left column, row 3. Ths operaton s trval n that only certan zero generators should be added. For the rght chld element, a zero node s nserted between p R 1 and p R 3. Ths has already been dscussed n subsecton 4.2.3. After we get the HSS trees as shown n the last row of Fgure 4.5, we use ste to handle the overlaps so that Ûc 1 and Ûc 2 wll have the same HSS tree structure and row/column ndces. As dscussed n secton 3.2, overlaps occur n separators and ; see Fgures 2.3() and 2.4. As an example, the overlap p L 1 pr 1 may correspond to HSS blocks n both Ûc 1 and Ûc 2 or, more lkely, parts of ther HSS blocks. For the latter case, n order to match the HSS structures of p L 1 pr 1 n Ûc 1 and Ûc 2, we need to cut p L 1 pr 1 from ether pl 1, on the rght end, or from pr 1, on the left end, and to merge t wth a nearby zero block. We use the splttng procedure n subsecton 4.2.2. Now, Ûc 1 and Ûc 2 have the same HSS tree structure, whch becomes the structure of Û c1 + Ûc 2 and also F (Fgure 4.6). Then at ste, we convert the ntal frontal matrx F 0 nto an HSS form followng the HSS structure of Ûc 1 +Ûc 2. Because the orgnal matrx A s sparse, F 0 s generally sparse also. We are often able to wrte the HSS form of F 0 n advance. For example, for Model Problem 1.1 wth nested dssecton orderng, F 0 n (2.1) has the sparsty pattern where A s trdagonal, A p2 =0,A p4 =0,andeachofA p1 and A p3 has only one nonzero entry. Such a matrx has HSS rank 2. Now at step 5, we are ready to get F by computng the HSS sum of F 0, Û c1, and Ûc 2 wth formulas n [11]. The szes of the generators of F ncrease after ths addton, although the actual HSS rank of F wll not. Thus usually the HSS addton s followed by a compresson step [11]. Now F s n compact HSS form, and we can contnue the factorzaton along the elmnaton tree. 4.4. Algorthm and performance. Based on the prevous dscussons, we present the man superfast multfrontal algorthm and ts analyss. Before that, we frst clarfy a few mplementaton ssues for the HSS operatons.

SUPERFAST MULTIFRONTAL METHOD FOR LINEAR SYSTEMS 1401 4.4.1. Implementaton ssues. One ssue s related to the mesh boundary. For convenence, we can assume the mesh boundary corresponds to empty separators. We may then have empty nodes n HSS trees. Empty nodes do not accumulate or change and are not assocated wth any actual operaton. Another ssue s to predetermne the HSS structures before the actual factorzatons. Smlar to other sparse drect solvers, we can have a symbolc factorzaton stage whch s used after nested dssecton to approxmately predct the HSS structures of the frontal/update matrces n the elmnaton. Fnally, for the purpose of computatonal performance, we usually avod too large or too small HSS block szes. 4.4.2. Factorzaton algorthm. We provde the superfast multfrontal method and analyze ts performance. Algorthm 4.1 (Superfast multfrontal method wth HSS structures). 1. Use nested dssecton to order the nodes n the n n mesh. Buld an elmnaton tree wth separator orderng. Assume the total number of separators to be k and the total number of levels to be l = log 2 n. 2. Decde l 0, the number of bottom levels of tradtonal factorzatons (see Theorem 4.2 below). 3. For separators =1,...,k (a) If separator s at level l >l l 0, do tradtonal Cholesky factorzaton and extend-add.. If s a leaf node n the elmnaton tree, obtan the frontal matrx F from A and compute U as n (2.1). Push U onto an update matrx stack.. Otherwse, pop two update matrces U c1 and U c2 from the update matrx stack, where c 1 and c 2 are the chldren of. Use extend-add to form the frontal matrx F as n (2.2). Factorze F and get U as n (2.2). Push U onto the update matrx stack. (b) If separator s at the swtchng level l = l l 0,. Followng ste(a), buld F, factorze ts pvot block, and get U.. Construct a smple HSS form for U wth few blocks (1, 2, or 4, etc.). Push the HSS form of U onto the update matrx stack. (c) Otherwse (separator s at level l <l l 0 ), do structured factorzaton and extend-add at upper levels.. Pop two HSS matrces U c1 and U c2 from the stack. Use HSS extendadd to form the frontal matrx F, as n Fgure 3.2.. Compute the generalzed HSS Cholesky factorzaton of the leadng prncpal blocks of F and compute the Schur complement whch s U (n HSS form). Push the HSS form of U onto the stack. About the complexty and storage requrement of the algorthm, we have the followng theorem. Theorem 4.2. Assume p s the maxmum of all HSS ranks of the frontal and update matrces throughout the multfrontal method. Then the optmal complexty of Algorthm 4.1 s O(pn 2 ). In ths stuaton, the number of bottom levels of tradtonal Cholesky factorzatons s l 0 = O(log 2 p), the bottom level tradtonal Cholesky factorzatons and upper level structured factorzatons take the same amount of work, the storage requred for the factors s O(n 2 log 2 p), and the update matrx stack sze s O(pn).