arxiv: v3 [cs.na] 18 Mar 2015

Similar documents
c 2009 Society for Industrial and Applied Mathematics

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Mathematics 256 a course in differential equations for engineering students

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Analysis of Continuous Beams in General

Hermite Splines in Lie Groups as Products of Geodesics


An Optimal Algorithm for Prufer Codes *

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Cluster Analysis of Electrical Behavior

Feature Reduction and Selection

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Parallel matrix-vector multiplication

Programming in Fortran 90 : 2017/2018

Lecture 4: Principal components

An Accurate Evaluation of Integrals in Convex and Non convex Polygonal Domain by Twelve Node Quadrilateral Finite Element Method

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Binarization Algorithm specialized on Document Images and Photos

3D vector computer graphics

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Machine Learning: Algorithms and Applications

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

S1 Note. Basis functions.

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Support Vector Machines

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Problem Set 3 Solutions

Lecture 5: Multilayer Perceptrons

Support Vector Machines

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Lecture #15 Lecture Notes

Array transposition in CUDA shared memory

LU Decomposition Method Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

X- Chart Using ANOM Approach

CSE 326: Data Structures Quicksort Comparison Sorting Bound

K-means and Hierarchical Clustering

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Wishing you all a Total Quality New Year!

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

AVO Modeling of Monochromatic Spherical Waves: Comparison to Band-Limited Waves

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Wavefront Reconstructor

Hierarchical clustering for gene expression data analysis

Multiblock method for database generation in finite element programs

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

High-Boost Mesh Filtering for 3-D Shape Enhancement

Sorting. Sorting. Why Sort? Consistent Ordering

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Solving two-person zero-sum game by Matlab

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Classifier Selection Based on Data Complexity Measures *

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

CS 534: Computer Vision Model Fitting

CMPS 10 Introduction to Computer Science Lecture Notes

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Dynamic wetting property investigation of AFM tips in micro/nanoscale

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Active Contours/Snakes

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Module 6: FEM for Plates and Shells Lecture 6: Finite Element Analysis of Shell

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Recognizing Faces. Outline

The Codesign Challenge

GSLM Operations Research II Fall 13/14

Solitary and Traveling Wave Solutions to a Model. of Long Range Diffusion Involving Flux with. Stability Analysis

Radial Basis Functions

Reading. 14. Subdivision curves. Recommended:

y and the total sum of

USING GRAPHING SKILLS

Unsupervised Learning

Transcription:

A Fast Block Low-Rank Dense Solver wth Applcatons to Fnte-Element Matrces AmrHossen Amnfar a,1,, Svaram Ambkasaran b,, Erc Darve c,1 a 496 Lomta Mall, Room 14, Stanford, CA, 9435 b Warren Weaver Hall, Room-115A, 51, Mercer Street, New York, NY 11 c 496 Lomta Mall, Room 9, Stanford, CA, 9435 arxv:143.5337v3 [cs.na] 18 Mar 15 Abstract Ths artcle presents a fast solver for the dense frontal matrces that arse from the multfrontal sparse elmnaton process of 3D ellptc PDEs. The solver reles on the fact that these matrces can be effcently represented as a herarchcally off-dagonal low-rank (HODLR) matrx. To construct the low-rank approxmaton of the off-dagonal blocks, we propose a new pseudo-skeleton scheme, the boundary dstance low-rank approxmaton, that pcks rows and columns based on the locaton of ther correspondng vertces n the sparse matrx graph. We compare ths new low-rank approxmaton method to the adaptve cross approxmaton (ACA) algorthm and show that t acheves betters speedup specally for unstructured meshes. Usng the HODLR drect solver as a precondtoner (wth a low tolerance) to the GMRES teratve scheme, we can reach machne accuracy much faster than a conventonal LU solver. Numercal benchmarks are provded for frontal matrces arsng from 3D fnte element problems correspondng to a wde range of applcatons. Keywords: Fast drect solvers, Iteratve solvers, Numercal lnear algebra, Herarchcally off-dagonal low-rank matrces, multfrontal elmnaton, Adaptve cross approxmaton. 1. Introducton In many engneerng applcatons, solvng large fnte element systems s of great sgnfcance. Consder the system Ax = b arsng from the fnte element dscretzaton of an ellptc PDE, where A R N N s a sparse matrx wth a symmetrc pattern. In many practcal applcatons, the matrx A mght be ll-condtoned and thus, challengng for teratve methods. On the other hand, conventonal drect solver algorthms, whle beng robust n handlng ll-condtoned matrces, are computatonally expensve (O(N 1.5 ) for D meshes and O(N ) for 3D meshes). Snce one of the man bottlenecks n the drect multfrontal solve process s the hgh computatonal cost of solvng dense frontal matrces, we manly focus on solvng these matrces n ths artcle. Our goal s to buld an teratve solver, whch utlzes a fast drect solver as a precondtoner for the dense frontal matrces. The drect solver n ths scheme acts as a hghly accurate Correspondng author. +1 65-644-764 Emal address: amnfar@stanford.edu (AmrHossen Amnfar) 1 Mechancal Engneerng Department, Stanford Unversty Courant Insttute of Mathematcal Scences, New York Unversty Preprnt submtted to Journal of Computatonal Physcs March, 15

pre-condtoner. Ths approach combnes the advantages of the teratve and drect solve algorthms,.e., t s fast, accurate and robust n handlng ll-condtoned matrces. To be consstent wth our prevous work, we adopt the notatons used n [3]. We should also menton that n refers to the sze of dense matrces and N refers to the sze of sparse matrces (e.g., number of degrees of freedom n a fnte-element mesh). In the next secton, we revew the prevous lterature on both dense structured solvers and sparse multfrontal solvers. We then ntroduce a herarchcal off-dagonal low-rank (from now on abbrevated as HODLR) drect solver n Secton 4. In Secton 5, we ntroduce the boundary dstance low-rank (BDLR) algorthm as a robust low-rank approxmaton scheme for representng the off-dagonal blocks of the frontal matrces. Secton 6 dscusses the applcaton of the teratve solver wth a fast HODLR drect solver precondtoner to the sparse multfrontal solve process and demonstrates the solver for a varety of 3D meshes. We also show an applcaton n combnaton wth the FETI-DP method [], whch s a famly of doman decomposton algorthms to accelerate fnte-element analyss on parallel computers. We present the results and numercal benchmarks n Secton 7.. Prevous Work.1. Fast Drect Solvers for Dense Herarchcal Matrces Herarchcal matrces are data sparse representaton of a certan class of dense matrces. Ths representaton reles on the fact that these matrces can be sub-dvded nto a herarchy of smaller block matrces, and certan sub-blocks (based on the admssblty crteron) can be effcently represented as a low-rank matrx. We refer the readers to [7, 31, 6, 8, 11, 14, 1] for more detals. These matrces were ntroduced n the context of ntegral equatons [7, 31, 6, 41] arsng out of ellptc partal dfferental equatons and potental theory. Subsequently, t has also been observed that dense fll-ns n fnte element matrces [58], radal bass functon nterpolaton [3], kernel densty estmaton n machne learnng, covarance structure n statstc models [15], Bayesan nverson [3, 5, 6], Kalman flterng [43], and Gaussan processes [4], can also be effcently represented as data-sparse herarchcal matrces. Broadly speakng, these matrces can be grouped nto two general categores based on the admssblty crteron: () Strong admssblty: subblock that correspond to the nteracton between well-separated clusters are low-rank; () Weak admssblty: sub-block correspondng to non-overlappng nteractons are low-rank. Ambkasaran [1] provdes a detaled descrpton of these dfferent herarchcal structures. We revew some of the prevously developed structured dense solvers for herarchcal matrces and dscuss them n relaton to our work. Hackbusch [7, 6] ntroduced the concept of H-matrces, whch are the most general class of herarchcal matrces wth the strong admssblty crteron [7, 6, 8, 3, 9, 31, 3, 8, 9, 11]. Contrary to the HODLR matrx structure, n whch the off-dagonal blocks are low-rank, n H-matrces, the offdagonal blocks are further decomposed nto low-rank and full-rank blocks. Thus, the rank can be kept small. In HODLR, we make a sngle low-rank approxmaton for the off-dagonal blocks and the rank s larger as a result. Hence, the HODLR structure makes for a much smpler representaton and s often used because of ts smplcty compared to the H-matrx structure. Hackbusch [6] suggests a recursve block low-rank factorzaton scheme for H- matrces. Ths method s based on the dea that all the dense matrx algebra (matrx multplcaton and matrx addton) can be replaced by H-matrx algebra. As a result, the nverse of an H-matrx can also be approxmated as an H-matrx tself. Ths results n a computatonal complexty of O(n log (n)) for an H-matrx factorzaton.

We note that the approach n ths paper s based on the Woodbury matrx dentty. It s therefore dfferent from the algorthm n Hackbusch [6] for example. The latter s based on a block LU factorzaton, whle the Woodbury dentty reduces the global solve to block dagonal solves followed by a correcton update. The HODLR matrx structure s the most general off-dagonal low-rank structure wth weak admssblty. Solvers for ths matrx class have a computatonal cost of O(n log n). In an HODLR matrx, the off-dagonal low-rank bases do not have a nested structure across dfferent levels [3]. The HSS matrx s an HODLR matrx but, n addton, has a nested offdagonal low-rank structure. Solvers for the HSS matrces have an O(n) complexty [59, 13]. Martnsson and Rokhln [49] dscuss an O(n) drect solver for boundary ntegral equatons based on the HSS structure. Ther method s based on the fact that for a matrx of rank r, there exsts a well-condtoned column operaton, whch leaves r columns unchanged and sets the remanng columns to zero. Usng ths dea, they derve a two-sded compressed factorzaton of the nverse of the HSS matrx. Ther generc algorthm requres O(n ) operatons to construct the nverse. However, they accelerate ther algorthm to O(n log κ (n)) when appled to two-dmensonal contour ntegral equatons. Chandrasekaran et al. [14] present a fast O(n) drect solver for HSS matrces. In ther artcle, they construct an mplct ULV H factorzaton of an HSS matrx, where U and V are untary matrces, L s a lower trangular matrx and H s the transpose conjugate operator. Ther method s based on the Woodbury matrx dentty and the fact that for a low-rank representaton of the form UBV H, where U and V are thn matrces wth r columns, there exsts a untary transformaton Q, n whch only the last r rows of QU are nonzero. They use ths observaton to recursvely solve the lnear system of equatons. Snce ths method requres constructng an HSS tree, the authors suggest an algorthm that uses the SVD or the rank revealng QR decomposton, recursvely, to construct the HSS tree n O(n ) tme. Gllman et al. [4] dscuss an O(n) algorthm for drectly solvng ntegral equatons n one-dmensonal domans. The algorthm reles on applyng the Sherman-Morrson- Woodbury formula (see for example [3]) recursvely to an HSS tree structure to acheve O(r n) solve complexty, where r s the rank of the off dagonal blocks n the HSS matrx. They also descrbe an O(r n) algorthm for constructng an HSS representaton of the matrx resultng from a Nyström dscretzaton of a boundary ntegral equaton. Ho and Greengard [36] present a fast drect solver for HSS matrces. They use the nterpolatve decomposton (ID) (see for example [16]) algorthm wth random samplng to obtan the low-rank representatons of the off-dagonal blocks. The computatonal complexty of the low-rank approxmaton algorthm s O(mn log r + r n) for a matrx K R m n. After obtanng the herarchcal matrx representaton of the orgnal dense matrx, new varables and equatons are ntroduced nto the system of equatons. Fnally, all equatons are assembled nto an extended sparse matrx and a conventonal sparse solver s used to factorze the sparse matrx. Ths method has a computatonal complexty of O(n) for both the pre-computaton and soluton phases for boundary ntegral equatons n D, whle n 3D, these phases cost O(n 1.5 ) and O(n log(n)) respectvely. Kong et al. [4] have developed an O(n ) dense solver for HODLR matrces. Smlar to [49], they accelerate ther algorthm to O(n log (n)), when appled to boundary ntegral equatons. Ther method uses the Sherman-Morrson-Woodbury formula to construct a one-sded herarchcal factorzaton of the nverse of these matrces, n whch each factor s a block dagonal matrx. The low-rank approxmaton scheme n ther paper s based on the rank revealng QR algorthm. The authors use the pvoted Gram-Schmdt algorthm to obtan r orthogonal bass vectors for the low-rank matrx n queston. For a matrx 3

K R m n wth rank r, ths low-rank approxmaton scheme requres O(mnr) operatons. Then, they use a randomzed algorthm from [55] to accelerate ther low-rank approxmaton scheme. Ths accelerated low-rank approxmaton algorthm costs O(mn log(l +lnr)) n the general case where r < l < mn(m, n). Ambkasaran and Darve [3] present an O(n log (n)) solver for HODLR matrces and an O(n log(n)) solver for p-hss matrces. Ths approach dffers from the approach mentoned n [4] n the fact that, whle [4] constructs the nverse, [3] constructs a factorzaton of the matrx. Each factor n ths factorzaton scheme s a block dagonal matrx wth each block beng a low-rank perturbaton of the dentty matrx. The authors then use the Sherman- Morrson-Woodbury formula to nvert each block n the block dagonal factors. The artcle uses the Chebyshev low-rank approxmaton scheme to factorze the off-dagonal blocks. As mentoned above, solvers for the HSS matrx structure have the lowest computatonal complexty O(r n), r beng the rank of approxmaton among other herarchcally offdagonal low-rank matrx structures. Whle the HSS structure s attractve, the nested structure makes t more complcated and more dffcult to work wth, compared to the smpler HODLR structure. Furthermore, the off-dagonal rank ncreases from root to leaves n the HSS tree, whereas the off-dagonal ranks at each level are ndependent from each other n the HODLR structure. Ths often leads to lower average off-dagonal rank n the HODLR structure compared to HSS. A pont worth mentonng s that the solver dscussed n the current artcle reles on purely algebrac technque (nstead of analytc or geometry based technques) to construct the low-rank approxmaton of the off-dagonal blocks. Analytc low-rank approxmaton technques lke the Chebyshev low-rank approxmaton, multpole expansons, etc., are only applcable when the matrx defnton nvolves an analytcal kernel functon. In ths artcle, we propose a boundary dstance low-rank approxmaton (from now on abbrevated as BDLR), whch reles on the underlyng sparse matrx graph to choose the desred rows and columns n constructng a low-rank representaton. We also compare wth the adaptve cross approxmaton algorthm [51] (from now on abbrevated as ACA), whch s also a purely algebrac scheme to construct low-rank approxmatons of the off-dagonal blocks. Due to ts black-box nature, the solver can handle a wde range of dense matrces arsng from boundary ntegral equatons, covarance matrces n statstcs, frontal matrces arsng n the context of fnte-element matrces, etc. Table 1 summarzes the dense solver algorthms mentoned above... Fast Drect Solvers for Sparse Matrces As mentoned n Secton.1, we are nterested n acceleratng the drect solve process for fnte-element matrces. In ths artcle, we focus on the fnte-element dscretzaton of ellptc PDEs. One common way of factorzng such matrces s usng the sparse Cholesky factorzaton. The effcency of ths algorthm strongly depends on the orderng of mesh nodes [54]. Sparse Cholesky factorzaton takes O(N ) flops n D wth a typcal row-wse or column-wse mesh orderng, where N s the number of degrees of freedom [58]. The most effcent method for solvng such matrces s the multfrontal method wth nested dssecton [], whch takes O(N 1.5 ) flops for two-dmensonal and O(N ) for three dmensonal meshes [54]. The multfrontal method was orgnally ntroduced by Duff & Red [19], George [] and Lu [45], as an extenson to the frontal method of Irons [38]. In ths algorthm, the overall factorzaton s done by factorzng smaller dense frontal matrces [44]. For each node or super-node n the elmnaton tree, the frontal matrx s obtaned usng an update process 4

Artcle Matrx Class Factorzaton Applcaton Hackbusch [7, 6] H Recursve block factorzaton of the matrx Martnsson Rokhln [49] Chandrasekaran et al. [14] and HSS Two sded compressed factorzaton of the nverse HSS ULV H factorzaton of the matrx Gllman et al. [4] HSS Data sparse factorzaton of the nverse Ho and Greengard [36] HSS Factorzaton of the extended sparse system Kong et al. [4] HODLR One sded herarchcal factorzaton of the nverse Ambkasaran and Darve [3] HODLR, p-hss Block-dagonal of the matrx factorzaton Ths artcle HODLR Recursve block LU factorzaton of the matrx Table 1: Summary of fast dense structured solvers. BEM ntegral operators D boundary ntegral equatons Radal bass functon matrces 1D ntegral equatons wth Nyström dscretzaton D and 3D boundary ntegral equatons Boundary ntegral equatons Interpolaton usng radal bass functons Fnte-element matrces called the extend-add process, whch nvolves updates from the prevously elmnated nodes. Martnsson [47] uses a spral elmnaton approach along wth HSS compresson of Schur complements to acheve O(N log N) tme complexty. Ths approach s not based on the multfrontal method and requres a mesh that can be parttoned nto concentrc annul. Gllman et al. [3] proposed an accelerated nested dssecton algorthm for obtanng the Drchlet-to-Neumann operator assocated wth a D ellptc boundary value problem. The authors approxmate the Schur complements that appear n the elmnaton process as herarchcally block separable (HBS) matrces, a structure smlar to HSS matrces. Usng ths matrx structure, they are able to obtan the Drchlet-to-Neumann operator wth a cost of O(N) compared to O(N 1.5 ) of the conventonal multfrontal method wth nested dssecton. There have been some recent efforts to reduce the computatonal cost of the multfrontal method wth nested dssecton. Xa et al. [58] observed that frontal and update matrces n the multfrontal elmnaton process can be approxmated wth herarchcally sem-separable (HSS) matrces. The authors develop a structured extend-add process to facltate the formaton of the frontal matrces usng the HSS structure. Next, they perform a structured dense Cholesky factorzaton on the obtaned frontal matrx. The authors use the algorthm n [13] to compute the explct factorzatons of HSS matrces. Usng ths procedure, they are able to acheve nearly lnear tme complexty for D meshes. However, only regular well shaped meshes n D are consdered n the artcle. Schmtz et al. [54] extend the approach of [58] to a more general settng of unstructured and adaptve grds n D. Xa [56] ntroduced an effcent multfrontal factorzaton for general sparse matrces. The author approxmates the frontal matrces usng the HSS structure and ntroduces the concept of reduced HSS matrces that reduce the computatonal cost of operaton on HSS matrces. For smplcty, ths approach keeps the update matrces as dense matrces whch leads to hgh memory consumpton for large sparse matrces. Xa [57] ntroduces a new 5

algorthm that overcomes ths defcency by randomzaton. That s, nstead of passng dense update matrces along the elmnaton tree, ths approach passes a sknny randomzed matrx vector product. In addton to savng memory, ths approach only requres sknny extend adds (extend adds on all rows and only a subset of columns) whch leads to mprovements n effcency. Ths method s based on the work of Martnsson [48] whch provdes an algorthm for constructng HSS matrces usng randomzed matrx vector products. Amestoy et al. [7] ntroduce a new low-rank matrx format called the Block Low-Rank (BLR) structure, a flat, non-herarchcal block matrx structure, for representng frontal matrces obtaned n the multfrontal elmnaton process. The authors show that BLR s a good alternatve to herarchcal structures lke H and HSS matrces n terms of storage costs, flop count and parallelzaton for representng frontal matrces. The artcle demonstrates that the BLR format reduces the flop count and storage requrements for factorzng frontal matrces arsng from a varety of large matrces comng from dfferent physcs applcatons. However, there s no dscusson of the extend-add operatons for BLR matrces. Furthermore, the artcle does not demonstrate a full multfrontal solver based on the BLR frontal matrx representaton. The approach presented here s based on the multfrontal method [44]. It does not requre constructng and mantanng HSS trees and can be appled to any mesh structure. Our method s based on the observaton that the frontal matrces obtaned durng the multfrontal elmnaton process have an HODLR structure. Ths observaton was also made by [58]. In order to factorze (elmnate) these frontal matrces, we present a dense HODLR structured solver. If the rank r s O(1) (that s functon of ɛ only), the algorthm has a computatonal cost of O(r n log n) for an n n frontal matrx. When solvng 3D PDEs, we typcally have that r O(n 1/ ). In that case, the computatonal cost s O(r n), where r s the largest rank found, at the top of the tree. Ths cost s, n fact, slghtly favorable compared to what s reported for HSS n [57] (see Table 4.3, p. 19), at least asymptotcally for n. The log n factor dsappears because the rank s bounded by a geometrc seres assocated wth the rank. We wll benchmark the structured elmnaton (solve) process for frontal matrces correspondng to separators at varous levels of the sparse elmnaton tree, for many dfferent types of sparse matrces. It s worth mentonng that contrary to prevous works whch have manly benchmarked matrces n the Unversty of Florda Sparse Matrx Collectons [17], we focus on frontal matrces arsng from large and complcated mesh structures. These matrces are often very ll-condtoned and cannot be solved usng tradtonal teratve technques lke GMRES [5] wth dagonal precondtonng. Our benchmarks show that obtanng a good precondtoner for unstructured meshes s sgnfcantly harder compared to structured meshes. Furthermore, solvng 3D problems s an order of magntude more dffcult than D problems as the off-dagonal rank s sgnfcantly hgher n 3D. Hence, ths artcle manly focuses on 3D meshes. Table shows a summary of varous fast sparse matrx solvers n the lterature. 3. An Iteratve Solver wth Drect Solver Precondtonng In ths paper, we nvestgate usng a fast HODLR drect solver as a precondtoner to the GMRES [5] teratve scheme. In ths case, we use a relatvely low accuracy for the drect solver. 6

Artcle Methodology Test Cases & Applcaton Martnsson [47] HSS compresson and spral elmnaton Meshes that can be parttoned nto concentrc annul Gllman et al. [3] Approxmatng Schur complements as HBS matrces and usng HBS algebra. Xa et al. [58] HSS approxmaton of frontal matrces and structured extend-add Schmtz et al. [54] Modfed [58] to accommodate adaptve and unstructured grds Xa [56] Introducton of reduced HSS matrces that reduce the operaton cost on HSS matrces. For smplfcatons, the update matrces are kept as dense matrces. Xa [57] HSS compresson usng randomzaton technques n [48]. Passng randomzed matrx vector products nstead of dense update matrces and performng sknny extend-add operatons. Amestoy et al. [7] BLR format for representng frontal matrces. No dscusson of BLR extend-add process. D ellptc boundary value problems dscretzed usng a 5 pont stencl on a regular square grd. D structured meshes D adaptve and unstructured meshes that roughly follow the pattern of a regular mesh Helmholtz Equaton n D and Unversty of Florda Sparse Matrx Collectons [17] Helmholtz Equaton n D and Unversty of Florda Sparse Matrx Collectons [17] Large matrces comng from dfferent physcs applcatons Table : Summary of fast sparse drect solvers. We wll show that ths approach s much faster than both a conventonal LU solver and a hgh accuracy drect HODLR solver. We should also menton that ths precondtonng method can be appled to any teratve solver (conjugate gradent (CG) [35], etc..). 4. A Fast Drect Solver for HODLR Matrces One bottleneck of sparse solvers s the factorzaton of the dense frontal matrces that appear durng the multfrontal elmnaton process. To accelerate the factorzaton of dense frontal matrces, we approxmate them as HODLR matrces. As mentoned n Secton.1, HODLR matrces can be factorzed n O(n log n) whch s a sgnfcant mprovement over conventonal dense factorzatons whch typcally scale as O(n 3 ). 4.1. HODLR Matrces A HODLR matrx has low-rank off-dagonal blocks at multple levels. As descrbed n [3], a -level HODLR matrx, K R n n, can be wrtten as shown n Equaton (): [ ] K = = K (1) 1 U (1) 1 U (1) [ K () 1 U () 1 U () (V (1) 1, )T (1) (V,1 )T K (1) (V () 1, )T () (V,1 )T K () U (1) (1) (V ],1 )T [ U () 4 U (1) (1) 1 (V 1, )T K () 3 (U () 3 )T (V () 3,4 )T (1) (V 4,3 )T K () 4 ] (1) () 7

where for a p-level HODLR matrx, K (p) R n/p n/ p, U (p) 1, U (p), V (p) 1,, V (p), 1 R n/p r and r n. Further nested compresson of the off-dagonal blocks wll lead to HSS structures [3]. 4.. Solver Dervaton and Algorthm Contrary to the method ntroduced by Hackbusch [7] whch utlzes sequental block LU factorzaton, the HODLR drect solve algorthm presented n ths secton s based on the Woodbury matrx dentty (see for example [33, 3]). Although we do not use the formula explctly, we perform the exact same operatons. Lookng at Equaton (4), our method assumes that both dagonal blocks are nonsngular and factorzes them ndependently. However, Hackbusch [7] only assumes that top dagonal block s nvertble and factorzes the top dagonal block frst. He then constructs the remanng Schur complement and contnues on wth the factorzaton. In comparng the two methods, one can see that because of the ndependent factorzaton of the dagonal blocks, the method presented n ths secton s better suted to parallel mplementatons. Consder the followng lnear equaton: Kx = F (3) where K R n n s an HODLR matrx and x, F R n s. Now let s wrte K as a one-level HODLR matrx and rewrte Equaton (3) : [ ] [ ] K (1) 1 U (1) 1 K = V (1) 1, T x (1) [ ] U (1) V (1),1 T K (1) 1 F1 x (1) = (4) F where x (1), F (1) R ( n s). We now ntroduce two new varables y (1) 1 and y (1) : y (1) 1 = V (1),1 T x (1) 1 (5) y (1) = V (1) 1, T x (1) (6) Rearrangng (4), we have: K (1) 1 U (1) 1 x (1) K (1) U (1) 1 F 1 x (1) V (1),1 T I y (1) = F V (1) 1 1, T I y (1) }{{}}{{}}{{} K x F We now factorze the top dagonal block of K whch conssts of K (1) 1 and K (1). Snce ths subblock of K s a block dagonal matrx, ths means that we only need to factorze K (1) 1 and K (1). After elmnatng the top off dagonal block, we are left wth the Schur complement: [ ] S (1) I V (1),1 T (K (1) 1 = ) 1 U (1) 1 V (1) 1, T (K (1) ) 1 U (1) (8) I All we have to do now, s to solve the Schur complement: [ ] S (1) y (1) = V (1) T (1),1 (K 1 y (1) V (1) 1, 8 1 ) 1 F 1 T (K (1) ) 1 F (7) (9)

At ths pont, we can wrte x (1) 1 and x (1) n terms of (K (1) 1 ) 1 and (K (1) [ ] ] x (1) 1 x (1) = [ (K (1) 1 ) 1 (K (1) ) 1 ] [ F 1 U (1) 1 y(1) F U (1) y(1) 1 ) 1 : (1) Snce, both K (1) 1 and K (1) are HODLR matrces, we can apply the same procedure for factorzng them. Thus, we have arrved at a recursve algorthm for solvng (7). The factorzaton step corresponds to the computaton and storage of all the terms that are ndependent of the rght hand sde (.e., the Schur complements at all levels). 4.3. Algorthm Summary We now summarze the recursve HODLR drect solver algorthm. For a matrx such as K R n n, we have to carry out the followng procedure at each recurson level (p) for all 1 p : 4.3.1. Factorze 1. Fnd the low-rank approxmaton of the off-dagonal blocks (U (p) 1, U (p), V (p) 1,, V (p), 1 ).. Defne Z 1 =. For each level p, startng at the top level (p = ), let: [ ] [ ] Z (p+1) 1 U (p+1) Z (p+1) = 1 Z (p) U (p+1) (11) In the equaton above, on the rght-hand sde, we are vertcally concatenatng two matrces to form a matrx at level p + 1. 3. Recursvely solve the followng equatons: [ ] d (p+1) 1 [ d (p+1) c (p+1) 1 c (p+1) ] = (K (p+1) 1 ) 1 Z (p+1) 1 (1) = (K (p+1) ) 1 Z (p+1) (13) where d (p+1) and c (p+1) correspond to the U (p+1) and Z (p) porton of the rght hand sdes respectvely. 4. Obtan S (p), usng Equatons (8) and (9): [ S (p) I = (V (p+1) 1, )T d (p+1) ] (V (p+1), 1 )T d (p+1) 1 I (14) 5. Obtan d (p), c (p) for p 1 usng: [ d (p) ] c (p) = I [ d (p+1) 1 d (p+1) ] (S (p) ) 1 V (p+1) T, 1 V (p+1) 1, T [ ] c (p+1) 1 c (p+1) (15) 9

4.3.. Solve 1. Defne z1 = F. For each level p, startng at the top level (p = ), let: [ ] z (p+1) 1 z (p+1) = z p (16). Recursvely solve the followng equatons: x (p+1) 1 x (p+1) = (K(p+1) 1 ) 1 z (p+1) 1 (17) = (K (p+1) ) 1 z (p+1) (18) 3. Obtan x (p) x (p) = for p usng: [ I d (p+1) 1 d (p+1) ] (S (p) ) 1 V (p+1) T, 1 V (p+1) 1, T [ ] x (p+1) 1 x (p+1) (19) Note that (S (p) ) 1 was prevously computed and ths step s therefore only a seres of matrx-matrx products. Hence, the computatonal cost s small compared to the prevous factorzaton. 4.4. Solver Computatonal Cost Assumng we use a fast (O(n)) low-rank approxmaton scheme, the cost of constructng and storng an HODLR matrx s O(nr log(n)) [3], where r s the rank of approxmaton. Lookng at the procedure descrbed n Secton 4.3, we can wrte the followng: ( C (p) (r, s, n) = C (p+1) r, s + r, n ) + O(nr ) + O(nsr) () where C (p) (r, s, n) s the computatonal cost assocated wth solvng an n n HODLR matrx at level p wth s rght hand sdes and off-dagonal blocks of rank r. Equaton () suggests that the cost of solvng a HODLR matrx at level p wth s rght hand sdes s made up of three contrbutons. The frst contrbuton s assocated wth solvng the two dagonal blocks at the lower level (p + 1) wth s + r rght hand sdes. The second contrbuton comes from constructng the Schur complement S (p) (Equaton (8)) and the thrd contrbuton s the cost of constructng the rght hand sde of Equaton (9). Wrtng Equaton () as a sum, we have: log( n r ) C () (r, s, n) = O(pnr + nsr) (1) p=1 If the off-dagonal rank s constant throughout varous levels n the HODLR tree, the computatonal cost of the algorthm s O(r n log (n)) accordng to Equaton (1). However, n many practcal cases, the rank decays from root to leaves n the HODLR tree. Assume we can approxmate r as O(np 1/ ) where n p s the sze of a block at level p. Then, we have: r p = O( r 1 ), where r p/ 1 s the rank at the top level. Accordng to 1

Equaton (), the total computatonal cost nvolves two sums: log( n r ) p=1 r p q= log( n r ) p=1 r p = O(r 1) log( n r ) p (s + r q ) = O( p=1 r p (s + r 1 )) = O(r 1 (s + r 1 )) Note n partcular that the second sum s O(r 1 ) nstead of O(r log n). Fnally: C () (r, s, n) = O(nr ) () Ths result shows that n cases where the off-dagonal rank s decreasng, HODLR solvers can become very effcent and can compete wth HSS solvers. 5. Low-Rank Approxmaton Schemes In ths secton, we dscuss the varous low-rank approxmatons schemes used for obtanng a low-rank representaton of the off-dagonal blocks of the HODLR matrces n consderaton. Although a varety of low-rank approxmaton algorthms (SVD, rank revealng LU, rank revealng QR, randomzed algorthms, etc) are avalable, we requre a scheme that has a computatonal cost of O(rn) where r s the rank of approxmaton and n s the sze of the matrx. In the context of ths work, we cannot use randomzed SVD methods snce no fast matrx-vector product algorthm apples n our benchmark settngs. Ths lmts our choces to methods lke Chebyshev, partal pvotng ACA (Secton 5.1) and the pseudo-skeleton low-rank approxmaton algorthm (Secton 5.3). Each of these methods has certan drawbacks: The Chebyshev low-rank approxmaton algorthm s only suted to cases dealng wth nteracton of ponts va smooth kernels. The partal pvotng ACA algorthm works well when the leverage score of the matrx [46] s unform. That s, all rows and columns have farly the same mportance when constructng the low-rank approxmaton. However, n cases where certan rows or columns play a specal role and are crtcal to nclude n the low-rank approxmaton, ACA mght fal to properly dentfy them, resultng n an naccurate low-rank approxmaton. The accuracy of the pseudo-skeleton low-rank approxmaton scheme strongly depends on the method used for selectng rows and columns. In order to construct a fast and robust low-rank approxmaton scheme, we ntroduce a method for selectng rows and columns n the pseudo-skeleton low-rank approxmaton algorthm. We call ths new method the boundary dstance low-rank approxmaton scheme (BDLR). 11

5.1. ACA Low-Rank Approxmaton We use the ACA algorthm wth partal pvotng as descrbed by Rjasanow [51]. Ths algorthm s an algebrac low-rank approxmaton scheme and works on any dense matrx wthout any pror knowledge of the matrx. Both full pvotng and partal pvotng ACA search the matrx or the remanng Schur complement for the largest entry and use ths entry as the pvot. The full pvotng algorthm, smlar to rank revealng LU, scans all the matrx entres. Partal pvotng ACA avods ths expensve search by lookng at the largest entry n a sngle row/column at each step. The partal pvotng ACA algorthm has a cost of O(r(m + n)), for a matrx A R m n [51], where r s the rank of approxmaton. 5.. Randomzed Algorthms Randomzed algorthms as descrbed by [53, 34, 18, 1] arrve at a low-rank approxmaton of matrx A by formng a lower dmensonal matrx Y obtaned from samplng rows and/or columns of the orgnal matrx or by applyng random projectons to matrx A. They then obtan the orthonormal bass Q for the range of Y and approxmate A as: A QQ T A (3) For a matrx of sze n n, and wthout a fast matrx-vector product, these methods have a computatonal cost of O(n ). Otherwse, the cost can be brought down to O(n) or O(n log n). 5.3. Pseudo-Skeleton and Boundary Dstance Low-Rank Approxmaton In order to construct a fast and accurate solver, we need an accurate and robust method to construct low-rank approxmatons. As we wll show, BDLR s very robust and leads to accurate low-rank approxmatons. It works well n problems where the matrx can be related to a Green s functon. (Ths s true for all lnear PDE problems. Note that the Green s functon needs to be smooth, wth a sngularty at the orgn). In that case, large entres correspond to ponts close n space, whch we assocate as a smplfcaton to nodes n the graph that are connected by few edges. Although ths s a smple heurstc, t worked very well n our examples and allowed us to effcently form accurate low-rank approxmatons. The BDLR algorthm s a row and column selecton algorthm n the pseudo-skeleton low-rank approxmaton scheme. The pseudo-skeleton algorthm allows us to construct a low-rank approxmaton of a matrx by choosng a subset of rows and columns of that matrx. As mentoned n [5], for a low-rank matrx A, f we pck a set of row ndces ( I = { 1,..., r }) and a set of column ndces (j J = {j 1,..., j r }) and defne matrces C and R such that : Then, we can approxmate A to be : R = A(I, :) (4) C = A(:, J) (5) A C 1 R (6) where  = A(I, J). If  s not a square matrx or rank defcent, the Moore-Penrose pseudonverse s needed for  1. In order to acheve a certan accuracy, one can ncrease the number of chosen rows and columns untl the desred accuracy s reached. To montor 1

the error n the scheme, we pck rows and columns that are not n the set of rows and columns already chosen for low-rank approxmaton. We then montor the relatve Frobenus norm error on these rows and columns and ncrease the rank of the approxmaton untl the relatve Frobenus norm error falls below a certan tolerance. For a rank r pseudo-skeleton low-rank approxmaton, the nverson of  has a computatonal cost of O(r 3 ). Montorng the error has a computatonal cost of O(mr + nr r ) for A R m n. Thus, ths method has an asymptotc complexty of O(nr). As mentoned n Secton 1, we are predomnantly nterested n solvng dense frontal matrces arsng from the multfrontal elmnaton process of sparse fnte-element matrces. In ths case, every frontal matrx has a correspondng sparse matrx, whch s a dagonal subblock of the orgnal fnte-element matrx. Ths sparse matrx descrbes a graph that has rows and columns of the dense matrx as ts vertces and the edges n ths graph correspond to nonzero entres n the sparse matrx and descrbe the connecton between these ponts. We use ths graph n constructng the low-rank approxmaton of the off-dagonal blocks. Entres n dense matrx blocks that correspond to FEM or BEM applcatons can be related to the nverse of a Green s functon. The Green s functon s large at short dstances and then decays smoothly. We have a smlar behavor for our dense blocks. Hence, we want to dentfy row/column pars correspondng to large entres. These correspond to nodes n the graph that are close, that s connected by few edges. Therefore we use the dstance between a row vertex n the graph and the column vertex set (e.g., f the vertex corresponds to a row, we consder the dstance to the set of vertces assocated wth the columns, and vce versa) as a good crteron to determne whether to pck a row/column or not. For a set of row (column) vertces, we defne the boundary vertces as the subset of vertces for whch there exsts an edge n the nteracton graph connectng them to a vertex n the column (row) set. Fgure 1(a) shows an example of a matrx whch corresponds to the nteractons of a set of row ponts wth a set of column ponts. In ths partcular example, the blue vertces are the boundary vertces. That s, they are the vertces closest to the boundary between the row and column set of ponts. Now that we have defned the boundary nodes, we can desgnate an ndex d for every vertex n the row (column) set. Ths ndex s defned as the dstance of a vertex to the vertces n the boundary set. In order to construct the low-rank approxmaton, we choose rows and columns based on ther d ndex value. That s, we frst choose rows (columns) that are n the boundary set (d = ). We then add rows (columns) wth a dstance of one to the boundary (d = 1). For example, n Fgure 1(a), the green nodes are labeled (d = 1) as they are separated from the blue boundary nodes (d = ) wth only one edge. We contnue addng ponts based on the d ndex, untl we reach the desred accuracy. Fgure 1(b) shows that the BDLR algorthm approxmates the nteracton of a set of row and column nodes wth the nteracton of the ones that are closest to the boundary (nteracton of blue nodes). As mentoned above, calculatng the pseudo skeleton low-rank approxmaton requres us to calculate the pseudonverse of Â. For the BDLR algorthm, nstead of usng the SVD for calculatng the pseudonverse ( 1 ), we use a full pvotng LU factorzaton, whch s slghtly cheaper:  = P 1 LUQ 1 (7) where P and Q are permutaton matrces. Let r be the rank of Â. Defne R and C as: C = (CQ)(:, 1 : r)(u(1 : r, 1 : r)) 1 (8) R = (L(1 : r, 1 : r)) 1 (P R)(1 : r, :) (9) 13

where C and R are the subset of columns and rows we have pcked usng the BDLR scheme. We then have: A C R (3) (U(1 : r, 1 : r)) 1 and (L(1 : r, 1 : r)) 1 correspond to lower-trangular solves. The nverse matrces are not explctly computed. Row Set d= d=1 d= d= d=1 d= Column Set Row Set Column Set (a) Full Matrx Representaton Row Set d= d=1 d= d= d=1 d= Column Set Row Set Column Set (b) Low-Rank Matrx Representaton Fgure 1: Classfcaton of vertces based on dstance from the other set. 6. Applcaton for Multfrontal Solve Process In ths secton, we demonstrate how our fast dense solver algorthm can be appled to a sparse multfrontal solve process. We wll not explan the multfrontal algorthm n detal. For a detaled revew of the multfrontal method see [44]. We appled our fast solver as descrbed n Secton 3 to a varety of 3D fnte-element problems. We nvestgate frontal matrces at varous levels of the sparse matrx elmnaton tree correspondng to the elastcty equaton. We use SCOTCH [5] to do the reorderng n the sparse multfrontal solver. Our goal s to apply our fast dense solver to the dense frontal matrces obtaned n the multfrontal elmnaton process of a sparse fnte-element matrx, and speed up the multfrontal algorthm to approxmately O(N 4/3 ). The results shown n ths paper can be vewed as a proof of concept of ths dea. We should also menton that the approach presented n ths artcle s fully general. We use SCOTCH [5], (whch can partton any 14

graph) to obtan the separators and the resultng separators can always be handled by our algorthm, wthout any change. 6.1. Elastcty Equaton for a 3D Beam and a Cylnder Head Geometry We consder the 3D Naver-Cauchy elastostatcs equatons wth a beam geometry (fgure (a)): (λ + µ) ( u) + µ u + F = (31) where u s the dsplacement vector and λ and µ are Lamé parameters. For the beam geometry, we use 1-node tetrahedral elements (see for example Secton 1. of ths document 3 ) to dscretze the above equaton. For the cylnder head geometry, the mesh s composed of 8-node hexahedral, 6-node pentahedral and 4-node tetrahedral sold elements, and also 3- node shell elements. Fgures (a) and (b) show a sample beam and cylnder head geometry respectvely. As can be seen, the meshes are unstructured for both geometres. (a) Beam (b) Cylnder Head Fgure : 3D unstructured mesh for the beam and cylnder head geometres.. 6.. FETI-DP Solver for a 3D Elastcty Problem Doman decomposton (DD) methods solve a problem by splttng t nto several subdomans. Local problems are solved on each subdoman and a global lnear system s used to couple these local solutons nto a global soluton for the entre problem [1]. FETI methods [, 4] are a famly of doman decomposton algorthms wth Lagrange multplers that have been developed for the fast sequental and parallel teratve soluton of large-scale systems of equatons arsng from the fnte-element dscretzaton of partal dfferental equatons []. In ths artcle, we consder two sparse local FETI-DP matrces arsng from the fnteelement dscretzaton of an elastcty problem n three dmensons. The frst matrx corresponds to solvng the elastcty equaton wth a structured mesh n three dmensons (fgure 3(a)) whle the second matrx corresponds to solvng the same problem usng the 3 http://www.colorado.edu/engneerng/cas/courses.d/afem.d/afem.ch1.d/afem.ch1.pdf 15

(a) Structured Mesh (b) Unstructured Mesh Fgure 3: FETI-DP benchmark meshes. Fgure (a) shows a structured and fgure (b) shows an unstructured 3D FETI-DP mesh. geometry of an engne n an unstructured mesh (fgure 3(b)). Both matrces correspond to the stffness matrx of one subdoman of a lnear elastc 3D sold fnte element model (Equaton (31)) of ther respectve geometry. The dscretzaton for the cube geometry uses 8-node (trlnear) hexahedral elements (see for example Secton 11.3 of ths onlne document 4 ) whle the dscretzaton for the engne geometry uses 1-node tetrahedral elements (see for example Secton 1. of ths document 5 ). 7. Numercal Benchmarks In ths secton we show some numercal results and benchmarks of our code. As our code uses the Egen C++ lbrary for matrx manpulatons, we use the Egen drect solvers as benchmark references. 7.1. Elastcty Equaton for a 3D Beam and a Cylnder Head Geometry We apply our solvers to frontal matrces arsng from the multfrontal elmnaton of 3D elastostatcs sparse matrces (Fgures (a), (b)). We compare the fast BDLR drect solver and the ACA drect solver as precondtoners to the GMRES teratve scheme. Because of the partcular geometry of the beam mesh, all frontal matrces are relatvely small ( K) for ths partcular case. As can be seen n Fgure 5, the sngular values of a sample frontal matrx off-dagonal block decay rapdly and the block s n fact low-rank. Fgures 4(a) and 4(b) show the dstance of row (column) ndex of each pvot obtaned n the full pvotng LU factorzaton from the boundary between the row and column sets of vertces n the nteracton graph for the beam problem. As we expected, larger pvots correspond to rows and columns that are closer to the boundary. Fgures 6(a) and 6(b) compares the relatve error n approxmatng the top off-dagonal block usng SVD versus the BDLR approxmaton for the beam and 4 http://www.colorado.edu/engneerng/cas/courses.d/afem.d/afem.ch11.d/afem.ch11.pdf 5 http://www.colorado.edu/engneerng/cas/courses.d/afem.d/afem.ch1.d/afem.ch1.pdf 16

cylnder head geometry respectvely. That s, each pont (x,y) n ths plot represents the relatve error n approxmaton (y) f we wanted a rank (x) approxmaton usng one of the low-rank approxmaton algorthms. Needless to say, ths corresponds to choosng the top sngular values n the SVD decomposton and choosng rows and columns that are closest to the boundary n the BDLR approxmaton. As can be seen n the plot, the curves assocated wth the BDLR scheme have a tolerance (ɛ). Ths means that after the LU factorzaton of  (see Secton 5.3), we only keep rows and columns correspondng to pvots that are larger than ɛ tmes the magntude of the largest pvot. We use ths conventon for all BDLR approxmatons n ths paper. We can observe that as we decrease ɛ, we obtan a more accurate low-rank representaton va the BDLR algorthm for the beam geometry. For the more complcated cylnder head geometry, we see that n order to obtan a good approxmaton for low values of ɛ, more rows and columns need to be ncluded n the low-rank approxmaton whch corresponds to a hgher depth parameter (d) n the BDLR scheme. Fgures 7(a) and 7(b) show a level by level tmng of the factorzaton, solve and lowrank approxmaton of the BDLR solver appled to sample frontal matrces correspondng to the beam and cylnder head geometres respectvely. As can be seen, the off-dagonal rank decays from root to leaf whch confrms our assumptons n Secton 4.4. Fgures 8(a) and 8(b) show a detaled convergence analyss and comparson between the BDLR and ACA solvers as precondtoners to the GMRES teratve scheme. 7.. FETI-DP Solver for a 3D Elastcty Problem We apply the BDLR and ACA drect solver precondtoner to frontal matrces arsng from the multfrontal elmnaton of local matrces n a FETI-DP solver. We consdered two dfferent classes of problems. One corresponds to solvng the elastcty equaton (Equaton (31)) n a cube geometry wth a structured mesh. The other corresponds to solvng the same equaton n an engne geometry wth an unstructured mesh. Fgures 4(c) and 4(d) show that the largest pvot values of a sample off-dagonal block of a frontal matrx arsng from the cube geometry correspond to rows and columns that are closer to the boundary. Fgures 4(e) and 4(f) show that for the unstructured engne mesh, although most large pvots correspond to rows and columns near the boundary, there are some mportant rows and columns that are not ncluded n the ponts closest to the boundary. Fgures 6(c) show that the error n the BDLR method s comparable to the SVD (optmal) algorthm for the structured cube problem. Fgure 6(d) shows that smlar to Fgure 6(b), we need to nclude more ponts (rows and columns), n order to acheve an accurate low-rank approxmaton for ɛ = 1 1. In other words, f there are nsuffcent rows and columns n the BDLR approxmaton, the matrx  (see Secton 5.3) becomes low-rank and results n a LU factorzaton wth vey small pvots. These small pvots are the cause of the large relatve error as they become very large when nverted. Fgures 8(c) and 8(d) show the convergence rate of varous BDLR and ACA drect solver precondtoners for a sample frontal matrx arsng from the cube and engne mesh respectvely. 7.3. Summary Table 3 summarzes the solver tmngs for varous frontal matrces that we benchmarked. As can be seen, the teratve solve scheme wth both a fast BDLR and ACA drect solver precondtoner can reach near machne accuracy much faster than a conventonal LU solver 17

n almost all cases. Furthermore, both BDLR and ACA acheve a relatvely good speedup for all cases. However, for very large cases (1.5M structured cube and.3m unstructured cylnder head), one can observe that BDLR acheves hgher speedup compared to ACA. One mportant pont to note, s that convergence of both BDLR and ACA depends on the chosen parameters. For ACA, one can get better results by decreasng the tolerance. For BDLR, n order to acheve a gven tolerance, one has to ncrease the depth parameter (d). It s possble for BDLR not to converge for a certan tolerance and a depth parameter. Ths s because the depth and accuracy are related. In partcular, the effcency of the method s sometmes found to degrade f we reduce ɛ too much wthout ncreasng d suffcently. Ths corresponds to the fact that we are tryng to get a more accurate low-rank approxmaton but the pool of sample ponts s not suffcently large to provde the desred accuracy. In that case, reducng ɛ may, n fact, lead to a degradaton n the precondtoner, rather than an mprovement. An mportant advantage of the BDLR algorthm s that the rows and columns requred for constructng the low-rank approxmaton are known a pror based on the structure of the separator graph. As we wll demonstrate n a future artcle, ths wll allow us to sgnfcantly accelerate the extend-add process and allows us to avod constructng large dense frontal and update matrces as we wll only keep track of rows and columns requred by the BDLR algorthm. Matrx Sze ACA BDLR Speed-up Matrx Mesh Level 1e-1 1e-3 1e-5 1e-1 1e-3 1e-5 LU Type Type Sparse Dense ACA BDLR T I T I T I T I T I T I 1st 1.5M 3K 1.3e 3.85e 7 7.8e 15 1.1e 34.9e 13 6.71e 7 7.9e 5.5 6.51 1st 7.5K 6.99e 141 1.78e1 4.9e1 3 8.3e 3.8e1 9 4.77e1 7.38e1 3.4.86 nd 5.K.9e 77 6.91e 19 1.51e 3 3.3e 17 9.3e 7 1.83e1 6 8.53e 3.7.64 nd 5.K.5e 74 7.8e 19 1.68e1 3.38e 17 9.75e 6.6e1 4 7.4e.96.18 3rd.K.77e-1 45 6.15e-1 9 1.1e 3 3.33e-1 1 7.86e-1 5 1.44e 4 5.41e-1 1.95 1.6 Cube FETI Local 3rd 4K.8K 4.34e-1 61 1.4e1 13.89e1 3 7.17e-1 15 1.76e 7 3.7e 11 1.31e 3.1 1.83 3rd.K.e-1 9 5.84e-1 7 1.7e.91e-1 1 6.47e-1 5 1.1e 4 6.9e-1 3.11.37 3rd.5K 3.95e-1 41 1.9e1 7.61e 5.74e-1 13 1.33e 5.65e 4 1.e.53 1.74 4th.5K 4.65e-1 57 1.9e 13 3.9e 8.83e-1 13.58e 6 4.77e 5 1.e.15 1.13 4th.K 3.6e-1 35 1.4e 7.46e 6.6e-1 1 1.6e 5 3.3e 4 6.5e-1.4 1.7 6th 3.8K 5.7e 61 6.5e 68 4.e 16.e 157.88e 4 3.9e 3 3.4e.77 1.6 Engne 9th 4K.8K 1.31e 48 4.17e-1 6 7.5e-1 3 5.43e-1 8 4.4e-1 6.56e-1 15 1.4e 3.41 3.51 13th.5K 1.34e 48 4.14e-1 6 7.4e-1 3 3.93e-1 54 4.91e-1 18 8.37e-1 17 9.6e-1.31.44 1st 1.9K x x 5.67e-1 13 9.57e-1 4 5.14e-1 63 8.95e-1 14 1.6e 7 4.38e-1.77.85 nd 1.9K 1.31e 358 5.45e-1 7 9.44e-1 4.9e-1 3 8.37e-1 1 1.36e 4 4.5e-1.83 1.1 Beam nd 3K 1.9K x x 4.88e-1 1 9.8e-1 4 4.46e-1 6 7.84e-1 14 1.4e 5 4.1e-1.86.94 Stffness 3rd 1.9K 6.67e-1 185 4.44e-1 6 9.e-1 3.16e-1 7 8.17e-1 1 1.44e 4 4.3e-1.91 1.7 3rd 1.9K 1.19e 369 4.64e-1 8 9.76e-1 3.84e-1 9 7.64e-1 11 1.48e 4 4.57e-1.98 1.19 5th.3M 4K 3.84e 89 x x 1.5e x x 8.7e1 15 9.5e1 13 8.5e 5.5 9.98 CHead nd 4.8K 4.69e 65 3.81e 4 1.7e1 3.45e 11.97e 9 4.54e 94 6.56e 1.7.68 33K 4th.6K 4.61e-1 88 9.85e-1 3.17e 5 4.16e-1 6 1.3e 18 1.74e 17 1.6e.3.54 Table 3: Summary of solver speed for varous benchmark cases. All tmngs are measured n seconds. The GMRES accuracy and maxmum number of teratons was set to 1 1 and 1 respectvely for all cases. The letters x depcts that the solver dd not converge wthn 1 teratons. All LU tmngs are obtaned usng Egen s [39] partal pvotng LU solver. Level ndcates the level of the dense frontal matrx n the sparse elmnaton tree. T and I refer to the total solve tme and the number of teratons n the teratve solver respectvely. Iteratve solver tmes depcts total solve tme for the teratve solver wth a fast drect BDLR (ACA) solver precondtoner (low-rank computaton, drect solve, teraton, etc). For BDLR, we used a depth of 1, 3 and 5 for tolerances 1 1, 1 3 and 1 5 respectvely. For the 4.8K and 3K cylnder head matrces, the results on the last BDLR column were obtaned usng a tolerance of 1 4 and a depth of 1. We have calculated the speedups by comparng the runtme of the conventonal LU solver to the lowest runtme for each case. 8. Concluson and Future Work To reach our fnal goal of constructng a fast multfrontal solver, we need to mprove the slow dense solves for the frontal matrces, whch we demonstrate through varous bench- 18

Row dstance from boundary 8 6 4 Col dstance from boundary 1 8 6 4 4 6 8 1, 4 6 8 1, Pvot sze (largest to smallest) Pvot sze (largest to smallest) (a) Row Dstance (Unstructured Beam) (b) Col Dstance (Unstructured Beam) Row dstance from boundary 3 1 Col dstance from boundary 15 1 5 1,, 3, 4, 1,, 3, 4, Pvot sze (largest to smallest) Pvot sze (largest to smallest) (c) Row Dstance (Structured Cube) (d) Col Dstance (Structured Cube) Row dstance from boundary 3 1 Col dstance from boundary 3 1 5 1, 1,5, 5 1, 1,5, Pvot sze (largest to smallest) Pvot sze (largest to smallest) (e) Row Dstance (Unstructured Engne) (f) Col Dstance (Unstructured Engne) Fgure 4: Row (column) dstance versus pvot sze for a varety of off-dagonal blocks of sample frontal matrces. Row (column) dstance s the dstance correspondng to the row (column) ndex of a pvot from the boundary as defned n Fgure 1(a). Ths graph shows that large pvots are near the boundary nterface, whereas the pvot sze decays as we move away. Ths justfes heurstcally our approach wth BDLR. a,b) An off dagonal block of an unstructured beam geometry frontal matrx of sze.95k. c,d) An off dagonal block of an structured cube geometry frontal matrx of sze 3.75K. e,f) An off dagonal block of an unstructured engne geometry frontal matrx of sze 1.9K. 19