Multigrid Reduction in Time for Nonlinear Parabolic Problems

Size: px

Start display at page:

Download "Multigrid Reduction in Time for Nonlinear Parabolic Problems"

Mariah Bates
5 years ago
Views:

University of Colorado, Boulder CU Scholar Applied Mathematics Graduate Theses & Dissertations Applied Mathematics Spring 1-1-2017 Multigrid Reduction in Time for Nonlinear Parabolic Problems Ben

1 University of Colorado, Boulder CU Scholar Applied Mathematics Graduate Theses & Dissertations Applied Mathematics Spring Multigrid Reduction in Time for Nonlinear Parabolic Problems Ben O'Neill University of Colorado at Boulder, Follow this and additional works at: Part of the Applied Mathematics Commons, and the Theory and Algorithms Commons Recommended Citation O'Neill, Ben, "Multigrid Reduction in Time for Nonlinear Parabolic Problems" (2017). Applied Mathematics Graduate Theses & Dissertations This Dissertation is brought to you for free and open access by Applied Mathematics at CU Scholar. It has been accepted for inclusion in Applied Mathematics Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact

2 Multigrid Reduction in Time for Nonlinear Parabolic Problems by Ben O Neill B.Sc. (Hons), University of Waikato, NZ, 2011 M.S., University of Colorado at Boulder, 2014 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Applied Mathematics 2017

3 This thesis entitled: Multigrid Reduction in Time for Nonlinear Parabolic Problems written by Ben O Neill has been approved for the Department of Applied Mathematics Prof. Thomas Manteuffel John Ruge Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 iii O Neill, Ben (Ph.D., Applied Mathematics) Multigrid Reduction in Time for Nonlinear Parabolic Problems Thesis directed by Prof. Thomas Manteuffel The need for parallelism in the time dimension is being driven by changes in computer architectures, where recent performance increases are attributed to greater concurrency rather than faster clock speeds. Sequential time integration limits parallelism to the spatial domain, introducing a parallel scaling bottleneck. Multigrid Reduction in Time (MGRIT) is an iterative parallel-in-time algorithm that permits temporal concurrency by applying timeintegration on a multilevel hierarchy of temporal grids. The overarching goal of this thesis is to maximize the accuracy per computational cost (APCC) of the MGRIT algorithm in the context of both linear and nonlinear parabolic partial differential equations (PDE s). The cost of the MGRIT algorithm is directly proportional to the cost of a timeintegration step. For a linear problem with implicit time-stepping, each time step equates to solving one linear system. If an optimal spatial solver is used, the work required for a time-step evaluation is independent of the time-step size. For nonlinear problems, each time integration step involves an iterative nonlinear solve, the cost of which often increases with time-step size. This thesis develops a library of MGRIT optimizations, most importantly an alternate initial guess for the nonlinear time-step solver and delayed spatial coarsening, that reduce the cost of the algorithm for nonlinear parabolic problems. This allows nonlinear problems to be solved with parallel scaling behavior comparable to a corresponding linear problem. An alternative approach towards maximizing the APCC is to increase the accuracy of the method. MGRIT uses multigrid reduction techniques and a multilevel hierarchy of coarse time grids to parallelize the time dimension. Richardson extrapolation (RE) uses a similar multilevel hierarchy of time-grids to improve the accuracy of those same time-

5 iv integration methods. In this thesis we develop the RE-MGRIT algorithm, a non-intrusive parallel-in-time algorithm that uses RE with MGRIT to improve the convergence order of the underlying time integration scheme, all with almost no extra cost when compared to the standard MGRIT algorithm. In addition to increasing the convergence order of the time integration scheme, RE can also be used as a means of temporal error estimation. This thesis introduces and tests a Richardson based estimate of the local truncation error. When used in conjunction with an adaptive RE-MGRIT based algorithm, these estimates, available for free at each time-point, allow for automatic error control without the requirement for an additional temporal error estimation procedure.

6 For Lindsey and Isaac :) Dedication

7 vi Acknowledgements I would like to thank Tom Manteuffel, Jacob Schroder and Rob Falgout for all their help throughout my time in Colorado. I would also like to thank everyone at Grandview for all the support along the way.

8 Contents Chapter 1 Introduction Thesis Outline Iterative Solvers Geometric Multigrid The Full Approximation Scheme (FAS) Parallel Time Integration Parallel Time Integration with Multigrid The Two-level MGRIT Algorithm for Linear Problems MGRIT Error Propagation Matrix Storage Costs Extension to Multiple Levels XBraid: A Non-intrusive Parallel in Time Package Results for Linear Problems MGRIT for Nonlinear Parabolic Problems Nonlinear MGRIT with FAS Numerical Parameters Establishing Baselines Cost Metric

9 viii Sequential Time Integration baseline Naive MGRIT Baseline Efficient MGRIT for the Model Nonlinear Problem Solver ID Table MGRIT with an Improved Initial Guess for Newton s Method MGRIT with Spatial Coarsening Combining the Improved Initial Guess and Spatial Coarsening Further Cost Savings Skipping Unnecessary Work Cheap Initial Iterates Setting the Coarsest Grid Size Most Effective Improvements Scaling Studies Optimal Multigrid Scaling Strong Scaling Efficiency of Nonlinear MGRIT Discussion MGRIT with Richardson Extrapolation Two level MGRIT with Richardson Extrapolation Convergence Analysis Numerical Example: First Order ODE Numerical Example - 1D Heat Equation Extension to a Multilevel Algorithm Strong Scaling Analysis Adaptive Time Integration with a RE Based Error Estimate Local Error Estimation with RE-MGRIT

10 ix Local Embedded Error Estimator Model Problem: 2D Heat Equation Numerical Results Numerical Example: Coupled ODE Discussion Conclusion 107 Bibliography 109

11 Tables Table 1.1 Weak scaling study for linear MGRIT and the 2D heat equation Baseline solver costs for sequential time integration Cost estimates for the naive baseline Cost per processor for the naive baseline Run-times, storage costs and iteration counts for the various solver options Costs associated with solver 5 scaled by the naive baseline Costs associated with solver 8 scaled by the naive baseline Costs associated with solver 6 scaled by the naive baseline Costs associated with solver 10 scaled by those in Table Costs associated with solver 12 scaled by those in Table Costs associated with solver 14 scaled by the naive baseline Weak scaling study: MGRIT iteration counts Two-level convergence factors for RE-MGRIT with F relaxation Weak scaling study for two-level RE-MGRIT Weak Scaling, multilevel MGRIT Numerical Parameters used for MGRIT applied to the model problem

12 Figures Figure 1.1 Convergence analysis of the weighted Jacobi method for the 1D Poisson equation Multigrid V cycle Nested fine and coarse temporal grid Sequential time integration is sequential MGRIT converges iteratively towards the solution F and C relaxation Space-time domain processor decomposition MGRIT V cycle with and without spatial coarsening Strong scaling study for 2D linear diffusion with MGRIT Exact solution for model problem Exact nonlinear diffusion coefficient for the model problem Asymptotic convergence rates for MGRIT with spatial coarsening Strong scaling for p-laplacian with 16 processors in space Strong scaling study for p-laplacian with 32 processors in space Time integration with Richardson extrapolation Butcher tableaux for SDIRK-1, SDIRK-2 and SDIRK Convergence bounds for RE-MGRIT with F relaxation Convergence bounds for RE-MGRIT with FCF relaxation

13 xii 3.5 Error obtained solving equation 3.56 with RE-MGRIT RE-MGRIT strong scaling study Error and scaling study for 1D heat equation with RE-MGRIT Strong scaling study for multilevel RE-MGRIT Accuracy per second for multilevel RE-MGRIT An example of the cycle structure for the NI based are-mgrit algorithm Example of integer based refinement with MGRIT Non-uniform temporal grid hierarchy Manufactured solution for the 2D heat equation Error and temporal grids after 5 levels of refinement Error estimator and temporal grids for different m Strong scaling for the are-mgrit algorithm Reference solution for equation Temporal grid found using the are-mgrit algorithm with r = Temporal grid found using are-mgrit with r = Temporal grid found using the are-mgrit algorithm with the scaled refinement strategy Strong scaling study for are-mgrit APS for each refinement strategy N t vs. refinement level

14 Chapter 1 Introduction Until recently, increasing clock speeds allowed for the speedup of sequential time integration simulations of a fixed size, and allowed simulations to be refined in both space and time without an increase in overall wall-clock time. However, increases in clock speed have stagnated, leading to a sequential time integration bottleneck. Future speedups will be available from more concurrency, and hence, speedups for time marching simulations must also come from increased concurrency. By allowing for parallelism in time, much greater computational resources can be brought to bear and overall speedups can be achieved. Because of this, interest in parallel-intime methods has grown over the last decade. Perhaps the most well known parallel-in-time algorithm, Parareal [36], is equivalent [23] to a two-level multigrid scheme. This work focuses on the multigrid reduction in time (MGRIT) method [17]. MGRIT is a true multilevel algorithm with optimal parallel communication behavior, as opposed to a two-level scheme, where the size of the coarse-level limits concurrency. MGRIT solves, in parallel, a general first-order ordinary differential equation (ODE) and the corresponding time discretization: u t = f(u, t), u(0) = u 0, t [0, T ], (1.1) u(t + δt) = Φ(u(t), u(t + δt), δt) + g(t + δt), (1.2) where the operator Φ represents the chosen time-stepping routine and g is a time dependent function that incorporates all the solution independent terms. Based on the chosen

15 2 time integration scheme, the operator Φ could represent a matrix vector multiplication, e.g. forward Euler, a spatial inverse, e.g. backward Euler, or some combination of the two, e.g. Crank Nicholson. Sequential time marching schemes are optimal in that they move from time t = 0 to t = T using the fewest possible applications of Φ. By applying Φ iteratively, in comparably expensive but highly parallel multigrid cycles, MGRIT sacrifices additional computation for temporal concurrency. This creates a crossover point wherein the added concurrency overcomes the extra computational work. Beyond this crossover point MGRIT provides a speedup over sequential methods. 1.1 Thesis Outline The overarching goal of this thesis is to maximize the accuracy per computational cost (APCC) of the MGRIT algorithm in the context of both linear and nonlinear parabolic partial differential equations (PDE s). This will be addressed in two chapters, the first concerning the cost of the MGRIT algorithm and the second, the accuracy. MGRIT has been shown to be effective for linear problems, with speedups of up to 50x over sequential time integration seen in some cases (see Section 1.3.7). The application of MGRIT to the nonlinear compressible Navier Stokes equations was considered in [18], however, the results were somewhat underwhelming. The key difficulty is that the nonlinear time-step solver becomes progressively more expensive on coarser time levels. For the most part, this occurs because the nonlinear problem is harder to solve on coarse levels and because we lack an accurate coarse grid initial guess. In Chapter 2, based on a detailed case study of the application of MGRIT to the nonlinear p-laplacian equation, we develop a library of MGRIT modifications for solving nonlinear parabolic problems with MGRIT. Each of these optimizations, most importantly an alternate initial guess for the nonlinear time-step solver and delayed spatial coarsening, are designed to reduce the average cost of a nonlinear time

16 3 integration step with a focus on the coarsest grids. When applied to the model nonlinear problem, these optimizations combined to give a MGRIT algorithm with an efficiency similar to that of the equivalent linear problem. In the third chapter, Richardson extrapolation (RE) is investigated, in the context of MGRIT, as a means of improving the accuracy of the algorithm. Like most multigrid methods, the MGRIT algorithm uses a coarse grid error correction to accelerate the convergence of a fine grid iterative solver. This is similar to Richardson based time integration, a non-intrusive algorithm that uses coarse grid approximations to improve the convergence order of the underlying time-integration scheme. The non-intrusive nature of the MGRIT algorithm makes it easy to add temporal concurrency to Richardson based time integration schemes, however, this leads to an in-efficient implementation that does not fully exploit the similarities between the two methods. The algorithm developed in Chapter 3 capitalizes on those similarities, resulting in an efficient, non-intrusive algorithm that permits large scale temporal concurrency while also improving the accuracy of the solution. In addition to increasing the convergence order of the underlying time integration scheme, RE also provides, at no extra cost, a temporal error estimate at each fine grid time point. When used in conjunction with full multigrid (FMG), this error estimate allows for automatic local error control through adaptive temporal mesh refinement. As discussed above, the novel contributions of this thesis begin in Chapter 2. The remainder of this chapter will be devoted to providing an introduction to iterative solvers, multigrid, parallel-time-integration and the MGRIT algorithm. 1.2 Iterative Solvers Consider the system Ax = b, where A is a linear operator, x is the (unknown) solution and b is some known RHS vector. In the absence of roundoff errors, a direct solver (e.g. Gaussian elimination) can be used to obtain the exact solution in a finite sequence of oper-

17 4 ations, however, for the large, sparse and structured systems that generally result from the discretization of parabolic PDE s, this approach is often cost prohibitive [47]. In contrast, iterative methods attempt to approximate x using a cheap iterative scheme that, given an initial guess, x 0, converges towards the true solution. Of particular interest here are the stationary iterative methods. Under the right conditions, methods of this type, including the Jacobi method, the Gauss-Seidel method and the Successive over-relaxation method, can dramatically reduce the overall time to solution when compared to a direct solver. Define e k = x x k to be the error associated with the current approximation to the solution, x k, and let r k = b Ax k be the residual. For a linear operator, the residual, error and true solution are related by the residual equation Ae k = r k. (1.3) That is to say, for any approximation x k, the true solution can be written as x = x k + A 1 r k. Unfortunately, this reformulation of the problem does not reduce the cost. In fact, the direct calculation of the true solution using this equation is at least as expensive as solving the original system. Stationary iterative methods solve the system using a cheap approximation to A inside a fixed point iterative scheme. This iteration takes the form x k+1 = x k + Br k = (I BA)x k + Bb, (1.4) where B is an approximation to A 1 which is inexpensive to apply. Notice that the true solution is a fixed point of this iteration. The iteration may or may not converge based on the choice of B. A convergence criterion for B can be obtained by rewriting equation 1.4 in

18 5 terms of the error. Subtracting x from both sides gives e k+1 = e k Br k, (1.5) = e k B(b Ax k ), (1.6) = e k BA(x x k ), (1.7) = (I BA)e k, (1.8) = (I BA) k e 0, (1.9) where the third equality uses the fact that Ax b = 0. The matrix E = (I BA) is known as the error propagation matrix. The iterative method is guaranteed to converge if E < 1. Moreover, the rate of convergence is bounded by the norm of the error propagation matrix and is asymptotically equal to the spectral radius. One example of a stationary iterative method is the weighted Jacobi method: x k+1 = ωd 1 (b Rx k ) + (1 ω)x k. where D = diag(a), R = A D, and ω is some weighting factor chosen based on the properties of A. (a) Error vs. Time. (b) Convergence rate Figure 1.1: Convergence of the weighted Jacobi method applied to the 1D Poisson equation (ω = 2/3). The Jacobi method is known as a relaxation method (or smoother) because it smooths

19 6 out high frequency errors. As an example, Figure 1.1 shows the local and global errors remaining after the weighted Jacobi iteration (ω = 2/3) was used to solve the linear system associated with a central differencing discretization of the one dimensional Poisson equation: u xx = 0, x [0, 1], (1.10) u(0) = u(1) = 0. (1.11) The takeaway from this figure is that the Jacobi method smooths high frequency errors quickly, but struggles to account for the low frequency errors. Another well known relaxation method is the Gauss-Seidel method. Multigrid methods use a multilevel hierarchy of grids and a coarse grid error correction to accelerate the convergence of relaxation schemes Geometric Multigrid As shown above, the relaxation process stagnates once the error is smooth. At this point, continuing to apply relaxation steps is ineffective. Multigrid methods use cheap coarse grid error corrections to accelerate the convergence of relaxation schemes. Recall that, after k iterations of an iterative relaxation scheme, the error is given by e k = A 1 r k. Let R and P denote spatial restriction and prolongation operators and let r c ν = R(r ν ) denote the fine grid residual restricted to the coarse grid. A cheap coarse grid approximation of the fine grid error can be obtained by solving the the coarse grid residual equation: A c e c ν = r c ν, where A c is a coarse grid representation of A. For the purposes of this introduction, the coarse grid operator is defined to be A c = RAP where R = P T, as discussed in [5]. All going well, the fine grid error correction, e f ν = P e c ν = P (A 1 c r c ν),

7 is a good approximation of the error on the fine grid. In reality, several post smoothing steps are often required to smooth error introduced through interpolation.

20 7 is a good approximation of the error on the fine grid. In reality, several post smoothing steps are often required to smooth error introduced through interpolation. The coarse grid solve can be completed in several ways. For larger problems, the most intuitive approach is use an iterative scheme. The benefits of this approach are two-fold; first, smooth error appears more oscillatory on coarse grids so relaxation methods are more effective on the coarse grid, and second, relaxation is cheaper on the coarse grid. However, eventually one runs into the same convergence bottleneck observed on the fine grid. To counter-act this, multigrid methods recursively call the two-level algorithm until the coarse grid equation is such that an efficient direct solve can be completed. This recursive algorithm, depicted in Figure 1.2, is known as a multigrid V-cycle. Figure 1.2: Geometric Multigrid V cycles combine restriction, interpolation and relaxation to accelerate convergence on the fine grid. Image courtesy of Rob Falgout [32] The Full Approximation Scheme (FAS) Equation 1.3 (i.e., the linear residual equation) does not hold when A is nonlinear. For a nonlinear operator the residual equation is A(x k + e k ) Ax k = r k,

21 8 where we used the fact that Ax = b and x = x k + e k. Let wk c = xc k + ec k. Then, for this residual equation, a cheap coarse grid approximation to the fine grid error is obtained by solving A c w c k = A c x c k + r c k, where, again, A c is the coarse grid operator. The main difference between linear and nonlinear multigrid is that, in addition to the residual, the user must restrict the current approximation of the fine grid solution to the coarse grid. If A is linear, the FAS multigrid algorithm is equivalent to the algorithm described in the previous section. After solving for wk c, it is simple to obtain the error, ec k = wc k xc k, interpolate it to the fine grid, e f k = P (ec k ), and use it to correct the fine grid solution, xf k+1 = xc k + ef k. This nonlinear approach, known as Full Approximation Storage (FAS) multigrid, is outlined in Algorithm 1. Here, R and ˆR are separate restriction operators, highlighting the fact that a different restriction operator is often used to map the residual to the coarse grid. Also note that a non-linear smoother such as the nonlinear variants of the Jacobi and Gauss-Seidel methods must be used. For a good introduction to nonlinear iterative methods see [54]. The MGRIT algorithm is implemented using FAS multigrid to allow linear and nonlinear problems to be solved using the same code. Algorithm 1 FAS(A, x, b, ν) 1: Relax ν times on Ax k = b using a nonlinear smoother. 2: Restrict current approximation and residual to coarse grid: x c k = Rx k r c k = ˆR(b Ax k ). 3: Solve the coarse grid problem A c w c k = Ac x c k + rc k. 4: Compute the coarse grid error approximation: e c k = wc k xc k. 5: Correct the fine grid approximation x k+1 x k + P e c k.

22 9 1.3 Parallel Time Integration Interest in parallel-in-time methods has grown over the last decade, however, work on parallel-in-time methods actually goes back at least 50 years [44] and includes a variety of approaches. Work regarding direct methods includes [43, 48, 39, 9, 24]. There are iterative approaches, as well, based on multiple shooting, domain decomposition, waveform relaxation, and multigrid, including [33, 25, 38, 1, 27, 28, 52, 8, 53, 51, 30, 29, 36, 10, 42, 14, 56, 17]. For a gentle introduction to this history, please see the review paper [21]. No matter the approach, the goal is always the same; introduce concurrency into what is intuitively a sequential process. Perhaps the most well known parallel in time algorithm is the Parareal method, first introduced in 2001 by Lions, Maday and Turinici [36]. This algorithm, which can be interpreted as a two-level multigrid algorithm [23], decomposes the temporal domain into a set of non-overlapping sub domains. Each sub problem is solved, in parallel, using a predictorcorrector scheme. The weakness of the Parareal algorithm is that it involves a large coarse grid solve, the cost of which increases with temporal processors. Given enough time points, this coarse grid sequential solve becomes a new bottleneck. This work focuses on multigrid approaches (and MGRIT in particular) because of multigrid s optimal algorithmic scaling for both parallel communication and number of operations. In particular, the multilevel approach is attractive because it minimizes the size of the coarse grid sequential bottleneck. An additional attraction of MGRIT is its nonintrusive nature, where the user employs an existing sequential time-stepping routine within the context of the MGRIT implementation. This work uses XBraid [57], an open source implementation of MGRIT developed at Lawrence Livermore National Laboratory (LLNL). Some applications of the MGRIT algorithm include compressible fluid dynamics [18], moving meshes [16], multi-step methods [20] and power grid systems [34].

23 Parallel Time Integration with Multigrid The MGRIT algorithm for time independent linear problems will now be presented. This presentation first appeared in [17]. The nonlinear extension of MGRIT, as well as a detailed cost analysis and optimization of the algorithm, a key contribution of this thesis, is presented in Chapter 2. Define a uniform temporal grid with time-step δt and nodes t j, j = 0,..., N t. Nonuniform time grids are discussed in Section 3.2. Further, define a coarse temporal grid with time-step T = mδt and nodes T j = j T, j = 0, 1,..., N t /m, for some coarsening factor, m. This is depicted in Figure 1.3. Let u j be a vector representing the solution at time t = t j, discretized on some spatial grid. Then, discretized on the time grid shown above, equation (1.2) becomes u j+1 = Φ(u j, δt) + g j, (1.12) for j = 1, 2,..., N t. Alternatively, this system can be written in block triangular form: Au = I Φ I Φ I u 0 u 1. u Nt = g 0 g 1. g Nt = g. (1.13) T 0 T 1 t 0 t 1 t 2 t 3... t m T = mδt δt t Nt Figure 1.3: Fine- and coarse-grid temporal meshes. Fine-grid points (black) are present on only the fine-grid, whereas coarse-grid points (red) are on both the fine- and coarse-grid. Sequential time integration is a block forward solve of this system. This approach allows for spatial parallelism, but must be completed sequentially in time. As the name suggests,

11 Figure 1.4: As the name suggests, sequential time integration precedes sequentially. Figure 1.5: MGRIT converges iteratively towards the solution simultaneously across the temporal domain.

Both sequential time integration and MGRIT are O(Nt ) methods, where Nt is the number of time steps on the finest grid [17], but, MGRIT is highly concurrent.

24 11 Figure 1.4: As the name suggests, sequential time integration precedes sequentially. Figure 1.5: MGRIT converges iteratively towards the solution simultaneously across the temporal domain. the MGRIT algorithm uses multigrid techniques to solve equation 1.13 in parallel. Both sequential time integration and MGRIT are O(Nt ) methods, where Nt is the number of time steps on the finest grid [17], but, MGRIT is highly concurrent. To solve equation 1.13 with multigrid, one must first define the coarse grid operator, as well as a method of restriction, a method of interpolation, and a method of relaxation (see Section 1.2.1).

25 MGRIT Coarse Grid Operator The MGRIT algorithm uses strategies borrowed from multigrid reduction methods to determine the coarse grid operator. The idea behind reduction methods is to successively eliminate unknowns from the system. MGRIT eliminates the F-points from the system based on the following recursion relation: u mj = Φu mj 1 + g mj = Φ(Φu mj 2 + g mj 1 ) + g mj =..., (1.14) = Φ m u m(j 1) + ĝ mj, j = 1, 2,..., N t /m, (1.15) where ĝ mj = g mj + Φ(g mj Φ m 1 g m(j 1)+1. Writing this system in block triangular form gives the ideal coarse grid operator A u = I Φ m I... Φ m I u,0 u,1. u,nt/m = g, (1.16) where u,j = u mj and g = [ĝ 0, ĝ mj,..., ĝ Nt ]. This is referred to as ideal because the solution of equation 1.16 yields the exact solution at the coarse points. In this setting, and for the remainder of this thesis, the exact solution is the solution one would obtain by solving equation 1.13 with sequential time integration. The limitation of this exact reduction method is that the coarse-grid problem is, in general, as expensive to solve as the original fine-grid problem (because of the Φ m evaluations). Multigrid reduction methods address this by approximating A with B, where B = I Φ I... Φ I, (1.17)

26 13 and Φ is an approximate coarse-grid time-step operator. One obvious choice for defining Φ is to re-discretize the problem on the coarse grid so that a coarse-grid time-step is roughly as expensive as a fine-grid time-step. For instance, with backward Euler, one simply uses a larger time-step size. Convergence of MGRIT is governed by the approximation, A B, and this choice of using a re-discretization of Φ with T = mδt has proved effective [17, 18]. It is important to note that, while the definition of this algorithm relies upon Φ and Φ, the internals of these functions need not be known. This is the non-intrusive aspect of MGRIT. The user defines the time-step operator and can use existing codes within the MGRIT framework Restriction and Interpolation Ideal restriction, R, and interpolation, P, are defined as I R = Φ m 1... Φ I I Φ T... Φ m 1,T P T =..., (1.18a)... Φ m 1... Φ I. (1.18b) I Φ T... Φ m 1,T With these definitions, g = Rg and A = RAP = R I AP, where R I is injection at the coarse points. Based on the last equivalence, unless otherwise stated, injection is used to map to the coarse-level for the remainder of this thesis. The exception is the spatial coarsening option where spatial restriction and interpolation functions, R x () and P x (), are used to coarsen in space as well as time (see Section 2.3.3).

27 F- and C- Relaxation MGRIT uses two basic types of relaxation, namely, F- and C- relaxation. Figure 1.6 shows the actions of F- and C-relaxation on a temporal grid with m = 4. F-relaxation propagates the solution forward in time from each coarse point to the neighboring F -points. On each coarse grid interval, F relaxation includes of m 1 time integration steps. For a linear problem with a backward Euler time integration scheme, this constitutes m 1 sequentially linked spatial inverses. For a nonlinear time integration operator, updating each interval of F-points requires m 1 sequentially linked nonlinear solves, each of which will likely require several nonlinear iterations. Most importantly, F relaxation is highly parallel; each interval of F -points can be updated independently during F-relaxation. Note that the sequential nature of F-relaxation places a bound on the number of temporal processors for which we expect to see a speedup. To be precise, the F-relaxation obtains a maximal parallel concurrency when N t /m temporal processors are used. At this point, additional temporal processors only acts to increase communication costs associated with F-relaxation. There may however be some advantage in terms of memory consumption (see Section 1.3.4). C-relaxation is the process of updating the coarse time points. As with F relaxation, each C-point can be updated independently, in parallel. Irregardless of the coarsening factor, each C-point update requires one time step evaluation. In [17], it was shown that F relaxation was not enough to ensure a scalable multi-level algorithm (see Section 1.3.7). To fix this, two composite types of relaxation were introduced. FCF relaxation consists of a F relaxation, followed by a C relaxation, followed by another F relaxation. F-FCF relaxation is a level based relaxation method that uses F relaxation on the finest grid and FCF relaxation on all the coarse levels. These relaxation types are discussed further in Section 1.3.7, but are included here for completeness.

28 15 Φ Φ Φ Φ Φ Φ g g g g g g F-relaxation Φ Φ g g C-relaxation Figure 1.6: F- and C-relaxation for coarsening by factor of 4. F relaxation is used to update the F-points. C-relaxation updates the C-points The Two-level MGRIT Algorithm for Linear Problems The two-level, linear MGRIT algorithm is outlined in Algorithm 2. In terms of time step evaluations, the two-level MGRIT algorithm is more expensive than standard sequential time integration. however, as shown in Figure 1.7, the MGRIT algorithm allows for parallelism in time. This leads to a cross over point, after which the added concurrency outweighs the extra work. Beyond this point, MGRIT provides a speedup over the corresponding sequential solver. As an example, consider the case where the cost associated with a time step is fixed across the temporal grid hierarchy. Let L indicate the number of coarse grid levels in the temporal hierarchy. Then, since the costs of restriction and interpolation are equivalent to those of C- and F- relaxation, respectively, the MGRIT algorithm with F relaxation requires approximately 2N L t i=0 (1/m)i 2N t m/(m 1) time step evaluations per iterations, proving our claim that MGRIT is an O(N t ) algorithm. Note that the final approximation is obtained by letting L and using the formula for an infinite geometric series. Ignoring communication costs, this analysis suggests MGRIT will require about a factor of 2N t mν/(m 1) more temporal processors to break even with sequential time integration, where ν is the number of iterations required for MGRIT to converge [17].

29 16 Algorithm 2 MGRIT(A, u, g) 1: repeat 2: Relax the approximate solution, u, using F of FCF relaxation. 3: Compute the fine grid residual, r = g Au. 4: Restrict the residual to the coarse grid, r = R I r. 5: Solve the coarse grid equation, B e = r. 6: Interpolate the error to the fine grid, e = P e. 7: Update the fine grid solution, u u + e. 8: until norm of the residual is small enough. Sequential MGRIT Time Space Space Figure 1.7: MGRIT allows for parallelism in both space and time. Red boxes indicate processor boundaries MGRIT Error Propagation Matrix The MGRIT algorithm, a fixed point iterative scheme, is guaranteed to converge to the exact discrete solution in at most N t /m iterations (shown below and in [17]), however, to achieve a parallel speedup, MGRIT must converge in O(1) iterations. Because MGRIT is a fixed point iterative scheme, a bound on the convergence rate of MGRIT can be found by considering the norm of the error propagation matrix. The simplest approach to determining the MGRIT error propagation matrix is to reformulate the MGRIT algorithm as an approximate Schur-complement approach with relaxation, i.e., a two-level multigrid method. This derivation of MGRIT as a Schur complement approach was first presented in [17] and

30 17 expanded on in [50]. The Schur complement approach is a non-overlapping domain decomposition method. The idea is to split the domain of interest into a series of non-overlapping sub-domains (i.e., coarse grid intervals), eliminate the interior unknowns ( i.e., the F-points) and solve the resulting problem (i.e, the coarse grid problem). Grouping together the fine (f) and coarse (c) points, one can re-order equation 1.13 as A ff A cf A fc A cc u f u c = g f g c. Assuming A ff is invertible, elimination can be used to remove the A cf term I f 0 A cf A 1 ff I c A ff A cf A fc A cc = A ff Likewise, elimination can also be used to remove the A fc term I f 0 A cf A 1 ff I c A ff A cf A fc A cc A 1 ff A fc I f 0 I c A fc 0 A cc A cf A 1 ff A fc =. A ff 0 0 A cc A cf A 1 ff A fc Multiplication on the left and right by the corresponding inverses gives the LDU (lower,. diagonal, upper) decomposition of A A ff A cf A fc A cc = I f 0 A cf A 1 ff I c A 1 ff A fc A ff 0 I f 0 A cc A cf A 1 ff A fc 0 I c. This decomposition naturally implies operators R and P, known as ideal restriction and interpolation, and S [ R = ] A fc A 1 ff I c, P = A 1 ff A fc I c, S = I f 0. (1.19) Note that these definitions for R and P are equivalent to those given in Equation In

31 18 this notation [ A = RAP = [ = ] A fc A 1 ff I c A ff ] A fc A 1 ff I c A cf A fc A cc 0 A 1 ff A fc A cf A 1 ff A fc + A cc I c, (1.20), (1.21) = A cc A cf A 1 ff A fc, (1.22) = R I AP, (1.23) [ ] RAS = A fc A 1 ff I c A ff [ = A fc A 1 ] ff I c A cf A ff A fc A fc A cc I f 0, (1.24), (1.25) = A fc A fc, (1.26) = 0, (1.27) and S T AS = A ff, where R I = [0, I c ] is the injection operator. The inverse of A is A 1 = = I f A 1 ff A fc 0 I c (ST AS) (RAP ) 1 (ST AS) 1 + A 1 ff A fc(rap ) 1 A fc A 1 ff (RAP ) 1 A fc A 1 ff = P (RAP ) 1 R + S(S T AS) 1 S T. Based on these relationships, the multiplicative identity is I f 0 A fc A 1 ff I c A 1 ff A fc(rap ) 1 (RAP ) 1 0 = I A 1 A, (1.28) = I P (RAP ) 1 RA + S(S T AS) 1 S T A, (1.29) = (I P (RAP ) 1 RA)(I S(S T AS) 1 S T A). (1.30),,

32 19 The first term corresponds to the error propagator of coarse grid correction using the ideal Petrov-Galerkin coarse grid operator, RAP = A, and the second, to the error propagator for F-relaxation. This identity can be simplified further by noting that I S(S T AS) 1 S T A = = = I f 0 0 I c I f 0 0 I c 0 I f 0 A 1 ff A cf 0 I c A 1 ff A 1 ff [ I f 0 A ff A cf ] A ff A fc A fc A cc A cf A cc, (1.31), (1.32), (1.33) = P R I, (1.34) where again, R I = [0, I c ] is the injection operator. In other words, interpolation is equivalent to F-relaxation. Using this simplification, the multiplicative identity becomes 0 = I A 1 A, (1.35) = (I P (RAP ) 1 RA)P R I. (1.36) This identity highlights the fact that two-level MGRIT with the ideal coarse grid operator, A = RAP, constitutes an exact solve. As discussed in Section , inverting A is as expensive as inverting the original fine grid operator, A. Instead, MGRIT approximates A with the cheaper B (see equation 1.17). Making this substitution, the two-grid error propagation operator for MGRIT with FCF relaxation is (I P B 1 R IA)P R I = P (I B 1 A )R I. (1.37) With this substitution, the multiplicative identity is no longer equal to zero, highlighting the fact that MGRIT is in fact an iterative solver.

33 20 In [17] it was shown that FCF relaxation was required for a scalable multilevel algorithm. In short, for the problems tested in that paper, the number of iterations required for convergence with multi-level MGRIT and F relaxation increased with problem size. When FCF relaxation was used, where FCF relaxation indicates F relaxation, followed by C relaxation, followed by a second F relaxation, the iteration counts were bounded independently of the problem size (see Section 1.3.7). F relaxation corresponds to setting the residual equal to zero at all F-points, i.e. (S T AS) 1 = A 1 ff. In contrast, FC relaxation propagates each value at a C-point to the next adjacent C-point, i.e., FC relaxation is propagation with Φ m. Combined, FCF relaxation is defined as P (I A 1 cc (A cc A cf A 1 ff A fc))r I = P (I A 1 cc (RAP ))R I (1.38) = P (I A )R I. (1.39) where we use the fact that A cc = I for a single step time integration scheme. Note that this implies FCF relaxation is Jacobi-relaxation on the coarse-grid with the ideal coarse grid operator. Given this, the error propagation matrix for two-level MGRIT with FCF relaxation is (I P B 1 R IA)P (I A )R I = P (I B 1 A )(I A )R I. (1.40) MGRIT with Spatial Coarsening As an extra cost saving measure, one can also choose to coarsen the spatial grid. A detailed investigation into the pros and cons of spatial coarsening with MGRIT is given in Section 2.3.3, however, to summarize, spatial coarsening reduces the cost of the coarse grid solve, leading to large reductions in the runtime per iteration of the MGRIT algorithm. Unfortunately, in many cases, spatial coarsening also leads to a degradation of the MGRIT convergence rate, especially for problems where the time integration operator is anisotropic. Let P x and R x be spatial prolongation and restriction operators at each time instance

34 21 and define P x = diag(p x ) = P x P x..., Rx = diag(r x ) = R x R x.... P x R x Note that these are rectangular matrices, so that Φ is of smaller size compared to Φ. Given this, the two-level error propagation operator for MGRIT with FCF relaxation and spatial coarsening is: P (I P x B 1 R x A )(I A )R I, (1.41) The upper bound on the MGRIT convergence factor is given by the norm of this error propagation matrix. The inverse of the block tridiagonal coarse grid approximation, B, is ( ) B 1 = 0 i < j, i, j = 0,..., N. (1.42) ij Φ i j i j, Thus, in matrix format, the coarse grid error propagation operator is: 0 i j, [E ] ij = P x Φ i j 1 (R x Φ m Φ R x )Φ m i > j 1, (I P x R x )Φ m i = j 1, i, j = 0,..., N. (1.43) Having demonstrated that the above E is nilpotent, we can state the method is guaranteed to converge. In fact, convergence is guaranteed in at most N t /2m iterations for MGRIT with FCF relaxation (N t /m for MGRIT with F relaxation). One can obtain an exact bound for the convergence factor, however, this bound is both problem and solver dependent (see Section 3.1.1). Unfortunately, guaranteed convergence does not imply guaranteed speedups. Rather, speedups are obtained when the convergence rate of the method is bounded much less than

35 22 one. An example of this is given in Section 2.3.3, where the dramatic reductions in runtime per iteration associated with full spatial coarsening are outweighed by an unavoidable degradation in the MGRIT convergence rate Storage Costs The most common concern regarding the MGRIT approach to parallel-time-integration relates to the associated storage costs. At a minimum, the solution values, u i, must be stored at each C-point in the temporal hierarchy (the first term in the storage cost model below). In addition, some number, k, of auxiliary vectors must be stored at all points on the coarse levels (the second term below). Let L be the number of temporal levels, p be the number of temporal processors, and s l be the cost to store a vector on level l. The units of s l can vary with the implementation. For the purposes of this thesis, the cost to store a vector on level l is given by the number of spatial unknowns present in the spatial grid on that level. The storage cost of MGRIT (per temporal processor) is as follows: L 1 Nt L 1 Nt S C (k) s m i+1 i + k s p m i i. (1.44) p i=0 By default, MGRIT stores two auxiliary vectors on the coarse grids (k = 2): a restricted copy of the fine-grid solution and the right-hand-side for the coarse FAS system. However, in Section we show how storing one additional auxiliary vector, specifically the most recent solution from the corresponding Φ evaluation, allows the user to dramatically reduce the overall runtime. Equation 1.44 will be used in Chapter 2 to compute a memory multiplier that gives an estimate of the per processor increase in memory usage when compared to sequential time-stepping. The memory multiplier used is defined as S C (k)/s 0. i= Extension to Multiple Levels A variety of cycling strategies are available in multigrid (e.g., V, W, F). All results presented here use the standard V-cycle depicted in Figure 1.8. This corresponds to Algo-

36 23 rithm 3 with the Solve step turned into a single recursive call. The recursion ends when a trivially sized grid, of, say, 5 time points, is reached. At this point a sequential solver is used. On the right in Figure 1.8, a V-cycle with only temporal coarsening is shown. Full spatial coarsening, depicted on the left, fixes the parabolic ratio, δt/h 2, on all levels. Full spatial coarsening can lead to a degradation of the MGRIT convergence rate. In Section 2.3.3, the effects of this degradation are minimized using a delayed spatial coarsening strategy. This delayed spatial coarsening strategy, that is, not coarsening in space for several temporal levels and then coarsening in space and time keeping the parabolic ratio fixed, is a novel and important contribution of this thesis. Figure 1.8: Example multigrid V-cycle with space-time coarsening that holds δt/h 2 fixed on the left, and then time-only coarsening on the right XBraid: A Non-intrusive Parallel in Time Package The MGRIT algorithm is implemented in XBraid [57], an open source package developed at LLNL. The key benefits of the XBraid package are that it: Conforms to MGRIT s non-intrusive philosophy, requiring the user to provide an existing time-stepping routine, as well as define a few other basic operations like a state-vector norm and inner-product. Adds temporal parallelism on top of existing spatial parallelism. Returns, up to a prescribed tolerance, the exact sequential solution.

37 24 Accommodates nonlinear problems. Allows for both spatial and temporal adaptation. Is open source software released under the LGPL 2.1. For more details, see [17] and [57] Results for Linear Problems The MGRIT algorithm and the corresponding XBraid implementation have been used to solve a variety of problems, including compressible fluid dynamics [18], moving meshes [16], multi-step methods [20] and power grid systems [34]. In [17], the original MGRIT paper, MGRIT was applied to the 2D linear heat equation u t u = b(x, t), x Ω = [0, π] 2, t [0, T ], (1.45) subject to the following initial condition and Dirichlet boundary conditions u(x, 0) = ū(x) x Ω, (1.46) u(x, t) = 0, x Ω, t [0, T ], (1.47) discretized with central finite differences in space and backward Euler in time. A detailed description of this numerical discretization as well as the numerical parameters used is given in [17], Section 3.1. Table 1.1 shows the number of MGRIT V cycles required for convergence for a variety of problem sizes. The take home message is that FCF relaxation is required for a scalable multilevel algorithm. Another option, also pursued in [17], is to complete F relaxation on the fine grid, and FCF relaxation on all coarse grids. This approach also leads to a scalable algorithm. Figure 1.9 shows a strong scaling study of MGRIT for linear diffusion on the machine Vulcan, an IBM BG/Q machine at Lawrence Livermore National Laboratory (LLNL). The

38 25 N x N t = (2 4 ) (2 5 ) (2 6 ) (2 7 ) (2 8 ) F-Relax two-level F-Relax multilevel FCF-Relax two-level (all levels) multilevel Table 1.1: Number of two-level and multilevel MGRIT V cycles for solving 1.45 to within a fixed residual tolerance of 10 9 on (N x +1) (N t +1) space-time grids with factor 2 coarsening, as published in [17], Table 1. problem size was (129) (space time). Three data sets are presented; a standard sequential time-stepping run (square markers), a time-only parallel run of MGRIT (circle markers), and a space-time parallel run of MGRIT (triangle markers). The space-time parallel runs used an 8 8 processor grid in space, with all additional processors added in time. Both MGRIT curves represent the use of temporal and spatial coarsening, so that the ratio of δt/h 2 is fixed on coarse time-grids, where h is the spatial mesh width. The maximum speedup over sequential time integration achieved by the blue curve (triangle markers) is approximately 50, and the crossover point where MGRIT provides a speedup is at about 128 processors in time. The first nonlinear implementation of MGRIT was discussed in [18] for the compressible Navier stokes equations. That report highlighted the need for a detailed investigation into optimizing nonlinear MGRIT for the solution of nonlinear problems, leading to the work presented in Chapter 2.

39 Figure 1.9: Time to solve 2D linear diffusion on a (129) space-time grid using sequential time-stepping and two different processor decompositions of MGRIT. Figure courtesy of [17] 26

40 Chapter 2 MGRIT for Nonlinear Parabolic Problems The MGRIT algorithm has proven to be very effective for linear problems, with speedups of up to 50x over sequential time integration seen in some cases. When considering the performance of MGRIT, the application of Φ is the dominant process. For a linear problem with implicit time-stepping, each application of Φ equates to solving one linear system. When an optimal spatial solver such as classical multigrid in space [40, 4, 46, 32] is used, the work required for a time-step evaluation is independent of the time-step size. For nonlinear problems, where each application of Φ involves an iterative nonlinear solve, the cost of a Φ evaluation becomes progressively more expensive on coarser time levels. This occurs because the conditioning of the nonlinear problem and the accuracy of the best available initial guess, usually the solution at the previous time step, becomes worse as the time step size increases. Multigrid research has a long history of choosing a representative model problem, studying it at length, and using those results to develop strategies that can be used when solving other problems. Based on a case study of the nonlinear modified p-laplacian equation, this chapter introduces a set of optimizations that can be used to develop an efficient MGRIT implementation for a wide range of nonlinear problems. Each of these optimizations, most importantly an alternate initial guess for the nonlinear time-step solver and delayed spatial coarsening, are designed to reduce the cost of coarse grid time integration while also balancing user concerns such as overall time-to-solution, memory use, and non-intrusiveness.

41 28 The model problem explored in this chapter is the modified p-laplacian: u t (x, t) [(C + u(x, t) p 2 ) u(x, t) ] = b(x, t), x Ω, t [0, T ], C 0, (2.1) subject to the following Neumann boundary and initial conditions: ( C + u(x, t) p 2 ) u(x, t) n = g(x, t), x Ω, t (0, T ], (2.2) u(x, 0) = u 0 (x), x Ω, (2.3) where Ω = [0, 2] [0, 2] and T = 4s. The weak form of equations (2.1)-(2.3) reads: find u V such that u t, v + (C + u p 2 ) u, v = b, v + g, v Ω, v V, (2.4) where V = W 1,p (Ω) = {u L p (Ω) : D α u L p (Ω) α 1}. Let b j+1 = b(x, t j+1 ), g j+1 = g(x, t j+1 ) and u 0 = u(x, 0). Then, a backward Euler discretization of this weak form gives uj+1 u k δt, v + (C + u j+1 p 2 ) u j+1, v = b j+1, v + g j+1, v Ω, v V, (2.5) for j = 0, 1,..., N t 1. Define Ψ(u)(v) and f j (v) to be Ψ(u)(v) = u, v + δt (C + u p 2 ) u, v, (2.6) f j (v) = u j + δt b j+1, v + δt g j+1, v Ω. (2.7) Then, the final nonlinear weak problem is: find u j+1 V such that Ψ(u j+1 )(v) = f j (v), v V, j = 0, 1, 2,..., N t 1. (2.8) Assume that each b k is Lipschitz continuous, that u 0 V, and that ( u 0 p 2 u 0 ) L 2 (Ω). Then, for each k, the weak form presented in equation 2.8 has a unique solution u k+1 V [55]. Moreover, this result holds for any finite-dimensional subspace on V. Hence, the corresponding discrete weak problem: find u h j+1 V h such that Ψ(u h j+1)(v h ) = f j (v h ), v h V h, j = 0, 1, 2,..., N t 1, (2.9)

42 29 where V h is an appropriate finite element space, also has an unique solution. Each time-step corresponds to the solution of this nonlinear system (i.e., the inversion of Ψ) in the appropriate finite element space, V h. Define the Fréchet derivative in the direction w, denoted by Ψ (u)(v)[w], to be Ψ Ψ(u + aw)(v) Ψ(u)(v) (u)(v)[w] = lim, a 0 a (2.10) = w, v + δt [ C + u p 2 + (p 2)( u)( u) ] T w, v. (2.11) Then, for each time step, the corresponding Newton iteration is where δu h,i is a unique element in V h such that u i+1,h j+1 = u i,h j+1 δui,h, (2.12) for every v h V h. Ψ (u i,h j+1 )(vh )[δu h,i ] = Ψ(u i,h j+1 )(v h) f j (v h ), (2.13) The uniqueness and existence of the solution δu i,h V h hinges on the fact that C + u p 2 + (p 2)( u)( u) T is a 2 2 symmetric, positive definite (SPD) matrix. In this chapter, MGRIT is applied to the modified p-laplacian with p = 4 and C = 0. To facilitate this case study, the forcing function, b(x, t), the initial condition, u 0 (x), and the Neumann boundary condition, g(x, t), are all chosen such that the exact solution is u(x, y) = sin(κx) sin(κy) sin(τt), where κ = π and τ = (2 + 1/6)π. Figure 2.1 shows the exact solution at three discrete time points. It is important to note that, for this choice of parameters, there are points in the space-time domain where u = 0. Figure 2.2 plots u p 2 at three discrete time points for p = 4. In particular, u 2 is zero for all (x, y, t) = (n/2, m/2, t) with n, m = 0, 1, 2, 3, 4, and for all (x, y, nπ/τ) with n = 0, 1,..., T τ/π. In these regions the model problem with

43 30 Figure 2.1: The exact solution, u(x, y, t), at, from left to right and top to bottom, t = 0, t = 0.4, t = 2.2, and t = 3.0 C = 0 is irregular. This irregularity can be removed by setting C > 0, however, doing so does not effect the convergence properties of the MGRIT algorithm. This assertion is confirmed by Figure 3.69 which shows that the convergence rate of the MGRIT algorithm without spatial coarsening is independent of the constant C. When spatial coarsening is used, the convergence rate degrades as C approaches zero, however, this occurs because the time integration operator is anisotropic (see Section Based on those results, we choose to set C = 0 and ignore this irregularity.

31 Figure 2.2: The nonlinear diffusion coefficient, u 2, at, from left to right and top to bottom, t = 0, t = 0.4, t = 2.

To summarize, this chapter will investigate the application of MGRIT to the modified p-laplacian with C = 0 and p = 4, an

(denoising, segmentation and in-painting) and machine learning (see [13] for an overview and [35] for an introduction).

44 31 Figure 2.2: The nonlinear diffusion coefficient, u 2, at, from left to right and top to bottom, t = 0, t = 0.4, t = 2.2, and t = 3.0. To summarize, this chapter will investigate the application of MGRIT to the modified p-laplacian with C = 0 and p = 4, an equation that is often used as a means of modeling soil erosion and transport [3] and has also found uses in image processing (denoising, segmentation and in-painting) and machine learning (see [13] for an overview and [35] for an introduction). The study begins with a naive application of MGRIT to equation 2.1, where a large increase in the cost of a nonlinear solve (for Newton s method) on the coarser temporal grids is observed. This is caused by the relatively large time-steps (compared to the finest

45 32 grid) and the associated poor initial guess to the Newton solver on the coarse levels. These increases counteract the strength of multigrid, where speedup is achieved by using cheap coarse grid problems to accelerate convergence on the fine grid. Therefore, our strategy is to minimize the cost of each nonlinear solve, which is measured with a cost estimate. The desired result is to achieve similar efficiencies for the nonlinear and linear versions of equation 2.1, while also taking into account common user concerns, such as non-intrusiveness and storage costs. To reduce the average cost of each Φ evaluation, an improved initial guess for each time-step is investigated. For sequential time-stepping, a commonly used approach is to take the previous time-step as the initial guess. However, in the parallel-in-time setting this approach is flawed because on coarse levels, where the time-step size is large, this is a poor initial guess. The strategy developed here is to use information from previous parallel-in-time iterations as the initial guess. For MGRIT, this entails using the solution from a previous evaluation of Φ as the initial guess for the corresponding nonlinear solve. With this initial guess, the cost of each Newton solve reduces on each subsequent MGRIT iteration 1. However, this also creates a tension between memory usage and computational efficiency because MGRIT need not store the solution from the previous Φ evaluation at all time points. A more conservative approach is to only store the solution from the previous Φ evaluation at the C-points. In this scenario, the previous time step is used as the initial guess at each F-point. This allows a user with memory constraints to reap the benefits of this improved initial guess, while limiting per-processor storage costs. This reduced storage result and the importance of improved initial guesses for MGRIT, both important contributions of this work, are discussed further in Section Following this, a spatial coarsening strategy, designed to limit δt/h 2 on the coarse 1 Assuming the Newton solver measures nonlinear convergence with a fixed absolute tolerance.

46 33 time grids, is pursued. The goal is to reduce the number of Newton iterations required on coarse levels by reducing the spatial variability of the nonlinearity while also improving the condition number of the inner linear problems. Recognizing this specific importance of spatial coarsening when controlling the cost of nonlinear time-step solvers on coarse time grids is an important contribution of this work. Another important benefit of this approach is that spatial coarsening reduces coarse grid compute times and coarse grid storage costs. Unfortunately, full spatial coarsening, i.e., spatial coarsening on every temporal level, can, in some cases, lead to a degradation of the MGRIT convergence rate. Motivated by the dramatic computational savings, a delayed strategy that begins spatial coarsening on the fourth temporal level is investigated. Spatial coarsening, delayed spatial coarsening, and the degradation of the convergence rate are explored, in detail, in Section Two additional strategies include avoiding unnecessary work on the first MGRIT cycle and optimizing the number of levels in the MGRIT hierarchy. Further speedups can be obtained by loosening the Newton solver tolerance during the first three MGRIT iterations on all levels, where the approximate solution is still poor. These additional strategies are discussed further in Section 2.4. Overall, the most effective strategies are spatial coarsening and the improved initial guess, but the other strategies combined have a similarly significant impact on runtime. Together, these strategies produce an MGRIT algorithm for nonlinear problems that has an efficiency similar to that found for a corresponding linear problem. In Section 2.1.1, some implementation details are given. In Section 2.2, our strategy for improving the performance of MGRIT is proposed and justified. In Section 2.3, this strategy is implemented and in sections and 2.5.2, weak and strong scaling results are presented. However, the first step is to extend the MGRIT algorithm to allow for a nonlinear time-integration operator.

47 Nonlinear MGRIT with FAS The linear residual equation, (see Section 1.2.2, equation 1.3) does not hold if the fine grid operator, A, is nonlinear. Instead, for a nonlinear operators, FAS multigrid must be used. Algorithm 3 presents MGRIT as a two-level, FAS method. The multilevel FAS based MGRIT algorithm is formed by recursively applying the algorithm at Step 4. Ideal interpolation (1.18) is carried out in two steps (7 and 8) for simplicity in implementation. This description of FAS based MGRIT first appeared in [18]. To accommodate both linear and nonlinear problems, the XBraid implementation of MGRIT is based on this FAS based algorithm. Note that an initial guess must be set at each fine grid C-point prior to completing the first MGRIT iteration. In general, the C-points should be initialized with the best available estimate of the solution. Alternatively, a good initial guess can be found using a full multigrid methodology. In this case, the fine grid C-points are obtained by interpolating a cheap coarse grid solution to the fine grid (see Section 2.4.1). Algorithm 3 MGRIT(A, u, g) 1: Apply F- or FCF-relaxation to A(u) = g. 2: Inject the fine grid approximation and its residual to the coarse grid: u,i u mi, r,i g mi [A(u)] mi. 3: If Spatial coarsening, then u,i R x (u,i ), r,i R x (r,i ). 4: Solve B (v ) = B (u ) + r. 5: Compute the coarse grid error approximation: e v u. 6: If Spatial coarsening, then e,i P x (e,i ). 7: Correct u at C-points: u mi = u mi + e,i. 8: If converged, then update F -points: apply F-relaxation to A(u) = g. 9: Else go to step 1.

48 Numerical Parameters All tests were completed with T = 4 seconds and Ω = [0, 2] 2 on a regular grid. Unless otherwise stated, the p-laplacian was used with p = 4 and C = 0. The spatial discretization was computed using standard bi-linear quadrilateral elements and MFEM [41], a parallel finite element code. The numerical testing parameters used throughout the chapter (unless otherwise mentioned) were as follows. The Newton tolerance was fixed at The spatial solver for each Newton iteration was BoomerAMG from hypre b [32]. The BoomerAMG parameters were: HMIS coarsening (coarsen-type 10), one level of aggressive coarsening, symmetric L1 Gauss-Seidel (relax-type 8), extended classical modified interpolation (interp-type 6), and interpolation truncation equal to 4 nonzeros per row. The machine used for all numerical tests was Vulcan, an IBM BG/Q machine at LLNL. Except for the scaling studies, the test problem size was a (64) space-time grid on the domain [0, 2] 2 [0, 4] using 4 processors in space and 128 processors in time. V-cycles and FCF-relaxation were employed in every test. The global MGRIT residual is defined as r g = Nt/m r mj 2 L 2, j=0 where r mj is the residual at t = t mj as calculated in step 2 of Algorithm 3. In all cases, V cycles were completed until the global MGRIT residual was reduced to within a fixed stopping criteria of 10 9 /( δt h). This allowed the same tolerance, relative to the fine-grid resolution, to be used in all cases. Note that this is an overly tight tolerance with respect to discretization error, set in large part because of the desire to investigate the algorithm s asymptotic convergence properties. The temporal coarsening factor was m = 4.

49 Establishing Baselines The goal of this chapter is to establish general strategies that one can use to obtain an MGRIT algorithm for nonlinear problems that is as efficient as those previously seen for linear problems. To that end, an efficiency metric for computational cost is now presented. Two numerical baselines, through which all improvements will be measured, are also given. The MGRIT cost metric used here relies on c (j) l, the average cost of a Newton solve on grid level l during MGRIT iteration j. Note that by Newton solve we mean the cost to complete a time step using Newtons method, including the multiple inner Newton iterations. This is a sensible choice because the key computational kernel for nonlinear problems is the Newton solve. Section provides a detailed discussion of this metric and why it was chosen. Two baseline tests were completed, the first being the sequential baseline. This baseline determined the average cost of a Newton solve when using a sequential solver, on a variety of space-time grids. An efficient implementation of MGRIT for nonlinear problems will match (or improve on) these cost estimates. The second baseline test is called the naive MGRIT baseline. For this baseline, MGRIT was applied to the model nonlinear problem in an out of the box fashion. Typically, time-stepping routines have a hard-coded initial guess equal to the previous time-step and do not implement spatial coarsening. Hence, the naive MGRIT baseline mimics this. Note that this approach did, given enough temporal processors, provide a small speedup over the sequential routine (see Section 2.5.2); however, these speedups were much less than those seen for linear problems Cost Metric The chosen cost metric, c (j) l, is defined as the cost of a Newton solve (Φ application) on grid level l during MGRIT iteration j, averaged over all time-steps on that level, during

50 37 that MGRIT iteration, i.e., c (j) l = a (j) l w l, (2.14) where a (j) l is the average number of Newton iterations required to take a nonlinear time-step on level l and MGRIT iteration j, and w l is the number of spatial unknowns on level l. Thus, c (j) l is the cost estimate of a single Φ application. It is important to note that this cost metric also depends on the Newton stopping tolerance. If one reduces the stopping tolerance, the average cost per solve will decrease, however, this can lead to an increase in the number of outer MGRIT iterations required to meet the fixed MGRIT residual tolerance. As such, all the cost saving methods must be designed to balance cost savings and V cycle accuracy. As motivation for this choice, let us examine a simple cost model for the entire MGRIT algorithm. Restriction from level l to level l+1 has the same cost as a C-relaxation on level l. Likewise, interpolation from level l to level l 1 has the same cost as an F-relaxation on level l. Hence, when FCF relaxation is used, 3 F-relaxations and 2 C-relaxations are completed on each level (excluding the coarsest grid). Assuming N tl, the number of time-steps on level l, is an exact multiple of m, this amounts to around (2 + (m 1)/m)N tl evaluations of Φ. Next, consider completing ν MGRIT V-cycles on a temporal grid where 0 represents the finest grid and L represents the coarsest grid. Then, the total cost of the MGRIT algorithm is C(L, ν) ( ν N tl c (k) L k=1 + L 1 l=1 c (k) l (2 + (m 1)/m)N tl ). (2.15) Equation 2.15 implies that minimizing the average cost of a Newton solve on each level will result in a reduction of the cost of each MGRIT iteration. As mentioned above, there exists an implicit relationship linking ν, the number of MGRIT iterations required for convergence, and each c (k) l. For example, allowing a maximum of one Newton iteration on the coarse time grids will reduce the cost of a single MGRIT V cycle but, this will likely lead to a large increase in the number of MGRIT iterations required for convergence. Because the cost of a V cycle is so large compared to a single Newton solve, any increase in the num-

51 38 ber of MGRIT iterations required for convergence can quickly outweigh the corresponding per iteration cost savings, leading to an increase in run-time. As such, the optimizations described in this chapter are designed to reduce the cost of the Newton solve on each level without significantly increasing the number of iterations required for convergence. Sections and examine the two most important strategies for minimizing c (j) l, an improved initial guess for Newton, and spatial coarsening. If the same stopping criterion is used, the improved initial guess reduces the average cost of a Newton solve on all levels without sacrificing V cycle accuracy, albeit at the cost of storage. On the other hand, spatial coarsening reduces the average cost of a Newton solve on the coarse grids, but can lead to an increase in MGRIT iterations for problems with an anisotropic time integration operator. To mitigate these effects, a delayed strategy is introduced. This delayed strategy maintains the full spatial resolution until the parabolic ratio, δt/h 2, is large enough to ensure the time integration operator is sufficiently isotropic for spatial coarsening. This leads to an increase in the cost per iteration when compared to full spatial coarsening, but ensures that ν remains fixed, ultimately leading to a reduced runtime. While the metric shown here is c (j) l, another parallel performance issue is the variance in the cost of a Newton solve over the temporal domain. On the coarser grids, where a processor might own a single time-step, synchronization effects imply that an F- or C-relaxation cannot complete until the slowest processor finishes. As such, any difference between the maximum and minimum Newton iterations across the temporal domain indicates a load imbalance. Temporal load balancing is a topic of current research, but is beyond the scope of this thesis Sequential Time Integration baseline We now establish the sequential time-stepping baseline. Table 2.1 shows estimates of the cost of a Newton solve on all of the space-time grids present in the space-time grid hierarchy when MGRIT is applied to the model problem outlined in Section These

52 39 estimates, calculated using Equation (2.14), are the average cost of a Newton solve when using a sequential solver. As discussed above, these estimates show the average number of iterations required to complete a Newton solve, multiplied by the number of spatial unknowns in the spatial grid (129 2 ). Table 2.1a depicts baseline estimates for MGRIT using delayed spatial coarsening. Table 2.1b provides baseline estimates for MGRIT with no spatial coarsening. An efficient implementation of MGRIT will match (or improve on) these cost estimates across all levels. The take home message from these results is that the average cost of a Newton iteration is highly dependent on the space-time grid, with there being a considerable advantage to coarsening simultaneously in space and time. both a (j) l time-step and w l. The decrease in a (j) l This is because spatial coarsening reduces is explained by considering a standard backward Euler ( I δt ) h G (u 2 k+1 ) = f(u k ), (2.16) where G is a nonlinear diffusion operator. As δt/h 2 increases, the nonlinear operator moves away from the identity, becoming more expensive to solve. Coarsening in space and time bounds this ratio, making the nonlinear solve from Equation (2.16) cheaper on the coarse grid. A detailed investigation into using spatial coarsening is given in Section The second strategy to reduce the cost of the Newton solve is to improve the initial guess (see Section 2.3.2). This importance is also evident in Table 2.1, where a l continues to increase with δt, even when spatial coarsening is used. This is because δt is increasing, thus making the initial guess to Newton (the previous time-step) progressively worse. On the coarse levels, where δt is large, this is clearly a poor approximation to the solution. In Section 2.3.2, an alternate initial guess for Newton, where the initial guess is the solution obtained during a previous evaluation of Φ at the current time-point, is pursued.

53 40 δt h a l w l c l 1/1024 1/ e3 1/256 1/ e4 1/64 1/ e4 1/16 1/ e3 1/4 1/ e3 1/1 1/ e2 (a) Delayed spatial coarsening δt h a l w l c l 1/1024 1/ e3 1/256 1/ e4 1/64 1/ e4 1/16 1/ e4 1/4 1/ e4 1/1 1/ e4 (b) No spatial Coarsening Table 2.1: Baseline Newton solver costs for sequential time-stepping. Here c l represents the average number of iterations required for the Newton solve to converge, a l, multiplied by the number of spatial unknowns in the grid, w l Naive MGRIT Baseline Next, the so called naive MGRIT baseline is presented. The naive implementation is an unmodified, out of the box type implementation that wraps a sequential time-stepping routine without any MGRIT optimizations. In particular, the previous time-step is used as the initial guess and spatial coarsening is not implemented. Table 2.2 depicts the cost estimates, c (j) l, found using this basic parameter set. Each δt (column) value corresponds to a temporal level, while the rows represent different XBraid iterations. Due to increasing Newton iteration counts, the cost of a coarse grid Newton solve is, in some places, 6 times more expensive than a fine grid solve. To see how this can lead to an inefficient method in parallel, consider the case where each MPI process owns m points in time on the finest level, i.e., one CF-interval. In [17] it was shown that this is an efficient decomposition. On coarser levels, each process then owns at most one point in time. Table 2.3 gives, for the processes active on each temporal level, the estimated cost of an FCF-relaxation. With this processor distribution, FCF-relaxation on the finest grid requires roughly 2m time-step evaluations per processor. Coarse grid estimates are obtained based on the fact that each processor owns at most one point, and all F -points are relaxed twice. The average cost of a time step evaluation used to calculate

54 41 these estimates was taken from Table 2.2. Iteration δt = 1/1024 1/256 1/64 1/16 1/ e4 2.10e4 2.20e4 3.16e4 5.86e4 4.92e e4 1.24e4 1.76e4 3.52e4 5.98e4 5.41e e3 1.19e4 1.70e4 3.35e4 6.06e4 5.73e e3 1.17e4 1.72e4 3.38e4 5.94e4 5.73e e3 1.17e4 1.72e4 3.39e4 5.94e4 5.73e e3 1.17e4 1.72e4 3.39e4 5.98e4 5.73e4 Table 2.2: Cost estimates, c (j) l, for the naive MGRIT baseline (solver 0). The cost estimates are given in raw, unscaled form across each temporal level and MGRIT iteration. In this setting, it is immediately clear how expensive Newton solves on coarse levels can dominate the cost of a V-cycle, especially when take into account the fact that the levels must be traversed in order on each processor. In contrast, for the linear setting (when using an optimal spatial solver and no spatial coarsening), the work required would be independent of time-step size. Controlling any growth in the cost of a Newton solve across temporal levels will be key to matching the results seen with MGRIT for linear problems. By itself, the naive application of MGRIT scales very poorly when placed alongside the comparable linear problem (see sections and 2.5.2). To summarize, our goal is to present a library of MGRIT optimizations and modifications that can be used to develop efficient MGRIT implementations for a wide range of nonlinear parabolic problems. These modifications are developed with the goal of minimizing the average cost of a Newton solve, focusing on the coarse grids, without increasing the number of iterations required for convergence, while also addressing common user concerns of non-intrusiveness and memory usage.

55 42 Iteration δt = 1/1024 1/256 1/64 1/16 1/4 1 Total e5 4.20e4 4.39e4 6.32e4 1.17e5 9.83e4 5.38e e4 2.48e4 3.51e4 7.05e4 1.20e5 1.08e5 4.44e e4 2.38e4 3.41e4 6.71e4 1.21e5 1.15e5 4.31e5... Table 2.3: For processes active on each temporal level, estimated average cost to carry out FCF-relaxation, based on Table Efficient MGRIT for the Model Nonlinear Problem In this section a variety of approaches designed to make MGRIT more efficient for the chosen nonlinear model problem (equation 2.1) are investigated. Improvements are measured by comparing to the naive MGRIT baseline of Section Solver ID Table Given the number of options considered, and to make discussion easier, each solver is given a numerical ID. Table 2.4 presents each solver considered and its runtime for the chosen test problem size. Also given is the per processor memory multiplier that gives an estimate of the per processor increase in memory usage (see Equation 1.44), when compared to sequential time-stepping. The multiplier is given for both p = 128 temporal processors and p = 4096 temporal processors, which is the largest possible p for this test problem size. For p = 128, the multiplier is quite large, but the addition of more resources can make this number relatively small. Each solver option is discussed in more detail in the following subsections. However, a brief description of each solver is presented here for the reader s convenience. Solver 0 refers to the naive MGRIT approach from Section The column heading Spatial Grids refers to the number of spatial grids used and is the option introduced in Section A value of 1 for Spatial Grids indicates no spatial coarsening, while 4 means

56 43 that the finest spatial grid is coarsened 3 times for a total of 4 grids. The finest grid is 64 64, so coarsening further in space is not advantageous. The difference between solvers 1 and 2 is that solver 1 delays spatial coarsening so that it begins on the fourth temporal grid. Solver 2 begins spatial coarsening immediately on the first coarse time grid. Remember, for this problem with 4096 time-steps and m = 4, there are only 6 temporal grids. Given the bad effects on convergence visible from no delay in solver 2, unless otherwise mentioned, spatial coarsening is always delayed in further tests. See Section for more details. Solvers 3, 4, 5 and 6 correspond to the improved initial guess introduced in Section Here, IIG means that the improved initial guess, specifically the solution from a previous evaluation of Φ, is used as the initial guess for each Newton solve. This happens either at every point, or at all the coarse grid points and the fine grid C-points, depending on whether all the points or only the C-points are stored. Regarding the memory multiplier, the use of IIG requires an additional vector stored on coarse levels, hence k = 3 (instead of 2) in Equation Solvers 7, 8, 9 and 10 correspond to turning on the Skip option introduced in Section There, the concept of skipping unnecessary work during the first MGRIT down cycle is introduced. Solvers 11 and 12 correspond to having Cheap first three iters as introduced in Section Here, the Newton tolerance is relaxed during the first three MGRIT iterations. Solvers 13, 14, 15 and 16 reduce the number of levels in the hierarchy by setting a larger Coarsest grid size as introduced in Section For instance, with m = 4, using a coarsest grid size of 16 instead of 4 removes the coarsest level in the hierarchy. This can improve the time-to-solution by avoiding cycling between very small grids.

57 44 Solver ID Spatial grids IIG Skip Cheap first 3 iters Memory mult., p = Coarsest grid size Runtime per iter Iterations Runtime 0 1 Never No No s s 1 4 Never No No s 7 712s 2 4* Never No No s s 3 1 C points No No s 7 638s 4 4 C points No No s 7 579s 5 1 Always No No s 7 647s 6 4 Always No No s 7 591s 7 1 C points Yes No s 7 453s 8 4 C points Yes No s 7 410s 9 1 Always Yes No s 7 429s 10 4 Always Yes No s 7 391s 11 1 Always Yes Yes s 7 395s 12 4 Always Yes Yes s 7 360s 13 1 Always Yes Yes s 7 408s 14 4 Always Yes Yes s 7 354s 15 1 Always Yes Yes s 8 506s 16 4 Always Yes Yes s 8 383s Table 2.4: Overall run-times, storage costs, iteration counts and average time per iteration for the various solver options, with a (64) space-time grid. A * indicates no delay in spatial coarsening MGRIT with an Improved Initial Guess for Newton s Method level). Section showed that c (j) l increases with the time-step size (and hence MGRIT This occurs in a large part because the previous time-step becomes increasingly inaccurate initial guess. This section explores an improved initial guess (IIG), which uses as the initial guess for Newton s method at a specific time point and MGRIT level, the solution from the previous evaluation of Φ at that point and level. As XBraid converges, this value

58 45 becomes an ever improving initial guess. The main drawback of using the IIG is that an extra auxiliary vector must be stored at each point on every coarse grid. Note that the solution from the previous Φ evaluation and the current FAS solution are equivalent on the fine-grid, so the IIG is already available at stored fine-grid points. The extra storage at coarse levels is required because the FAS solution, already stored there, is not equal to the result of the previous Φ evaluation. This creates a tension between memory usage and computational efficiency. To limit memory usage, one can alternatively choose to store the solution at the fine grid C-points only. In this case, the IIG is available at all time steps except those completed during fine grid F-relaxation, where the previous time-step is used. This approach, referred to as IIG at C-points, is used by solvers 3, 4, 7 and 8. Remark For users with extreme memory constraints, results not presented here have shown that, when it is available, using the current FAS solution as the initial guess to the nonlinear solver also provides large runtime-reductions, although not as impressive as those shown here. Table 2.5 shows c (j) l over all levels and iterations, scaled entry-by-entry with the corresponding entry from the naive MGRIT baseline from Table 2.2. For example, entry (2,3) in Table 2.5 has been scaled by entry (2,3) from Table 2.2. Values less than one indicate that the cost per Newton solve was reduced by using the improved initial guess, while values greater than one indicate it increased. Solvers 3 and 5, where the IIG is used as the initial guess to the Newton solver at all points and only at coarse points (C-points), show large reductions in overall runtime, a direct consequence of the cost reductions seen in Table 2.5. Moreover, there is no degradation in MGRIT convergence. Solvers 4 and 6, where the IIG is used in conjunction with spatial coarsening, are discussed in Section Interestingly, for both approaches, the costs increase during early iterations when using the IIG. A consequence of these increased costs is that using the IIG only at C-points reduced

59 46 the runtime by 44%, with only a 32% increase in storage costs over the naive approach, while using the IIG as the initial guess at all points doubled the storage costs of the method and only reduced the runtime by 43%. This occurs because MGRIT, and hence the IIG, converges towards the true solution. During the early iterations, where the solution from the previous MGRIT iteration is inaccurate, the previous time step is a better initial guess. In this case, one could opt to experimentally determine which initial guess to use on each level and iteration. Such a strategy would likely result in further cost savings, but would also be highly problem specific. Rather than doing this, we state that further cost savings are possible, and instead choose to pursue more general strategies that, when used in conjunction with the improved initial guess, will reduce the cost of those first three iterations. Either way, the improved initial guess is an effective, non-intrusive strategy that will provide a large reduction in runtime for most nonlinear implementations of MGRIT. This ability to effectively use reduced storage only at C-points and the overall importance of improved initial guesses for nonlinear MGRIT are both important contributions of this work. Iteration δt = 1/1024 1/256 1/64 1/16 1/ Table 2.5: Relative cost estimates, c (j) l /c (j) l,naive, for MGRIT with the improved initial guess at all points (solver 5), across each temporal level and MGRIT iteration. The solver setup is otherwise identical to that used for Table 2.2. Cost estimates are scaled entry-by-entry with c (j) l,naive, the naive MGRIT baseline costs from Table 2.2.

60 MGRIT with Spatial Coarsening Motivated by results seen in Section and Table 2.1a, spatial coarsening is added to the naive solver from Section 2.2, as indicated in Algorithm 3. In general, the user s code defines the separate spatial interpolation and restriction functions P x () and R x (). At first glance spatial coarsening seems intrusive, however, MGRIT semi-coarsens in time and is agnostic to the spatial discretization, thus spatial coarsening is an extra option that can easily be implemented independently of the time-integration scheme. Here, the natural finite element spatial restriction operator (and its transpose) is used to interpolate between regularly refined spatial grids. This operator is provided by MFEM. Spatial interpolation equals the scaled (by 1/4) transpose of restriction so that R x P x I. The choice of spatial interpolation operators is an area of active research; however, it is not surprising that scaling so that RP resembles an oblique projection helps MGRIT convergence. A better choice here could lead to improved results below. The benefit of spatial coarsening is validated by the improved timings in Table 2.4. Here, solvers 0, 1 and 2 apply varying levels of spatial coarsening. Beginning spatial coarsening on the first coarse grid (solver 2) leads to a degradation in MGRIT convergence. For practical purposes, this solver is unusable as the convergence degradation continues for larger problems. This is unfortunate because this approach leads to a much smaller runtime per iteration. It is important to note that this degradation is not restricted to nonlinear problems. For example, if the nonlinear co-efficient of the p-laplacian is replaced with the exact, known solution, then the resulting linear problem suffers from the same convergence rate degradation. In fact, under certain conditions, even the linear heat equation suffers from a degradation in the convergence rate. Overall, the convergence rate degrades if the strength of connection between the spatial and temporal dimensions is weak. In those cases, the time integration operator is anisotropic in the spatial direction.

61 48 To see why this causes a problem, first recall the two-level coarse error propagation matrix for MGRIT with FCF relaxation and spatial coarsening (see Section 1.3.3) E = (I P x B 1 R x A )(I A ). Assume the time integration operator is of the form Φ = (I + ɛd) 1, where D is some spatial operator and ɛ is a constant. An example of a time integrator of this form is the time integration operator associated with the 1D linear heat equation, backward Euler, and finite differencing. In that case, Φ = (I + δt/h 2 D) 1, where D is the standard diffusion operator. Note that this operator becomes anisotropic as ɛ 0. In what follows, we show that the convergence rate for MGRIT goes to one when spatial coarsening is completed using an anisotropic time integration operator. Define e to be a vector in the null space of R x, i.e., R x (e) = 0 with e 0, define e 0 = [0, 0,..., e] and e 1 = [0, 0,..., Φ m e, 0]. Then, noting that (I A )e 1 = e 0 and A e 0 = e 0 gives E E e 1, e 1 (2.17) (I P x B 1 x A )(I A )e 1, e 1 (2.18) (I P x B 1 x A )e 0, e 1 (2.19) (I P x B 1 x )e 0, e 1 (2.20) e 0 e 1, (2.21) e Φ m e. (2.22) In the limit that ɛ 0, Φ I, and E 1. In other words, the convergence rate of MGRIT with spatial coarsening degrades as the operator becomes anisotropic. In particular, this occurs because spatial coarsening leaves high frequency interpolation error that cannot be smoothed by an anisotropic time integrator.

62 49 This is especially problematic because many of the most common time-integration operators are anisotropic when the parabolic ratio, δt/δx 2, is small. For the p-laplacian, the nonlinear diffusion coefficient takes the place of ɛ. As stated in Section 2, this coefficient is small for all (x, y, t) = (n/2, m/2, t) n, m = 0, 1, 2, 3, 4 and for all (x, y, t) = (x, y, nπ/τ) with n = 0, 1,..., T τ/pi. The original modified p-laplacian equation (equation 2.1), repeated here for convenience, is u t (x, t) [(C + u(x, t) p 2 ) u(x, t) ] = b(x, t), x Ω, t [0, T ], C 0, (2.23) subject to the following Neumann boundary and initial conditions: ( C + u(x, t) p 2 ) u(x, t) n = g(x, t), x Ω, t (0, T ], (2.24) u(x, 0) = u 0 (x), x Ω. (2.25) In essence, C provides a lower bound on the anisotropy of the time integration operator. For example, when C = 1, the link between the spatial and temporal domains is as least as strong as that seen for the linear heat equation. Figure 2.3 plots asymptotic convergence factors found when using the MGRIT algorithm, with and without spatial coarsening, to solve the modified p-laplacian (p = 4) equation using a range of C values. As expected, the convergence rate degrades as C goes to zero. The degradation associated with zeros that are local in space, i.e., the zeros near (x, y) = (n/2, m/2), n, m = 0, 1, 2, 3, 4, can be mitigated using an adaptive local mesh coarsening scheme [31]. The anisotropies caused by the zeros that span then entire spatial grid in local regions of the temporal domain, i.e., the zeros when t = nπ/τ, n = 0, 1,..., T τ/π, are more difficult to remove. The intuitive approach is to not coarsen the spatial mesh at these time points, however, this leads to a large coarse grid load imbalance that all but eliminates the cost savings associated with spatial coarsening.

63 50 Our goal is to develop a spatial coarsening strategy that can be applied to all nonlinear problems. As such, a detailed investigation into when and where spatial coarsening can be used is skipped. Instead, a simple delayed strategy that allows spatial coarsening to be used on most nonlinear problems that exhibit this phenomenon is developed. A similar strategy was used in [22] to make use of spatial coarsening inside a full space-time multigrid method. Figure 2.3: Asymptotic convergence rates against C for modified p-laplacian with MGRIT and spatial coarsening. The strategy, referred to as delayed spatial coarsening, is to limit spatial coarsening to the coarsest time grids. For example, solver 1 uses four spatial grids with spatial coarsening delayed until δt/δx 2 = 16. Table 2.6 presents a cost analysis of this strategy, scaling each c (j) l with the corresponding cost estimates found using the naive MGRIT baseline in Table 2.2. To highlight the cost saving benefits of delayed spatial coarsening, this table shows results that use the previous time-step as the initial guess for the Newton solve. Section discusses the combined effect of spatial coarsening and the improved initial guess, introduced in Section Delayed spatial coarsening reduced both the number and cost of Newton iterations on the coarse grids, In fact, on the coarsest grid, the average cost per Newton Solve is 100

64 51 times smaller than the cost to complete the same solve using the naive baseline. Recognizing the specific importance of spatial coarsening when controlling the nonlinear time-step solver iterations on coarse time grids is an important contribution. Heuristically, the effectiveness of the delayed spatial coarsening strategy can be explained by considering the above discussion regarding anisotropic time-integrators. Each time spatial coarsening is delayed, the ratio δt/δx 2 increases by a factor of m. This results in a coarse grid operator that is less anisotropic than the corresponding operator associated with full spatial coarsening. Eventually, δt/δx 2 is such that the coarse grid problem converges at an acceptable rate. The benefit of these dramatic cost reductions is seen in Table 2.4. Solver 1, where spatial coarsening is delayed until δt/δx 2 = 16, balanced the MGRIT convergence rate and the coarse grid computational costs, leading to a 37% reduction in run-time when compared to solver 0. This is the strategy pursued in this thesis and it has proven to be robust in our tests. Iteration δt = 1/1024 1/256 1/64 1/16 1/ Table 2.6: Relative cost estimates, c (j) l /c (j) l,naive, for MGRIT with spatial coarsening (solver 1), across each temporal level and MGRIT iteration. Spatial coarsening begins when δt/δx 2 = 16. The previous time-step is used as the initial guess at all points. The solver setup is otherwise identical to that used for Table 2.2. Cost estimates are scaled entry-by-entry with c (j) l,naive, the naive MGRIT baseline costs from Table 2.2.

65 Combining the Improved Initial Guess and Spatial Coarsening Sections and introduced an improved initial guess and spatial coarsening. Individually these improvements reduced the overall runtime by 43% and 37%, respectively. Table 2.7 gives a cost analysis of the MGRIT algorithm when these strategies are combined, using 4 levels of spatial coarsening and the IIG at all points. In this table each cost estimate is again scaled by the corresponding cost estimate calculated using the naive MGRIT baseline. Iteration δt = 1/1024 1/256 1/64 1/16 1/ Table 2.7: Relative cost estimates, c (j) l /c (j) l,naive, for MGRIT when using spatial coarsening and the improved initial guess (solver 6), across each temporal level and MGRIT iteration. The solver setup is otherwise identical to that used for Table 2.2. Cost estimates are scaled entryby-entry with c (j) l,naive, the naive MGRIT baseline costs from Table 2.2. Spatial coarsening again begins on the fourth time level, δt = 1/16. Solvers 4 and 6, from Table 2.4, show the effect of using both of these options. When compared to the naive baseline, solver 4, where the IIG is used only at C-points, gives a 49% reduction in runtime with only a 5% increase in storage, when compared to solver 0. Overall, delayed spatial coarsening, and spatial coarsening in general, is an effective strategy that can be applied to most, if not all, nonlinear parabolic implementations of MGRIT. Clearly, no delay in spatial coarsening is best, however, in cases where this causes a large degradation in the convergence rate, delayed spatial coarsening can still be used to balance the computational savings and convergence degradation. Mitigating the degradation in the convergence rate for MGRIT with full spatial coarsening is an area of active research.

66 Further Cost Savings While the improved initial guess and spatial coarsening are, by far, the most effective strategies presented, several more strategies for reducing the cost of MGRIT are now considered. The effect of these strategies is significantly smaller than that for spatial coarsening and the improved initial guess; thus, we will now normalize the cost analysis tables with respect to the just previously considered solver configuration. For example, the next table in Section is scaled entry-by-entry by the raw cost estimates corresponding to Table 2.7 from Section The table after that in Section is scaled entry-by-entry by the raw cost estimates corresponding to the table from Section 2.4.1, and so on. An entry greater than 1 represents a deterioration relative to the previous solver configuration, while a value less than 1 represents an improvement Skipping Unnecessary Work Even with the improvements so far, c (j) i remains large during the first three MGRIT iterations. Consider iteration 0 and the down-cycle and up-cycle parts of Figure 1.8. For the model problem, where no prior knowledge of the solution is available, it is clear that relaxation during the down cycle of iteration 0 provides no benefit. The first time that global information is propagated is during the coarse grid solve, and the subsequent up cycle, of iteration 0. Therefore, all relaxation, and in fact work of any kind, during the down-cycle of iteration 0 can be omitted. In this setting, the initial condition on the finest-grid is injected to the coarsest-grid and then serially propagated. Then, the solution is interpolated back to the finest-grid. Because we have no knowledge of the solution, the previous time step is used as the initial guess throughout this process. This strategy, similar to those found in full multigrid methods, is a non-intrusive modification of the basic MGRIT algorithm. If some a priori knowledge of the solution is available, and/or a useful initial guess is available at all fine-grid C-points, this work should not be skipped. In all other cases, this

67 54 is an effective, cheap, approach to determining an initial guess at the fine-grid C-points. Table 2.8 shows the cost analysis for this approach (solver 8) when it is added to the solver strategy from Section (solver 6), where the IIG is used as the initial guess, along with four levels of spatial coarsening. The cost analysis is similar if spatial coarsening is not used. To better discern the improvements, c (j) l is scaled by the corresponding raw cost estimates generated when using solver 6. Skipping work can actually increase the cost of a Newton solve on some levels during the first few iterations. However, the corresponding solvers 7 and 8 in Table 2.4 show that this strategy reduces the overall runtime with no degradation in MGRIT convergence. The reduced runtime is due to the fact that the timestepping routine is called far fewer times during iteration 0, despite being more expensive when it is called. For instance, the denotes the fact that there is no work done on the first level for iteration 0, by far the most expensive level. On coarse levels, during iteration 0, only interpolation (F-relaxation) is performed. Iteration δt = 1/1024 1/256 1/64 1/16 1/ Table 2.8: Relative cost estimates, c (j) l /c (j) l,2.7, for MGRIT when skipping work during the down-cycle during iteration 0 (solver 10), across each temporal level and MGRIT iteration. The solver setup is otherwise identical to that used for Table 2.7 (solver 6). Cost estimates are scaled entry-by-entry with c (j) l,2.7, the raw cost estimates used to generate Table Cheap Initial Iterates levels. The initial three iterates are the most expensive, with c (j) l significantly higher on all The solution at this point is still inaccurate, so another obvious modification is

68 55 to reduce the Newton tolerance during the first three iterations on all levels. Our simple strategy here is to use tol = 10 3 (as opposed to 10 7 ) as the Newton tolerance during the first three iterations. Thereafter, tol returns to Some tuning of these parameters might be required when applying this strategy to another nonlinear problems. Table 2.9 shows the cost analysis for this approach (solver 12) when it is added to the previous solver strategy from Section (solver 10). The cost analysis is similar if spatial coarsening is not used. The cost estimates, c (j) l, are scaled by the corresponding raw cost estimates generated using the previous solver 10, so that values less than 1 represent an improvement over Table 2.8. These scaled values show the expected drop in the cost estimate during iterations 0, 1 and 2 on the finest grid, where the decrease is by about 40-45%. Those fine grid savings are somewhat counteracted by small increases in the cost on the coarse grids (see iteration 1, levels 2 through 6), however, the decreased runtimes shown in Table 2.4 (solvers 11 and 12) verify the fact that this is an effective approach. Iteration δt = 1/1024 1/256 1/64 1/16 1/ Table 2.9: Relative cost estimates, c (j) l /c (j) l,2.8, for MGRIT when using the cheap initial iterate strategy (solver 12) across each temporal level and MGRIT iteration. The solver setup is otherwise identical to that used for Table 2.8 (solver 10). Cost estimates are scaled entryby-entry with c (j) l,2.8, the raw cost estimates used to generate Table Setting the Coarsest Grid Size The final algorithmic enhancement considered is the size of the coarsest grid. Given how relatively expensive the Newton solver is on coarse grids, the question naturally arises

69 56 whether truncating the number of levels in the hierarchy can be beneficial. However, this involves a trade-off. When a level is truncated from the hierarchy, the expensive Newton solves are no longer done at that level and the communication involved with visiting that level is avoided. However, the sequential part of the algorithm increases because the coarsest grid size has now increased. Thus, the best coarsest grid size is naturally problem and machine dependent. So far, the coarsest grid has consisted of 4 time points, but here, coarsest grids composed of 16 and 64 time steps are now also considered. Since changing the coarsest grid size changes the cost estimates very little, that data is omitted. For the case of a coarsest grid size of 16 (solvers 13 and 14), Table 2.4 shows a small speedup of 6s when using spatial coarsening, but a slow down of 13s for the case of no spatial coarsening. This is because spatial coarsening makes the spatial grid on the coarsest grid much smaller, and hence, the sequential solve required on the coarsest level is much cheaper. In other words, the penalty for a larger coarsest grid size is much smaller for the case of spatial coarsening. For the case of a coarsest grid size of 64 (solvers 15 and 16 in Table 2.4), the extra work from the larger sequential component of the solve on the coarsest level swamps any benefit and the run times increase for both solvers 15 and 16. Given that solver 14 is the best overall performing solver, a maximum coarse grid size of 16 time steps is used in scaling studies below Most Effective Improvements Not surprisingly, the largest overall reduction in runtime is seen when spatial coarsening and the improved initial guess were used in conjunction with the three improvements outlined in sections Table 2.10 gives a cost analysis of this approach (solver 14). In this table, cost estimates are scaled entry-by-entry with the naive MGRIT baseline from Table 2.2. The average cost of a Newton solve was reduced across the board, the only exception being

70 57 fine grid Newton iterations calculated during iteration 2. Comparing to Table 2.7, so that the overall effect of the improvements from sections can be measured, shows that most of the work saved is on the first two temporal levels (note that there is one less level in Table 2.10.) Regarding runtime, Table 2.4 validates our strategy. Solver 14 is 3.2 times faster than the naive baseline, a direct consequence of the across the board cost savings seen in Table Iteration δt = 1/1024 1/256 1/64 1/16 1/ Table 2.10: Relative cost estimates, c (j) l /c (j) l,naive, for MGRIT when using the most effective combination of the suggested improvements (solver 14), across each temporal level and MGRIT iteration. Cost estimates are scaled entry-by-entry with c (j) l,naive, the raw cost estimates used to generate Table 2.2. While many improvements have been outlined, Table 2.4 makes it clear that two of the improvements, the improved initial guess and spatial coarsening, are the most important. This can be seen by comparing solver 0 to solver 5 (to see the impact of the improved initial guess), which shows a 47% improvement. Then, when solver 5 and solver 6 are compared (to see the impact of spatial coarsening), a further 8% improvement is seen. The other four strategies in concert combine for another 40% improvement (compare solver 6 with solver 14). The subsequent scaling studies, therefore, focus on solvers 0, 1, 4, 6 and Scaling Studies Previous sections have focused on producing the most efficient Newton solver as a proxy for MGRIT efficiency. Here, in an effort to validate that heuristic, parallel scaling

71 58 studies are presented Optimal Multigrid Scaling In this subsection, to test the optimality of MGRIT for the chosen model problem, a domain refinement study is presented. This was completed in a manner similar to the way in which spatial multigrid optimality is tested experimentally. That is, the space-time domain was fixed ([0, 2] 2 [0, 4]) while the spatial and temporal resolution were scaled up, keeping δt/h 2 fixed. For runs using spatial coarsening, the number of levels of spatial coarsening was increased on each subsequent test, resulting in 4 levels of spatial coarsening on the largest space-time grid of The solvers considered are 0, 1, 4, 6 and 14 so that (respectively) the effects of spatial coarsening, the improved initial guess for the Newton solver, and all the other enhancements, can be examined. Table 2.11 shows the number of MGRIT iterations required to reduce the MGRIT residual within the prescribed tolerance for the range of weakly scaled space-time grids described above. In general, the observed iteration counts appear to be bounded independently of problem size for all the solver options considered. Unfortunately, using this experiment for weak scaling timings requires more processors than our machine provides (131K processors are available). For example, consider a weak scaling experiment where the smallest problem size involves 8 compute nodes, with 16 processors per node, for a total of 128 processors. A base test case that involves some off-node communication is needed, hence the choice of 8 compute nodes. To maintain a constant problem size per node and a constant δt/h 2, the number of time points must quadruple as the spatial problem size is doubled. This corresponds to increasing the node count by a factor of 16. Thus, to obtain four data points, = processors would be required. This level of computing resources is not available; thus, this weak scaling study is only partially complete.

72 59 ID \Grid: Table 2.11: Weak scaling study: MGRIT iteration counts Strong Scaling Both MGRIT and sequential time-stepping are O(N) optimal, but the constant for MGRIT is larger. On the other hand, MGRIT allows for temporal parallelism. This leads to a crossover point, beyond which MGRIT is beneficial. To illustrate this, a strong scaling study of MGRIT, for the space-time grid of (128) , was completed. Figures 2.4 and 2.5 show the results. The plot for ID=14, linear problem corresponds to the comparable linear problem of p = 2 in equation 2.1. This allows one to compare MGRIT s scaling for the nonlinear problem with the scaling for the corresponding linear problem. To generate the data for ID=14, linear problem, we simply set p = 2 in the code and make no other optimizations, e.g., the spatial matrix and solver are still built during every application of Φ, as is required in the nonlinear case, so that the comparison is fair. 2 The other plots are for the sequential ( Time-stepping ) time-stepping code and solver ID s 0, 1, 4, 6 and 14. This allows for a comparison of naive MGRIT (ID=0) with the effects of spatial coarsening (ID=1), the improved initial guess (ID s=4, 6) and the effects of all the other improvements (ID=14). Figure 2.4 depicts the results for m = 16 with 16 processors in space, while Figure 2.5 shows the results for 32 processors in space and m = 4. For the sequential solver, the bottleneck occurs at around 256 spatial processors, although the best wall clock time of The spatial matrix and solver actually only need to be built once as in [17], because p = 2 is a constant coefficient heat equation. But, this is not done here for the purposes of a fair comparison between the linear and nonlinear cases.

73 60 seconds is at 1024 spatial processors. Two different coarsening factors, m, and two different spatial processor counts are shown in order to indicate that these are now important user parameters affecting the speedup. The data points in each plot end when there are m points in time per processor on the finest-level, i.e., when using more processors in time provides no benefit. This is because FCF-relaxation is sequential over each interval of m points. Observing the results for the m = 16 case, the crossover point at which MGRIT is beneficial is around 1024 processors, or about 64 processors in time. The maximum speedup over the sequential solver was a factor of 6.4 at processors. Note that delayed spatial coarsening (solver 1) does not show an improvement because the larger coarsening factor reduces the number of levels in the temporal hierarchy. For the m = 4 case, the crossover point is at about 2000 processors, or about 500 processors in time. Yet, this smaller coarsening factor allows one to use more of the machine, and at 130K processors, the speedup over the sequential solver was 21x. In both cases, solvers 4 and 6 scaled identically. This is a key result of this chapter; storing the solution at fine grid C-points reduces the storage costs of MGRIT without affecting the overall scaling of the algorithm. Figure 2.4 highlights another important aspect of the MGRIT algorithm. Increasing processors in space allows for greater scaling potential, but, given a fixed number of processors, it is often beneficial to bias the processor distribution towards temporal processors until m equals the number of time points per processor. Compared to the strong scaling for the linear problem presented in Figure 1.9, these results are not as good. Optimal scaling would be represented by straighter lines, however, achieving this is difficult. The dataset for the linear heat equation ( ID=14, linear problem ) is very similar to the experiment in Figure 1.9. In fact, the only differences are: (1) bi-linear finite elements on a regular grid were used in space, as opposed to finite differencing; (2) the BoomerAMG solver in hypre was used, as opposed to the more efficient geometric-algebraic solver PFMG in hypre; and (3) the spatial discretization and spatial multigrid solver was

74 61 Figure 2.4: Strong scaling study for a (128) space-time grid, 16 processors in space, m = 16, built during every time-step. However, in these tests the data set for the linear heat equation shows less than linear strong scaling. Improving strong scaling here will require investigating each of the three differences described above. Difference (1) can be discounted because it does not change the sparsity pattern of the spatial operator, nor does it qualitatively change the convergence rate of the spatial multigrid solver. Thus, the most likely culprits for the strong scaling degradation are the parallel finite-element matrix assembly and the BoomerAMG setup-phase, although BoomerAMG is known to be an efficient spatial multigrid code. This is a topic for future research Efficiency of Nonlinear MGRIT The goal of this chapter is to develop a library of non-intrusive optimizations and modifications that make an efficient implementation of the MGRIT algorithm for nonlinear problems possible. As a measure of that efficiency, the MGRIT algorithm was compared against a similar implementation of a linear problem. When comparing to the linear problem in Figure 2.4, the lines for solver ID s 14 and 14, linear prob are essentially parallel,

75 62 Figure 2.5: Strong scaling study for a (128) space-time grid, 32 processors in space, m = 4. which is a good result, indicating equivalent scaling. Yet, the line for 14, linear prob is noticeably lower, indicating its faster runtime of about 3x. This is not surprising because for the linear problem, the Newton solver need only iterate once, while for the nonlinear problem, the Newton solver takes on average 2 3 iterations on the finest-level, meaning that the problem is itself 2 3 times more expensive. With the goal of further reducing the costs, current research is focused towards improving spatial coarsening so that it can begin on the first coarse grid. 2.6 Discussion The MGRIT algorithm effectively adds temporal parallelism to existing sequential solvers and was shown to be effective for linear problems [17]. However, when moving to the nonlinear setting, the relatively large time-step sizes on coarse grids make the application of MGRIT nontrivial. The additional measures proposed in this chapter allow MGRIT to achieve similar performance to a comparable linear problem. If storage is not a concern, an FMG-like initial guess should be built by skipping work on the first down cycle. After

76 63 that, fine grid F-points should be stored and use the IIG used to the nonlinear solver (here Newton s method). When memory constraints inhibit this approach, storage costs can be reduced by not storing finest grid F-Points, but, the IIG should still be used whenever it is available, e.g., at finest grid C-points. Thus, there is a tension between memory usage and computational efficiency. Secondly, spatial coarsening should be used whenever possible. For the linear example in Figure 1.9, spatial coarsening was implemented on all levels effectively. Tests showed that for the nonlinear model problem this was not the best strategy. However, delaying spatial coarsening until the fourth temporal level nonetheless dramatically reduced the cost of a Newton solve on the coarser grids and limited degradation of the MGRIT convergence rate. These two changes gave the largest speedup. The other changes, when combined, also effected a significant speedup. Weak scaling results showed that MGRIT is a scalable algorithm for nonlinear parabolic problems, with iteration counts bounded independently of problem size. Strong scaling showed the benefit of MGRIT, with a speedup of 21x seen over the corresponding sequential time-stepping routine. The scaling is not as ideal as in [17], but with the modifications given here, similar performance was attained when compared to the corresponding linear problem.

77 Chapter 3 MGRIT with Richardson Extrapolation Richardson extrapolation (RE), first introduced in 1911 [45], is a non-intrusive, extrapolation based technique designed to estimate and eliminate the lowest order error terms from the underlying numerical approximations. Let U(h) be an approximation to U with a step size h and error of the form U U(h) = C 0 h k 0 + C 1 h k , (3.1) where each k 0 < k 1 <... is a known constant and each C i is an unknown constant. For a larger step size, mh, the same error formula becomes U U(mh) = C 0 (mh) k 0 + C 1 (mh) k (3.2) Eliminating the exact solution, U, and ignoring higher order terms, we find C 0 = U(h) U(mh) m k 1 + O(h k 1 ). (3.3) Substituting for C 0 in Equation 3.1 gives U = mk 0 U(h) U(mh) m k O(h k 1 ). (3.4) This process can be repeated to further improve the accuracy. The general recurrence relationship for RE is U 0 = U(h), (3.5) U i+1 (h) = mk i U i (h) U i (mh) m k i 1, (3.6)

78 65 where U = U i+1 (h) + O(h k i+1 ). Note that RE does not guarantee an improvement, i.e. equations 3.2 and 3.1 suggest that one level of RE will not improve the solution if C m k 0 m k 1 1 m k 0 1 hk 1 C 0 h k 0. (3.7) Sequential time integration is naturally suited to enhancement with RE because, in general, the error formula for the time integration scheme is known. Define a uniform temporal grid with time-step δt and nodes t j, j = 0,..., N t (non-uniform time grids are discussed in Section ). Further, define a coarse temporal grid with time-step T = mδt and nodes T j = j T, j = 0, 1,..., N t /m, for some coarsening factor, m. This is depicted in Figure 1.3. Next, consider a general first-order, ordinary differential equation (ODE) and its corresponding time discretization: du dt = f(u, t), u(0) = u 0, t [0, T ], (3.8) u j+1 = Φ(u j+1, u j, δt) + g(t j+1 ), j = 0,..., N t 1, (3.9) where u j represents the discrete solution at t = t j, Φ is an operator representing the chosen time-stepping routine and g is a time dependent function that incorporates all the solution independent terms. In general, the error formula of a k g th order time integration scheme can be easily obtained using Taylors theorem. The only difficulty is that, in most cases, the expansion coefficients now vary in both space and time. Because of this, RE based time integration must be completed under the assumption that the solution is sufficiently smooth and that δt is small, so that each C i (x, t) can be assumed to be constant in time. To minimize the length of the domain on which these assumptions must hold, RE based time integration is usually completed using a discrete set of sequentially linked temporal sub domains. In the notation given above, each temporal sub domain corresponds to a coarse grid interval with time step δt and coarsening factor m. Solving the system presented in

79 66 equation 3.9 with a single stage ( i.e., i = 1 in equation 3.6) RE based time integration scheme equates to solving the following system of equations: u f,mj+1 = Φ(u,mj, δt) + g mj, (3.10) u f,mj+2 = Φ(u f,mj+1, δt) + g mj+1, (3.11). =. (3.12) u f,m(j+1) = Φ(u f,m(j+1) 1 ) + g m(j+1), (3.13) u c,m(j+1) = Φ(u c,mj, mδt) + g m(j+1), (3.14) u u,m(j+1) = mkg f,m(j+1) u c,m(j+1), (3.15) m kg 1 for j = 0, 1, 2,..., N t /m, u,0 = u f,0 and g mj = g(t mj ). The notation u f/c/,a indicates a fine grid (f), coarse grid (c), or RE enhanced ( ), solution at time t = t a. Algorithm 4 outlines a standard single stage RE based time integration scheme and Figure 3.1 provides a graphical representation of this approach for a single stage RE based time integration scheme with m = 2. Algorithm 4 RE based time integration(u, g(x, t), Φ, t stop ) 1: While ( t < t stop ) do. 2: Take a time step of size mδt: u c = Φ(u) + g(t + mδt). 3: Take m time steps of size δt to get u f = Φ(Φ(... (Φ(u) + g(t + δt))... )) + g(t + mδt). 4: Calculate the enhanced solution u = (m k u f u c )/(m k 1) 5: t t + mδt. 6: Goto Step 1. u f,1 u f,2 u f,3 u f,4 u f,5 u f,6 u 0 u,2 u,4 u,6 u c,2 u c,4 u c,6 Figure 3.1: Time integration with Richardson extrapolation and a coarsening factor of 2. The notation u f/c/,a indicates a fine grid (f), coarse grid (c), or RE enhanced ( ), solution at time t = t a.

80 67 MGRIT and RE based time integration are both non-intrusive algorithms; MGRIT adds temporal parallelism to existing time integration schemes, whereas, RE based time integration improves the accuracy. The goal of this chapter is to develop a method that allows RE based time integration to be used efficiently inside MGRIT framework. In particular, we develop a non-intrusive, parallel-in-time solver that improves the accuracy of the underlying time integration scheme without a significant increase in the total cost when compared to the standard MGRIT algorithm. This new MGRIT variant is known as RE-MGRIT. One additional benefit of RE based time integration is that it can be modified to provide a low cost estimate of the local truncation error associated with each time step. In the second section of this chapter, the RE-MGRIT algorithm is extended to allow for automatic error control through time step size manipulation. As we will show, this adaptive algorithm is capable of mimicking the non-uniform temporal grid found using a corresponding adaptive sequential time integration scheme, while also allowing for speedups through temporal concurrency and an improved accuracy through RE. 3.1 Two level MGRIT with Richardson Extrapolation The two-level RE-MGRIT algorithm for linear, time-independent problems (i.e., Φ is constant in time) will now be presented. The nonlinear and time-dependent variations of the RE-MGRIT algorithm follow easily. The extension of RE-MGRIT to non-uniform temporal grids is discussed in Section Note that this derivation assumes that the spatial grids are held fixed across all levels. Spatial coarsening with RE-MGRIT is an area of future research. The Richardson based time-integration problem, equations , can be written

81 68 in block triangular form as I Φ I A τ bφ aφ I u = 0 Φ I bφ aφ I u 0 u 1. u m u m+1. u 2m. = ḡ, (3.16) where a = m kg /(m kg 1) and b = 1/(m kg 1) = a 1 (c.f. Equation (1.13) for MGRIT). Note that u = [u 0,..., u Nt ] is a vector of vectors, each of which has a length equal to the number of spatial unknowns at that point in time. RE based time integration is equivalent to a block forward solve of this system. This approach allows for spatial parallelism, but must propagate forward sequentially in time 1. RE-MGRIT solves this system iteratively, in parallel, using a coarse-grid correction scheme based on multigrid reduction. The exact coarse grid operator is found by eliminating the F - points I A τ (bφ aφ m ) u = I (bφ aφ m ) I u,0 u,1 u,2.. (3.17) The coarse-grid operator, A τ, is referred to as ideal because the exact solution of equation (1.16) yields the exact solution at the coarse points. If this is followed by interpolation (i.e., F-relaxation), then the exact solution is also available at fine points. As a reminder, in this setting, the exact solution is the solution found at that time point using the corre- 1 Technically, the two (or more) solutions can proceed in parallel, however, this is a low level form of temporal parallelism that does not scale.

82 69 sponding sequential time integration scheme. In this case, the exact solution represents the enhanced RE solution, i.e., u,mj in equation 3.15, for j = 1, 2,..., N t /m. The limitation of this exact reduction method is that the coarse-grid problem is, in general, as expensive to solve as the original fine-grid problem (because of the Φ m evaluations). Approximating Φ m with Φ and noting that a b = 1 gives a cheap approximation to this exact coarse grid operator I B τ Φ = I Φ = B, (3.18) I where B is the coarse grid operator for the original MGRIT algorithm first defined in Equation The goal of this chapter is to use the similarities between RE based time integration and MGRIT to produce a cost efficient, non-intrusive, variant of the MGRIT algorithm that permits large scale temporal concurrency while also improving the accuracy of the solution. Assuming the cost to complete a time step is equal on all levels, an approximation of the total cost can be determined by multiplying the number of time steps required per iteration by the number of iterations required for convergence. As with the standard MGRIT algorithm, RE-MGRIT uses FAS multigrid, allowing both linear and nonlinear problems to be solved. In terms of time step evaluations, the cost of F relaxation is the same for both RE- MGRIT and MGRIT. Likewise, the number of time step evaluations required to complete the sequential coarse grid solve, restriction (injection) and interpolation (injection followed by F Relaxation) is also fixed. The standard C-relaxation used by the original MGRIT algorithm requires exactly one time step evaluation per C-Point (see Figure 1.6). For RE-MGRIT, C relaxation on the finest grid requires 2 time steps per C-point, one on the fine grid, and one on the coarse grid.

83 70 In total, N t /mp extra time steps per temporal processor are completed per C-relaxation for the RE-MGRIT algorithm. It remains to consider the cost associated with calculating the RHS of the coarse grid equation. The FAS coarse grid problem, defined in Algorithm 1, is: find v such that B (v ) = B R I u + R I (g A τ u). (3.19) Let g mj represents the mj th vector of g and define the notation [A τ u] j to indicate the j th spatial vector in the global vector that is formed after applying the operator A τ to u. Then, the j th row of the coarse grid problem is [B (v )] j = [B R I u] j + [R I (g A τ u)] j, (3.20) = [B R I u] j + g mj [A τ u] mj, (3.21) = u mj Φ u m(j 1) + g mj [A τ u] mj, (3.22) = u mj Φ u m(j 1) + g mj + aφu mj 1 u mj bφ u m(j 1), (3.23) = g mj + aφu mj 1 (1 + b)φ u m(j 1), (3.24) = g mj + a(φu mj 1 Φ u m(j 1) ), (3.25) = g mj + a(u mj Φ u m(j 1) u mj + Φu mj 1 ), (3.26) = g mj + a([b R I u] j [R I Au] j ), (3.27) where j = 1, 2,..., N t /m and the operator A is the time integration operator defined in equation (1.13) for the original MGRIT algorithm. Given this, the RE-MGRIT coarse grid problem is: find v such that B (v ) = R I g + a(b R I u R I Au), (3.28) = R I g + aτ(u), (3.29) where τ(u) = B R I u R I Au. For the standard MGRIT algorithm, the coarse grid equation

84 71 is B (v ) = B R I u + R I (g Au), (3.30) = R I g + (B R I u R I Au), (3.31) = R I g + τ(u). (3.32) Thus, two-level RE-MGRIT is identical to the standard MGRIT algorithm with the exception that the second term in coarse grid RHS is scaled by the factor a = m kg /(m kg 1). The τ-term can be interpreted as a correction to the coarse grid solution, v, to make it coincide with the fine grid solution, u. Multiplying this term by a factor a = m kg /(m kg 1) forces the coarse grid solution to better approximate the true solution. Note that this result is equivalent to the result obtained in [2]. In that case, the extrapolation principle is used with spatial FAS multigrid to improve the approximation properties of the spatial solver, resulting in an algorithm known as τ-extrapolation. In terms of time step evaluations, this means that the cost to calculate the coarse grid RHS is identical for both methods. In summary, two-level RE-MGRIT with F relaxation requires exactly the same number of time steps per iteration as the standard MGRIT algorithm. If FCF relaxation is used, two-level RE-MGRIT requires at most N t /mp extra time steps per processor and iteration. Algorithm 5 outlines the two-level RE-MGRIT algorithm. The multilevel extension of the two-level RE-MGRIT algorithm is discussed in Section Remark By default, XBraid uses the norm of the residual to determine when to stop iterating. The RE-MGRIT algorithm does not directly calculate the residual, however it can be obtained using a single vector addition, the cost of which is most likely negligible when compared to the cost of a MGRIT iteration. It is important to note that the original RE equations were derived under the assumption that the solution is smooth and that the time step size is small. Of course, this is not always true, and as such, RE based time integration cannot be used in all situations. While

85 72 Algorithm 5 RE-MGRIT(A, u, g) 1: Apply F- or FCF-relaxation to A(u) = g. 2: Calculate u f,mi = Φ(u mi 1 ) + g mi 3: Calculate u c,mi = Φ (u m(i 1) ) + g mi 4: Inject u f,mi and u c,mi to the coarse grid. û,i u f,mi, ū,i u c,mi. 5: Solve B (w ) = g + a(û ū ). 6: Compute the error approximation: e,i = w,i û,i. 7: Correct u at C-points: u mi = u f,mi + e,i. 8: If converged, then update F -points: apply F-relaxation to A(u) = g. 9: Else go to step 1. the stability and error analysis of RE based time integration may be difficult, it is in no way altered by the RE-MGRIT algorithm. As such, RE-MGRIT can return a stable solution with improved convergence order if sequential time integration with RE also returns a stable solution with improved convergence order. This ability to piggyback off existing analysis is a major advantage of the non-intrusive approach to parallel time integration employed by both MGRIT algorithms. As such, determining the applicability of RE based time integration to a particular problem is beyond the scope of this paper. Rather, our goal is to show that, when applicable, RE time integration can be used with MGRIT to dramatically improve the APCC of the MGRIT algorithm Convergence Analysis In the previous section it was demonstrated that, in terms of the total number of time step evaluations completed, the costs associated with each iteration of the RE-MGRIT and MGRIT algorithms are very similar. In this section we compare the convergence rates of the two methods. In particular, the analysis is this section develops a convergence bound for the RE-MGRIT algorithm for linear, time-independent problems on an evenly spaced time grid with no spatial coarsening. For comparison, two analytic bounds on the convergence rate of the standard MGRIT algorithm are given. These bounds were first derived in [50].

86 73 Consider solving equation 3.9 on the temporal grid shown in Figure 1.3 and assume g(x, t) = 0. The case with g(x, t) nonzero is similar. Next, assume Φ and Φ are diagonalized by the same unitary transform, ˆΦ = X ΦX = diag(λ 1,..., λ Nx ), (3.33) ˆΦ = X Φ X = diag(µ 1,..., µ Nx ), (3.34) where N x represents the number of spatial unknowns associated with each time-step, X = (x 1,..., x Nx ) and X X = I. Finally, define e j to be the error at t = t j between the current approximation to the solution, û j, and the exact discrete solution, u j. Then, the global space-time error vector at the C-points, ē = [e T 0, e T m,..., e t N t/m ]T, satisfies E ē F 2 max { λ mω µ ω 1 µ } ω Nt/m ē 2, (3.35) ω=1,2,...,n x (1 µ ω ) for MGRIT with F relaxation, and E F CF ē 2 max ω=1,2,...,n x for MGRIT with FCF relaxation [50]. F/F CF Here E { λ mω µ ω 1 µ } ω Nt/m (1 µ ω ) λ ω m ē 2, (3.36) denotes the coarse grid error propagation matrices for MGRIT with F and FCF relaxation. In what follows, similar estimates for the RE-MGRIT algorithm are derived. These derivations mimic those found in [50] for the MGRIT algorithm. Following F relaxation, the residual at t = mj for the RE-MGRIT method is r mj = [A(û)] mj, (3.37) = û mj + (aφ m bφ )û m(j 1). (3.38) The exact solution satisfies u mj = (aφ m bφ )u m(j 1), hence, the residual can be rewritten in terms of the error, e mj = u mj û mj, as r 0 = e 0, (3.39) r mj = e mj (aφ m bφ )e m(j 1), (3.40)

87 74 for j = 1, 2,..., N t /m. For two-level RE-MGRIT, the coarse grid problem is solved using forward substitution. This is a O(N t ) sequential operation. The extension of RE-MGRIT to multiple levels is discussed in section The exact coarse grid correction, c mj, is given by c 0 = r 0, (3.41) c mj = Φ (c m(j 1) ) + r mj, (3.42) = Φ j (r 0) + Φ j 1 (r m) + + r mj, (3.43) for j = 1, 2,..., N t /m. Substituting for r mj using equation 3.39 gives c 0 = e 0, (3.44) c mj = Φ j (e 0) + Φ j 1 (e m + (bφ aφ m )e 0 ) + Φ j 2 (e 2m + (bφ aφ m )e m ) +... (3.45) + e mj + (bφ aφ m )e m(j 1). Collecting like terms and noting that b + 1 = a gives c 0 = e 0, (3.46) c mj = aφ j 1 (Φ Φ m ))e 0 + aφ j 2 (Φ Φ m ))e m a(φ Φ m )e m(j 1) + e mj. (3.47) After the correction, the updated error at the C-Points, f mj = e mj c mj, satisfies: f 0 = 0, (3.48) j 1 f mj = a Φ j 1 q (Φ m Φ )e mq, (3.49) q=0 for j = 1, 2,..., N t /m. The unitary transformation given in equations 3.33 and 3.34 can be used to decompose e mj as e mj = N x ω=1 êj,ωx ω and f mj as f mj = N x ω=1 ˆf mj,ω x ω. Substituting

88 75 these decompositions into equation 3.48 gives ˆf 0,ω = 0, (3.50) j 1 ˆf mj,ω = a µ j 1 q (λ m ω µ ω )ê mq,ω, (3.51) q=0 for j = 1, 2,..., N t /m. Defining ê ω = (ê 0,ω, ê m,ω... ê NT,ω) T and ˆf ω = ( ˆf 0,ω, ˆf m,ω... ˆf NT,ω) T gives ˆf ω = Eω F êω where E ω = a(λ m ω µ ω ) µ ω µ N T /m 1 ω... µ ω 1 0. Notice how the eigenmodes are decoupled. This allows us to calculate a convergence bound for each eigenmode independently. The overall RE-MGRIT convergence bound is given by the maximum bound over all the eigenmodes. A convergence bound for each eigenmode is given by E ω 1 = E ω = λ m ω µ ω 1 µ ω Nt/m 1. 1 µ ω It is straight forward to apply Theorem 3.3 from [50] to obtain a bound on the global error vector, ē = [e T 0, e T m,..., e T N T m ]T, for RE-MGRIT with F relaxation: ĒF ē 2 max ω=1,2,...,n x m kg m kg 1 { λ mω µ ω 1 µ ω N T /m (1 µ ω ) } ē 2. (3.52) A similar calculation gives the global error bound for RE-MGRIT with FCF relaxation: CF ĒF ē m 2 max { λ kg mω µ ω 1 µ ω N T /m ω=1,2,...,n x m kg 1 (1 µ ) m kg λ m ω µ ω m kg 1 } ē 2. (3.53)

89 Singularly Diagonal Implicit Runge-Kutta (SDIRK) Methods To understand these bounds, we examine the convergence rate of the RE-MGRIT algorithm for various coarse and fine grid time integrators. Consider solving u + Gu = b, (3.54) where G is a symmetric positive definite matrix with real eigenvalues, γ ω, using the first, second and third order Singularly Diagonal Implicit Runge-Kutta (SDIRK) time-integration schemes. Figure 3.2 shows the Butcher tableaux for each of these methods (a) SDIRK-1 Figure 3.2: 1 α 1 α α 2α 1 1 α 1/2 1/2 (b) SDIRK-2, α = 1/ 2 a a c c a a 1 b 1 a b a b 1 a b a (c) SDIRK-3, a = , b = , c Butcher tableaux for SDIRK-1, SDIRK-2 and SDIRK-3. The first order SDIRK-1 method is equivalent to the backward Euler method. In that case, the fine and coarse grid operators are Φ = (I + δtg) 1 and Φ = (I + mδtg) 1, respectively. Likewise, the fine and coarse grid eigenvalues are λ ω = 1 (1 + δtγ ω ), µ ω = 1 (1 + mδtγ ω ), respectively. As defined above, the operator G is a symmetric positive definite matrix, hence, for ω = 1, 2,..., N x, γ ω (0, ), λ ω (0, 1) and µ ω (0, 1). Figures 3.3 and 3.4 plot the convergence bound, E ω, vs. κ = δt γ ω for γ ω > 0. For SDIRK-1, the convergence bound of the RE-MGRIT algorithm (i.e., the maximum value on each plot) is only slightly worse than that of the original MGRIT algorithm. For SDIRK-2 and SDIRK-3 the convergence bounds are almost identical. Moreover, for FCF relaxation and a larger m, the convergence bounds are essentially identical.

90 77 The take home message here is that the RE-MGRIT algorithm is capable or increasing the convergence order of the underlying time-integration scheme without significantly increasing the convergence factor, especially for higher order methods. Combined with the limited increase in time step evaluations, this gives the RE-MGRIT algorithm a sizable advantage when considering the accuracy per computational cost of the method. (a) m = 4 (b) m = 16 Figure 3.3: E ω vs κ = δtω for the SDIRK algorithms with RE-MGRIT and F relaxation. Left: m = 2. Right: m = 16. (a) m = 4 (b) m = 16 Figure 3.4: E ω vs κ = δtω for the SDIRK algorithms with RE-MGRIT and FCF relaxation.

91 Numerical Example: First Order ODE Consider the following IVP: y + 4y = 1 t, t [0, 1], (3.55) y(0) = 1, (3.56) with exact solution, y(t) = (1/16)( 4t + 11e 4t + 5). Figure 3.5 plots the error against time for RE-MGRIT for the SDIRK-1 and SDIRK-2 methods with 128 time steps. The error profile appears discontinuous because the RE-MGRIT algorithm corrects the solution each C-point. Notice that in all but a few local regions (i.e., near t = 0.45 for RE-MGRIT with SDIRK-2), the accuracy of the RE-MGRIT solution decreases with the coarsening factor. This should not be surprising. For k = 1, equation 3.55 suggests the accuracy of the improved solution varies as m kg (1 m)/(m kg 1)δt kg+1 = mc 1 δt kg+1. This is an unfortunate result because the analysis in the previous section predicts that, at least for SDIRK-1, decreasing the coarsening factor leads to a degradation of the convergence rate. Any exceptions to this are likely due to a fortuitous cancellation of errors. As mentioned many times throughout this chapter, this behavior occurs independently of the RE-MGRIT algorithm; the corresponding solutions calculated using sequential time integration with RE behave identically. Figure 3.6 gives a scaling study, with respect to the time step size, for the error at t = 0.5. For SDIRK-1, a globally first order method, RE-MGRIT scales with an accuracy equivalent to 2nd order method. Likewise, RE-MGRIT improves the accuracy of the second order SDIRK-2 algorithm to that of a third order method.

79 (a) SDIRK-1 (b) SDIRK-2 Figure 3.5: Error against time for solution of Equation 3.56 using RE-MGRIT and 128 time steps. The Richardson based error corrections occur only at the C-points.

92 79 (a) SDIRK-1 (b) SDIRK-2 Figure 3.5: Error against time for solution of Equation 3.56 using RE-MGRIT and 128 time steps. The Richardson based error corrections occur only at the C-points. (a) SDIRK-1 (b) SDIRK-2 Figure 3.6: Scaling study for RE-MGRIT applied to Equation Numerical Example - 1D Heat Equation To add some complexity, RE-MGRIT was applied to the one dimensional heat equation with Dirichlet boundary conditions: u t u xx = f(x, t), x [0, π], t [0, 2π], (3.57) u(0, t) = g 1 (t), u(π, t) = g 2 (t), t [0, 2π], (3.58) u(x, 0) = h(x), x [0, π]. (3.59) The source term, f(x, t), as well as the boundary terms, g 1 (t) and g 2 (t), and the initial

93 80 condition, h(x), were chosen to prescribe the exact solution u exact (x, t) = sin(x) cos(t). The temporal discretization uses the SDIRK-1 algorithm, while the spatial discretization uses standard central differences. A direct spatial solver was used, limiting parallelism to the temporal domain. Figure 3.7 plots the error against time for RE-MGRIT with SDIRK-1, 128 time steps, and spatial grid points. The error represents the l2 norm of the difference between the exact and approximate solutions, i.e., E j = u j ˆRu(x, t j ) l2, where ˆR restricts the exact solution onto the discrete spatial mesh. A disproportionate number of spatial grid points were used so that the second order errors associated with the spatial discretization did not effect the overall result. The reader will notice that RE-MGRIT performs worse than the original solver near t = 2 and t = 5. As discussed above, the RE-MGRIT algorithm converges to the exact solution found using a sequential implementation of RE. In other words, the corresponding sequential RE based time-integration scheme behaves identically. Accessing the suitability of RE for solving particular ODE s and PDE s is beyond the scope of this thesis. Figure 3.7 also shows a scaling study for the error at the final time, t = 2π. The number of spatial grid points was for all tests. As in the previous example, the RE-MGRIT algorithm returned a solution with second order accuracy, the magnitude of which depends on the coarsening factor. Also shown is a scaling study for time integration with SDIRK-2. Clearly, the accuracy of the solution obtained with RE-MGRIT is less than that of the SDIRK-2 method. This suggests that, if accuracy is the most important aspect of the solution, the user would be better off switching to a higher order time integration scheme. It is important to note however, that a single step of SDIRK-2 is twice as expensive as a single time step with SDIRK-1. In Section 3.2, we provide a detailed comparison of the APCC associated with each method. On top of this, there is nothing to say that RE-MGRIT cannot be used to

94 81 further improve the accuracy of the SDIRK-2 method. (a) Error against time for N t = 128 (b) Error at t = 2π against N t Figure 3.7: points. Scaling study for 1D heat equation with RE-MGRIT and spatial grid Convergence Rate of Two-level RE-MGRIT Tables 3.1 and 3.2 show the convergence rates of a SDIRK-1 implementation of RE- MGRIT and MGRIT with both F and FCF relaxation. The final column, labeled Est, shows the estimated convergence factor as derived in equations 3.52 and These numerically determined convergence factors represent the average convergence factor over the last 5 iterations for runs using two level (RE)-MGRIT, solved to a residual tolerance of 10 10, with a random initial guess. These tables highlight two important facts. First, in all cases, the analysis in the previous section provided a good bound on the convergence rate. Second, the convergence rate of both the RE-MGRIT and MGRIT algorithms does not vary noticeably with respect to the problem size. In other words, the two-level RE-MGRIT algorithm exhibits excellent weak scalability.

95 82 F-Relaxation Time Steps RE m Est. No Yes No Yes No Yes Table 3.1: Two-level numerical convergence factors for (RE)-MGRIT with F relaxation. The final column indicates the error estimate derived in Section FCF Relaxation Time Steps RE m Est. No Yes No Yes No Yes Table 3.2: Two-level numerical Convergence Factors for (RE)-MGRIT with F CF relaxation. The final column indicates the error estimate derived in Section Extension to a Multilevel Algorithm The two-level RE-MGRIT algorithm can be extended to multiple levels in several ways. Of course, with more levels comes more opportunities to further enhance the solution with RE, however, in a parallel setting, the extra coarse grid time steps required for multilevel RE are not free. For this reason, the multi-level RE-MGRIT algorithm uses one level of RE-based time integration. This choice also simplifies the multi-level extension of the RE- MGRIT algorithm. In fact, because the coarse grid operator, B, is identical for RE-MGRIT and MGRIT, the coarse grid equations are exactly the same for all levels beyond the first

96 83 coarse grid. As such, the multi-level RE-MGRIT algorithm uses the two-level RE-MGRIT algorithm on the fine grid, but solves the coarse grid problem using a recursive call to the original MGRIT algorithm. Table 3.3 presents the number of iterations required for the multilevel MGRIT and RE-MGRIT algorithms to converge to within a residual tolerance of when solving the 1D problem introduced in the previous section. MGRIT Time Steps RE Relax m No F Yes F No F Yes F No FCF Yes FCF No FCF Yes FCF Table 3.3: Iterations required to reduce residual by for multilevel (RE)-MGRIT with SDIRK-1 In general, the MGRIT algorithm appears to converge faster than the RE-MGRIT algorithm. On the other hand, the RE-MGRIT algorithm returns a solution with improved accuracy. For a fair comparison, one must consider the accuracy per computational cost of the algorithm (see below). Either way, the iteration counts associated with the FCF based variant of RE-MGRIT appear to be bounded independently of the problem size Strong Scaling Analysis The RE-MGRIT algorithm is an O(N t ) algorithm with a constant that is larger than that of the corresponding sequential solver. This leads to a cross over point where the extra concurrency accounts for the additional work. Figure 3.8 presents a strong scaling study for RE-MGRIT applied to the 1D heat equation. In all cases, the problem size included 16384

97 84 time steps and spatial unknowns. Spatial parallelism was not used. Two sequential data points are also presented. The first, labeled sequential, NO-RE, is the wall-clock time for the standard sequential solver. The second sequential solution, labeled sequential, RE, is the wall-clock time for a sequential RE based time integration scheme with m = 4 (left) and m = 16 (right). The time to solution, and scaling behavior in general, for the RE-MGRIT and MGRIT methods is very similar. As expected, both RE-MGRIT and MGRIT performed better when FCF relaxation was used. Comparing the two figures, we see that a larger coarsening factor (m = 16) leads to an earlier crossover point. Reducing the coarsening faction delays the cross-over point, but, allows for greater temporal concurrency and hence, greater speedups. Overall, the largest speedup was obtained for m = 4 with temporal processors. To be precise, the speedup was 19x when comparing RE-MGRIT with FCF relaxation to sequential time integration without RE. The speedup was 20x when comparing RE-MGRIT with FCF relaxation to sequential time integration with RE. (a) m = 4 (b) m = 16 Figure 3.8: Strong Scaling study for RE-MGRIT applied to the 1D heat equation. The strength of the RE-MGRIT algorithm is that it returns a solution with improved accuracy. To show this, the APCC of each algorithm was calculated. For this comparison, cost is measured in seconds and the accuracy is defined to be log(1/err) where err is the

98 85 error at time t = 2π. In this way, the APCC measures the digits of accuracy per second of run-time. Figure 3.9 plots the digits of accuracy per second (APS) for the RE-MGRIT and MGRIT algorithms. In all cases, as predicted by the analysis in the previous section, the RE-MGRIT dramatically increased the APCC of the associated method. (a) m = 4 (b) m = 16 Figure 3.9: Digits of Accuracy per second the RE-MGRIT and MGRIT algorithms applied to the 1D diffusion equation. Another, arguably more interesting comparison is found by comparing the results for the second order MGRIT with SDIRK-2 approach (blue lines) and the second order RE- MGRIT with SDIRK-1 (magenta lines) approach. The analysis presented in Section suggests that the convergence factor of MGRIT with SDIRK-2 will be about half that of the RE-MGRIT algorithm with SDIRK-1. On the other hand, each SDIRK-2 time integration step requires two spatial inverses, compared to one spatial inverse for SDIRK-1. Overall, MGRIT with SDIRK-2 outperforms RE-MGRIT with SDIRK-1. As such, it appears that switching to a higher order time integration scheme is still a better approach to increasing the APCC of the method. Of course, there is nothing to say that RE-MGRIT could not then be used with that higher order scheme to increase the accuracy even further. For users who are unable to alter the time integration scheme, RE-MGRIT is an excellent approach that can dramatically improve the APCC of the method, allowing the user to achieve a greater accuracy in the same wall clock time, or, to achieve a similar accuracy in a reduced runtime

99 86 by using less time steps. 3.2 Adaptive Time Integration with a RE Based Error Estimate One of the first steps in the numerical simulation of time dependent PDEs is to determine an appropriate space-time grid. For well behaved problems a uniform grid is often satisfactory, however, for problems with local singularities, i.e. shock fronts, discontinuities and steep gradients, the computational costs associated with a uniform space-time grid that also resolves fine scale behavior is, most often, prohibitive. In these situations, a non-uniform space-time grid is appropriate, however, it is often difficult to determine this grid a-priori. Instead, some form of dynamic, adaptive mesh refinement (AMR) must be used. AMR can be completed in both space and time. Spatial AMR with MGRIT is a topic of current research, but is beyond the scope of this paper. The most common approach to temporal AMR is adaptive time integration through local error control. Assuming the solution at t = t j 1 is exact, the local truncation error (LTE) introduced by a single step from t = t j 1 to t = t j is: u j u(t j ) = δt k u(kl) (x, ψ l j ), (3.60) k l! where k l is the local order of the time-integration scheme (e.g. k l = 2 = k g + 1 for backward Euler), the superscript (k l) denotes the k l th derivative and ψ j is some constant in (t j 1, t j ). The LTE can be controlled by changing the order of the time integration scheme and/or by changing the size of the time step taken. In an effort to remain non-intrusive, this work focuses on error control through adaptive time-step manipulation. The optimal time step size is chosen so that the local error at each step is bounded within some prescribed tolerance, ɛ t. Let E be an estimate of the local truncation error. Then, the optimal time step is given by ( [ ( ɛt ) ] ) 1 k δt opt = θ s δt min max l, θ min, θ max, (3.61) E

100 87 where (θ s, θ min, θ max ) are safety parameters designed to avoid large changes in the time step size. Algorithm 6 outlines a classical sequential time integration scheme with adaptive error control. Algorithm 6 Adaptive sequential solver(u, g, Φ, ɛ t ) 1: While ( t < t final ) do. 2: Take a time step: û = Φ(u, t + δt, δt) + g(t + δt). 3: Estimate the local truncation error, E. 4: Calculate δt opt using equation : If E > ɛ t then δt δt opt. 6: Else u û, t t + δt, δt δt opt. 7: Goto Step 1. The defining feature of an adaptive sequential solver is the error estimation process ( step 3 in Algorithm 6). There are numerous approaches to temporal error estimation, e.g, [11, 12, 15, 19, 26, 49]. In this chapter we focus on RE based error estimation. The original RE algorithm uses solutions obtained on different grids to approximate and eliminate the lowest order error term. One consequence of this approach is that one can obtain, at no extra cost, an estimate of the local truncation error in each step. Based on equation 3.3, the local truncation error introduced by a single stage RE based time integration step is E = δtk m k 1 (u f u c ), (3.62) where u f and u c are the fine and coarse grid solutions. Algorithm 7 outlines a standard, adaptive sequential solver with RE based error estimation and temporal integration. Note that this approach simultaneously performs RE and adaptive time integration without increasing the cost of an individual RE based time integration step. In this section, we show how a similar embedded error estimate can be calculated, for free, during each RE-MGRIT iteration. One limitation of the global approach to time integration employed by both MGRIT methods is that the temporal grids must remain fixed throughout each iteration. As such, the sequential approach to adaptive time integration, i.e.,

101 88 Algorithm 7 Adaptive RE based time integration(u 0, g(x, t), Φ, t stop, ɛ t ) 1: While ( t < t stop ) do. 2: Take a time step of size mδt: u c = Φ(u) + g(t + mδt). 3: Take m time steps of size δt to get u f = Φ(Φ(... (Φ(u) + g(t + δt))... )) + g(t + mδt). 4: Estimate the local truncation error using equation : Calculate δt opt using equation : If E > ɛ t then δt δt opt. 7: Else t t + δt, δt δt opt and u = (m k u f u c )/(m k 1) 8: Goto Step 1. building the grid sequentially in time, cannot be used. Instead, the non-uniform temporal grids must be built from the ground up using FMG. In particular, we use a specific type of FMG known as Nested iteration (NI). When used inside this adaptive NI cycle, the embedded RE-MGRIT error estimate helps the adaptive RE-MGRIT algorithm to dynamically build a non-uniform temporal grid capable of mimicking the behavior implied by the PDE, while also increasing the convergence order of the underlying time-integration scheme and obtaining parallel speedups. The adaptive, NI based, RE-MGRIT algorithm, here after referred to as are-mgrit, is outlined in Algorithm 8 and depicted in Figure Figure 3.10: An example of the cycle structure for the NI based are-mgrit algorithm The are-mgirt algorithm conforms to the non-intrusive philosophy of the original MGRIT algorithm. In particular, any time integration scheme and any temporal estimation routine can be used. Some key differences between adaptive sequential time integration and adaptive temporal mesh refinement with MGRIT include: The are-mgrit algorithm implements an integer based temporal refinement algo-

102 89 Algorithm 8 Adaptive RE-MGRIT cycle (A,u, g, ɛ t ) 1: Solve on current grid with RE-MGRIT(A,u,g) and obtain error estimate, E. 2: Set Refinement factors: R = SetRFactors(E) 3: Refine the space-time grid 4: Populate the fine grid using ideal interpolation 5: Goto Step 1 rithm centered around user defined refinement factors provided at each time point. An example of integer based temporal refinement with are-mgrit is given in Figure The consequence of this approach is that the adaptive method cannot exactly match the temporal grids found using the non-integer based methods used by most adaptive sequential methods. In the sequential setting, where the non-uniform time grid is built from left to right, one can choose a coarsening factor and time step size such that RE is always completed on a uniform temporal sub-domain. The temporal sub-domains (i.e., coarse grid intervals) are set prior to the start of each are-mgrit iteration and can be non-uniform. Because the original derivation of RE and RE-MGRIT implicitly assumed a uniform temporal grid, the RE-MGRIT must be modified to allow for non-uniform temporal grids. The key assumption that must be made is that local truncation errors accumulate linearly across each coarse grid interval. This is not an uncommon assumption in the sequential time integration community. In fact, this is the assumption underlying the common claim that the global order of a time integration scheme is one less than the local order. The extension of RE-MGRIT to non-uniform temporal grids is given in Section Adaptive sequential methods build the time grid from the left to the right. At any point, one can be sure that all the upstream solutions conform to the prescribed local error tolerances. For are-mgrit, temporal refinement is completed simultaneously across the entire time domain, with no guarantee that upstream errors are bounded

103 90 until MGRIT has converged. For the most part, this is unavoidable, however, we do note that this type of refinement is used commonly by the spatial adaptive mesh refinement community to good effect. MGRIT converges towards the true solution on each grid. At the end of each V cycle a choice must be made: should we refine the temporal grid, or should we complete another V cycle. Allowing MGRIT to fully converge on each coarse grid increases the accuracy of the solution at the previous time step and provides a better initial guess on the next grid, but, it is highly in-efficient. Completing refinement after a single MGRIT V cycle on each level is very efficient, but the errors in the solution at the previous time step can pollute the downstream error estimation process and it gives a worse initial guess on the refined grid. As we will show, one V cycle is often enough on all levels to provide an error estimate that can accurately mimic the temporal grids found using a corresponding adaptive sequential solver, while also providing a good initial guess on the new, refined time grid. Based on these observations, the goal of this section is to prove that the are-mgrit algorithm and the corresponding error estimate can be used to build a non-uniform temporal grid that closely matches that of the corresponding solver, while also providing parallel speedup Old Grid Rfactors New Grid Figure 3.11: Example of integer based grid refinement: Refinement factors, defined at each time point, are used to create a new time grid. All points present on the old grid exist on the new grid, but the location and number of C-points can change.

104 Local Error Estimation with RE-MGRIT In this section, an embedded MGRIT error estimator, based on Richardson extrapolation and similar to that shown in equation 3.62 is developed and tested. Given the results in the first half of this chapter, it should not be surprising that this estimate can be calculated inside the RE-MGRIT cycle at each fine grid time point with negligible costs. The first step towards adaptive time integration with RE-MGRIT is to extend the algorithm to allow for integration across non-uniform temporal sub domains Extension to Non-uniform Time Grids Define a non-uniform temporal grid with time steps δt j and nodes t 0 = 0, t j = t j 1 + δt j, j = 1, 2,..., N t. Further, define a coarse temporal grid with time steps T j = mj i=m(j 1)+1 δt i and nodes T 0 = 0, T j = T j 1 + T j, j = 1, 2,..., N t/m, for some fixed coarsening factor, m. This is depicted in Figure T 0 T 1 T (j+m)/m t 0 t 1 t 2 t m t j t j+m t Nt δt j+2 Figure 3.12: Non-uniform fine- and coarse-grid temporal meshes. Fine-grid points are present on only the fine-grid, whereas coarse-grid points (red) are on both the fine- and coarse-grid. Let Φ j = Φ j (u j 1, t j, δt j ) represent the time integration operator at t = t j for a scheme with local order k l. Moreover, assume that the local truncation errors accumulate linearly, that each δt j is small and that the solution, u(x, t), is sufficiently smooth. Then, ignoring higher order terms, the global error introduced by time integration across the j th coarse grid interval on the non-uniform temporal mesh is m 1 u f,mj u(t mj ) = C mj i=0 δt k l mj i. (3.63)

105 92 Likewise, the error introduced by a single time step on the coarse grid is u c,mj u(t mj ) = C mj T k l j/m. (3.64) As in Section 3, these equations can be combined to eliminate C mj to give an enhanced solution u,mj = āu f,mj bu c,mj, (3.65) where ā = b = T k l j/m T k l j/m m 1, (3.66) i=0 δtk l mj i m 1 i=0 δtk l mj i T k l j/m m 1. (3.67) i=0 δtk l mj i Apart from this simple modification, the derivation for RE-MGRIT with non-uniform temporal grids is identical to that of RE-MGRIT for uniform grids. In fact, assuming a uniform time step size, δt, and given k g = k l 1, we have ā = (mδt) k l (mδt) k l mδt = mkg m kg 1 = a, as in the uniform case derived earlier. A similar equivalence holds for b Local Embedded Error Estimator The key result of this section is that one can modify RE-MGRIT to return an estimate of the local truncation error introduced during each time step. Solving equations for C mj gives C mj = 1.0 T k l j/m m 1 u i=0 δtk l f,mj u c,mj. (3.68) mj i Hence, the local error associated with each time step is E mj i = C mj δt k l mj i, (3.69)

106 93 for i = 0, 1,..., m 1 and j = 1, 2,..., N t /m. This error estimator is a byproduct of the RE-MGRIT algorithm and, as such, can be calculated without any extra time integration steps. It remains to determine how error estimates are transformed into refinement factors. For now, our goal is investigate the robustness of the new embedded error estimate. As such, a simple tolerance based approach to temporal refinement is used; if the error indicator for a given time step is greater than the prescribed tolerance, ɛ t, then that time step is marked for refinement by a factor of r = 2 on the next level. An investigation into various refinement strategies is given in Section Model Problem: 2D Heat Equation The first model problem is the 2D linear heat equation with Dirichlet boundary conditions: u t u = f(x, t) x [ 2, 2] 2, t [0, 1], (3.70) u 0 = u(0) = g(x), (3.71) u(x, t) = h(x, t) x Ω. (3.72) The RHS, initial condition, and boundary conditions were chosen based on the method of manufactured solutions to prescribe an exact solution of u(x, y, t) = exp( 100(t 0.5) 2 6(x 2 + y 2 )) (3.73) This solution was chosen because it exhibits the need for temporal adaptation near t = 0.5. In all cases, the temporal discretization is completed using the backward Euler method and the spatial discretization was handled using the Fenics finite element package [37] and the First Order Systems of Least Squares (FOSLS) methodology (see below). Table 3.4 outlines the default set of numerical parameters. Unless otherwise stated, these are the parameters used for all tests in this section.

107 94 Parameter Value Description tstart 0.0 Start time of the simulation tstop 1.0 Final time of the simulation nt 64 Number of time steps on the initial grid tol 1e 04 MGRIT Residual Tolerance m 4 MGRIT Coarsening factor ml 100 Maximum Number of MGRIT levels element triangle Shape of the elements in spatial mesh nx 32 Number of spatial elements in x direction ny 32 Number of spatial elements in y direction p 1 Polynomial order of finite elements t V 1 MGRIT V cycles before refining in time ɛ t 1e 05 Local tolerance for temporal refinement Solver cg Linear Solver Precond hypre boomeramg Preconditioner ltol Absolute tolerance for each Linear solve Table 3.4: Numerical Parameters used for MGRIT applied to the model problem.

108 Spatial Discretization with FOSLS The First Order System Least Squares (FOSLS) methodology is used to complete the spatial solves. FOSLS is designed to solve systems of first-order equations. Denote this system of q equations as L(u) = f, where L : V L 2 (Ω) q, V is an appropriate Hilbert space, and it is assumed that f i L 2 (Ω) for i = 1, 2,..., q. Let p = u. Then, for the linear heat equation I L = t, 0 u = (u, p) and f = (0, f, 0). The curl equation is added to make the resulting matrix more amenable to solution with AMG [6, 7]. Define the FOSLS functional q G(u; f) = L i u f 2 0,Ω (3.74) where 0,Ω is the L 2 -norm. An L 2 minimization gives the exact L 2 solution, u : i=1 u = arg min G(v; f). (3.75) u V Given that L is linear, the solution u satisfies G (u)[v] = 0, where G (u)[v] = lim α 0 G(u + αv; f) G(u; f) α (3.76) is the Frechet derivative of G in the direction of v. The weak form of the FOSLS minimization problem is as follows: find u V such that, v V, a(u, v) = Lu, Lv = f, Lv, (3.77) where, denotes the standard L 2 inner product. Assume a(u, v) is continuous and coercive in V, i.e., there exists constants c, C > 0 such that: a(v, v) c v V (coersivity), (3.78) a(v, u) C v V u V (Continuity). (3.79)

109 96 Then c G(v; 0) v 2 V C v V. (3.80) The Reiz representation theorem implies that there exists a unique solution u V that satisfies This result holds for any finite-dimensional subspace V h V. Hence, the corresponding discrete weak form: find u h V h such that v h V h : a(u h, v h ) = f h, Lv h, (3.81) also has an unique solution. If V is a product of H 1 spaces then these conditions also imply that there exists an optimal spatial multigrid solver for the discrete system [6, 7] Numerical Results The goal of this section is to prove that the embedded error estimator derived in Section can be used with the are-mgrit algorithm to build a non-uniform temporal grid that closely matches that of the corresponding solver, while also providing parallel speedup. One of the strengths of nested iteration is that the coarse grid solution provides a good initial guess to the solution on the next level. However, one does not need to solve the coarse grid problem exactly. Figure 3.14 shows the error estimate and temporal grid after 5 levels of temporal refinement. Here V cycs represents the number of MGRIT V cycles that were completed on each level before temporal refinement was completed. In all but a few places, the fine grid temporal grid and error estimator are identical, regardless of the number of V cycles completed on each level. Based on these results, one V cycle will be used on all levels for the remainder of the thesis. For RE-MGRIT, the accuracy of the enhanced RE solution depended on the coarsening factor. As such, it is reasonable to assume the accuracy of this error estimate might also degrade as m increases. Figure 3.15 plots the error estimate and temporal grids found using

97 Figure 3.13: Manufactured solution for 3.73 at t = 0.2, 0.4, 0.5, and 0.8. the are-mgrit algorithm to solve the same problem, this time varying the coarsening factor.

110 97 Figure 3.13: Manufactured solution for 3.73 at t = 0.2, 0.4, 0.5, and 0.8. the are-mgrit algorithm to solve the same problem, this time varying the coarsening factor. Except in a few local regions, the temporal grids obtained using the embedded error estimate are identical. The are-mgrit algorithm completes the majority of the work on the coarse grid. This is not ideal in terms of parallel efficiency, however, the are-mgrit algorithm is still capable of achieving large speedups. To illustrate this a strong scaling study of the are- MGRIT algorithm was completed. As discussed above, one V cycle was completed on each level before refinement took place. For comparison, a sequential time integration scheme

111 98 (a) Error Estimate (b) Temporal Grids Figure 3.14: Error estimate and time step size after 5 levels of refinement. (a) Error Estimate (b) Temporal Grids Figure 3.15: Error estimator and temporal grids returned for various coarsening factors after 5 levels of refinement with 1 V cycle on each level. based on Algorithm 6 was implemented. In this case, error estimation was completed using Richardson extrapolation. Note that the sequential method allows for non-integer based refinement. The local temporal refinement tolerance was set to 1e 6. Two studies were completed. The first used a coarsening factor of m = 4 and 16 spatial processors. The second study was completed with m = 2 and 2 processors in space. The are-mgrit algorithm is capable of effectively and efficiently building a nonuniform temporal mesh that closely approximates the temporal mesh found using the corresponding sequential solver. The maximum speedup was 13. It is important to notice

112 99 (a) Temporal grids (b) Strong scaling study Figure 3.16: Strong scaling study for the are-mgrit algorithm applied to the linear heat equation on a 32 2 spatial grid. that these speedups were obtained with little to no tuning. A more aggressive refinement strategy combined with some of the optimizations outlined in the first chapter would likely lead to further speedups Numerical Example: Coupled ODE In this section we investigate using the are-mgrit algorithm to solve a coupled system of ODEs with a discontinuous forcing function: dy 1 (t) dt dy 2 (t) dt = 6y 1 (t) + u(t), (3.82) = 6y 1 (t) 0.6y 2 (t), (3.83) where the forcing function,u(t) is defined as 1 if mod (10t, 10) 1, u(t) = 0 else. (3.84) In particular, we wish to test the ability of the are-mgrit algorithm to refine around a sharp discontinuity. Figure 3.17 shows a reference solution for this equation, obtained using the second order SDIRK-2 method with 10 8 time steps. For the remaining tests presented in this section, the initial grid consisted of 5000 time steps, the coarsening factor was m = 4

100 and the coarsest grid size was 300 time steps. This choice of the coarsest grid size was made to ensure that the forcing function does not alias to zero on the coarsest temporal grid. Figure 3.

113 100 and the coarsest grid size was 300 time steps. This choice of the coarsest grid size was made to ensure that the forcing function does not alias to zero on the coarsest temporal grid. Figure 3.17: Reference solution for the coupled ODE system. Top left: y 1. Top right: y 2. Bottom left: Phase plane. Bottom right: Forcing function. The color scale on the phase plot fades from red to blue as t increases from t = 0 to t = 10. Figure 3.18a compares the non-uniform temporal grid found using the corresponding sequential solver with the temporal grids found using RE-MGRIT after 2,5 and 9 levels of temporal refinement. The sequential solver uses Algorithm 6 with SDIRK-1 and RE based temporal error estimation. Note that, because of this choice of error estimation scheme, the sequential solver can also use RE to increase the accuracy of the solution at no extra cost. For RE-MGRIT, temporal refinement was completed using a local tolerance based approach with a refinement factor of r = 2 and a tolerance of Overall, the are-mgrit algorithm built a non-uniform temporal grid that closely matched the grid found using the sequential solver. In total, 11 levels of refinement plus two extra fine grid iterations were required to obtain a solution within the desired adaptive

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids