CU Scholar. University of Colorado, Boulder. David John Appelhans University of Colorado Boulder,

Size: px

Start display at page:

Download "CU Scholar. University of Colorado, Boulder. David John Appelhans University of Colorado Boulder,"

Robyn Preston
6 years ago
Views:

University of Colorado, Boulder CU Scholar Applied Mathematics Graduate Theses & Dissertations Applied Mathematics Spring 1-1-2014 Trading Computation for Communication: A Low Communication Algorithm

1 University of Colorado, Boulder CU Scholar Applied Mathematics Graduate Theses & Dissertations Applied Mathematics Spring Trading Computation for Communication: A Low Communication Algorithm for the Parallel Solution of PDEs Using Range Decomposition, Nested Iteration, and Adaptive Mesh Refinement David John Appelhans University of Colorado Boulder, dappelha@gmail.com Follow this and additional works at: Part of the Applied Mathematics Commons, and the Theory and Algorithms Commons Recommended Citation Appelhans, David John, "Trading Computation for Communication: A Low Communication Algorithm for the Parallel Solution of PDEs Using Range Decomposition, Nested Iteration, and Adaptive Mesh Refinement" (2014). Applied Mathematics Graduate Theses & Dissertations This Dissertation is brought to you for free and open access by Applied Mathematics at CU Scholar. It has been accepted for inclusion in Applied Mathematics Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact cuscholaradmin@colorado.edu.

2 Trading Computation for Communication: A Low Communication Algorithm for the Parallel Solution of PDEs Using Range Decomposition, Nested Iteration, and Adaptive Mesh Refinement by David John Appelhans B.S., Colorado School of Mines, 2008 M.S., Colorado School of Mines, 2009 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Applied Math 2014

3 This thesis entitled: Trading Computation for Communication: A Low Communication Algorithm for the Parallel Solution of PDEs Using Range Decomposition, Nested Iteration, and Adaptive Mesh Refinement written by David John Appelhans has been approved for the Department of Applied Math Tom Manteuffel Steve McCormick Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 iii Appelhans, David John (Ph.D., Applied Math) Trading Computation for Communication: A Low Communication Algorithm for the Parallel Solution of PDEs Using Range Decomposition, Nested Iteration, and Adaptive Mesh Refinement Thesis directed by Prof. Tom Manteuffel In this thesis we propose a new algorithm for solving PDEs on massively parallel computers. The Nested Iteration Adaptive Mesh Refinement Range Decomposition (NI-AMR- RD) algorithm uses nested iteration and adaptive mesh refinement locally before performing a global communication step. Only a few such steps are observed to be necessary before reaching a solution that is on the order of discretization error. The target application is peta- and exa-scale machines, where traditional parallel numerical PDE communication patterns stifle scalability. The RD algorithm uses a partition of unity to equally distribute the error and thus the work. The computational advantages of this approach are that the decomposed problems can be solved using nested iteration and any multigrid cycle type, with communication needed only a few times when the partitioned solutions are summed. This offers potential advantages in the paradigm of expensive communication but very cheap computation. This thesis introduces the method and explains the details of the communication step. Two performance models are developed, showing that the communication cost associated with a traditional parallel implementation of nested iteration is proportional to log(p ) 2, whereas the NI-AMR-RD method reduces the communication time to log(p ). Numerical results for the Laplace problem with dirichlet boundary conditions demonstrate this enhanced performance.

5 Dedication To my grandfather who would have been proud to attend my graduation, and to my parents that have offered such good advice throughout the years.

6 v Acknowledgements Many thanks are due to those on whose shoulders I am perched. To all the past and present generations of Grandview students who have added to the FOSPACK code, I appreciate you. A special thanks is in order to John Ruge, who helped me navigate FOSPACK and wrote many routines necessary to implementing the RD method. I would still be stuck on a bug somewhere in the depths of FOSPACK if it were not for the help of John Ruge and Marian Brezina. I would also like to thank my advisors Tom Manteuffel and Steve McCormick for having faith in me and teaching me so much.

7 vi Contents Chapter 1 Introduction Preliminaries FOSLS Adaptive Refinement (ACE) Nested Iteration Range Decomposition Handling Inhomogeneous Boundary Conditions NI-AMR-RD Algorithm Iterative Description Communication and Union of Meshes Error Analysis Notation NI-AMR-RD Asymptotic Discretization Error Lower Bound Functional Reduction on 1st Step Functional Reduction Assumptions Functional Reduction Proof Numerical Validation

8 vii 4 Performance Model Traditional Implementation of Nested Iteration Communication Cost of a Single V(1,1) Cycle in 2D Nested iteration cost RD algorithm performance model One RD iteration Performance model comparison Model Parameters Large P Small P for Comparison to Numerical Results Numerical Tests Graded Mesh Convergence Laplace Equation Weak Scaling with Fixed Right Hand Side Weak Scaling When Right Hand Side Scales What If The Solution is Smooth? Conclusion 67 Bibliography 69 Appendix A Lemmas For Error Analysis 72

9 viii Tables Table 4.1 Relation between nested iteration level and number of MG levels Graded mesh convergence to union grid solution using 16 processes Graded mesh convergence to union grid solution using 64 processes Values of Q when the initial f is smooth

10 ix Figures Figure process example of the initial coarse mesh from which each process would start nested iteration, and the resulting refined mesh for process 6. The coarsemesh elements are numbered according to processes, and they need not be a uniform grid. Notice that most of the refinement happens in a process own home turf (a) The butterfly communication pattern used to sum the solutions and union the grids. The dark arrows emphasize the direct communication of another process solution to process 6, while the gray arrows denote when a process solution in region 6 is passed on to be indirectly communicated to process Example of what is sent and received by process 6. During the first step, (a), process 6 sends regions 9-16 and receives process 14 s solution in region 1-8. (b) Process 6 now receives regions 5-8 from process 2, and the elements marked in red are the additional elements process 2 created when it was unioned with process 10 on the previous step. (c) On step 3, process 6 and 8 communicate 1/8 of domain. (d) In the final communication step, process 5 sends to 6 its partially summed solution and unioned grid only from region Plot showing relative functional reduction on the first iteration of NI-AMR- RD for a problem that approximately satisfies the assumptions. E is fixed at 4000 elements, Q is measured, and P is increased from 16 to

11 x 4.1 Starting from nested iteration level j, a V-cycle goes from the fine grid of k(j) nodes to 1 node and back to nested iteration level j. Adaptive refinement then generates the next nested refinement level, and the approximate guess for the solution is interpolated (arrow) onto the new level (a) Example of nodes along a process boundary that must be passed to a neighboring process to perform relaxation, restriction, or evaluate the functional. There are 8 neighboring processes so 8 seperate messages must be passed to perform any of the multigrid operations. If the number of nodes along a side is n then the total number of nodes that must be sent for relaxation is 4n + 4. For restriction with a full weighting stencil, all 4n + 4 boundary points must be communicated. Interpolation, shown in (b), only requires approximately half as many nodes be communicated Depiction of a V-cycle coarsening until 1 node per process. Past that point, some processes go idle but the slowest process always has at least one node Plot of data and least squares fit of average node to node communication time on Stampede, a Bluegene Q supercomputer. The entire range of data is shown in (a), while a zoomed view of the data points clustered near the y intercept are shown in (b) Plot of the performance model for traditional nested iteration (a) and the RD method (b), showing how the communication times of the methods scale into the regime of millions of processes. The traditional method is latency bound, while RD scales much better and is bandwidth bound Plot of the communication costs for (a) traditional nested iteration and (b) the RD method, for 16 to 1024 processes. The enhanced scaling of the RD method is not as obvious on this scale, but numerical tests were conducted for this range of processes. The traditional method is latency bound, while RD scales much better and is bandwidth bound

12 xi 5.1 An example process p component of the solution to the decomposed problem for (a) 16 processes and (b) 1024 processes Weak scaling results for solving Laplace equation using 4000 elements per process. RD iterations scale very well, while the time for the traditional implementation begins growing. The error achieved by the two methods is of course different, and is shown in later plots The relative error reduction with respect to iterations is shown in (a), while (b) shows that the asymptotic error reduction behaves like 1/QEP as expected when using quadratic elements Comparison of RD accuracy to traditional accuracy with the time required to achieve. The blue dots correspond to RD iterations, the green line corresponds to the accuracy achieved by a traditional solve, and the red line is the accuracy of the traditional solve multiplied by 1/Q Asymptotic values of Q taken from runs on 16 to 1024 processes showing that if the resources, E, are sufficiently greater than the number of MPI tasks, P, then Q is relatively constant and close to Computed values of Q at each RD iteration for various numbers of processes (a). The asymptotic value of Q decreases as P increases because the coarse grid influences the fine grid when E is not much greater than P. When the resources are increased (b), the asymptotic value of Q returns to Contour plots of the solution using 1024 processes for the case where (a) the frequency of f is fixed, and (b) the frequency of f is scaled with increasing number of processes Weak scaling results using 4000 elements per process to solve Laplace equation with a solution that increases frequency with increasing P Error and timings when the frequency of the solution is varied proportional to P. Note that the convergence rates are almost identical

13 xii 5.10 The slow convergence of Q for the study of fixed frequency solution shown in (a) is no longer present when the frequency scales with P, as shown in (b) An example process solution to the decomposed problem with smooth input f on the first (a) and the second (b) iteration. The respective adapted meshes are shown in (c) and (d)

14 Chapter 1 Introduction Modern supercomputing architectures achieve increased performance through the addition of more cores rather than faster cores. Communication costs, especially latency costs, continue to become an increasingly important bottleneck in the scalability of previously developed algorithms. Traditional parallel PDE solvers decompose the computational domain into regions entirely owned by a process, resulting in matrix vector multiplications that require the exchange of boundary data among neighboring processes. In the case of relaxation in an Algebraic Multigrid (AMG) context, the communication of boundary points occurs for each relaxation sweep and at each level [21]. This communication is 1-to-1 and can be a local communication, but neighbors in the computational grid are often not topological neighbors in the machine, resulting in long-range communication. Furthermore, as the grid becomes coarser, a higher fraction of the grid points make up the boundary, causing communication to dominate computation on coarser grids. The communication on these coarser grids is of smaller message packets, but the latency cost remains the same. The latency expense of communication has prompted a paradigm shift in PDE solver design. The goal has been to trade expensive communication for relatively cheap computation. Most previous research aimed at reducing communication cost has focused on distributing the domain among processes, including, for example, Mitchell s work on domain partitions [24] and Bank and Holst s domain decomposition approach to parallel adaptive

15 2 meshing [3]. In this thesis I introduce the Nested Iteration Adaptive Mesh Refinement Range Decomposition (NI-AMR-RD) method, which decomposes the range, or the error, instead of the domain. The method generates independent problems that can be solved in parallel and that only require a small number of global communication steps. The independent problems are solved using nested iteration and adaptive mesh refinement without incurring any additional communication penalties. Because RD distributes the error among processes, load balancing is a natural feature of the algorithm. The strategy of range decomposition is presented here in the context of First-Order System Least-Squares (FOSLS) finite element discretization and an AMG solver [28], but the concept can be applied to any discretization with an a-priori error estimate for adaptive mesh refinement and a suitable linear system solver. 1.1 Preliminaries The building blocks upon which the Nested Iteration, Adaptive Mesh Refinement Range Decomposition (NI-AMR-RD) method is constructed are briefly explained below. References are provided should the reader desire a more than cursory overview of the topics. Throughout the thesis, processes will be referred to instead of processors, because the algorithm is not restricted to physical cores, but instead could be thought of in terms of a parallel hierarchy where a processes is actually executed by a group of processors in parallel FOSLS The First Order System Least Squares (FOSLS) finite element method reformulates a PDE or system of PDEs into a problem of minimizing a functional based on the least-squares minimization of a system of first-order equations. For example, suppose a PDE has been rewritten as the first order system, Lu = f. Assuming f L 2 (Ω), then the associated FOSLS

16 3 functional is G(u, f) = Lu f 2 0,Ω. (1.1) The problem then becomes that of minimizing this functional over an appropriate space: where W is a Hilbert space, often a product of H 1 spaces. u = arg min w W Lw f 0,Ω. (1.2) For many applications that satisfy general regularity assumptions, G(u, f) is elliptic, meaning that the homogeneous part of the FOSLS functional is equivalent to the squared W norm, c 1 G(w, 0) w 2 W c 2, (1.3) for some positive constants c 1 and c 2 and for every w W. The well-posedness of (1.2) follows directly from this ellipticity [16][15]. If W is a product of H 1 spaces, then this ellipticity also enables an optimal multigrid solver [15]. Other typical spaces, for example, H(Div) and H(Curl), also admit fast multilevel solvers. Let W h W be a finite-dimensional subspace of W. Then the discrete FOSLS formulation can be stated in weak form as: find u h W h W such that, for every z h k Wh k, Lu h, Lz h = f, Lz h. (1.4) This is equivalent to u h = arg min w h W h Lw h f 0,Ω. (1.5) The FOSLS functional also gives a local sharp and globally reliable error bound for adaptive mesh refinement. To see this, for any element τ Ω, define the local FOSLS functional as G τ (u h, f) = Lu h f 2 0,τ. (1.6) The locally sharp property is seen by using the ellipticity in (1.3) to imply that 1 c 2 G τ (u h, f) = 1 c 2 G τ (u h u, 0) u h u 2 W,τ. (1.7)

17 4 In other words, if the functional contribution from an element is large, then the W norm of the error is also large in that element. Using the lower bound provided by the ellipticity gives u h u 2 W 1 G(u h u, 0) = 1 G τ (u h, f), (1.8) c 2 c 1 which means that a small sum of the local functionals implies a small global error. This is what is meant by a globally reliable error estimate. τ Ω Adaptive Refinement (ACE) The FOSLS functional provides an a posteriori error measure for adaptive mesh refinement. This measure has been combined with a measure of the work in assembling and solving the AMG system to develop a refinement strategy based on accuracy per computational efficiency (ACE) [14]. The basic idea is as follows. Abstractly order the elements according to the error in each, largest to smallest. Let r [0, 1] be the fraction of elements to be refined, starting with the largest, and let γ(r) be an estimate of the expected reduction in the functional norm of the error. Let W (r) be the estimated computational cost of solving the resulting system after the fraction r of the elements have been refined. Define the effective functional reduction measure as γ(r) 1 W (r), (1.9) which incorporates the added work into the functional reduction. Then, the ACE refinement algorithm finds r opt = arg min γ(r) 1 W (r), (1.10) r and refines the first r of the elements that have the largest error. The algorithm can easily be extended to allow multiple refinements [14, 2]. Because the ACE refinement strategy refines elements with higher functional error, after several levels of refinement, it results in refined meshes with the functional error equally

18 5 distributed among elements. Meshes in which the functional error is equally distributed across elements has been proven to be optimal in 1D [19], and the result is believed to hold true for higher dimensions as well. A sequence of ACE refinements produces a mesh in which the functional error is approximately equally distributed and can therefore be considered an approximately optimal mesh, which I define more precicely in section Nested Iteration Nested Iteration, or full multigrid, is a solution process that begins on a coarse level where the element sizes are relatively large and computations are relatively cheap. An approximate solution on the coarse to fine levels in turn, is attained through fast iterative solvers (multigrid V-cycle), with the approximation on each coarse level interpolated to the next finer level to provide an initial guess for the V-cycle there. This bootstrapping process continues until a user-specified criteria is reached, which, for the purposes of this paper is until the fixed memory resources of a process are fully occupied. That is, nested iteration and AMR stop when a fixed number of elements per process have been reached. Nested Iteration is often much faster than solving directly on the finest mesh since the coarser levels provide cheap approximations to the solution on finer levels. In the context of adaptive mesh refinement, the finest mesh is not known a priori, and nested iteration also provides for the generation of the mesh on the next level based on an error indicator, such as the FOSLS functional, from the current level. In this way, the optimal solution is approximated on a nearly optimal grid at relatively low cost [14, 2]. In parallel, consideration must be given to the fact that computation and bandwidth are cheap on coarse levels, but latent communication expenses remain fixed regardless of the number of elements on a level. A performance model of nested iteration in parallel is derived in section 4.

19 6 1.2 Range Decomposition Whereas a traditional PDE solver equally distributes the computational domain among processes, range decomposition equally distributes the residual and, therefore, the error across processes. This is done by applying a partition of unity to the right-hand side of the PDE. For example, consider the linear system Lu = f in Ω, (1.11) Bu = 0 on Ω. (1.12) Here, the right-hand side, f, can be viewed as the residual resulting from a preprocessing step on a very coarse grid that can be accomplished on a single process, and therefore, on all processes simultaneously. In that context, and in what follows, the solution, u, is a correction to an existing approximation. It is, therefore, natural to pose homogeneous boundary conditions in (1.11). Let the number of processes be P and partition the domain into P subdomains, Ω i = Ω. Assume that the subdomains are sized in such a way that f 2 is approximately equally divided among subdomains. Define a partition of unity using the characteristic function 1 in Ω i χ i =. (1.13) 0 Ω j i This is a discontinuous partition of unity. Smoother partitions of unity could also be chosen. Range decomposition equidistributes the residual (error) among the P subdomains by applying χ i to the right-hand side of (1.11). Then, process i solves the decomposed system Lu i = χ i f in Ω, Bu = 0 on Ω. (1.14) The final solution is given by a sum of the solutions from each process, P u = u i. (1.15) i=1

20 7 Notice that, unlike domain decomposition methods, the approximation on a process of the solution of (1.14) is defined over the entire domain Ω. We refer to the subdomain where χ = 1 as a process home turf, because it is only over the home turf that a process has a nonzero right-hand side. Outside of the home turf, the PDE is homogeneous and u i is a local null vector of operator L. We say that u is a local L-harmonic in a region if Lu = 0 in that region. For the first-order system arising from elliptic PDEs, the dominant L-harmonic becomes smoother further away from the home turf and therefore requires a small number of points to accurately represent the solution outside the home turf. This property is important to the efficiency of our range decomposition approach. It is a theoretical though impractical note that if each process used (and were capable of using) a globablly fine grid in each of the P solves of eqn. 1.14, the summed approximation u would be exactly equal to the direct approximation of eqn computed on that globally fine grid. Computationally this saves no work over having a single process directly performing an approximate solve of The computational savings only appear when combining range decomposition with a graded mesh Handling Inhomogeneous Boundary Conditions Inhomogeneous boundary conditions for (1.11) can always be made into homogeneous conditions with superposition. For example, given boundary conditions on a variable p(x, y), consider constructing a Coon s patch, G(x, y), and surplus, S(x, y), as follows. G(x, y) = (1 x)p(0, y) + xp(1, y) + (1 y)p(x, 0) + yp(x, 1), S(x, y) = (1 x)(1 y)p(0, 0) + x(1 y)p(1, 0) + (1 x)yp(0, 1) + xyp(1, 1). Then, a patch that satisfies the boundary conditions on p is p 0 (x, y) = G(x, y) S(x, y). To satisfy the boundary conditions on the primitive FOSLS variables, assume that the boundary conditions are consistent with the definitions of the derived variables, and write the

21 8 relation between primal unknowns and all unknowns as u = Dp. (1.16) For example, in a diffusion problem, u = p u v = 1 x y p Then the patch that satisfies boundary conditions for all unknowns is u 0 (x, y) = Dp 0 (x, y). Using superposition with this patch gives L(ũ) = f Lu 0 B(ũ) = 0, where ũ = u u 0 and g Bu 0 = 0, effectively burying nonzero boundary conditions in the forcing function. Thus, PDEs considered in this thesis are assumed to have been already converted to an equivalent system with homogeneous boundary conditions.

22 Chapter 2 NI-AMR-RD Algorithm Iterative Description The goal of the NI-AMR-RD algorithm is to reduce communication, especially the number of communications, at the expense of possibly incurring more computation and less accuracy. Using range decomposition only achieves a computational advantage when adaptive mesh refinement is employed to approximate u i in different finite element spaces, for each process. On a given process, nested iteration with ACE places elements globally to achieve a nearly optimal mesh for solving the decomposed problem, (1.14). For a Laplacetype operator, this nearly optimal mesh tends to focus refinement in the home turf since the solution is harmonic elsewhere. The solve phase of the algorithm requires no communication; each process simply performs NI and AMR on the PDE with the process unique portion of the residual. The RD algorithm using P processes is as follows: (1) Preprocessing step: Solve Lu = f in Ω on each process using ACE to refine the mesh from 1 element until there are approximately P elements. For simplicity, assume there are P elements; if there were more, then they could easily be approximately equally distributed among processes. Let the solution from this preprocessing step be denoted by u (0). (2) RD iterations:

23 (a) Beginning with preprocessing approximation u (0) on a coarse mesh of P elements, find the next approximate solution, 10 u (n+1) = u (n) + δu (n), (2.1) by rewriting the right hand side as a residual, f (0) = f, (2.2) f (1) = f (0) Lu (0), (2.3). (2.4) f (n+1) = f (n) Lu (n). (2.5) and solving for the correction, Lδu (n) = f (n+1). (2.6) (b) To solve for the correction, each process instead solves the range decomposed problem, Lδu (n) i = χ i f (n+1), (2.7) where 1 in Ω i χ i =. (2.8) 0 Ω j i Here, Ω has been partitioned into P subdomains, Ω i, each corresponding to a process. The correction after each process has solved (2.7) are summed to give the solution to (2.6); δu (n) = P i=1 δu (n) i. (2.9) (c) After the processes have finished independently solving their version of (1.14), the solutions, u i, must be added together in a union over the processes composite grids. This involves a global communication of each processes solution and

24 11 mesh corresponding to the area outside of its home turf. The global solution on the union grid cannot fit on a single process each process is simply responsible for storing the summed solution in its own home turf. This all-to-all communication has some very interesting details that are explained in section 2.0.3, and make the global communication less expensive than it may first appear. If ACE is used when solving each decomposed problem (1.14), the composite meshes are nearly optimal, but the union of optimal meshes may not be optimal for solving the original system, (1.11). Additionally, the solutions to the decomposed problem are iterative approximations that are interpolated onto the unioned grid, so it is reasonable to expect some corrections to be necessary. An iterative solution process applied to the entire algorithm is therefore employed, and (1.11) is viewed as a residual equation. The NI-AMR-RD algorithm is: Preprocessing: Solve Lu (0) = f on a coarse grid Set: f f Lu (n) Solve: Lδu (n) = f RD Solve: Lδu (n) i = χ i f i (Nested Iteration, AMR) RD Sum: δu (n) = P i=1 δu(n) i (All-to-All Communication) Update: u (n+1) = u (n) + δu (n) Iterate: n n + 1 Each process solves the decomposed problem using nested iteration and the ACE mesh refinement strategy. This solve is local to a process and produces an update, δu (n) i, which is an element of the finite element space, V (n) i. The individual updates must be summed and to do so the union of their finite element spaces must be formed. An example of the refined mesh corresponding to process 6 is shown in figure 2.1. All processes initially begin on identical coarse meshes, but because they each have a

25 Refinement (Processor 6) Coarse Grid Fine Grid 6 Figure 2.1: 16 process example of the initial coarse mesh from which each process would start nested iteration, and the resulting refined mesh for process 6. The coarse-mesh elements are numbered according to processes, and they need not be a uniform grid. Notice that most of the refinement happens in a process own home turf.

26 piece of the residual in their own home turf and the solutions are L-harmonic outside the home turf, adaptive refinement will tend to place elements in the home turf. To quantify this, we define a measure of the proportion of elements in the home turf to the total elements used for the subsolve: Q = elements in Ω k total elements. Thus, a value of Q = 1 would correspond to all elements being placed in the home turf, while a value of Q = 0 means that the adaptive strategy determined that the optimal grid for solving the subproblem was with all elements outside the home turf. Note that when the coarse grid is a uniform grid of P elements, there are at least P 1 elements outside the home turf. By assumption, the solution cannot be stored on a single process, and is therefore distributed among all processes. Each process is responsible for storing the final solution over their own home turf only. To assemble a final solution, a process needs to add the 13 contributions from the other processes solves to its own home turf. This means that a process home turf never needs to be communicated to another process, and the solutions communicated among the processes come from sparsely refined exterior regions Communication and Union of Meshes At the end of an RD iteration, a process either needs to store the part of the solution for which it is responsible, or compute a residual in preparation for more RD iterations. Storage of the solution is distributed among processes, so process k needs only to store the solution restricted to Ω k, denoted by u Ωk. When continuing iterations, process k solves the range-decomposed residual equation, Lδu k = χ k (f Lu), (2.10) which only requires the functional, and thus the solution, in the home turf.

27 14 Thus, process k does not need to sum the entire decomposed solution from other processes, it only needs to sum the portion of solutions from other processes corresponding to its own home turf 1, Ω k : δu Ωk = P δu Ωk i. (2.11) i=1 Writing the sum in (2.11) implicitly assumes that the solutions are summed on a union mesh over the home turf. Each process needs the solution and union mesh in its own home turf. A traditional all-gather, where the entire solution and grid are communicated would result in each process receiving the solution over the entire domain, which is neither practical or necessary. Instead, I developed a modification of the all-gather communication pattern to gather only the solution and mesh on a process home turf in log(p ) communication steps. Consider the stage of the algorithm where communication is carried out. The processes have solved the decomposed problem (1.14) through nested iteration and AMR, and each has an adapted mesh that is concentrated in its home turf and becomes sparser with distance from the home turf (see figure 2.1). The processes now need to communicate each region of their solution to the process that has a home turf in that region. This could be accomplished in P communication steps if every process passed its version of the recieving process home turf at each step. To accomplish this in log(p ) communication steps, a tree structure is used wherein processes are passed somewhat more than just their own home turf so that they can pass along the partially summed solution during a later communication step. To understand this communication pattern, consider the 6th process in an example of 16 total processes. Figure 2.2 shows where processes send their data on each step and Figure 2.3 shows what is sent at each communication step. On the first communication step processes are split into two groups. Processes in the first group carry out paired send and 1 In fact, a process would only need the solution from another process along the boundary of Ω k. From the boundary data the solution in Ω k could be recomputed locally, further reducing the bandwidth costs of the method in exchange for more computation.

28 Communication Pattern for Processor 6 Step Step Step Step Direct Communication Indirect Communication Figure 2.2: (a) The butterfly communication pattern used to sum the solutions and union the grids. The dark arrows emphasize the direct communication of another process solution to process 6, while the gray arrows denote when a process solution in region 6 is passed on to be indirectly communicated to process 6.

29 16 receives with processes in the second group. For example, figure 2.2 emphasizes with a dark arrow that process 6 receives data from process 14. Processes send the solution in regions corresponding to the opposite process group, sending their solution in half the domain. For example, figure 2.3 shows that, during the first step, process 14 s solution and mesh in regions 1-8 (grey) are sent to process 6. After a send/receive, a process current mesh must be unioned with the received mesh, and the received solution must be added to the current partially summed solution. However, this only happens in the region corresponding to the received region, which is half the domain for the first step, a quarter of the domain on the second step, etc. After communication step 1, process 6 has sent its solution in regions 9-16 to process 14. Process 14 unions and sums the solution over the regions corresponding to its group (9-16) so that it can then pass along the solution in these regions to the other processes in this group. Process 6 never communicates to any other process in 14 s group. Instead, 14 serves as the proxy for conveying 6 s solution to the rest of the group. Solutions and grids in sent regions no longer need to be stored by the sending process and can be freed from memory. On the second communication step, the groups from step 1 are each subdivided into 2 new groups, resulting in 4 groups. Processes that were previously in the same group communicate to their sister processes in the new group. For example, process 2 communicates with process 6 (see figure 2.2). Process 2 sends its current partially summed solution in region 5-8, as shown in (b) in figure 2.3. The grid and solution sent from process 2 is a result of the union and summation on the previous communication step. On step 1, process 2 received a grid and solution from process 10. The grid contained several elements, shown in red in figure 2.3, which were distinct from the elements existing on process 2 s grid, and the union resulted in a grid with more elements. This union grid is what is sent to process 6 on the second communication step.

30 Communication Step 1 Processor 6 Processor Communication Step 2 Processor 6 Processor (a) Communication Step 3 Processor 6 Processor 8 *new elements recieved by processor 2 on prior steps. (b) Communication Step 4 Processor 6 Processor 5 *no new elements recieved by processor 8 on prior steps. (c) *new elements due to processor 5 union with 7 on prior steps. (d) Figure 2.3: Example of what is sent and received by process 6. During the first step, (a), process 6 sends regions 9-16 and receives process 14 s solution in region 1-8. (b) Process 6 now receives regions 5-8 from process 2, and the elements marked in red are the additional elements process 2 created when it was unioned with process 10 on the previous step. (c) On step 3, process 6 and 8 communicate 1/8 of domain. (d) In the final communication step, process 5 sends to 6 its partially summed solution and unioned grid only from region 6.

31 18 Communication step 3 has process 8 paired with 6, and they exchange approximately 1/8th of the exterior elements. Process 8 has not accumulated additional elements from its previous unions with grids from processes 16, 4, or 12. On the final communication step, process 5 is 6 s communication partner. Process 6 sends only region 5 to process 5, but, by this point in the communication pattern, this sent region encompasses the sum of the solutions from processes 14, 2, 10, 8, 4, 12, 16, and 6. Similarly, process 6 receives only its home-turf region from process 5, but this region is the union and sum over all the previous processes that have directly or indirectly communicated to process 5. Ever process has now unioned and summed the solution in its home region, removing from memory all outside regions. At each step of the communication the number of regions to store and communicate decreases by a factor of 2. This drives the message size down on each step of the communication pattern. The number of elements in a region, however, can increases because of grid unions. This would tend to drive the message size up. These two factors will play a crucial role in future improvement of the method and are critical to understanding the performance of the algorithm.

32 Chapter 3 Error Analysis The following chapter analyzes two aspects of the error in the NI-AMR-RD method. The first analysis provides an estimate on the discretization error achieved by successive iterations of the NI-AMR-RD method. The second analysis bounds the functional reduction that the RD method achieves on the first step. 3.1 Notation NI-AMR-RD is applied in parallel using P processes, and each process has some fixed memory resources that allow it to construct and solve a finest mesh of E elements. We begin by defining the basic notation: P = Number of processes E = Number of elements each process can store EP = Total resources of the machine J = Number of iterations of NI-AMR-RD This thesis considers NI-AMR-RD in a FOSLS context. Let L = FOSLS operator Lu = f FOSLS system Bu = g Boundary operator Different decomposed problems will be solved on different processes. Let the subscript define the specific problem and process and let

33 u l = Exact solution to decomposed problem Lu l = χ l f 20 W h l = Finite element space for decomposed problem E = Wl h, Dimension of Wh l, fixed for all processes l u h l = Best finite element approximation from Wl h τ = Element index Ω l = Region where process l has nonzero right-hand side Q l = Percentage of elements placed in Ω l by process l. Let {χ k } k=1,...,p be a partition of unity such that Ω k = supp(χ k ). That is, χ k is the characteristic function of Ω k. A discontinuous partition of unity, χ k, is used in this thesis, but smoother partitions are possible, for example a C 0 partition of unity that decays linearly to zero in neighboring regions. The FOSLS method directly gives a minimization problem where the error is computable in the minimizing norm (the functional norm). Thus, the functional norm is the natural norm used in this thesis to measure error and convergence to the discrete solution. Next, define the optimal finite element spaces that the ACE adaptive refinement algorithm approximately achieves. Definition 1 (Optimal Mesh). Given a machine with the capacity to solve a FOSLS system on a finite element space of dimension N, we say that W h op,n is the finite element space that satisfies Wop,N h = arg min min Lw h f. W h =N w h W h Definition 2 (Optimal Discretization Error). The optimal discretization error is the error obtain on the optimal grid: ε op,n = min Lw h f. w h Wop,N h Definition 3 (Union Mesh). Given P optimal meshes from a solve of Lu l = χ l f on the jth RD iteration, the union mesh is the union of the optimal meshes from the decomposed solve.

34 21 That is, the union mesh is the finite element space such that P WU,j h = arg min min Lwl h χ l f. Wl h =N wl h Wh l l=1 Note that the dimension of the union mesh is l Wh l. Definition 4 (Union Mesh Discretization Error). Let the solution on the union mesh be denoted by u h U,j = arg min w h W h U,j Lw h f. Then, the union mesh discretization error is defined as ε U,j = min w h W h U,j Lw h f. Definition 5 (Full Union Mesh). The full union mesh is the union of the union mesh over J iterations, W h U = J P j=1 l=1 arg min min Lw Wl h =N wl h l h χ l f. Wh l Note that the dimension of the full union mesh is j l Wh l. Definition 6 (Dimension of Full Union Mesh). Not only are different decomposed problems solved on different processes, but the method is iterative as well, with refinement restarting at each iteration. To determine the dimension of the finite element space WU h, define Q such that QEP = WU h. An estimate is Q elements in j l Wh l where k denotes the average over the k processes. that are contained in Ω k k, (3.1) E Definition 7 (Full Union Mesh Discretization Error). Let the solution on the full union mesh be denoted by u h U = arg min Lw h f. w h WU h Then the full union mesh discretization error is defined as ε U = min Lw h f w h W h U

35 The NI-AMR-RD decomposed solve on each subproblem obtains a near-optimal grid 22 for that given subproblem. Let W h k represent the near-optimal grid for decomposed problem k and denote the optimal finite element solution by u h k = arg min Lw h χ k f. w h Wk h The discrete RD solution is denoted simply as u h = k u h k. Definition 8 (RD Discretization Error). Denote the functional-norm error associated with the jth RD solution by ε RD,j = Lu h f. We immediately have the following relationships: ε op,ep ε op,qep ε U ε RD,j, which will be useful in the next section. 3.2 NI-AMR-RD Asymptotic Discretization Error Lower Bound Adaptive refinement in the RD method devotes elements to areas outside the home turf of a process. These elements are essentially thrown away after each iteration and are a sacrifice of the method to achieve reduced communication. A natural question is: how much accuracy is sacrificed by RD placing elements outside a process home turf elements that are thrown away and do not contribute to a finer mesh in the home turf? A heuristic answer, and a lower bound that appears to be achieved for a certain class of numerical test problems, can be arrived at by the following assumptions. (1) Assume that the NI-AMR-RD iteration converges to the solution on the full union mesh,

36 (2) Assume that the full union mesh has approximately equal functional error to the optimal (ACE) mesh using QEP elements. 23 The error achieved on the ACE mesh using QEP elements of order q is a factor of 1/Q q/2 times the error achieved when using EP elements with ACE, so it follows from the assumptions that the NI-AMR-RD solution at best can be a factor of 1/Q of the error achievable using the full resources of the machine in a traditional method. Of course, this is a statement about possible discretization error, not the speed or efficiency with which the method arrives at that error. If the RD iterative solution is assumed to converge to the solution on the full union mesh, and the full union mesh is assumed to be optimal, then the asymptotic error one can expect to achieve with RD is a factor of 1/Q q/2 of the error one would achieve when applying P E total resources in a traditional method. Both the NI-AMR-RD method and the traditional method employ ACE and thus achieve near-optimal grids. There is an important difference, however, in that the traditional method employs ACE to achieve a near-optimal grid for solving Lu = f, while NI-AMR- RD achieves a near-optimal mesh for solving Lu l = χ l f. On each RD iteration, the RD correction is summed over the union of the optimal meshes from each processes previous solve of the decomposed problem. After J iterations, the RD solution lives on the full union mesh, which was defined previously. This brings us to the first assumption: Assumption Assume that the NI-AMR-RD iterations error converges to within a constant, independent of P, of the full union discretization error. In other words, given any C 1 1, there exists a J such that for all iterations j > J, ε RD C 1 ε U, (3.2)

37 Both the RD solution, u h, and the union solution, u h U, are elements of the full union mesh space, WU h, but the union solution is the minimizer by definition. Thus, given any C 1 1, we have the bound 24 ε U ε RD C 1 ε U, for j sufficiently large. Assumption Assume that the full union mesh error is within a constant, independent of P, of the optimal discretization error (of the same dimension as the full union mesh). That is, assume where C 2 1 and independent of P, and M = j ε U C 2 ε op,m, (3.3) l Wh l. The dimension of the the full union grid is M = QEP, by definition. The full union grid of dimension M is not necessarily optimal, which, with the above assumption, gives the bound Combined with the previous assumption this implies that ε op,qep ε U C 2 ε op,qep. (3.4) ε op,qep ε RD C 1 C 2 ε op,qep, (3.5) for j sufficiently large. Now consider an error estimate for ε op,n. Theorem 4.7 of [22] provides an error bound for a general FOSLS problem, using finite elements of order q, that is well posed and has a sufficiently smooth solution: Lw h f 0,Ω C q h q u 1+q,Ω. (3.6) This result can be applied to a nonuniform mesh spacing where, in 2D, h is defined as h = 1/ N, (3.7)

38 with N the number of elements (dimension of the finite element space). Applying this to 25 ε op,qep gives ( ) q 1 ε op,qep C q u 1+q,Ω. (3.8) QP E Using this in (3.5), we have a bound on the asymptotic error for the NI-AMR-RD method: ( ) q 1 ε op,qep ε RD C 1 C 2 C q u 1+q,Ω. (3.9) QP E Contrast this with the traditional method of NI with ACE, which uses communication to harness all E elements per process and achieves a nearly optimal mesh of EP total elements. The error bound for the traditional algorithm is ε Trad = ε op,ep C q ( 1 P E ) q u 1+q,Ω. (3.10) The best NI-AMR-RD could hope to do is to achieve the same error as an optimal mesh using QEP total resources. If the RD solution does better than converge to within a constant of the union solution, i.e. it in fact converges directly to the union solution, and if the union solution is identical to the optimal one, then C 1 and C 2 = 1. The error bound for RD would then be ( ) q 1 ε RD C u 1+q,Ω. (3.11) QP E In other words, if the RD solution converges to the optimal solution with the same effective number of elements, the discretization error bound one can expect from the RD method is a constant factor of ( 1/Q ) q/2 times the error bound of the traditional, fully communicating algorithm. For Laplace s equation with an oscillatory right-hand side, the numerical results section shows C 1 and C 2 are approximately equal to 1, and the asymptotic RD error closely follows the relationship ( ) q/2 1 ε RD ε Trad. (3.12) Q

39 26 When Q is close to one, which will be seen to be the case for Laplace equation, this reduction in accuracy is a small sacrifice to avoid communication. This section has provided some heuristics on the asymptotic performance of the method; however, what is of real interest for a fast method is the first iteration. This is explored in the next section. 3.3 Functional Reduction on 1st Step This section develops a heuristic bound on the functional error when applying NI- AMR-RD. Under certain assumptions, only one iteration of the NI-AMR-RD method is needed to achieve a functional error within a small factor of the functional error one would achieve by directly solving on the union mesh. When developing the bound of the error after the first RD iteration, rather than getting caught up in the iterative notation, consider f to be the functional residual resulting from the preprocessing step and u to be the correction to this residual. That is, instead of writing the PDE solved by RD iterations as a cumbersome iterative correction as in (2.6), we simplify notation and write Lu = f. (3.13) This implicitly defines f as the functional residual from the preprocessing step. Similarly, the decomposed problem solve will be written Lu i = χ i f. (3.14) The proof proceeds as follows: (1) Bound the functional in a home turf region, L(u u h ) 2 Ω k, by bounding the contributions from the host process and the external processes. (2) Bound the total functional by summing the contributions from each home turf, Ω k.

40 (3) Use an interpolation assumption to write the bound in terms of an effective mesh spacing and a semi-norm on u k. 27 (4) The sum of seminorms of u k are finite and can be bounded by a constant independent of E Functional Reduction Assumptions We begin bounding the error by making the following assumptions. Assumption Initial Error Distribution: the initial functional, f 2 Ω, is equally distributed among the coarse mesh elements, and, consequently, among processes. Assumption implies that the initial functional is also equally distributed among processes. This would be accomplished in a preprocessing step in which the solution on a coarse mesh of at least P elements has been computed separately in each process, and the residual, now called f, is orthogonal to the coarse mesh space in the functional inner product. Assumption Graded Mesh: Assume that an adaptive finite element method (ACE), starting with the inherited preprocessing mesh and f k = χ k f as right-hand side, yields a mesh in which the number of elements within Ω k is Q k E, where Q 0 Q k 1 and Q 0 is bounded away from zero. Then the effective mesh spacing in Ω k is h k = 1/ Q k P E. Now consider P L(u k u h k) 2 = L(u k u h k) 2 Ω j (3.15) j=1 We assume there is a 0 < Q 0 Q k 1 such that L(u k u h k) 2 Ω k = Q k L(u k u h k) 2 Ω, (3.16) which implies L(u k u h k) 2 Ω j = (1 Q k ) L(u k u h k) 2 Ω (3.17) j k

41 For ease of notation, define the contribution to the error from each subdomain as follows: 28 ε kj = L(u k u h k) Ωj. (3.18) We now make two assumptions on the character of these errors: they are bounded above everywhere, and in fact, in an exterior neighborhood around the home turf, the error decays with distance from the home turf. Assumption Uniformity Assumption: The functional from the solve of Lu i = χ i f is approximately equally distributed among elements on the refined mesh. That is, assume that for all τ k W k, and assume that L(u k u h k) τk η 1 (3.19) η 0 L(u k u h k) τk (3.20) for all τ k W k such that τ Ω k. We assume that η 1 /η 0 = O(1). (3.21) Further, assume (3.19)-(3.21) holds independent of k. Assumptions and are statements about the ability of the ACE refinement to equally distribute the functional among elements. Previous work has shown that given a sufficient number of refinement levels, ACE achieves a mesh in which the contribution to the functional is equal among elements [2]. The number of refinements required to reach this equadistribution of the functional is problem dependent. Here we assume that the resources are sufficient to achieve this result. Assumption is a statement that if ACE has done its job, then the error in each element outside Ω k is the same as or smaller than a typical element inside Ω k. The error in each element is essentially equal across elements in a neighborhood around a given process home turf where the mesh is actually graded. The algorithm could be imagined to be true

42 nested iteration on each process, in which case the mesh would be graded all the way to the boundary. However, in the case of our implementation, the mesh eventually becomes the coarse mesh of approximately P elements that is inherited from the preprocessing step. Because this inheritance of the approximately P element coarse mesh does not allow the mesh to continue to grade away from the home turf, the error away from the home turf must be assumed to decay. Assumption Graded Error: In the first step of the RD algorithm, each process, in parallel, established a solution on a coarse mesh containing at least P elements. This provides for equal distribution of the error among processes and a starting point for each process to solve its own problem. This original mesh is inherited and beyond a graded neighborhood around the home turf the error in the decomposed solve is not tightly bounded by (3.19). On this exterior portion of the grid, since elements cannot be removed to make the grid graded, we expect the error within coarse elements to decay with physical distance between Ω k and Ω j, which we denote d(k, j). Further, we expect this decay to be exponential, that is, we assume there exists an 0 < ρ < 1 and constant C d > 0 such that, for j k, 29 ε kj C d (max j k ε k,j)ρ d(k,j), (3.22) We assume that the distribution of error is symmetric. That is, we assume that the error u h k contributes to domain j is equal to the error that uh j contributes to domain Ω k. This is stated as ε jk ε kj. (3.23) One can visualize a P P symmetric matrix whose elements are given by ε kj. The norm of the kth row is L(u k u h k ). Lemma Let ε kj be defined in (3.18) and satisfy the bound in (3.22). Then, ( P ) 2 P ε k,j C ρ ε 2 k,j, (3.24) j k j k

43 where C ρ = 30 ( ) 1 + C2 d. (3.25) (1 ρ) 4 Proof. Using (3.22) in Assumption 3.3.4, and assuming a two-dimensional problem, the error is assumed to decay in rings around the home turf. The ring of distance 1 has 8 elements, the next ring has 16. Continuing in this fashion, e k,j Cε 1 8 lρ l 1. (3.26) j k The result now follows from application of Lemma A.0.2. Assumption Interpolation Error Bound: Let τ Ω k. The interpolation error on τ is bounded by where h τ is the diameter of element τ. This implies that where h k = max τ Ωk h τ. Here, we assume that h k = 1/ Q k EP and where 1 ĈI/C I = O(1). P l=1 L(u k I h u k ) τ C I h q τ u k q+1,τ, (3.27) L(u k I h u k ) Ωk C I h q k u k q+1,ωk, (3.28) L(u k u h k) Ωk ĈIh q k u k q+1,ωk, (3.29) Note that, in general, a least squares solution might not satisfy the bound L(u k u h k) τ L(u k I h u k ), (3.30) on each element, or, in fact the bound on any collection of elements: L(u k u h k) Ωk L(u k I h u k ) Ωk. (3.31) However, it is a good assumption that (3.29) holds because of the assumption that the elements in Ω k are a major fraction of the total number of elements and the equidistribution of error among the elements.

44 Functional Reduction Proof With the above assumptions and lemmas in hand, we can now derive a bound on L(u u h ) Ω. Consider an arbitrary home-turf region Ω k. One process, which we call the host process, has solved with a nonzero right-hand side in its home turf, and the other processes have contributions from their homogeneous solves in this region. The functional error in region Ω k can be bounded by the sum over all processes contributions to that region: P P P L(u u h ) Ωk = L( u l u h l ) Ωk = L(u l u h l ) Ωk l=1 l=1 l=1 P L(u l u h l ) Ωk. (3.32) l=1 The sum can be split into the host process contribution and the contributions from the external processes, P L(u l u h l ) Ωk = L(u k u h k) Ωk + L(u l u h l ) Ωk (3.33) l k l=1 Consider the sum in the last equation. From (3.17), Lemma 3.3.1, and the assumption that ε lk is symmetric, we have L(u l u h l ) Ωk = ε lk = ε kl (3.34) l k l k l k = ( C ρ P l k ε 2 k,l) 1/2 (3.35) C ρ (1 Q k ) L(u k u h k) Ωk (3.36) This leads to the bound L(u u h ) Ωk (1 + C ρ (1 Q k )) L(u k u h k)) Ωk. (3.37) The full functional is the sum of the functionals from each Ω k. Using the above bound on

45 32 subregion Ω k, and, finally, L(u u h ) 2 Ω P L(u u h ) 2 Ω k (3.38) l=1 P (1 + l=1 ( 1 + C ρ (1 Q k )) 2 L(u k u h k)) 2 Ω k (3.39) ) 2 P C ρ (1 Q 0 ) l=1 L(u k u h k)) 2 Ω k. (3.40) Note that the constant is now independent of k. Let us sweep all the constants into one and define C b = ( ) 1 + C ρ (1 Q 0 ). (3.41) Using the interpolation error assumption, (3.29), gives, L(u u h ) 2 Ω C 2 b C 2 b P L(u k u h k) 2 Ω k (3.42) k=1 P (ĈIh q k u k q+1,ωk ) 2, (3.43) k=1 This last bound can be rewritten as L(U U h ) Ω Ch q P U k 2 q+1,ω k, (3.44) where h = max k h k 1 Q 0 EP and C = C bĉi. Note that this bounds the error after 1 step of RD, independent of E, the number of elements that each process can resolve. A bound on this final term that depends only on f, that is, a bound independent of P still remains to be established, but numerical tests indicate this constant is, in fact, independent of P. Note that, the whole heuristic line of reasoning depends on the assumption that E > P, and works best when E >> P. This corresponds to a limited weak scaling. For comparison, the error bound for the union grid solution is k=1 L(u u h U) 0,Ω C I h q u q+1,ω. (3.45)

46 Numerical Validation Full numerical results for the NI-AMR-RD method applied to the FOSLS formulation of p = 2π 2 (65 2 ) sin(65πx) sin(65πy), are given in chapter 5. However, it is appropriate here to show the relative functional reduction on the first iteration in figure 3.1. Note that the given f has f 2 approximately equally distributed among a uniform initial grid of P elements, and the mesh generated by ACE is approximately graded. It is thus a good candidate for satisfying the assumptions, and therefore adhering to the bound. Since quadratic elements were used, q = 2 in the bound (3.44), and the relative functional reduction should be bounded by O(h) = O(1/QEP ). As seen in figure 3.1, the relative functional reduction in the first step indeed appears to behave linearly with respect to the effective mesh spacing, and in fact the constant is not dependent on P.

47 34 Relative Functional Reduction on 1st Iteration RD 1st iteration C QEP f Lu h / f /QEP x 10 5 Figure 3.1: Plot showing relative functional reduction on the first iteration of NI-AMR-RD for a problem that approximately satisfies the assumptions. E is fixed at 4000 elements, Q is measured, and P is increased from 16 to 1024.

48 Chapter 4 Performance Model To understand where the RD method is particularly effective, two communication cost models are developed here. The first, for comparison purposes, is a model of a traditional implementation of nested iteration with AMR, and the second is for the new NI-AMR- RD algorithm. Many implementation assumptions have to be made in order to develop such models, but every effort was made to keep the assumptions consistent accross both models. The assumptions are also consistent with existing software implementations of the two methods. When developing a performance model, communication costs can be seperated into two categories: T comm = T latency + T bandwidth, where T latency is the fixed overhead of sending a message of any size and T bandwidth is the cost that depends on the bandwidth and message size. Latency tends to be the dominant communication costs for a traditional implementation, so carefull accounting of the number of messages passed is crucial in the development of the traditional model, while the size of the messages passed can be more crudely approximated. 4.1 Traditional Implementation of Nested Iteration First, for comparison purposes, consider the common implementation of nested iteration using multigrid and AMR, shown in Figure 4.1. Nested iteration begins with a random

49 36 initial guess on an initial element mesh. An approximate solution on the current mesh is obtained through multigrid V-cycles, and then this approximate solution is interpolated onto a finer mesh. The finer mesh can be generated adaptively by using a measure of the error to refine elements. The procedure repeats until the final nested iteration level is reached. The final level could be determined by a variety of conditions, but in the case of this model, consider it to be determined by a total number of allowed elements. i = k(j) j = 1 Model of Nested Iteration j = L j = 2 i = 1 nodes = 1 (a) Figure 4.1: Starting from nested iteration level j, a V-cycle goes from the fine grid of k(j) nodes to 1 node and back to nested iteration level j. Adaptive refinement then generates the next nested refinement level, and the approximate guess for the solution is interpolated (arrow) onto the new level. To simplify modeling, consider the task of solving the PDE in 2D using linear finite elements, which gives a 9 point stencil. To determine the cost of the traditional nested iteration, first consider the cost of a V-cycle, then consider the sum of those costs at each nested iteration level Communication Cost of a Single V(1,1) Cycle in 2D First consider the steps in a V(1,1) cycle utilizing k grids. Let the multigrid levels be indexed by i, with the coarsest grid given by i = 1, and the finest by i = k. Later, when nested iteration is considered, a V-cycle s finest grid will be a function of the nested iteration level, that is, k = k(j).

50 37 Communication costs on a process for a single V-cycle can be written as k T latency + T bandwidth = α m i + β k i=1 i=1 where m i is the number of messages passed on grid i, q i is the amount of passed data on grid i, β is the inverse bandwidth of the machine, and α is the latency cost. Assume that standard geometric full-coarsening is used in the V-cycles so that multigrid operators and complexity are identical on all grids. To avoid seperately modeling the coarsegrid solve, assume that the V-cycle reaches 1 node globally and that a coarse-grid solve is equivalent to the multigrid operations on other levels. This assures that the coarsest grid is sufficiently coarse and low frequency errors are addressed. q i Relaxation, Residual, and Restriction Operations Processor boundary Interpolation Operation Processor boundary Communicated nodes Communicated nodes Stencil Stencil (a) (b) Figure 4.2: (a) Example of nodes along a process boundary that must be passed to a neighboring process to perform relaxation, restriction, or evaluate the functional. There are 8 neighboring processes so 8 seperate messages must be passed to perform any of the multigrid operations. If the number of nodes along a side is n then the total number of nodes that must be sent for relaxation is 4n + 4. For restriction with a full weighting stencil, all 4n + 4 boundary points must be communicated. Interpolation, shown in (b), only requires approximately half as many nodes be communicated Number of messages in a V-cycle To compute the latency time, note that the number of messages on grid i is m i = 2m i,relax + m i,residual + m i,restrict + m i,interp i k.

51 For the 9-point Laplacian, messages must be sent to the 8 neighboring processs (figure 4.2), so the specific number of messages passed per grid i is 38 m i = , i k. Summing gives the number of messages for a V-cycle, starting at level k, as k m i = i=1 k 40 = 40k. i= Size of messages in a V-cycle Define n i as the number of nodes along the side of a process boundary for grid i. Then, strictly speaking, the number of nodes that need to be communicated on grid i is 4n i + 4 (figure 4.2). However, the inclusion of the 4 corner points in the message size is inconsequential so the approximate number of nodes communicated is 4n i. The amount of data that must be passed on grid i is then 4n i for relaxation, residual, and restriction operations, and 4n i /2 for interpolation. Accounting for 2 relaxations, a residual computation, and a restriction and interpolation, the sum of the message sizes over the grids on a given V-cycle is k q i = i=1 k 2(4n i ) + (4n i ) + (4n i ) + (4n i /2) = i=1 k 18n i. i=1 Standard coarsening (in other words coarsening by a factor of 2 on a side) will reduce the number of nodes in a process region until there is only one node per process. This is graphically illustrated in figure 4.3. Further coarsening, which is necessary to achieve a coarse grid that adequetly addresses smooth error, requires some processes to go idle. The slowest process is the one that does not go idle and it still must have one node on every grid, so n i = n k i k log(n 2 k i k ), 1 i k log(n k ). (4.1)

52 fine nodes = n 2 k i = k 39 coarse nodes = 1 i = 1 Figure 4.3: Depiction of a V-cycle coarsening until 1 node per process. Past that point, some processes go idle but the slowest process always has at least one node. In other words, the number of nodes per process is at least 1, and a lower bound on the message size for a V-cycle operation is k 18n i 18k. (4.2) i= Nested iteration cost Now consider a nested iteration of the V-cycles analyzed in the preceeding section. The communication costs for a nested iteration of V-cycles can be written; k(j) L T latency = α + m j, T bandwidth = β j=1 i=1 k(j) L j=1 i=1 m i q i + q j, where α is the latency cost, m i is the number of messages passed on V-cycle level i, q i is the amount of passed data on V-cycle level i, β is the inverse bandwidth of the machine, and m j and q j are the number of messages and data passed to interpolate on the nested iteration level j. The interior sum over i is for the k levels of the V-cycles, and the outer sum over j accounts for the nested iteration levels (figure 4.1).

53 40 When examining the cost of a single V-cycle, nodes are the natural quantity for determining the number of multigrid levels or the size of passed messages. For adaptive mesh refinement, working in terms of elements is natural, so it is necessary to be able to convert between the two. When considering bilinear elements, the number of unique nodes is equal to the number of elements, so element and node counts can be interchanged. For this model, the starting grid for nested iteration is taken to be one element per process. This is based on the implementation we are comparing in the numerical results section. Our standard NI-AMG software starts from a coarse grid of one element per process, and the impementation of the RD method also starts from a coarse grid of one element per process. Given a nested iteration starting mesh of P elements or P nodes, the number of multigrid levels for the first V-cycle will be k = log 4 (P ). Assume that adaptive mesh refinement is used to achieve successive grids, but that it is done uniformly with respect to levels in order to give a level-independent factor by which the number of nodes increases. That is, denote the fraction of elements refined on a level by ω and assume for this model that ω = 1/3 for every grid. Then each level of adaptive refinement doubles the number of elements, which can be seen from 4Nω + N(1 ω) = 2N. Table 4.1 relates the important quantities of the model. Table 4.1: Nested iteration levels are related to the global number of fine nodes and therefore the number of V-cycle levels. This assumes refinement doubles the number of elements, the elements are linears, and V-cycle coarsening by a factor of 4. NI Level (j) Global Number of Nodes Number of MG Levels k(j) 1 P log 4 (P ) 2 2P log 4 (2 P )... j 2 j 1 P log 4 (2 j 1 P )... L EP log 4 (EP )

54 41 Consider solving a problem with P processes and with resources capable of a multigrid heirarchy with E fine grid elements (or nodes) per process. The number of levels L of nested iteration can be determined by using the assumption that refinement doubles the number of elements at each level. The coarsest grid has P elements globally, and the finest grid has P E elements. Therefore, the number of nested iteration levels is L = log 2 (E). (4.3) The number of multigrid levels, k(j), on nested iteration level j is k(j) = log 4 (2 j 1 P ). (4.4) The number of messages needed to interpolate to the next finer nested grid is 8 and the amount of interpolation data sent on level j is 4n k(j). Thus, using the previous results for the cost of one V-cycle, the cost of traditional nested iteration in 2D for the 9 point stencil is T latency = α L (40k(j) + 8) j=1 = α ( 20L log 2 P + 10L 2 2L ). L ( ) T bandwidth β 18k(j) + 4nk(j) β j=1 ( 9L log 2 P (L2 L) + 4 ) 2L In terms of the number of processes, P, and the resources of each process, E, the behavior of the communication costs are T latency O (log 2 (E) log 2 (P )) T bandwidth O (log 2 (E) log 2 (P )). Recall that L is the number of levels of nested iteration or adaptive refinement. For nested iteration, we have chosen to progress from P elements to P E elements globally,

55 42 meaning L = log 2 (E). A true nested iteration approach, starting from one global element would have resulted in L = log 2 (P E) causing the bandwidth scaling for the traditional method to be O(log(P ) 2 ). Next we examine a model for the cost of the RD algorithm. 4.2 RD algorithm performance model Analyzing the latency of the RD method is straightforward, while estimating the amount of data sent is more difficult and depends on the refinement in the exterior regions. Plausible good and bad cases are presented for the amount of data that must be communicated One RD iteration Consider an iteration of RD in which each process has resources for a fine grid of E elements. The decomposed problem, Lu i = χ i f, has been solved locally without communication and nested iteration has been used with adaptive mesh refinement to yield an optimal grid for the subproblem. Now, the solution relevant to each process must be accumulated. Mathematically, this amounts to needing to accumulate u Ωk = P u Ωk i, (4.5) i=1 on process k. Recall from the discussion of the communication pattern that log 2 (P ) steps are needed to distribute the information. Thus, the latency time is T latency = α log 2 (P ). (4.6) The amount of data passed depends on how the meshes are refined and also how efficiently the data is passed from process to process. The worst case scenario for how bad that data passing could be is so unrealistic that it is not useful to examine. Instead a good

56 43 case and a bad case will be presented, and the numerical results will show that what actually happens in practice is fairly close to the good case. Recall that the method begins from a coarse mesh of P elements, and that these coarsemesh elements spawn children elements through the adaptive refinement process. Since Q is the fraction of elements in a process home turf, Ω k, then the number of elements that a process has outside of Ω k is E(1 Q). On the first step the processes are split into two groups, and a process sends the children elements corresponding to half of its coarse-mesh elements to a process in the recieving group. Since children of half of its coarse-mesh elements are sent, assume that approximately half of its exterior elements, or E(1 Q)/2 elements are sent on the first communication step. Remember that a process would also recieve approximately E(1 Q)/2 elements at this step, and these would be unioned with the elements still on a process. At this point 3 things should be noted: (1) A portion of the recieved elements correspond to the recieving process own home turf, Ω k, and, therefore, do not need to be passed along in further communication steps. This would be sure to happen accidentally, and can be leveraged in future work to happen by design through a judicious pairing of communication partners. (2) If the set of elements to be unioned are not disjoint sets, then the elements would be common to both meshes and no new elements would be introduced by the union. This is a good case, in that the union of meshes does not add to the message size that must later be sent. (3) The set of elements recieved could be entirely disjoint from the set of elements it is being unioned into, doubling the number of exterior elements. This is unlikely, and extremely unlikely to continue to happen on subsequent communication steps. Suppose that we neglect point 1, effectively overestimating the number of exterior elements after unioning. To address points 2-3, we introduce a parameter Z [1, 2) that

57 describes how disjoint the element sets are. That is, if the number of elements on each portion of the mesh to be unioned is E(1 Q) 2, then the number of elements after the union is Z E(1 Q) 2, with Z = 1 corresponding to the meshes having all common elements and Z = 2 representing disjoint meshes being unioned together. Assuming the same Z for all communication steps and again assuming each send halves the exterior elements, then the number of elements sent on the ith communication step is Summing over all the communication steps gives 44 ( ) i 1 Z E(1 Q). (4.7) 2 2 T bandwidth = β log 2 (P ) i=1 ( ) i 1 Z E(1 Q) 2 2 = β 1 ( Z 2 )log 2 (P ) E(1 Q) 2 Z Notice that, if Z = 1, meaning the mesh union did not increased the number of remaining exterior elements to be sent, then the total data communicated is less than the total exterior elements, E(1 Q). This is the good communication case. The bad case, parametrized by Z = 2, is where the grids are so disjoint that every union doubles the number of elements in a region. An important point to note is that both cases assume that the number of elements sent are halved because the number of regions that are sent are halved at each communication step. The true worst case is where not only are the unioned set of elements completely disjoint, but the elements are also sent in a way that passes on all the elements, but they are never used in a home turf until the final process recieves them. That is, the number of elements sent on each communication step are not assumed to be halved. This is not realistic in practice, but it is the only true upper bound. Note that the RD model does not depend on stencil size, whereas the traditional model costs would increase dramatically with stencil size. Also if the problem dimension where changed to 3D, then the traditional model cost would increase, whereas the RD cost would be unchanged.

58 Performance model comparison Two models have been established. The first corresponds to a traditional, distributed data-type implementation of nested iteration and adaptive mesh refinement. The second is a model for the new NI-AMR-RD algorithm proposed in this thesis. Both consider P processes with fixed process resources capable of a mesh heirarchy of E fine-mesh elements and an initial coarse mesh of P elements, with E > P. To recap, the communication cost of traditional nested iteration is T latency = α ( 20L log 2 P + 10L 2 2L ) ( T bandwidth β 9L log 2 P (L2 L) + 4 ) 2L 1, 2 1 While the RD method offers communication costs T latency = α log 2 (P ) ( ) 1 ( Z 2 T bandwidth = β )log 2(P ) E(1 Q). 2 Z Recall that L = log 2 (E) for this version of nested iteration Model Parameters The machine-dependent model parameters α and β can be estimated by considering that the latency cost, α, is the fixed cost of sending a message while the bandwidth cost, β, is the rate of change of time with respect to the size of message sent. Thus, on a plot of the time required to send a message vs message size, α is the y intercept and β is the slope. Code was written to vary the message size sent and collect average timings for many node-to-node communications. The code works in the following way. Every computational node is indexed by a number, called the node s rank. An outer loop determines a fixed message size, and the code uses MPI to have every node send a message of that size to

59 46 a node with index i greater than the node s index. The longest time for a node to node communication is recorded, and distance i is increased from 1 to P-1. The average over these P-1 samples of the max node to node communication time is recorded as a data point for a given message size, and the outer loop then selects a larger message size. The message sizes were programmed so that small ones would grow by a power of 2, while large ones would grow by a factor of In network performance estimation studies such as this one, a period of message sending prior to timing the sends is recommended to achieve reliable data [27] Communication Time x 10 4 Communication Time Time (s) Time (s) data1 data Grid Points x 10 4 (a) data1 data Grid Points (b) Figure 4.4: Plot of data and least squares fit of average node to node communication time on Stampede, a Bluegene Q supercomputer. The entire range of data is shown in (a), while a zoomed view of the data points clustered near the y intercept are shown in (b). Because the performance model was based upon points communicated, the message sizes were measured in terms of grid points rather than bytes. A grid point consisted of an x and y coordinate and a solution value, meaning one grid point is equivalent to 24 bytes. The parameter estimation code was run on a Bluegene Q architecture to generate the data shown in figure 4.4. Bluegene Q is a common architecture produced by IBM and is in use at many national labs. For example, it is the architecture of Lawrence Livermore

60 47 National Lab s Sequoia, which achieved petaflop/s performance on 1.6 million cores and topped the list of fastest supercomputers in June 2012 [1]. A least-squares fit of the data to a linear model of T (x) = βx + α (4.8) yielded α = seconds and β = seconds/grid point. A value of Q=0.7 was used based on numerical results (see figure 5.6). Z = 1.9 was chosen since it shows that, even in the bad case of unions growing the number of elements at each step of the communication, the communication cost models predict better scalability and faster communication times for the NI-AMR-RD alrorithm compared to the traditional method. Difference between the good case and the bad case was not significant compared relative to the improvement over the traditional implementation, so the bad case is sufficient to show the improved performance of the NI-AMR-RD method Large P For simplicity, these communication costs were modeled from a 9 point stencil laplacian and the computational time required to union the meshes was neglected. Still the models capture the fundamental behavior of the two methods and the RD method latency time is multiplied by a significantly smaller constant than traditional nested iteration with adaptive mesh refinement. The bandwidth-related cost for the RD method does not significantly depend on P ; instead, the main factor affecting bandwidth cost is E, the resources of a process, and (1 Q), the portion of resources devoted to refinement outside of a process home turf. Figure 4.5 shows the weak scaling of the models in the regime of a large number of processes, each representing a cluster of processors or nodes.

61 Traditional Communication Time, E=1 Million 1 48 Time (s) T +T latency bandwidth T latency T bandwidth MPI Tasks (a) RD Communication Time, E=1 Million Time (s) MPI Tasks (b) Figure 4.5: Plot of the performance model for traditional nested iteration (a) and the RD method (b), showing how the communication times of the methods scale into the regime of millions of processes. The traditional method is latency bound, while RD scales much better and is bandwidth bound Small P for Comparison to Numerical Results In addition to scaling much better, the RD method is dominated by bandwidth, whereas the traditional method is dominated by latency. This is significant because on bandwidth is

62 10 0 Traditional Communication Time, E= Time (s) T latency +T bandwidth T latency T bandwidth Processors (a) 10 0 RD Communication Time, E= Time (s) Processors (b) Figure 4.6: Plot of the communication costs for (a) traditional nested iteration and (b) the RD method, for 16 to 1024 processes. The enhanced scaling of the RD method is not as obvious on this scale, but numerical tests were conducted for this range of processes. The traditional method is latency bound, while RD scales much better and is bandwidth bound. expected to improve on new architectures much more quickly than latency. Increasing the throughput of communication pipelines is easier than increasing the speed at which those pipelines are opened.

63 Chapter 5 Numerical Tests The Range Decomposition algorithm was implemented in parallel using Fortran 2003 and MPI and tests were conducted on Stampede, an IBM Bluegene Q supercomputer. Because of expedient software limitations, the coarsest mesh, from which nested iteration is started, is required to be uniform and have the same number of elements as processes. This was done purely for ease of implementation; neither is essential to the efficacy of the NI-AMR-RD algorithm. In principle, the mesh would not need to be uniform because each process would start by independently solving with AMR to compute an approximately optimal coarse grid (the preprocessing step discussed previously). The home elements belonging to a process would then be determined to assure the residual is equally distributed among processes [14]. This preliminary generation of a coarse grid and assignment of processes would require no communication because each process would compute the same coarse grid and element distribution locally. Each process would then begin the NI-AMR-RD algorithm as described in this paper, using the solution they computed during the start up stage as the initial guess. Given a reasonably large P, the AMR strategy ACE would have equally distributed the functional error, meaning that the processes would have equal portions of the error and be essentially load balanced, without needing any communication. Since the current implementation requires a uniform coarse grid with the number of elements equal to the number of processes,a test problem was chosen in which the functional

64 error is fairly equally distributed throughout the domain, and, thus, to the processes. This simulates the uniform distribution of the residual Graded Mesh Convergence Heuristics suggest that the RD solution will converge to a solution on the union mesh in just a few iterations. The following computational test demonstrates this for problem sizes where the union mesh solution can be computed by a single process. For this test, instead of adaptive mesh refinement, a graded mesh was implemented such that on the first refinement step, only the home turf would be refined, and on subsequent refinement steps the home turf would be refined as well as any elements touching a previously refined element. This was done so that the union mesh would be known a priori, and the union mesh would be equal to the full union mesh (as defined in section 3.1). In fact the union mesh in this case is a uniform mesh. This avoids many complications and focuses on whether RD iterates converge to the solution on the union mesh. Consider a simple test problem, the Laplace equation written as a first-order system: U = p U = sin(πx) sin(πy) U = 0, with Dirichlet boundary conditions p = 0 U τ = 0 on N,S,E,W boundary. Rather than limiting element resources, as is done later with the adaptive refinement tests, this problem was solved using a fixed number of refinement levels, namely 6. The first-order system, Lu = f, was first solved on a uniform grid down to 6 levels of refinement. Denote this solution on the union grid by u U. Next, Lu = f was solved iteratively

65 52 using the RD algorithm with the graded mesh specified above. Denote this solution on iteration l by u h l. Define the normalized functional norm of the RD solution as ε l = f Luh l 0 f 0 (5.1) and the normalized functional norm between the iterative RD solution and the union solution as ε l,op = L(uh l u U ) 0 f 0. (5.2) Tables 5.1 and 5.2 show numerical results for 16 and 64 processes respectively. The distance between the true solution of the PDE and the RD solution quickly reaches discretization error, and the distance between the union grid and the RD iterate shrinks much smaller than discretization error, almost to round off error. Table 5.1: Graded mesh convergence to union grid solution using 16 processes. Iteration ε l ε l,op E E E E E E E E E E E E E E-08 Table 5.2: Graded mesh convergence to union grid solution using 64 processes. Iteration ε l ε l,op E E E E E E E E E E E E E E-07

66 Laplace Equation For a simple test that elucidates the characterstics of the NI-AMR-RD method, consider solving the Laplace equation, p = f. The first order system for the Laplace equation with Dirichlet boundary conditions is U p = 0 U = f U = 0 p = 0 U τ = 0 in Ω, (5.3) on Ω. (5.4) Recall that an ideal implementation of the algorithm would begin with a preprocessing step of having every process solve the non-decomposed problem, Lu = f, up to a set number of elements, generating an adapted mesh that would be identical for all processes. If such a mesh were generated from the ACE algorithm, the residual would be approximately equally distributed among elements. Then, each process would equally assign home turf elements among processes. This would require no communication since each process would generate the same mesh on the preprocessing step and would know which elements would be assigned as home turf elements to each process. This preprocessing step has two main effects: (1) The residual error is approximately equally distributed among processes because ACE tends to equally distribute residual error among elements and the elements are equally distributed among processes. (2) The NI-AMR-RD algorithm tends to be load balanced without needing communication. To the extent that the preprocessing mesh is sufficiently resolved to approximately equally distribute the residual, each process has a similar amount of functional error to correct and, thus, will require similar amounts of work.

67 54 Unfortunately, the current implementation of NI-AMR-RD was not coded with the preprocessing step. Instead, when using P processes, the implementation requires starting NI-AMR-RD on a coarse grid of P uniformly distributed elements. This restriction is purely a result of coding convenience, not a restriction on the algorithm. The method can still be tested as if the preprocessing step had been implemented by making some assumptions on the f that NI-AMR-RD receives as input. In what follows, f is considered to be a residual from the preprocessing step and is therefore assumed to be approximately equally oscillatory in every element of the coarse mesh. Thus every process has an equal portion of the residual, f. In the tests on the Laplace equation, we hold the resources per process fixed, and consider the performance of the algorithm as we increase the number of processes, and thus the total resources. This is known as weak scaling. Two weak scaling studies are considered. In the first, the difficulty of the problem, as determined by f, remains fixed while the processes increase. This of course allows a smaller error with increasing processes. The second study scales the frequency of f proportional to the number of processes used to solve the problem, effectively maintaining the same difficulty per process as more processes are used Weak Scaling with Fixed Right Hand Side To test the algorithm with an oscillatory but uniformly distributed initial residual, consider using the following manufactured solution: p(x, y) = sin(65πx) sin(65πy), U(x, y) = p(x, y), f(x, y) = U(x, y). This solution is chosen as an example because it is sufficiently oscillatory to reveal the benefits of adaptive mesh refinement, but still results in an f that is equally distributed

55 among elements as desired by our assumption that in practice f would be the residual from a preprocessing step.

(a) 16 processes, iteration 1 (b) 1024 processes, iteration 1 (c) 16 processes, iteration 2 (d) 1024 processes, iteration 2 Figure 5.

Recall that each RD iteration solves for a correction to the previous iteration s approximate solution.

Thus, on the first iteration the correction on a process will resemble the solution within the process home turf, and is L-harmonic in the

68 55 among elements as desired by our assumption that in practice f would be the residual from a preprocessing step. The p component of the solution to the decomposed problem for the first two iterations is shown in figure 5.1. (a) 16 processes, iteration 1 (b) 1024 processes, iteration 1 (c) 16 processes, iteration 2 (d) 1024 processes, iteration 2 Figure 5.1: An example process p component of the solution to the decomposed problem for (a) 16 processes and (b) 1024 processes. Recall that each RD iteration solves for a correction to the previous iteration s approximate solution. On the first iteration, the initial approximate solution is u (0) = 0, so the first correction is equivalent to a solve of Lu i = χ i f. Thus, on the first iteration the correction on a process will resemble the solution within the process home turf, and is L-harmonic in the exterior region. It is immediately clear from looking at (a) and (c) in figure 5.1 that the mesh generated when P = 16 is not strongly influenced by the choice of a uniform coarse mesh of 4x4 elements. This is not the case for P = 1024, where the coarse mesh has forced the exterior

69 56 region to be more fine than necessary. Starting from such a fine coarse mesh effectively wastes resources where they are not needed and is a result of programming convenience. A production implementation of the NI-AMR-RD method would not require a coarse grid of P elements. Still, the tests show the that algorithm performs very well, and artifacts of the coarse-mesh restriction are pointed out below. Weak scaling results are shown in figure 5.2. The stopping criteria for both RD and the traditional implementation are the same; each method stops when it has adapted to a fine grid of E elements. Thus the weak scaling shows that RD applies its resources faster, and that the time to do so remains fairly constant with increasing P, whereas the time to complete the traditional implementation is growing with P Weak Scaling Data Traditional Solve 2 RD Iterations 1 RD Iteration Time (s) Processes (or MPI tasks) Figure 5.2: Weak scaling results for solving Laplace equation using 4000 elements per process. RD iterations scale very well, while the time for the traditional implementation begins growing. The error achieved by the two methods is of course different, and is shown in later plots. It should be noted that when the ratio of computation to communication is analyzed for RD, the communication costs are less than 1% of the computational costs. The regime

70 57 in which current software allows testing of the method is not significantly limited by the RD communication time. The performance model predicts better communication scalability for much larger process counts, but, even in this somewhat low process regime, RD iterations are scaling significantly better than the traditional implementation. The weak scaling results show that the time it takes to perform an iteration of the RD algorithm scales very well, which is what was predicted in the performance model. However, a remaining and significant question is how much accuracy is sacrificed by the RD method. Figure 5.3 shows the error reduction and the dependence on the effective mesh spacing. f Lu h / f 10 0 Relative Functional Reduction processors 64 processors 256 processors 1024 processors f Lu h / f Functional Norm vs Effective Elements processors 64 processors 256 processors 1024 processors O(1/QEP) RD iterations (a) Effective number of elements = QEP x 10 6 (b) Figure 5.3: The relative error reduction with respect to iterations is shown in (a), while (b) shows that the asymptotic error reduction behaves like 1/QEP as expected when using quadratic elements. From the error analysis developed in section 3.2, the error reached on an RD iteration should depend on the effective mesh spacing achieved on that iteration. For quadratics, the error should decrease according to 1/QEP, and this is exactly what RD achieves, as shown in figure 5.3(b). For comparison, the traditional implementation has error decreasing O(1/EP ), without the factor of 1/Q since it is using communication instead of extra external computations.

71 58 For an understanding of the error reduction achieved in relation to the time required to achieve it, figure 5.4 shows the time required to reach a given accuracy compared to the time and accuracy achieved by a traditional solve Convergence for 16 Processes 10 1 RD Iterations Traditional Solve (9.9 s) 1/Q error 10 0 Convergence for 64 Processes 10 1 RD Iterations Traditional Solve (11 s) 1/Q error 2.3 s 6.5 s 9.4 s12 s f Lu h / f 10 2 f Lu h / f s 5.6 s8.6 s12 s RD iteration (a) 10 0 Convergence for 256 Processes 10 1 RD Iterations Traditional Solve (13 s) 1/Q error RD iteration (b) 10 0 Convergence for 1024 Processes 10 1 RD Iterations Traditional Solve (21 s) 1/Q error f Lu h / f s 7 s 10 s 14 s f Lu h / f s s 13 s 18 s RD iteration (c) RD iteration (d) Figure 5.4: Comparison of RD accuracy to traditional accuracy with the time required to achieve. The blue dots correspond to RD iterations, the green line corresponds to the accuracy achieved by a traditional solve, and the red line is the accuracy of the traditional solve multiplied by 1/Q. The RD iterations are shown in blue, while the error and timing of a traditional solve are given in green. The red line is the traditional solve error multiplied by 1/Q, which,

72 59 according to theory in section 3.2, is the best error RD could hope to achieve. For 16 processes, as shown in figure 5.4 (a), two RD iterations takes 6.5 seconds and achieves an error very close to the error achieved using the traditional implementation, which took 9.9 seconds. The picture is similar for 64 and 256 processes (b-c), with 2 iterations of RD achieving an error very close to the theoretically best error, and also very close to the error of a traditional solve, but in half the time. As the number of processes grows to 1024 (panel d), two things happen. First the theoretical asymptote (red line) is no longer as close to the error one can achieve through a traditional solve, and second, the rate of convergence to the asymptote decreases. Both of these facts are related to Q as explained below. Recall that software limitations caused us to start RD on a coarse grid of P elements. The resources of a process is fixed at E = When E >> P the fact that we are starting from a coarse grid of P elements does not affect the final distribution of elements for the decomposed solve. However when E/P is not large, our choice of forcing a uniform coarse grid affects the final distribution and, thus Q. This is seen in figure 5.5, where Q was measured for different numbers of processes and resources. The plot shows that Q 0.9 when E is much greater than P and the coarse mesh choice does not affect the element distribution on the final mesh. The red line in 5.4(a) for the 16 processes case is close to the traditional accuracy because Q 0.9, while Q 0.7 for the 1024 processes case shown in (d). Each iteration of RD starts from the coarse grid of P elements and re-executes adaptive mesh refinement. A process grid thus changes from iteration to iteration, so Q and the elements on the union grid change with each iteration. A plot of how Q changes with each iteration is given in figure 5.6. In the case of 16 or 64 processes, Q is closer to one and very quickly reaches an asymptotic value. As P increases, the asymptotic value that Q reaches drops further below 1, as explained previously as an implementation artifact of the initial coarse mesh in fact

73 Q(E/P) Q elements/processor data 8000 elements/processor data 0.9(1 x.75 ) E/P Figure 5.5: Asymptotic values of Q taken from runs on 16 to 1024 processes showing that if the resources, E, are sufficiently greater than the number of MPI tasks, P, then Q is relatively constant and close to 1. 1 Q vs Iteration, E=4K elem/proc 1 Effect of Increasing Resources Q Q processors 64 processors 256 processors 1024 processors procs, E = 4K elem/process 1024 procs, E = 8K elem/process Iteration Iteration (a) (b) Figure 5.6: Computed values of Q at each RD iteration for various numbers of processes (a). The asymptotic value of Q decreases as P increases because the coarse grid influences the fine grid when E is not much greater than P. When the resources are increased (b), the asymptotic value of Q returns to 0.9. being too fine and influencing the final distribution of elements on the fine mesh. The rate at which Q approaches the asymptote also degrades with increasing P. This is because the tests were conducted with a fixed frequency f for all P values, resulting in an

61 f that appears less oscillatory relative to the resolution of the mesh for large P. The local L-harmonic solution in the exterior region matches the solution on the boundary of the home turf.

The more oscillatory the solution along the home turf boundary, the more quickly the affects of the oscillation are dissipated in the exterior region.

74 61 f that appears less oscillatory relative to the resolution of the mesh for large P. The local L-harmonic solution in the exterior region matches the solution on the boundary of the home turf. Any oscillations along the home turf boundary are smoothed into the exterior region. The more oscillatory the solution along the home turf boundary, the more quickly the affects of the oscillation are dissipated in the exterior region. This is explains why, in the case of a fixed frequency f, Q converges more slowly with increasing P. Since f was fixed, as the number of processes grows and the home turf size shrinks, the solution in a home turf becomes less oscillatory relative to the mesh size. This results in the solution in the home turf bleeding much further into the exterior region, as shown in figure 5.7. (a) (b) Figure 5.7: Contour plots of the solution using 1024 processes for the case where (a) the frequency of f is fixed, and (b) the frequency of f is scaled with increasing number of processes Weak Scaling When Right Hand Side Scales A fixed frequency in the solution, p and U, becomes lower relative to the maximum mesh frequency as P grows. This, in turn, affects the rate of convergence of Q, which is a crucial quantity of interest to the method. A weak scaling study was therefor devised in which the frequency of f scales with P. For this study, consider manufactured solution

75 62 p(x, y) = sin(16 P πx) sin(16 P πy), U(x, y) = p(x, y), which generates f(x, y) = 2π P sin(16 P πx) sin(16 P πy). When P = 16, this matches the previously studied problem but, as P increases, the frequency increases proportionally Weak Scaling Data Traditional Solve 2 RD Iterations 1 RD Iteration Time (s) Processes (or MPI tasks) Figure 5.8: Weak scaling results using 4000 elements per process to solve Laplace equation with a solution that increases frequency with increasing P. When the frequency of the solution scales with P, the convergence to an asymptotic error is improved for P = The asymptotic error achieved is still worse than when P = 16, but this is because the code requirements of beginning on a coarse grid of P processes is wasting too many elements outside a process home turf. Note that convergence is observed to still be slightly slower for P = 1024, requiring approximately 3 iterations to

76 10 0 Convergence for 16 Processes s 6.5 s 9.4 s12 s RD Iterations Traditional Solve (10 s) 1/Q error 10 0 Convergence for 64 Processes s 6.3 s 9.4 s12 s RD Iterations Traditional Solve (12 s) 1/Q error 63 f Lu h / f 10 2 f Lu h / f RD iteration (a) 10 0 Convergence for 256 Processes s 5.8 s9.1 s12 s RD Iterations Traditional Solve (14 s) 1/Q error RD iteration (b) 10 0 Convergence for 1024 Processes s 7.2 s11 s 15 s RD Iterations Traditional Solve (22 s) 1/Q error f Lu h / f 10 2 f Lu h / f RD iteration (c) RD iteration (d) Figure 5.9: Error and timings when the frequency of the solution is varied proportional to P. Note that the convergence rates are almost identical. reach the red asymptote instead of the 2 required when P = 16, but this is believed to also be a result of the inefficient use of resources caused by an overly fine coarse grid. The convergence and asymptotic value of the error is tied to the convergence and asymptotic value of Q, which is plotted in figure Now, when the frequency of the solution scales with P the value of Q is approximately constant by the second iteration. This is a big contrast to the previous section where the frequency was fixed and would therefore

77 1 Q vs Iteration, E=4K elem/proc 1 Q vs Iteration, E=4K elem/proc Q Q processors 64 processors 256 processors 1024 processors processes 64 processes 256 processes 1024 processes Iteration Iteration (a) Fixed solution (b) Scaling solution Figure 5.10: The slow convergence of Q for the study of fixed frequency solution shown in (a) is no longer present when the frequency scales with P, as shown in (b). appear more smooth for large values of P and lead to a relatively smoother solution that bled farther out of the home turf and into the exterior region What If The Solution is Smooth? All the manufactured solutions considered in the previous sections were oscillatory. A natural question is how is the refinement during an iteration affected when the solution (or residual f) is smooth? This question was tested using 16 processes with resources of 5280 elements per process. Consider the smooth manufactured solution: p(x, y) = sin(πx) sin(πy), U(x, y) = p(x, y), which generates the smooth forcing function f(x, y) = 2π 2 sin(πx) sin(πy). The correction computed on a process on the first iteration is smooth as shown in figure 5.11(a). Because f, and therefore the solution, is less oscillatory, fewer elements are needed

iteration. The respective adapted meshes are shown in (c) and (d).

This results in a much smaller value of Q on the first iteration than in either oscillatory case.

78 65 (a) (b) (c) (d) Figure 5.11: An example process solution to the decomposed problem with smooth input f on the first (a) and the second (b) iteration. The respective adapted meshes are shown in (c) and (d). by refinement in a process home turf and more broad refinement occurs. This results in a much smaller value of Q on the first iteration than in either oscillatory case. However, after the first iteration, the right hand side f becomes the residual of the previous iteration. This residual is oscillatory, as demonstrated by the oscillatory solution shown in figure 5.11(b) and results in refinement mostly taking place in the home turf and a value of Q close to 1 as seen in table 5.3.

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids