IMPLEMENTATION OF IMPLICIT FINITE ELEMENT METHODS FOR INCOMPRESSIBLE FLOWS ON THE CM-5

Size: px

Start display at page:

Download "IMPLEMENTATION OF IMPLICIT FINITE ELEMENT METHODS FOR INCOMPRESSIBLE FLOWS ON THE CM-5"

Delphia Barber
5 years ago
Views:

1 Computer Methods in Applied Mechanics and Engineering, (1994) 1 IMPLEMENTATION OF IMPLICIT FINITE ELEMENT METHODS FOR INCOMPRESSIBLE FLOWS ON THE CM-5 J.G. Kennedy Thinking Machines Corporation 245 First Street Cambridge, MA 02142, USA M. Behr, V. Kalro, and T.E. Tezduyar AEM/AHPCRC Supercomputer Institute, University of Minnesota, 1200 Washington Avenue South, Minneapolis, MN 55415, USA March 20, 1994 Revised: March 27, 1994 Abstract A parallel implementation of an implicit finite element formulation for incompressible fluids on a distributed-memory massively parallel computer is presented. The dominant issue that distinguishes the implementation of finite element problems on distributed-memory computers from that on traditional shared-memory scalar or vector computers is the distribution of data (and hence workload) to the processors and the nonuniform memory hierarchy associated with the processors, particularly the nonuniform costs associated with on-processor and off-processor memory references. Accessing data stored in a remote processor requires computing resources an order of magnitude greater than accessing data locally in a processor. This distribution of data motivates the development of alternatives to traditional algorithms and data structures designed for shared-memory computers, which must now account for distributed-memory architectures. Data structures as well as data decomposition and data communication algorithms designed for distributed-memory computers are presented in the context of high level language constructs from High Performance Fortran. The discussion relies primarily on abstract features of the hardware and software environment and should be applicable, in principle, to a variety of distributed-memory systems. The actual implementation is carried out on a Connection Machine CM-5 system with high performance communication functions. 1. Introduction Distributed-memory, massively parallel computers are emerging as significant competitors to traditional vector supercomputers in the area of large-scale computational fluid dynamics. Fluid problems are particularly well suited for these parallel computers due to large, regular data sets since parallelization occurs over large uniform data structures. The continuing trend in fluid simulations toward significantly larger data sets as well as the need for shorter solution times leads naturally to distributed-memory massively parallel computers. Parallel

2 Computer Methods in Applied Mechanics and Engineering, (1994) 2 computers offer the potential for both higher sustained computational performance as well as substantially larger memory capacities than traditional vector computers. This study focuses on a finite element formulation for the problem of an incompressible, viscous fluid. In spite of offering highly parallel data structures, finite element methods pose a challenge for distributed-memory parallel machines as a result of the irregular data communication patterns that arise in the context of unstructured meshes. The earliest implementations of finite element methods on parallel computers relied on a message passing programming model coupled with domain decomposition constructs [1 3]. Domain decomposition constructs subdivide the physical domain into N p subdomains, one for each of the N p processors, each subdomain typically consisting of a spatially contiguous set of elements. The subdomains then communicate data only through elements on the subdomain boundaries. More recently, data parallel implementations of finite element methods emerged [4 9]. Johan et al. [10] coupled traditional notions of domain decomposition with a data parallel finite element implementation. A variety of methods have been used to construct the element subdomains to maintain favorable load balancing characteristics and low network communication requirements. The recursive spectral bisection (RSB) algorithm due to Pothen et al. [11] and Simon [12] provides a systematic and robust methodology for domain partitioning. Johan et al. [10] provided the first parallel implementation of RSB for unstructured meshes. Behr et al. [9] provide parallel implementation constructs for two incompressible flow formulations (based on velocity-pressure, and stress-velocity-pressure as primary variables) along with two-dimensional flow simulations. These parallel implementation constructs are extended here to include a detailed discussion of the the GMRES solver, including a comparison between a traditional matrix-based algorithm and a matrix-free algorithm, domain decomposition and three-dimensional flow simulations. In addition, the issues associated with the coupling between domain decomposition and gather-scatter communication performance are discussed. The domain decomposition strategy is based on the parallel implementation of RSB provided in Johan et al. [10]. The paper is organized as follows. An abstraction of the hardware and software characteristics of distributed-memory massively parallel computers is provided in Section 2. A statement of the finite element problem is provided in Section 3. The parallel implementation of this problem is presented in Section 4. Data decomposition and associated communication issues are discussed in Section 5. Numerical simulations are presented in Section 6. Finally, conclusions are provided in Section Parallel Computer Model Implementation of the finite element method discussed here is carried out on a Connection Machine CM-5 system in the data parallel language Connection Machine Fortran (CMF). The discussion here relies primarily on general features of both the CM-5 and CMF. The CM-5 is a distributed memory, massively parallel computer system. Like the emerging High Performance Fortran (HPF) standard, CMF is a language based on Fortran 90 with additional data layout compiler directives. In principle, the discussion here applies to other distributed memory, massively parallel computer systems using other programming models and languages. Programming language specifics discussed here are immediately accessible

3 Computer Methods in Applied Mechanics and Engineering, (1994) 3 from both CMF and HPF. The primary programming language constructs used here are data distribution or data layout constructs used to distribute Fortran array elements to memory within the distributed processors 2. The syntax :SERIAL and :PARALLEL are used here to denote serial (in local processor memory) and parallel (across processor memory) array dimensions. For example, consider the arrays A, B, and C on an N p processor machine with cyclic data layout as follows: REAL A(N p ),B(3 N p ),C(5,N p ) CMPLR$ LAYOUT A(:PARALLEL), B(:PARALLEL) CMPLR$ LAYOUT C(:SERIAL, :PARALLEL). Array A has a single parallel dimension whose number of entries matches the number of processors. Cyclic layout of the parallel dimension places a single entry of A in each processor. Array B on the other hand has three times as many entries as there are processors. Cyclic layout of B places B(1 : N p ) one per processor. Similarly, B(N p +1 : 2 N p )andb(2 N p +1 : 3 N p ) are distributed one per processor such that each processor is assigned three entries of B. Array C on the other hand has both a serial and a parallel axis. The parallel axis of C is distributed identically to that in A. The serial axis of C is distributed such that, for the k th parallel entry of C, a serial vector of length 5 is placed in the processor associated with the k th parallel axis entry of C. Further discussion of these constructs may be found in [13, 14]. For the case in which the syntax CMPLR$ LAYOUT C(:SERIAL, :PARALLEL) is assumed to infer cyclic distribution of data along parallel axes, the equivalent syntax in HPF is!hpf$ DISTRIBUTE C(, CYCLIC). Currently CMF supports only block layout (described in [13]). In the case in which the syntax CMPLR$ LAYOUT C(:SERIAL, :PARALLEL) is assumed to infer block distribution of data, the equivalent syntax in HPF is whereas the equivalent case in CMF is!hpf$ DISTRIBUTE C(, BLOCK), CMF$ LAYOUT C(:SERIAL, :NEWS). For either block or cyclic data distribution, parallel array operations may be invoked using simple array syntax in Fortran 90. For example, the expression A(:) = A(:) + 3/C(4, :), where : denotes do for all entries of the axis and invokes the assignment statement in parallel, for all parallel entries in A and C simultaneously. 2 InthecaseoftheCM-5,thetermprocessor is used to infer a single vector unit. There are four vector units per processing node on a CM-5. On parallel architectures composed of processing nodes containing only one vector or superscalar processor, the term processor is unambiguous.

4 Computer Methods in Applied Mechanics and Engineering, (1994) 4 3. Finite Element Formulation Here we consider the isothermal transient response of an incompressible fluid. The initial/boundary-value problem is represented in Box 1 where u is the velocity, p denotes pressure, ρ is the density, σ is the Cauchy stress, f is the body force and g and h are the Dirichlet and Neumann boundary condition values, respectively, enforced on the subsets of the boundary Γ t of the possibly evolving domain Ω t. In the case of the fixed domain, the subscript t denoting time on the domains is dropped. The stress response is assumed to be Newtonian, characterized by the fluid viscosity µ. 1. Momentum Balance on Ω t ( ) u ρ t + u u f σ = 0 2. Mass Balance (Incompressible) on Ω t u =0 3. Initial and Boundary Conditions u = g on (Γ t ) g σ n = h on (Γ t ) h u(x, 0) = u 0 on Ω 0 4. Stress Response (Newtonian) σ = pi + T, T =2µε(u) ε(u) = 1 ( ) u + u T 2 Box 1: Initial/Boundary-Value Problem. A stabilized, space-time, velocity-pressure formulation is then summarized in Box 2. Here, (, ), (, ) Q e n and (, ) Ωn denote L 2 inner products over the space-time slab Q n,the single space-time slab element Q e n and the spatial domain Ω n, respectively. The surface P n is traced by Γ t as t traverses the time interval associated with slab n and (P n ) h is the subset of P n corresponding to (Γ t ) h. The ( ) + n and ( ) n denote the values of a variable at level n as it is approached from the top and the bottom, respectively. The Q h and V h are suitable spaces for pressure and velocity functions, and τ MOM and τ CONT are stabilization parameters. Further details on this formulation, although not central to the discussions here, may be found in Tezduyar et al. [15, 16]. The space-time formulation is used in the next section because of its notational simplicity, but the parallel implementation issues are the same for a semi-discrete formulation, in which the jump term (Box 2, item 2, term 4 on right-hand side) is dropped, the integration takes place over the spatial domain only, and

5 Computer Methods in Applied Mechanics and Engineering, (1994) 5 1. Finite Element Form B(p h, u h ; q h, v h )=F (q h, v h ) (q h, v h ) Q h V h 2. B(p h, u h ; q h, v h ) B(p h, u h ; q h, v h ) = ( u h t + u h u h, ρv h ) Q n + ( σ(p h, u h ), ε(v h ) ) Q n + ( ρ u h,q h) Q n + ( ) (u h ) + n (u h ) n,ρ(v h ) + n Ω n (n el ) n ] + ([ρ( uh t + uh u h ) σ(p h, u h ), e=1 ]) 1 τ MOM [ρ( vh ρ t + uh v h ) σ(q h, v h ) + (n el ) n e=1 ( τcont u h,ρ v h) Q e n Q e n 3. F (q h, v h ) F (q h, v h ) = ( f,ρv h) Q n + ( h, v h) (P n) h (n el ) n + e=1 ]) (f,τ MOM [ρ( vh t + uh v h ) σ(q h, v h ) Q e n Box 2: Stabilized u p Space-Time Finite Element Formulation. the time derivatives are replaced by appropriate expansions. The Galerkin/least-squares problem like the one shown in Box 2 will lead to a nonlinear coupled system of equations: N (d n )=F, (1) where d n is the vector of unknowns associated with marching from time step n 1tonin a semi-discrete formulation, or associated with time slab n in a space-time formulation. For the nonlinear system of equations (1), the Newton-Raphson iterations N d each require the solution of a linear equation system ( ) ( ) d k n = F N d k n, (2) d k n A k nx k n = R k n, (3)

6 Computer Methods in Applied Mechanics and Engineering, (1994) 6 where k is the nonlinear iteration counter, A k n = N/ d d k n is the nonsymmetric Jacobian operator, x k n = d k n is the vector of increments for unknown solution values and R k n = F N ( dn) k is the vector of residuals. When discussing the process of the solution of the linear equation system (3), the sub- and superscripts identifying the time step and nonlinear iteration will be dropped, as only one such system is solved at a given time. An outline of the implicit solution to the finite element problem is shown in Box Preprocessing and initial conditions 2. PARTITION data to processors 3. a. Time step loop (n start =0) b. Nonlinear iteration loop (k start =0) 4. GATHER nodal x k n to elements 5. FORM element matrices and residuals A e,k n 6. SCATTER R e,k n to assembled R k n 7. SOLVE A k nx k n = R k n 8. a. End k loop (k k +1,goto3b) b. End n loop (n n +1,goto3a) 9. Postprocessing and visualization Box 3: Outline of Finite Element Solution., Re,k n 4. Parallel Implementation Here, the global programming model described in Section 2 is used to implement the finite element method. The key features of the current finite element implementation on a distributed-memory massively parallel computer are (1) constructing data structures which circumvent unneeded communication of data between processors, (2) mapping the data associated with these data structures to the processors in a manner which efficiently exploits data locality, (3) using efficient gather and scatter algorithms which distinguish on-processor and off-processor data transfers and (4) maintaining favorable load balancing and scaling properties. Two naturally parallel data structures emerge from the finite element problem: the first associated with the FORM phase (element-ordered data set corresponding to Step 5, the FORM step, in Box 3), the second associated with the SOLVE phase (node-ordered data structure corresponding to Step 7, the SOLVE step, in Box 3). Using the serial and parallel layout constructs described in Section 2, the element-level residual vector R e and its global counterpart R are represented in these two data structures as shown in Box 4, where n dof is the number of degrees of freedom per node, n en is the number of local nodes residing in an element, n nodes is the number of global nodes and n el is the number of elements. The idea is to construct a parallel array axis of length n el for the FORM data structure and n nodes for the SOLVE data structure. Element information in a FORM array associated with each element is then accumulated by indexing along the serial dimension(s) of the array. Similarly, node information in a SOLVE array associated with each node is also accumulated by indexing along the serial dimension(s).

7 Computer Methods in Applied Mechanics and Engineering, (1994) 7 1. FORM Element Based REAL R e (n dof,n en, n el ) CMPLR$ LAYOUT R e (:SERIAL, :SERIAL, :PARALLEL ) 2. SOLVE Node Based REAL R(n dof, n nodes ) CMPLR$ LAYOUT R(:SERIAL, :PARALLEL ) 3. Communication R e (n dof,n en, n el ) R(n dof, n nodes ) gather / scatter Box 4: FORM and SOLVE Data Structures. These FORM and SOLVE data structures exhibit natural parallelism in that they enable the FORM step and a number of phases of the SOLVE step of the solution outlined in Box 3 to take place in parallel without communication between processors. With these two data structures, communication between processors within the time step loop occurs predominantly due to communication between the FORM and SOLVE data structures. That is, communication occurs predominantly in the GATHER and SCATTER steps. Pseudo-code evaluating the boxed terms in Box 2 for the FORM phase is shown in Box 5. Note that the repeated indices imply summation, j σ is an index of the stress tensor component, and that i sd identifies the space dimension. Here, integration over the space-time slabs Q e n is taken as the usual sum over quadrature points. That is, n int χ(:) dq = [χ(:)] l J l (:)w l, (4) Q e n where n int is the number of integration points, J l is the determinant of the Jacobian of the finite element mapping, and w l is the weight. Here, and in Box 5 pseudocode, : implies do for all elements i el =1:n el in a FORM -based array. By definition of the CMPLR$ LAYOUT constructs, for a given element i el, the element-level vector R e (1 : n dof, 1:n en,i el )of n dof n en components resides in the memory of processor p(i el ), where p(i el ) is a mapping provided by the compiler. Furthermore, an element-level vector v e (1 : m, i el ), for any m along a serial dimension and i el along a parallel dimension with identical extent 1:n el as the one in R e, resides in the memory of the same processor p(i el ). Consequently, R e (1 : n dof, 1:n en,i el ) and v e (1 : m, i el ) reside in the same (virtual) processor for each i el [1,n el ]. With this in mind, it is evident from Box 5 that no inter-processor communication occurs in the FORM phase. The SOLVE phase on the other hand does require communication. A summary of the GMRES algorithm used in the SOLVE phase is shown in Box 6. All quantities in the SOLVE phase are stored in the SOLVE data structure with the exception of the element-level Jacobian matrices a e (and, as a result, two element-level vectors required to interact with a e ) which are stored on the element level for performance reasons. From the l

8 Computer Methods in Applied Mechanics and Engineering, (1994) 8 1. B ff" Formation B ff" (p h, u h ; q h, v h ):= σ(p h, u h ):ε(v h )dq Q n B ff" comprises of element-level contributions: B e (:) = σ(j ff" σ, :) ε(j σ, :)dq Q e n 2. B t Formation B t (p h, u h ; q h, v h ):= ρ Qn uh t vh dq B t comprises of element-level contributions: Bt e (:) = ρ(:) u(i sd, :) v(i sd, :) dq Q e n Box 5: Pseudo Code: FORM Phase. perspective of a parallel implementation, the SOLVE phase is comprised primarily of dot products (α = p q), SAXPY operations (p = p + αq), matrix-vector products (q = Ap) and a preconditioning step. Here, only diagonal preconditioning is considered such that the preconditioning step requires strictly inexpensive on-processor operations, with computing costs not significantly beyond that of a dot product or a SAXPY operation. Pseudo code for such steps of the SOLVE phase is shown in Box 7. The dominant computational portion of the GMRES algorithm is the matrix-vector product. Item 3 in Box 7 highlights a matrix-vector product (q = Ap) scheme which consists of three steps: (1) a gather of p to p e on the element level, (2) an on-processor matrix-vector product (q e = a e p e )involving no inter-processor communication and (3) a scatter of q e to q on the global assembled level. In the gather and scatter steps, iconn(1 : n en, 1:n el ) is the nodal connectivity array. This matrix-vector product scheme was initially proposed by Johnsson and Mathur [17] and demonstrates favorable performance characteristics on Connection Machine systems. Note that communication in the SOLVE phase occurs in the global sums within dot products and in the gather/scatter steps of the matrix-vector product, the latter being the dominant communication steps. Expressed in terms of the the High Performance Fortran FORALL construct, the gather step may be expressed in the form FORALL (i dof =1:n dof,i en =1:n en,i el =1:n el ) v e (i dof,i en,i el )=v(i dof,iconn(i en,i el )). (5) A scatter on the other hand must account for collisions of data at the destination and hence takes the form DO i dof =1,n dof FORALL (i node =1:n nodes ) v(i dof,i node )=v(i dof,i node )+SUM (v e (i dof, 1:n en, :), MASK = iconn(1 : n en, :).EQ.i node ) END DO. (6)

9 Computer Methods in Applied Mechanics and Engineering, (1994) 9 DO l =1,n outer GMRES outer iterations r 0 := R Ax 0 compute initial residual β := r 0 2 compute initial residual norm v 1 = r 0 /β define first Krylov vector DO j = 1,m GMRES inner iteration z j := M 1 j v j preconditioning step w := Az j matrix-vector product DO i = 1,j Gramm-Schmidt orthogonalization h i,j := (w, v i ) w := w h i,j v i END DO h j+1,j := w 2 v j+1 := w/h j+1,j define next Krylov vector END DO H := {h i,j } define reduced system matrix y := argminŷ βe 1 Hŷ 2 solve reduced system x := x 0 + m i=1 y iz i form approximate solution IF βe 1 Hy 2 ɛ EXIT convergence check x 0 := x restart END DO Box 6: GMRES algorithm: Algorithm Summary. 1. Dot Product: α = p q 2. SAXPY: p = p + αq α = SUM(p(i dof, :) q(i dof, :)) p(i dof, :) = p(i dof, :) + α q(i dof, :) 3. Matrix-Vector Multiply: q = A p p e (i dof,i en, :) = p(i dof,iconn(i en, :)) (Gather) q e (i dof,i en, :) = a e (i dof,i en,j dof,j en, :) p e (j dof,j en, :) (Local Mult) q(i dof,iconn(i en, :)) = q e (i dof,i en, :) (Scatter) [Add Collisions] Box 7: Pseudo Code: SOLVE Phase. In the numerical implementation, for performance reasons, the gather/scatter steps are implemented on the CM-5 using high performance communication algorithms which replace

10 Computer Methods in Applied Mechanics and Engineering, (1994) 10 the FORALL statements above with single function calls. The gather and scatter steps are discussed further in the following section Ax = N d x N N(d + εx) N(d) x d ε R ε = R(d + εx) =F N(d + εx), R = R(d) =F N(d) Ax = R ε R ε Box 8: Matrix-Free Linearization. An alternative matrix-free GMRES solution scheme may be used based on a matrix-free linearization of the residual as is represented in Box 8. In the matrix-free linearization, which is due to Johan [18], the linear part of the residual represented as a matrix-vector product Ax is approximated by (R ε R)/ε, wherer ε = R(d + εx), R = R(d), and ε is a suitably small number [18]. As a result, in the parallel implementation, the GMRES algorithm differs only in replacing the matrix-vector product in the above algorithm with this simple difference formula between the residuals. In particular, the above three-step matrix-vector product is replaced by the steps (1) gather the current solution vector d to the element level d e,(2) FORM the updated element-level residual R e ε on the element level (without inter-processor communication) based upon d + εx and (3) scatter R e ε to the global assembled level and perform the difference formula. That is, from a computational perspective, this scheme differs primarily from the matrix-vector product scheme in that the on-processor matrixvector product is replaced with formation of the element-level residual R e ε. Notice that the element-level Jacobian matrices a e from Box 7 need not be stored in the matrix-free case, resulting in substantial memory savings since storage of these matrices dominate the memory requirements in the original formulation. This memory savings is accompanied by additional on-processor computational requirements, however, since computing the residual R e ε on subsequent GMRES iterations typically requires greater on-processor computations than does the on-processor matrix-vector multiply discussed above. A comparison of the matrix-free and original GMRES solver is provided in Data Decomposition and Communication Partitioning of the data associated with the FORM and SOLVE data structures into groups, each group being associated with a single processor of the parallel computer, is used to increase the efficiency of the GATHER and SCATTER steps by attempting to

11 Computer Methods in Applied Mechanics and Engineering, (1994) 11 minimize the off-processor communication in these communication steps, taking maximum advantage of data locality. A parallel implementation of the RSB algorithm is used to decompose and distribute the element data ( FORM data structure) to the processors based on the modal analysis of the graph of the connectivity array describing the connectivity between the elements (dual connectivity). The RSB algorithm, with origins due to Pothen et al. [11] and Simon [12], provides a robust, systematic tool for generating efficient data decompositions in parallel. The parallel implementation of the RSB algorithm used here is due to Johan [18], and is available in the Connection Machine Scientific Software Library. The data decomposition generated by the bisection algorithm is exploited in efficient gather and scatter communication algorithms which account for locality of data residing in a given processor by breaking each communication step (gather or scatter) into two distinct phases an on-processor communication step (with communication speeds on the order of the memory bandwidth) and an off-processor communication step (with communication rates on the order of the network bandwidth). Such a two-step algorithm is natural within a message passing programming model. The data parallel implementation of the two-step algorithm is more subtle due to the high-level language constructs. The data parallel twostep algorithm used here is due to Johan et al. [10] and exhibits favorable load balancing and scaling properties for large classes of problems. Performance advantages which arise due to the data decomposition and two-step communication strategies are a result of the amount of data gathered (or scattered) from the surface elements in one partition to that in another partition (hereafter referred to as surface data) relative to the amount of data gathered (or scattered) within the internal volume of an element partition (hereafter referred to as volume data). Provided the mesh partitioning algorithm provides suitably nice, contiguous element groups, the ratio of surface data to volume data becomes small as the number of elements in typical partitions becomes large. Hence, the amount of surface data communicated at network bandwidth speeds is small relative to the amount of volume data communicated at memory bandwidth speeds. The relative amounts of surface data and volume data in a mesh and hence the performance improvements available from the mesh partitioning and communication schemes is dependent on mesh geometry. Three-dimensional meshes typically exhibit more favorable surface data to volume data ratios than do two-dimensional meshes and hence experience more pronounced speed-ups from the data decomposition strategies. To illustrate this, it is useful to look in detail at the amount of surface and volume data which exists first within a general finite element mesh and next within simple illustrative meshes. To begin, assume that the distribution of the global nodes to the processors in the SOLVE data structure are such that (1) nodes internal to the element partition (nodes not on the element partition boundary) are assigned to the processor associated with that element partition and (2) nodes on the element partition boundary (nodes associated with surface data) are assigned such that two element partitions sharing a set of nodes receive a random subset of those shared nodes. Such a node distribution is in fact the one used in the mesh partitioning scheme used here. With this in mind, for this discussion, it is reasonable to characterize the amount of data sent off-processor from an element partition in the gather or scatter operation (equa- Obtaining a true minimum is an NP complete (i.e. intractable) problem.

12 Computer Methods in Applied Mechanics and Engineering, (1994) 12 tions (5) and (6)) as roughly half the data associated with the partition boundary nodes. Consequently, the number of array elements of v(1 : n dof, 1:n nodes )sentoff-processorfrom a single element partition is roughly half the number of partition boundary nodes times n dof. The two-step scatter described in [10] is composed of (1) a scatter of element data within a partition to an intermediate set of (pseudo-global) nodes local to that partition and (2) a scatter of the data associated with this intermediate set of nodes to the global nodes. The first step involves strictly on-processor data motion (n dof n en n partition el words per partition, where n partition el is the number of elements in the partition). The second step involves both on-processor (n dof n Vpartition nodes + 1n 2 dof n partition np words per partition, where n Vpartition np := n partition np n partition np ) and off-processor data motion ( 1n 2 dof n partition np words per partition). Here, n partition np is the number of nodes in the partition and n partition np is the number of nodes on the partition boundary. Next, consider the four simple meshes shown in Figure 1: (1) a two-dimensional square mesh of quadrilateral elements (4 nodes per element, N N elements), (2) a two-dimensional square mesh of triangular elements (3 nodes per element, 2 N N elements), (3) a threedimensional cubic mesh of brick elements (8 nodes per element, N N N elements) and (4) a three-dimensional cubic mesh of tetrahedral elements (4 nodes per element, 5 N N N elements). We constrain the tetrahedral mesh to be that generated from the brick mesh by decomposing each brick into 5 tetrahedral elements containing only those nodes which exist in the brick elements (see Figure 1), and place similar constraint on the pair of two-dimensional meshes. We also require that the elements of each mesh are partitioned into identical rectangular (two-dimensional) or rectangular parallelepiped (three-dimensional) partitions of elements on each processor, so that identical nodes comprise corresponding partitions in each mesh. Figure 1. Meshes for communication bandwidth tests. In the event that these simple meshes are partitioned into N p partitions of n n quadrilateral element partitions in the two-dimensional case (each quadrilateral subdivided into 2 triangles for the triangular mesh) and n n n hexahedral element partitions in the threedimensional case (each hexahedron subdivided into 5 tetrahedra for the tetrahedral mesh), the number of array elements associated with on-processor (volume data) and off-processor (surface data) data motion is shown as a function of n in Table 1. Steps 1 and 2 in Table 1 refer to the steps in the two-step gather-scatter. Noting that n is a linear function of N for each mesh, it is evident from Table 1 that the amount of surface data is O(N d 1 )whereas As is described in Johan et al. [10], the off-processor communication occurs on a node-to-node basis as opposed to an element-to-node basis.

13 Computer Methods in Applied Mechanics and Engineering, (1994) 13 the amount of volume data is O(N d ), where d is dimensionality of the mesh. As the size of the mesh and hence N increases, the amount of volume data quickly dominates the surface data. Step 1 Step 2 Mesh On-PN Off-PN On-PN Off-PN Quads n dof n en n 2 0 n dof ((n 1) 2 +2n) n dof 2n Triangles n dof n en 2n 2 0 n dof ((n 1) 2 +2n) n dof 2n Hexahedra n dof n en n 3 0 n dof ((n 1) 3 +3n 2 +1) n dof (3n 2 +1) Tetrahedra n dof n en 5n 3 0 n dof ((n 1) 3 +3n 2 +1) n dof (3n 2 +1) Table 1. Communication load for square and cubic partitions (n x = n y = n z = n). Ratios of surface data to volume data for meshes associated with specific values of N are shown Tables 2 and 3, where n dof is assumed to be 4. Here the partitions are not square as they are in Table 1, however. Table 2 corresponds to two-dimensional quadrilateral meshes of (1) a mesh of 16 8 partitions of 8 16 elements and (2) a mesh of 16 8 partitions of elements. Table 3 corresponds to three-dimensional hexahedral meshes of (1) a mesh of partitions of 3 6 6elementsand(2) a mesh of partitions of elements. The triangular and tetrahedral meshes are generated from the quadrilateral and hexahedral meshes as described above. From Tables 2 and 3, one can see that the ratio of surface data sent off-processor (at network bandwidth rates) to volume data sent on-processor (at memory bandwidth rates) is quite small, even for these moderately sized meshes. The degree to which a two-step gather or scatter will experience speed-ups due to a particular data partitioning strategy on a particular computer system is a function of the both the memory and the network bandwidths of the computer system. In the case of the CM-5, the speed-ups for the two-dimensional meshes considered in Table 2 are shown in Table 4, while the speed-ups for three-dimensional meshes considered in Table 3 are shown in Table 5. In Tables 4 and 5 the non-partitioned results are based on the communication strategy outlined in [19], with random distribution of the nodal points to the processors. Notice that the two-step scheme with partitioning offers dramatic speed-ups and that the speed-ups for the three-dimensional problems exceed those for the two-dimensional problems. 6. Numerical Simulations D flow around a cylinder: matrix-free vs. non-matrix-free performance In this section we consider a three-dimensional simulation of the flow past a circular cylinder. The simple problem geometry allows us to generate meshes of hexahedral and tetrahedral elements with relative ease. Here we use two meshes shown in Figure 2. The mesh shown in Figure 2 a) consists of 100,907 tetrahedral elements and 21,188 nodes, while the mesh in Figure 2 b) consists of 18,396 hexahedral elements and 21,460 nodes. The steady flow field at Re = 100 is obtained on both meshes with each technique (matrix-free and non-matrixfree). Figure 3 shows the steady-state pressure field around the cylinder obtained with the tetrahedral mesh.

14 Computer Methods in Applied Mechanics and Engineering, (1994) 14 Step 1 Step 2 Mesh On-PN Off-PN On-PN Off-PN Off-PN to On-PN Ratio Quads (N = 128) Triangles (N = 128) Quads (N = 256) Triangles (N = 256) Table 2. Surface to volume data ratios for the square meshes. Step 1 Step 2 Mesh On-PN Off-PN On-PN Off-PN Off-PN to On-PN Ratio Hexahedra (N = 24) Tetrahedra (N = 24) Hexahedra (N = 32) Tetrahedra (N = 32) Table 3. Surface to volume data ratios for the cubic meshes. Gather Scatter Mesh Non-partitioned Partitioned Non-partitioned Partitioned Quads (N = 128) 2.1 ( 30.8 ms) 7.9 ( 8.4 ms) 2.1 ( 31.3 ms) 6.4 (10.2 ms) Triangles (N = 128) 2.0 ( 48.9 ms) 12.4 ( 7.9 ms) 2.0 ( 50.7 ms) 9.5 (10.4 ms) Quads (N = 256) 1.8 (147.0 ms) 10.6 (24.7 ms) 1.7 (154.6 ms) 8.9 (29.9 ms) Triangles (N = 256) 1.6 (239.7 ms) 13.5 (29.4 ms) 1.6 (247.3 ms) 12.6 (32.1 ms) Table 4. Bandwidth comparison for the square mesh in MB/s/PN. Gather Scatter Mesh Non-partitioned Partitioned Non-partitioned Partitioned Hexahedra (N = 24) 1.8 ( 61.3 ms) 8.3 (13.3 ms) 2.1 ( 51.9 ms) 8.0 (13.8 ms) Tetrahedra (N = 24) 1.8 (155.8 ms) 18.8 (14.7 ms) 1.7 (159.1 ms) 14.2 (19.5 ms) Hexahedra (N = 32) 2.0 (132.6 ms) 9.3 (28.3 ms) 1.7 (151.4 ms) 8.1 (32.3 ms) Tetrahedra (N = 32) 1.5 (438.1 ms) 20.2 (32.5 ms) 1.5 (450.2 ms) 19.7 (33.3 ms) Table 5. Bandwidth comparison for the cubic mesh in MB/s/PN.

15 Computer Methods in Applied Mechanics and Engineering, (1994) 15 Figure 2. a) Surface of the tetrahedral and b) hexahedral cylinder mesh. Figure 3. Surface steady pressure field at Reynolds number 100. The parameter which influences the relative performance of the two techniques is the size of the Krylov space. Since in the matrix-free technique the matrix-vector products are replaced by residual evaluations, it is computation dominated; hence increasing the size of the Krylov space would result in the relative slow-down of the matrix-free technique. Table 6 indicates the time required per non-linear iteration, as well as the overall communication percentage, for different number of inner iterations (i.e. Krylov space sizes) for the tetrahedral mesh. Table 7 shows the same data for the hexahedral mesh. The measurements were taken on a CM-5 with 128 processing nodes for the tetrahedral mesh, and on a CM-5 with 32 processing nodes for the hexahedral mesh, resulting in similar subgrid lengths for

16 Computer Methods in Applied Mechanics and Engineering, (1994) 16 the two problems. We observe that for smaller Krylov spaces the matrix-free technique is faster, with a break-even point at around 8 inner iterations in the case of the tetrahedral mesh, and around 30 inner iterations in the case of the hexahedral mesh. The tetrahedral result is similar to findings by Johan [18] for compressible flows. We use 4 gauss points for the tetrahedral mesh and 8 for the hexahedral mesh. Note that in the current matrix-free implementation we store the values of the shape functions and Jacobians of the element domain transformation. At most (in the case of the deforming meshes) they are computed once every non-linear iteration. Matrix-free Non-matrix-free Krylov space size Iteration cost Comm. percentage Iteration cost Comm. percentage sec 18.2% 1.15 sec 18.7% sec 21.8% 1.19 sec 21.3% sec 19.3% 1.31 sec 20.5% sec 21.8% 1.87 sec 27.6% sec 22.8% 2.51 sec 32.1% sec 22.2% 3.24 sec 32.1% Table 6. Matrix-free vs. non-matrix-free comparison for the tetrahedral mesh. Matrix-free Non-matrix-free Krylov space size Iteration cost Comm. percentage Iteration cost Comm. percentage sec 17.2% 4.24 sec 9.2% sec 20.5% 4.62 sec 12.2% sec 21.0% 4.84 sec 13.8% sec 21.1% 6.06 sec 21.0% sec 22.9% 7.68 sec 23.1% sec 22.1% 9.08 sec 25.8% Table 7. Matrix-free vs. non-matrix-free comparison for the hexahedral mesh Flow around a submarine: partitioning benefits This simulation involves three-dimensional flow around a Los Angeles-class submarine. The ability to handle completely unstructured meshes is important when studying flows around complex shapes, since it is difficult to construct a structured mesh around a complex threedimensional object. The semi-automatic structured-mesh generators are generally less flexible and require more user intervention than fully automatic mesh generators designed for unstructured meshes. An example of the latter is the finite octree tetrahedral mesh generator developed by Shephard [20]. Here, this mesh generator was used to create a mesh around a Los Angeles-class submarine. The input to the mesh generator consisted of a geometric definition of the bounding surfaces of the mesh, including the outer rectangular box, and surface model of the submarine hull. The hull geometric model was digitized from commercially available data and was composed of a number of triangular and rectangular Bezier

17 Computer Methods in Applied Mechanics and Engineering, (1994) 17 patches. The mesh used for the current computations consisted of 86,111 nodes and 428,157 tetrahedral elements. Selected surfaces of that mesh are shown in Figure 4. Figure 4. Surface of the submarine mesh. In these initial computations the domain was stationary and therefore a more computationally efficient semi-discrete implementation (Tezduyar et al. [21]) was used in place of the space-time formulation. The boundary conditions consisted of a specified uniform inflow velocity, zero-normal-velocity/zero-shear-stress boundary conditions at the external lateral boundaries, a traction-free outflow boundary, and no-slip condition on the submarine hull. The Reynolds number is based on the free-stream fluid velocity and submarine length. The computations were restarted from a steady-state solution at Reynolds number The Reynolds stress was modeled using a Smagorinsky turbulence model after Kato [22]. In this model, the kinematic viscosity ν is augmented by an eddy viscosity ν T =(Ch) 2 (2ε(u):ε(u)) 1 2, (7) where C =0.15 is the model constant and h is the element length. In the transient phase of the solution, the Krylov space of 50 was used in the GMRES solver with no restarts. At each time step 4 nonlinear iterations were performed. A representative result from this preliminary computation is presented in Figure 5, which shows the pressure field on the submarine hull. At this point in the computation, the drag coefficient remained at The overall sustained performance and communication performance for this simulation are shown in Table 8. The communication performance is shown both for the case of the two-step communication of partitioned data (see Section 5) and for the case of a singlestep communication (see Mathur and Johnsson [19]) with random distribution of the nodes. Figure 6 shows the partitioning for 2048 vector units on the surface mesh of the submarine

18 Computer Methods in Applied Mechanics and Engineering, (1994) 18 Figure 5. Pressure distribution on the submarine hull. hull. Table 8 shows, with and without partitioning, the overall speed in GigaFLOPS, time taken per nonlinear iteration, as well as gather and scatter bandwidths attained in the GMRES solver. All measurements were taken on a CM-5 computer with 512 processing nodes and 2048 vector execution units. Note that the difference in the FORM phase speed between the partitioned and non-partitioned case is statistical and/or possibly due to the load on the front end. The partitioning is observed to more than double the overall speed, by decreasing the gather cost by a factor of 7 and scatter by a factor of 3.5. Figure 6. Partitioning of the submarine mesh for 2048 vector units. Non-partitioned Partitioned FORM phase speed 11.5 GigaFLOPS 12.3 GigaFLOPS Overall speed 2.4 GigaFLOPS 5.4 GigaFLOPS Time per iteration 9.9 sec 4.4 sec Gather Bandwidth 1.5 MB/s/PN 10.4 MB/s/PN Scatter Bandwidth 1.8 MB/s/PN 6.4 MB/s/PN Table 8. Performance with and without mesh partitioning.

19 Computer Methods in Applied Mechanics and Engineering, (1994) Concluding Remarks We have discussed various aspects of a data parallel implementation of finite element methods for computational fluid dynamics. The foundation for such implementation is the existence of high-level data parallel programming languages such as the Connection Machine Fortran or High Performance Fortran. These languages are ideal for exploiting the fine-grain parallelism occurring naturally in finite element problems on large meshes. We based the implementation discussion on a space-time velocity-pressure formulation of incompressible Navier-Stokes equations, and noted that this discussion is equally relevant to many other formulations, including those that employ conventional time-stepping methods. The issues covered include the selection of two principal data storage modes, the formation of elementlevel residual vectors, and the iterative solution process used to solve the linear system of equations arising at each nonlinear iteration step. Subsequently we investigated how additional control over the distribution of the data elements in the two storage modes can be used to significantly reduce the cost of communication between these storage sets. Here we used the two-step gather and scatter routines from the Connection Machine Scientific Software Library. Using a 3D flow past a cylinder as an example, we compared the performance of the aforementioned implementation, using a standard GMRES implementation, as well as its matrix-free version. Finally we presented some results from a 3D simulation of a flow past a complex submarine model, and compared the throughput of both the standard and the two-step communication routines on this practical problem. The preconditioning of the linear system arising from the finite element formulation is still an open issue, especially significant in the incompressible case, where some degree of global (i.e., not local to element or node) preconditioning can dramatically improve convergence. In the examples presented here, only a diagonal preconditioning/scaling has been used. 8. Acknowledgments This research was sponsored by NASA-JSC under grant NAG 9-449, by NSF under grants CTS and ASC , by ARPA under NIST contract 60NANB2D1272, and by ARO under grant DAAH04-93-G Partial support for this work has also come from the ARO contract number DAAL03-89-C-0038 with the AHPCRC at the University of Minnesota. We are indebted to Zdenek Johan for helpful comments and providing access to his CM-5 implementations of both the RSB algorithm for data decomposition and the two-step gather and scatter algorithms. We are also indebted to Kapil Mathur for helpful comments and his contributions to the two-step gather and scatter algorithms. References [1] J.G. Malone, Automatic mesh decomposition and concurrent finite element analysis for hypercube multiprocessor computers, Computer Methods in Applied Mechanics and Engineering, 70 (1988)

20 Computer Methods in Applied Mechanics and Engineering, (1994) 20 [2] C. Farhat and E. Wilson, A new finite element concurrent computer program architecture, International Journal for Numerical Methods in Engineering, 24 (1987) [3] G.A. Lyzenga, A. Raefsky, and B.H. Hager, Finite elements and the method of conjugate gradients on concurrent processors, Report C3P-119, California Institute of Technology, Pasadena, CA, [4] K.K. Mathur and S.L. Johnsson, The finite element method on a data parallel computing system, International Journal of High Speed Computing, 1 (1989) [5] T. Belytschko, E.J. Plaskacz, J.M. Kennedy, and D.L. Greenwell, Finite element analysis on the Connection Machine, Computer Methods in Applied Mechanics and Engineering, 81 (1990) [6] R.A. Shapiro, Implementation of an Euler/Navier-Stokes finite element algorithm on the Connection Machine, in AIAA , AIAA 29th Aerospace Sciences Meeting, (1991). [7] C. Farhat, N. Sobh, and K.C. Park, Transient finite element computations on 65,536 processors: The Connection Machine, International Journal for Numerical Methods in Engineering, 30 (1990) [8] Z. Johan, T.J.R. Hughes, K.K. Mathur, and S.L. Johnsson, A data parallel finite element method for computational fluid dynamics on the Connection Machine system, Computer Methods in Applied Mechanics and Engineering, 99 (1992) [9] M. Behr, A. Johnson, J. Kennedy, S. Mittal, and T.E. Tezduyar, Computation of incompressible flows with implicit finite element implementations on the Connection Machine, Computer Methods in Applied Mechanics and Engineering, 108 (1993) [10] Z. Johan, K.K. Mathur, S.L. Johnsson, and T.J.R. Hughes, An efficient communications strategy for finite element methods on the Connection Machine CM-5 system, Computer Methods in Applied Mechanics and Engineering, 113 (1994) [11] A. Pothen, H.D. Simon, and K.P. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM Journal on Matrix Analysis and Applications, 11 (1990) [12] H.D. Simon, Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, 2 (1991) [13] C.H. Koelbel, D.B. Loveman, R.S. Schreiber, Jr. G.L. Steele, and M.E. Zosel, The High Performance Fortran Handbook. MIT Press, Cambridge, MA, 1994, ISBN [14] Thinking Machines Corporation, 245 First Street, Cambridge, MA 02142, CM Fortran Reference Manual, Versions 1.0 and 1.1, 1991.

21 Computer Methods in Applied Mechanics and Engineering, (1994) 21 [15] T.E. Tezduyar, M. Behr, and J. Liou, A new strategy for finite element computations involving moving boundaries and interfaces the deforming-spatial-domain/space-time procedure: I. The concept and the preliminary tests, Computer Methods in Applied Mechanics and Engineering, 94 (1992) [16] T.E. Tezduyar, M. Behr, S. Mittal, and J. Liou, A new strategy for finite element computations involving moving boundaries and interfaces the deforming-spatialdomain/space-time procedure: II. Computation of free-surface flows, two-liquid flows, and flows with drifting cylinders, Computer Methods in Applied Mechanics and Engineering, 94 (1992) [17] S.L. Johnsson and K.K. Mathur, Experience with the conjugate gradient method for stress analysis on a data parallel supercomputer, International Journal for Numerical Methods in Engineering, 27 (1989) [18] Z. Johan, Data Parallel Finite Element Techniques for Large-Scale Computational Fluid Dynamics, Ph.D. thesis, Department of Mechanical Engineering, Stanford University, [19] K.K. Mathur and S.L. Johnsson, Communication primitives for unstructured finite element simulations on data parallel architectures, Computer Systems in Engineering, 3 (1992) [20] M.S. Shephard and M.K. Georges, Automatic three-dimensional mesh generation by the finite octree technique, International Journal for Numerical Methods in Engineering, 32 (1991) [21] T.E. Tezduyar, S. Mittal, S.E. Ray, and R. Shih, Incompressible flow computations with stabilized bilinear and linear equal-order-interpolation velocity-pressure elements, Computer Methods in Applied Mechanics and Engineering, 95 (1992) [22] C. Kato and M. Ikegawa, Large eddy simulation of unsteady turbulent wake of a circular cylinder using the finite element method, in I. Celik, T. Kobayashi, K.N. Ghia, and J. Kurokawa, editors, Advances in Numerical Simulation of Turbulent Flows, FED-Vol.117, ASME, New York, (1991)

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS

Contemporary Mathematics Volume 157, 1994 A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS T.E. Tezduyar, M. Behr, S.K. Aliabadi, S. Mittal and S.E. Ray ABSTRACT.