Massively Parallel Computing: Unstructured Finite Element Simulations

Size: px

Start display at page:

Download "Massively Parallel Computing: Unstructured Finite Element Simulations"

Bernadette Turner
5 years ago
Views:

1 Massively Parallel Computing: Unstructured Finite Element Simulations The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Accessed Citable Link Terms of Use Mathur, Kapil K., Zdenek Johan, S. Lennart Johnsson, and Thomas J.R. Hughes Massively Parallel Computing: Unstructured Finite Element Simulations. Harvard Computer Science Group Technical Report TR August 26, :53:27 PM EDT This article was downloaded from Harvard University's DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at (Article begins on next page)

2 Massively Parallel Computing: Unstructured Finite Element Simulations Kapil K. Mathur Zdenek Johan S. Lennart Johnsson Thomas J.R. Hughes TR March 1993 Parallel Computing Research Group Center for Research in Computing Technology Harvard University Cambridge, Massachusetts To appear in Proceedings of NAFEM 4th International Conference on Quality Assurance and Standards in Finite Element and Associated Technologies, May 26{28, Brighton, England.

3 Massively Parallel Computing: Unstructured Finite Element Simulations Kapil K. Mathur, Zdenek Johan and S. Lennart Johnsson 1 Thinking Machines Corporation 245 First Street, Cambridge, MA Thomas J.R. Hughes Division of Applied Mechanics, Stanford University Durand Building, Stanford, CA Abstract Massively parallel computing holds the promise of extreme performance. Critical for achieving high performance is the ability to exploit locality of reference and eective management of the communication resources. This article describes two communication primitives and associated mapping strategies that have been used for several dierent unstructured, three-dimensional, nite element applications in computational uid dynamics and structural mechanics. 1 Introduction Most modern high-performance computing systems have memory distributed among the processors. These processors along with their memory and communication hardware are often referred to as processing nodes. In turn, processing nodes are interconnected by a network such as a mesh, a binary cube or a fat-tree. These computing systems hold a promise for extreme performance. However, careful attention to the data allocation, data motion in the distributed data structures, memory hierarchies and load balancing is required to achieve such extreme performance. Fundamental changes to classical algorithms are therefore necessary. With the advent of such computing systems, it is now possible to simulate signicantly more complex problems found in science and engineering. The increased complexity may be a result of advancing from coarse, structured, two-dimensional geometries to ne, unstructured, three-dimensional geometries or may be due to the more detailed modeling techniques being used. This paper focuses attention on the nite element method. The inherent parallelism in the nite element method is studied in the context of unstructured, three-dimensional, simulations that typically arise in structural mechanics and computational uid dynamics. As stated before, an ecient implementation of the nite element techniques on distributed-memory high-performance computing systems must address issues related to 1 Also aliated with the Division of Applied Sciences, Harvard University, Cambridge, MA

4 load balance and to the ecient use of the network interconnecting the processing nodes. The class of high-performance computing systems studied in this paper are programmed with a shared address space in single-program multiple-data mode. Programming languages that are based on a shared address space include High Performance Fortran (HPF) [11], Connection Machine Fortran (CMF) [4], Fortran-90 [21], Fortran-D [10], and Fortran- Y [3]. In this paper, the Connection Machine systems CM-200 and CM-5 are used as the model architectures and Connection Machine Fortran is used as the model programming language. Several researchers have studied nite element implementation on these model architectures for a variety of dierent applications. Johnsson and Mathur [15 and 16] discuss three-dimensional linear elastic structural applications. Belytschko et al. [2] investigate explicit crash simulations. Farhat et al. [5 and 6] and Shapiro [21] have studied explicit algorithms for computational uid dynamics. Johan et al. [12] report a fully implicit matrix-free implementation for solving the compressible Euler and Navier-Stokes equations. Mathur et al. [19 and 20] discuss explicit and fully dynamic nite element simulations of ductile fracture in metals. Beaudoin et al. [1] describe implicit metal forming simulations with explicit use of polycrystalline plasticity. The outline of this article is as follows: The next section describes a set of communication primitives that are common to all nite element simulations. These primitives have been identied by studying several applications that span the entire range of solution techniques from explicit to fully implicit solvers. Then, the issue of load balance with respect to the communication system is reported for unstructured nite element meshes. Both the mapping of the nite elements and the inuence of node numbering on the communication costs are reported. Two three-dimensional unstructured meshes one that has been used for simulating airow around a complete airplane [12] and the other that typically arises in crash simulations of automobiles are used to demonstrate the importance of locality of reference and optimal selection of paths for the data motion. 2 Communication Primitives All unstructured nite element simulations can be best based on a formulation that views the entire data set in one of the following two representations. The rst representation is called the element-by-element approach. In this data representation, the entire data set is partitioned into two groups a group of unassembled nite elements and a group of assembled nodal points (or sometimes of assembled nodal degrees of freedom). Here the unassembled nite elements and the assembled nodal points are mapped onto the virtual processing nodes of the architecture. All computations at the element level are performed in the rst group. Computations that must be performed on the assembled nodes are performed in the second group. Any interaction between data stored in the two dierent groups involves data motion between virtual processing nodes. The second data representation is called the assembled stiness matrix approach. Here the entire data set is divided into three groups. In addition to the unassembled nite elements and assembled nodal points, the third group represents the assembled global stiness matrix. As before, computations at the element level are performed in the group representing unassembled nite elements. After the element matrices have been evalu- 2

5 ated, a global stiness matrix is assembled. This involves data motion between the group representing unassembled nite elements and the group representing the assembled stiness matrix. The assembled stiness matrix data representation is particularly useful for certain implicit calculations which involve the solution of sparse linear systems by direct methods. In a previous article, Mathur and Johnsson [18] identify a set of communication primitives for unstructured nite element simulations on high performance computing architectures that are programmed with a shared address space. The model architecture used for that study was a Connection Machine system CM-200. This article reported on applications based on element-by-element algorithms. Four communication primitives were identied: global gather, global scatter, all-to-all broadcast, and all-to-all reduce. The rst two primitives are described here very briey. The reader is referred to the above article for a detailed description. The gather operation is a many-from-one mapping between the destination and source arrays. Every destination array element accumulates data values based on a pointer array which is of the same shape as the destination array. In the context of nite element simulations, one example of the gather operation is the accumulation of the assembled nodal data values to local element vectors. Since many nite elements share the same nodal point, this is indeed a many-from-one mapping. The scatter operation is the reverse of the gather operation. Here, many source array elements combine their data values based on a pointer array which is of the same shape as the source array. This is a many-to-one mapping. Data collision may occur at the destination as several source array elements may be associated with the same destination array element. In this case, the colliding data values must be added to achieve the eect of the assembly operation. The assembled stiness matrix data representation requires no additional communication primitives. The data interaction between the group of unassembled nite elements and the assembled stiness matrix is a scatter operation. The data interaction between the group representing the assembled stiness matrix and the group representing the assembled nodal points can either be a gather or a scatter operation. In particular, when iterative methods are used to solve the sparse linear system, a sparse matrix-vector multiply requiring both gather and scatter operations forms the computation kernel. Figure 1 shows a simple nite element mesh with three elements labeled A, B, and C, and seven nodal points labeled 1 through 7. This mesh is used to outline the algorithms used to formulate the gather/scatter operations (Figure 2). For simplicity, it is assumed that there is only one degree of freedom per nodal point. During the gather operation, the unassembled nite elements accumulate the nodal values. In a preprocessing phase, the group of unassembled nite elements is associated with the group labeled \Nodal { II" in Figure 2 through a one-to-one mapping. Since this one-to-one mapping is solely a function 3

6 @ 6? 6? 1 2 C 3 4 A B Group of unassembled nite elements Network interconnecting the processing nodes Group of assembled nodal points Figure 1: The two groups of the processing nodes used to map unstructured discretizations for element-by-element algorithms. The arrows represent the direction of data motion between the two groups, for the gather and scatter operations. For the simple mesh shown above, the group of unassembled nite elements are mapped on to the processing nodes as a linear array three long (representing the three unassembled elements labeled A, B, and C). Similarly, the group of nodal points are mapped on to the processing nodes as a linear array seven long (representing the seven assembled nodal points labeled 1 7). of the mesh connectivity, the preprocessing time can be amortized over several calls to the gather operation. The actual gather operation is performed by rst making local copies of the data values that are requested by more than one unassembled nite element (\Nodal { I"! \Nodal { II") and then by performing the one-to-one data motion step. The scatter operation is done in the reverse order. First, the one-to-one data motion step is performed. Then, the local data values are added together by a reduce operation. 3 Mesh Partitioning and Node Numbering To make an ecient use of the network interconnecting the processing nodes, the nite elements of the unstructured mesh have to be mapped onto the processing nodes of the architecture so that locality of reference is maximized and the number of routing conicts is kept at a minimum. One useful mapping technique that has been studied extensively is the recursive spectral bisection algorithm proposed by Pothen et al. [22]. This algorithm has been used successfully by Simon [24] for mesh decomposition. Johan [13] and Johan and Hughes [14] report an ecient data-parallel implementation of the recursive spectral bisection algorithm. It is important to note that the implementation of the partitioning algorithm be as ecient as possible because the mapping of the nite elements, for an optimal selection of paths for moving data, requires knowledge of conguration 4

7 @ 1 2 C 3 4 A B Finite Element Mesh 1 a 5 a 6 a 3 a 3 b 6 b 7 b 4 b 1 c 3 c 4 c 2 c???????????? Unassembled nite elements 1 a 2 c 3 a 4 b 5 a 6 a 7 c 3 b 4 c 6 b 3 c Nodal points { II Nodal points { I Figure 2: Data structures used in the gather/scatter operations. For the simple mesh shown above, the gather/scatter primitives generate an internal one-to-one mapping between the group of unassembled nite elements and the group of assembled nodal points. The unassembled nodal values (for example 3 a, 3 b, and 3 c ) are queued in the local memory of the processing node representing the assembled nodal point (3 in the example). A local copy or reduce operation completes the gather and scatter operations respectively. 5

8 of the computing platform, which may only be known at runtime. Moreover, adaptive simulations may require a new mapping whenever the mesh is rened. Briey, the spectral partitioning algorithm is based on the smallest non-zero eigenpair of the Laplacian matrix associated with the dual mesh connectivity. The Laplacian matrix is constructed such that the smallest eigenvalue is zero and its corresponding eigenvector consists of all ones (Note that the Laplacian matrix is dened by some authors to be negative semi-denite, in which case the partitioning is based on the second largest eigenvalue). The eigenvector associated with the smallest non-zero eigenvalue is frequently called the Fiedler vector [7, 8, and 9] and can be used to decompose the nite element mesh. The dual mesh connectivity of a mesh is an alternate method of representing a nite element mesh. It is simply a list of elements that share a face with a given nite element. This is in contrast with the popularly used nodal connectivity representation which is a list of nodal points making up a nite element. The partitioning algorithm provides an ecient method for mapping the unassembled nite elements. The contention for the communication links in the network can be reduced further by an appropriate mapping of the assembled nodal points of the nite element mesh. Two dierent nodal renumbering schemes have been studied. The rst technique is based on the results of the mapping algorithm used for the assembled nite elements. After the nite elements have been mapped on to the processing nodes, the nodal connectivity of the nite elements on each processing node is examined to map the nodal points (or the nodal degrees of freedom) on the processing nodes for further improving the locality of the gather/scatter operations. This node renumbering algorithm works very well when the computational domain is discretized by only one type of nite element. When the computational domain is discretized by more than one element type, each element type requires a dierent mapping. In this case the nodal points are mapped randomly on to the processing nodes. The random mapping is quite eective in minimizing the contention for the communication channels during the gather/scatter operations [23, 26, and 17]. 4 Applications Two three-dimensional nite element meshes are used to illustrate the use of the communication primitives discussed above. The rst mesh is that of a generic Falcon Jet (Figure 3). It has been used in CFD calculations to simulate the inviscid ow over an airplane [12]. The second mesh is that of a complete automobile and represents a typical mesh used in crash simulations (Figure 4). The two meshes shown in Figures 3 and 4 were used to measure the eective communications bandwidth for the primitives described above. All bandwidth data reported here was measured on a 32 processing node CM-5 equipped with 128 vector units. The CFD simulation [12] uses the global gather-scatter primitives only. After accumulating data in the group representing unassembled nite elements, all computations are done locally. The result of the local computations is then scattered back to the group representing assembled nodal points. The nite element mesh is made up of 109,914 tetrahedral elements and 97,085 degrees of freedom. A one-point quadrature rule was 6

9 Figure 3: Generic Falcon Jet. The complete mesh has 109,914 tetrahedral elements and 97,085 nodal degrees of freedom. Figure 4: Finite element mesh used in the crash analysis of an automobile. The complete mesh has 33,590 quadrilateral shell elements, 14,678 triangular shell elements and 270,522 degrees of freedom. 7

10 used in the elements. The mapping phase consisting of the recursive spectral bisection algorithm and the nodal reordering scheme took 66 seconds. In this example, the nodal reordering algorithm renumbers the nodal points based on the outcome of the partitioning algorithm. Since the mesh connectivity does not change during the course of the simulation, this mapping is done once for the entire simulation. For this mesh, the eective data motion rates, normalized to one processing node, are 14 Mbytes s 1 and 9:4 Mbytes s 1 for the gather and scatter operations, respectively. The normalized data motion rate can also be separated into two components { the local (or the on processing node) gather scatter rate and the o processing node gather scatter rate. For this nite element mesh, the normalized local data motion rates for the gather and scatter operations were 94 Mbytes s 1 and 15 Mbytes s 1 respectively. The o processing node data motion rates were 1:5 Mbytes s 1 for the gather operation and 2:0 Mbytes s 1 for the scatter operation. It should be noted that the scatter data motion rate includes the time required to perform the addition operation. On the 32-processing node CM-5, the overall data motion bandwidth is 0:45 Gbytes s 1 for the gather operation and 0:30 Gbytes s 1 for the scatter operation. Approximately 27% of the total time is spent on data motion (9 % for the gather operation and 18 % for the scatter operation). Most explicit dynamic simulations have similar structure to the one used by the implicit CFD simulation reported above. The time-step loop of such algorithms also involves a gather-compute-scatter cycle. The typical structure of dynamic explicit element-byelement algorithms using the global gather and scatter primitives is reported in detail in Mathur et al. [19 and 20]. The automobile mesh has 33,590 quadrilateral shells, 14,678 triangular shells and 45,087 nodal points. The spectral partitioning algorithm was unable to produce a mapping that was any better than randomly mapping the nodal points and nite elements. A detailed study of the nite element mesh reveals that there is more than one nite element connected to a face of a neighboring nite element. Fot this mesh, the maximum number of nite elements connected to a face of a neighboring nite element is nineteen. This property of the nite element mesh seems unique to shell elements and requires special care before the spectral partitioning algorithm can be used. This aspect is under investigation and will be reported elsewhere. The data motion bandwidth for the automobile mesh was measured assuming that there are six degrees of freedom per nodal point (the mesh has a total of 270,522 degrees of freedom). For the stochastic mapping scheme, the data motion rate normalized to one processing node is 2:1 Mbytes s 1 for the gather operation and 1:9 Mbytes s 1 for the scatter operation. The local data motion rates normalized to a processing node are 157 Mbytes s 1 and 22 Mbytes s 1 for the gather and scatter operations respectively. The corresponding o processing node data motion rates normalized to one processing node are 2:0 Mbytes s 1 for both the gather and scatter operations. From these data motion rates, it is clear that the stochastic mapping technique helps in reducing the conict for the communication channels in the network interconnecting the processing nodes signicantly (the o processing node bandwidth is the same as that of the generic Falcon Jet mesh). However, locality is not maximized. Consequently, the overall data motion rates are close to the o processing node rates. 8

11 5 Conclusions By using appropriate mapping strategies, it is possible to achieve very high data motion bandwidths for unstructured meshes. This article describes two dierent mapping ideas that improve the locality of reference and minimize contention for the communication channels. The rst mapping algorithm is based on the spectral properties of the Laplacian matrix associated with the dual connectivity of a nite element mesh. Locality is further improved by using a nodal renumbering scheme that maps the nodal points based on the nite element mapping. The second algorithm uses a stochastic mapping strategy by randomly assigning the nodal points of the mesh to the processing nodes. These two strategies have worked well to produce ecient implementation of the communication primitives which work well for a variety of remarkably dierent nite element meshes. It should be noted that the two communication primitives described here are completely general. They are not specic to nite element simulations. Moreover the mapping strategies and the gather and scatter algorithms are valid for any distributed memory computing architecture. Acknowledgements The mesh for the generic Falcon Jet was provided by Dassault Aviation, France. The mesh for the automobile was provided by Centric Engineering Systems, Inc., Palo Alto. References 1. BEAUDOIN A. J., MATHUR K. K., DAWSON P. R. AND JOHNSON G. C. { Three-dimensional deformation process simulation with explicit use of polycrystalline plasticity models, Int. J. Plas., in press. 2. BELYTSCHKO T., PLASKACZ E. J., KENNEDY J. M. AND GREENWELL D. L. { Finite element analysis on the Connection Machine, Comp. Meth. Appl. Mech. and Engr., Vol. 81, 229{254, CHEN M. AND WU J. J. { Optimizing Fortran-90 programs for data motion on massively parallel systems, Yale U., Tech. Rep., CM Fortran reference manual, versions 2.0, Thinking Machines Corporation, FARHAT C., SOBH N. AND PARK K. C. { Transient nite element computations on 65,536 processors: The Connection Machine, Int. J. Num. Meth. Engr., Vol. 30, 27{55, FARHAT C., FEZOUI L. AND LANTERI S. { Two-dimensional viscous ow computations on the Connection Machine: Unstructured meshes, upwind schemes, and massively parallel computations, Comp. Meth. Appl. Mech. Engr., Vol. 102, 61{ 88,

12 7. FIEDLER M. { Algebraic Connectivity of Graphs, Czech. Math. J., 23, 298{305, FIEDLER M. { Eigenvectors of acyclic matrices, Czech. Math. J., 25, 607{618, FIEDLER M. { A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory, Czech. Math. J., 25, 619{633, FOX G., HIRANANDANI S., KENNEDY K., KOELBEL C., KREMER U., TSENG C. AND WU M. { Fortran D Language Specication, Rice U., TR90{141, High Performance Fortran Language Specication, version 0.4, Dept. Comp. Sci., Rice Univ., JOHAN Z., HUGHES T. J. R., MATHUR, K. K. AND JOHNSSON S. L. { A data parallel nite element method for computational uid dynamics on the Connection Machine system, Comp. Meth. Appl. Mech. and Engr., 99, No. 1., 113{134, JOHAN Z. { Data parallel nite element techniques for large-scale computational uid dynamics, Ph.D. Thesis, Stanford University, JOHAN Z. AND HUGHES T. R. J. { An ecient implementation of the spectral partitioning algorithm on Connection Machine systems, Int. Conf. Comp. Sci. Cont., INRIA, JOHNSSON S. L. AND MATHUR K. K. { Experience with the conjugate gradient method for stress analysis on a data parallel computer, Int. J. Num. Meth. Engr., Vol. 27, 523{546, JOHNSSON S. L. AND MATHUR K. K. { Data structures and algorithms for the - nite element method on a data parallel supercomputer, Int. J. Numer. Meth. Engr., Vol. 29, 881{908, MATHUR K. K. { On the use of randomized address maps in unstructured threedimensional nite element simulations, Tech. Rep. Thinking Machines Corporation 37/CS90{4, MATHUR K. K. AND JOHNSSON S. L. { Communication primitives for unstructured nite element simulations on data parallel architectures, Comp. Syst. Engr., 3, No. 1{4, 63{72, MATHUR K. K., NEEDLEMAN A. AND TVERGAARD V. { Dynamic 3D analysis of the Charpy V-notch, Model. Sim. Mater. Sci. Engr., in press. 20. MATHUR K. K., NEEDLEMAN A. AND TVERGAARD V. { Ductile failure analyses on massively parallel computers, in preparation. 21. METCALF M. AND REID J. Fortran 90 explained, Oxford Univ. Press,

13 22. POTHEN A., SIMON H. D., AND LIOU, K.-P. { Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Mat. Anal. Appl. 11, 430{452, RANADE A. G. { How to emulate shared memory, Proc. 28th Symp. Found. Comp. Sci., IEEE, 185{194, SIMON H. D. { Partitioning of unstructured problems for parallel processing, Comp. Sys. Engr, 2, 135{148, SHAPIRO R. A. { Implementation of an Euler/Navier-Stokes nite element algorithm on the Connection Machine, Proc. AIAA 29th. Aero. Sci., AIAA{91{0433, VALIANT L. { A scheme for fast parallel communication, SIAM J. Comp., Vol. 11, 350{361,

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this