Optimizing TELEMAC-2D for Large-scale Flood Simulations

Size: px

Start display at page:

Download "Optimizing TELEMAC-2D for Large-scale Flood Simulations"

Shanon Walton
5 years ago
Views:

1 Available on-line at Partnership for Advanced Computing in Europe Optimizing TELEMAC-2D for Large-scale Flood Simulations Charles Moulinec a,, Yoann Audouin a, Andrew Sunderland a a STFC Daresbury Laboratory, UK Abstract This report details optimization undertaken on the Computational Fluid Dynamic (CFD) software suite TELEMAC, a modelling system for free surface waters with over 200 installations worldwide. The main focus of the work has involved eliminating memory bottlenecks occuring at the pre-processing stage that have historically limited the size of simulations processed. This has been achieved by localizing global arrays in the pre-processing tool, known as PARTEL. Parallelism in the partitioning stage has also been improved by replacing the serial partitioning tool with a new parallel implementation. These optimizations have enabled massively parallel runs of TELEMAC-2D, a Shallow Water Equations based code, involving over 200 million elements to be undertaken on Tier-0 systems. These runs simulate extreme flooding events on very fine meshes (locally less than one meter). Simulations at this scale are crucial for predicting and understanding flooding events occurring, e.g. in the region of the Rhine river Project ID: PRA4IC 1. Introduction An increasing number of the world s population are inhabiting areas that are at significant risk of serious flooding events, such as river basins. It is therefore essential that tools are developed to assess the impact of a flooding on wetted regions, and ultimately to better warn people of serious events. Numerical tools are of vital importance in aiding a better understanding of flooding impact. TELEMAC [1, 2] enables, among other applications, the simulation of river systems, and can model free-surface flows, including flooding, wetting and drying. The system is highly portable and has been under development for over 20 years by EDF R&D. The whole system will go Open-Source in August 2011, with TELEMAC-2D (GPL licensing), the BIEF (Bibliothèque déléments Finis) (LGPL licensing) and the pre-processing libraries already available to any user. TELEMAC-2D is based on the depth-integrated Shallow Water (hydrostatic) Equations when the horizontal length scale of the flow is greater than the vertical scale. A research project between Bundesanstalt für Wasserbau (BAW, Karlsruhe, Germany) [3] and the Science and Technology Facilities Council (STFC, Daresbury, UK) [4] has recently been agreed to investigate flooding of the Rhine river from Bonn to the North Sea. The originality of this work resides in the fact that the flooding of this long section of river (about 250kms) will be undertaken in one simulation by a 2D approach (TELEMAC- 2D) with a fine resolution of less than a metre in some parts of the mesh. It is expected that simulations involving grids of a finer resolution will produce more accurate results, thus enabling better understanding of the likelihood and extent of flooding effects at any place and time. This geometry has been meshed with 5M of elements. Some results already exist for portions of the Rhine river between Bonn and the North Sea, which have been studied by BAW, but the complete mesh has not yet been run. These intermediate data will be used for comparison. Two larger meshes have been identified to investigate the quality of the results and the sensitivity of the results to the grid size. The first mesh (20M) will be built by applying one level of refinement to the 5M element mesh and the second mesh by refining it twice (80M elements). Tier-1 systems can be used for the smaller cases, but simulations on Tier-0 systems are required to run calculations involving element meshes of 80M and beyond. Introducing the capability to prepare and compute these larger problem sizes using TELEMAC-2D is the focus of this project. TELEMAC-2D has already been successfully run on Argonne BlueGene/P (BG/P) Intrepid [5] (a machine with the same architecture as Jugene [6]) up to 16,384 cores for a straight channel demonstration case of 25M elements. Here, performance was shown to scale very well up to 4,096 cores (VN mode). To generate the input data for these cases, a parallel prototype of the pre-processor was designed, that would perform pre-processing for the 25M case much quicker, but the memory overheads, due to replicated data structures are limiting the Corresponding author. tel fax charles.moulinec@stfc.ac.uk

2 problem sizes that can be addressed. Another limitation in using this prototype for pre-processing runs is that the number of parallel tasks must match that which is used in the subsequent target calculation with TELEMAC- 2D (25M case required 16K cores of BG/P, but in SMP mode to pre-process whereas TELEMAC-2D is run in VN mode). In fact the pre-processing of any grid with > 25M elements for subsequent Tier-0 simulations therefore requires a new memory-optimized parallel version based on the original serial pre-processor called PARTEL. The number of parallel tasks used in this tool, called PARTEL P, is independent of the number of sub-domains used by TELEMAC-2D. This paper is arranged as follows: Section 2 describes the computational approach used by TELEMAC-2D, with a special focus on the formats of the IOs; Section 3 and Section 4 explains the strategy used to redesign the pre-processor; Section 5 presents the results obtained by the new pre-processor; Section 6 measures TELEMAC-2D performance for a 200-million element grid test case. 2. TELEMAC-2D Computational Approach The TELEMAC system is a multi-scale hydrodynamics free-surface suite able to solve Shallow Water Equations (TELEMAC-2D) and Navier-Stokes Equations (TELEMAC-3D) depending on the topology of the configuration and the approximation in the calculation of the vertical velocity. The system relies on the BIEF Finite Element Library. This library contains basic operations, a few linear solvers, and some of the discretisation schemes used in the hydrodynamics solvers. As the scientific project aims at solving the Shallow Water Equations, the following description is restricted to the computational properties of TELEMAC-2D. The steps to perform a simulation with the TELEMAC system proceed as follows: Generation of the grid (triangular elements), with a mesh generator taking into account the bathymetry. This step is computed in serial with the current existing tools. However it is also possible to globally refine an existing mesh to increase resolution. This is also performed in serial, but with a tool that has been recently optimised, Pre-processing, including mesh partitioning by METIS 5.0pre2 (serial version) [7] and calculation of the connectivities, boundary conditions, halo cells, and pre-processing for the method of characteristics for advection (if used). The mesh partitioning and all other pre-processing tasks are performed by the same tool, i.e. PARTEL. Serial mesh partitioning is limited by memory availability whereas the rest of the preprocessing tasks are limited by time constraints. There exist two versions of PARTEL, a fully serial one and a partially parallel prototype, which runs partitioning serially, but perform the rest of the pre-processing in parallel on the same number of processors as the number of sub-domains. This version uses global arrays, because it was designed to speed-up the pre-processing process, therefore to date no optimization in terms of memory has been undertaken, Solution of the shallow water equations using TELEMAC-2D. The equations might be solved coupled or with the help of a wave equation, depending on the option chosen. The space discretisation is, in general, linear. Several advection schemes are available and used depending on the flow, namely, the method of characteristics, the streamline-upwind Petrov-Galerkin (SUPG), Residual Distributive Schemes (N-Scheme and Psi-Scheme). Matrix-storage is edge-based. Several linear solvers are available in the BIEF library, e.g. Conjugate Gradient, Conjugate Residual, CGSTAB and GMRES. TELEMAC-2D is fully parallelised by MPI. Input files consist of a parameter file (ASCII) read by all the processors, and a geometry file (binary, SELAFIN format) and a boundary file (ASCII) per MPI task read by each processor. Those files are generated by one of the PARTEL tools (either serial or parallel), which prepares the initial geometry (binary, SELAFIN format) and boundary files (ASCII) for each MPI task. Output files are handled in the same way, with a result file (binary, SELAFIN format) per processor, as well as another output file (ASCII) per processor, showing the evolution of the simulation. The result file (binary, SELAFIN format) can also be used to restart a simulation. 3. Outline of PARTEL P To overcome the 2-10 million element grid limit of the serial pre-processor, PARTEL, a parallel version has been developed within the PRACE-1IP project, called PARTEL P, which runs on NPROCS cores and partitions grids into NSUBS sub-domains. It should be noted that the current version of PARTEL P does not support parallel IO nor the method of characteristics for the pre-processing stage Description of files output by PARTEL Two files per sub-domain are output by PARTEL; a geometry file in SELAFIN format and a boundary file in ASCII format following the TELEMAC-2D standard. The geometry file contains a header, and the number of elements, nodes, physical boundaries and interfaces for a given sub-domain. It also contains the local connectivities of the nodes, the local-to-global node table and finally the coordinates and/or other quantities known at each node. The boundary file contains the information for the physical boundaries, the number of interfaces with other sub-domains, and the information required to handle the interfaces. Each physical boundary requires knowledge of the neighbouring nodes located in a different sub-domain. The treatment of the interfaces is more

3 Table 1. CPU time (s) (IBM POWER7) for PARTEL P 1 and PARTEL P 2 to pre-process the 200-million element grid, using METIS as the partitioner. PARTEL P 1 PARTEL P 2 PARTEL P Table 2. CPU time (s) (IBM POWER7) for PARTEL P 1 and PARTEL P 2 to pre-process the 200-million element grid, using SCOTCH as the partitioner. PARTEL P 1 PARTEL P 2 PARTEL P complex as the number of contiguous sub-domains has to be known, as well as their partition index. interfaces also need to be sorted into ascending order to comply with the TELEMAC-2D standard. The 4. Description of PARTEL P PARTEL P is actually split into two parts; PARTEL P 1 is used to generate NPROCS files to be read by PARTEL P 2 so as to reduce memory consumption. These two programs will be merged in the future since both PARTEL P 1 and PARTEL P 2 are run on the same number of cores. PARTEL P 1 is used to distribute the information of NSUBS/NPROCS sub-domains over the NPROCS cores. The initial stage is mainly serial as no attempt has yet been made to improve the IO operations. The input parameters are read, i.e. the name of the geometry file, the name of the boundary file, NSUBS and the library used to partition the grid. Each processor reads the geometry and the boundary files, and calls the subroutines VOISIN PARTEL, ELEBD PARTEL and FRONT2 PARTEL. The partitioning is performed using either METIS or SCOTCH [8] on the master node (ParMETIS and PT-SCOTCH have yet to be tested), and its output is broadcast to the other cores. NSUBS/NPROCS sub-domains are gathered over the NPROCS cores in order to reduce the array sizes in PARTEL P 2. Two files per core are written; one for the geometry, and the other for the boundary conditions with some additional information compared to a regular boundary condition file. PARTEL P 1 transmits the information about the adjacent neighbouring nodes located in a different subdomain but on a different core to PARTEL P 2. Interfaces are not dealt with at this stage. PARTEL P 2 is run on NPROCS cores. It first reads the input parameters, i.e. the name of the original geometry file, the name of the original boundary file, NSUBS and NPROCS. Each core then reads the files output by PARTEL P 1 which contains information for NSUBS/NPROCS sub-domains. The number of elements, nodes, physical boundaries, and interfaces per sub-domain are easily computed. This information, together with the knowledge of the subdomain local connectivity and coordinates, helps build the NSUBS geometry files that are read by TELEMAC- 2D. The local-to-global node table is also easily accessible. The information relative to the interfaces on all NSUBS sub-domains has to be computed. The neighbouring nodes located on the same core but in a different sub-domain have to be identified. Working on a given core, all the physical boundaries are gathered in an array containing the global index. This array is sorted in ascending order and global indices that occur twice, or more, indicate that the corresponding nodes belong to several sub-domains. Their neighbours are easily identified and the array is sorted back to its original structure to comply with the TELEMAC-2D standard. The interfaces are treated globally. A loop over all the NSUBS/NPROCS sub-domains allows the code to gather the interfaces of all the NPROCS cores before using MPI Allgatherv to get their global index, as well as the index of the sub-domain they belong to. This array is sorted by global indices in ascending order. The number of consecutive occurrences, NINTERF, of a given global index indicates that the same interface belongs to NINTERF sub-domains and these partition indices have to be saved. To comply with the TELEMAC-2D standard, the information per interface has also to be sorted. All this information is then distributed in two stages, first onto the NPROCS, using an MPI Scatterv command, and then to the NSUBS sub-domains. The information relating to the physical boundaries and the interfaces is finally copied into the boundary files which are read by TELEMAC-2D. 5. Timings for PARTEL P 1 and PARTEL P 2 PARTEL P 1 and PARTEL P 2 have been run to pre-process a 200-million element grid, and the ouput is used in the next section to test TELEMAC-2D. Tables 1 and 2 indicate the total time spent by PARTEL P to pre-process the grid into 4096, 8192, 16384

4 Fig. 1. Scaling performance of TELEMAC-2D for the 200-million element grid on the IBM BG/P. and sub-domains respectively, using METIS and SCOTCH as the partitioner. Overall, partitioning by METIS allows a faster pre-processing. All PARTEL P 1 simulations are faster when METIS rather than SCOTCH is used as the partitioner. However, PARTEL P 2 is normally faster when SCOTCH is used. A more thorough study should be able to confirm whether this is due to the fact that the edge-cut should be smaller with SCOTCH, which has a direct impact on the global communications used in PARTEL P Scaling Performance of TELEMAC-2D The 200-million element grid has been used to evaluate the performance of TELEMAC-2D on 32,768 cores of Argonneś IBM Blue Gene/P [5]. PARTEL P was used to perform the pre-processing with both METIS and SCOTCH being used as partitioners. The positive stream-wise implicit (PSI) advection scheme was selected since PARTEL P does not yet support the method of characteristics. The scaling performance of TELEMAC- 2D was evaluated using simulations of 60 seconds (1200 time steps). The CPU time is reported as the time for the executable to complete (T T OT AL ), as well as the time difference between the end and the beginning of the main program, homere telemac2d.f (T SOLV ER ). Figure 6. shows that T SOLV ER decreases linearly as a function of the number of cores, whether METIS or SCOTCH is used as the partitioner. Good performances are observed with about 6100 elements per core. A 65,536 sub-domain simulation would help assess the performance of TELEMAC-2D with half this number (about 3000 elements) assigned to each core. However, T T OT AL shows a different behaviour, with no real speedup for the 32,768-core simulations. This non-scaling behaiour is probably due to the time spent opening and closing files, along with aspects of the way the system manages the simulations. 7. Summary of Results The pre-processing stage routine in PARTEL has been re-written and optimized in order to be run on NPROCS MPI tasks in PARTEL (to date typically up to 3 256GB RAM multicore nodes of an IBM PWR7 cluster) to deal with up to 100K NSUBS subdomains. This optimized pre-processing stage now enables very large scale TELEMAC-2D simulations on Jugene IBM BG/P. The pre-processing stage still takes place in two stages, with Fortran data files between, and has been optimized as follows: 1. The first stage run on NPROCS cores is the reading of the original mesh, then partitioning it into NSUBS subdomains and writing 2 files per NPROCS cores. These two files contain information for NSUBS/NPROCS subdomains. The first of the two files contains the geometry quantities (position and connectivity between elements and nodes) and the second one information concerning boundary conditions and interfaces between subdomains. 2. The second stage is also run on NPROCS and reads the output of the first stage. Improvements made to this stage result in data now being distributed, rather than replicated, thereby reducing markedly the local memory consumption. The outputs from this stage are the geometry and boundary files readable by TELEMAC-2D. First results of the new pre-processing stage applied to a 200M element grid partitioned into 32,768 subdomains on 8 and 24 MPI tasks on the IBM PWR7 cluster are now obtained in close to

5 three hours (9524 secs) rather than the previous run-times of several days with PARTEL1. The results from the new tool have been verified using different parallel runs with NPROCS=8 and NPROCS=24. Metis 5.0, ParMETIS, Scotch , and PT-Scotch have all been implemented in the pre-processing stage. A demonstration case run on the IBM PWR7 cluster has shown that serial Scotch is able to partition a 400M element demonstration case into 294,912 subdomains. This has fully prepared suitable partitioned grids for future parallel runs using large numbers of cores (up to the largest available job size) on Jugene for the large datasets described in this project. Acknowledgements This work was financially supported by the PRACE project funded in part by the EUs 7th Framework Programme (FP7/ ) under grant agreement no. RI and FP The work is achieved using the PRACE Research Infrastructure resources Jugene at Jülich, Germany. This research also used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH The authors would also like to thank the UK Engineering and Physical Sciences Research Council (EPSRC) for their support of Collaborative Computational Project 12 (CCP12) and the Distributed Computing Group at STFC Daresbury Laboratory. References 1. TELEMAC system, 2. Jean-Michel Hervouet, Hydrodynamics of Free Surface Flows: Modelling with the finite element method, Wiley. 3. BAW, 4. STFC Computational Engineering Group, Computational 5. Argonne Blue Gene /P Intrepid, 6. Jülich Blue Gene /P Jugene, 7. METIS 5.0, 8. SCOTCH,

Impact and Optimum Placement of Off-Shore Energy Generating Platforms

Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Impact and Optimum Placement of Off-Shore Energy Generating Platforms Charles Moulinec a,, David R. Emerson a a STFC Daresbury,