EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap

Size: px

Start display at page:

Download "EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap"

Doreen Cook
6 years ago
Views:

1 EDF's Code_Saturne and parallel IO Toolchain evolution and roadmap

Code_Saturne main capabilities Physical modelling Single-phase laminar and turbulent flows: k-, k- SST, v2f, RSM, LES Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP)

2 Code_Saturne main capabilities Physical modelling Single-phase laminar and turbulent flows: k-, k- SST, v2f, RSM, LES Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible flow ALE method for deformable meshes Conjugate heat transfer (Syrthes & 1D) Specific engineering modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 2

3 Code_Saturne main capabilities Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March Parallelism L.E.S 1.2 State of the art in turbulence Open source 1.3 Massively parallel ALE Code coupling (source code and manuals) (support) Basic capabilities : Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 3

4 Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 4

5 Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster,...) 5

6 Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method lines of code, 49% FORTRAN, 41% C, 10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 6

7 Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies ( to cells) 7

serial I/O memory management parallel mesh management code coupling parallel treatment

8 Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 8

9 Code_Saturne mesh examples Example of mesh with stretched cells and hanging nodes PWR lower plenum Example of composite mesh 3D polyedral cells 9

10 Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are explicit Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-cgstab) used for other variables More important, matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 10

Base parallel operations (1/6) Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells

11 Base parallel operations (1/6) Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Extended neighborhoods for gradients also require ghost cells sharing vertices Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 11

12 Base parallel operations (2/6) Use of global numbering We associate a global number to each mesh entity A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer (usually 32-bit), but an unsigned long integer (64-bit) will be necessary Face-cell connectivity for hexahedral cells : size 4.n_faces, and n_faces about 3.n_cells, size around 12.n_cells, so numbers requiring 64 bit around 350 million cells. Currently equal to the initial (pre-partitioning) number Allows for partition-independent single-image files Essential for restart files, also used for postprocessor output Also used for legacy coupling where matches can be saved 12

13 Base parallel operations (3/6) Use of global numbering Redistribution on n blocks n blocks n cores Minimum block size may be set to avoid many small blocks (for some communication or usage schemes), or to force 1 block (for I/O with nonparallel libraries) 13

data Arbitrary distribution, inefficient for halo exchange, but allow for simpler data structure related

14 Base parallel operations (4/6) Conversely, simply using global numbers allows reconstructing neighbor partition entity equivalents mapping Used for parallel ghost cell construction from initially partitioned mesh with no ghost data Arbitrary distribution, inefficient for halo exchange, but allow for simpler data structure related algorithms with deterministic performance bounds Owning processor determined simply by global number, messages are aggregated 14

15 Base parallel operations (5/6) Use of global numbering Using a block-distribution for I/O avoids need for indexed MPI datatypes We layer our raw I/O subsystem so that we may use either predistribution or direct MPI/IO with indexed datatypes in the future (and compare performance), but I/O using libraries with no indexed datatype equivalent may require using block distribution. Switching between partition-based to block-based distribution is currently done using MPI_Alltoall (for counts) and MPI_Alltoallv (for data) On high-performance machines with well-optimized MPI libraries, this does not seem to lead to scaling issues 15

16 Base parallel operations (6/6) Such all-to-all communication is only used for specific operations, such as parallel sort and processor boundary equivalent entity identification and I/O Algorithms such as crystal router exist in which message count (and thus latency) is of complexity O(log(p)) We trust MPI implementors to use the best routing algorithms and heuristics in most cases; they are the specialists We avoid explicit O(p) communication loops in new code, and have plans to remove a few such loops still residing in the post-processor file output Often less than 1 such call per time step on average, versus hundreds or thousands of calls to MPI_Isend / MPI_Irecv and MPI_Allreduce 16

17 Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells 200 iterations 200 iterations 50 iterations Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 17

18 Current performance (1/6) 2 LES test cases (most I/O factored out) 1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150 at FATHER 1 M hexahedra LES test case HYPPI 10 M hexahedra LES test case Elapsed time Opteron + infiniband Opteron + Myrinet NovaScale Blue Gene/L Elapsed tile Opteron + Myrinet Opteron + infiniband NovaScale Blue Gene/L n cores n cores 18

19 Current performance (2/6) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/ min/max cells/core at 1024 cores 11344/12781 min/max cells cores at 8192 cores FA Grid RANS test case Elapsed time per iteration NovaScale Blue Gene/L (CO) Blue Gene/L (VN) n cores 19

20 Current performance (3/6) Speedups on FATHER test case Speed-up Nombre de processeurs Station HP Cluster CHATOU Tantale Platine BGL 20

21 Current performance (4/6) Speedups on HYPI test case Speed-up Nombre de processeurs Cluster Tantale Platine BlueGene 21

22 Current performance (5/6) Serialized IO elapsed time for post-processor output, FATHER test case Start / restart up to 4 times faster than postprocessor output Communication scheme differences Temps pour un post-traitement type avec FATHER (1 M) 9,0 8,0 7,0 6,0 Temps écoulé (s) 5,0 4,0 3,0 2,0 1,0 0, Nombre de processeurs Station HP Cluster Tantale Platine BlueGene 22

23 Current performance (6/6) Serialized IO elapsed time for post-processor output, HYPI test case Temps pour un post-traitement type sur HYPI (10M) Temps (s) Nb procs. BlueGene Platine Tantale Cluster Chatou 23

24 Toolchain evolution (1) Code_Saturne V1.3 added many HPC-oriented improvements compared to prior versions: Post-processor output handled by FVM / Kernel Ghost cell construction handled by FVM / Kernel Up to 40% gain in preprocessor memory peak compared to V1.2 Parallelized and scales, though complex, as we may have 2 ghost cell sets and multiple periodicities Well adapted for models up to 100 million cells (with 64 Gb preprocessing machine), or less All fundamental limitations are pre-processing related Meshes Pre-Processor: serial run Kernel + FVM: distributed run Postprocessing output 24

25 Toolchain evolution (2) Separate basic Pre-Processing from Partitioning in version 1.4 Allows running both stages on different machines Allows using the same pre-processing run for different partitionings Joining of non-conforming meshes usually requires only a few minutes under 10 million cells, but can be close to a day for for 50 million+ cell meshes built of many sub-parts Partitioning requires slightly less memory than pre-processing Meshes Pre-Processor: serial run Partitioner: serial run Kernel + FVM: distributed run Postprocessing output 25

26 Toolchain evolution (3) Parallelize mesh joining for V2.0 Joining of conforming/non-conforming meshes, periodicity Parallel edge intersection implemented, not merged yet Sub-face reconstruction in progress Current (serial) algorithm will probably remain available as a backup option for V2.0 (and be removed no sooner than 6 months after full parallel algorithm) Allow restart in case of memory fragmentation due to complex mesh joining step Meshes Kernel + FVM: distributed init Partitioner: serial run Kernel + FVM: distributed run Postprocessing output 26

27 Toolchain evolution (4) Parallelize domain partitioning for V2.0 Replace serial METIS or SCOTCH with parallel versions SCOTCH uses free CeCILL-C licence, whereas METIS is more restrictive, but SCOTCH does not yet seem as stable Possibly due to METIS-wrapper, though feedback from Code_Aster seems to indicate the same ParMetis partitioning reportedly of lower quality than serial METIS Is it that bad? PT-SCOTCH does not suffer from this, but functionality is not complete yet 27

28 Large case example (1/2) 125 million hexahedra canal at Daresbury 100 million tetrahedra calculation at EDF Simplified PWR fuel assembly mixing grid Run on to Blue Gene/L cores Tested also on BULL NovaScale cores biggest issue: mesh generation 28

29 Large case example (2/2) 29

30 Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: use SANDIA's ZOLTAN library May call METIS, SCOTCH, it's own high-quality hypergraph partitioning, recursive bisection or space-filling curves (Zcurve or Hilbert), LGPL license 30

31 Parallel I/O (1/) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Use of indexed datatypes may be tested in the future, but will not be possible everywhere Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output Layered approach as we allow for multiple formats 31

32 Standard I/O (1/2) By default, all stdout type messages go to a log file Uses Fortran write Even in C, we print to a buffer, which is written as a character string by Fortran, to ensure messages appear in the correct order Easy to define and possibly overload a bft_printf function in C than to modify Fortran IO in a portable manner, so C adapts to Fortran... Opened as listing on rank 0, or unit 6 if we want to send it to screen We may use listing_rank_num on other ranks, but /dev/null is the default 32

33 Standard I/O (2/2) In case of failure (either error detected by the code, such as consistency or convergence errors), or a signal is caught: Open a special error_rank_num file, and print details and possibly backtrace to this file) In the case of a signal handler, only do this on rank 0 for signals such as SIGTERM,... do this on all ranks that fail for SIGFPE or SIGSEGV Also copy message to regular output if possible If va_copy is available 33

34 Parallel I/O (1/5) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 34

35 Parallel I/O (2/5) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 35

36 Parallel I/O (3/5) Uses fvm_file intermediary layer Only provides a subset of MPI IO possibilities, privileging collectives Handles offsets and other metadata Provides fvm_file_write_global, fvm_file_write_block, and corresponding read functions Will try fvm_file_read/write_list in the future using indexed datatypes to have all data shuffling inside MPI IO instead of outside Allows for using (synchronous) explicit offsets and individual file pointer implementations Shared pointers tried, but removed due to issues on some filesystems, and limited interest 36

37 Parallel I/O (4/5) Recently added for mesh input and checkpoint/restart files Specific format used for these files, based on successive header + data, with possible padding for alignment Initial header indicates minimum header size, header alignment, and data alignment Easy to analyze and index for restart purposes by reading headers and seeking to next header 37

38 Parallel I/O (5/5) Recent development, not much feedback yet Most useful for postprocessor output, but this will require more work We need to distribute indexed output data, distributed to blocks preferably by data blocks rather than index blocks for good balancing We will replace use of fvm_gather_... type functions by fvm_part_to_block_... type functions in file output and output helper code Multiple layers due to handling of several file formats 38

Possible future changes Base global (canonical) numbering not on initial number, but on space-filling curve A domain partitioning method in its own right Independent of processor count Ensures

39 Possible future changes Base global (canonical) numbering not on initial number, but on space-filling curve A domain partitioning method in its own right Independent of processor count Ensures block to partition mapping is «sparse» (i.e. only a few blocks are references by a partition, and vice-versa) Block to partition mapping is 1-to-1 if no additional partitioning method is used 39

40 Load imbalance (1/3) In this example, using 8 partitions (with METIS), we have the following local minima and maxima: Cells: 416 / 440 (6% imbalance) Cells + ghost cells: 469/519 (11% imbalance) Interior faces: 852/946 (11% imbalance) Most loops are on cells, but some are on cells + ghosts, and MatVec is in cells + faces 40

41 Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Perfect balancing is impossible to reach, as different loops show different imbalance levels, an synchronizations may be required between these loops GCP uses MatVec and dot products Load imbalance might be reduced using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 41

42 Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 42

43 Multigrid (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability, and may be checked (if suspected) using the solver summary info at the end of the log file Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small 43

44 Multigrid (2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper Important, as grids may be rebuilt each time step 44

Parallelization of mesh joining (1/2) Though mesh joining may not be a

detrimental to mesh quality), it is a functionality we can not do without

Periodicity is also constructed as an extension of mesh joining.

45 Parallelization of mesh joining (1/2) Though mesh joining may not be a long-term solution (as it tends to be excessively used in ways detrimental to mesh quality), it is a functionality we can not do without as long as we do not have a proven alternative. Periodicity is also constructed as an extension of mesh joining. In addition, joining of meshes built in several pieces to circumvent meshing tool memory limitations is very useful. 45

46 Parallelization of mesh joining (2/2) Parallelizing this algorithm requires the same main steps as the serial algorithm: Detect intersections (within a given tolerance) between edges of overlapping faces (done) Uses parallel octree for face bounding boxes, built in a bottomup fashion (no balance condition required here) Preprocessor version used a lexicographical binary search, whose best case was O(n.log(n)), and worst case was O(n 2 ). Subdivide edges according to inserted intersection vertices Merge coincident or nearly-coincident vertices/intersections Requires recursive merge accept/refuse (done) Re-build sub-faces 46

47 Parallel point location Parallel point location is used by the cooling tower and overset grid functionalities, as well as coupling with SYRTHES 4, neither of which is fully standard yet Algorithm is parallel, but communication pattern could lead to some serialization A simple improvement would be to order communication by communicating first among ranks in same half of rank set, then among ranks in other half, and ordering these communications recursively in the same manner Load imbalance may be important, but serialization would be limited A rendez-vous approach using an octree or recursive bisection intermediate mesh may be more efficient, but is setup more complex and expensive 47

48 Hybrid MPI / OpenMP (1/3) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 48

49 Hybrid MPI / OpenMP (3/3) Hybrid MPI / OpenMP will be tested IBM will do testing on Blue Gene/P We also hope to provide OpenMP parallelism for ease of packaging / installation on Linux distributions No dependencies on multiple MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling with SYRTHES 4 will still require MPI 49

50 HPC roadmap: application examples 50

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap

High Performance Calculation with Code_Saturne at EDF Toolchain evoution and roadmap Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are