PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

Size: px

Start display at page:

Download "PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK"

James Bishop
6 years ago
Views:

1 PRACE Workshop: Application Case Study: Code_Saturne Andrew Sunderland, Charles Moulinec, Zhi Shang Science and Technology Facilities Council, Daresbury Laboratory, UK Yvan Fournier, Electricite it de France, Paris, France Kevin Roy, Cray Centre of Excellence, UK Juan Uribe, University of Manchester, UK

2 Summary Background STFC Daresbury Laboratory Evolution of Code_Saturne Petascaling and Optimization i Datasets Initial Performance Analysis Optimization Multigrid Solver Petascaling Partitioning MPI/IO Load imbalance Hybrid Model

25 yrs Jointly run UK national HPC services

3 STFC Daresbury Laboratory HPC service provider to the UK academic community for > 25 yrs Jointly run UK national HPC services HPCx (IBM Pwr5), HECToR(XT4) Also STFC machine 1Rack IBM BG/P Research, development & support centre for leading edge academic engineering and physical science simulation codes: e.g. DL_POLY, GAMESS-UK, MPP-CRYSTAL, PFARM

4 Towards The Petascale Increase in TOP 500 performance now driven by increasing core count, not processor speed Memory subsystems may continue to improve Terascaling issues many hundreds or few 1000 cores Parallel scalability of Diags, FFTs, preconditioned sparse solvers Petascaling Issues s thousands of cores Diag free, FFT free methods? Different approach to sparse solvers? Efficient I/O, load-balancing, sensitivity to partitioning, MPI vs Hybrid

5 Code_Saturne main capabilities Chosen as one of the core application benchmarks for PRACE WP6 General Purpose Computational Fluid Dynamics code: to be run on PWR6, BG/P, Cray XT5, NEC SX-9 prototypes Physical modelling Single-phase laminar and turbulent flows: k-ε, SST, v2f, RSM, LES, RANS Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible & Incompressible flow Conjugate heat transfer (Syrthes & 1D) Specific engineering i modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 5

6 Code_Saturne main capabilities Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March Parallelism L.E.S 1.2 State of the art tin turbulence (source code and manuals) (support) Basic capabilities : Open source 1.3 Massively parallel l ALE Code coupling Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 6

But also applications in : Combustion (gas

Atmospheric flows Radiative heat transfer

interaction Deformable meshes : Arbitrary

7 Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 7

8 Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) No major porting issues for BG/P, PWR6 or XT Series in PRACE GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Aster,...) 8

9 Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method lines of code, 49% FORTRAN, 41% C,10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 9

10 Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), AIX, Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies ( to cells) 10

library serial I/O memory management parallel mesh management code coupling parallel treatment

11 Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining i periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 11

12 Code_Saturne subsystems Code_Saturne BFT library Meshes Pre-processor mesh import mesh joining i periodicity domain partitioning FVM Prace WP6 library Parallel Kernel ghost cells creation CFD Solver serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 12

13 Code_Saturne Features of note to HPC Segregated solver Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi- CGstab) used for other variables Matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 13

14 Base parallel operations Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 14

15 Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 15

16 Code_Saturne Initial Performance (i) 10M Cell Dataset 50 (arbitrary) Performa ance per Timestep Louhi (Cray XT) Huygens (IBM PWR6) Jugene (IBM BG/P) Number of Cores

17 Code_Saturne Initial Performance (ii) 10M Cell Dataset ) Louhi (Cray XT) 50 Huygens (IBM PWR6) Jugene IBM BG/P Performance per Timestep (arbitrary 0 Number of Cores

18 Code_Saturne Initial Performance (ii) 100M Cell Dataset Mixer Grid 100M Cells - Solver 90 Relative Performance e (arbitrary) Ideal CRAY XT4 IBM BG/P Number of Cores

19 Optimization - Multigrid V1.4 replaces standard Conjugate Gradient solver with Multigrid solver Two potential performance gains: Solver may converge in fewer iterations Solver requires fewer operations in a coarse grid iteration Otherwise improves robustness of code

20 Multigrid Performance Cray XT4, 10M Cell Dataset Performance Comparison: Conjugate Gradient vs Multigrid Relative Performanc ce (arbitrary y) Multigrid Conjugate gradient Number of Cores

21 Multigrid Future Optimizations (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability Planned solution: move grids to nearest rank multiple l of 4 or 8 when mean local grid size is too small 21

22 Multigrid Future Optimizations(2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local l grid size is too small Map onto underlying multicore architecture Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper 22

23 CrayPat Multigrid Serial Performance Profile at 128 cores shows 65% of the runtime in scalar numerical routines: 100.0% 0% Total % USER % % _mat_vec_p_l_native 10.9% % _conjugate_gradient_mp 7.9% % _alpha_a_x_p_beta_y_native 7.5% % cblas_daxpy 4.8% % _polynomial_preconditionning 4.6% % 7% gradrc_ 2.5% % _jacobi All of these are targets for optimization Most have been optimised for IBM & Bull platforms: #if defined( xlc ) #pragma disjoint(*x, *y, *da, *xa1, *xa2) #endif #if defined(ia64_optim) Shows there is a way forward for increased performance targeting AMD Quad-Core

24 CrayPat - Multigrid Parallel Performance Timing results show that it is not scaling: Profile at 128 cores shows 35% in non-compute operations Profile for 512 cores shows only 11% in serial operations. Either load-balancing problem or written with inefficient comms operations for an XT. Early profiles suggest collectives are dominating: 100.0% 0% Total % MPI % % MPI_Allreduce 9.3% % MPI_Waitall 4.7% % MPI_Barrier 4.6% % MPI_Recv 3.8% % MPI_Isend 1.0% % MPI_Irecv Significant time spent in Waitall, also imbalanced (3 rd and 4 th columns).

25 Cray Pat Analysis Message Exchange The predominant message exchanging routine is cs_halo_sync_var. It is called 180,000 times in 10 iterations. Structure is implemented with isend & irecv. Has global barrier between isends and irecvs to ensure irecv is posted before the send. For better performance we will re-order the Isends. Would also be better to not issue isends & irecvs if length is zero. (*) Would be better if there is some work to do between irecvs and isends as this will allow the communication to happen asynchronously. In many cases the calls to cs_halo_sync_var can be combined to send one message rather than 4 (in some cases).

26 Cray Pat Analysis Global Comms The synchronization time for the collectives is more significant than the routines themselves. We should be able to reduce collectives. Barrier within halo_sync_var. Consecutive global collectives can be collated. Should allow us to save time spent in the collective, and also give opportunity for overlapping.

27 Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: PARMETIS, PT-SCOTCH 27

28 Parallel Partitioner Performance 80 Time Take en (secs) PT-SCOTCH PARMETIS Cores

29 I/O Overheads 40 (arbitrary y) Perfo ormance Cray XT4 per Iteration Cray XT4 Total Time Ideal Processors

30 I/O Overheads 40 (arbitrary y) Cray XT4 per Iteration Cray XT4 Total Time Ideal I/O Overheads Perfo ormance Processors

31 Parallel I/O (i) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output 31

32 Parallel I/O (ii) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 32

33 Parallel I/O (iii) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 33

34 Load Imbalance (1/3) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/ min/max cells at 1024 cores 5.8% imbalance 11344/12781 min/max cells at 8192 cores 8.9% imbalance 34

35 Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Load imbalance might be reduced d using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 35

36 Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 36

37 Hybrid MPI / OpenMP (1/2) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 37

38 Hybrid MPI / OpenMP (2/2) Hybrid MPI / OpenMP EDF plans to test on Blue Gene Also pure OpenMP parallelism for ease of packaging / installation on Linux distributions No N dependencies d on multiple l MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling p g with SYRTHES 4 will still require MPI 38

39 Code_Saturne - Summary Several projects exist (in addition to PRACE) to improve the performance of the code: Pre-processing, Mesh Generation, Mesh Partitioning Improvements to CFD Solver Code Optimization, particularly on the Cray XT4 / XT5 (& Jaguar) Parallel I/O, Hybrid MPI / OpenMP Parallel Performance of Existing Code is Very Good Particularly for large problem sizes. We hope to benchmark the 100M Cell Mixing Grid for PRACE a.s.a.p. Introduction of Multigrid has reduced scalability but improved performance Load Balancing difficult to perfect on large processor counts

EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap

EDF's Code_Saturne and parallel IO Toolchain evolution and roadmap Code_Saturne main capabilities Physical modelling Single-phase laminar and turbulent flows: k-, k- SST, v2f, RSM, LES Radiative heat transfer