PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

PRACE Workshop: Application Case Study: Code_Saturne Andrew Sunderland, Charles Moulinec, Zhi Shang Science and Technology Facilities Council, Daresbury Laboratory, UK Yvan Fournier, Electricite it de France, Paris, France Kevin Roy, Cray Centre of Excellence, UK Juan Uribe, University of Manchester, UK

Summary Background STFC Daresbury Laboratory Evolution of Code_Saturne Petascaling and Optimization i Datasets Initial Performance Analysis Optimization Multigrid Solver Petascaling Partitioning MPI/IO Load imbalance Hybrid Model

STFC Daresbury Laboratory HPC service provider to the UK academic community for > 25 yrs Jointly run UK national HPC services HPCx (IBM Pwr5), HECToR(XT4) Also STFC machine 1Rack IBM BG/P Research, development & support centre for leading edge academic engineering and physical science simulation codes: e.g. DL_POLY, GAMESS-UK, MPP-CRYSTAL, PFARM

Towards The Petascale Increase in TOP 500 performance now driven by increasing core count, not processor speed Memory subsystems may continue to improve Terascaling issues many hundreds or few 1000 cores Parallel scalability of Diags, FFTs, preconditioned sparse solvers Petascaling Issues - 10 100s thousands of cores Diag free, FFT free methods? Different approach to sparse solvers? Efficient I/O, load-balancing, sensitivity to partitioning, MPI vs Hybrid

Code_Saturne main capabilities Chosen as one of the core application benchmarks for PRACE WP6 General Purpose Computational Fluid Dynamics code: to be run on PWR6, BG/P, Cray XT5, NEC SX-9 prototypes Physical modelling Single-phase laminar and turbulent flows: k-ε, SST, v2f, RSM, LES, RANS Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible & Incompressible flow Conjugate heat transfer (Syrthes & 1D) Specific engineering i modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 5

Code_Saturne main capabilities 1998 2000 2001 2004 2006 2007 2008 Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March 2007 1.1 Parallelism L.E.S 1.2 State of the art tin turbulence http://www.code-saturne.org (source code and manuals) saturne-support@edf.fr (support) Basic capabilities : Open source 1.3 Massively parallel l ALE Code coupling Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 6

Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 7

Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) No major porting issues for BG/P, PWR6 or XT Series in PRACE GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Aster,...) 8

Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method 500 000 lines of code, 49% FORTRAN, 41% C,10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 9

Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to 2 000 000 cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), AIX, Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies (500 000 to 3 000 000 cells) 10

Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining i periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 11

Code_Saturne subsystems Code_Saturne BFT library Meshes Pre-processor mesh import mesh joining i periodicity domain partitioning FVM Prace WP6 library Parallel Kernel ghost cells creation CFD Solver serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 12

Code_Saturne Features of note to HPC Segregated solver Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi- CGstab) used for other variables Matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 13

Base parallel operations Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 14

Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 15

Code_Saturne Initial Performance (i) 10M Cell Dataset 50 (arbitrary) Performa ance per Timestep 45 40 35 30 25 20 15 10 5 0 Louhi (Cray XT) Huygens (IBM PWR6) Jugene (IBM BG/P) 0 1024 2048 3072 4096 5120 Number of Cores

Code_Saturne Initial Performance (ii) 10M Cell Dataset ) 90 80 70 60 Louhi (Cray XT) 50 Huygens (IBM PWR6) Jugene IBM BG/P 40 30 20 10 0 34816 32768 30720 28672 26624 24576 22528 20480 18432 16384 14336 12288 10240 8192 61444 4096 2048 Performance per Timestep (arbitrary 0 Number of Cores

Code_Saturne Initial Performance (ii) 100M Cell Dataset Mixer Grid 100M Cells - Solver 90 Relative Performance e (arbitrary) 80 70 60 50 40 30 20 10 0 Ideal CRAY XT4 IBM BG/P 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of Cores

Optimization - Multigrid V1.4 replaces standard Conjugate Gradient solver with Multigrid solver Two potential performance gains: Solver may converge in fewer iterations Solver requires fewer operations in a coarse grid iteration Otherwise improves robustness of code

Multigrid Performance Cray XT4, 10M Cell Dataset Performance Comparison: Conjugate Gradient vs Multigrid Relative Performanc ce (arbitrary y) 40 30 20 10 0 Multigrid Conjugate gradient 0 1024 2048 3072 4096 Number of Cores

Multigrid Future Optimizations (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability Planned solution: move grids to nearest rank multiple l of 4 or 8 when mean local grid size is too small 21

Multigrid Future Optimizations(2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local l grid size is too small Map onto underlying multicore architecture Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper 22

CrayPat Multigrid Serial Performance Profile at 128 cores shows 65% of the runtime in scalar numerical routines: 100.0% 0% 76111 -- -- Total ---------------------------------------------- 64.6% 49177 -- -- USER --------------------------------------------- 21.7% 16544 2988.72 15.4% _mat_vec_p_l_native 10.9% 8284 1515.80 15.6% _conjugate_gradient_mp 7.9% 6034 2619.52 30.5% _alpha_a_x_p_beta_y_native 7.5% 5686 1357.38 19.4% cblas_daxpy 4.8% 3620 1044.47 22.6% _polynomial_preconditionning 4.6% 3477 2077.88 37.7% 7% gradrc_ 2.5% 1936 323.73 14.4% _jacobi All of these are targets for optimization Most have been optimised for IBM & Bull platforms: #if defined( xlc ) #pragma disjoint(*x, *y, *da, *xa1, *xa2) #endif #if defined(ia64_optim) Shows there is a way forward for increased performance targeting AMD Quad-Core

CrayPat - Multigrid Parallel Performance Timing results show that it is not scaling: Profile at 128 cores shows 35% in non-compute operations Profile for 512 cores shows only 11% in serial operations. Either load-balancing problem or written with inefficient comms operations for an XT. Early profiles suggest collectives are dominating: 100.0% 0% 37132 -- -- Total ----------------------------------------------- 35.7% 13272 -- -- MPI ---------------------------------------------- 10.7% 3987 437.43 9.9% MPI_Allreduce 9.3% 3471 2243.79 39.3% MPI_Waitall 4.7% 1750 282.24 13.9% MPI_Barrier 4.6% 1714 12.16 0.7% MPI_Recv 3.8% 1418 1124.28 44.3% MPI_Isend 1.0% 356 215.20 37.8% MPI_Irecv Significant time spent in Waitall, also imbalanced (3 rd and 4 th columns).

Cray Pat Analysis Message Exchange The predominant message exchanging routine is cs_halo_sync_var. It is called 180,000 times in 10 iterations. Structure is implemented with isend & irecv. Has global barrier between isends and irecvs to ensure irecv is posted before the send. For better performance we will re-order the Isends. Would also be better to not issue isends & irecvs if length is zero. (*) Would be better if there is some work to do between irecvs and isends as this will allow the communication to happen asynchronously. In many cases the calls to cs_halo_sync_var can be combined to send one message rather than 4 (in some cases).

Cray Pat Analysis Global Comms The synchronization time for the collectives is more significant than the routines themselves. We should be able to reduce collectives. Barrier within halo_sync_var. Consecutive global collectives can be collated. Should allow us to save time spent in the collective, and also give opportunity for overlapping.

Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: PARMETIS, PT-SCOTCH 27

Parallel Partitioner Performance 80 Time Take en (secs) 70 60 50 40 30 20 PT-SCOTCH PARMETIS 10 0 0 256 512 Cores

I/O Overheads 40 (arbitrary y) Perfo ormance 35 30 25 20 15 10 5 Cray XT4 per Iteration Cray XT4 Total Time Ideal 0 0 1024 2048 3072 4096 5120 6144 7168 8192 Processors

I/O Overheads 40 (arbitrary y) 35 30 25 20 Cray XT4 per Iteration Cray XT4 Total Time Ideal I/O Overheads Perfo ormance 15 10 5 0 0 1024 2048 3072 4096 5120 6144 7168 8192 Processors

Parallel I/O (i) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output 31

Parallel I/O (ii) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 32

Parallel I/O (iii) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 33

Load Imbalance (1/3) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/102242 min/max cells at 1024 cores 5.8% imbalance 11344/12781 min/max cells at 8192 cores 8.9% imbalance 34

Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Load imbalance might be reduced d using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 35

Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 36

Hybrid MPI / OpenMP (1/2) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 37

Hybrid MPI / OpenMP (2/2) Hybrid MPI / OpenMP EDF plans to test on Blue Gene Also pure OpenMP parallelism for ease of packaging / installation on Linux distributions No N dependencies d on multiple l MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling p g with SYRTHES 4 will still require MPI 38

Code_Saturne - Summary Several projects exist (in addition to PRACE) to improve the performance of the code: Pre-processing, Mesh Generation, Mesh Partitioning Improvements to CFD Solver Code Optimization, particularly on the Cray XT4 / XT5 (& Jaguar) Parallel I/O, Hybrid MPI / OpenMP Parallel Performance of Existing Code is Very Good Particularly for large problem sizes. We hope to benchmark the 100M Cell Mixing Grid for PRACE a.s.a.p. Introduction of Multigrid has reduced scalability but improved performance Load Balancing difficult to perfect on large processor counts