PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

Similar documents
EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap

HPC and CFD at EDF with Code_Saturne

Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)

Code Saturne on POWER8 clusters: First Investigations

Quality Assurance Benchmarks. EDF s CFD in-house codes: Code_Saturne and NEPTUNE_CFD

Parallel Mesh Multiplication for Code_Saturne

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Optimizing TELEMAC-2D for Large-scale Flood Simulations

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Evolving the Code_Saturne and NEPTUNE_CFD toolchains for billion-cell calculations. Yvan Fournier, EDF R&D October 3, 2012

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

High Performance Computing : Code_Saturne in the PRACE project

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS

Lecture 15: More Iterative Ideas

Multi-Physics Multi-Code Coupling On Supercomputers

simulation framework for piecewise regular grids

Recent results with elsa on multi-cores

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Laminar Flow in a Tube Bundle using Code Saturne and the GUI 01 TB LAM GUI 4.0

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC Technology Leadership

Particleworks: Particle-based CAE Software fully ported to GPU

Explicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung

Large-scale Gas Turbine Simulations on GPU clusters

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Using Lamport s Logical Clocks

High Performance Computing IBM collaborations with EDF R&D on IBM Blue Gene system

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Optimisation of LESsCOAL for largescale high-fidelity simulation of coal pyrolysis and combustion

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Commodity Cluster Computing

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011

High Performance Computing for PDE Towards Petascale Computing

Second OpenFOAM Workshop: Welcome and Introduction

ALE and AMR Mesh Refinement Techniques for Multi-material Hydrodynamics Problems

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

NIA CFD Futures Conference Hampton, VA; August 2012

Peta-Scale Simulations with the HPC Software Framework walberla:

High performance Computing and O&G Challenges

Red Storm / Cray XT4: A Superior Architecture for Scalability

The Cray Rainier System: Integrated Scalar/Vector Computing

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

High Performance Computing Course Notes HPC Fundamentals

HPC Algorithms and Applications

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Real Parallel Computers

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Turbostream: A CFD solver for manycore

PETSc Satish Balay, Kris Buschelman, Bill Gropp, Dinesh Kaushik, Lois McInnes, Barry Smith

Petascale Adaptive Computational Fluid Dyanamics

Cray XC Scalability and the Aries Network Tony Ford

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Graph Partitioning for Scalable Distributed Graph Computations

Optimization of MPI Applications Rolf Rabenseifner

Hybrid programming with MPI and OpenMP On the way to exascale

Exploring unstructured Poisson solvers for FDS

Compute Node Linux: Overview, Progress to Date & Roadmap

Performance of Implicit Solver Strategies on GPUs

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Parallel Mesh Partitioning in Alya

ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Enabling Code_Saturne for Multi-Petaflop/Exascale with MPI 3.0 one sided Communication

Handling Parallelisation in OpenFOAM

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Point-to-Point Synchronisation on Shared Memory Architectures

OP2 FOR MANY-CORE ARCHITECTURES

Meshing With SALOME for Code_Saturne. Mixed meshing and tests November

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CUDA GPGPU Workshop 2012

SCALASCA parallel performance analyses of SPEC MPI2007 applications

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Transcription:

PRACE Workshop: Application Case Study: Code_Saturne Andrew Sunderland, Charles Moulinec, Zhi Shang Science and Technology Facilities Council, Daresbury Laboratory, UK Yvan Fournier, Electricite it de France, Paris, France Kevin Roy, Cray Centre of Excellence, UK Juan Uribe, University of Manchester, UK

Summary Background STFC Daresbury Laboratory Evolution of Code_Saturne Petascaling and Optimization i Datasets Initial Performance Analysis Optimization Multigrid Solver Petascaling Partitioning MPI/IO Load imbalance Hybrid Model

STFC Daresbury Laboratory HPC service provider to the UK academic community for > 25 yrs Jointly run UK national HPC services HPCx (IBM Pwr5), HECToR(XT4) Also STFC machine 1Rack IBM BG/P Research, development & support centre for leading edge academic engineering and physical science simulation codes: e.g. DL_POLY, GAMESS-UK, MPP-CRYSTAL, PFARM

Towards The Petascale Increase in TOP 500 performance now driven by increasing core count, not processor speed Memory subsystems may continue to improve Terascaling issues many hundreds or few 1000 cores Parallel scalability of Diags, FFTs, preconditioned sparse solvers Petascaling Issues - 10 100s thousands of cores Diag free, FFT free methods? Different approach to sparse solvers? Efficient I/O, load-balancing, sensitivity to partitioning, MPI vs Hybrid

Code_Saturne main capabilities Chosen as one of the core application benchmarks for PRACE WP6 General Purpose Computational Fluid Dynamics code: to be run on PWR6, BG/P, Cray XT5, NEC SX-9 prototypes Physical modelling Single-phase laminar and turbulent flows: k-ε, SST, v2f, RSM, LES, RANS Radiative heat transfer (DOM, P-1) Combustion coal, fuel, gas (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Compressible & Incompressible flow Conjugate heat transfer (Syrthes & 1D) Specific engineering i modules for nuclear waste surface storage and cooling towers Derived version for atmospheric flows (Mercure_Saturne) Derived version for eulerian multiphase flows 5

Code_Saturne main capabilities 1998 2000 2001 2004 2006 2007 2008 Prototype 1.0 Basic modelling Wide range of meshes Qualification for nuclear applications Open source since March 2007 1.1 Parallelism L.E.S 1.2 State of the art tin turbulence http://www.code-saturne.org (source code and manuals) saturne-support@edf.fr (support) Basic capabilities : Open source 1.3 Massively parallel l ALE Code coupling Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, ) 6

Code_Saturne main capabilities Main application area : Nuclear power plant optimisation in terms of lifespan, productivity and safety. But also applications in : Combustion (gas and coal) Electric arc and joule effect Atmospheric flows Radiative heat transfer Other functionalities : Fluid structure interaction Deformable meshes : Arbitrary Lagragian Eulerian method (ALE) Dispersed particle tracking (Lagrangian approach) 7

Code_Saturne main capabilities Flexibility Portability (UNIX and Linux) No major porting issues for BG/P, PWR6 or XT Series in PRACE GUI (Python TkTix, Xml format) Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Aster,...) 8

Code_Saturne general features Technology Co-located finite volume, arbitrary unstructured meshes, predictor-corrector method 500 000 lines of code, 49% FORTRAN, 41% C,10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S,...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, gui) 2008: Version 1.3 (more parallelism, ALE, code coupling,...) released as open source (GPL licence) 9

Code_Saturne general features Broad validation range for each version ~ 30 cases, 1 to 15 simulations per case Academic to industrial cases (4 to 2 000 000 cells, 0,04 s to 12 days CPU time) Runs or has run on Linux (workstations, clusters), AIX, Solaris, SGI Irix64, Fujitsu VPP 5000, HP AlphaServer, Blue Gene/L and P, PowerPC, BULL Novascale, Cray XT Qualification for single phase nuclear applications Best practice guidelines in specific and critical domain Usual real life industrial studies (500 000 to 3 000 000 cells) 10

Code_Saturne subsystems Meshes Code_Saturne Pre-processor mesh import mesh joining i periodicity domain partitioning Parallel Kernel ghost cells creation CFD Solver FVM library BFT library serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 11

Code_Saturne subsystems Code_Saturne BFT library Meshes Pre-processor mesh import mesh joining i periodicity domain partitioning FVM Prace WP6 library Parallel Kernel ghost cells creation CFD Solver serial I/O memory management parallel mesh management code coupling parallel treatment Code_Saturne Syrthes Code_Aster Salome platform... Restart files Xml data file GUI Postprocessing output 12

Code_Saturne Features of note to HPC Segregated solver Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi- CGstab) used for other variables Matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 13

Base parallel operations Distributed memory parallelism using domain paritioning Use classical ghost cell method for both parallelism and periodicity Most operations require only ghost cells sharing faces Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm 14

Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells Turbulence = L.E.S Turbulence = L.E.S Turbulence = k-ε 15

Code_Saturne Initial Performance (i) 10M Cell Dataset 50 (arbitrary) Performa ance per Timestep 45 40 35 30 25 20 15 10 5 0 Louhi (Cray XT) Huygens (IBM PWR6) Jugene (IBM BG/P) 0 1024 2048 3072 4096 5120 Number of Cores

Code_Saturne Initial Performance (ii) 10M Cell Dataset ) 90 80 70 60 Louhi (Cray XT) 50 Huygens (IBM PWR6) Jugene IBM BG/P 40 30 20 10 0 34816 32768 30720 28672 26624 24576 22528 20480 18432 16384 14336 12288 10240 8192 61444 4096 2048 Performance per Timestep (arbitrary 0 Number of Cores

Code_Saturne Initial Performance (ii) 100M Cell Dataset Mixer Grid 100M Cells - Solver 90 Relative Performance e (arbitrary) 80 70 60 50 40 30 20 10 0 Ideal CRAY XT4 IBM BG/P 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of Cores

Optimization - Multigrid V1.4 replaces standard Conjugate Gradient solver with Multigrid solver Two potential performance gains: Solver may converge in fewer iterations Solver requires fewer operations in a coarse grid iteration Otherwise improves robustness of code

Multigrid Performance Cray XT4, 10M Cell Dataset Performance Comparison: Conjugate Gradient vs Multigrid Relative Performanc ce (arbitrary y) 40 30 20 10 0 Multigrid Conjugate gradient 0 1024 2048 3072 4096 Number of Cores

Multigrid Future Optimizations (1/2) Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability Planned solution: move grids to nearest rank multiple l of 4 or 8 when mean local grid size is too small 21

Multigrid Future Optimizations(2/2) Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local l grid size is too small Map onto underlying multicore architecture Most ranks will then have empty grids, but latency dominates anyways at this stage The communication pattern is not expected to change too much, as partitioning is of a recursive nature (whether using recursive graph partitioning or space filling curves), and should already exhibit some sort of multigrid nature This may be less optimal than some methods using a different partitioning for each rank, but setup time should also remain much cheaper 22

CrayPat Multigrid Serial Performance Profile at 128 cores shows 65% of the runtime in scalar numerical routines: 100.0% 0% 76111 -- -- Total ---------------------------------------------- 64.6% 49177 -- -- USER --------------------------------------------- 21.7% 16544 2988.72 15.4% _mat_vec_p_l_native 10.9% 8284 1515.80 15.6% _conjugate_gradient_mp 7.9% 6034 2619.52 30.5% _alpha_a_x_p_beta_y_native 7.5% 5686 1357.38 19.4% cblas_daxpy 4.8% 3620 1044.47 22.6% _polynomial_preconditionning 4.6% 3477 2077.88 37.7% 7% gradrc_ 2.5% 1936 323.73 14.4% _jacobi All of these are targets for optimization Most have been optimised for IBM & Bull platforms: #if defined( xlc ) #pragma disjoint(*x, *y, *da, *xa1, *xa2) #endif #if defined(ia64_optim) Shows there is a way forward for increased performance targeting AMD Quad-Core

CrayPat - Multigrid Parallel Performance Timing results show that it is not scaling: Profile at 128 cores shows 35% in non-compute operations Profile for 512 cores shows only 11% in serial operations. Either load-balancing problem or written with inefficient comms operations for an XT. Early profiles suggest collectives are dominating: 100.0% 0% 37132 -- -- Total ----------------------------------------------- 35.7% 13272 -- -- MPI ---------------------------------------------- 10.7% 3987 437.43 9.9% MPI_Allreduce 9.3% 3471 2243.79 39.3% MPI_Waitall 4.7% 1750 282.24 13.9% MPI_Barrier 4.6% 1714 12.16 0.7% MPI_Recv 3.8% 1418 1124.28 44.3% MPI_Isend 1.0% 356 215.20 37.8% MPI_Irecv Significant time spent in Waitall, also imbalanced (3 rd and 4 th columns).

Cray Pat Analysis Message Exchange The predominant message exchanging routine is cs_halo_sync_var. It is called 180,000 times in 10 iterations. Structure is implemented with isend & irecv. Has global barrier between isends and irecvs to ensure irecv is posted before the send. For better performance we will re-order the Isends. Would also be better to not issue isends & irecvs if length is zero. (*) Would be better if there is some work to do between irecvs and isends as this will allow the communication to happen asynchronously. In many cases the calls to cs_halo_sync_var can be combined to send one message rather than 4 (in some cases).

Cray Pat Analysis Global Comms The synchronization time for the collectives is more significant than the routines themselves. We should be able to reduce collectives. Barrier within halo_sync_var. Consecutive global collectives can be collated. Should allow us to save time spent in the collective, and also give opportunity for overlapping.

Parallelization of partitioning Version 1.4 already prepared for parallel mesh partitioning Mesh read by blocks in «canonical / global» numbering, redistributed using cell domain number mapping All that is required is to plug a parallel mesh partitioning algorithm, to obtain an alternative cell domain mapping The redistribution infrastructure is already in place, and already being used Possible choice: PARMETIS, PT-SCOTCH 27

Parallel Partitioner Performance 80 Time Take en (secs) 70 60 50 40 30 20 PT-SCOTCH PARMETIS 10 0 0 256 512 Cores

I/O Overheads 40 (arbitrary y) Perfo ormance 35 30 25 20 15 10 5 Cray XT4 per Iteration Cray XT4 Total Time Ideal 0 0 1024 2048 3072 4096 5120 6144 7168 8192 Processors

I/O Overheads 40 (arbitrary y) 35 30 25 20 Cray XT4 per Iteration Cray XT4 Total Time Ideal I/O Overheads Perfo ormance 15 10 5 0 0 1024 2048 3072 4096 5120 6144 7168 8192 Processors

Parallel I/O (i) Version 1.4 introduces parallel I/O Uses block to partition redistribution when reading, partition to block when writing Fully implemented for reading of preprocessor and partitioner output, as well as for restart files Infrastructure in progress for postprocessor output 31

Parallel I/O (ii) Parallel I/O only of benefit when using parallel filesystems Use of MPI IO may be disabled either when building the FVM library, and for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0 Using the same FVM file functions MPI I/O subsystem 32

Parallel I/O (iii) Prior to using parallel I/O, we would use a similar mapping of partitions to blocks, but blocks would be assembled in succession on rank 0 writing each block before assembling the next to avoid requiring a very large buffer; enforcing a minimum buffer size so as to limit the number of blocks when data is small Otherwise, we would be latency-bound, and exhibit inverse scalability 33

Load Imbalance (1/3) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/102242 min/max cells at 1024 cores 5.8% imbalance 11344/12781 min/max cells at 8192 cores 8.9% imbalance 34

Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, but processor power is wasted Load imbalance might be reduced d using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 35

Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate in advance With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 36

Hybrid MPI / OpenMP (1/2) Currently, only the MPI model is used: By default, everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 37

Hybrid MPI / OpenMP (2/2) Hybrid MPI / OpenMP EDF plans to test on Blue Gene Also pure OpenMP parallelism for ease of packaging / installation on Linux distributions No N dependencies d on multiple l MPI library choices, only on the gcc runtine Good enough for current multicore workstations Coupling p g with SYRTHES 4 will still require MPI 38

Code_Saturne - Summary Several projects exist (in addition to PRACE) to improve the performance of the code: Pre-processing, Mesh Generation, Mesh Partitioning Improvements to CFD Solver Code Optimization, particularly on the Cray XT4 / XT5 (& Jaguar) Parallel I/O, Hybrid MPI / OpenMP Parallel Performance of Existing Code is Very Good Particularly for large problem sizes. We hope to benchmark the 100M Cell Mixing Grid for PRACE a.s.a.p. Introduction of Multigrid has reduced scalability but improved performance Load Balancing difficult to perfect on large processor counts