Performance of the hybrid MPI/OpenMP version of the HERACLES code on the Curie «Fat nodes» system

Size: px

Start display at page:

Download "Performance of the hybrid MPI/OpenMP version of the HERACLES code on the Curie «Fat nodes» system"

Eileen Conley
5 years ago
Views:

1 Performance of the hybrid MPI/OpenMP version of the HERACLES code on the Curie «Fat nodes» system Edouard Audit, Matthias Gonzalez, Pierre Kestener and Pierre-François Lavallé

2 The HERACLES code Fixed grid finite volume code working in 1,2,and 3D in cartesian, cylindrical and spherical coordinate. Fortran + MPI, domain decomposition (Magneto)hydrodynamics : finite volume, 2 nd order godunov Explicit or Implicit Multigroup radiative transfer : Moment method, Implicit Gravity, fully coupled to ohydro / Splitted Thermochemistry and/or heating/coling function (local) Turbulent forcing (local) Used in astrophysics (star formation, interstellar medium studies, ) and to interpret laser generated plasma experiment.

3 The HERACLES code Fixed grid finite volume code working in 1,2,and 3D in cartesian, cylindrical and spherical coordinate. Fortran + MPI, domain decomposition (Magneto)hydrodynamics : finite volume, 2 nd order godunov Explicit or Implicit Multigroup radiative transfer : Moment method, Implicit Gravity, fully coupled to hydro / Splitted Thermochemistry and/or heating/cooling function (local) Turbulent forcing (local) Used in astrophysics (star formation, interstellar medium studies, ) and to interpret laser generated plasma experiment.

5 Domain Decomposition MPI process MPI process MPI process MPI process

6 Domaine Decomposition Physical boundaries Communications

7 The HERACLES code Read simulation parameters Split domain over the MPI processes Initial conditions Loop over time Not multi-threaded Fill the ghost cells : boundary conditions or communications Compute time step Hydro step OpenMP Loop over chunk OpenMP Loop over cells (slope, Riemann solver,.) Compute cooling (local) OpenMP Stirring (local) OpenMP Output End

8 Pure MPI vs MPI/OpenMP MPI MPI + 4 threads 16 messages of size 1 4 messages of size 2

9 The Curie system February 2011 October 2011 March 2012 Fat nodes 360 BullX-S6010 Intel NH EX 2,26 Ghz cores 32 cores/node 128 GB/node 105 TFlops Hybrid nodes 144 Bullx B Nvidia M TFlops Thin nodes 5040 BullX B510 Intel New generation (SNB) cores 16 cores/node - 4 GB/core 128 GB SSD 1.5+ PFflops Interconnect Infiniband QDR 6 PB GB/s 1 st level Lustre

10 The Curie system

11 Strong Scaling (900 3 run) Pur MPI 2 threads 4 threads 8 threads

12 Strong Scaling (900 3 run) Pur MPI 2 threads 4 threads 8 threads

13 Strong Scaling (900 3 run) Pur MPI 2 threads 4 threads 8 threads

14 weak scaling / node (32 cores) Pur MPI 2 threads 4 threads 8 threads

15 weak scaling / node (32 cores)

16 weak scaling / node (32 cores)

17 Scaling on BlueGene-IDRIS (strong scaling)

18 IO the craftsman way All processes write their output at the same time. Failure when > few 10 3 Write by packet + temporization Ncpu_write ~ T_wait ~ 2 10 secondes One output ~ 5 time steps

19 IO the professional approach P. Wautelet and P. Kestener 4 different IO approach where tested : POSIX : 1 file per MPI processes MPI-IO HDF5 Parallel-NetCDF STEP 1 : Optimizing the MPI-IO Hints MPI-O hints can have a dramatic effect on the IO performances Best parameters depend on the application 7 of the 23 available hints where tested!! STEP 2 : Strong Scaling test

20 IO the professional approach P. Wautelet and P. Kestener

21 Conclusions Multi-threading necessary for large number of cores OpenMP is easy to implement but not always to understand Multi-threaded communications probably necessary Good results for a small number of threads.

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection