Phase inversion problem: performances on EOS. Annaïg PEDRONO IMFT Service Codes et Simulations Numériques

Size: px

Start display at page:

Download "Phase inversion problem: performances on EOS. Annaïg PEDRONO IMFT Service Codes et Simulations Numériques"

Jessica Clark
6 years ago
Views:

1 Phase inversion problem: performances on EOS Annaïg PEDRONO IMFT Service Codes et Simulations Numériques

2 IMFT and CALMIP IMFT and CALMIP : a partnership to improve code performances since : IMFT consumed 24 millions hours on Hyperion Involvement in the renewal of CALMIP supercomputers, 1 CFD benchmark with 2 codes (JADIM & Neptune) : 2009 (hyperion) and 2013 (eos) Benchmark permits to take into account specificity of CFD in the supercomputer choice JADIM : from a serial code (2005) on «Soleil» to 4096 cores on «EOS» (2014)

3 Phase Inversion problem 3 codes (PARIS, Thetis, JADIM) a mesh of 134 millions of cells (512x512x512) performance tests from 64 to 2048 cores ~36 hours for each run using ~2000 cores about 1 Tb of generated data per simulation

4 Code performances How to compare the performances of different codes? Total CPU time needed to reach 25 seconds (physical time) for one configuration of phase inversion problem Speed-up comparison from 64 to 2048 cores Velocity Z = number of computed nodes / elapsed time for one iteration

5 Speed-up JADIM PARIS ideal Drawback : compare the scalability of each code without information on the serial performance

6 Velocity Z / number of cores PARIS JADIM Time step : 0.5 e-3 s 1.5 e-3 s Drawback : different physical time steps, different numerical methods, different precisions,

7 JADIM : performances on EOS Comparison EOS / Hyperion Profiling Effect of stopping criteria value on Poisson s solver JADIM improvement (tuning, memory, visualisation)

8 Elapsed time 10 iterations Comparison Hyperion / EOS Number of cores 20% gain between hyperion and EOS same scalabity until 512 cores EOS : Processors Ivybridge, 2.8GHz, Intel MPI, opt = -O3 -xavx Hyperion : Proc. Nehalem, 2.8 GHz, SGI MPT, opt = -O3 -xsse4.2 EOS Hyperion

9 CPU time distribution 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 89,6% rtol = 1.e-5 97,5% rtol = 1.e-8 Most of CPU time is spent in Poisson s solver (solved with PETSc : CG preconditioned with block Jacobi) rtol = residual convergence stop criteria of CG s method Pressure solver Runge-Kutta b Ax k b = rk b <rtol

10 Elapsed time by iteration in s Effects of iterative solver stopping criteria 160,00 140,00 120,00 100,00 80,00 60,00 40,00 20,00 rtol = 1e-10 rtol = 1e-9 rtol = 1e-8 rtol = 1e-7 rtol = 1e-6 rtol = 1e-5 0, iteration from initial state Elapsed time is not constant from one iteration to another. It could explain variability of performance tests.

Elapsed time in s / iteration 160 140 120 100 80 60 40 20 0 0 2 4 6 8 10

iteration 0, rtol=1e-5 From iteration 4000, rtol=1e-5 Iteration 0

11 Elapsed time in s / iteration Iteration From iteration 0, rtol=1e-10 From iteration 4000 rtol=1e-10 From iteration 0, rtol=1e-5 From iteration 4000, rtol=1e-5 Iteration 0 Performance tests induce different results depending on the physics. Iteration 4000

12 Mean error between rtol 1e-5/1e-12 Mean cumulative error for rtol=1. e-5 6,0E-07 5,0E-07 4,0E-07 3,0E-07 2,0E-07 1,0E-07 0,0E Iteration number Velocity U Velocity V Velocity W Volume Fraction rtol = 1.0e-5 => mean absolute error = 1e-4 on volume fraction after iterations

13 Jadim improvement thanks to the Workshop Tuning Which MPI? Effects of placement Memory On 512 cores on Hyperion : Mesh > 4 Gb / core No more possible on EOS -> work to save memory space Visualisation Visualisation of large data with Tecplot Post-treatment with Blender to get realistic pictures

14 BullX MPI vs Intel MPI? PETSc uses a lot of MPIAllReduce => Intel MPI is better than BullX MPI for this operation. Tendency increases with the number of cores. Intel MPI BullX MPI 512 cores 26 nodes 413 s 450 s (20 cores/node) 1024 cores 52 nodes 227 s 268 s (20 cores/node) Elapsed time for 20 iterations of inversion phase problem.

15 Efficiency Effects of placement 1,20 1, ,80 0,60 0,40 0, MPI processus 4096 MPI processus Compact 10 cores/10 8 cores / 10 0, Nodes consumed (1 node = 20 cores)

16 Memory/core in Gb Memory /core 4 3,5 3 2,5 2 1,5 1 0,5 0 Residual memory/core Number of cores before improvement (hyperion) after improvement (EOS) But we can do better : Thetis uses only ~400 Mo by core on 1800 cores!

17 Visualisation Before this workshop, mesh visualisation was realised on a coarse mesh (1 point on 2) All points are now visualised thanks to a new version of Tecplot and a new binary format (F. Auguste Tecplot Webminar coming soon ) Export stl file with Paraview in Blender Test for a new solution of a remote visualisation session on EOS (TurboVNC + VirtualGL) next week

19 Thank you for your attention. Thanks a lot to CALMIP team!

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS

HIGH PERFORMANCE LARGE EDDY SIMULATION OF TURBULENT FLOWS AROUND PWR MIXING GRIDS U. Bieder, C. Calvin, G. Fauchet CEA Saclay, CEA/DEN/DANS/DM2S P. Ledac CS-SI HPCC 2014 - First International Workshop