Advanced MD performance tuning. 15/09/2017 High Performance Molecular Dynamics, Bologna,

Size: px

Start display at page:

Download "Advanced MD performance tuning. 15/09/2017 High Performance Molecular Dynamics, Bologna,"

Lora Dalton
5 years ago
Views:

1 Advanced MD performance tuning 15/09/2017 High Performance Molecular Dynamics, Bologna,

2 General Strategy for improving performance Request CPU time from an HPC centre and investigate what resources are available. Try to understand what these resources are and how they can be applied to your simulations. Can I use GPUs? Multi-core processors? Choose an MD program taking into account also performance issues on the available computer system and not just functionality or scientific relevance. Read the manual, ask for technical help, etc before even setting up the simulation. Run a few sample simulations and read the program output for any hints on how to improve performance. When parameters are ready, perform scaling tests to determine the optimum number of nodes to use. CPU budgets are limited so the best option may not be the one with the highest performance As the project progresses be prepared to modify and test new options according to results. For example, the system may become inhomogenous or you may need to apply restraints which could affect performance and parallelisation options. Make sure the results are still correct! Be careful about modifying cut-offs or other options which may affect the correctness of the simulation. 15/09/2017 High Performance Molecular Dynamics, Bologna,

3 Some performance options to consider - GROMACS GROMACS mdrun -npme controls the number of cores dedicated for the PME calculation. For n total cores, pme=n/4 for n>16 (but check manual: see also tune_pme) -gcom frequency of exchange of energies -nstlist neighbour list update frequency (default: 10) -resethway reset timers (for benchmarks) -noconfout switches off final conf output (benchmarks only) -dlb dynamic load balancing (default: auto) 15/09/2017 High Performance Molecular Dynamics, Bologna,

4 Some performance options to consider - NAMD fullelectfrequency number of timesteps between full electrostatic evaluations (default: non-bonded freq), e.g. calculate long range electrostatics every 4 fs. rigidbonds controls how Shake is used (default: none). If water H-O and H-D distances are constrained. outputenergies frequency of energy calculation. Very frequent calculations will slow the simulation (esp. for GPUs). PMEProcessors number of processors for FFT and reciprocal sum. numinputprocs, numoutputprocs- Parallel I/O options for very large simulations usecompressedpsf- Use compressed.psf files for memory optimised build of NAMD for very large simulations. 15/09/2017 High Performance Molecular Dynamics, Bologna,

5 Test case 1. Performance and scaling of lignocellulose with GROMACS. It was reported in a conference that GROMACS on Omni-PATH gave much worse performance than on Infiniband network. Can this be true? If so why, given that Omni- Path should be very highly optimised for programs like GROMACS? Time to investigate.. 15/09/2017 High Performance Molecular Dynamics, Bologna,

Simulations of LignoCellulose on Marconi/Broadwell LignoCellulose-rf benchmark used in study and the PRACE benchmark suite is relatively large (~3M atoms) and uses Reaction Field instead of PME

6 Simulations of LignoCellulose on Marconi/Broadwell LignoCellulose-rf benchmark used in study and the PRACE benchmark suite is relatively large (~3M atoms) and uses Reaction Field instead of PME electrostatics. tpr available from PRACE website for example, tar.gz Not clear from the publication how the simulations were run, only that optimised parameters were used. 15/09/2017 High Performance Molecular Dynamics, Bologna,

7 Performance (ns/day) Simulations of LignoCellulose on Marconi/Broadwell First attempts using default GROMACS options and no OMP threads. Do not look so bad but compare with published data and we see it is really poor max performance 12ns/day instead of ns/day Gromacs LignoCellulose on Marconi A #nodes std 15/09/2017 High Performance Molecular Dynamics, Bologna,

8 Simulations of LignoCellulose on Marconi/Broadwell Perhaps the output from GROMACS can help.. NOTE: 74 % of the run time was spent communicating energies, you might want to use the -gcom option of mdrun Core t (s) Wall t (s) (%) Time: (ns/day) (hour/ns) Performance: Finished mdrun on rank 0 Thu Apr 27 18:04: Big clue here. The simulation is very large so communication is going to be important. Online search reveals other possible options for large simulations: -resethway (reset time counters) -noconfout (don t output final config) -nstlist (neighbour list size) 15/09/2017 High Performance Molecular Dynamics, Bologna,

9 Performance (ns/day) Simulations of LignoCellulose on Marconi/Broadwell Try again with these options, Gromacs LignoCellulose on Marconi A1 mpirun -n <tasks> mdrun -v -s topol.tpr -gcom 20 -resethway noconfout nstlist Much better although slightly lower performance than reported for Infiniband std optimised #nodes 15/09/2017 High Performance Molecular Dynamics, Bologna,

10 Simulations of LignoCellulose on Marconi/Broadwell We might be able to do better but makes sense to measure the performance with some performance tools. source $INTEL_HOME/itac_2017/bin/itacvars.sh export OMP_NUM_THREADS=1 mpirun -trace -n 32 mdrun -v -s topol.tpr -resethway - noconfout -gcom 20 GROMACS has been compiled with IntelMPI on Marconi so we can use Intel performance tools to profile the program. Intel Trace Analyser and Collector (ITAC) is easy because original program does not need to be recompiled. 15/09/2017 High Performance Molecular Dynamics, Bologna,

11 Simulations of LignoCellulose on Marconi/Broadwell Performance Analysis 1 node Very little MPI use this is good because it means the cores are not wasting time communicating. Mainly point-to-point communications (mpi_sendrecv), but also some collective (mpi_bcast) 15/09/2017 High Performance Molecular Dynamics, Bologna,

12 Simulations of LignoCellulose on Marconi/Broadwell Performance Analysis 32 nodes 15/09/2017 High Performance Molecular Dynamics, Bologna,

13 Simulations of LignoCellulose on Marconi/Broadwell Performance Analysis 128 nodes Program time is very heavily dominated by MPI calls, particularly collective calls (MPI_Bcast) 15/09/2017 High Performance Molecular Dynamics, Bologna,

14 Simulations of LignoCellulose on Marconi/Broadwell Performance Analysis ITAC results show that not only communication is as expected important but also which MPI calls are involved (MPI_Bcast). IntelMPI gives the possibility of changing the algorithm used for particular MPI calls. I_MPI_ADJUST_BCAST MPI_Bcast 1.Binomial 2.Recursive doubling 3.Ring 4.Topology aware binomial 5.Topology aware recursive doubling 6.Topology aware ring 7.Shumilin's 8.Knomial 9.Topology aware SHMbased flat 10.Topology aware SHMbased Knomial 11.Topology aware SHMbased Knary 15/09/2017 High Performance Molecular Dynamics, Bologna,

15 Performance (ns/day) GROMACS Performance as a function of MPI Broadcast algorithm 95 Ligno Cellulose Gromacs Performance as a function of Intel MPI_Bcast algorithm lignocellulose, 120 nodes MPI_Bcast algothim export I_MPI_ADJUST_BCAST=<algorithm> (0=default) In this example I_MPI_ADJUST_BCAST=3 gives small performance boost but the default (0) still ok. 15/09/2017 High Performance Molecular Dynamics, Bologna,

16 Test Case 2: NAMD on KNL NAMD as well as offering similar functionality to GROMACS is also highly optimised and shows good parallel scalabilty. The programming model though is rather particular, not based directly on MPI but rather on a library called Charm++ (don t confuse with the force-field/md program Charmm). For Intel KNL processors, NAMD and Intel suggest the SMP (Symmetric Mult-Processor) build of NAMD based on a mixed task/thread-like parallelisation (similar to, but not the same as, MPI/OpenMP). For multi-node essential to allocate cores for communications. Unfortunately, the NAMD-SMP syntax is complicated (see mpirun -n $nodes -perhost 1 namd2.smp +ppn 134 +pemap commap 67 stmv.namd number of KNL nodes tasks per node threads per node for many nodes need to increase the number of communication cores thread mapping core dedicated for comms 15/09/2017 High Performance Molecular Dynamics, Bologna,

17 Performance (ns/day) Performance (ns/day) NAMD- KNL results 25 STMV NAMD 2.12 Marconi Broadwell/KNL APOA1 NAMD BDW/KNL #nodes STMV KNL nodes For these benchmarks we used the same options so especially for few nodes performance may not be the optimum. bdw KNL 15/09/2017 High Performance Molecular Dynamics, Bologna,

18 NAMD KNL - Summary For Intel KNL need to use NAMD-SMP build. Our results show that for NAMD using KNL instead of Broadwell or similar only makes sense for very large systems (e.g. millions of atoms), although careful tuning of the Charm++ options may give better results. No advantage to using KNL flat mode instead of cache mode. 15/09/2017 High Performance Molecular Dynamics, Bologna,

19 Performance (ns/day) Case study 3: GROMACS DPPC on Intel Skylake GROMACS DPPC (GROMACS Skylake) For OmniPATH network (e.g. Marconi) Intel recommends reserving some cores/node to drive the network On Broadwell only slight difference observed, but for Skylake the difference is significant! #nodes 46 cores/node 48 cores/noed (no data for 46 cores/node since GROMACS cannot perform domaindecomposition as 23 is a prime factor.) 15/09/2017 High Performance Molecular Dynamics, Bologna,

20 Potassium channel Kir3.2 Marconi A1

21 Potassium Channel (kir 3.2) Galile o

22 Potassium Channel (kir 3.2) Galile o

23 Summary of Gromacs benchmarks on K80 GPUs # run MPI ranks OMP threads GPU ns/day , ,

24 Lipid Bilayer (DPPC)

25 Lipid Bilayer (DPPC)

26 Marconi A3

Benchmark results on Knight Landing architecture

Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68