Divide-and-Conquer Molecular Simulation:

Size: px

Start display at page:

Download "Divide-and-Conquer Molecular Simulation:"

Ambrose Todd
5 years ago
Views:

1 Divide-and-Conquer Molecular Simulation: GROMACS, OpenMM & CUDA ISC Erik Lindahl CBR Center for Biomembrane Research Stockholm University

2 These will soon be small computers ~2024: 1B cores 2022: ~300M cores 2020: ~100M cores 2018: ~30M cores 2016: ~10M cores 2014: ~3M cores 2012: ~1M cores 2010: ~300,000 cores How will YOU use a billion cores?

3 We re all doing Embarrassing Parallelism But not the way you think. We re investing huge efforts in parallelizing algorithms that only reach 50-75% scaling efficiency on large problems Not a chance they will scale to 1B cores Close-to-useless for smaller problems of commercial interest 100% focus on programs, forget the problem we re solving Ask taxpayers to foot the bill Pretty much the definition of embarrassing?

4 Molecular Dynamics Free Energy & Drug Design Protein Folding Understand biology GROMACS

As a comparilts from all 36 center chains. son, the performance of PME is shown. Figure 11. Weak scaling o XT5 with FFT implemen Information, A.3. The 3.3 588 128 128 FFT.

5 Figure 9. Strong scaling of 3.3 million atom biomass system on Jaguar Cray XT5 with RF. With 12ω288 cores the simulation(a) results from all 36 origin c otentials of mean force for the primary alcohol dihedral ) O6-C6-C5-C4: produces 27.5 ns/day and runs at 16.9TFlops. As a comparilts from all 36 center chains. son, the performance of PME is shown. Figure 11. Weak scaling o XT5 with FFT implemen Information, A.3. The FFT. T FFT step is represented b Scaling as an Obsession? Gromacs has scaled shorter than for NAMD w to 150kthecores onbenchma aim of this treatments a ORNL applications. Two differe because a direct compari electrostatics methods wi possible: NAMD, which i.5 ns/day and runs at 16.9TFlops. As a compariusing PME on Cray XT, formance of PME is589 shown. and GROMACS does prior to the improvement. The new implementation of theof complex-to-complex Figure 11. Weak scaling FFTnot on co 590 load balancing is now part of the GIT version of the XT5 with FFT implemented implemented, as describedwith in the Supp using GROMACS current 591 GROMACS and will also be included in GROMACS 4.1. Information, A.3. The 3.3 million atom system require for large systems (for m To highlight the computational benefit of using RF,The time the FFT. required to comput 593 the scaling of a simulation of the 3.3 million lignocellulose FFT step is represented by tf. Information). The significant differen 594 system using the PME method is also shown in Figure 9. PME However, and the RFweelectro 595 The PME simulation was run using NAMD, since this MD shorter than for NAMD with PME. stres 57 Figure 9, can be understo 596 application is known to have good parallel efficiency. To the aim of this benchmark is a comparison betwee of the parallel FFT requir 597 ensure a fair comparison between the two electrostatics electrostatic treatments and not between the differen In weak scaling, the ratio 598 methods, some of the parameters of the PME simulation were applications. Two different applications were used s of cores used in the simul 599 adjusted to improve its performance (standard 2 fs time step because a direct comparison ofa new simulations using imp dif and improved for RF atom and 6 system fs full electrostatics time step and neighbor600 million Strong scaling of 5.4 on is present which are presented in Section 2.1 formethods details). with In oneofapplication 601 list distance update for PME, seeelectrostatics Figure 10. Strong scaling of 5.4 million atom system on trong scaling of 3.3 million atomcray biomass system Jaguar XT5. With cores, 28 ns/day and 33TFlops ay XT5 with RF. With 12 are 288achieved. cores the simulation Only gigantic systems scale - limited number of applications M-100M atoms But: Small systems won t scale to large numbers of cores! How shall we break this impasse?

6 Stream computing First Gromacs GPU project in 2002 with Ian Buck & Pat Hanrahan, Stanford Promise of theoretical high FP performance on GeForce4 Severe limitations in practice... More general than mere hardware acceleration! Fermi: 448 cores TFLOP-desktops! Think in Parallel Problems! ScalaLife Life Science on next-gen hardware

7 Ensemble Simulation Membrane Protein Insertion Free Energy Vesicle Fusion (1M atoms) 20 Gln Ala 10 V (kcal/mol) His Arg 4, ns simulations Hard statistics on fusion rates Box (Å) Every dot is a simulation! Anna Johansson, Peter Kasson

0 0 32 64 96 0 32 64 96 0 32 64 96 32 64 96 All-vs-all (CUDA

8 Molecular Dynamics with CUDA Absolute performance critical, not speedup relative to a reference implementation! All-vs-all (CUDA book) Newton s 3rd law Sort atoms in tiles N^2 (N^2)/2 N log N Scott LeGrand, Peter Eastman

MD Performance in CUDA Not sufficient to accelerate nonbonded interactions - need to send to/from GPU Prefer to do entire simulation on GPU Don t rewrite 2M lines-of-code

9 MD Performance in CUDA Not sufficient to accelerate nonbonded interactions - need to send to/from GPU Prefer to do entire simulation on GPU Don t rewrite 2M lines-of-code ~5GB/s in a separate CUDA-Gromacs... OpenMM: Core MD functionality in separate library Stanford, Stockholm, Nvidia & AMD Fully public API, hardware-agnostic, use anywhere

10 Gromacs & OpenMM in practice GPUs supported in Gromacs 4.5 mdrun... -device OpenMM:Cuda Same input files, same output files: It just works Subset of features work on GPUs for now (checked) No shortcuts taken on the GPU: At least same accuracy as on the CPU (<1e-6) Potential energies calculated, free energy works Prerelease availability: NOW!

11 Fermi (C20) performance over C10 BPTI (~21k atoms) Villin (600 atoms, implicit) PME Implicit Reaction-field All-vs-all 1500 ns/day C We re conservative: 2fs time steps! s/day for small proteins! C2050 A millisecond would take ~1 year

12 GPU performance over x86 CPU PME Implicit Reaction-field All-vs-all 1500 ns/day Quad x GHz C2050

sometimes be bad Even fine cards can exhibit random errors For production

13 Caveats/Limitations Beware of Memory Errors: happens on all hardware Gromacs runs tests to check for GPU memory errors Low-end consumer cards can sometimes be bad Even fine cards can exhibit random errors For production scientific work we strongly recommend Fermi Tesla C20 cards Why? ECC memory! (C2050/C2070)

Think Massively Parallel - Scale the Problem,

freedom over 1 billion processors Parallelize in

14 Think Massively Parallel - Scale the Problem, not Runs Stream Computing is the future Everything is statistical mechanics! No algorithm will parallelize 5000 degrees of freedom over 1 billion processors Parallelize in the problem domain instead CUDA provides amazing performance on each node

15 Acknowledgments GROMACS: Berk Hess, David van der Spoel, Per Larsson OpenMM: Rossen Apostolov, Szilard Pall, Peter Eastman, Vijay Pande Nvidia: Scott LeGrand, Duncan Poole, Andrew Walsh, Chris Butler Ensemble Simulations: Peter Kasson!

Molecular Simulation with GROMACS on CUDA GPUs Erik Lindahl

Webinar 20130404 Molecular Simulation with GROMACS on CUDA GPUs Erik Lindahl GROMACS is used on a wide range of resources We re comfortably on the single-μs scale today Larger machines often mean larger