Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

Size: px

Start display at page:

Download "Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM"

Clemence Reed
6 years ago
Views:

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM 25th March, GTC 2014, San Jose CA AnE- Pekka Hynninen ane.pekka.hynninen@nrel.

1 Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM 25th March, GTC 2014, San Jose CA AnE- Pekka Hynninen NREL is a na*onal laboratory of the U.S. Department of Energy, Oﬃce of Energy Eﬃciency and Renewable Energy, operated by the Alliance for Sustainable Energy, LLC.

2 What is CHARMM? Chemistry at HARvard Molecular Mechanics o Molecular simula*on package with wild range of features, force fields, and analysis methods o Started late 1960s by Mar*n Karplus o Network of developers all over the world o Mostly Fortran 90 code Recently re- wripen CPU Molecular Dynamics (MD) engine is now compessve in performance with other MD packages o A.- P. Hynninen, M. F. Crowley, Journal of Computa*onal Chemistry, 35, 406 (2014) 2

3 3 CHARMM approach to GPU MD engine Incremental o Started as a simple offload of the non- bonded force calcula*on o Unable to rewrite the en*re lines of Fortran code Limited to ParScle Mesh Ewald (PME) simulasons o This is what most users want o Unable to support all CHARMM features from the get- go Modular o Core rou*nes wrizen as a standalone C++ library o Easy switch to different accelerator architecture (Intel Xeon Phi, AMD GPU) CPU code threaded with OpenMP MPI used for MulS- CPU/GPU support

Domain decomposison method SimulaSon box is divided into sub- boxes of size bx x by x bz Single MPI task is assigned to each sub- box The MPI task is responsible for updasng the coordinates within

4 Domain decomposison method SimulaSon box is divided into sub- boxes of size bx x by x bz Single MPI task is assigned to each sub- box The MPI task is responsible for updasng the coordinates within the sub- box (local- box) In order to calculate the forces, we need to coordinates from an import volume o by bx bz Eighth- shell method by D.E. Shaw research* Import volume extends Rcut away from the local- box boundary in possve x, y, and z direcsons Import Local *K. J. Bowers, R. O. Dror, and D. E. Shaw, J. Comput. Phys., 221, p. 303 (2007). bz Rcut 4

5 5 Domain decomposison method Import Local MPI MPI MPI MPI MPI MPI b z R cut Local MPI MPI b z R cut Coordinates are communicated from all the MPI tasks that overlap with the import volume

6 Non- bonded force calculason on GPU z x y SorSng atoms Divide simulason box into even z- columns Divide the z- columns into boxes such that each box contains exactly 32 atoms, except possibly the top box 6

7 Neighborlist search I J J 32 32 N atom z I R cut Excluded to avoid double coun*ng

using bounding boxes o o Detailed distance exclusions done on GPU Topological

7 7 Neighborlist search I J J N atom z I R cut Excluded to avoid double coun*ng y Neighborlist search finds interacsng pairs I J Neighborlist search is done on CPU using bounding boxes o o Detailed distance exclusions done on GPU Topological exclusions done on CPU Non- bonded calculason on GPU is performed on the 32x32 Sle N atom

Non- bonded force calculason on GPU i atoms 0... 31 0 t=0 j atoms.

8 Non- bonded force calculason on GPU i atoms t=0 j atoms t=31 Single warp (32 threads) Iterate t=0 31 and calculate interacsons in the 32x32 Sle Thread p calculates the interacson between atoms i[(p+t)%32] and j[p] o Index i is offset by p to avoid race condi*on when wri*ng atom i forces Atom i coordinates are kept in shared memory, atom j coordinates are in registers Exclusion mask (= 32 x 32bit integers) takes care of topological and distance exclusions 8

9 9 Force and energy accumulason on GPU Force calculated in single precision (SP), accumulated in fixed precision (FP) o FP Q24.40 model used for direct non- bonded calcula*on (long long int) o FP Q1.31 used for reciprocal grid (int) 24 bits bits o Allows for fast force accumula*on using hardware integer atomic opera*ons o No need for mul*ple force arrays and reduc*on Energy calculated in SP, accumulated in double precision (DP)

10 Two extremes of hardware setups Single CPU,

Single CPU, mul*ple GPUs Typical in homebrew

2 GPU Interconnect CPU PCI GPU 1 GPU 1 GPU M

Interconnect CPU PCI GPU 1 GPU 1 GPU M Node N

schedulers typically allocate en*re nodes to

10 10 Two extremes of hardware setups Single CPU, single GPU Typical in super- computer setups Single CPU, mul*ple GPUs Typical in homebrew clusters Node 1 Node 1 Interconnect CPU PCI Node 2 GPU Interconnect CPU PCI GPU 1 GPU 1 GPU M Node 2 Interconnect CPU PCI Node N GPU Interconnect CPU PCI GPU 1 GPU 1 GPU M Node N CPU PCI GPU CPU PCI GPU 1 GPU 1 GPU M Job schedulers typically allocate en*re nodes to users, therefore it is important that the code runs efficiently on both setups

11 CPU + GPU, Molecular dynamics cycle MPI node CPU Send local coordinates Communicate coordinates among CPUs Send import coordinates Bonded & Reciprocal force forces Communicate forces among CPUs Constraints, integra*on, etc.. local coordinates import coordinates Non- bonded force (import) Send forces Non- bonded force (local) GPU Only non- bonded (direct) forces calculated on GPU Easy to implement, only a few CUDA kernels Requires highly op*mized (threaded + vectorized) CPU code 11

12 When offload approach breaks down Total MD cycle Single CPU + Single GPU 8- core Intel Xeon X x C2075 GPU work GPU idle *me Single CPU + Mul*ple GPUs 8- core Intel Xeon X x K20 Exactly the same total performance because CPU can t keep up Shio more work to the GPU GPU work GPU idle *me 12

13 Fix: Move reciprocal force calculason on to GPU 3D FFT requires lot of communica*on o cufft currently parallelizes on mul*ple GPUs at around grid size 256x256x256 ó 1.

13 13 Fix: Move reciprocal force calculason on to GPU 3D FFT requires lot of communica*on o cufft currently parallelizes on mul*ple GPUs at around grid size 256x256x256 ó 1.5 million atoms Avoid communica*on by splipng into direct nodes and reciprocal nodes o Direct nodes: direct part of the non- bonded force o Reciprocal node: Reciprocal part of the non- bonded force For example: 8- core CPU with 4 GPUs o Split into 4 MPI nodes with 2xCPU threads and 1xGPU each MPI node 1 (direct) 2xCPU threads 1xGPU MPI node 2 (direct) 2xCPU threads 1xGPU MPI node 3 (direct) Physical node 2xCPU threads 1xGPU 8xCPU threads 4xGPU MPI node 4 (recip) 2xCPU threads 1xGPU

14 GPU only, Molecular Dynamics cycle MPI node (direct) Send coord. to recip Comm.

among direct Send import coordinates CPU forces Communicate forces among direct

(import) Bonded force Send forces forces Constraints, integra*on, etc.

14 14 GPU only, Molecular Dynamics cycle MPI node (direct) Send coord. to recip Comm. coord. among direct Send import coordinates CPU forces Communicate forces among direct recip forces Send final forces local coord. import coordinates Non- bonded force (import) Bonded force Send forces forces Constraints, integra*on, etc.. Non- bonded force (local) GPU MPI node (recip) coordinates from all CPUs Send coordinates CPU forces Send forces to direct nodes coordinates Reciprocal force Send forces GPU

15 Benchmark setup 4 GPU nodes with inﬁniband connecson o o o 8 core Intel Xeon X5667 on all nodes Node1-3: 2 x C2075 Node4: 1 x C2075, 3 x K20 Two benchmark systems: atoms (DHFR) atoms 15

CPU+GPU performance, 23558 atoms (DHFR) Uses

16 CPU+GPU performance, atoms (DHFR) Uses 2xGPU on each MPI node 2.4x faster 3.7x faster 16

17 CPU+GPU Performance, atoms 5.7x faster 9.3x faster 17

18 Performance, atoms (DHFR) 2 direct 1 recip 1 direct 1 recip 18

19 Performance, atoms 2 direct 1 recip 1 direct 1 recip 19

20 20 Conclusions & Future Work Conclusions Heterogeneous CPU + GPU Molecular Dynamics engine implemented for CHARMM o Na*vely mul*- GPU capable Two ways of running the simula*on o CPU + GPU For setups with fast CPUs o GPU only For setups with fast GPUs Split into direct + reciprocal nodes Future Work GPU only o BeZer load balancing by fusing reciprocal node as one of the direct nodes o CUDA aware MPI o Move neighbor list build on GPU (when more efficient) o Single- GPU special case CPU + GPU o Tuning for supercomputer setups Acknowledgements: Michael Crowley, NREL Charles Brooks, University of Michigan NVIDIA for hardware support NIH for funding

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of