Improving NAMD Performance on Volta GPUs

Size: px

Start display at page:

Download "Improving NAMD Performance on Volta GPUs"

Beatrice Little
5 years ago
Views:

1 Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer, University of Illinois at Urbana-Champaign

2 Journey of a Legacy HPC Application NAMD has been developed for more than 20 years Parallel scaling of large systems on supercomputers - Blue Waters (NCSA), Titan (ORNL), Stampede (TACC), Summit (ORNL) First full-featured molecular dynamics code to adopt CUDA - Stone, et al. J Comput Chem, 28: , 2007 Why were certain design choices made? What lessons have we learned? Where does development need to go in the Age of Volta?

3 NAMD & VMD: Computational Microscope Enable researchers to investigate systems described at the atomic scale NAMD - molecular dynamics simulation VMD - visualization, system preparation and analysis Neuron Ribosome Virus Capsid

10 7 10 6 10 5 Lysozyme ApoA1 ATP Synthase

4 Simulated System Sizes Have Increased Exponentially 10 8 HIV capsid Number of atoms Lysozyme ApoA1 ATP Synthase Ribosome STMV

5 Parallelism in Molecular Dynamics Limited to Each Timestep Computational workflow of MD: Update coordinates about 1% of computational work Initialize coordinates forces, coordinates Force calculation about 99% of computational work Occasional output of reduced quantities (energy, temperature, pressure) Occasional output of coordinates (trajectory snapshot)

6 Work Dominated by Nonbonded Forces 90% Non-bonded forces, short-range cutoff force calculation update coordinates 5% Long-range electrostatics, gridded (e.g. PME) 2% Bonded forces (bonds, angles, etc.) 2% Correction for excluded interactions 1% Integration, constraints, thermostat, barostat Apply GPU acceleration first to the most expensive part

7 Parallelize by Spatial Decomposition of Atoms Data parallelism is common to many codes

8 NAMD Also Decomposes Compute Objects Kale et al., J. Comp. Phys. 151: , 1999 Spatially decompose data and communication Separate but related work decomposition Compute objects create much greater amount of parallelization, facilitating iterative, measurement-based load balancing system, all from use of Charm++

9 Overlap Calculations, Offload Nonbonded Forces Phillips et al., SC2002 Offload to GPU Objects are assigned to processors and queued as data arrives

10 Early Nonbonded Forces Kernel Used All Memory Systems Start with most expensive calculation: direct nonbonded interactions. Decompose work into pairs of patches, identical to NAMD structure. GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) Texture Unit Force Table Interpolation 8kB cache 16kB Shared Memory Patch A Coordinates & Parameters 32-way SIMD Multiprocessor multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency

11 NAMD Performance Improved Using Early GPUs ApoA1 Performance Full NAMD, not test harness Useful performance boost 8x speedup for nonbonded 5x speedup overall w/o PME 3.5x speedup overall w/ PME GPU = quad-core CPU Plans for better performance Overlap GPU and CPU work. Tune or port remaining work. PME, bonded, integration, etc. seconds per step CPU faster GPU Nonbond PME Other 2.67 GHz Core 2 Quad Extreme + GeForce 8800 GTX

12 Reduce Communication Latency by Separating Work Units Phillips et al., SC2008 GPU x Remote Force f Local Force f CPU Remote Local f Local Update x Other Nodes/Processes f x One Timestep

13 Early GPU Fits Into Parallel NAMD as Coprocessor Offload most expensive calculation: non-bonded forces Fits into existing parallelization Extends existing code without modifying core data structures Requires work aggregation and kernel scheduling considerations to optimize remote communication GPU is treated as a coprocessor

NAMD Scales Well on Kepler Based Computers 64 21M atoms 32 Performance (ns per day) GTC 2017 (2fs timestep) 16 8

5 Edison XC30 (SC14) Blue Waters XE6 (SC14) 0.

14 NAMD Scales Well on Kepler Based Computers 64 21M atoms 32 Performance (ns per day) GTC 2017 (2fs timestep) M atoms 1 Blue Waters XK7 (GTC16) Kepler based Titan XK7 (GTC16) 0.5 Edison XC30 (SC14) Blue Waters XE6 (SC14) Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University Number of Illinois at of Urbana-Champaign Nodes

15 Large Rate Difference Between Pascal and CPU 20x FLOP rate difference between GPU and CPU Requires full use of CPU cores and vectorization! Balance between GPU and CPU capability keeps shifting towards GPU NVIDIA plots show only through Pascal Volta widens the performance gap! Difference made worse by multiple GPUs per CPU (e.g. DGX, Summit) Past efforts to balance work between GPU and CPU are now CPU bound

parallel algorithms, regularized work units CPU gets algorithms less well suited to GPU, plus overflow work Partial model: ~10M atoms Le, et al.

16 Balancing Work Between GPU and CPU Case Study: Multilevel Summation CUDA Kernels in VMD Hardy, et al. Journal of Parallel Computing, 35: , 2009 Effort to balance computational work between GPUs and CPU GPU gets only straightforward data parallel algorithms, regularized work units CPU gets algorithms less well suited to GPU, plus overflow work Partial model: ~10M atoms Le, et al. PLoS Comput Bio, 6:e , 2010 Electrostatic field of chromatophore model from multilevel summation method: computed with 3 GPUs (G80) in ~90 seconds, 46x faster than single CPU socket

17 Time-Averaged Electrostatics Analysis on NCSA Blue Waters NCSA Blue Waters Node Type Cray XE6 Compute Node: 32 CPU cores (2xAMD 6200 CPUs) Cray XK6 GPU-accelerated Compute Node: 16 CPU cores + NVIDIA X2090 (Fermi) GPU Seconds per trajectory frame for one compute node Speedup for GPU XK6 nodes vs. CPU XE6 nodes XK6 nodes are 4.15x faster overall Tests on XK7 nodes indicate MSM is CPU-bound with the Kepler K20X GPU. Performance not much faster (yet) than Fermi X2090 Need to move spatial hashing, prolongation, interpolation onto the GPU In progress. XK7 nodes 4.3x faster overall Preliminary performance for VMD time-averaged electrostatics w/ Multilevel Summation Method on the NCSA Blue Waters Early Science System speedup ok CPU-bound

18 Multilevel Summation on the GPU Accelerate short-range cutoff and lattice cutoff parts Performance profile for 0.5 Å map of potential for 1.5 M atoms. Hardware platform is Intel QX6700 CPU and NVIDIA GTX 280. Computational steps CPU (s) w/ GPU (s) Speedup Short-range cutoff Long-range anterpolation 0.18 restriction 0.16 lattice cutoff prolongation 0.17 interpolation 3.47 Total Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35: , 2009.

19 Balancing Work Between GPU and CPU Case Study: Multilevel Summation CUDA Kernels in VMD Conclusions: Successful (at the time) of exploiting task and data level parallelism Work was reasonably well balanced between CPUs and GPUs for Tesla and Fermi For Kepler and later, approach was CPU bound

20 Reduce Latency, Offload All Force Computation Overlapped GPU communication and computation (2012) Offload atom-based work for PME (2013) - Use higher order interpolation with coarser grid Emphasis on improving communication latency - Reduce parallel FFT communication Faster nonbonded force kernels (2016) Offload entire PME using cufft (for single node use) (2016) Offload remaining force terms (2017) - Includes: bonds, angles, dihedrals, impropers, crossterms, exclusions Emphasis on using GPUs more effectively

21 Overlapped GPU Communication and Computation Allows incremental results from a single grid to be processed on CPU before grid finishes on GPU Allows merging and prioritizing of remote and local work GPU side: - Write results to host-mapped memory (also without streaming) - threadfence_system() and syncthreads() - Atomic increment for next output queue location - Write result index to output queue CPU side: - Poll end of output queue (int array) in host memory

22 Non-overlapped Kernel Communication Integration unable to start until GPU kernel finishes

23 Overlapped Kernel Communication GPU kernel communicates results while running; patches begin integration as soon as data arrives

24 Faster Non-bonded force computation in NAMD Two levels of spatial sorting Simulation box is divided into patches Within the patch, atoms are sorted spatially into groups of 32 using orthogonal recursive bisection method 7 S6623: Advances in NAMD GPU Performance

25 Faster Non-bonded force compute Patch 1 Compute 1 Compute 1 Patch 2 32 Patch 2 Compute 2 Patch 3 Compute = all pairwise interactions between two patches 32 For GPU, compute is split into tiles of size 32x32 atoms 8 S6623: Advances in NAMD GPU Performance

Faster Non-bonded force computation F i 32 31 30 32 Warp 1 32 Warp 2 Warp 3

through 32x32 tile diagonally Avoids race condition when storing forces F i

26 Faster Non-bonded force computation F i Warp 1 32 Warp 2 Warp 3 Warp 4 Atoms in patch j Atoms in patch i F j One warp per tiles Loop through 32x32 tile diagonally Avoids race condition when storing forces F i and F j Bitmask used for exclusion lookup 9 S6623: Advances in NAMD GPU Performance

27 Neighbor list construction on GPU Bounding box neighbor list Sort neighbor list R Compute forces atom-based neighbor list R Sort neighbor list 11 S6623: Advances in NAMD GPU Performance

lists executed on the same thread block should have approximately the same work load Simple

28 Neighbor list sorting Load imbalance! Thread block No load imbalance sort Warp 1 Warp 2 Warp 3 Warp 4 Warp 1 Warp 2 Warp 3 Tile lists executed on the same thread block should have approximately the same work load Simple solution is to sort according to tile list length Also minimizes tail effects at the end of kernel execution 12 S6623: Advances in NAMD GPU Performance

29 Single-Node GPU Performance Competitive on Maxwell New kernels by Antti-Pekka Hynninen, NVIDIA Stone, Hynninen, et al., International Workshop on OpenPOWER for HPC (IWOPH'16), 2016

30 More Improvement from Offloading Bonded Forces GPU offloading for bonds, angles, dihedrals, impropers, exclusions, and crossterms Computation in single precision Forces are accumulated in fixed point Virials are accumulated in fixed point Code path exists for double precision accumulation on Pascal and newer GPUs Reduces CPU workload and hence improves performance on GPU-heavy systems Speedup DGX-1 apoa1 f1atpase stmv New kernels by Antti-Pekka Hynninen, NVIDIA

31 Supercomputers Increasing GPU to CPU Ratio Blue Waters, Titan with Cray XK7 nodes 1 K20 / 16-core AMD Opteron Summit nodes 6 Volta / 42 cores IBM Power 9 Only 7 cores supporting each Volta!

32 Limited Scaling Even After Offloading All Forces Results on NVIDIA DGX-1 (Intel Haswell using 28-cores with Volta V100 GPUs) 4 Performance (ns per day) STMV 1 million atoms NAMD 2.13 NAMD Number of NVIDIA Voltas Offloading all forces Nonbonded forces only

CPU Integrator Calculation (1%) Causing Bottleneck

33 CPU Integrator Calculation (1%) Causing Bottleneck Nsight Systems profiling of NAMD running STMV (1M atoms) on 1 Volta & 28 CPU cores Too much CPU work: 2200 patches across 28 cores [nonbonded] [bonded] [PME] Too much communication! GPU is not being kept busy Patches running sequentially within each core

34 CPU integrator work is mostly data parallel, but Uses double precision for positions, velocities, forces - Data layout is array of structures (AOS), not well-suited to vectorization Each NAMD patch runs integrator in separate user-level thread to make source code more accessible - Benefit from vectorization is reduced, loop over atoms in each patch Too many exceptional cases handled within same code path - E.g. fixed atoms, pseudo-atom particles (Drude and lone pair) - Test conditionals for simulation options and rare events (e.g. trajectory output) every timestep

35 CPU integrator work is mostly data parallel, but Additional communication is required - Send reductions for kinetic energies and virial - Receive broadcast periodic cell rescaling when simulating constant pressure A few core algorithms have sequential parts, with reduced parallelism - Not suitable for CPU vectorization - Rigid bond constraints successive relaxation over each hydrogen group - Random number generator Gaussian numbers using Box Muller transform with rejection - Reductions calculated over hydrogen groups irregular data access patterns

36 Strategies for Overcoming Bottleneck Data structures for CPU vectorization - Convert atom data storage from AOS (array of structures) form into vector friendly SOA (structure of arrays) form Algorithms for CPU vectorization - Replace non-vectorizing RNG code with vectorized version - Replace rigid bond constraints sequential algorithm with one capable of fine-grained parallelism (maybe LINCS or Matrix-SHAKE) Offload integrator to GPU - Main challenge is aggregating patch data - Use vectorized algorithms, adapt curand for Gaussian random numbers

37 Goal: Developing GPU-based NAMD CPU primarily manages GPU kernel launching - CPU prepares and aggregates data structures for GPU, handles Charm++ communication Reduces overhead of host-to-device memory transfers - Data lives on the GPU, clusters of patches - Communicate edge patches for force calculation and atom migration Design new data structures capable of GPU or CPU vectorization - Major refactor of code to reveal more data parallelism Kernel-based design with consistent interfaces across GPU and CPU - Need fallback kernels for CPU (e.g. continue to support Blue Waters and Titan)

38 Some Conclusions For HPC apps in general and NAMD in particular Balance between CPU and GPU computational capability continues to shift in favor of the GPU - The 1% workload from 10 years ago is now 60% or more in wall clock time Any past code that has attempted to load balance work between CPU and GPU is today likely to be CPU bound Best utilization of GPU might require keeping all data on the GPU - Motivates turning a previously CPU-based code using GPU as an accelerator into a GPUbased code using CPU as a kernel management and communication coprocessor Volta + CUDA 9.x + CUDA Toolkit Libraries provide enough general purpose support to allow moving an application entirely to the GPU

39 Related Talks S ORNL Summit: Petascale Molecular Dynamics Simulations on the Summit POWER9/Volta Supercomputer - James Phillips S Accelerating Molecular Modeling Tasks on Desktop and Pre-Exascale Supercomputers - John Stone S Optimizing HPC Simulation and Visualization Codes Using the NVIDIA Nsight Systems - Robert Knight, Daniel Horowitz, John Stone S VMD: Biomolecular Visualization from Atoms to Cells Using Ray Tracing, Rasterization, and VR - John Stone SE OpenACC User Group Meeting - Sunita Chandrasekaran, Guido Juckeland, John Stone, Randy Allen

40 Acknowledgments Special thanks to: - NVIDIA CUDA Team - NVIDIA NSight Systems Team - James Phillips, NCSA, UIUC - In memoriam Antti-Pekka Hynninen, NVIDIA - Theoretical Biophysics Research Group, Beckman Institute, UIUC - Parallel Programming Laboratory, Dept of Computer Science, UIUC Grants: - NIH P41-GM Center for Macromolecular Modeling and Bioinformatics - Summit Center for Accelerated Application Readiness, OLCF, ORNL

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of