Accelerating Molecular Modeling Applications with GPU Computing

Similar documents
Accelerating Molecular Modeling Applications with GPU Computing

Case Study: Molecular Modeling Applications John E. Stone

High Performance Molecular Visualization and Analysis with GPU Computing

GPU Accelerated Visualization and Analysis in VMD

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

GPU-Accelerated Molecular Visualization and Analysis with VMD

VSCSE Summer School Lecture 8: Application Case Study Accelerating Molecular Dynamics Experimentation

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

Analysis and Visualization Algorithms in VMD

Accelerating Molecular Modeling Applications with Graphics Processors

High Performance Molecular Simulation, Visualization, and Analysis on GPUs

Multilevel Summation of Electrostatic Potentials Using GPUs

Visualization and Analysis of Petascale Molecular Dynamics Simulations

S5371 VMD: Visualization and Analysis of Biomolecular Complexes with GPU Computing

High Performance Molecular Visualization and Analysis on GPUs

Using GPUs to compute the multilevel summation of electrostatic forces

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

S5386 Publication-Quality Ray Tracing of Molecular Graphics with OptiX

CUDA Applications I. John E. Stone

S8709 Accelerating Molecular Modeling on Desktop and Pre-Exascale Supercomputers

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations

Simulating Life at the Atomic Scale

John E. Stone. Intel HPC Developer Conference, Sheraton Hotel Sunday, Nov 13 th, 2016, Salt Lake City, UT

Early Experiences Porting the NAMD and VMD Molecular Simulation and Analysis Software to GPU-Accelerated OpenPOWER Platforms

Accelerating Scientific Applications with GPUs

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

GPU Particle-Grid Methods: Electrostatics

VMD Key Features of Recent Release

GPU Acceleration of Molecular Modeling Applications

GPU-Accelerated Analysis of Large Biomolecular Complexes

Advanced CUDA: Application Examples

Experiences Developing and Maintaining Scientific Applications on GPU-Accelerated Platforms

Improving NAMD Performance on Volta GPUs

S4410: Visualization and Analysis of Petascale Molecular Simulations with VMD

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

Early Experiences Scaling VMD Molecular Visualization and Analysis Jobs on Blue Waters

Visualization Challenges and Opportunities Posed by Petascale Molecular Dynamics Simulations

S6261 VMD+OptiX: Streaming Interactive Ray Tracing from Remote GPU Clusters to Your VR Headset

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

Using GPUs to Supercharge Visualization and Analysis of Molecular Dynamics Simulations with VMD

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories

Application Examples: Visual Molecular Dynamics (VMD)

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

Programming for Hybrid Architectures Today and in the Future

Harnessing GPUs to Probe Biomolecular Machines at Atomic Detail

GPU Histogramming: Radial Distribution Functions

High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs

Proteins and Mesoscale Data: Visualization of Molecular Dynamics

S6253 VMD: Petascale Molecular Visualization and Analysis with Remote Video Streaming

VMD: Immersive Molecular Visualization and Interactive Ray Tracing for Domes, Panoramic Theaters, and Head Mounted Displays

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

OpenCL: Molecular Modeling on Heterogeneous Computing Systems

Lecture 5: Input Binning

Graphics Processor Acceleration and YOU

The Parallel Revolution in Computational Science and Engineering

Visualization of Energy Conversion Processes in a Light Harvesting Organelle at Atomic Detail

ENABLING NEW SCIENCE GPU SOLUTIONS

Using Accelerator Directives to Adapt Science Applications for State-of-the-Art HPC Architectures

High Performance Molecular Visualization: In-Situ and Parallel Rendering with EGL

Broadening the Use of Scalable Kernels in NAMD/VMD

S8665 VMD: Biomolecular Visualization from Atoms to Cells Using Ray Tracing, Rasterization, and VR

Fast Molecular Electrostatics Algorithms on GPUs

High Performance Computing with Accelerators

NAMD Serial and Parallel Performance

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Lecture 1: Introduction and Computational Thinking

Parallel Computing 35 (2009) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Molecular Visualization and Simulation in VR

Journal of Computational Physics

Experience with NAMD on GPU-Accelerated Clusters

Using AWS EC2 GPU Instances for Computational Microscopy and Biomolecular Simulation

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Programming in CUDA. Malik M Khan

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

NAMD GPU Performance Benchmark. March 2011

Portland State University ECE 588/688. Graphics Processors

OPTIMIZING HPC SIMULATION AND VISUALIZATION CODE USING NVIDIA NSIGHT SYSTEMS

Accelerating NAMD with Graphics Processors

Accelerating Biomolecular Modeling with CUDA and GPU Clusters

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Visualizing Biomolecular Complexes on x86 and KNL Platforms: Integrating VMD and OSPRay

B. Tech. Project Second Stage Report on

General Purpose GPU Computing in Partial Wave Analysis

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

CUDA Experiences: Over-Optimization and Future HPC

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Optimization of Tele-Immersion Codes

Threading Hardware in G80

S8665 VMD: Biomolecular Visualization from Atoms to Cells Using Ray Tracing, Rasterization, and VR

VMD: Interactive Publication-Quality Molecular Ray Tracing with OptiX

ECE 8823: GPU Architectures. Objectives

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Transcription:

Accelerating Molecular Modeling Applications with GPU Computing John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Research/gpu/ Supercomputing 2009 Portland, OR, November 18, 2009

See our GPU article in the CACM issue included in your SC2009 attendee bag! Phillips & Stone, Communications of the ACM, 52(10):34-41, Oct 2009

VMD Visual Molecular Dynamics Visualization and analysis of: Molecular dynamics simulations Sequence data Volumetric data Quantum chemistry simulations, And much more User extensible with scripting and plugins Over 9,400 users running CUDA-enabled versions of VMD Research/vmd/ GPU acceleration provides an opportunity to make some slow, or batch calculations capable of being run interactively, or ondemand

CUDA Acceleration in VMD Electrostatic field calculation, ion placement 20x to 44x faster Molecular orbital calculation and display 100x to 120x faster Imaging of gas migration pathways in proteins with implicit ligand sampling 20x to 30x faster

Electrostatic Potential Maps Electrostatic potentials evaluated on 3-D lattice: Applications include: Ion placement for structure building Time-averaged potentials for simulation Visualization and analysis Isoleucine trna synthetase

Multilevel Summation Main Ideas Split the 1/r potential into a short-range cutoff part plus smoothed parts that are successively more slowly varying. All but the top level potential are cut off. Smoothed potentials are interpolated from successively coarser lattices. Finest lattice spacing h and smallest cutoff distance a are doubled at each successive level. Split the 1/r potential Interpolate the smoothed potentials 2h-lattice 1/r = + h-lattice + a 2a atoms

Multilevel Summation Calculation map potential = exact short-range interactions + interpolated long-range interactions Computational Steps long-range parts restriction anterpolation atom charges restriction 4h-lattice 2h-lattice cutoff h-lattice cutoff short-range cutoff prolongation prolongation interpolation map potentials

Multilevel Summation on the GPU Accelerate short-range cutoff and lattice cutoff parts Performance profile for 0.5 Å map of potential for 1.5 M atoms. Hardware platform is Intel QX6700 CPU and NVIDIA GTX 280. Computational steps CPU (s) w/ GPU (s) Speedup Short-range cutoff 480.07 14.87 32.3 Long-range anterpolation 0.18 restriction 0.16 lattice cutoff 49.47 1.36 36.4 prolongation 0.17 interpolation 3.47 Total 533.52 20.21 26.4

Photobiology of Vision and Photosynthesis Investigations of the chromatophore, a photosynthetic organelle Light Partial model: ~10M atoms Electrostatics needed to build full structural model, place ions, study macroscopic properties Electrostatic field of chromatophore model from multilevel summation method: computed with 3 GPUs (G80) in ~90 seconds, 46x faster than single CPU core Full chromatophore model will permit structural, chemical and kinetic investigations at a structural systems biology level

Computing Molecular Orbitals Visualization of MOs aids in understanding the chemistry of molecular system MO spatial distribution is correlated with electron probability density Calculation of high resolution MO grids can require tens to hundreds of seconds on CPUs >100x speedup allows interactive animation of MOs @ 10 FPS C 60

Computing Molecular Orbitals See a live demo of VMD GPU-accelerated MO calc./display, showing the results of a TeraChem GPUaccelerated QM calculation in the Stanford SLAC booth here at SC2009! C 60

Molecular Orbital Computation and Display Process One-time initialization Initialize Pool of GPU Worker Threads Read QM simulation log file, trajectory Preprocess MO coefficient data eliminate duplicates, sort by type, etc For current frame and MO index, retrieve MO wavefunction coefficients For each trj frame, for each MO shown Compute 3-D grid of MO wavefunction amplitudes Most performance-demanding step, run on GPU Extract isosurface mesh from 3-D MO grid Apply user coloring/texturing and render the resulting surface

CUDA Block/Grid Decomposition MO 3-D lattice decomposes into 2-D slices (CUDA grids) Grid of thread blocks: 0,0 0,1 Small 8x8 thread blocks afford large per-thread register count, shared mem. Threads compute one MO lattice point each. Padding optimizes glob. mem perf, guaranteeing coalescing 1,0 1,1

MO Kernel for One Grid Point (Naive C) for (at=0; at<numatoms; at++) { } int prim_counter = atom_basis[at]; calc_distances_to_atom(&atompos[at], &xdist, &ydist, &zdist, &dist2, &xdiv); for (contracted_gto=0.0f, shell=0; shell < num_shells_per_atom[at]; shell++) { int shell_type = shell_symmetry[shell_counter]; for (prim=0; prim < num_prim_per_shell[shell_counter]; prim++) { } float exponent = basis_array[prim_counter ]; float contract_coeff = basis_array[prim_counter + 1]; contracted_gto += contract_coeff * expf(-exponent*dist2); prim_counter += 2; for (tmpshell=0.0f, j=0, zdp=1.0f; j<=shell_type; j++, zdp*=zdist) { } }.. int imax = shell_type - j; for (i=0, ydp=1.0f, xdp=pow(xdist, imax); i<=imax; i++, ydp*=ydist, xdp*=xdiv) tmpshell += wave_f[ifunc++] * xdp * ydp * zdp; value += tmpshell * contracted_gto; shell_counter++; Loop over atoms Loop over shells Loop over primitives: largest component of runtime, due to expf() Loop over angular momenta (unrolled in real code)

Preprocessing of Atoms, Basis Set, and Wavefunction Coefficients Must make effective use of high bandwidth, lowlatency GPU on-chip memory, or CPU cache: Overall storage requirement reduced by eliminating duplicate basis set coefficients Sorting atoms by element type allows re-use of basis set coefficients for subsequent atoms of identical type Padding, alignment of arrays guarantees coalesced GPU global memory accesses, CPU SSE loads

GPU Traversal of Atom Type, Basis Set, Shell Type, and Wavefunction Coefficients Constant for all MOs, all timesteps Monotonically increasing memory references Different at each timestep, and for each MO Strictly sequential memory references Loop iterations always access same or consecutive array elements for all threads in a thread block: Yields good constant memory cache performance Increases shared memory tile reuse

Use of GPU On-chip Memory If total data less than 64 kb, use only const mem: Broadcasts data to all threads, no global memory accesses! For large data, shared memory used as a programmanaged cache, coefficients loaded on-demand: Tiles sized large enough to service entire inner loop runs, broadcast to all 64 threads in a block Complications: nested loops, multiple arrays, varying length Key to performance is to locate tile loading checks outside of the two performance-critical inner loops Only 27% slower than hardware caching provided by constant memory (GT200) Next-gen Fermi GPUs will provide larger on-chip shared memory, L1/L2 caches, reduced control overhead

Array tile loaded in GPU shared memory. Tile size is a power-of-two, multiple of coalescing size, and allows simple indexing in inner loops (array indices are merely offset for reference within loaded tile). Surrounding data, unreferenced by next batch of loop iterations 64-Byte memory coalescing block boundaries Full tile padding Coefficient array in GPU global memory

VMD MO Performance Results for C 60 Sun Ultra 24: Intel Q6600, NVIDIA GTX 280 Kernel Cores/GPUs Runtime (s) Speedup CPU ICC-SSE 1 46.58 1.00 CPU ICC-SSE 4 11.74 3.97 CPU ICC-SSE-approx** 4 3.76 12.4 CUDA-tiled-shared 1 0.46 100. CUDA-const-cache 1 0.37 126. CUDA-const-cache-JIT* 1 0.27 173. (JIT 40% faster) C 60 basis set 6-31Gd. We used an unusually-high resolution MO grid for accurate timings. A more typical calculation has 1/8 th the grid points. * Runtime-generated JIT kernel compiled using batch mode CUDA tools **Reduced-accuracy approximation of expf(), cannot be used for zero-valued MO isosurfaces

Kernel Performance Evaluation: Molekel, MacMolPlt, and VMD Sun Ultra 24: Intel Q6600, NVIDIA GTX 280 C 60 -A C 60 -B Thr-A Thr-B Kr-A Kr-B Atoms 60 60 17 17 1 1 Basis funcs (unique) 300 (5) 900 (15) 49 (16) 170 (59) 19 (19) 84 (84) Cores GPUs Speedup vs. Molekel on 1 CPU core Molekel 1* 1.0 1.0 1.0 1.0 1.0 1.0 MacMolPlt 4 2.4 2.6 2.1 2.4 4.3 4.5 VMD GCC-cephes 4 3.2 4.0 3.0 3.5 4.3 6.5 VMD ICC-SSE-cephes 4 16.8 17.2 13.9 12.6 17.3 21.5 VMD ICC-SSE-approx** 4 59.3 53.4 50.4 49.2 54.8 69.8 VMD CUDA-const-cache 1 552.3 533.5 355.9 421.3 193.1 571.6

VMD Orbital Dynamics Proof of Concept One GPU can compute and animate this movie on-the-fly! CUDA const-cache kernel, Sun Ultra 24, GeForce GTX 285 GPU MO grid calc. 0.016 s CPU surface gen, volume gradient, and GPU rendering Total runtime Frame rate 0.033 s 0.049 s 20 FPS threonine With GPU speedups over 100x, previously insignificant CPU surface gen, gradient calc, and rendering are now 66% of runtime. Need GPU-accelerated surface gen next

Multi-GPU Load Balance All new NVIDIA cards support CUDA, so a typical machine may have a diversity of GPUs of varying capability Static decomposition works poorly for non-uniform workload, or diverse GPUs, e.g. w/ 2 SM, 16 SM, 30 SM VMD uses a multithreaded dynamic GPU work distribution and error handling system GPU 1 2 SMs GPU 3 30 SMs

VMD Multi-GPU Molecular Orbital Performance Results for C 60 Kernel Cores/GPUs Runtime (s) Speedup Parallel Efficiency CPU-ICC-SSE 1 46.580 1.00 100% CPU-ICC-SSE 4 11.740 3.97 99% CUDA-const-cache 1 0.417 112 100% CUDA-const-cache 2 0.220 212 94% CUDA-const-cache 3 0.151 308 92% CUDA-const-cache 4 0.113 412 92% Intel Q6600 CPU, 4x Tesla C1060 GPUs, Uses persistent thread pool to avoid GPU init overhead, dynamic scheduler distributes work to GPUs

VMD Multi-GPU Molecular Orbital Performance Results for C 60 Using Mapped Host Memory Kernel Cores/GPUs Runtime (s) Speedup CPU-ICC-SSE 1 46.580 1.00 CPU-ICC-SSE 4 11.740 3.97 CUDA-const-cache 3 0.151 308. CUDA-const-cache w/ mapped host memory 3 0.137 340. Intel Q6600 CPU, 3x Tesla C1060 GPUs, GPU kernel writes output directly to host memory, no extra cudamemcpy() calls to fetch results! See cudahostalloc() + cudagetdevicepointer()

NAMD: Molecular Dynamics on GPUs Research/gpu/ Research/namd/

NAMD Highlights 2002 Gordon Bell Award 34,000 Users, 1200 Citations ATP synthase PSC Lemieux Computational Biophysics Summer School Blue Waters Target Application GPU Acceleration Illinois Petascale Computing Facility NVIDIA Tesla NCSA Lincoln

NAMD 2.7b2 w/ CUDA Released CUDA-enabled NAMD binaries for 64-bit Linux are available on the NAMD web site now! Research/namd/2.7b2/announce.html Direct download link: Development/Download/download.cgi?PackageName=NAMD

NCSA Lincoln Cluster Performance (8 Intel cores and 2 NVIDIA Telsa GPUs per node) STMV (1M atoms) s/step 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 ~2.8 CPU cores 2 GPUs = 24 cores 4 GPUs 8 16 32 64 Phillips, Stone, Schulten, SC2008 8 GPUs 16 GPUs

GPU-Accelerated NAMD Plans Serial performance Target NVIDIA Fermi architecture Revisit GPU kernel design decisions made in 2007 Improve performance of remaining CPU code Parallel scaling Target NSF Track 2D Keeneland cluster at ORNL Finer-grained work units on GPU (feature of Fermi) One process per GPU, one thread per CPU core Dynamic load balancing of GPU work Wider range of simulation options and features

Acknowledgements Additional Information and References: Research/gpu/ Questions, source code requests: John Stone: johns@ks.uiuc.edu Acknowledgements: J. Phillips, D. Hardy, J. Saam, UIUC Theoretical and Computational Biophysics Group, Prof. Wen-mei Hwu, Christopher Rodrigues, UIUC IMPACT Group CUDA team at NVIDIA UIUC NVIDIA CUDA Center of Excellence NIH support: P41-RR05969

Publications Research/gpu/ Probing Biomolecular Machines with Graphics Processors. J. Phillips, J. Stone. Communications of the ACM, 52(10):34-41, 2009. GPU Clusters for High Performance Computing. V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, W. Hwu. Workshop on Parallel Programming on Accelerator Clusters (PPAC), IEEE Cluster 2009. In press. Long time-scale simulations of in vivo diffusion using GPU hardware. E. Roberts, J. Stone, L. Sepulveda, W. Hwu, Z. Luthey-Schulten. In IPDPS 09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Computing, pp. 1-8, 2009. High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2), ACM International Conference Proceeding Series, volume 383, pp. 9-18, 2009. Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35:164-177, 2009.

Publications (cont) Research/gpu/ Adapting a message-driven parallel application to GPU-accelerated clusters. J. Phillips, J. Stone, K. Schulten. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, 2008. GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, and W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, pp. 273-282, 2008. GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 96:879-899, 2008. Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007. Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93:4006-4017, 2007.