Accelerating Molecular Modeling Applications with Graphics Processors

Similar documents
Accelerating Scientific Applications with GPUs

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

GPU Acceleration of Molecular Modeling Applications

Multilevel Summation of Electrostatic Potentials Using GPUs

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

Using GPUs to compute the multilevel summation of electrostatic forces

GPU Particle-Grid Methods: Electrostatics

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

GPU Accelerated Visualization and Analysis in VMD

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

High Performance Molecular Visualization and Analysis with GPU Computing

Accelerating Molecular Modeling Applications with GPU Computing

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations

High Performance Molecular Simulation, Visualization, and Analysis on GPUs

Simulating Life at the Atomic Scale

VSCSE Summer School Lecture 8: Application Case Study Accelerating Molecular Dynamics Experimentation

Accelerating NAMD with Graphics Processors

Analysis and Visualization Algorithms in VMD

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

Accelerating Molecular Modeling Applications with GPU Computing

Graphics Processor Acceleration and YOU

Experience with NAMD on GPU-Accelerated Clusters

GPU Histogramming: Radial Distribution Functions

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations

Case Study: Molecular Modeling Applications John E. Stone

High Performance Molecular Visualization and Analysis on GPUs

Improving NAMD Performance on Volta GPUs

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Visualization and Analysis of Petascale Molecular Dynamics Simulations

Fast Molecular Electrostatics Algorithms on GPUs

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

S5386 Publication-Quality Ray Tracing of Molecular Graphics with OptiX

Lecture 1: Introduction and Computational Thinking

NAMD Serial and Parallel Performance

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

GPU-Accelerated Analysis of Large Biomolecular Complexes

Early Experiences Porting the NAMD and VMD Molecular Simulation and Analysis Software to GPU-Accelerated OpenPOWER Platforms

Early Experiences Scaling VMD Molecular Visualization and Analysis Jobs on Blue Waters

S4410: Visualization and Analysis of Petascale Molecular Simulations with VMD

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Programming in CUDA. Malik M Khan

Lecture 5: Input Binning

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

High Performance Computing on GPUs using NVIDIA CUDA

CS 668 Parallel Computing Spring 2011

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

NAMD GPU Performance Benchmark. March 2011

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories

High Performance Computing with Accelerators

Parallel Computing 35 (2009) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Refactoring NAMD for Petascale Machines and Graphics Processors

Using GPUs to Supercharge Visualization and Analysis of Molecular Dynamics Simulations with VMD

Mattan Erez. The University of Texas at Austin

The Parallel Revolution in Computational Science and Engineering

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Experiences: Over-Optimization and Future HPC

Performance Diagnosis for Hybrid CPU/GPU Environments

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

General Purpose GPU Computing in Partial Wave Analysis

S6261 VMD+OptiX: Streaming Interactive Ray Tracing from Remote GPU Clusters to Your VR Headset

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Technology for a better society. hetcomp.com

GPU-Accelerated Molecular Visualization and Analysis with VMD

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Accelerating CFD with Graphics Hardware

ENABLING NEW SCIENCE GPU SOLUTIONS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Broadening the Use of Scalable Kernels in NAMD/VMD

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

Strong Scaling for Molecular Dynamics Applications

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPUs and GPGPUs. Greg Blanton John T. Lubia

ECE 8823: GPU Architectures. Objectives

Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Applications of Berkeley s Dwarfs on Nvidia GPUs

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

COMPILED HARDWARE ACCELERATION OF MOLECULAR DYNAMICS CODE. Jason Villarreal and Walid A. Najjar

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Recent Advances in Heterogeneous Computing using Charm++

Device Memories and Matrix Multiplication

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Unrolling parallel loops

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CS427 Multicore Architecture and Parallel Computing

Experiences Developing and Maintaining Scientific Applications on GPU-Accelerated Platforms

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Center for Computational Science

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 1: Gentle Introduction to GPUs

Transcription:

Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference on Parallel Processing for Scientific Computing, March 12, 2008

GPU Computing Commodity devices, omnipresent in modern computers Massively parallel hardware, hundreds of processing units, throughput oriented design: Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software 8x to 30x speedups common for data-parallel algorithms Host GPU Thread Execution Manager Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture Texture Texture Texture Texture Texture Texture GPU Global Memory

Computational Biology s Insatiable Demand for Processing Power Simulations still fall short of biological timescales Large simulations extremely difficult to prepare, analyze Order of magnitude increase in performance would allow use of more sophisticated models

Accelerating Molecular Dynamics with Jim Phillips will present GPU results for NAMD later today: MS10 Atlanta A: 4:15-4:40pm Graphics Processors

Calculating Electrostatic Potential Maps Used in structure building, analysis, visualization, simulation Electrostatic potentials evaluated on a uniformly spaced 3-D lattice Each lattice point contains sum of electrostatic contributions of all atoms Positive potential field Negative potential field

Direct Coulomb Summation At each lattice point, sum potential contributions for all atoms in the simulated structure: Lattice point j being evaluated potential[j] += charge[i] / Rij Rij: distance from lattice[j] to Atom[i] Atom[i]

Direct Coulomb Summation on the GPU GPU can outrun a CPU core by 44x Work is decomposed into tens of thousands of independent threads, multiplexed onto hundreds of GPU processor cores Single-precision FP arithmetic is adequate for intended application Numerical accuracy can be further improved by compensated summation, spatially ordered summation groupings, etc Starting point for more sophisticated algorithms Host GPU Thread Execution Manager Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture Texture Texture Texture Texture Texture Texture GPU Global Memory

Direct Coulomb Summation on the GPU Atomic Coordinates Charges Host Constant Memory GPU Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture Texture Texture Texture Texture Texture Texture Global Memory

Direct Coulomb Summation Runtime GPU underutilized GPU fully utilized, ~40x faster than CPU Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007.

Optimizing for the GPU Increase arithmetic intensity, reuse in-register data by unrolling lattice point computation into inner atom loop Each atom contributes to several lattice points, distances only differ in the X component: potentiala += charge[i] / (distancea to atom[i]) potentialb += charge[i] / (distanceb to atom[i]) Atom[i] Distances to Atom[i]

CUDA Block/Grid Decomposition Unrolling increases computational tile size Grid of thread blocks: Thread blocks: 64-256 threads 0,0 0,1 1,0 1,1 Threads compute up to 8 potentials. Skipping by half-warps optimizes global mem. perf. Padding waste

Direct Coulomb Summation Performance CUDA-Unroll8clx: fastest GPU kernel, 44x faster than CPU, 291 GFLOPS on GeForce 8800GTX CPU CUDA-Simple: 14.8x faster, 33% of fastest GPU kernel GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press.

GPU Application Performance (July 2007, current kernels are 20% faster...) CUDA ion placement lattice calculation performance: 82 times faster for virus (STMV) structure 110 times faster for ribosome Virus ion placement: 110 CPU-hours on SGI Altix Itanium2 Same calculation now takes 1.35 GPU-hours 27 minutes (wall clock) if three GPUs are used concurrently Satellite Tobacco Mosaic Virus (STMV) Ion Placement

Multi-GPU Direct Coulomb Summation Effective memory bandwidth scales with the number of GPUs utilized PCIe bus bandwidth not a bottleneck for this algorithm 117 billion evals/sec 863 GFLOPS 131x speedup vs. CPU core Power: 700 watts during benchmark Quad-core Intel QX6700 Three NVIDIA GeForce 8800GTX

Multi-GPU Direct Coulomb Summation 4-GPU (2 Quadroplex) Opteron node at NCSA 157 billion evals/sec 1.16 TFLOPS 176x speedup vs. Intel QX6700 CPU core w/ SSE NCSA GPU Cluster

Cutoff Summation At each lattice point, sum potential contributions for atoms within cutoff radius: if (distance to atom[i] < cutoff) potential += (charge[i] / r) * s(r) Smoothing function s(r) is algorithm dependent Cutoff radius Lattice point being evaluated r: distance to Atom[i] Atom[i]

Infinite vs. Cutoff Potentials Infinite range potential: All atoms contribute to all lattice points Summation algorithm has quadratic complexity Cutoff (range-limited) potential: Atoms contribute within cutoff distance to lattice points Summation algorithm has linear time complexity Has many applications in molecular modeling: Replace electrostatic potential with shifted form Short-range part for fast methods of approximating full electrostatics Used for fast decaying interactions (e.g. Lennard-Jones, Buckingham)

Cutoff Summation on the GPU Atoms spatially hashed into fixedsize bins in global memory Atoms Global memory Shared memory Atom bin Potential map regions Constant memory Bin-Region neighborlist Process atom bins for current potential map region

Cutoff Summation Runtime GPU cutoff with CPU overlap: 12x-21x faster than CPU core GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, 2008. In press.

GPU Performance Results, March 2008 GeForce 8800GTX w/ CUDA 1.1, Driver 169.09 Calculation / Algorithm Algorithm class Speedup vs. Intel QX6700 CPU core Fluorescence microphotolysis Iterative matrix / stencil 12x Pairlist calculation Particle pair distance test 10-11x Pairlist update Particle pair distance test 5-15x Molecular dynamics nonbonded force calculation N-body cutoff force calculations 10x Cutoff electron density sum Particle-grid w/ cutoff 15-23x Cutoff potential summation Particle-grid w/ cutoff 12-21x Direct Coulomb summation Particle-grid 44x 20x (w/ pairlist) Research/gpu/

Lessons Learned GPU algorithms need fine-grained parallelism and sufficient work to fully utilize hardware Much of GPU algorithm optimization revolves around efficient use of multiple memory systems Amdahl s Law can prevent applications from achieving peak speedup with shallow GPU acceleration efforts

Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana-Champaign Prof. Wen-mei Hwu, Chris Rodrigues, IMPACT Group, University of Illinois at Urbana-Champaign David Kirk and the CUDA team at NVIDIA NIH support: P41-RR05969

Publications Research/gpu/ Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007. Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93:4006-4017, 2007. GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press. GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, 2008. In press.