Using GPUs to compute the multilevel summation of electrostatic forces

Similar documents
Multilevel Summation of Electrostatic Potentials Using GPUs

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating NAMD with Graphics Processors

Improving NAMD Performance on Volta GPUs

Accelerating Scientific Applications with GPUs

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

Parallel Computing 35 (2009) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Portland State University ECE 588/688. Graphics Processors

GPU Acceleration of Molecular Modeling Applications

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

NAMD Serial and Parallel Performance

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Analysis and Visualization Algorithms in VMD

Graphics Processor Acceleration and YOU

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Programming in CUDA. Malik M Khan

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Performance Considerations (2 of 2)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Accelerated Visualization and Analysis in VMD

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

GPU Particle-Grid Methods: Electrostatics

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Computer Architecture

Optimization solutions for the segmented sum algorithmic function

GPU Histogramming: Radial Distribution Functions

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

GRAPHICS PROCESSING UNITS

Experience with NAMD on GPU-Accelerated Clusters

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Tesla Architecture, CUDA and Optimization Strategies

B. Tech. Project Second Stage Report on

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Performance potential for simulating spin models on GPU

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High Performance Computing on GPUs using NVIDIA CUDA

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

A Cross-Input Adaptive Framework for GPU Program Optimizations

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Strong Scaling for Molecular Dynamics Applications

CUDA Experiences: Over-Optimization and Future HPC

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Fundamental CUDA Optimization. NVIDIA Corporation

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to CUDA

Parallel FFT Program Optimizations on Heterogeneous Computers

Warps and Reduction Algorithms

CUDA Performance Optimization. Patrick Legresley

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

Numerical Simulation on the GPU

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

QR Decomposition on GPUs

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

General Purpose GPU Computing in Partial Wave Analysis

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Introduction to CUDA (1 of n*)

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Programmable Graphics Hardware (GPU) A Primer

NVIDIA Fermi Architecture

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Technology for a better society. hetcomp.com

GPU-accelerated Verification of the Collatz Conjecture

Master Informatics Eng.

Practical Introduction to CUDA and GPU

Spring Prof. Hyesoon Kim

Lecture 5: Input Binning

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

A low memory footprint OpenCL simulation of short-range particle interactions

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

CUDA Memories. Introduction 5/4/11

high performance medical reconstruction using stream programming paradigms

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Mathematical computations with GPUs

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

High Performance Molecular Simulation, Visualization, and Analysis on GPUs

Introduction to GPU hardware and to CUDA

Advanced CUDA Optimization 1. Introduction

Parallel Computing: Parallel Architectures Jin, Hai

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

GPU Fundamentals Jeff Larkin November 14, 2016

CMPSCI 691AD General Purpose Computation on the GPU

Recent Advances in Heterogeneous Computing using Charm++

Transcription:

Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Research/gpu/ Multiscale Molecular Modelling, Edinburgh, June 30 - July 3, 2010

Outline Multilevel summation method (MSM) GPU architecture and kernel design considerations GPU kernel design alternatives for short-range non-bonded interactions GPU kernel for MSM long-range part Initial performance results for speeding up MD

Multilevel Summation Method Fast algorithm for N-body electrostatics Calculates sum of smoothed pairwise potentials interpolated from a hierarchal nesting of grids Advantages over PME (particle-mesh Ewald) and/or FMM (fast multipole method): - Algorithm has linear time complexity - Allows non-periodic or periodic boundaries Skeel, et al., J. Comp. Chem. 23:673-684, 2002. - Produces continuous forces for dynamics (advantage over FMM) - Avoids 3D FFTs for better parallel scaling (advantage over PME) - Permits polynomial splittings (no erfc() evaluation, as used by PME) - Spatial separation allows use of multiple time steps - Can be extended to other types of pairwise interactions

MSM Main Ideas Split the 1/r potential into a short-range cutoff part plus smoothed parts that are successively more slowly varying. All but the top level potential are cut off. Smoothed potentials are interpolated from successively coarser grids. Finest grid spacing h and smallest cutoff distance a are doubled at each successive level. Split the 1/r potential 1/r =. + + a 2a Interpolate the smoothed potentials r0. 2h-grid h-grid atoms

MSM Calculation force = exact short-range part + interpolated long-range part Computational Steps 4h-grid long-range parts restriction 2h-grid cutoff prolongation restriction h-grid cutoff prolongation anterpolation interpolation positions charges short-range cutoff forces

GPU Computing Concepts Heterogeneous computing model - CPU host manages GPU devices, memories, invokes kernels GPU hardware supports standard integer and floating point types and math operations, designed for fine-grained data parallelism - Hundreds of threads grouped together into a block performing SIMD execution - Need hundreds of blocks (~10,000-30,000 threads) to fully utilize hardware Great speedups attainable (with commodity hardware) - 10x - 20x over one CPU core are common - 100x or more possible in some cases Programming toolkits (CUDA and OpenCL) in C/C++ dialects, allowing GPUs to be integrated into legacy software

123435 6,"7&!*8 90",#7&:;)</.0&$"+',((+" 9:;,'</'=>?5'5"67&8'@2'A((,-./'(7BC$)B$'()*+"!*-!*-!*-!*- =>?@A B,C,.)D -#'%,,-'./'01'()*+"'2'3+)&"4'5"67&8 968!"#$%&'()*"+',((+" -./(0," 968 9< 9< 968 968 9< 9<!"#!"#!"#!"#!"#$%&"'()*+" SP = Streaming Processor SFU = Special Function Unit Tex = Texture Unit

NVIDIA Fermi New Features ECC memory support L1 (64 KB) and L2 (768 KB) caches 512 cores, faster arithmetic ( madd ) by 2x - HOWEVER: memory bandwidth increases only 20-30% Support for additional functions in SFU: erfc() for PME! - HOWEVER: no commensurate improvement in SFU performance Improved double precision performance and accuracy Improved integer performance Allow multiple kernel execution - Could prove extremely important for keeping GPU fully utilized

GPU Kernel Design (CUDA) Problem must offer substantial data parallelism Data access using gather memory patterns (reads) rather than scatter (writes) Increase arithmetic intensity through effective use of memory subsystems - registers: the fastest memory, but smallest in size - constant memory: near register-speed when all threads together read a single location (fast broadcast) - shared memory: near register-speed when all threads access without bank conflicts - texture cache: can improve performance for irregular memory access - global memory: large but slow, coalesced access for best performance May benefit from trading memory access for more arithmetic

Additional GPU Considerations Coalescing reads and writes from global memory Avoiding bank conflicts accessing shared memory Avoiding branch divergence Synchronizing threads within thread blocks Atomic memory operations can provide synchronization across thread blocks Stream management for invoking kernels and transferring memory asynchronously Page-locked host memory to speed up memory transfers between CPU and GPU

Designing GPU Kernels for Short-range Non-bonded Forces Calculate both electrostatics and van der Waals interactions (need atom coordinates and parameters) Spatial hashing of atoms into bins (best done on CPU) Should we use pairlists? - Reduces computation, increases and delocalizes memory access Should we make use of Newton s 3rd Law to reduce work? Is single precision enough? Do we need double precision? How might we handle non-bonded exclusions? - Detect and omit excluded pairs (use bit masks) - Ignore, fix with CPU (use force clamping)

Designing GPU Kernels for Short-range Non-bonded Forces How do we map work to the GPU threads? - Fine-grained: assign threads to sum forces on atoms - Extremely fine-grained: assign threads to pairwise interactions How do we decompose work into thread blocks? - Non-uniform: assign thread blocks to bins - Uniform: assign thread blocks to entries of the force matrix How do we compute potential energies or the virial? How do we calculate expensive functional forms? - PME requires erfc(): is it faster to use an interpolation table? Other issues: supporting NBFix parameters

GPU Kernel for Short-range MSM CPU sorts atoms into bins, copies bins to GPU global memory Each bin is assigned to a thread block Threads are assigned to individual atoms Find index offset for my neighbor bin Constant Memory Array of Bin Index Offsets Shared Memory Bin of Atoms Copy bin into shared memory Loop over surrounding neighborhood of bins, summing forces and energies from their atoms Calculation for MSM involves rsqrt() plus several multiplies and adds CPU copies forces and energies back from GPU global memory Each thread keeps its atom s forces in registers Global Memory

GPU Kernel for Short-range MSM Each thread accumulates atom force and energies in registers Bin neighborhood index offsets stored in constant memory Load atom bin data into shared memory; atom data and bin depth are carefully chosen to permit coalesced reads from global memory Find index offset for my neighbor bin Constant Memory Array of Bin Index Offsets Shared Memory Bin of Atoms Copy bin into shared memory Check for and omit excluded pairs Thread block performs sum reduction of energies Coalesced writing of forces and energies (with padding) to GPU global memory CPU sums energies from bins Each thread keeps its atom s forces in registers Global Memory

NAMD Hybrid Decomposition Kale et al., J. Comp. Phys. 151:283-312, 1999. Spatially decompose data and communication. Separate but related work decomposition. Compute objects facilitate iterative, measurement-based load balancing system.

NAMD Non-bonded Forces on GPU Decompose work into pairs of patches (bins), identical to NAMD structure. Each patch-pair is calculated by an SM (thread block). Force computation on single multiprocessor (GeForce 8800 GTX has 16) 16kB Shared Memory Patch A Coordinates & Parameters Texture Unit Force Table Interpolation 8kB cache 32-way SIMD Multiprocessor 32-256 multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

MSM Grid Interactions Potential summed from grid point charges within cutoff Uniform spacing enables distance-based interactions to be precomputed as stencil of weights Weights at each level are identical up to scaling factor (!) Calculate as 3D convolution of weights - stencil size up to 23x23x23 Cutoff radius Accumulate potential Sphere of grid point charges

MSM Grid Interactions on GPU Store weights in constant memory (padded up to next multiple of 4) Thread block calculates 4x4x4 region of potentials (stored contiguously) Pack all regions over all levels into 1D array (each level padded with zero-charge region) Store map of level array offsets in constant memory Kernel has thread block loop over surrounding regions of charge (load into shared memory) All grid levels are calculated concurrently, scaled by level factor (keeps GPU from running out of work at upper grid levels) Hardy, et al., J. Paral. Comp. 35:164-177, 2009. Each thread block cooperatively loads regions of grid charge into shared memory, multiply by weights from constant memory Shared memory Subset of grid charge regions Global memory Constant memory Grid potential regions Grid charge regions Stencil of weights

Apply Weights Using Sliding Window Thread block must collectively use same value from constant memory Read 8x8x8 grid charges (8 regions) into shared memory Window of size 4x4x4 maintains same relative distances Slide window by 4 shifts along each dimension

Initial Results (GPU: NVIDIA GTX-285, using CUDA 3.0; CPU: 2.4 GHz Intel Core 2 Q6600 quad core) Box of 21950 flexible waters, 12 A cutoff, 1ps CPU only with GPU Speedup vs. NAMD/CPU NAMD with PME 1199.8 s 210.5 s 5.7 x NAMD-Lite with MSM 5183.3 s (4598.6 short, 572.23 long) 176.6 s (93.9 short, 63.1 long) 6.8 x (19% over NAMD/GPU)

Concluding Remarks Redesign of MD algorithms for GPUs is important: - Commodity GPUs offer great speedups for well-constructed codes - CPU single core performance is not improving - Multicore CPUs benefit from redesign efforts (e.g., OpenCL can target multicore CPUs) MSM offers advantages that benefit GPUs: - Short-range splitting using polynomial of r 2 - Calculation on uniform grids More investigation into alternative designs for MD algorithms

Acknowledgments Bob Skeel (Purdue University) Funding from NSF and NIH John Stone (University of Illinois at Urbana-Champaign) Jim Phillips (University of Illinois at Urbana-Champaign) Klaus Schulten (University of Illinois at Urbana-Champaign)