GPU Histogramming: Radial Distribution Functions

Similar documents
High Performance Molecular Simulation, Visualization, and Analysis on GPUs

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Accelerating Molecular Modeling Applications with Graphics Processors

Analysis and Visualization Algorithms in VMD

Using GPUs to compute the multilevel summation of electrostatic forces

High Performance Molecular Visualization and Analysis on GPUs

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories

Histogram calculation in CUDA. Victor Podlozhnyuk

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Journal of Computational Physics

ECE 408 / CS 483 Final Exam, Fall 2014

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Multilevel Summation of Electrostatic Potentials Using GPUs

Broadening the Use of Scalable Kernels in NAMD/VMD

GPU Particle-Grid Methods: Electrostatics

GPU programming basics. Prof. Marco Bertini

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Programming in CUDA. Malik M Khan

CUDA Performance Optimization. Patrick Legresley

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Advanced CUDA Optimizations. Umar Arshad ArrayFire

VMD: Immersive Molecular Visualization and Interactive Ray Tracing for Domes, Panoramic Theaters, and Head Mounted Displays

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Programming Model

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Shared Memory and Synchronizations

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

GPU-Accelerated Analysis of Large Biomolecular Complexes

Visualization of Energy Conversion Processes in a Light Harvesting Organelle at Atomic Detail

Tesla Architecture, CUDA and Optimization Strategies

Optimizing Parallel Reduction in CUDA

Accelerating Scientific Applications with GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lab 1 Part 1: Introduction to CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

CS 677: Parallel Programming for Many-core Processors Lecture 6

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Advanced CUDA Optimizations

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

CUDA Performance Considerations (2 of 2)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

L3: Data Partitioning and Memory Organization, Synchronization

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

Multi-Processors and GPU

Tuning CUDA Applications for Fermi. Version 1.2

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Introduction to GPU hardware and to CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

OPTIMIZING HPC SIMULATION AND VISUALIZATION CODE USING NVIDIA NSIGHT SYSTEMS

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CUDA Architecture & Programming Model

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Improving NAMD Performance on Volta GPUs

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CS377P Programming for Performance GPU Programming - II

Code Optimizations for High Performance GPU Computing

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Hands-on CUDA Optimization. CUDA Workshop

Advanced CUDA: Application Examples

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Acceleration of Molecular Modeling Applications

CUDA Experiences: Over-Optimization and Future HPC

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

International Supercomputing Conference 2009

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Dense Linear Algebra. HPC - Algorithms and Applications

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Introduction to GPGPU and GPU-architectures

Preparing seismic codes for GPUs and other

Introduction to CUDA (1 of n*)

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Transcription:

GPU Histogramming: Radial Distribution Functions John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Research/gpu/ Workshop on GPU Programming for Molecular Modeling, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, August 7, 2010

Radial Distribution Function Describes how atom density varies with distance Decays toward unity with increasing distance, for liquids Sharp peaks appear for solids, according to crystal structure, etc. Quadratic time complexity O(N 2 )

Computing RDFs Compute distances for all pairs of atoms between groups A and B (A and B may be the same, or different) Use nearest image convention for periodic systems Each atom pair distance is inserted into a histogram Histogram is normalized one of several ways depending on use, but usually according to the volume of the spherical shells associated with each histogram bin

Histograms Since the RDF algorithm contains histogramming as a key component, it may be interesting for other applications Volumetric density map processing tools for crystallography and cryo-em often make use of histograms Useful for rapidly estimating quantiles, eliminating outlier data from visualizations, etc.

Computing RDFs on CPUs Atom data can be traversed in a strictly consecutive access pattern, yielding good cache utilization Since RDF histograms are usually small to moderate in size, they normally fit entirely in L2 cache Performance tends to be limited primarily by performance of histogram update step

Histogramming Partition population of data values into discrete bins Compute by traversing input population and incrementing bin counters 2 1.5 1 0.5 0 0.00 1.00 2.00 3.00 4.00 Atom pair distance histogram (normalized)

Histogramming on the CPU (slow-and-simple C) memset(histogram, 0, sizeof(histogram)); for (i=0; i<numdata; i++) { float val = data[i]; if (val >= minval && val <= maxval) { int bin = (val - minval) / bindelta; histogram[bin]++; } } Random access updates to histogram array

What About x86 SSE for RDF Histogramming? Atom pair distances can be computed four-at-atime without too much difficulty Current generation x86 CPUs don t provide SSE instructions to allow individual SIMD units to scatter results to arbitrary memory locations Since the scatter operation must be done with scalar code, the histogram updates become the performance bottleneck for the CPU

Parallel Histogramming on Multi-core CPUs Parallel updates to a single histogram updates creates a potential output conflict CPUs have atomic increment instructions, but they often take hundreds of clock cycles, not intended for this purpose For small numbers of CPU cores, it is best to replicate the histogram for each CPU thread, compute them independently, and combine the separate histograms in a final reduction step

VMD Multi-core CPU RDF Implementation Each CPU worker thread processes a subset of atom pair distances, maintaining its own histogram Threads acquire tiles of work from a dynamic work scheduler built into VMD When all threads have completed their histograms, the main thread combines the independently computed histograms into a final result histogram CPUs compute the entire histogram in a single pass, regardless of the problem size or number of histogram bins

Computing RDFs on the GPU Need tens of thousands of independent threads Each GPU thread computes one or more atom pair distances Histograms are best stored in fast on-chip shared memory Size of shared memory severely constrains the range of viable histogram update techniques Performance is limited by the speed of histogramming Fast CUDA implementation ~70x faster then CPU

GPU Work Decomposition Process atoms in group A by loading blocks of atoms into shared memory and processing them Each thread computes all atom pair distances between current atom and all atoms in constant memory, inserting each distance into histogram

Computing Atom Pair Distances on the GPU Distances are computed using nearest image convention in the case of periodic boundary conditions Since all atom pair combinations will ultimately be computed, the memory access pattern is relatively simple Primary consideration is amplification of effective memory bandwidth, through use of GPU on-chip shared memory, caches, and broadcast of data to multiple or all threads in a thread block

GPU Atom Pair Distance Calculation Divide A and B atom selections into fixed size blocks Load a large block of A into constant memory Load small block of B into thread block s registers Each thread in a thread block computes atom pair distances between its atom and all atoms in constant memory, incrementing appropriate histogram bins until all A/B block atom pairs are processed Next block(s) are loaded, repeating until done

GPU Histogramming Tens of thousands of threads concurrently computing atom distance pairs Far too many threads for a simple per-thread histogram approach like CPU Viable approach: per-warp histograms Fixed size shared memory limits maximum histogram size that can be computed in a single pass Large histograms require multiple passes

Intra-warp Atomic Update for Old NVIDIA GPUs (Compute 1.0, 1.1) device void adddata(volatile unsigned int *s_warphist, unsigned int data, unsigned int threadtag) { } unsigned int count; do { count = s_warphist[data] & 0x07FFFFFFUL; count = threadtag (count + 1); s_warphist[data] = count; } while(s_warphist[data]!= count); Loop until we were the winning thread to update value store a 5-bit thread ID tag upper bits of counter

Per-warp Histogram Approach Each warp maintains its own independent histogram in on-chip shared memory Each thread in the warp computes an atom pair distance and updates a histogram bin in parallel Conflicting histogram bin updates are resolved using one of two schemes: Shared memory write combining with thread-tagging technique (older hardware) atomicadd() to shared memory (new hardware)

RDF Inner Loops (Abbrev.) // loop over all atoms in constant memory for (iblock=0; iblock<loopmax2; iblock+=3*ncudablocks*nblock) { syncthreads(); for (i=0; i<3; i++) xyzi[threadidx.x + i*nblock]=pxi[iblock + i*nblock]; // load coords syncthreads(); for (joffset=0; joffset<loopmax; joffset+=3) { rxij=fabsf(xyzi[idxt3 ] - xyzj[joffset ]); // compute distance, PBC min image convention rxij2=celld.x - rxij; rxij=fminf(rxij, rxij2); rij=rxij*rxij; [ other distance components ] rij=sqrtf(rij + rxij*rxij); ibin= float2int_rd((rij-rmin)*delr_inv); if (ibin<nbins && ibin>=0 && rij>rmin2) { atomicadd(llhists1+ibin, 1U); } } //joffset } //iblock

Writing/Updating Histogram in Global Memory When thread block completes, add independent per-warp histograms together, and write to per-thread-block histogram in global memory Final reduction of all of the per-threadblock histograms stored in global memory

Preventing Integer Overflows Since all-pairs RDF calculation computes many billions of pair distances, we have to prevent integer overflow for the 32-bit histogram bin counters (supported by the atomicadd() routine) We compute full RDF calculation in multiple kernel launches, so each kernel launch computes partial histogram Host routines read GPUs and increment 64-bit (e.g. long, or double) histogram counters in host memory after each kernel completes

Acknowledgements Ben Levine and Axel Kohlmeyer at Temple University Theoretical and Computational Biophysics Group, University of Illinois at Urbana- Champaign NVIDIA CUDA Center of Excellence, University of Illinois at Urbana-Champaign NCSA Innovative Systems Lab The CUDA team at NVIDIA NIH support: P41-RR05969