Accelerating Scientific Applications with GPUs

Similar documents
Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

GPU Acceleration of Molecular Modeling Applications

Accelerating NAMD with Graphics Processors

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

Multilevel Summation of Electrostatic Potentials Using GPUs

Experience with NAMD on GPU-Accelerated Clusters

Using GPUs to compute the multilevel summation of electrostatic forces

Refactoring NAMD for Petascale Machines and Graphics Processors

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

GPU Particle-Grid Methods: Electrostatics

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

GPU Accelerated Visualization and Analysis in VMD

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

Improving NAMD Performance on Volta GPUs

Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Graphics Processor Acceleration and YOU

Analysis and Visualization Algorithms in VMD

High Performance Molecular Visualization and Analysis with GPU Computing

Simulating Life at the Atomic Scale

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments

Accelerating Molecular Modeling Applications with GPU Computing

GPU Histogramming: Radial Distribution Functions

VSCSE Summer School Lecture 8: Application Case Study Accelerating Molecular Dynamics Experimentation

High Performance Molecular Simulation, Visualization, and Analysis on GPUs

NAMD Serial and Parallel Performance

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Petascale Molecular Dynamics Simulations on GPU-Accelerated Supercomputers

Lecture 5: Input Binning

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

GPU-Accelerated Visualization and Analysis of Petascale Molecular Dynamics Simulations

Accelerating Molecular Modeling Applications with GPU Computing

High Performance Computing with Accelerators

Fast Molecular Electrostatics Algorithms on GPUs

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Case Study: Molecular Modeling Applications John E. Stone

GPU-Accelerated Analysis of Petascale Molecular Dynamics Simulations

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

High Performance Molecular Visualization and Analysis on GPUs

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture 1: Introduction and Computational Thinking

Technology for a better society. hetcomp.com

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Experiences: Over-Optimization and Future HPC

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

General Purpose GPU Computing in Partial Wave Analysis

NAMD Performance Benchmark and Profiling. November 2010

Early Experiences Scaling VMD Molecular Visualization and Analysis Jobs on Blue Waters

Introduction to GPU computing

Strong Scaling for Molecular Dynamics Applications

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Visualization and Analysis of Petascale Molecular Dynamics Simulations

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Computing 35 (2009) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance Diagnosis for Hybrid CPU/GPU Environments

Introduction to GPU hardware and to CUDA

The Parallel Revolution in Computational Science and Engineering

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Warps and Reduction Algorithms

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Programming in CUDA. Malik M Khan

Recent Advances in Heterogeneous Computing using Charm++

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

NAMD GPU Performance Benchmark. March 2011

GPU-Accelerated Analysis of Large Biomolecular Complexes

S5386 Publication-Quality Ray Tracing of Molecular Graphics with OptiX

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

n N c CIni.o ewsrg.au

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

GPUs and GPGPUs. Greg Blanton John T. Lubia

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Mattan Erez. The University of Texas at Austin

ECE 8823: GPU Architectures. Objectives

When MPPDB Meets GPU:

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Early Experiences Porting the NAMD and VMD Molecular Simulation and Analysis Software to GPU-Accelerated OpenPOWER Platforms

TUNING CUDA APPLICATIONS FOR MAXWELL

Accelerating CFD with Graphics Hardware

ENABLING NEW SCIENCE GPU SOLUTIONS

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

COMPILED HARDWARE ACCELERATION OF MOLECULAR DYNAMICS CODE. Jason Villarreal and Walid A. Najjar

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

TUNING CUDA APPLICATIONS FOR MAXWELL

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Transcription:

Accelerating Scientific Applications with GPUs John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ Workshop on Programming Massively Parallel Processors, July 10, 2008

GPU Computing Commodity devices, omnipresent in modern computers Massively parallel hardware, hundreds of processing units, throughput oriented design Support for standard integer and floating point data types, most recently double-precision floating point Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software 8x to 30x speedups common for data-parallel algorithms implemented as GPU kernels GPU algorithms are often multicore-friendly due to attention paid to data locality and work decomposition. Tools like MCUDA have already demonstrated the feasibility of retargeting GPU kernels to multicores with appropriate compiler/runtime toolkits.

Computational Biology s Insatiable Demand for Processing Power Simulations still fall short of biological timescales Large simulations extremely difficult to prepare, analyze Order of magnitude increase in performance would allow use of more sophisticated models

Calculating Electrostatic Potential Maps Used in molecular structure building, analysis, visualization, simulation Electrostatic potentials evaluated on a uniformly spaced 3-D lattice Each lattice point contains sum of electrostatic contributions of all atoms Positive potential field Negative potential field

Direct Coulomb Summation At each lattice point, sum potential contributions for all atoms in the simulated structure: Lattice point j being evaluated potential[j] += charge[i] / Rij Rij: distance from lattice[j] to Atom[i] Atom[i]

Direct Coulomb Summation on the GPU GPU can outrun a CPU core by 44x Work is decomposed into tens of thousands of independent threads, multiplexed onto hundreds of GPU processor cores Single-precision FP arithmetic is adequate for intended application Numerical accuracy can be further improved by compensated summation, spatially ordered summation groupings, or accumulation of potential in double-precision Starting point for more sophisticated algorithms Host GPU Thread Execution Manager Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture Texture Texture Texture Texture Texture Texture GPU Global Memory

Direct Coulomb Summation on the GPU Atomic Coordinates Charges Host Constant Memory GPU Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture Texture Texture Texture Texture Texture Texture Global Memory

Lower is better Direct Coulomb Summation Runtime GPU underutilized GPU fully utilized, ~40x faster than CPU Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007.

Optimizing for the GPU Increase arithmetic intensity, reuse in-register data by unrolling lattice point computation into inner atom loop Each atom contributes to several lattice points, distances only differ in the X component: potentiala += charge[i] / (distancea to atom[i]) potentialb += charge[i] / (distanceb to atom[i]) Atom[i] Distances to Atom[i]

CUDA Block/Grid Decomposition Unrolling increases computational tile size Grid of thread blocks: Thread blocks: 64-256 threads 0,0 0,1 1,0 1,1 Threads compute up to 8 potentials. Skipping by half-warps optimizes global mem. perf. Padding waste

Direct Coulomb Summation Performance CUDA-Unroll8clx: fastest GPU kernel, 44x faster than CPU, 291 GFLOPS on GeForce 8800GTX CPU CUDA-Simple: 14.8x faster, 33% of fastest GPU kernel GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press.

GPU Application Performance (July 2007, current code is 20% faster...) CUDA ion placement lattice calculation performance: 82 times faster for virus (STMV) structure 110 times faster for ribosome Virus ion placement: 110 CPU-hours on SGI Altix Itanium2 Same calculation now takes 1.35 GPU-hours 27 minutes (wall clock) if three GPUs are used concurrently Satellite Tobacco Mosaic Virus (STMV) Ion Placement

Multi-GPU Direct Coulomb Summation Effective memory bandwidth scales with the number of GPUs utilized PCIe bus bandwidth not a bottleneck for this algorithm 117 billion evals/sec 863 GFLOPS 131x speedup vs. CPU core Power: 700 watts during benchmark Quad-core Intel QX6700 Three NVIDIA GeForce 8800GTX

Multi-GPU Direct Coulomb Summation 4-GPU (2 Quadroplex) Opteron node at NCSA 157 billion evals/sec 1.16 TFLOPS 176x speedup vs. Intel QX6700 CPU core w/ SSE NCSA GPU Cluster

Cutoff Summation At each lattice point, sum potential contributions for atoms within cutoff radius: if (distance to atom[i] < cutoff) potential += (charge[i] / r) * s(r) Smoothing function s(r) is algorithm dependent Cutoff radius Lattice point being evaluated r: distance to Atom[i] Atom[i]

Infinite vs. Cutoff Potentials Infinite range potential: All atoms contribute to all lattice points Summation algorithm has quadratic complexity Cutoff (range-limited) potential: Atoms contribute within cutoff distance to lattice points Summation algorithm has linear time complexity Has many applications in molecular modeling: Replace electrostatic potential with shifted form Short-range part for fast methods of approximating full electrostatics Used for fast decaying interactions (e.g. Lennard-Jones, Buckingham)

Cutoff Summation on the GPU Atoms spatially hashed into fixedsize bins in global memory Atoms Global memory Shared memory Atom bin Potential map regions Constant memory Bin-Region neighborlist Process atom bins for current potential map region

Cutoff Summation Runtime GPU cutoff with CPU overlap: 12x-21x faster than CPU core GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, 2008. In press.

NAMD Parallel Molecular Dynamics Kale et al., J. Comp. Phys. 151:283-312, 1999. Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation into a large number of objects Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing with minimal communication Hybrid of spatial and force decomposition: Spatial decomposition of atoms into cubes (called patches) For every pair of interacting patches, create one object for calculating electrostatic interactions Recent: Blue Matter, Desmond, etc. use this idea in some form

NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 847 objects 100,000 108 Offload to GPU Objects are assigned to processors and queued as data arrives.

Nonbonded Forces on G80 GPU Start with most expensive calculation: direct nonbonded interactions. Decompose work into pairs of patches, identical to NAMD structure. GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) Texture Unit Force Table Interpolation 8kB cache 16kB Shared Memory Patch A Coordinates & Parameters 32-way SIMD Multiprocessor 32-256 multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

texture<float4> force_table; constant unsigned int exclusions[]; shared atom jatom[]; atom iatom; // per-thread atom, stored in registers float4 iforce; // per-thread force, stored in registers for ( int j = 0; j < jatom_count; ++j ) { float dx = jatom[j].x - iatom.x; float dy = jatom[j].y - iatom.y; float dz = jatom[j].z - iatom.z; float r2 = dx*dx + dy*dy + dz*dz; if ( r2 < cutoff2 ) { } float4 ft = texfetch(force_table, 1.f/sqrt(r2)); bool excluded = false; int indexdiff = iatom.index - jatom[j].index; if ( abs(indexdiff) <= (int) jatom[j].excl_maxdiff ) { indexdiff += jatom[j].excl_index; excluded = ((exclusions[indexdiff>>5] & (1<<(indexdiff&31)))!= 0); } float f = iatom.half_sigma + jatom[j].half_sigma; // sigma f *= f*f; // sigma^3 f *= f; // sigma^6 f *= ( f * ft.x + ft.y ); // sigma^12 * fi.x - sigma^6 * fi.y f *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; float qq = iatom.charge * jatom[j].charge; if ( excluded ) { f = qq * ft.w; } // PME correction else { f += qq * ft.z; } // Coulomb iforce.x += dx * f; iforce.y += dy * f; iforce.z += dz * f; iforce.w += 1.f; // interaction count or energy Nonbonded Forces CUDA Code } Stone et al., J. Comp. Chem. 28:2618-2640, 2007. Force Interpolation Exclusions Parameters Accumulation

NAMD Overlapping Execution with Asynchronous CUDA kernels GPU kernels are launched asynchronously, CPU continues with its own work, polling for GPU completion periodically. Forces needed by remote nodes are explicitly scheduled to be computed ASAP to improve overall performance.

NAMD Performance on NCSA GPU Cluster, April 2008 STMV benchmark, 1M atoms,12a cutoff, PME every 4 steps, running on 2.4 GHz AMD Opteron + NVIDIA Quadro FX 5600

NAMD Performance on NCSA GPU Cluster, April 2008 5.5-7x overall application speedup w/ G80-based GPUs STMV virus (1M atoms) Overlap with CPU Off-node results done first Infiniband scales well Plans for better performance seconds per step Tune or port remaining work Balance GPU load (?) 0 5 4 3 2 1 STMV Performance 25.7 13.8 7.8 CPU only with GPU GPU faster 1 2 4 8 16 32 48 2.4 GHz Opteron + Quadro FX 5600 Thanks to NCSA and NVIDIA

GPU Kernel Performance, May 2008 GeForce 8800GTX w/ CUDA 1.1, Driver 169.09 Research/gpu/ Calculation / Algorithm Algorithm class Speedup vs. Intel QX6700 CPU core Fluorescence microphotolysis Iterative matrix / stencil 12x Pairlist calculation Particle pair distance test 10-11x Pairlist update Particle pair distance test 5-15x Molecular dynamics non-bonded force calculation N-body cutoff force calculations Cutoff electron density sum Particle-grid w/ cutoff 15-23x MSM short-range cutoff Particle-grid w/ cutoff 24x MSM long-range lattice cutoff Grid-grid w/ cutoff 22x Direct Coulomb summation Particle-grid 44x 10x 20x (w/ pairlist)

Lessons Learned GPU algorithms need fine-grained parallelism and sufficient work to fully utilize hardware Fine-grained GPU work decompositions often compose well with the comparatively coarsegrained decompositions used for multicore or distributed memory programing Much of GPU algorithm optimization revolves around efficient use of multiple memory systems and latency hiding

Lessons Learned (2) The host CPU can potentially be used to regularize the computation for the GPU, yielding better overall performance Amdahl s Law can prevent applications from achieving peak speedup with shallow GPU acceleration efforts Overlapping CPU work with GPU can hide some communication and unaccelerated computation Concurrent use of GPUs, Infiniband cards, or other hardware performing DMA to the same region of memory can be problematic due to the lack of OS-managed interfaces for pinning pages of physical memory, when accessed through high-performance OS kernel bypass software interfaces (e.g. MPI, CUDA, etc)

Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana- Champaign Prof. Wen-mei Hwu, Chris Rodrigues, IMPACT Group, University of Illinois at Urbana-Champaign David Kirk and the CUDA team at NVIDIA NVIDIA, IACAT, NCSA (GPU cluster) NIH support: P41-RR05969

Publications Research/gpu/ Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007. Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93:4006-4017, 2007. GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press. GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, 2008. In press.