Accelerating NAMD with Graphics Processors

Similar documents
Refactoring NAMD for Petascale Machines and Graphics Processors

GPU Acceleration of Molecular Modeling Applications

Graphics Processor Acceleration and YOU

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

Accelerating Scientific Applications with GPUs

Experience with NAMD on GPU-Accelerated Clusters

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

Using GPUs to compute the multilevel summation of electrostatic forces

Simulating Life at the Atomic Scale

Threading Hardware in G80

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Mattan Erez. The University of Texas at Austin

Multilevel Summation of Electrostatic Potentials Using GPUs

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Improving NAMD Performance on Volta GPUs

Portland State University ECE 588/688. Graphics Processors

High Performance Computing with Accelerators

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to CUDA (1 of n*)

Petascale Molecular Dynamics Simulations on GPU-Accelerated Supercomputers

The NVIDIA GeForce 8800 GPU

NAMD Serial and Parallel Performance

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 1: Gentle Introduction to GPUs

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Multi-Processors and GPU

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

CS427 Multicore Architecture and Parallel Computing

GPU Particle-Grid Methods: Electrostatics

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Performance potential for simulating spin models on GPU

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

Lecture 1: Introduction and Computational Thinking

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

Intro to GPU s for Parallel Computing

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to GPU hardware and to CUDA

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Spring 2011 Prof. Hyesoon Kim

GRAPHICS PROCESSING UNITS

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Tesla Architecture, CUDA and Optimization Strategies

CSE 160 Lecture 24. Graphical Processing Units

Real-Time Graphics Architecture

Warps and Reduction Algorithms

Spring 2009 Prof. Hyesoon Kim

From Brook to CUDA. GPU Technology Conference

Spring Prof. Hyesoon Kim

Parallel Computing: Parallel Architectures Jin, Hai

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments

High Performance Computing on GPUs using NVIDIA CUDA

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Computer Architecture

Analysis and Visualization Algorithms in VMD

CUDA Programming Model

GPGPU introduction and network applications. PacketShaders, SSLShader

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Mattan Erez. The University of Texas at Austin

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

TUNING CUDA APPLICATIONS FOR MAXWELL

Hardware/Software Co-Design

General Purpose Computing on Graphical Processing Units (GPGPU(

CUDA Experiences: Over-Optimization and Future HPC

Numerical Simulation on the GPU

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

TUNING CUDA APPLICATIONS FOR MAXWELL

ECE 8823: GPU Architectures. Objectives

General Purpose GPU Computing in Partial Wave Analysis

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Programmable Graphics Hardware (GPU) A Primer

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

COSC 6385 Computer Architecture - Multi Processor Systems

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Strong Scaling for Molecular Dynamics Applications

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Comparison of High-Speed Ray Casting on GPU

Transcription:

Accelerating NAMD with Graphics Processors James Phillips John Stone Klaus Schulten Research/namd/

NAMD: Practical Supercomputing 24,000 users can t all be computer experts. 18% are NIH-funded; many in other countries. 4900 have downloaded more than one version. User experience is the same on all platforms. No change in input, output, or configuration files. Run any simulation on any number of processors. Precompiled binaries available when possible. Desktops and laptops setup and testing x86 and x86-64 Windows, and Macintosh Allow both shared-memory and network-based parallelism. Linux clusters affordable workhorses x86, x86-64, and Itanium processors Gigabit ethernet, Myrinet, InfiniBand, Quadrics, Altix, etc Phillips et al., J. Comp. Chem. 26:1781-1802, 2005.

Our Goal: Practical Acceleration Broadly applicable to scientific computing Programmable by domain scientists Scalable from small to large machines Broadly available to researchers Price driven by commodity market Low burden on system administration Sustainable performance advantage Performance driven by Moore s law Stable market and supply chain

Acceleration Options for NAMD Outlook in 2005-2006: FPGA reconfigurable computing (with NCSA) Difficult to program, slow floating point, expensive Cell processor (NCSA hardware) Relatively easy to program, expensive ClearSpeed (direct contact with company) Limited memory and memory bandwidth, expensive MDGRAPE Inflexible and expensive Graphics processor (GPU) Program must be expressed as graphics operations

GPU vs CPU: Raw Performance Calculation: 450 GFLOPS vs 32 GFLOPS Memory Bandwidth: 80 GB/s vs 8.4 GB/s G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX GFLOPS G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

CUDA: Practical Performance November 2006: NVIDIA announces CUDA for G80 GPU. CUDA makes GPU acceleration usable: Developed and supported by NVIDIA. No masquerading as graphics rendering. New shared memory and synchronization. No OpenGL or display device hassles. Multiple processes per card (or vice versa). Resource and collaborators make it useful: Experience from VMD development David Kirk (Chief Scientist, NVIDIA) Wen-mei Hwu (ECE Professor, UIUC) Fun to program (and drive) Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

GeForce 8800 Graphics Mode Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB

GeForce 8800 General Computing Host Input Assembler Thread Execution Manager 12,288 threads, 128 cores, 450 GFLOPS Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 768 MB DRAM, 4GB/S bandwidth to CPU

Typical CPU Architecture L2 L3 L1 I L1 D Dispatch/Retire FPU FPU ALU Memory Controller

Minimize the Processor No large caches or multiple execution units L1 I L1 D Dispatch/Retire FPU Do integer arithmetic on FPU Memory Controller

Maximize Floating Point 8 FP pipelines per SIMD unit L1 I L1 D Dispatch/Retire FPU FPU FPU FPU FPU FPU FPU FPU Memory Controller Shared data cache Single instruction stream One thread per FPU allows branches and gather/scatter.

Add More Threads Pipeline 4 threads per FPU to hide 4-cycle instruction latency. All 32 threads in a warp execute the same instruction. FPU FPU FPU FPU FPU FPU FPU FPU Divergent branches allowed through predication.

Add Even More Threads Multiple warps in a block hide main memory latency and can synchronize to share data. FPU FPU FPU FPU FPU FPU FPU FPU

Add More Threads Again Multiple blocks on a single multiprocessor hide both memory and synchronization latency. FPU FPU FPU FPU FPU FPU FPU FPU All blocks execute a kernel function independently without synchronization or memory coherency.

Add Cores to Suit Customer Kernel is invoked on a grid of uniform blocks. Blocks are dynamically assigned to available multiprocessors and run to completion. Synchronization occurs when all blocks complete.

GeForce 8800 General Computing Host Input Assembler Thread Execution Manager 12,288 threads, 128 cores, 450 GFLOPS Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 768 MB DRAM, 4GB/S bandwidth to CPU

Support Fine-Grained Parallelism Threads are cheap but desperately needed. How many can you give? 512 threads will keep all 128 FPUs busy. 1024 threads will hide some memory latency. 12,288 threads can run simultaneously. Up to 2 10 12 threads per kernel invocation.

NAMD Parallel Design Kale et al., J. Comp. Phys. 151:283-312, 1999. Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation into a large number of objects Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing with minimal communication Hybrid of spatial and force decomposition: Spatial decomposition of atoms into cubes (called patches) For every pair of interacting patches, create one object for calculating electrostatic interactions Recent: Blue Matter, Desmond, etc. use this idea in some form

NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 847 objects 100,000 108 Offload to GPU Objects are assigned to processors and queued as data arrives.

GPU Hardware Special Features Streaming Processor Array TPC TPC TPC TPC TPC TPC TPC TPC Constant 64kB read-only Texture Processor Cluster Texture Unit SM SM read-only interpolation Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SFU Data L1 SFU Super Function Unit SIN RSQRT EXP Etc Streaming Processor ADD SUB MAD Etc

Nonbonded Forces on CUDA GPU Start with most expensive calculation: direct nonbonded interactions. Decompose work into pairs of patches, identical to NAMD structure. GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) Texture Unit Force Table Interpolation 8kB cache 16kB Shared Memory Patch A Coordinates & Parameters 32-way SIMD Multiprocessor 32-256 multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

texture<float4> force_table; constant unsigned int exclusions[]; shared atom jatom[]; atom iatom; // per-thread atom, stored in registers float4 iforce; // per-thread force, stored in registers for ( int j = 0; j < jatom_count; ++j ) { float dx = jatom[j].x - iatom.x; float dy = jatom[j].y - iatom.y; float dz = jatom[j].z - iatom.z; float r2 = dx*dx + dy*dy + dz*dz; if ( r2 < cutoff2 ) { } float4 ft = texfetch(force_table, 1.f/sqrt(r2)); bool excluded = false; int indexdiff = iatom.index - jatom[j].index; if ( abs(indexdiff) <= (int) jatom[j].excl_maxdiff ) { indexdiff += jatom[j].excl_index; excluded = ((exclusions[indexdiff>>5] & (1<<(indexdiff&31)))!= 0); } float f = iatom.half_sigma + jatom[j].half_sigma; // sigma f *= f*f; // sigma^3 f *= f; // sigma^6 f *= ( f * ft.x + ft.y ); // sigma^12 * fi.x - sigma^6 * fi.y f *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; float qq = iatom.charge * jatom[j].charge; if ( excluded ) { f = qq * ft.w; } // PME correction else { f += qq * ft.z; } // Coulomb iforce.x += dx * f; iforce.y += dy * f; iforce.z += dz * f; iforce.w += 1.f; // interaction count or energy Nonbonded Forces CUDA Code } Stone et al., J. Comp. Chem. 28:2618-2640, 2007. Force Interpolation Exclusions Parameters Accumulation

Why Calculate Each Force Twice? Newton s 3rd Law of Motion: F ij = F ji Could calculate force once and apply to both atoms. Floating point operations are cheap: Would save at most a factor of two. Almost everything else hurts performance: Warp divergence Memory access Synchronization Extra registers Integer logic

What About Pairlists? Generation works well under CUDA Assign atoms to cells Search neighboring cells Write neighbors to lists as they are found Scatter capability essential 10x speedup relative to CPU Potential for significant performance boost Eliminate 90% of distance test calculations

Why Not Pairlists? Changes FP-limited to memory limited: Limited memory to hold pairlists Limited bandwidth to load pairlists Random access to coordinates, etc. FP performance grows faster than memory Poor fit to NAMD parallel decomposition: Number of pairs in single object varies greatly

Initial GPU Performance Full NAMD, not test harness Useful performance boost 8x speedup for nonbonded 5x speedup overall w/o PME 3.5x speedup overall w/ PME GPU = quad-core CPU Plans for better performance Overlap GPU and CPU work. Tune or port remaining work. PME, bonded, integration, etc. ApoA1 Performance seconds per step 2.5 2 1.5 1 0.5 0 CPU faster GPU Nonbond PME Other 2.67 GHz Core 2 Quad Extreme + GeForce 8800 GTX

New GPU Cluster Performance 7x speedup 1M atoms for more work Overlap with CPU Infiniband helps scaling Load balancer still disabled Plans for better scaling Better initial load balance Balance GPU load seconds per step 5 4 3 2 1 STMV Performance CPU only with GPU GPU faster 0 1 2 4 8 16 32 48 2.4 GHz Opteron + Quadro FX 5600 Thanks to NCSA and NVIDIA

Next Goal: Interactive MD on GPU Definite need for faster serial IMD Useful method for tweaking structures. 10x performance yields 100x sensitivity. Needed on-demand clusters are rare. AutoIMD available in VMD already Isolates a small subsystem. Specify molten and fixed atoms. Fixed atoms reduce GPU work. Pairlist-based algorithms start to win. Limited variety of simulations Few users have multiple GPUs. Move entire MD algorithm to GPU. NAMD VMD User (Former HHS Secretary Thompson)

Conclusion and Outlook Low-End GPU Impact: Usable performance from a single machine Faster, cheaper, smaller clusters High-End GPU Impact: Fewer, faster nodes reduces communication Faster iteration for longer simulated timescales This is first-generation CUDA hardware