Refactoring NAMD for Petascale Machines and Graphics Processors

Similar documents
Accelerating NAMD with Graphics Processors

GPU Acceleration of Molecular Modeling Applications

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

Accelerating Scientific Applications with GPUs

Experience with NAMD on GPU-Accelerated Clusters

Graphics Processor Acceleration and YOU

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

Simulating Life at the Atomic Scale

Using GPUs to compute the multilevel summation of electrostatic forces

Threading Hardware in G80

Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Mattan Erez. The University of Texas at Austin

Accelerating Molecular Modeling Applications with Graphics Processors

High Performance Computing with Accelerators

Improving NAMD Performance on Volta GPUs

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Petascale Molecular Dynamics Simulations on GPU-Accelerated Supercomputers

NAMD Serial and Parallel Performance

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Multilevel Summation of Electrostatic Potentials Using GPUs

Strong Scaling for Molecular Dynamics Applications

Introduction to CUDA (1 of n*)

NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations

Portland State University ECE 588/688. Graphics Processors

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 1: Gentle Introduction to GPUs

Spring 2011 Prof. Hyesoon Kim

The NVIDIA GeForce 8800 GPU

Spring 2009 Prof. Hyesoon Kim

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Lecture 1: Introduction and Computational Thinking

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Analysis and Visualization Algorithms in VMD

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Mattan Erez. The University of Texas at Austin

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Hardware/Software Co-Design

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

Intro to GPU s for Parallel Computing

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GPU Accelerated Visualization and Analysis in VMD and Recent NAMD Developments

CUDA Experiences: Over-Optimization and Future HPC

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Tesla Architecture, CUDA and Optimization Strategies

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

n N c CIni.o ewsrg.au

COSC 6385 Computer Architecture - Multi Processor Systems

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

CS427 Multicore Architecture and Parallel Computing

TUNING CUDA APPLICATIONS FOR MAXWELL

Performance potential for simulating spin models on GPU

CUDA. Matthew Joyner, Jeremy Williams

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Lecture 5: Input Binning

The rcuda middleware and applications

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to GPU hardware and to CUDA


Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

BlueGene/L. Computer Science, University of Warwick. Source: IBM

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Parallel Programming with MPI

Multi-Processors and GPU

CS Parallel Algorithms in Scientific Computing


GPGPU introduction and network applications. PacketShaders, SSLShader

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Flexible Architecture Research Machine (FARM)

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

arxiv: v1 [physics.comp-ph] 4 Nov 2013

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

Programming in CUDA. Malik M Khan

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Recent Advances in Heterogeneous Computing using Charm++

From Brook to CUDA. GPU Technology Conference

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Charm++ for Productivity and Performance

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

How to Write Fast Code , spring th Lecture, Mar. 31 st

BEST BANG FOR YOUR BUCK

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Transcription:

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips Research/namd/

NAMD Design Designed from the beginning as a parallel program Uses the Charm++ idea: Decompose the computation into a large number of objects Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: Spatial decomposition of atoms into cubes (called patches) For every pair of interacting patches, create one object for calculating electrostatic interactions Recent: Blue Matter, Desmond, etc. use this idea in some form

NAMD Parallelization using Charm++ Example Configuration 847 VPs 100,000 VPs 108 VPs These 100,000 Objects (virtual processors, or VPs) are assigned to real processors by the Charm++ runtime system

Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Time Instrumented Timesteps Refinement Load Balancing

Parallelization on BlueGene/L Sequential Optimizations Messaging Layer Optimizations NAMD parallel tuning Illustrates porting effort Optimization NAMD v2.5 NAMD v2.6 Blocking Fine Grained Congestion Control Topology Loadbalancer Chessboard Dynamic FIFO Mapping Fast Memcpy Non Blocking 2AwayXY + Spanning tree Performance 40 ms 25.2 24.3 20.5 14 13.5 13.3 11.9 8.6 (10 ns/day) Inside help by: Sameer Kumar, former CS/TCB student, now at IBM BlueGene group, tasked by IBM to support NAMD Chao Huang, spent summer at IBM on messaging layer

Fine Grained Decomposition on BlueGene Force Evaluation Integration Decomposing atoms into smaller bricks gives finer grained parallelism

Recent Large-Scale Parallelization Since the proposal was submitted PME parallelization: needs to be fine grained We recently did a 2-D (Pencil-based) parallelization: will be tuned further Efficient data-exchange between atoms and grid Memory issues: New machines will stress memory/node 256MB per processor on BlueGene/L NSF s selection of NAMD, and BAR domain benchmark Plan: partition all static data, Preliminary work done: We can now simulate ribosome on BlueGene/L Much larger systems on Cray XT3: Interconnection topology: Is becoming a strong factor: bandwidth Topology-aware load balancers in Charm++, some specialized to NAMD Improvement with pencil: 0.65 ns per day to 1.2 ns/day. Fibrinogen system: 1 million atoms running on 1024 processors at PSC XT3 Y X Processor Grid Z

94% efficiency Shallow valleys, high peaks, nicely overlapped PME Apo-A1, on BlueGene/L, 1024 procs Red: integration green: communication Charm++ s Projections Analysis too Time intervals on x axis, activity added across processors on Y axisl Blue/Purple: electrostatics Orange: PME turquoise: angle/dihedral

76% efficiency Cray XT3, 512 processors: Initial runs Clearly, needed further tuning, especially PME. But, had more potential (much faster processors)

On Cray XT3, 512 processors: after optimizations 96% efficiency

Performance on BlueGene/L Simulation Rate in Nanoseconds Per Day 100 10 1 0.1 0.01 1 10 100 1000 10000 100000 Processors IAPP (5.5K atoms) LYSOZYME (40K atoms) APOA1 (92K atoms) ATPase (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms) IAPP simulation (Rivera, Straub, BU) at 20 ns per day on 256 processors 1 us in 50 days STMV simulation at 6.65 ns per day on 20,000 processors

Comparison with Blue Matter Nodes ApoLipoprotein-A1 (92K atoms) 512 1024 2048 4096 8192 16384 Blue Matter (SC 06) 38.42 18.95 9.97 5.39 3.14 2.09 ms/step NAMD 18.6 10.5 6.85 4.67 3.2 2.33 ms/step NAMD (Virtual Node) 11.3 7.6 5.1 3.7 3.0 2.33 (CP) ms/step NAMD is about 1.8 times faster than Blue Matter on 1024 processors (and 3.4 times faster with VN mode, where NAMD can use both processors on a node effectively). However: Note that NAMD does PME every 4 steps.

Performance on Cray XT3 100 Simulation Rate in Nanoseconds Per Day 10 1 0.1 0.01 1 10 100 1000 10000 Processors IAPP (5.5K atoms) LYSOZYME(40K atoms) APOA1 (92K atoms) ATPASE (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms) RIBOSOME (2.8M atoms)

NAMD: Practical Supercomputing 20,000 users can t all be computer experts. 18% are NIH-funded; many in other countries. 4200 have downloaded more than one version. User experience is the same on all platforms. No change in input, output, or configuration files. Run any simulation on any number of processors. Automatically split patches and enable pencil PME. Precompiled binaries available when possible. Desktops and laptops setup and testing x86 and x86-64 Windows, PowerPC and x86 Macintosh Allow both shared-memory and network-based parallelism. Linux clusters affordable workhorses x86, x86-64, and Itanium processors Gigabit ethernet, Myrinet, InfiniBand, Quadrics, Altix, etc

NAMD Shines on InfiniBand TACC Lonestar is based on Dell servers and InfiniBand. Commodity cluster with 5200 cores! (Everything s bigger in Texas.) ns per day 100 10 1 0.1 JAC/DHFR (24k atoms) ApoA1 (92k atoms) STMV (1M atoms) 4 8 16 32 64 128 256 512 1024 cores 32 ns/day 2.7 ms/step 15 ns/day 5.6 ms/step Auto-switch to pencil PME

Hardware Acceleration for NAMD Can NAMD offload work to a special-purpose processor? Resource studied all the options in 2005-2006: FPGA reconfigurable computing (with NCSA) Difficult to program, slow floating point, expensive Cell processor (NCSA hardware) Relatively easy to program, expensive ClearSpeed (direct contact with company) Limited memory and memory bandwidth, expensive MDGRAPE Inflexible and expensive Graphics processor (GPU) Program must be expressed as graphics operations

GPU Performance Far Exceeds CPU A quiet revolution in games world so far Calculation: 450 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 80 GB/s vs. 8.4 GB/s GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

CUDA: Practical Performance November 2006: NVIDIA announces CUDA for G80 GPU. CUDA makes GPU acceleration usable: Developed and supported by NVIDIA. No masquerading as graphics rendering. New shared memory and synchronization. No OpenGL or display device hassles. Multiple processes per card (or vice versa). Resource and collaborators make it useful: Experience from VMD development David Kirk (Chief Scientist, NVIDIA) Wen-mei Hwu (ECE Professor, UIUC) Fun to program (and drive)

GeForce 8800 Graphics Mode New GPUs are built around threaded cores Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB

GeForce80 General Computing Up to 65,535 threads, 128 cores, 450 GFLOPS, 768 MB DRAM, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory

NVIDIA G80 GPU Hardware Streaming Processor Array TPC TPC TPC TPC TPC TPC TPC TPC Texture Processor Cluster Texture Unit SM SM Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SFU SFU Super Function Unit SIN RSQRT EXP Etc Streaming Processor ADD SUB MAD Etc

Nonbonded Forces on CUDA GPU Start with most expensive calculation: direct nonbonded interactions. Decompose work into pairs of patches, identical to NAMD structure. GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) Texture Unit Force Table Interpolation 8kB cache 16kB Shared Memory Patch A Coordinates & Parameters 32-way SIMD Multiprocessor 32-256 multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency

texture<float4> force_table; constant unsigned int exclusions[]; shared atom jatom[]; atom iatom; // per-thread atom, stored in registers float4 iforce; // per-thread force, stored in registers for ( int j = 0; j < jatom_count; ++j ) { float dx = jatom[j].x - iatom.x; float dy = jatom[j].y - iatom.y; float dz = jatom[j].z - iatom.z; float r2 = dx*dx + dy*dy + dz*dz; if ( r2 < cutoff2 ) { float4 ft = texfetch(force_table, 1.f/sqrt(r2)); bool excluded = false; int indexdiff = iatom.index - jatom[j].index; if ( abs(indexdiff) <= (int) jatom[j].excl_maxdiff ) { indexdiff += jatom[j].excl_index; excluded = ((exclusions[indexdiff>>5] & (1<<(indexdiff&31)))!= 0); } float f = iatom.half_sigma + jatom[j].half_sigma; // sigma f *= f*f; // sigma^3 f *= f; // sigma^6 f *= ( f * ft.x + ft.y ); // sigma^12 * fi.x - sigma^6 * fi.y f *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; float qq = iatom.charge * jatom[j].charge; if ( excluded ) { f = qq * ft.w; } // PME correction else { f += qq * ft.z; } // Coulomb iforce.x += dx * f; iforce.y += dy * f; iforce.z += dz * f; iforce.w += 1.f; // interaction count or energy } }

Initial GPU Performance Full NAMD, not test harness Useful performance boost 8x speedup for nonbonded 5x speedup overall w/o PME 3.5x speedup overall w/ PME GPU = quad-core CPU Plans for better performance Overlap GPU and CPU work. Tune or port remaining work. PME, bonded, integration, etc. ApoA1 Performance seconds per step 2.5 2 1.5 1 0.5 0 CPU faster GPU Nonbond PME Other 2.67 GHz Core 2 Quad Extreme + GeForce 8800 GTX

Initial GPU Cluster Performance Poor scaling unsurprising 2x speedup on 4 GPUs Gigabit ethernet Load balancer disabled Plans for better scaling InfiniBand network Tune parallel overhead Load balancer changes Balance GPU load. Minimize communication. seconds per step 3 2.5 2 1.5 1 0.5 0 ApoA1 Performance CPU faster 1 GPU 2 GPU Nonbond PME Other 3 GPU 4 GPU 2.2 GHz Opteron + GeForce 8800 GTX

Next Goal: Interactive MD on GPU Definite need for faster serial IMD Useful method for tweaking structures. 10x performance yields 100x sensitivity. Needed on-demand clusters are rare. AutoIMD available in VMD already Isolates a small subsystem. Specify molten and fixed atoms. Fixed atoms reduce GPU work. Pairlist-based algorithms start to win. Limited variety of simulations Few users have multiple GPUs. Move entire MD algorithm to GPU. NAMD VMD User (Former HHS Secretary Thompson)