Strong Scaling for Molecular Dynamics Applications

Similar documents
NAMD Serial and Parallel Performance

Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6

Using GPUs to compute the multilevel summation of electrostatic forces

Introduction to parallel Computing

Improving NAMD Performance on Volta GPUs

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Accelerating Molecular Modeling Applications with Graphics Processors

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

ENABLING NEW SCIENCE GPU SOLUTIONS

Advanced MD performance tuning. 15/09/2017 High Performance Molecular Dynamics, Bologna,

Scott Le Grand Amazon Web Services

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Parallel Processing SIMD, Vector and GPU s cont.

When MPPDB Meets GPU:

CUDA. Matthew Joyner, Jeremy Williams

CS377P Programming for Performance GPU Programming - II

Heidi Poxon Cray Inc.

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

BEST BANG FOR YOUR BUCK

Refactoring NAMD for Petascale Machines and Graphics Processors

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Overview of research activities Toward portability of performance

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Profiling & Tuning Applications. CUDA Course István Reguly

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Optimization of MPI Applications Rolf Rabenseifner

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Fundamental CUDA Optimization. NVIDIA Corporation

OpenACC Course. Office Hour #2 Q&A

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

Accelerating NAMD with Graphics Processors

Particle Simulations with HOOMD-blue

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Scalable Topology Aware Object Mapping for Large Supercomputers

Divide-and-Conquer Molecular Simulation:

Efficient Lists Intersection by CPU- GPU Cooperative Computing

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Fundamental CUDA Optimization. NVIDIA Corporation

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Recent Advances in Heterogeneous Computing using Charm++

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Charm++ for Productivity and Performance

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Vulkan Timeline Semaphores

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Properties of Processes

NAMD Performance Benchmark and Profiling. November 2010

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Operating Systems (1DT020 & 1TT802)

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

dcuda: Distributed GPU Computing with Hardware Overlap

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Cartoon parallel architectures; CPUs and GPUs

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON. Created by Moritz Wundke

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Trends in HPC (hardware complexity and software challenges)

GPU-centric communication for improved efficiency

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Flexible Architecture Research Machine (FARM)

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

Modern Processor Architectures. L25: Modern Compiler Design

Alan Humphrey, Qingyu Meng, Brad Peterson, Martin Berzins Scientific Computing and Imaging Institute, University of Utah

Master-Worker pattern

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

NAMD GPU Performance Benchmark. March 2011

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

An Introduction to Parallel Programming

Performance analysis basics

Master-Worker pattern

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Topologies. Maurizio Palesi. Maurizio Palesi 1

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Intro to Parallel Computing

Threads, Synchronization, and Scheduling. Eric Wu

Computer Networks. Course Reference Model. Topic. Congestion What s the hold up? Nature of Congestion. Nature of Congestion 1/5/2015.

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Fundamental Optimizations

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Transcription:

Strong Scaling for Molecular Dynamics Applications

Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics at each time step Compute pairwise forces on atoms integrate atoms forward Forces can either be bonded (the atoms are covalently bound), or non bonded Limit amount of direct forces computed by using a cutoff distance Estimate forces outside cut off to infinity with FFT Strong Scaling + Molecular Dynamics For a single run can only improve performance by speeding up the time step Time steps are already in milliseconds Need strong scaling because need large number of time steps to study phenomenon of biological interest

Overview of talk This talk is a look at the issues that we found while trying to improve the scaling performance of MD codes, mostly focusing on NAMD I will try to make these lessons as general as possible This is not a comprehensive list of all scaling issues an application might face

What could cause scaling issues on GPU nodes We will focus on issues that would be different than what you would face on your CPU path 1. GPU Kernel(s) itself does not scale 2. The code that is setting up work for the GPU / managing the GPU / processing work from the GPU does not scale 3. Something else, like communication is becoming the bottleneck the GPU might be speeding up the computation to the extent that it no longer is the bottleneck

Process Profile! Over multiple processor counts observe how the following vary Time spent in GPU functions Time spent in CPU functions Communication cost Look at timeline traces to help understand bottlenecks and and validate hypothesis

Example - AMBER JAC NVE on M2090 1 node 2 nodes 4 nodes 6 nodes 8 nodes ns/day 35.7 49.16 69.68 79.73 85.21 Parallel efficiency 1.00 0.69 0.49 0.37 0.30

Example AMBER profile Time in ms % of timestep Parallel efficiency Build neighbor list Non bond forces MPI win fence MPI all gather PME fill charge buffer 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 0.171 0.087 0.051 1.000 0.982 0.844 1.030 0.548 0.347 1.000 0.939 0.742 0.291 0.205 0.280 1.000 0.710 0.259 0.193 0.324 0.404 1.000 0.298 0.119 0.339 0.346 0.386 1.000 0.490 0.220

Example AMBER timeline ChargeGrid, PME kernels MPI fence and gather Nonbond force kernel Other kernels Host side CUDA functions CUDA memcopies 2 nodes 4 nodes 8 nodes

Kernel Scaling

Possible reasons for Kernel timing not scaling Is the work to the kernel scaling? I.e. Is it reduced to 1/n per GPU when running n GPUs? Are you running less threads than what you need to saturate the machine? Are we running into the tail effect?

Work executed by the GPU not scaling? AMBER Some of the functions executed on only one GPU Time taken for serial PME 2 4 6 8 Amdahl s law

Not enough work to saturate the GPU Try to increase occupancy is occupancy limited by register or shared memory usage? Can you reduce the amount of work each thread does in order to increase total amount of threads on GPU? Are the multiple kernels that you could run concurrently? Is there more work that you can port to the GPU?

Kernel not Scaling NAMD NAMD kernel scaling 1 node 8 nodes 16 nodes 32 nodes Apoa1 27.3 3.9 2.3 1.5 Parallel efficiency 1 0.88 0.74 0.57 In the case of NAMD the amount of work to the kernel was scaling perfectly and there was enough work Hypothesis: some blocks are taking a very long time, causing low SM occupancy towards the tail end of the kernel, resulting in the scaling issues

Tail effect Sm 0 Sm 0

Tail effect Sm 0 Sm 0 GPU 0 GPU 0 time GPU 0 time GPU 1 GPU 1 time GPU 1 time

Blocks, sorted by start tmie Block timing, Apoa1 at 32 nodes two Away Y, sm 0-10 start time Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 time

Number of Blocks running on SM0 Block timing 2 nodes 16 nodes 32 nodes 8 6 4 2 SM clock SM clock SM clock

Performance Model Can build a performance model to estimate improvement in kernel time as a result of perfect load balancing Distribute blocks to SMs in round robin fashion Inside each block have n queues, n = number of concurrent blocks Model actual kernel behavior by sampling from measured block runtimes Model kernel behavior with perfect load balancing by setting all block runtimes to average value

Estimates from Performance model 52% predicted improvement at 32 nodes if blocks are perfectly load balanced

Managing the GPU

Collecting work from multiple cores Potential performance issues if running multiple MPI ranks with each rank submitting work to the GPU each rank has to submit to its own GPU context and kernels from different contexts cannot run concurrently you can do better by Using threads instead of MPI ranks This way all threads can submit to the same context Use PROXY Introduced in CUDA 5.0 Collect work from multiple MPI ranks to be submitted by one context

NAMD collect work from multiple ranks

Waiting for all the data to arrive Time step GPU kernel Time step GPU kernel 1 node Waiting for data to arrive 16 nodes Time step Time step

Packaging work for the GPU Is the function that is packaging work for the GPU and processing work off the GPU scaling? Across nodes? Across cores? NAMD: pushing enqueuecuda work to the sender nodes improved both scaling across nodes and across cores

Figuring out kernel completion Reducing time to query event signaling kernel completion 10% improvement at 16 nodes, 57% improvement at 52 nodes for Apoa1 GPU CPU before after

Communication Issues

Communication Issues If all your CPU functions and GPU kernels are scaling well but the overall application is not, communication could be a bottleneck Communication issues are more critical to address in the GPU path - the GPU is speeding up the computation so communication can more easily become the bottleneck

Improving the communication performance Using native UGNI messaging rather than generic MPI on a Cray cluster Improving the communication performance helps improve the scaling performance of the GPU path more than the CPU path

NAMD scaling of 100 M atoms Performance results generated on Titan at Oak Ridge National Labs

Analysis Average time step non PME step PME step sum PME computation kernel GPU, 100 stmv, 32 nodes 1.249 1.009 1.833 0.796 0.964 GPU, 100 stmv, 768 nodes 0.085 0.046 0.176 0.034 0.042 strong scaling 0.616 0.905 0.435 0.988 0.966

Example NAMD timelines Showing 11 CPU threads and one communication thread Communication thread Magenta = enqueuecuda(work) White = idle time Green = PME functions Pink superscript = GPU kernel Blue/purple = CPU functions

NAMD, Strong Scaling, 100 M atoms, GPU 4 timesteps = 4982ms = 1.24 s / step 1.01 s for non-pme step 1.83 s for PME step 100 stmv 32 nodes 0.046 s for non-pme step 4 timesteps = 336ms = 0.084 s / step 0.185 s for PME step 100 stmv 768 nodes

Communication delays in PME tracing data needed for one ungrid calculation 4 timesteps = 336ms = 0.084 s / step 0.046 s for non-pme step 0.185 s for PME step 0.046s delay for 100KB message 0.042s delay for 10 KB message

How to improve performance Algorithmic changes to reduce communication needed Reduce amount of PME data by doing more non bonded work Improve communication

Topology Ideal topology? Do you have the ideal topology of nodes in order to maximize bandwidth? Do the CPU only nodes have the same or better topology? Topology aware mapping Randomization As an experiment try randomizing the ordering of nodes you receive

Reducing Communication Delays 2 nodes MPI_Win_Fence = 338 us MPI_allgatherv = 183 us 8 nodes MPI_Win_Fence = 537 us MPI_allgatherv = 413 us Reduce / remove global syncs and pairwise syncs Try to do one sided RMA operations If MPI is slow try others like Global Arrays toolkit, UPC, CoArray, ddi P2P / Interprocess P2P for GPUs on same node

Contention If you are running into network contention try Rendezvous messaging (sender sends small control message, receiver initiates get) Vs greedy / fire and forget Can throttle and control congestion on both the sender and the receiver side

Overlap kernel computation and network activity If your CPU core is also responsible for the communication Kernel Cuda Synchronize Poll network Kernel Cuda memcpy Poll network Kernel Poll network Kernel Poll network Asyn copy Check status of event query

Acknowledgement Oak Ridge National Lab Scott Le Grand and Ross Walker AMBER Jim Phillips, Sanjay Kale, Yanhua Sun, Gengbin Zhang and Eric Bohm NAMD Sarah Anderson and Ryan Olson - CRAY Sky Wu, Nikolai Sakharnykh, Mike Parker and Justin Luitjens NVIDIA

Questions? Thank You!