NAMD Serial and Parallel Performance

Similar documents
Strong Scaling for Molecular Dynamics Applications

Using GPUs to compute the multilevel summation of electrostatic forces

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. February 2012

Accelerating Molecular Modeling Applications with Graphics Processors

NAMD GPU Performance Benchmark. March 2011

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Introduction to parallel Computing

NAMD: Biomolecular Simulation on Thousands of Processors

Improving NAMD Performance on Volta GPUs

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

NAMD Performance Benchmark and Profiling. January 2015

Computer Architecture. R. Poss

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

What are Clusters? Why Clusters? - a Short History

An Introduction to Parallel Programming

NAMD: Biomolecular Simulation on Thousands of Processors

Charm++ for Productivity and Performance

Accelerating NAMD with Graphics Processors

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Interconnect Technology and Computational Speed

Divide-and-Conquer Molecular Simulation:

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

10th August Part One: Introduction to Parallel Computing

CSE502: Computer Architecture CSE 502: Computer Architecture

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

CSC630/CSC730 Parallel & Distributed Computing

BEST BANG FOR YOUR BUCK

Analytical Modeling of Parallel Programs

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Multilevel Summation of Electrostatic Potentials Using GPUs

Instructor Information

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

High Performance Computing Systems

Chapter 1: Introduction to Parallel Computing

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

Course web site: teaching/courses/car. Piazza discussion forum:

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Cray XE6 Performance Workshop

High Performance Computing and Data Mining

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Installation and Test of Molecular Dynamics Simulation Packages on SGI Altix and Hydra-Cluster at JKU Linz

Introduction to GPU computing

Chapter 1: Fundamentals of Quantitative Design and Analysis

Parallel Code Optimisation

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

Accelerating Scientific Applications with GPUs

High Performance Computing

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

The midpoint method for parallelization of particle simulations

Introduction to Parallel Computing

Configuring and Running NAMD Simulations

High-End Computing Systems

Portland State University ECE 588/688. Cray-1 and Cray T3E

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

UCB CS61C : Machine Structures

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Performance comparison between a massive SMP machine and clusters

Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Molecular Dynamics and Quantum Mechanics Applications

Refactoring NAMD for Petascale Machines and Graphics Processors

COMPILED HARDWARE ACCELERATION OF MOLECULAR DYNAMICS CODE. Jason Villarreal and Walid A. Najjar

High Performance Computing. Introduction to Parallel Computing

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Superlinear Speedup in Parallel Computation

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

Scalable Interaction with Parallel Applications

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Consistency and Multiprocessor Performance

Why Parallel Architecture

Clusters of SMP s. Sean Peisert

General Purpose GPU Computing in Partial Wave Analysis

Computer Architecture s Changing Definition

High-End Computing Systems

Introduction to High-Performance Computing

Overview of High Performance Computing

Benchmark results on Knight Landing architecture

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Computer Architecture Spring 2016

1.3 Data processing; data storage; data movement; and control.

ECE 669 Parallel Computer Architecture

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

BlueGene/L (No. 4 in the Latest Top500 List)

Transcription:

NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance and cycle length. Full electrostatics (PME) parameters. Processor architecture and speed. 1

System size and density Time per step scales linearly with the number of atoms. linearly with density (atoms per volume). Example: explicit H vs. united atoms 1/3 the number of atoms. 1/3 the density. Expect 9 X the performance (for protein). Cutoffs and frequencies Time per step scales cubically with cutoff. Steps per cycle has a much smaller effect. Example: 10 A vs. 14 A 1000 vs. 2744 Expect 2.7 X the performance. 2

PME and grid sizes PME with a shorter cutoff breaks even. Grid spacing should be about 1 A. Grid sizes should be 2 i 3 j 5 k. (2 i is best!) Only affects FFT (should be minor). Example: 47A x 31A x 39A periodic cell Grid sizes: 48 32 40 Processors Limited dependence on memory system: Small amount of data. Cache friendly design. Performance scales linearly with clock speed (for one architecture). uncertainly with SPEC benchmarks. Example: 1333 MHz Althon equals 667 MHz Alpha 3

Parallel performance basics Speedup = (serial time) / (parallel time). Ideal (linear) speedup is number of CPUs. Amdahl s law says best speedup possible is (total work) / (non-parallelizable work). In MD, each timestep must be parallelized. NAMD parallel design Atoms are spatially decomposed into cubes which are distributed among CPUs. Generates 25-400 cubes to distribute. Interactions between pairs of cubes are also distributed among CPUs. Generates 300-5000 interaction groups. Interactions are redistributed among CPUs based on measurement of their run time. 4

Calculating speedup Run 500 steps on 1,2,4,8, CPUs. Final performance, ignore startup time. Be aware of dynamic load balancing: Don t use timing at start of simulation. Don t use timing including load balancing step. Don t use Initial timing: value. Best estimate is Benchmark timing:. Cutoff and system size Smaller cutoff may improve efficiency. Data is distibuted to more CPUs. But there is less work to distribute. A larger simulation can employ more CPUs. More work to divide among CPUs. More data to distribute as well. Length of the simulation doesn t matter. 5

Best recorded scaling (cutoff) Speedup on Asci Red 1400 1200 1000 Speedup 800 600 Published in SC2000: Gordon Bell Award Finalist 400 200 0 0 500 1000 1500 2000 2500 Processors PME and parallel scaling PME is harder to parallelize: More communication. More stages of communication. Only uses number of CPUs equal to grid size. But Superlinear speedup observed on T3E. 6

PME performance increase Parallel machines Three dimensions to machine performance: Performance of the individual processors. Bandwidth of the communication network. Latency of the communication network. CPUs are improving faster than networks. Slow CPUs and fast networks give best speedups, but not best price/performance. 7

Latency tolerant design Some communication is necessary, but we don t want to sit idle waiting for it. Work is broken down into interaction sets which require different parts of the incoming data. Interactions are calculated as soon as required data is received, not in any particular order. 8

Cost of large simulations 100,000 atom simulation with PME. 20s per timestep on one 1GHz CPU. One nanosecond of simulated time requires: 5000 CPU-hours 8 CPU-months 1 week on 32 CPUs You can buy 32 processor machines but they cost $300K (or more, we don t ask). NAMD 2.3 on Athlon cluster 67% efficiency on 32, commodity hardware. NAMD design is latency tolerant, cache friendly. Can simulate 100K atoms at 50 ns/year on 32 CPUs. Equivalent to owning a 100 CPU Cray T3E for only $32K. 1 0.75 0.5 0.25 0 Performance on ApoA1 (ns simulated per week) 0 8 16 24 32 9

NAMD: Scalable Molecular Dynamics 1536 1024 512 Parallel simulation of biomolecular systems Spatial decomposition for distributed memory Message driven for latency tolerance 0 0 512 1024 1536 Kalé et al., J. Comp. Phys., 151, 283 (1999) Object-oriented and implemented in C++ 10