GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Similar documents
LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

GROMACS Performance Benchmark and Profiling. September 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

GROMACS Performance Benchmark and Profiling. August 2011

CPMD Performance Benchmark and Profiling. February 2014

OpenFOAM Performance Testing and Profiling. October 2017

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

MILC Performance Benchmark and Profiling. April 2013

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

NAMD GPU Performance Benchmark. March 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

LAMMPS Performance Benchmark and Profiling. July 2012

NAMD Performance Benchmark and Profiling. February 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

Altair RADIOSS Performance Benchmark and Profiling. May 2013

CP2K Performance Benchmark and Profiling. April 2011

ICON Performance Benchmark and Profiling. March 2012

LAMMPSCUDA GPU Performance. April 2011

Application Performance Optimizations. Pak Lui

HYCOM Performance Benchmark and Profiling

Himeno Performance Benchmark and Profiling. December 2010

AcuSolve Performance Benchmark and Profiling. October 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

ABySS Performance Benchmark and Profiling. May 2010

NAMD Performance Benchmark and Profiling. November 2010

CP2K Performance Benchmark and Profiling. April 2011

NEMO Performance Benchmark and Profiling. May 2011

AcuSolve Performance Benchmark and Profiling. October 2011

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

MM5 Modeling System Performance Research and Profiling. March 2009

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

GRID Testing and Profiling. November 2017

Clustering Optimizations How to achieve optimal performance? Pak Lui

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Dell EMC HPC System for Life Sciences v1.4

HP GTC Presentation May 2012

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

AMD EPYC and NAMD Powering the Future of HPC February, 2019

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

IBM Power Systems HPC Cluster

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

HPC Hardware Overview

(Reaccredited with A Grade by the NAAC) RE-TENDER NOTICE. Advt. No. PU/R/RUSA Fund/Equipment Purchase-1/ Date:

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Selecting Server Hardware for FlashGrid Deployment

IBM Power AC922 Server

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

High Performance Computing

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

Emerging Technologies for HPC Storage

Dell Solution for High Density GPU Infrastructure

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

HPE ProLiant ML350 Gen10 Server

ENABLING NEW SCIENCE GPU SOLUTIONS

Dell HPC System for Manufacturing System Architecture and Application Performance

Dell EMC PowerEdge server portfolio: platforms and solutions

GW2000h w/gw175h/q F1 specifications

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

MegaGauss (MGs) Cluster Design Overview

SERVER TECHNOLOGY H. Server & Workstation Motherboards Server Barebones & Accessories

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

IBM Power Advanced Compute (AC) AC922 Server

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

HUAWEI Tecal X6000 High-Density Server

Performance Analysis of LS-DYNA in Huawei HPC Environment

HIGH-DENSITY HIGH-CAPACITY HIGH-DENSITY HIGH-CAPACITY

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Implementing Storage in Intel Omni-Path Architecture Fabrics

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Comet Virtualization Code & Design Sprint

Transcription:

GROMACS (GPU) Performance Benchmark and Profiling February 2016

2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices GROMACS performance overview Understanding GROMACS communication patterns Ways to increase GROMACS productivity For more info please refer to http://www.dell.com http://www.mellanox.com http://www.nvidia.com http://www.gromacs.org

3 GROMACS GROMACS (GROningen MAchine for Chemical Simulation) A molecular dynamics simulation package Primarily designed for biochemical molecules like proteins, lipids and nucleic acids A lot of algorithmic optimizations have been introduced in the code Extremely fast at calculating the nonbonded interactions Ongoing development to extend GROMACS with interfaces both to Quantum Chemistry and Bioinformatics/databases An open source software released under the GPL

Objectives 4 The presented research was done to provide best practices GROMACS performance benchmarking MPI Library performance comparison Interconnect performance comparison CPUs comparison Optimization tuning The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency

5 Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (BIOS: Maximum Performance, Home Snoop ) Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop OS: RHEL 6.5, MLNX_OFED_LINUX-3.2-1.0.1.1 InfiniBand SW stack Hard Drives: 2x 1TB 7.2 RPM SATA 2.5 on RAID 1 Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch Mellanox ConnectX-3 FDR VPI InfiniBand and 40Gb/s Ethernet Adapters Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420 NVIDIA Tesla K40 and K80 GPUs; 1 GPU per node MPI: Mellanox HPC-X v1.4.356 (based on Open MPI 1.8.8) with CUDA 7.0 support Application: GROMACS 5.0.4 (Single Precision) Benchmark datasets: Alcohol dehydrogenase protein (ADH) solvated and set up in a rectangular box (134,000 atoms), simulated with 2fs step (http://www.gromacs.org/gpu_acceleration)

PowerEdge R730 Massive flexibility for data intensive operations 6 Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from Ethernet to InfiniBand Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

7 GROMACS Performance adh_cubic For adh_cubic, Tesla K80 generally outperforms the predecessor Tesla K40 K80 can delivers up to 71% of higher performance on the adh_cubic data GROMACS parameters used to control GPUs being used mdrun_mpi -gpu_id 01 -nb gpu_cpu (for K80, 2 MPI are being used for each GPU core) 71% 48% Higher is better

8 GROMACS Performance adh_dodec For adh_dodec, the K80 performs 47% higher than Tesla K40 Environment variables used to control GPUs being used: -x CUDA_VISIBLE_DEVICES=0,1 for K80 66% 38% Higher is better

9 GROMACS Performance adh_dodec_vsites For adh_dodec_vsites, the K80 performs 36% higher than Tesla K40 36% Higher is better

10 GROMACS Performance Network Interconnects EDR InfiniBand provides higher scalability in performance for GROMACS InfiniBand delivers 465% higher performance than 10GbE on 8 nodes Benefits of InfiniBand over Ethernet expect to increase as cluster scales Ethernet would not scale; while InfiniBand scale continuously 467% 465% 347% 341% Higher is better GPU: 1 GPU / Node

11 GROMACS Performance CPU & GPU performance GPU has a performance advantage compared to just CPU cores on the same node GPU outperforms the CPU only by 22%-55% for adh_cubic on a single node The scalability performance of CPUs as node count increases The performance of CPU cluster delivers around 48% higher at 16 nodes (448 cores) 48% 465% 10% 22% 55% Higher is better GPU: 1 GPU / Node

12 GROMACS Performance CPU & GPU performance GPU has a performance advantage compared to just CPU cores on the same node GPU outperforms the CPU only by 32%-44% for adh_dodec on a single node The scalability performance of CPUs as node count increases The performance of CPU cluster delivers around 68% higher at 16 nodes (448 cores) 68% 13% 32% 44% Higher is better GPU: 1 GPU / Node

13 GROMACS Profiling % of MPI Calls The communication time for GPU stays roughly the same as cluster scales While the compute time reduces as number of nodes increase

14 GROMACS Profiling % of MPI Calls The most time consuming MPI calls for GROMACS (cuda): MPI_Sendrecv: 40% MPI / 19% Wall MPI_Bcast: 20% MPI / 9% Wall MPI_Comm_split: 16% MPI / 7% Wall 16 nodes, adh_cubic, RF 8 nodes, adh_cubic, PME

15 GROMACS Profiling MPI Message Size Distribution For the most time consuming MPI calls MPI_Comm_split: 0B (14% MPI time) MPI_Sendrecv: 16KB (13% MPI time) MPI_Bcast: 4B (11% MPI time) 16 nodes, adh_cubic, RF 8 nodes, adh_cubic, PME 8 Nodes

16 GROMACS Profiling MPI Memory Consumption Some load imbalance is seen on the workload for MPI_SendRecv Memory consumption: About 400MB of memory is used on each compute node for this input data 8 Nodes

17 GROMACS Summary GROMACS demonstrates good scalability on cluster of CPU or GPU The Tesla K80 outperforms the Tesla K40 by up to 71% GPU outperforms CPU on a per node basis Up to 55% against the 28 core CPU per onode InfiniBand enables scalability performance for GROMACS InfiniBand delivers 465% higher performance than 10GbE on 8 nodes Benefits of InfiniBand over Ethernet expect to increase as cluster scales Ethernet would not scale; while InfiniBand scale continuously Scalability performance on CPU cluster shown to be better than GPU cluster The most time consuming MPI calls for GROMACS (cuda): MPI_Sendrecv: 40% MPI / 19% Wall MPI_Bcast: 20% MPI / 9% Wall MPI_Comm_split: 16% MPI / 7% Wall 8 Nodes

18 18 Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein