GROMACS Performance Benchmark and Profiling. August 2011

Similar documents
GROMACS Performance Benchmark and Profiling. September 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

ICON Performance Benchmark and Profiling. March 2012

AcuSolve Performance Benchmark and Profiling. October 2011

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

CP2K Performance Benchmark and Profiling. April 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

STAR-CCM+ Performance Benchmark and Profiling. July 2014

NAMD GPU Performance Benchmark. March 2011

LAMMPS Performance Benchmark and Profiling. July 2012

AcuSolve Performance Benchmark and Profiling. October 2011

CP2K Performance Benchmark and Profiling. April 2011

ABySS Performance Benchmark and Profiling. May 2010

Altair RADIOSS Performance Benchmark and Profiling. May 2013

CPMD Performance Benchmark and Profiling. February 2014

SNAP Performance Benchmark and Profiling. April 2014

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

HYCOM Performance Benchmark and Profiling

NAMD Performance Benchmark and Profiling. January 2015

LS-DYNA Performance Benchmark and Profiling. April 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD Performance Benchmark and Profiling. February 2012

NAMD Performance Benchmark and Profiling. November 2010

Himeno Performance Benchmark and Profiling. December 2010

LAMMPSCUDA GPU Performance. April 2011

MM5 Modeling System Performance Research and Profiling. March 2009

MILC Performance Benchmark and Profiling. April 2013

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

NEMO Performance Benchmark and Profiling. May 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

OpenFOAM Performance Testing and Profiling. October 2017

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Application Performance Optimizations. Pak Lui

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Clustering Optimizations How to achieve optimal performance? Pak Lui

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

2008 International ANSYS Conference

GRID Testing and Profiling. November 2017

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

AMD EPYC and NAMD Powering the Future of HPC February, 2019

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Birds of a Feather Presentation

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Performance Analysis of LS-DYNA in Huawei HPC Environment

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

QLogic TrueScale InfiniBand and Teraflop Simulations

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Best Practices for Setting BIOS Parameters for Performance

Performance comparison between a massive SMP machine and clusters

High-Performance Lustre with Maximum Data Assurance

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Optimizing Cluster Utilisation with Bright Cluster Manager

Maximizing Cluster Scalability for LS-DYNA

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

IBM Information Technology Guide For ANSYS Fluent Customers

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

Assessment of LS-DYNA Scalability Performance on Cray XD1

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

Molecular Dynamics and Quantum Mechanics Applications

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

HP ProLiant delivers #1 overall TPC-C price/performance result with the ML350 G6

Future Routing Schemes in Petascale clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Interconnect Your Future

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Solutions for Scalable HPC

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

The rcuda middleware and applications

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Optimizing LS-DYNA Productivity in Cluster Environments

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

FUSION1200 Scalable x86 SMP System

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

IBM Power AC922 Server

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

An Introduction to the SPEC High Performance Group and their Benchmark Suites

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

NetApp High-Performance Storage Solution for Lustre

ENABLING NEW SCIENCE GPU SOLUTIONS

Transcription:

GROMACS Performance Benchmark and Profiling August 2011

Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices GROMACS performance overview Understanding GROMACS communication patterns Ways to increase GROMACS productivity MPI libraries comparisons For more info please refer to http://www.dell.com http://www.intel.com http://www.mellanox.com http://www.gromacs.org 2

GROMACS GROMACS (GROningen MAchine for Chemical Simulation) A molecular dynamics simulation package Primarily designed for biochemical molecules like proteins, lipids and nucleic acids A lot of algorithmic optimizations have been introduced in the code Extremely fast at calculating the nonbonded interactions Ongoing development to extend GROMACS with interfaces both to Quantum Chemistry and Bioinformatics/databases An open source software released under the GPL 3

Test Cluster Configuration Dell PowerEdge M610 38-node (456-core) cluster Six-Core Intel X5670 @ 2.93 GHz CPUs, Six-Core Intel X5675 @ 3.06 GHz GPUs Memory: 24GB memory, DDR3 1333 MHz OS: RHEL 5.5, OFED 1.5.2 InfiniBand SW stack Intel Cluster Ready certified cluster Mellanox ConnectX-2 InfiniBand adapters and non-blocking switches Compiler: Intel Composer XE 2011 for Linux MPI: Intel MPI 4 Update 2, Open MPI 1.5.4 with KNEM 0.9.7, Platform MPI 8.1.1 Math libraries: Intel MKL 10.3 Update 5 Application: GROMACS 4.5.4 Benchmark datasets: Kinesin Dimer docked to the MicroTubule (309,453 atoms, 20000 steps, 80.0 ps) 4

About Intel Cluster Ready Intel Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity Simplifies selection, deployment, and operation of a cluster A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers Focus on your work productivity, spend less management time on the cluster Select Intel Cluster Ready Where the cluster is delivered ready to run Hardware and software are integrated and configured together Applications are registered, validating execution on the Intel Cluster Ready architecture Includes Intel Cluster Checker tool, to verify functionality and periodically check cluster health 5

About Dell PowerEdge Servers System Structure and Sizing Guidelines 38-node cluster build with Dell PowerEdge M610 blade servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 6

GROMACS Performance Interconnects InfiniBand enables higher cluster productivity Increasing the performance by up to 444% over 1GigE Performance benefits begins with 2 nodes 1GigE shows little gain in performance despite more systems are used 444% 86% 202% 406% Higher is better InfiniBand QDR 7

GROMACS Performance MPI Implementations All MPI performs similarly in performance Reflects that each MPI implementation handles efficiently for the MPI data transfers Higher is better InfiniBand QDR 8

GROMACS Performance Turbo Mode Setting Turbo Mode would allow the most optimal performance Seen an advantage of around 2% over the static Maximum Performance setting OS controls CPU frequencies through the ACPI power management 2% Higher is better InfiniBand QDR 9

GROMACS Performance MPI Tuning Tuning MPI would provide marginally better in speedup Seen a speedup gain of 3% to 5% Tuning parameters used: I_MPI_SPIN_COUNT=1 I_MPI_RDMA_TRANSLATION_CACHE=1 I_MPI_RDMA_RNDV_BUF_ALIGN=65536 I_MPI_SPIN_COUNT=121 I_MPI_DAPL_DIRECT_COPY_THRESHOLD=65536 3% 5% Higher is better InfiniBand QDR 10

GROMACS Performance CPU Frequencies MPI communication percentage increases as the cluster scales Seen a 5% gain by using faster CPU at 6-node, comparing X5670 versus X5675 Accounts for the difference in CPU speeds (2.93GHz versus 3.06GHz) 5% Higher is better InfiniBand QDR 11

GROMACS Profiling MPI/User Time Ratio Computation time is dominant compared to MPI communication time MPI time only accounts for around 30% at 16-node MPI communication ratio increases as the cluster scales Means tuning for CPU or computation performance could yield better results InfiniBand QDR 12

GROMACS Profiling MPI Calls MPI_Send and MPI_Sendrecv are the most used MPI calls Each is accounted for around 38% of the MPI function calls on a 16-node job GROMACS uses MPI functions for data transfers Seen a large volume of MPI calls for transferring data Inferring the application is network bandwidth-bound 13

GROMACS Profiling Time Spent by MPI Calls Majority of the MPI time is spent on MPI_Alltoall MPI_Alltoall(31%), MPI_Recv(25%), MPI_SendRecv(16%), MPI_Send (9%) MPI communication time lowers gradually as cluster scales Due to the faster total runtime, as more CPUs are working on completing the job faster Reducing the communication time for each of the MPI calls 14

GROMACS Profiling MPI Message Sizes Most of the MPI messages are in the medium sizes Most message sizes are between 4KB to 16KB, and 16KB to 64KB 15

GROMACS Profiling MPI Data Transfer As the cluster grows, substantial less data transfers between MPI processes Reducing data communications from 115GB an single node simulation To around 15GB for a 16-node simulation 16

GROMACS Profiling Aggregated Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively Large data transfer takes place in GROMACS Seen around 750GB to 1.5TB of data being exchanged between the nodes InfiniBand QDR 17

GROMACS Summary Performance InfiniBand allows GROMACS to run at the most efficient rate InfiniBand enables highest network throughput to allow GROMACS to scale Ethernet would not allow scale, ended up wasting valuable system resources Tuning MPI parameters tuning can provide some benefits around 3-5% Tuning for CPU or computation can deliver higher performance As the CPU/MPI time ratio shows significantly more computation is taken place Using faster CPU clock speed would yield 5% higher performance (2.93GHz vs 3.06GHz) Enabling Turbo Mode allows additional 2% in CPU utilization over static Max Performance Spreading the computational workload to more nodes can get job done faster Profiling MPI_Send and MPI_Sendrecv are the most used MPI functions MPI_Allreduce is the most dominant MPI function call 18

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 19 19