GROMACS Performance Benchmark and Profiling. September 2012

Similar documents
AMBER 11 Performance Benchmark and Profiling. July 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

CP2K Performance Benchmark and Profiling. April 2011

LAMMPS Performance Benchmark and Profiling. July 2012

Himeno Performance Benchmark and Profiling. December 2010

NAMD Performance Benchmark and Profiling. February 2012

HYCOM Performance Benchmark and Profiling

GROMACS Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

ICON Performance Benchmark and Profiling. March 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

MILC Performance Benchmark and Profiling. April 2013

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NAMD GPU Performance Benchmark. March 2011

ABySS Performance Benchmark and Profiling. May 2010

STAR-CCM+ Performance Benchmark and Profiling. July 2014

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

MM5 Modeling System Performance Research and Profiling. March 2009

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

CP2K Performance Benchmark and Profiling. April 2011

CPMD Performance Benchmark and Profiling. February 2014

LAMMPSCUDA GPU Performance. April 2011

OpenFOAM Performance Testing and Profiling. October 2017

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

NAMD Performance Benchmark and Profiling. November 2010

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

NEMO Performance Benchmark and Profiling. May 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

Application Performance Optimizations. Pak Lui

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

2008 International ANSYS Conference

Clustering Optimizations How to achieve optimal performance? Pak Lui

GRID Testing and Profiling. November 2017

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Optimizing LS-DYNA Productivity in Cluster Environments

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Performance comparison between a massive SMP machine and clusters

Birds of a Feather Presentation

ARISTA: Improving Application Performance While Reducing Complexity

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Single-Points of Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

FUSION1200 Scalable x86 SMP System

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Solutions for Scalable HPC

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

HPC Performance in the Cloud: Status and Future Prospects

IBM Information Technology Guide For ANSYS Fluent Customers

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Implementing Storage in Intel Omni-Path Architecture Fabrics

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

The Future of Interconnect Technology

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Comparison of Storage Protocol Performance ESX Server 3.5

Performance Analysis of LS-DYNA in Huawei HPC Environment

The Future of High Performance Interconnects

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Trend Micro Core Protection Module 10.6 SP1 System Requirements

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Who says world-class high performance computing (HPC) should be reserved for large research centers? The Cray CX1 supercomputer makes HPC performance

High-Performance Lustre with Maximum Data Assurance

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Introduction to High-Speed InfiniBand Interconnect

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Interconnect Your Future

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Future Routing Schemes in Petascale clusters

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

POWEREDGE RACK SERVERS

The Optimal CPU and Interconnect for an HPC Cluster

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

Transcription:

GROMACS Performance Benchmark and Profiling September 2012

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource HPC Advisory Council Cluster Center For more info please refer to http:// www.amd.com http:// www.dell.com/hpc http://www.mellanox.com http://www.gromacs.org 2

GROMACS GROMACS (GROningen MAchine for Chemical Simulation) A molecular dynamics simulation package Primarily designed for biochemical molecules like proteins, lipids and nucleic acids A lot of algorithmic optimizations have been introduced in the code Extremely fast at calculating the nonbonded interactions Ongoing development to extend GROMACS with interfaces both to Quantum Chemistry and Bioinformatics/databases An open source software released under the GPL 3

Objectives The following was done to provide best practices GROMACS performance benchmarking Understanding GROMACS communication patterns Ways to increase GROMACS productivity Compilers and network interconnects comparisons The presented results will demonstrate The scalability of the compute environment The capability of GROMACS to achieve scalable productivity Considerations for performance optimizations 4

Test Cluster Configuration Dell PowerEdge C6145 6-node (384-core) cluster Memory: 128GB memory per node DDR3 1600MHz, BIOS version 2.6.0 4 CPU sockets per server node AMD Opteron 6276 (code name Interlagos ) 16-core @ 2.3 GHz CPUs Mellanox ConnectX -3 VPI Adapters and IS5030 36-Port InfiniBand switch MLNX-OFED 1.5.3 InfiniBand SW stack OS: RHEL 6 Update 2, SLES 11 SP2 MPI: Intel MPI 4 Update 3, Open MPI 1.5.5, Platform MPI 8.2.1 Compilers: GNU 4.7 Application: GROMACS 4.5.5 Benchmark workload: DPPC in Water (d.dppc) (5000 steps, 10.0 ps.) 5

Dell PowerEdge C6145 6-node cluster HPC Advisory Council Test-bed System New 6-node 384 core cluster - featuring Dell PowerEdge C6145 servers Replacement system for Dell PowerEdge SC1435 (192 cores) cluster system following 2 years of rigorous benchmarking and product EOL System to be redirected to explore HPC in the Cloud applications Workload profiling and benchmarking Characterization for HPC and compute intense environments Optimization for scale, sizing and configuration and workload performance Test-bed Benchmarks RFPs Customers/Prospects, etc ISV & Industry standard application characterization Best practices & usage analysis 6

About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6200 series platform and Mellanox ConnectX -3 InfiniBand on Dell HPC Solutions provide the ultimate platform for speed and scale Dell PowerEdge C6145 system delivers 8 socket performance in dense 2U form factor Up to 64 core/32dimms per server 2688 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 7

GROMACS Performance Interconnect InfiniBand QDR delivers the best performance for GROMACS Seen up to 152% better performance than 10GbE on 6 nodes Seen up to 59% better performance than 1GbE on 2 nodes Scalability limitation seen with Ethernet networks 10GigE performance starts to drop after 3-node 1GigE performance drop takes place after 2-node 152% 23% 59% Higher is better 64 Cores/Node 8

GROMACS Performance MPI Intel MPI delivers better scalability for GROMACS 12% higher performance than Open MPI at 6 nodes 4% higher performance than Platform MPI at 6 nodes 12% 4% Higher is better 64 Cores/Node 9

GROMACS Performance Processes Per Node Allocating more processes per node can yield higher system utilization 59% gain in performance with 4P servers versus 2P servers when comparing at 6 nodes Using 64 PPN delivers higher performance than 32PPN using 1 active core 26% gain in performance with 64 PPN versus 32 PPN (with 1 active core) for 6 nodes GROMACS can fully utilized all CPU cores available in a system 26% 59% Higher is better InfiniBand QDR 10

GROMACS Performance Operating Systems Both SLES11SP2 and RHEL 6.2 perform at the same level of performance SLES performs slightly better on a single node while RHEL performs better at scale Higher is better InfiniBand QDR 11

GROMACS Profiling MPI/User Time Ratio InfiniBand QDR reduces the amount of time for MPI communications MPI Communication time stays flat as the compute time reduces 64 Cores/Node 12

GROMACS Profiling Number of MPI Calls The most used MPI functions are for data transfers MPI_Sendrecv (35%), MPI_Isend (23%), MPI_Irecv (23%), MPI_Waitall (14%) Reflects that GROMACS requires good network throughput The number of calls increases proportionally as the cluster scales 13

GROMACS Profiling Time Spent of MPI Calls The time in communications is taken place in the following MPI functions: MPI_Sendrecv (54%) MPI_Waitall (28%), MPI_Bcast (12%) 14

GROMACS Profiling MPI Message Sizes Majority of the MPI messages are small to median message sizes In the ranges of between 257B and 1KB All of the MPI messages are in the sizes less than 256KB Low network latency requires for good small MPI message performance 15

GROMACS Profiling MPI Message Sizes Large concentration of MPI calls are small to median message sizes MPI_Irecv: In the ranges of between 257B and 1KB 6 Nodes 384 Processes 16

GROMACS Profiling Data Transfer By Process Data transferred to each MPI rank is generate constant for all MPI processes Amount of data transfer to each rank is reduced as more nodes are in the job From around 600MB per rank on 1-node down to around 300MB per rank for 6-node 17

GROMACS Profiling Aggregated Data Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively The total data transfer increases steadily as the cluster scales For this dataset, a good amount of data being sent and received across the network As a compute node being added, more data communications will take place InfiniBand QDR 18

GROMACS Profiling Point To Point Data Flow The point to point data flow shows the communication pattern of GROMACS GROMACS mainly communicates mainly its neighbors and close ranks The pattern stays the same as the cluster scales 1 Nodes 64 Processes 6 Nodes 384 Processes InfiniBand QDR 19

Summary GROMACS is a memory and network latency sensitive application CPU: Using 4P systems delivers 59% higher performance than 2P systems (at 6 nodes) Using 64 PPN delivers 26% higher performance than 32PPN using 1 active core Interconnects: InfiniBand QDR can deliver good scalability for GROMACS Provides up to 142% better performance than 10GbE on 6 nodes Provides up to 52% better performance than 1GbE on 2 nodes 10GigE and 1GigE would not scale and become inefficient to run beyond 2-3 nodes MPI: Intel MPI achieves higher scalability than Open MPI and Platform MPI for GROMACS OS: Both SLES 11 SP 2 and RHEL 6 Update 2 provides similar level of performance 20

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 21 21