AMBER 11 Performance Benchmark and Profiling. July 2011

Similar documents
CP2K Performance Benchmark and Profiling. April 2011

GROMACS Performance Benchmark and Profiling. September 2012

Himeno Performance Benchmark and Profiling. December 2010

Altair RADIOSS Performance Benchmark and Profiling. May 2013

LAMMPS Performance Benchmark and Profiling. July 2012

NAMD Performance Benchmark and Profiling. February 2012

HYCOM Performance Benchmark and Profiling

AcuSolve Performance Benchmark and Profiling. October 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011

NAMD GPU Performance Benchmark. March 2011

ICON Performance Benchmark and Profiling. March 2012

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

AcuSolve Performance Benchmark and Profiling. October 2011

STAR-CCM+ Performance Benchmark and Profiling. July 2014

ABySS Performance Benchmark and Profiling. May 2010

LAMMPSCUDA GPU Performance. April 2011

MM5 Modeling System Performance Research and Profiling. March 2009

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NAMD Performance Benchmark and Profiling. January 2015

SNAP Performance Benchmark and Profiling. April 2014

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

OCTOPUS Performance Benchmark and Profiling. June 2015

MILC Performance Benchmark and Profiling. April 2013

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

CPMD Performance Benchmark and Profiling. February 2014

CP2K Performance Benchmark and Profiling. April 2011

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

NAMD Performance Benchmark and Profiling. November 2010

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

OpenFOAM Performance Testing and Profiling. October 2017

NEMO Performance Benchmark and Profiling. May 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

2008 International ANSYS Conference

Clustering Optimizations How to achieve optimal performance? Pak Lui

GRID Testing and Profiling. November 2017

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Solutions for Scalable HPC

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

Sharing High-Performance Devices Across Multiple Virtual Machines

The Future of Interconnect Technology

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Birds of a Feather Presentation

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

ARISTA: Improving Application Performance While Reducing Complexity

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

High Performance Computing with Accelerators

OP2 FOR MANY-CORE ARCHITECTURES

Single-Points of Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Interconnect Your Future

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Optimizing LS-DYNA Productivity in Cluster Environments

HPC Performance in the Cloud: Status and Future Prospects

High-Performance Lustre with Maximum Data Assurance

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

IBM Information Technology Guide For ANSYS Fluent Customers

Maximizing Cluster Scalability for LS-DYNA

An Introduction to the SPEC High Performance Group and their Benchmark Suites

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Introduction to High-Speed InfiniBand Interconnect

The rcuda middleware and applications

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

The Future of High Performance Interconnects

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Performance Analysis of LS-DYNA in Huawei HPC Environment

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Readme for Platform Open Cluster Stack (OCS)

Illinois Proposal Considerations Greg Bauer

POWEREDGE RACK SERVERS

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

IBM Power AC922 Server

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

MDHIM: A Parallel Key/Value Store Framework for HPC

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Emerging Technologies for HPC Storage

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Interconnect Your Future

Transcription:

AMBER 11 Performance Benchmark and Profiling July 2011

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to http:// www.amd.com http:// www.dell.com/hpc http://www.mellanox.com http://ambermd.org/ 2

AMBER Application AMBER Software for analyzing large-scale molecular dynamics (MD) simulation trajectory data Reads either CHARMM or AMBER style topology/trajectory files as input, and its analysis routines can scale up to thousands of compute cores or hundreds of GPU nodes with either parallel or UNIX file I/O AMBER has dynamic memory management, and each code execution can perform a variety of different structural, energetic, and file manipulation operations on a single MD trajectory at once The code is written in a combination of Fortan90 and C, and its GPU kernels are written with NVIDIA's CUDA API to achieve maximum GPU performance 3

Objectives The following was done to provide best practices AMBER performance benchmarking Understanding AMBER communication patterns Ways to increase AMBER productivity Compilers and MPI libraries comparisons The presented results will demonstrate The scalability of the compute environment The capability of AMBER to achieve scalable productivity Considerations for performance optimizations 4

Test Cluster Configuration Dell PowerEdge R815 11-node (528-core) cluster AMD Opteron 6174 Series processors (codenamed Magny-Cour ) 12-cores @ 2.2 GHz 4 CPU sockets per server node Mellanox ConnectX-2 VPI adapters for 40Gb/s QDR InfiniBand and 10Gb/s Ethernet Mellanox MTS3600Q 36-Port 40Gb/s QDR InfiniBand switch Memory: 128GB memory per node DDR3 1333MHz OS: RHEL 5.5, MLNX-OFED 1.5.2 InfiniBand SW stack MPI: Open MPI 1.5.3 with KNEM 0.9.6, Platform MPI 8.1.1 Compilers: PGI 10.9, GNU Compilers 4.1.2 Application: AMBER 11 (PMEMD), AmberTools 1.5 Benchmark workload: Cellulose_production_NVE_256_128_128 (408,609 atoms) 5

About Dell PowerEdge R815 11-node cluster HPC Advisory Council Test-bed System New 11-node 528 core cluster - featuring Dell PowerEdge R815 servers Replacement system for Dell PowerEdge SC1435 (192 cores) cluster system following 2 years of rigorous benchmarking and product EOL System to be redirected to explore HPC in the Cloud applications Workload profiling and benchmarking Characterization for HPC and compute intense environments Optimization for scale, sizing and configuration and workload performance Test-bed Benchmarks RFPs Customers/Prospects, etc ISV & Industry standard application characterization Best practices & usage analysis 6

About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6000 Series platform and Mellanox ConnectX InfiniBand on Dell HPC. Solutions provide the ultimate platform for speed and scale Dell PowerEdge R815 system delivers 4 socket performance in dense 2U form factor Up to 48 core/32dimms per server 1008 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 7

AMBER Performance MPI Platform MPI enables better scalability than Open MPI Seen up to 86% better performance with Platform MPI Tuned Platform MPI provides better performance for InfiniBand Seen an speed improvement of up to 28% at 8-node Tuning parameters used: -cpu_bind -e MPI_RDMA_MSGSIZE=65536,65536,4194304 -e MPI_RDMA_NSRQRECV=2048 -e MPI_RDMA_NFRAGMENT=128 28% 86% Higher is better 48 Cores/Node 8

AMBER Performance Compilers Executable generated by the GNU compilers runs slightly faster Up to 3% faster is seen than with PGI Using the default optimization and linker flags: PGI: -fast -O3 -fastsse GNU: -O3 3% Higher is better Platform MPI 48 Cores/Node 9

AMBER Performance CPU Frequency Higher CPU core frequency enables higher job performance Up to 11-14% better job performance between 2200MHz vs 1800MHz Shows the performance is somewhat affected by the change in the CPU frequency, and the percentage of Computation versus MPI Communication 14% 11% Higher is better 48 Cores/Node 10

AMBER Profiling MPI/User Time Ratio Large data communications are seen Shows heavy data communications between parallel tasks More communications than computation happens between 4 to 8 node MPI time stays constant while CPU time reduces as more nodes in cluster Demonstrate the importance of the interconnect to handle addition network throughput InfiniBand QDR 48 Cores/Node 11

AMBER Profiling Number of MPI Calls The most used MPI functions are MPI_Isend, MPI_Irecv, MPI_Waitany These non-blocking MPI calls allows computation while communications take place Each of these dominates as a third of MPI calls used on a 8-node job The number of calls accelerates rapidly as the cluster scales 12

AMBER Profiling Time Spent of MPI Calls The time in communications is taken place in the following MPI functions: MPI_Waitany (36%) MPI_Allreduce (25%) MPI_Allgather (22%) MPI_Bcast (6%) 13

AMBER Profiling MPI Message Sizes Majority of the MPI messages are small and median message sizes In the ranges of less than 64 bytes and between 4K and 16K 14

AMBER Profiling Data Transfer By Process Data transferred to each MPI rank is consistent for any number of processes Shows large amount of data transfers happened Amount of data transfer to each rank is reduced as more nodes are in the job 15

AMBER Profiling Aggregated Data Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively The total data transfer increases steadily as the cluster scales Hugh amount of data being sent and received across the network As a compute node being added, more data communications will take place InfiniBand QDR 16

Summary AMBER is a compute and data communications intensive application Which has a high demand for CPU power and good network throughput Using InfiniBand enables good scalability by spreading workload to compute nodes MPI: Better scalability is seen with Platform MPI than with Open MPI CPU: Shows higher job productivity when using CPU with higher core frequency MPI Communications: Large data communications take place between MPI processes 17

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 18 18