OCTOPUS Performance Benchmark and Profiling. June 2015

Similar documents
ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

CPMD Performance Benchmark and Profiling. February 2014

SNAP Performance Benchmark and Profiling. April 2014

MILC Performance Benchmark and Profiling. April 2013

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD Performance Benchmark and Profiling. January 2015

CP2K Performance Benchmark and Profiling. April 2011

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

CP2K Performance Benchmark and Profiling. April 2011

LS-DYNA Performance Benchmark and Profiling. April 2015

OpenFOAM Performance Testing and Profiling. October 2017

AcuSolve Performance Benchmark and Profiling. October 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

GROMACS Performance Benchmark and Profiling. September 2012

AMBER 11 Performance Benchmark and Profiling. July 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

NEMO Performance Benchmark and Profiling. May 2011

GROMACS Performance Benchmark and Profiling. August 2011

NAMD GPU Performance Benchmark. March 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

ICON Performance Benchmark and Profiling. March 2012

HYCOM Performance Benchmark and Profiling

Himeno Performance Benchmark and Profiling. December 2010

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. February 2012

LAMMPS Performance Benchmark and Profiling. July 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

AcuSolve Performance Benchmark and Profiling. October 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

ABySS Performance Benchmark and Profiling. May 2010

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LAMMPSCUDA GPU Performance. April 2011

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

MM5 Modeling System Performance Research and Profiling. March 2009

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

HP GTC Presentation May 2012

Clustering Optimizations How to achieve optimal performance? Pak Lui

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

HPE ProLiant DL580 Gen10 Server

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Application Performance Optimizations. Pak Lui

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

HPE ProLiant DL360 Gen P 16GB-R P408i-a 8SFF 500W PS Performance Server (P06453-B21)

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

HPE ProLiant ML350 Gen10 Server

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

Performance Analysis of LS-DYNA in Huawei HPC Environment

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

2008 International ANSYS Conference

Solutions for Scalable HPC

HP ProLiant DL385p Gen8 Server

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

GRID Testing and Profiling. November 2017

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Birds of a Feather Presentation

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Interconnect Your Future

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

AMD EPYC and NAMD Powering the Future of HPC February, 2019

HPE ProLiant ML110 Gen10 Server

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

Supports up to four 3.5-inch SAS/SATA drives. Drive bays 1 and 2 support NVMe SSDs. A size-converter

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

N V M e o v e r F a b r i c s -

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Cisco UCS M5 Blade and Rack Servers

About 2CRSI. OCtoPus Solution. Technical Specifications. OCtoPus servers. OCtoPus. OCP Solution by 2CRSI.

GW2000h w/gw175h/q F1 specifications

Hewlett Packard Enterprise supports InfiniBand products from InfiniBand technology partner Mellanox.

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Single-Points of Performance

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

QuickSpecs. HPE ProLiant XL170r Gen9 Server. Overview. HPE ProLiant XL170r Gen9 Server

FUSION1200 Scalable x86 SMP System

Front-loading drive bays 1 12 support 3.5-inch SAS/SATA drives. Optionally, front-loading drive bays 1 and 2 support 3.5-inch NVMe SSDs.

EPYC VIDEO CUG 2018 MAY 2018

Density Optimized System Enabling Next-Gen Performance

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

Open storage architecture for private Oracle database clouds

(Reaccredited with A Grade by the NAAC) RE-TENDER NOTICE. Advt. No. PU/R/RUSA Fund/Equipment Purchase-1/ Date:

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

HPE ProLiant ML110 Gen P 8GB-R S100i 4LFF NHP SATA 350W PS DVD Entry Server/TV (P )

Application Acceleration Beyond Flash Storage

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

A New Compute Experience

Transcription:

OCTOPUS Performance Benchmark and Profiling June 2015

2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting vendors solutions please refer to: www.mellanox.com, http://www.hp.com/go/hpc For more information on the application: http://www.tddft.org/programs/octopus

3 Octopus Octopus is designed for Density-functional theory (DFT) Time-dependent density functional theory (TDDFT) Octopus is aimed at the simulation of the electron-ion dynamics of 1, 2, 3, and 4 dimensional finite systems Octopus is one of selected 22 applications for the PRACE application benchmark suite Octopus is a freely available (GPL) software

4 Objectives The presented research was done to provide best practices OCTOPUS performance benchmarking Interconnect performance comparisons MPI performance comparison Understanding OCTOPUS communication patterns The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability

5 Test Cluster Configuration HP Apollo 6000 Heimdall cluster HP ProLiant XL230a Gen9 10-node Heimdall cluster Processors: Dual-Socket 14-core Intel Xeon E5-2697v3 @ 2.6 GHz CPUs (Turbo Mode off; Home Snoop as default) Memory: 64GB per node, 2133MHz DDR4 DIMMs OS: RHEL 6 Update 6, OFED 2.4-1.0.0 InfiniBand SW stack Mellanox Connect-IB FDR InfiniBand adapters Mellanox ConnectX-3 Pro Ethernet adapters Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand and Ethernet VPI Switch MPI: Intel MPI 5.0.2 Compiler and Libraries: Intel Composers and MKL 2015.1.133, FFTW 3.3.4, GSL 1.16, libxc 2.0.3, pfft 1.0.8 alpha Application: OCTOPUS 4.1.2 Benchmark Workload: Magnesium (Ground State, maximum iteration set to 1)

6 HP ProLiant XL230a Gen9 Server Item Processor Chipset Memory Max Memory Internal Storage Networking Expansion Slots Ports Power Supplies Integrated Management Additional Features Form Factor HP ProLiant XL230a Gen9 Server Two Intel Xeon E5-2600 v3 Series, 6/8/10/12/14/16 Cores Intel Xeon E5-2600 v3 series 512 GB (16 x 32 GB) 16 DIMM slots, DDR3 up to DDR4; R-DIMM/LR-DIMM; 2,133 MHz 512 GB 1 HP Dynamic Smart Array B140i SATA controller HP H240 Host Bus Adapter Network module supporting various FlexibleLOMs: 1GbE, 10GbE, and/or InfiniBand 1 Internal PCIe: 1 PCIe x 16 Gen3, half-height Front: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD card & Active Health HP 2,400 or 2,650 W Platinum hot-plug power supplies delivered by HP Apollo 6000 Power Shelf HP ilo (Firmware: HP ilo 4) Option: HP Advanced Power Manager Shared Power & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O support 10 servers in 5U chassis

7 OCTOPUS Performance - Interconnects FDR InfiniBand is the most efficient network interconnect for OCTOPUS FDR IB outperforms 10GbE by 600% at 8 nodes (224 MPI processes) FDR IB outperforms 40GbE by 80% at 8 nodes (224 MPI processes) 80% 600% Higher is better

8 OCTOPUS Profiling Interconnects FDR InfiniBand reduces the communication time at scale FDR InfiniBand consumes about 40% of total runtime at 8 nodes (224 processes) 10GbE consumes 54% of total time in communications, while 40GbE consumes 46% 28 Processes Per Node

9 OCTOPUS Profiling Time Spent in MPI The most time consuming MPI functions for OCTOPUS: MPI_Reduce (50%), MPI_Bcast (40%) among all MPI calls Time spent on network bandwidth differentiates among interconnects 10GbE/40GbE spent more time in MPI_Reduce/MPI_Bcast Demonstrated that InfiniBand performs better than Ethernet networks for MPI collectives ops

10 OCTOPUS Profiling Number of MPI Calls MPI calls for collective communication is dominated in OCTOPUS MPI_Allreduce (63% of calls) MPI_Irecv, MPI_Isend (18%) on a 8 node (224 processes)

11 OCTOPUS Profiling Time Spent in MPI OCTOPUS: More time spent on MPI collective operations: MPI_Allreduce(50%), MPI_Bcast(40%)

12 OCTOPUS Summary InfiniBand FDR is the most efficient cluster interconnect for OCTOPUS FDR IB outperforms 10GbE by 600% at 8 nodes (224 MPI processes) FDR IB outperforms 40GbE by 80% at 8 nodes (224 MPI processes) FDR InfiniBand reduces the communication time at scale FDR InfiniBand consumes about 40% of total runtime at 8 nodes (224 processes) 10GbE consumes 54% of total time in communications, while 40GbE consumes 46% Octopus MPI profiling MPI collectives create big communication overhead Both large and small message are used by Octopus Interconnect latency and bandwidth are critical to Octopus performance

13 13 Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein