SNAP Performance Benchmark and Profiling. April 2014

Similar documents
MILC Performance Benchmark and Profiling. April 2013

CPMD Performance Benchmark and Profiling. February 2014

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

OCTOPUS Performance Benchmark and Profiling. June 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD Performance Benchmark and Profiling. January 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

LS-DYNA Performance Benchmark and Profiling. April 2015

ICON Performance Benchmark and Profiling. March 2012

GROMACS Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

GROMACS Performance Benchmark and Profiling. September 2012

OpenFOAM Performance Testing and Profiling. October 2017

AMBER 11 Performance Benchmark and Profiling. July 2011

LAMMPS Performance Benchmark and Profiling. July 2012

NAMD Performance Benchmark and Profiling. February 2012

CP2K Performance Benchmark and Profiling. April 2011

AcuSolve Performance Benchmark and Profiling. October 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

NAMD GPU Performance Benchmark. March 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

HYCOM Performance Benchmark and Profiling

CP2K Performance Benchmark and Profiling. April 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Himeno Performance Benchmark and Profiling. December 2010

NAMD Performance Benchmark and Profiling. November 2010

NEMO Performance Benchmark and Profiling. May 2011

MM5 Modeling System Performance Research and Profiling. March 2009

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LAMMPSCUDA GPU Performance. April 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

ABySS Performance Benchmark and Profiling. May 2010

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

HP GTC Presentation May 2012

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

Clustering Optimizations How to achieve optimal performance? Pak Lui

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

HPE ProLiant ML350 Gen10 Server

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

GRID Testing and Profiling. November 2017

HPE ProLiant DL360 Gen P 16GB-R P408i-a 8SFF 500W PS Performance Server (P06453-B21)

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Application Acceleration Beyond Flash Storage

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

All-new HP ProLiant ML350p Gen8 Server series

HP ProLiant DL385p Gen8 Server

HPE ProLiant DL580 Gen10 Server

HPE ProLiant DL380p Gen8 Server

AMD EPYC and NAMD Powering the Future of HPC February, 2019

IBM Power Advanced Compute (AC) AC922 Server

HPE ProLiant ML110 Gen10 Server

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Density Optimized System Enabling Next-Gen Performance

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

Open storage architecture for private Oracle database clouds

Pedraforca: a First ARM + GPU Cluster for HPC

Number of Hosts: 2 Uniform Hosts [yes/no]: yes Total sockets/cores/threads in test: 4/32/64

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Altos R320 F3 Specifications. Product overview. Product views. Internal view

HP solutions for mission critical SQL Server Data Management environments

IBM Power Systems HPC Cluster

The Why and How of Developing All-Flash Storage Server

HPE ProLiant ML110 Gen P 8GB-R S100i 4LFF NHP SATA 350W PS DVD Entry Server/TV (P )

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

Application Performance Optimizations. Pak Lui

Number of Hosts: 2 Uniform Hosts [yes/no]: yes Total sockets/cores/threads in test: 4/32/64

ProLiant DL F100 Integrated Cluster Solutions and Non-Integrated Cluster Bundle Configurations. Configurations

Front-loading drive bays 1 12 support 3.5-inch SAS/SATA drives. Optionally, front-loading drive bays 1 and 2 support 3.5-inch NVMe SSDs.

Performance Analysis of LS-DYNA in Huawei HPC Environment

Avid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System

QuickSpecs. HPE Cloudline CL7100 Server. Overview

IBM System x servers. Innovation comes standard

HP ProLiant DL360 G7 Server - Overview

HP ProLiant m300 1P C2750 CPU 32GB Configure-to-order Server Cartridge B21

IBM Power AC922 Server

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Transcription:

SNAP Performance Benchmark and Profiling April 2014

Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting vendors solutions please refer to: www.mellanox.com, http://www.hp.com/go/hpc For more information on the application: https://asc.llnl.gov/coral-benchmarks/#snap 2

SNAP SNAP Stands for: SN (Discrete Ordinates) Application Proxy Serves as a proxy application to model the performance of a modern discrete ordinates neutral particle transport application. May be considered an update to Sweep3D, which Is intended for hybrid computing architectures Mimics the computational workload, memory requirements, and communication patterns of PARTISN SNAP Is modeled off the LANL code PARTISN PARTISN solves the linear Boltzmann transport equation (TE) A governing equation for determining the number of neutral particles in a multidimensional phase space 3

Objectives The presented research was done to provide best practices SNAP performance benchmarking Interconnect performance comparisons MPI performance comparison Understanding SNAP communication patterns The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability 4

Test Cluster Configuration HP ProLiant SL230s Gen8 4-node Athena cluster Processors: Dual-Socket 10-core Intel Xeon E5-2680v2 @ 2.8 GHz CPUs Memory: 32GB per node, 1600MHz DDR3 Dual-Ranked DIMMs OS: RHEL 6 Update 2, OFED 2.0-3.0.0 InfiniBand SW stack Mellanox Connect-IB FDR InfiniBand adapters Mellanox ConnectX-3 VPI Ethernet adapters Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand and Ethernet VPI Switch MPI: Platform MPI 8.3, Open MPI 1.6.5 Compiler: GNU Compilers Application: SNAP 1.03 Benchmark Workload: Input dataset: 512 cells per rank 5

About HP ProLiant SL230s Gen8 Item Processor Chipset Memory Max Memory Internal Storage Max Internal Storage Networking I/O Slots Ports Power Supplies Integrated Management Additional Features Form Factor HP ProLiant SL230s Gen8 Server Two Intel Xeon E5-2600 v2 Series, 4/6/8/10/12 Cores, Intel Xeon E5-2600 v2 product family (256 GB), 16 DIMM slots, DDR3 up to 1600MHz, ECC 256 GB Two LFF non-hot plug SAS, SATA bays or Four SFF non-hot plug SAS, SATA, SSD bays Two Hot Plug SFF Drives (Option) 8TB Dual port 1GbE NIC/ Single 10G Nic One PCIe Gen3 x16 LP slot 1Gb and 10Gb Ethernet, IB, and FlexF abric options Front: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD card & Active Health 750, 1200W (92% or 94%), high power chassis ilo4 hardware-based power capping via SL Advanced Power Manager Shared Power & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O support 16P/8GPUs/4U chassis 6

SNAP Performance - Interconnect FDR InfiniBand is the most efficient inter-node communication for SNAP Outperforms 10GbE by 109% at 80 MPI processes Outperforms 1GbE by 625% at 80 MPI processes Performance benefit of InfiniBand expects to grow at larger CPU core counts 109% 625% Higher is better 20 Processes/Node 7

SNAP Performance - MPI Platform MPI shows higher performance at small node count The gap between Open MPI and Platform closes in at higher core counts Processor binding and same compiler flags are used for both cases Higher is better 20 Processes/Node 8

SNAP Profiling MPI Time Ratio FDR InfiniBand reduces the communication time at scale FDR InfiniBand consumes about 58% of total runtime 1GbE consumes 94% of total time 10GbE consumes about 79% of total runtime 20 Processes/Node 9

SNAP Profiling MPI Functions The most time consuming MPI functions: MPI_Recv (61%), MPI_Send (14%) Time spent on network bandwidth differentiates among interconnects 1GbE and 10GbE spent more time in MPI_Recv/MPI_Send Demonstrated that InfiniBand performs better than Ethernet networks 10

SNAP Profiling MPI Message Sizes Messages for MPI_Recv are concentrated around 4KB Most of the blocking send and receive happened at those messages The rest of the messages are less by comparison The MPI collective operations appeared at smaller sizes between below 1KB MPI_Bcast at ~4B to 64B MPI_Allreduce at < 256B The messaging behavior stays as job scales 11

SNAP Profiling MPI Functions The number of MPI calls used is split between the two MPI_Recv (~50%) and MPI_Send (~50%) 12

SNAP Profiling Network Transfers The application shows some imbalances in communication per rank Some ranks conducts more communications than others The level/size of messages stays the same as node count increases 20 Processes/Node 13

SNAP Profiling Aggregated Transfer Aggregated data transfer refers to: Amount of data being transferred in the network between all MPI ranks collectively Data transfer grows as more ranks involved For weak scaling problem, the problem size grows together with cluster size As problem size grows, the amount of data transferred grows simultaneously 20 MPI Processes 14

SNAP Summary Performance of SNAP is directly affected by the network throughput FDR InfiniBand delivers the most efficient network communication for SNAP Outperforms 10GbE up to 109% at 4 nodes (or 80 MPI processes) Outperforms 1GbE up to 625% at 4 nodes (or 80 MPI processes) MPI Profiling FDR InfiniBand reduces communication time; leave more time for computation FDR InfiniBand consumes 58% of total time Versus 79% against 10GbE and a whopping 94% against 1GbE Point-to-point communications (MPI_Recv) is the most time-consumed operation Blocking communications are seen: Time spent: MPI_Recv(61%) Most used are split between MPI_Send and MPI_Recv, each accounts for 50% 15

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 16 16