LS-DYNA Performance Benchmark and Profiling. April 2015

Similar documents
LS-DYNA Performance Benchmark and Profiling. October 2017

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Application Performance Optimizations. Pak Lui

NAMD Performance Benchmark and Profiling. January 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

LS-DYNA Performance Benchmark and Profiling. October 2017

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

CPMD Performance Benchmark and Profiling. February 2014

SNAP Performance Benchmark and Profiling. April 2014

OpenFOAM Performance Testing and Profiling. October 2017

OCTOPUS Performance Benchmark and Profiling. June 2015

MILC Performance Benchmark and Profiling. April 2013

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

GROMACS Performance Benchmark and Profiling. August 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

AcuSolve Performance Benchmark and Profiling. October 2011

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

Maximizing Cluster Scalability for LS-DYNA

Altair RADIOSS Performance Benchmark and Profiling. May 2013

GROMACS Performance Benchmark and Profiling. September 2012

CP2K Performance Benchmark and Profiling. April 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

LAMMPS Performance Benchmark and Profiling. July 2012

HYCOM Performance Benchmark and Profiling

NAMD GPU Performance Benchmark. March 2011

Performance Analysis of LS-DYNA in Huawei HPC Environment

CP2K Performance Benchmark and Profiling. April 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

NAMD Performance Benchmark and Profiling. February 2012

NEMO Performance Benchmark and Profiling. May 2011

ICON Performance Benchmark and Profiling. March 2012

Clustering Optimizations How to achieve optimal performance? Pak Lui

Himeno Performance Benchmark and Profiling. December 2010

LAMMPSCUDA GPU Performance. April 2011

ABySS Performance Benchmark and Profiling. May 2010

HPC Applications Performance and Optimizations Best Practices Pak Lui

NAMD Performance Benchmark and Profiling. November 2010

MM5 Modeling System Performance Research and Profiling. March 2009

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

GRID Testing and Profiling. November 2017

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Optimizing LS-DYNA Productivity in Cluster Environments

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Birds of a Feather Presentation

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Single-Points of Performance

Nehalem Hochleistungsrechnen für reale Anwendungen

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Interconnect Your Future

HP GTC Presentation May 2012

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA

AMD EPYC and NAMD Powering the Future of HPC February, 2019

2008 International ANSYS Conference

Faster Metal Forming Solution with Latest Intel Hardware & Software Technology

Density Optimized System Enabling Next-Gen Performance

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Assessment of LS-DYNA Scalability Performance on Cray XD1

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

HPC Hardware Overview

Future Routing Schemes in Petascale clusters

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

FUSION1200 Scalable x86 SMP System

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers

Implementing Storage in Intel Omni-Path Architecture Fabrics

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

GW2000h w/gw175h/q F1 specifications

LS-DYNA Performance on Intel Scalable Solutions

LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

IBM Power Systems HPC Cluster

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

The Exascale Architecture

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

Interconnect Your Future

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Data Sheet FUJITSU Server PRIMERGY CX1640 M1 Multi-node Server

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Transcription:

LS-DYNA Performance Benchmark and Profiling April 2015

2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices LS-DYNA performance overview Understanding LS-DYNA communication patterns Ways to increase LS-DYNA productivity MPI libraries comparisons For more info please refer to http://www.dell.com http://www.intel.com http://www.mellanox.com http://www.lstc.com

3 LS-DYNA LS-DYNA A general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems Developed by the Livermore Software Technology Corporation (LSTC) LS-DYNA used by Automobile Aerospace Construction Military Manufacturing Bioengineering

Objectives 4 The presented research was done to provide best practices LS-DYNA performance benchmarking MPI Library performance comparison Interconnect performance comparison CPUs comparison Optimization tuning The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency

5 Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (Power Management in BIOS sets to Maximum Performance) Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop OS: RHEL 6.5, MLNX_OFED_LINUX-2.4-1.0.5.1_20150408_1555 InfiniBand SW stack Hard Drives: 2x 1TB 7.2 RPM SATA 2.5 on RAID 1 Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch Mellanox ConnectX-3 FDR VPI InfiniBand and 40Gb/s Ethernet Adapters Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch MPI: Open MPI 1.8.4, Mellanox HPC-X v1.2.0-326, Intel MPI 5.0.2.044, IBM Platform MPI 9.1 Application: LS-DYNA 8.0.0 (builds 95359, 95610), Single Precision Benchmarks: 3 Vehicle Collision, Neon refined revised

PowerEdge R730 Massive flexibility for data intensive operations 6 Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GigE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

7 LS-DYNA Performance Network Interconnects EDR InfiniBand delivers superior scalability in application performance Provides higher performance by over 4-5 times than 1GbE, 10GbE and 40GbE 1GbE stop scaling beyond 4 nodes, and 10GbE stops scaling beyond 8 nodes InfiniBand demonstrates continuous performance gain at scale 505% 572% 444% Higher is better 28 MPI Processes / Node

8 LS-DYNA Performance EDR vs FDR InfiniBand EDR InfiniBand delivers superior scalability in application performance As the cluster scales, performance gap of EDR IB becomes widen Performance advantage of EDR InfiniBand increases for larger core counts EDR IB provides 15% versus FDR IB at 32 nodes (896 cores) 15% Higher is better 28 MPI Processes / Node

9 LS-DYNA Performance Cores Per Node Better performance is seen at scale with less CPU cores per node At low node counts, higher performance can be achieved with more cores per node At high node counts, slightly better performance by using less cores per node Memory bandwidth might be limited by more CPU cores being used 2% 5% 3% 5% Higher is better CPU @ 2.6GHz

10 LS-DYNA Performance AVX2/SSE2 CPU Instructions LS-DYNA provides executables with supports for different CPU instructions AVX2 is supported on Haswell while SSE2 is supported on previous generations Due to runtime issue, AVX2 executable build 95610 is used, instead of the public build 95359 Slight improvement of ~2-4% by using executable with AVX2 instructions The AVX2 instructions runs at a lower clock speed (2.2GHz) than normal CPU clock (2.6GHz) 4% Higher is better 24 MPI Processes / Node

11 LS-DYNA Performance Turbo Mode Turbo Boost enables processors to run above its base frequency Capability to allow CPU cores to run dynamically above the CPU clock When thermal headroom allows the CPU to operate The 2.6GHz clock speed could boost to Max Turbo Frequency of 3.3GHz Running with Turbo Boost translates to a ~25% of performance boost 40% 25% Higher is better 28 MPI Processes / Node

12 LS-DYNA Performance Memory Optimization Setting the environment variables for memory allocator improve on performance Modifying the memory allocator allows faster memory registration for communications Environment variables used: export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=-1 19% 176% Higher is better 28 MPI Processes / Node

13 LS-DYNA Performance MPI Optimization FCA and MXM enhance LS-DYNA performance at scale for HPC-X Open MPI and HPC-X are based on the Open MPI distribution The yalla PML, UD transport and memory optimization in HPC-X reduce overhead MXM provides a speedup of 38% over un-tuned baseline run at 32 nodes (768 cores) MCA parameters for MXM: For enabling MXM: -mca btl_sm_use_knem 1 -mca pml yalla -x MXM_TLS=ud,shm,self -x MXM_SHM_RNDV_THRESH=32768 -x MXM_RDMA_PORTS=mlx5_0:1 38% Higher is better 24 MPI Processes / Node

14 LS-DYNA Performance Intel MPI Optimization The DAPL provider performs better than OFA provider for Intel MPI DAPL would provide better scalability performance for Intel MPI on LS-DYNA MCA parameters for MXM: Common for 2 tests: I_MPI_DAPL_SCALABLE_PROGRESS 1, I_MPI_RDMA_TRANSLATION_CACHE 1, I_MPI_FAIR_CONN_SPIN_COUNT 2147483647, I_MPI_FAIR_READ_SPIN_COUNT 2147483647, I_MPI_ADJUST_REDUCE 2, I_MPI_ADJUST_BCAST 0, I_MPI_RDMA_TRANSLATION_CACHE 1, I_MPI_RDMA_RNDV_BUF_ALIGN 65536, I_MPI_SPIN_COUNT 121 For OFA: -IB, MV2_USE_APM 0, I_MPI_OFA_USE_XRC 1 For DAPL: -DAPL, I_MPI_DAPL_DIRECT_COPY_THRESHOLD 65536, I_MPI_DAPL_UD enable, I_MPI_DAPL_PROVIDER ofa-v2-mlx5_0-1u 20% Higher is better 24 MPI Processes / Node

15 LS-DYNA Performance MPI Libraries HPC-X outperforms Platform MPI, and Open MPI in scalability performance HPC-X delivers higher performance than Intel MPI (OFA) by 33%, (DAPL) by 11%, Platform MPI by 27% on neon_refined_revised Performance is 20% higher than Intel OFA, and % 8% better than Platform MPI in 3cars Tuning parameter used: For Open MPI: -bind-to-core and KNEM. For Platform MPI: -cpu_bind, -xrc. For Intel MPI: see previous slide 11% 27% 33% 20% 7% Higher is better 24 MPI Processes / Node

16 LS-DYNA Performance System Generations Current Haswell system configuration outperforms prior system generations Current systems outperformed Ivy Bridge by 47%, Sandy Bridge by 75%, Westmere by 148%, Nehalem by 290% Scalability support from EDR InfiniBand and HPC-X provide huge boost in performance at scale for LS-DYNA System components used: Haswell: 2-socket 14-core E5-2697v3@2.6GHz, 2133MHz DIMMs, ConnectX-4 EDR InfiniBand Ivy Bridge: 2-socket 10-core E5-2680v2@2.8GHz, 1600MHz DIMMs, Connect-IB FDR InfiniBand Sandy Bridge: 2-socket 8-core E5-2680@2.7GHz, 1600MHz DIMMs, ConnectX-3 FDR InfiniBand Westmere: 2-socket 6-core x5670@2.93ghz, 1333MHz DIMMs, ConnectX-2 QDR InfiniBand Nehalem: 2-socket 4-core x5570@2.93ghz, 1333MHz DIMMs, ConnectX-2 QDR InfiniBand 148% 75% 290% 47% Higher is better

17 LS-DYNA Profiling Time Spent in MPI Most of the MPI messages are in the medium sizes Most message sizes are between 0 to 64B For the most time consuming MPI calls MPI_Recv: Most messages are under 4KB MPI_Bcast: Majority are less than 16B, but larger messages exist MPI_Allreduce: Most messages are less than 256B

18 LS-DYNA Profiling Time Spent in MPI Majority of the MPI time is spent on MPI_recv and MPI Collective Ops MPI_Recv(36%), MPI_Allreduce(27%), MPI_Bcast(24%) Similar communication characteristics seen on both input dataset Both exhibit similar communication patterns Neon_refined_revised 32 nodes 3 Vehicle Collision 32 nodes

19 LS-DYNA Summary Performance Compute: Intel Haswell cluster outperforms system architecture of previous generations Outperforms Ivy Bridge by 47%, Sandy Bridge by 75%, Westmere by 148%, and Nehalem by 290% Using executable with AVX2 instructions provides slight advantage Slight improvement of ~2-4% by using executable with AVX2 instructions Turbo Mode: Running with Turbo Boost provides ~25% of performance boost in some cases Turbo Boost enables processors to run above its base frequency Network: EDR InfiniBand and HPC-X MPI library deliver superior scalability in application performance EDR IB provides higher performance by over 4-5 times vs 1GbE, 10GbE and 40GbE, 15% vs FDR IB at 32 nodes MPI Tuning HPC-X enhances LS-DYNA performance at scale for LS-DYNA MXM UD provides a speedup of 38% over un-tuned baseline run at 32 nodes HPC-X outperforms Platform MPI, and Open MPI in scalability performance Up to 27% better than Platform MPI on neon_refined_revised, and 8% better than Platform MPI in 3cars

20 20 Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein