Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Similar documents
NAMD Performance Benchmark and Profiling. January 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Performance Benchmark and Profiling. April 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

MILC Performance Benchmark and Profiling. April 2013

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

CPMD Performance Benchmark and Profiling. February 2014

OpenFOAM Performance Testing and Profiling. October 2017

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Altair RADIOSS Performance Benchmark and Profiling. May 2013

AcuSolve Performance Benchmark and Profiling. October 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

AcuSolve Performance Benchmark and Profiling. October 2011

LAMMPS Performance Benchmark and Profiling. July 2012

GROMACS Performance Benchmark and Profiling. September 2012

NAMD Performance Benchmark and Profiling. February 2012

GROMACS Performance Benchmark and Profiling. August 2011

CP2K Performance Benchmark and Profiling. April 2011

NAMD GPU Performance Benchmark. March 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

ABySS Performance Benchmark and Profiling. May 2010

ICON Performance Benchmark and Profiling. March 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

HYCOM Performance Benchmark and Profiling

Himeno Performance Benchmark and Profiling. December 2010

CP2K Performance Benchmark and Profiling. April 2011

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

NAMD Performance Benchmark and Profiling. November 2010

LAMMPSCUDA GPU Performance. April 2011

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

NEMO Performance Benchmark and Profiling. May 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Clustering Optimizations How to achieve optimal performance? Pak Lui

MM5 Modeling System Performance Research and Profiling. March 2009

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Application Performance Optimizations. Pak Lui

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

AMD EPYC and NAMD Powering the Future of HPC February, 2019

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GRID Testing and Profiling. November 2017

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Performance Analysis of LS-DYNA in Huawei HPC Environment

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

Best Practices for Setting BIOS Parameters for Performance

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Scale out Server

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

IBM Power AC922 Server

Selecting Server Hardware for FlashGrid Deployment

IBM Power Systems HPC Cluster

IBM Power Advanced Compute (AC) AC922 Server

Lenovo Database Configuration Guide

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Implementing Storage in Intel Omni-Path Architecture Fabrics

Data Sheet FUJITSU Server PRIMERGY CX2550 M1 Dual Socket Server Node

HP GTC Presentation May 2012

HUAWEI Tecal X6000 High-Density Server

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Interconnect Your Future

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

Consolidating Microsoft SQL Server databases on PowerEdge R930 server

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Data Sheet FUJITSU Server PRIMERGY CX1640 M1 Multi-node Server

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Data Sheet FUJITSU Server PRIMERGY CX400 S2 Multi-Node Server Enclosure

Cisco MCS 7825-I1 Unified CallManager Appliance

Boundless Computing Inspire an Intelligent Digital World

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Dell HPC System for Manufacturing System Architecture and Application Performance

Sugon TC6600 blade server

Dell EMC HPC System for Life Sciences v1.4

High Performance Computing

IBM Power 755 server. High performance compute node for scalable clusters using InfiniBand architecture interconnect products.

Transcription:

Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015

Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices OptiStruct performance overview Understanding OptiStruct communication patterns Ways to increase OptiStruct productivity MPI libraries comparisons For more info please refer to http://www.altair.com http://www.dell.com http://www.intel.com http://www.mellanox.com 2

Objectives The following was done to provide best practices OptiStruct performance benchmarking Interconnect performance comparisons MPI performance comparison Understanding OptiStruct communication patterns The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability The capability of OptiStruct to achieve scalable productivity 3

OptiStruct by Altair Altair OptiStruct OptiStruct is an industry proven, modern structural analysis solver Solve for linear and non-linear structural problems under static and dynamic loadings Market-leading solution for structural design and optimization Helps designers and engineers to analyze and optimize structures Optimize for strength, durability and NVH (Noise, Vibration, Harshness) characteristics Help to rapidly develop innovative, lightweight and structurally efficient designs Based on finite-element and multi-body dynamics technology 4

Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Turbo on, Early Snoop, Max Perf in BIOS) OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack Memory: 64GB memory, DDR3 2133 MHz Hard Drives: 1TB 7.2 RPM SATA 2.5 Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0 Application: Altair OptiStruct 13.0 Benchmark datasets: Engine Assembly 5

PowerEdge R730 Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GbE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch 6

OptiStruct Performance CPU Cores Running more cores per node generally improves overall performance The -nproc parameter specified the number of threads spawned per MPI process Guideline: 6 threads per MPI process yields the best performance Ideal threads to be spawned appears to be 6 threads per MPI process (either 2/4 PPN) Having 6 threads spawned by each MPI process performs best among all other tested Higher is better 7

OptiStruct Performance Interconnect EDR InfiniBand provides superior scalability performance over Ethernet 11 times better performance than 1GbE at 24 nodes 90% better performance than 10GbE at 24 nodes Ethernet solutions does not scale beyond 4 nodes 11x 90% Higher is better 2 PPN / 6 Threads 8

OptiStruct Profiling Number of MPI Calls For 1GbE, communication time is mostly spent on point-to-point transfer MPI_Iprobe and MPI_Test are the tests for non-blocking transfers Overall runtime is significantly longer compared to faster interconnects For 10GbE, communication time is consumed by data transfer Amount of time for non-blocking transfers still significant Overall runtime reduces compared to 1GbE While time for data transfer reduces, collective operations has higher ratio as in overall For InfiniBand, overall runtime reduces Time consumed by MPI_Allreduce is more significant compared to data transfer Overall runtime reduces significantly compared to Ethernet 1GbE 10GbE EDR IB 9

OptiStruct Profiling Number of MPI Calls For 1GbE, communication time is mostly spent on point-to-point transfer MPI_Iprobe and MPI_Test are the tests for non-blocking transfers Overall runtime is significantly longer compared to faster interconnects For 10GbE, communication time is consumed by data transfer Amount of time for non-blocking transfers still significant Overall runtime reduces compared to 1GbE While time for data transfer reduces, collective operations has higher ratio as in overall For InfiniBand, overall runtime reduces Time consumed by MPI_Allreduce is more significant compared to data transfer Overall runtime reduces significantly compared to Ethernet 1GbE 10GbE EDR IB 10

OptiStruct Profiling MPI Message Sizes The most time consuming MPI communications are: MPI_Allreduce: Messages concentrated at 8B MPI_Iprobe and MPI_Test have volume of calls that test for completion of messages 2 PPN / 6 Threads 11

OptiStruct Performance Interconnect EDR IB delivers superior scalability performance over previous InfiniBand EDR InfiniBand improves over FDR IB by 40% at 24 nodes EDR InfiniBand outperforms FDR InfiniBand by 9% at 16 nodes New EDR IB architecture supersedes previous FDR IB generation of in scalability 9% 40% Higher is better 4 PPN / 6 Threads 12

OptiStruct Performance Processes Per Node OptiStruct reduces communication by deploying hybrid MPI mode Hybrid MPI process can spawn threads; helps reducing communications on network By enabling more MPI processes per node, it helps to unlock additional performance The following environment setting and tuned flags are used : I_MPI_PIN_DOMAIN auto, I_MPI_ADJUST_ALLREDUCE 2, I_MPI_ADJUST_BCAST 1, I_MPI_ADJUST_REDUCE 2, ulimit -s unlimited 10% 4% Higher is Better 13

OptiStruct Performance IMPI Tuning Tuning Intel MPI collective algorithm can improve performance MPI profile shows ~30% of runtime spent on MPI_Allreduce IB communications Default algorithm in Intel MPI is Recursive Doubling (I_MPI_ADJUST_ALLREDUCE=1) Rabenseifner's algorithm for Allreduce appears to the be the best on 24 nodes Higher is better Intel MPI 4 PPN / 6 Threads 14

OptiStruct Performance CPU Frequency Increase in CPU clock speed allows higher job efficiency Up to 11% of high productivity by increasing clock speed from 2300MHz to 2600MHz Turbo Mode boosts job efficiency higher than increase in clock speed Up to 31% of performance jump by enabling Turbo Mode at 2600MHz Performance gain by turbo mode depends on environment factors, e.g. temperature 11% 4% 8% 10% 17% 31% Higher is better 4 PPN / 6 Threads 15

OptiStruct Profiling Disk I/O OptiStruct makes use of distributed I/O of local scratch of compute nodes Heavy disk IO appears to take place throughout the run on each compute node The high I/O usage causes system memory to also to be utilized for I/O caching Disk I/O is distributed on all compute nodes; thus provides higher I/O performance Workload would complete faster as more nodes take part on the distributed I/O Higher is better 4 PPN / 6 Threads 16

OptiStruct Profiling MPI Message Sizes Majority of data transfer takes place from rank 0 to the rest It appears that most data transfer takes place between rank 0 to the rest Those non-blocking communication appears data transfers to hide latency in network The collective operations appear to be much less in size 16 Nodes 32 Nodes 2 PPN / 6 Threads 17

OptiStruct Summary OptiStruct is designed to perform structural analysis at large scale OptiStruct designed hybrid MPI mode to perform at scale EDR InfiniBand shows to outperform Ethernet in scalability performance ~70 times better performance than 1GbE at 24 nodes 4.8x better performance than 10GbE at 24 nodes EDR InfiniBand improves over FDR IB by 40% at 24 nodes Hybrid MPI process can spawn threads; helps reducing communications on network By enabling more MPI processes per node, it helps to unlock additional performance Hybrid MPP version enhanced OptiStruct scalability Profiling and Tuning: CPU, I/O, Network MPI_Allreduce accounts for ~30% of runtime at scale Tuning for MPI_Allreduce should allow better performance at high core counts Guideline: 6 threads per MPI process yields the best performance Turbo Mode boosts job efficiency higher than increase in clock speed OptiStruct makes use of distributed I/O of local scratch of compute nodes Heavy disk IO appears to take place throughout the run on each compute node 18

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 19 19