LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Similar documents
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Maximizing Cluster Scalability for LS-DYNA

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Single-Points of Performance

Birds of a Feather Presentation

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

2008 International ANSYS Conference

Future Routing Schemes in Petascale clusters

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NEMO Performance Benchmark and Profiling. May 2011

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

AMBER 11 Performance Benchmark and Profiling. July 2011

ABySS Performance Benchmark and Profiling. May 2010

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

LS-DYNA Performance Benchmark and Profiling. April 2015

GROMACS Performance Benchmark and Profiling. August 2011

STAR-CCM+ Performance Benchmark and Profiling. July 2014

AcuSolve Performance Benchmark and Profiling. October 2011

MM5 Modeling System Performance Research and Profiling. March 2009

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

CP2K Performance Benchmark and Profiling. April 2011

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Assessment of LS-DYNA Scalability Performance on Cray XD1

LS-DYNA Performance Benchmark and Profiling. October 2017

New Features in LS-DYNA HYBRID Version

NAMD GPU Performance Benchmark. March 2011

SNAP Performance Benchmark and Profiling. April 2014

GROMACS Performance Benchmark and Profiling. September 2012

HYCOM Performance Benchmark and Profiling

Altair RADIOSS Performance Benchmark and Profiling. May 2013

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

OCTOPUS Performance Benchmark and Profiling. June 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

CP2K Performance Benchmark and Profiling. April 2011

QLogic TrueScale InfiniBand and Teraflop Simulations

LS-DYNA Scalability Analysis on Cray Supercomputers

LAMMPSCUDA GPU Performance. April 2011

Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System

Technical guide. Windows HPC server 2016 for LS-DYNA How to setup. Reference system setup - v1.0

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

LS-DYNA Performance Benchmark and Profiling. October 2017

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

Introduction to High-Performance Computing

NAMD Performance Benchmark and Profiling. January 2015

NAMD Performance Benchmark and Profiling. November 2010

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

ICON Performance Benchmark and Profiling. March 2012

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

High-Performance Lustre with Maximum Data Assurance

Himeno Performance Benchmark and Profiling. December 2010

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Real Application Performance and Beyond

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Platform Choices for LS-DYNA

NetApp High-Performance Storage Solution for Lustre

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

FUSION1200 Scalable x86 SMP System

Experiences with HP SFS / Lustre in HPC Production

Parallel File Systems for HPC

New Approach to Unstructured Data

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

ARISTA: Improving Application Performance While Reducing Complexity

Best Practices for Setting BIOS Parameters for Performance

Interconnect Your Future

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

CPMD Performance Benchmark and Profiling. February 2014

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Dell HPC System for Manufacturing System Architecture and Application Performance

A Comprehensive Study on the Performance of Implicit LS-DYNA

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Manufacturing Bringing New Levels of Performance to CAE Applications

MILC Performance Benchmark and Profiling. April 2013

OpenFOAM Performance Testing and Profiling. October 2017

Accelerating Implicit LS-DYNA with GPU

Application Acceleration Beyond Flash Storage

Future Trends in Hardware and Software for use in Simulation

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Performance Analysis of LS-DYNA in Huawei HPC Environment

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

An Oracle White Paper December Accelerating Deployment of Virtualized Infrastructures with the Oracle VM Blade Cluster Reference Configuration

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

The PowerEdge M830 blade server

Solutions for Scalable HPC

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Partitioning Effects on MPI LS-DYNA Performance

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Transcription:

11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu 3 1 HPC Advisory Council 2 Mellanox Technologies 3 Dell, Inc. Abstract From concept to engineering, and from design to test and manufacturing, the automotive industry relies on powerful virtual development solutions. CFD and crash simulations are performed in an effort to secure quality and accelerate the development process. The recent trends in cluster environments, such as multi-core CPUs, GPUs, cluster file systems and new interconnect speeds and offloading capabilities are changing the dynamics of clusteredbased simulations. Software applications are being reshaped for higher parallelism and multi-threads, and hardware configuration for solving the new emerging bottlenecks, in order to maintain high scalability and efficiency. In this paper we cover best practices for achieving maximum productivity through MPI optimizations, efficient networking utilization and usage of parallel file systems. Introduction High-performance computing (HPC) is a crucial tool for automotive design and manufacturing. It is used for computer-aided engineering (CAE) from component-level to full vehicle analyses: crash simulations, structure integrity, thermal management, climate control, engine modeling, exhaust, acoustics and much more. HPC helps drive faster speed to market, significant cost reductions, and tremendous flexibility. The strength in HPC is the ability to achieve best sustained performance by driving the CPU performance towards its limits. The motivation for high-performance computing in the automotive industry has long been its tremendous cost savings and product improvements - the cost of a high-performance compute cluster can be just a fraction of the price of a single crash test, while providing a system that can be used for every test simulation going forward. LS-DYNA software from Livermore Software Technology Corporation is a general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems. It is widely used in the automotive industry for crashworthiness analysis, occupant safety analysis, metal forming and much more. In most cases, LS-DYNA is being used in cluster environments as these environments provide better flexibility, scalability and efficiency for such simulations. Cluster productivity, sometimes not measured by just how fast an application runs, is the most important factor for cluster hardware and software configuration. Achieving the maximum number of jobs executed per day is of higher importance than the wall clock time of a single job. Maximizing productivity in today s cluster platforms requires using enhanced messaging techniques even on a single server platform. These techniques also help with parallel simulations by using efficient cluster interconnects. 8-39

Computing Technology 11 th International LS-DYNA Users Conference For LS-DYNA, users can either choose to use local disks for the storage solution, or take advantage of parallel file system. With parallel file all nodes may be accessing the same files at the same time, concurrently reading and writing. One example for parallel file system is Lustre - Lustre is an object-based, distributed file system, generally used for large scale cluster computing. Efficient usage of Lustre can increase LS-DYNA performance, in particular when the compute system is connected via a high speed network such as InfiniBand. HPC Clusters LS-DYNA simulations are typically carried out on high-performance computing (HPC) clusters based on industry-standard hardware connected by a private high-speed network. The main benefits of clusters are affordability, flexibility, availability, high-performance and scalability. A cluster uses the aggregated power of compute server nodes to form a high-performance solution for parallel applications such as LS-DYNA. When more compute power is needed, it can sometimes be achieved simply by adding more server nodes to the cluster. The manner in which HPC clusters are architected has a huge influence on the overall application performance and productivity number of CPUs, usage of GPUs, the storage solution and the cluster interconnect. By providing low-latency, high-bandwidth and extremely low CPU overhead, InfiniBand has become the most deployed high-speed interconnect for HPC clusters, replacing proprietary or low-performance solutions. The InfiniBand Architecture (IBA) is an industry-standard fabric designed to provide high-bandwidth, low-latency computing, scalability for ten-thousand nodes and multiple CPU cores per server platform and efficient utilization of compute processing resources. This study was conducted at the HPC Advisory Council systems center (www.hpcadvisorycouncil.com) on an Intel Cluster Ready certified cluster comprised of Dell PowerEdge M610 16-node cluster, each node with quad-core Intel Xeon processors X5570 at 2.93 GHz (2 CPU sockets per node or total of 8 cores per node), Mellanox ConnectX -2 40Gb/s InfiniBand mezzanine card and Mellanox M3601Q 36-Port 40Gb/InfiniBand Switch. Each node had 24GB of memory. The Operating system used was RHEL5U3, the InfiniBand driver version was OFED 1.5, File system - Lustre 1.8.2, MPI libraries used were Open MPI 1.3.3, HP-MPI 2.7.1, Platform MPI 5.6.7 and Intel MPI 4.0, the LS-DYNA version was LS- DYNA MPP971_s_R4.2.1 and the benchmark Workload was the Three Vehicle Collision Test simulation. The Importance of the Cluster Interconnect The cluster interconnect is very critical for efficiency and performance of the application in the multi-core era. When more CPU cores are present, the overall cluster productivity increases only in the presence of a high-speed interconnect. We have compared the elapsed time with LS- DYNA using 40Gb/s InfiniBand and Gigabit Ethernet. Figure 1 below shows the elapsed time for these interconnects for a range of core/node counts for the Three Vehicle Collision case. 8-40

11 th International LS-DYNA Users Conference Computing Technology Figure 1 Interconnect comparison with Three Vehicle Collision InfiniBand delivered superior scalability in performance, resulting in faster run time, providing the ability to run more jobs per day. The 40Gb/s InfiniBand-based simulation performance measured as ranking (number of jobs per day) was 365% higher compared to GigE at 16 nodes. While Ethernet showed a loss of performance (increase in run time) beyond 8 nodes, InfiniBand demonstrated good scalability throughout the various tested configurations. LS-DYNA uses MPI for the interface between the application and the networking layer, and as such, requires scalable and efficient send-receive semantics, as well as good scalable collective operations. While InfiniBand provides an effective way for those operations, the Ethernet TCP stack which leads to CPU overheads that translate to higher network latency, reduces the cluster efficiency and scalability. LS-DYNA MPI Profiling Profiling the application is essential for understanding its performance dependency on the various cluster subsystems. In particular, application communication profiling can help in choosing the most efficient interconnect and MPI library, and in identifying the critical communication sensitivity points that greatly influence the application s performance, scalability and productivity. LS-DYNA 3 Vehicle Collision MPI profiling data is presented in Figures 2 and 3 which show the run-time distribution between computation and communications and usage of the different MPI communications in several cluster configurations (32-cores or 4 nodes, 64-cores or 8 nodes and 128-cores or 16 nodes). 8-41

Computing Technology 11 th International LS-DYNA Users Conference Figure 2 Run-time distribution between computation and communications 8-42

11 th International LS-DYNA Users Conference Computing Technology Figure 3 Distribution of the different MPI communications As seen in figure 2, the percentage of communication time versus computation time increases as cluster size scales from 35% at 32 processes (4-nodes) to 55% at 128 processes (16-nodes) which demonstrated the need for a low-latency interconnect solution that delivers the increased throughput required for the node or core communications. From figure 3 it is clear that the two MPI collectives, MPI_Allreduce and MPI_Bcast consume most of the total MPI time and hence is critical to LS-DYNA performance. MPI libraries and offloading related to those two collectives operation will greatly influence the system performance. In this paper we will review the difference between the different MPI libraries. We will review the MPI collectives offloading capabilities presented within the Mellanox ConnectX- 2 adapters in a follow-up paper. MPI Library Comparisons We have compared the performance of three MPI libraries Open MPI, HP MPI and Platform MPI and two compilers Intel compiler and PGI compiler. The results are presented in figure 4. 8-43

Computing Technology 11 th International LS-DYNA Users Conference Figure 4 MPI libraries and compilers comparison The combination of HP MPI with the Intel compiler has demonstrated the highest performance but in general the difference between the different configurations was not dramatic. All MPI options showed comparable performance in the four cluster sizes that were tested. 8-44 Parallel File System There are three main categories of file systems for HPC system I/O - the Network File System (NFS), storage area network (SAN) file systems, and parallel file systems. The most scalable solution with the highest performance is the parallel file systems. In this solution, a few nodes connected to the storage (I/O nodes) serve data to the rest of the cluster. The main advantages a parallel file system can provide include a global name space, scalability, and the capability to distribute large files across multiple nodes. Generally, a parallel file system includes a metadata server (MDS), which contains information about the data on the I/O nodes. Metadata is the information about a file (its name, location and owner). Some parallel file systems use a dedicated server for the MDS, while other parallel file systems distribute the functionality of the MDS across the I/O nodes. Lustre is an open source parallel file system for Linux clusters. It appears to work like a traditional UNIX file system (similar to GPFS), distributed Object Storage Targets are responsible for actual file-to-disk transactions and a user level library is available to allow application I/O requests to be translated into LUSTRE calls. As with other parallel file systems, data striping from concurrently running nodes is the main performance enhancing factor. Metadata is provided from a separate server. Lustre stores file system metadata on a cluster of MDSs and stores file data as objects on object storage targets (OSTs), which directly interface with object-based disks (OBDs). The MDSs maintain a transactional record of high-level file and file system changes. They support all file system namespace operations such as file lookups, file

11 th International LS-DYNA Users Conference Computing Technology creation, and file and directory attribute manipulation. In our test environment, native InfiniBand-based storage was used for Lustre as described in figure 5. Figure 5 Test system environment with native InfiniBand Lustre parallel file system With the availability of native InfiniBand Lustre parallel file system, we have compared LS- DYNA performance with Lustre versus using the local disk for storing the data files. The results of the two cases are presented in figure 6. Figure 6 LS-DYNA performance with Lustre versus local disk 8-45

Computing Technology 11 th International LS-DYNA Users Conference Due to the parallel file system capabilities and the usage of InfiniBand as the network for Lustre, using Lustre instead of the local disk increased LS-DYNA performance 20% in average for both HP MPI and Intel MPI. Intel MPI has native Lustre support (command line mpiexec -genv I_MPI_ADJUST_BCAST 5 -genv I_MPI_EXTRA_FILESYSTEM on -genv I_MPI_EXTRA_FILESYSTEM_LIST luster). Future work Future investigation is required on the benefits of MPI collectives communication offloads supported in the Mellanox ConnectX-2 adapters. From figure 3 it is clear that the MPI collectives MPI_Allreduce and MPI_Bcast consume most of the total MPI time and offloading them to the network is expected to increase LS-DYNA performance. Conclusions From concept to engineering and from design to test and manufacturing; engineering relies on powerful virtual development solutions. Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) are used in an effort to secure quality and speed up the development process. Cluster solutions maximize the total value of ownership for FEA and CFD environments and extend innovation in virtual product development. HPC cluster environments impose high demands for cluster connectivity throughput, lowlatency, low CPU overhead, network flexibility and high-efficiency in order to maintain a balanced system and to achieve high application performance and scaling. Low-performance interconnect solutions, or lack of interconnect hardware capabilities will result in degraded system and application performance. Livermore Software Technology Corporation (LSTC) LS-DYNA software was investigated. In all InfiniBand-based cases, LS-DYNA demonstrated high parallelism and scalability, which enabled it to take full advantage of multi-core HPC clusters. Moreover, according to the results, a lower-speed interconnect, such as Ethernet is ineffective on mid to large cluster size, and can cause a dramatically reduction in performance beyond 8 server nodes (i.e. the application run time actually gets slower). We have profiled the communication over the network of LS-DYNA software to determine LS- DYNA sensitivity points, which is essential in order to estimate the influence of the various cluster components, both hardware and software. We evidenced the importance of providing high-performance MPI_AllReduce and MPI_Bcast collectives operations and the performance differences between the different MPI libraries and compilers. We have also investigated the benefits of using parallel file systems instead of the local disk and have shown that using InfiniBand for both the MPI and the storage can provide performance benefits as well as cost savings by using a single network for all cluster purposes network consolidations. 8-46