Optimizing LS-DYNA Productivity in Cluster Environments

Similar documents
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

2008 International ANSYS Conference

Birds of a Feather Presentation

Future Routing Schemes in Petascale clusters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Single-Points of Performance

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Maximizing Cluster Scalability for LS-DYNA

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

MM5 Modeling System Performance Research and Profiling. March 2009

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

AcuSolve Performance Benchmark and Profiling. October 2011

ABySS Performance Benchmark and Profiling. May 2010

New Features in LS-DYNA HYBRID Version

LS-DYNA Performance Benchmark and Profiling. April 2015

LS-DYNA Performance Benchmark and Profiling. October 2017

Assessment of LS-DYNA Scalability Performance on Cray XD1

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

Real Application Performance and Beyond

Performance Analysis of LS-DYNA in Huawei HPC Environment

Technical guide. Windows HPC server 2016 for LS-DYNA How to setup. Reference system setup - v1.0

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

LS-DYNA Performance Benchmark and Profiling. October 2017

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Altair RADIOSS Performance Benchmark and Profiling. May 2013

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Interconnect Your Future

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

Partitioning Effects on MPI LS-DYNA Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Performance Benefits of NVIDIA GPUs for LS-DYNA

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System

QLogic TrueScale InfiniBand and Teraflop Simulations

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

QLogic in HPC Vendor Update IDC HPC User Forum April 16, 2008 Jeff Broughton Sr. Director Engineering Host Solutions Group

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Solutions for Scalable HPC

CP2K Performance Benchmark and Profiling. April 2011

Introduction to High-Performance Computing

The Future of Interconnect Technology

HYCOM Performance Benchmark and Profiling

The Optimal CPU and Interconnect for an HPC Cluster

GROMACS Performance Benchmark and Profiling. September 2012

Manufacturing Bringing New Levels of Performance to CAE Applications

NAMD Performance Benchmark and Profiling. January 2015

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

FUSION1200 Scalable x86 SMP System

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Paving the Road to Exascale Computing. Yossi Avni

Voltaire Making Applications Run Faster

NAMD Performance Benchmark and Profiling. November 2010

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Himeno Performance Benchmark and Profiling. December 2010

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Platform Choices for LS-DYNA

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

AcuSolve Performance Benchmark and Profiling. October 2011

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Interconnect Your Future

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

In-Network Computing. Paving the Road to Exascale. June 2017

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

ARISTA: Improving Application Performance While Reducing Complexity

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

AMBER 11 Performance Benchmark and Profiling. July 2011

NEMO Performance Benchmark and Profiling. May 2011

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

N V M e o v e r F a b r i c s -

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

GROMACS Performance Benchmark and Profiling. August 2011

Paving the Road to Exascale

CP2K Performance Benchmark and Profiling. April 2011

Transcription:

10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for computing power in scientific and engineering applications has spurred deployment of highperformance computing (HPC) clusters. Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) are computational technologies that can take advantage of HPC clusters for increasing engineering design productivity, reduce development cost and faster time to market. The end-user benefits are far more sophisticated, enhanced, safer and robust products. With increase usage of multi-core in HPC clusters, FEA and CFD applications need to be highly parallel and scalable in order to fully utilize cluster computing ability. Moreover, multi-core based clusters impose higher demands on cluster components, in particular cluster interconnect. In this paper we investigate the optimum usage of high-performance clusters for maximum efficiency and productivity, for CAE applications, and for automotive design in particular. Introduction High-performance computing is a crucial tool for automotive design and manufacturing. It is used for computer-aided engineering (CAE) from component-level to full vehicle analyses: crash simulations, structure integrity, thermal management, climate control, engine modeling, exhaust, acoustics and much more. HPC helps drive accelerated speed to market, significant cost reductions, and tremendous flexibility. The strength in HPC is the ability to achieve best sustained performance by driving the CPU performance towards its limits. The motivation for high-performance computing in the automotive industry has long been its tremendous cost savings and product improvements. Total cost of real vehicle crash-tests in order to determine its safety characteristics is in the range of $250,000 or more. On the other hand, the cost of a high-performance compute cluster can be just a fraction of the price of a single crash test, while providing a system that can be used for every test simulation going forward. HPC is used for many other aspects than just crash simulations. Compute-intensive systems and applications are used to simulate everything from airbag deployment to brake cooling, exhaust systems, thermal comfort and windshield washer nozzles. HPC-based simulations and analyses empower engineers and designers to create vehicles that are more ready and safer for real-life environments. High-Performance and Scalable Simulations One of the most demanding applications of automotive design is crash simulations (full-frontal, offset-frontal, angle-frontal, side-impact, rear-impact and more). Crash simulations, while 5-11

Computing Technology 10 th International LS-DYNA Users Conference performed very early in the development process, are validated very late in the development process once the vehicle is completely built. The more sophisticated and complex the simulation, the more parts and details can be analyzed. Computer-based analyses provide an early insight into phenomena that are difficult to be gathered experimentally, and if so, only at a later stage and at a substantial cost. Time and money is saved without having to build costly prototypes. High-performance clusters are scalable performance compute solutions based on commodity hardware on a private system network. The main benefits of clusters are scalability, availability, flexibility and high-performance. A cluster uses the combined compute power of compute sever nodes to form a high-performance solution for cluster-based applications. When more compute power is needed, it can be simply achieved by adding more server nodes to the cluster. Depending on the design complexity, FEA or CFD simulation can be performed on the entire cluster server nodes, utilizing all of the CPU cores in the cluster, or on a segment of the cluster, enabling multiple parallel simulations. Later, the cluster environment provides a flexible dynamic solution, where each simulation can use a different number of CPU cores. The way the cluster nodes are connected together has a great influence on the overall application performance, especially when multi-core servers are used. In multi-core environments, it is essential to have I/O that provides the same low latency for each process/core, regardless of the number of cores that operate simultaneously, in order to guarantee linear application scalability. As each of the CPU cores generates and receives data, to and from other server nodes the bandwidth capability needs to be able to handle all the data streams. Furthermore, the interconnect should handle all the communication and offload the CPU cores from I/O related tasks in order to maximize the CPU cycles that are dedicated to the applications. InfiniBand Architecture (IBA) is an industry standard fabric designed to connect between clustered servers and between servers and storage. InfiniBand is designed to provide highest bandwidth, low-latency computing, scalability for ten thousand nodes and multiple CPU cores per server platform and efficient utilization of compute processing resources. InfiniBand serverto-server and server-to-storage connections deliver up to 40Gb/s of bandwidth. InfiniBand switch-to-switch connections deliver up to 120Gb/s. This high-performance bandwidth is matched with ultra-low application latency of 1μs, and switch latencies under 140ns per hop that enable efficient scale out of compute systems. ConnectX MPI Latency - Multi-core Scaling 6 Latency (usec) 4 2 0 1 2 3 4 5 6 7 8 # of CPU cores 5-12 Figure 1 MPI multi-core latency with Mellanox ConnectX InfiniBand HCA

10 th International LS-DYNA Users Conference Computing Technology Figure 1 shows InfiniBand latency scaling on multi-core cluster, where each server includes 8 cores. The latency is being measured using IMB (Pallas) multi-core benchmark for Message Passing Interface (MPI), which is the messaging interface for most of the FEA and CFD applications, including LS-DYNA. MPI is the interface between the applications and the cluster networking. As shown, InfiniBand provides the same low latency regardless on the number of cores that simultaneously carry out the simulations. Single Server Dilemma SMP or MPI A common multi-core environment consists of 8 to 16 CPU cores in a single server. In a typical one server environment, application jobs can be executed in a Shared Memory Processing (SMP) fashion, or with a Message Passing Interface (MPI) protocol. In order to compare between the two options, we have used LS-DYNA neon_refined_revised benchmark. LS-DYNA is a general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems. It is widely used in the automotive industry for crashworthiness, occupant safety and metal forming and also for aerospace, military and defense and consumer products. In order to compare between an SMP and MPI approach, a single server was used. The comparison metric was the amount of jobs that can be achieved per 24 hours. According to figure 2, the usage of MPI improves the system s efficiency and parallel scalability, and as more cores were used, the MPI approach performed better versus the traditional SMP. 40.0 LS-DYNA Jobs Per Day 35.0 Jobs / 24 hours 30.0 25.0 20.0 15.0 10.0 5.0 0.0 1 2 3 4 5 6 7 8 Number of cores SMP MPI (Scali MPI Connect) Figure 2: SMP versus MPI (LS-DYNA neon_refined_revised benchmark) The usage of MPI (in this case, Scali MPI Connect) in a single server not only provides better performance and efficiency, it also enables a smooth integration of a single server into a cluster environment. In addition, when more compute power is needed, no software changes are required. 5-13

Computing Technology 10 th International LS-DYNA Users Conference LS-DYNA Scalability in a multi-core cluster environment: the importance of interconnects The cluster interconnect is very critical to efficiency and productivity in multi-core platforms. Productivity is being measured by the number of jobs, or simulations that can be performed in a time frame. When more CPU cores are present, the overall cluster productivity is expected to be increased. The productivity and scalability analysis was performed at Mellanox Cluster Center using Neptune cluster cluster. Neptune cluster consist of 16 server nodes, each with dual dualcore AMD Opteron CPUs and 8GB DDR2 memory. The servers were connected with gigabit Ethernet and Mellanox InfiniHost III Ex and ConnectX InfiniBand adapters. LS-DYNA software version mpp971_s_7600 with HP MPI version 2.2.5.1 was used. The InfiniBand drivers were OFED 1.3 from OpenFabrics. Figure 3 compares between InfiniBand and gigabit Ethernet (GigE) InfiniBand with the new LS- DYNA neon_refined_revised benchmark. The cluster consisted of dual-socket, dual core AMD CPUs server nodes. For any number of cores, InfiniBand shows better efficiency than GigE, enabling up to 39% more LS-DYNA jobs per day with 16 cores, and 1034% higher productivity on 64 cores cluster. Moreover, when scaling beyond 16 cores, GigE failed to provide an increase number of jobs, and even diminishing the overall cluster productivity dramatically. While InfiniBand continued to provide almost linear scalability and high-efficiency, adding any additional CPU cores beyond 16 cores for GigE connected cluster, causes the productivity of the compute cluster to be reduced to levels lower than the one achieved with only 16 cores or even 8 cores. LS-DYNA Productivity Jobs per day 200 180 160 140 120 100 80 60 40 20 0 1034% 199% 89% 11% 39% 8 16 20 32 64 Number of cores InfiniBand GigE % Difference 1500% 1350% 1200% 1050% 900% 750% 600% 450% 300% 150% 0% % difference between IB and GigE Figure 3: InfiniBand versus GigE (LS-DYNA neon_refined_revised benchmark) 5-14

10 th International LS-DYNA Users Conference Computing Technology As more jobs can be performed, the quality of the product increases and the time-to-market reduces. The results show that InfiniBand based clusters are required for as low as 16 cores cluster, and is a must for clusters beyond 16 cores. Any additional servers beyond 16 cores connected with GigE only reduce the cluster productivity. Moreover, InfiniBand based cluster with only 16 cores, achieves higher productivity compared to any cluster size with GigE. Profiling of LS-DYNA Network Activity In order to understand the I/O requirements of LS-DYNA, we have analyzed the network activity during neon_refined_revised benchmark for two cases- 16 cores and 64 core clusters. The results of the total data that was sent at the corresponded MPI message size (or I/O message size) are shown in figure 4. Need to note that the Y axis is a logarithmic scale. LS-DYNA Profiling Total data transferred (Byte) 10000000000 1000000000 100000000 10000000 1000000 100000 10000 1000 100 10 1 [0..64] [65..256] [257..1024] [1025..4096] [4097..16384] [16385..65536] [65537..262144] [262145..1048576] [1048577..4194304] [4194305..infinity] MPI message size (Byte) 16 cores 32 cores Figure 4 neon_refined_revised network profiling There peaks of I/O throughput throughout neon_refined_revised execution are in the range of 1KB to 256KB message sizes, which reflect the need for high bandwidth interconnect solution. Moreover, LS-DYNA uses very large messages, of a 1MB and higher in size which explain why GigE is not adequate for car crash simulations. As the number of cores (and cluster server nodes) increases from 16 servers to 32 cores, higher number of small messages (2B-64B) was seen, this is a results of the need to synchronize more servers. In regards to the data messages, the increase of the number of core and servers resulted in the increase of the total data send via the 4KB-16KB and 64KB-256KB messages. As clusters increase in size, there are more control messages (small size) and more large data messages. 5-15

Computing Technology 10 th International LS-DYNA Users Conference The network profiling results show that with the increase in cluster size there is a higher number of control messages (small size), hence the need for lowest latency, and larger data messages, hence the need for highest throughput. We can conclude that higher throughput (such as InfiniBand QDR 40Gb/s) and lower latency interconnects will continue to improve LS-DYNA performance and productivity. Conclusions From concept to engineering and from design to test and manufacturing; engineering relies on powerful virtual development solutions. Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) are used in an effort to secure quality and speed up the development process. Cluster solutions maximize the total value of ownership for FEA and CFD environments and extend innovation in virtual product development. Multi-core cluster environments impose high demands for cluster connectivity throughput, lowlatency, low CPU overhead, network flexibility and high-efficiency in order to maintain a balanced system and to achieve high application performance and scaling. Low-performance interconnect solutions, or lack of interconnect hardware capabilities will result in degraded system and application performance. Livermore Software Technology Corporation (LSTC) LS-DYNA software and the new LS- DYNA benchmark case was investigated. In all InfiniBand based cases, LS-DYNA had demonstrated high parallelism and scalability, which enable it to take full advantage of multicore HPC clusters. Moreover, according to the results, a low-speed interconnect, such as GigE is ineffective with any cluster size, starting from 16 cores cluster size (4 nodes) and above. We have profile the networking usage of LS-DYNA software to determine LS-DYNA requirements from the interconnect, which have lead to the pervious conclusions. The results indicted that large data messages are highly used and the amount of the data send via the large message sizes increase with cluster size. We also evidenced the large amount of small messages, which mostly used to synchronize between the cluster server nodes. From those results we have concluded that a combination of very high bandwidth and extremely low latency, with low CPU overhead, interconnect is required in order to maintain and sustain high scalability and efficiency. From all of the above results, it is clear that GigE can no longer be used as a cluster interconnect for FEA or CFD applications such as LSTC LS-DYNA. In order to achieve the desired virtual design, CAE users rely on HPC clusters to gain the needed compute ability. InfiniBand based multi-core cluster environments provide the needed highperformance compute system for the CAE simulations, and the CAE software is ready to take full advantage of the hardware setting. 5-16