Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

Similar documents
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Comprehensive Study on the Performance of Implicit LS-DYNA

Accelerating Implicit LS-DYNA with GPU

Optimizing LS-DYNA Productivity in Cluster Environments

Platform Choices for LS-DYNA

Single-Points of Performance

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Analysis of LS-DYNA in Huawei HPC Environment

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Manufacturing Bringing New Levels of Performance to CAE Applications

Assessment of LS-DYNA Scalability Performance on Cray XD1

Partitioning Effects on MPI LS-DYNA Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Itanium a viable next-generation technology for crash simulation?

Parallel and Distributed Computing Programming Assignment 1

HP OpenVMS Alpha and Integrity Performance Comparison

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

LS-DYNA Performance Benchmark and Profiling. October 2017

Performance Analysis and Prediction for distributed homogeneous Clusters

2008 International ANSYS Conference

The Optimal CPU and Interconnect for an HPC Cluster

Birds of a Feather Presentation

Leveraging LS-DYNA Explicit, Implicit, Hybrid Technologies with SGI hardware and d3view Web Portal software

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

BlueGene/L. Computer Science, University of Warwick. Source: IBM

New Features in LS-DYNA HYBRID Version

Introduction...2. Executive summary...2. Test results...3 IOPs...3 Service demand...3 Throughput...4 Scalability...5

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

LS-DYNA Performance Benchmark and Profiling. October 2017

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Network Design Considerations for Grid Computing

2008 International ANSYS Conference

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

LS-DYNA Performance on 64-Bit Intel Xeon Processor-Based Clusters

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

InfiniBand Experiences of PC²

Parallel Computer Architecture II

LS-DYNA Scalability Analysis on Cray Supercomputers

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Cluster Network Products

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

High Performance MPI on IBM 12x InfiniBand Architecture

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Infiniband and RDMA Technology. Doug Ledford

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

The NE010 iwarp Adapter

QLogic TrueScale InfiniBand and Teraflop Simulations

Improvement of domain decomposition of. LS-DYNA MPP R7 and R8

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

High Volume Transaction Processing in Enterprise Applications

Future Routing Schemes in Petascale clusters

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Introduction to Ethernet Latency

NAMD Performance Benchmark and Profiling. November 2010

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

Network Bandwidth & Minimum Efficient Problem Size

An Extensible Message-Oriented Offload Model for High-Performance Applications

Introduction to High-Performance Computing

Enterprise Server Midrange - Hewlett Packard

Interconnection Networks and Clusters. Interconnection Networks and Clusters. Interconnection Networks and Clusters

Habanero Operating Committee. January

FUSION1200 Scalable x86 SMP System

Introduction to MPP version of LS-DYNA

Assessing performance in HP LeftHand SANs

Oracle Data Warehouse with HP a scalable and reliable infrastructure

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

MSC NASTRAN EXPLICIT NONLINEAR (SOL 700) ON ADVANCED SGI ARCHITECTURES. Authors. Summary. Tech Guide

10 Gbit/s Challenge inside the Openlab framework

Operating two InfiniBand grid clusters over 28 km distance

LS-DYNA Performance Benchmark and Profiling. April 2015

Maximizing Cluster Scalability for LS-DYNA

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Assessing the Convergence Properties of NSGA-II for Direct Crashworthiness Optimization

Hybrid (MPP+OpenMP) version of LS-DYNA

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

A Long-distance InfiniBand Interconnection between two Clusters in Production Use

FM4000. A Scalable, Low-latency 10 GigE Switch for High-performance Data Centers

Quad-Core Intel Xeon Processor-based 3200/3210 Chipset Server Platforms

Technical guide. Windows HPC server 2016 for LS-DYNA How to setup. Reference system setup - v1.0

Experiences with HP SFS / Lustre in HPC Production

Performance and Sizing Guide

HP visoko-performantna OLTP rješenja

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

Dynamic Flow Regulation for IP Integration on Network-on-Chip

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

The Use of Cloud Computing Resources in an HPC Environment

Full Vehicle Dynamic Analysis using Automated Component Modal Synthesis. Peter Schartz, Parallel Project Manager ClusterWorld Conference June 2003

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Benchmark Results. 2006/10/03

Transcription:

8 th International LS-DYNA Users Conference Computing / Code Tech (1) Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect Yih-Yih Lin Hewlett-Packard Company MR01-3 200 Forest Street Marlborough, MA 01752 yih-yih.lin@hp.com Abstract The least square error approach applied to LS-DYNA communication and computation costs for the Neon model was shown previously by this author to be useful in predicting performance on a given interconnect of known pingpong latency and bandwidth, and both Gigabit Ethernet and HyperFabric 2 results were presented. In this paper, this prediction method is applied to a much larger public-domain crash model, the 3-vehicle collision model, to determine communication and computation costs for models representative of the most demanding requirements. Furthermore, the result is verified against the new, high-speed low- latency Infiniband interconnect. Users of this method may perform trade-off analysis for optimum hardware configuration decisions without the need for extensive benchmark testing. Introduction In a previous paper [1], the author presented the least square error approach to determine the MPP LS-DYNA communication and computation costs for the Neon Model. As described in that paper, the elapsed time of an MPP LS-DYNA job is comprised of two parts, the computation and the communication costs: T elapsed = T comput + T comm (1) The communication cost can be further approximated by the following formula: T comm = M(αt lan + β t bw s / t bw ) (2) where M is the average number of messages per processors, s is the average message size, t lan and t bw are the ping-pong latency and bandwidth, α is the latency constant (so called because αt lan represents the aggregate latency), and β is the bandwidth constant (so called because t bw /β represents the aggregate bandwidth). For a given MPP LS-DYNA job, all those quantities, except the latency and bandwidth constants α and β, can be measured and assumed known. To determine the quantities α and β, the least square error approach is proposed. In the approach, two clusters, with interconnects a and b, are used, and a set of elapsed times for the two clusters, T a elapsed and T b elapsed, are measured with various numbers of processors. Then, from formulas (1) and (2), the quantities α and β can be determined as the quantities that minimize the following sum of squares of errors: 12-27

Computing / Code Tech (1) 8 th International LS-DYNA Users Conference S = i [(M n,i (t lan a - t lan b )α + M n,i s(1/ t bw a - 1/ t bw b)β) (T a elapsed,i T b elapsed,i)] 2 (3) where i ranges with the varying numbers of processors. Once the two quantities are determined, one can then apply formulas (1) and (2) to predict the elapsed time, or the performance, of MPP LS-DYNA for a given model with an interconnect with known ping-pong latency and bandwidth. In this paper, the same least square error approach is applied to a much larger public domain model, the 3-vehicle collision model, to determine its communication and computation costs and, thus, to provide a quantitative relationship between the performance of this currently representative model of the most demanding requirements and an interconnect. Furthermore, the quantitative relationship is verified against the new, high-speed low-latency Infiniband interconnect. Model, Machine, Interconnects, Measured Data Model, Machine, and OS In this paper, the 3-vehicle collision model from NCAC, whose website is http://www.ncac.gwu.edu, of 971 thousand elements and with simulation time of 150 milliseconds, is used; furthermore, decompositions used are the same one as described in the website http://www.topcrunch.org. The single-precision 970.3858 version of MPP LS-DYNA is used. A 32-processor cluster, consisted of 16 machines of HP s 1.5 GHz Rx2600 with HP-UX 11.23, is used. The Rx2600 is a 2-CPU Itanium2 machine. Interconnects and Their Characteristics Two interconnects are used to determine the communication and computation costs: the Gigabit Ethernet (GigE) and HP s HyperFabric 2 (HF2); their ping-pong latencies and bandwidths have been measured and are shown in Table 1. Once the communication and computation costs are determined, the performance with a different interconnect can be predicted. The new, high-speed low-latency Infiniband interconnect, whose ping-pong latency and bandwidth has been measured and is also shown in Table 1, is used to verify the accuracy of the resulted prediction and thus validate the approach. Elapsed times Table 2 and Figure 1 show elapsed times, actually measured, for jobs with numbers of processors 2, 4, 8, 12, 16, 24, and 32; and each with the three interconnects: GigE, HF2, and Infiniband. Message Patterns The numbers of processors used in the least square error approach are 4, 8, 12, 16, 24, and 32. Table 3 shows the average numbers of messages and average message sizes per processor for the MPP LS-DYNA jobs with this set of numbers of processors. GigE HF2 Infiniband Latency 43 µsec 22 µsec 6.5 µsec Bandwidth 112 MB 216 MB 780 MB Table 1. Ping-pong latencies and bandwidths of Gigabit Ethernet, HF2, and Infiniband 12-28

8 th International LS-DYNA Users Conference Computing / Code Tech (1) Number of Processors 2 4 8 12 16 24 32 Infiniband 197422 100938 51250 35872 26778 18210 14182 HF 197422 100941 51429 35943 27461 19010 15211 GigE 197422 101257 52921 37811 28076 20795 17100 Table 2. Measured elapsed times in seconds 250000 Elapsed Time (seconds) 200000 150000 100000 50000 Infiniband HF GigE 0 2 4 8 12 16 24 32 Number of Processors Figure 1. Graph for table 2 Ave. No. of Messages 9924591 1278707417521630191410762391006629620774 Ave. Message Size in Bytes 3177 2862 2338 2095 1830 1438 Table 3. Average number pf messages per processor and average messages sizes Estimation of Communication Costs Aggregated Latency and Aggregated Bandwidth To estimate α and β, call the cluster with GigE as cluster a and the cluster with HF2 as cluster b. Then the sum of the squares of the 6 errors, with the numbers of processors being 4, 8, 12, 16, 24, and 32, as in formula (3), can then be obtained with the ping-pong latency and bandwidth in Table 1, the elapsed time data in Table 2, and the message data in Table 3. The sum of squares of these 6 errors is a quadratic function of α and β. The minimum of the quadratic function occurs when its partial derivatives with respect to α and β are equal to zero, which, in turn, forms two 12-29

Computing / Code Tech (1) 8 th International LS-DYNA Users Conference linear equations of the two unknowns α and β. A computer program based on this approach has been written to obtain α = 2.17 and β=2.89 This means that, for the 3-vehicle collision model, the aggregated latency of a given interconnect is 2.17 times its ping-pong latency, and its aggregated bandwidth is 0.346, or 1/2.89, times its ping-pong bandwidth. Accuracy of the Approach In the above section, the latency α and the bandwidth constant β are determined only with the knowledge of the elapsed times with the GigE and the HF2 interconnects, but not the Infiniband connect. With these two interconnect constants determined, for a given number of processors, we can use formula (2) to estimate the communication cost with the GigE, the HF2, or the Infiniband interconnect. The computation cost, which is independent of the interconnect, can be simply estimated to be the difference between the elapsed time and the communication cost with the GigE or the HF2, as indicated by formula (1). And the estimated, or predicted, elapsed time with the Infiniband interconnect is just the sum of the estimated communication and computations. Shown in Table 4 are comparisons between such predicted and measured elapsed times with the Infiniband interconnect. It is shown that the maximal percent error is just 3 percent. Estimated Elapsed Time 100039 50873 35239 26053 17860 13810 Measured Elapsed Time 100938 51250 35872 26778 18210 14182 Percent Error 1% 1% 2% 3% 2% 3% Table 4. Comparison between predicted and measured elapsed times with the Infiniband interconnect Cost Percentages As mentioned in the previous section, the approach allows one to quantify the two components, communication and computation costs, of the elapsed time and the two components, latency and bandwidth costs, of the communication cost. Table 5 shows the percentages of communication and computation costs in the elapsed times with Infiniband and the GigE interconnects. It is worthwhile to note that the slower GigE decreases its efficiency much faster than the fast Infiniband as the number of processors increases. Table 6 shows the percentages of latency and bandwidth costs in the communication costs with the same two interconnects. It is also worthwhile to note that the percentages of latency and bandwidth costs with the faster Infiniband and the slower GigE are, given a number of processors, almost the same. The conclusion that the Infiniband meets the processor speed of the current 1.5 GHz Itanium2 Rx2600 can be drawn immediately from these results. 12-30

8 th International LS-DYNA Users Conference Computing / Code Tech (1) GigE Communication Cost 1% 4% 8% 9% 17% 23% Computation Cost 99% 96% 92% 91% 83% 77% Infiniband Communication Cost 0% 1% 1% 2% 3% 4% Computation Cost 100% 99% 99% 98% 97% 96% Table 5. Percentages of communication and computation costs in the elapsed times with the GigE and the Infiniband interconnects GigE Latency Cost 53% 56% 61% 63% 66% 72% Bandwidth Cost 47% 44% 39% 37% 34% 28% Infiniband Latency Cost 55% 57% 62% 64% 68% 73% Bandwidth Cost 45% 43% 38% 36% 32% 27% Table 6. Percentages of latency and bandwidth costs in communication costs with the GigE and the Infiniband interconnects Summary In this paper, a previously proposed least square error approach for determining the MPP LS- DYNA communication and computation costs is applied to the very large public domain crash model of 3 vehicles, which is the currently representative model of the most demanding requirements. The result is verified against the measured elapsed times with the new, high-speed low-latency Infiniband interconnect and is shown to be within 3 percent error. This approach allows the quantitative breakdown of the MPP LS-DYNA simulation cost and thus can be used to perform trade-off analysis for optimum hardware configuration decisions without the need for extensive benchmark testing. References Yih-Yih Lin, A Quantitative Approach for Determining the Communication and Computation Costs in MPP LS- DYNA Simulations, FEA News, February 2004. 12-31

Computing / Code Tech (1) 8 th International LS-DYNA Users Conference 12-32