A Comprehensive Study on the Performance of Implicit LS-DYNA

Similar documents
Accelerating Implicit LS-DYNA with GPU

New Features in LS-DYNA HYBRID Version

Hybrid (MPP+OpenMP) version of LS-DYNA

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

Advances of parallel computing. Kirill Bogachev May 2016

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Leveraging LS-DYNA Explicit, Implicit, Hybrid Technologies with SGI hardware and d3view Web Portal software

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

2008 International ANSYS Conference

SNAP Performance Benchmark and Profiling. April 2014

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

CPMD Performance Benchmark and Profiling. February 2014

MSC NASTRAN EXPLICIT NONLINEAR (SOL 700) ON ADVANCED SGI ARCHITECTURES. Authors. Summary. Tech Guide

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

LS-DYNA Performance Benchmark and Profiling. October 2017

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

MILC Performance Benchmark and Profiling. April 2013

Single-Points of Performance

Approaches to Parallel Computing

HP ProLiant delivers #1 overall TPC-C price/performance result with the ML350 G6

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

HPC Architectures. Types of resource currently in use

University at Buffalo Center for Computational Research

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

ANSYS HPC Technology Leadership

HP and CATIA HP Workstations for running Dassault Systèmes CATIA

What is Parallel Computing?

Architectures for Scalable Media Object Search

Assessment of LS-DYNA Scalability Performance on Cray XD1

MSC NASTRAN EXPLICIT NONLINEAR (SOL 700) ON ADVANCED SGI ARCHITECTURES

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Optimizing LS-DYNA Productivity in Cluster Environments

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

CP2K Performance Benchmark and Profiling. April 2011

HPC Capabilities at Research Intensive Universities

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Super-Fast Genome BWA-Bam-Sort on GLAD

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

LS-DYNA Scalability Analysis on Cray Supercomputers

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

LS-DYNA Performance Benchmark and Profiling. October 2017

Performance comparison between a massive SMP machine and clusters

Pexip Infinity Server Design Guide

OCTOPUS Performance Benchmark and Profiling. June 2015

Performance Analysis of LS-DYNA in Huawei HPC Environment

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Introduction to parallel Computing

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

designing a GPU Computing Solution

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Large scale Imaging on Current Many- Core Platforms

HPE Delivers Software Defined Solutions based on Microsoft Storage Spaces Direct

HPE Delivers Software Defined Solutions based on Microsoft Storage Spaces Direct

IBM Information Technology Guide For ANSYS Fluent Customers

Introduction II. Overview

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Hybrid Implementation of 3D Kirchhoff Migration

LS-DYNA Performance Benchmark and Profiling. April 2015

TFLOP Performance for ANSYS Mechanical

IBM Power AC922 Server

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Dell HPC System for Manufacturing System Architecture and Application Performance

Parallel Computing Ideas

6WINDGate. White Paper. Packet Processing Software for Wireless Infrastructure

Two-Phase flows on massively parallel multi-gpu clusters

NAMD GPU Performance Benchmark. March 2011

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

LBRN - HPC systems : CCT, LSU

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Performance Benefits of NVIDIA GPUs for LS-DYNA

Computer Architecture

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HPE Delivers Software Defined Solutions based on Microsoft Storage Spaces Direct

High performance Computing and O&G Challenges

GPUs and Emerging Architectures

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

GROMACS Performance Benchmark and Profiling. August 2011

Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance

Scalability and Classifications

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Transcription:

12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four aspects of Implicit LS-DYNA s performance. First, Implicit LS-DYNA s scalability is characterized on multicore platforms. Second, the effectiveness of the GPU implementation is examined. Third, the effect of memory configuration on performance and information on how to configure optimal memory size for a system are presented. And fourth, the performance of out-of-core solutions is discussed. Introduction Cost per time step in Implicit LS-DYNA is much more expensive than in Explicit LS-DYNA. The major cost for implicit solution is the cost of matrix factorization in the algorithm for solving a linear algebraic equation or an eigenproblem. Matrix factorization is both very CPU and memory intensive in comparison to explicit calculation. This paper investigates four aspects of LS-DYNA implicit solution. First, that LS-DYNA can perform unsurpassable speed in implicit solutions with modern multicore servers and clusters is established by characterizing its scalability. Second, how the recent GPU implementation helps the performance of implicit solutions is investigated. Third, the effect of memory configurations on performance is exemplified by comparing performances on a 1 DPC (DIMM per Channel) system and a 2 DPC system. Additionally, information on how to configure a system with optimal memory sizes for implicit solutions is presented. And fourth, the performance of out-of-core solutions is discussed. In this investigation, the Hybrid LS-DYNA 971 R6.0 version and the GPU-enabled Hybrid LS- DYNA R6.0.1 beta version are used. The four cylinder models of 0.5, 1, 2 and 4 million solid elements known as the Cyl0p5e6, the Cyl1e6, the Cyl2e6, the Cyl4e6 models are studied. And the following three HP multi-core platforms are used: HP ProLiant DL980: a 1-TB shared memory server, comprising eight ten-core Intel Xeon E7-4870 processors at 2.4 GHz. HP ProLiant SL390 cluster: a 32-node cluster interconnected by InfiniBand QDR, each node comprising two six-core, 2.93 GHz Intel Xeon X5670 processors and having 48GB memory; 16 nodes of which also configured with 3 NVIDIA TESLA M2090 6 GB GPU. HP ProLiant SL230 cluster: an 8-node cluster interconnected by InfiniBand FDR, each node comprising two eight-core, 2.6 GHz Intel Xeon E5-2670 processors and having 64 GB memory. Scalabilities As discussed later, an out-of-core solution is always slower than its corresponding in-core solution; moreover, performance of the former depends on several factors and varies greatly. For this reason, the in-core solution is used in the following discussion unless otherwise mentioned. 1

Computing Technologies(4) 12 th International LS-DYNA Users Conference Figure 1: Elapsed times for the Cyl4e6 model Shown in Figure 1 are the elapsed times measured with 8, 16, 32, 48 and 64 cores on the DL980, and 192 cores and 394 cores on the SL390. The corresponding scalabilities, relative to 8 cores on the DL980, are shown in Figure 2. The DL980 achieves a sublinear scalability: increasing the number of cores 8 times results in a speedup of 7 times. The SL390 similarly achieves a sublinear scalability at a greater number of cores: increasing the number of cores 2 times results in a speedup of 1.7 times. It is noteworthy that increasing the number of cores 48 times, from that of the DL980 to that of the SL390, results in a speedup of 23 times. (The Intel processors used in the DL980 and SL390 are in the same generation.) This kind of scalability is indeed the best among all commercial implicit mechanics software. Figure 2: Scalability for the Cyl4e6 model (from Figure 1) 2

12 th International LS-DYNA Users Conference Computing Technologies(4) Figure 3: Comparison of elapsed times with the GPU implementation and without. The number of cores increases as the model size increases. Effectiveness of the GPU implementation As mentioned earlier, matrix factorization represents a major cost in an implicit solution. Because matrix factorization involves arithmetic operations on a great number of onedimensional arrays of data, GPU can greatly accelerate its speed. Figure 3 compares the elapsed times with the GPU implementation and without for three cases: the Cyl0p5e6 model with 12 cores, the Cyl1e6 model with 48 cores, and the Cyl2e6 model with 96 cores. The GPU implementation has achieved 2 to 2.5 times speedup. Memory Configuration Matrix factorization is a process of continuously reading, updating and writing matrix data, which uses CPU and memory. Memory speed is relatively much slower than CPU speed. It follows from Amdahl s law that improving the speed of the slower component memory is effective in improving the speed of an implicit solution. The computer industry offers an array of choices on memory modules and configurations with varying speeds. Figure 4 shows that the SL230 configured with 2 DPC (DIMM per Channel) outperforms that with 1 DPC by 1.13 times for the Cyl0p5e6 model and by 1.21 times for the Cyl1e6 model. This is a result of Amdahl s law as a system configured with 2 DPC has higher memory bandwidth than the one with 1 DPC. Configuring systems with optimal memory size within a budget requires knowledge of memory requirements for in-core solutions on typical models. The Hybrid LS-DYNA is a distributed program, in which SMP threads are first distributed across ranks, which in turn are distributed across nodes. Memory requirement is therefore distributed per rank. Figure 5 shows the in-core solution memory requirement for the Cyl2ep6 model with the number of ranks from 1 to 8, and Figure 6 shows that for the Cyl4e6 model with the number of ranks from 4 to 64. 3

Computing Technologies(4) 12 th International LS-DYNA Users Conference Figure 4: 2 DPC configuration outperforms 1 DPC configuration It is noteworthy that the total memory requirement varies little with respect to the number of ranks. If the solution is run on a single server, like the DL980, the required memory size is the same as the total memory size. However, if the solution is run on a cluster of nodes with a uniform memory size, like the SL390, then the required memory size for each node is the maximal required memory times the number of ranks used in each node. Figure 7 shows required memory per node with respect to the number of ranks for the Cyl4e6 model. Note that the required memory per node decreases almost linearly with increasing number of ranks. As a result, if a cluster with a small memory size per node is used, one can solve a big implicit problem in-core by increasing the number of ranks, or equivalently by increasing the number of nodes. Figure 5: Memory requirement for the Cyl2e6 model 4

12 th International LS-DYNA Users Conference Computing Technologies(4) Figure 6: Memory requirement for the Cyl4e6 model Figure 7: Memory requirement per node in a distributed cluster, using 2 ranks per node Out-of-Core Solutions An out-of-core solution uses both main memory and I/O to perform its memory operations while an in-core solution uses only main memory. Since I/O is slower than main memory, an out-ofcore solution is always slower than its corresponding in-core solution. But factors that determine performance penalty in out-of-core solutions are many folds and complex. The following factors are some examples: 1. I/O systems with a wide range of speeds are offered by the computer industry. 2. In a shared I/O environment, the heavier the load, the slower the I/O. 5

Computing Technologies(4) 12 th International LS-DYNA Users Conference 3. The amount of free memory for buffering impacts the I/O speed. 4. The amount of free memory limits the amount of main memory that can be allocated for the out-of-core solution and thus limits the amount of its in-core memory operations. Figure 8 shows the relative performance of the out-of-core solution for the Cyl2e6 model on the DL980 with three different amounts of free memory. As shown, the least free memory case (326GB) incurs a 60% speed penalty. This is a consequence of factor 3. Figure 8: The amount of free memory can affect speeds of out-of-core solutions. The command line option memory is set at the same size for the 3 out-of-core solutions. Conclusion In summary, this paper establishes the unsurpassable scalability of Implicit LS-DYNA on multicore platforms. It shows that a platform configured with the GPU implantation outperforms that without more than 2 times. Furthermore, it demonstrates the importance of memory configuration by showing that a platform configured with 2 DPC outperforms that with 1 DPC more than 1.13 times. And finally, the nature of out-of-core solutions is discussed. 6