MASSIVELY-PARALLEL MULTI-GPU SIMULATIONS FOR FAST AND ACCURATE AUTOMOTIVE AERODYNAMICS

Similar documents
LaBS: a CFD tool based on the Lattice Boltzmann method. E. Tannoury, D. Ricot, B. Gaston

Dynamic Mode Decomposition analysis of flow fields from CFD Simulations

Introduction to ANSYS CFX

International Supercomputing Conference 2009

DrivAer-Aerodynamic Investigations for a New Realistic Generic Car Model using ANSYS CFD

Impact of STAR-CCM+ v7.0 in the Automotive Industry Frederick J. Ross, CD-adapco Director, Ground Transportation

Next-generation CFD: Real-Time Computation and Visualization

Software and Performance Engineering for numerical codes on GPU clusters

THE APPLICATION OF AN ATMOSPHERIC BOUNDARY LAYER TO EVALUATE TRUCK AERODYNAMICS IN CFD

THE INFLUENCE OF MESH CHARACTERISTICS ON OPENFOAM SIMULATIONS OF THE DRIVAER MODEL

VALIDATION OF FLOEFD AUTOMOTIVE AERODYNAMIC CAPABILITIES USING DRIVAER TEST MODEL

A NURBS-BASED APPROACH FOR SHAPE AND TOPOLOGY OPTIMIZATION OF FLOW DOMAINS

CFD Modeling of a Radiator Axial Fan for Air Flow Distribution

Calculate a solution using the pressure-based coupled solver.

Transition Flow and Aeroacoustic Analysis of NACA0018 Satish Kumar B, Fred Mendonç a, Ghuiyeon Kim, Hogeon Kim

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

simulation framework for piecewise regular grids

Aerodynamic Study of a Realistic Car W. TOUGERON

Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011

Aero-Vibro Acoustics For Wind Noise Application. David Roche and Ashok Khondge ANSYS, Inc.

CST STUDIO SUITE R Supported GPU Hardware

Validation of an Automated Process for DES of External Vehicle Aerodynamics

Coupling of STAR-CCM+ to Other Theoretical or Numerical Solutions. Milovan Perić

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

NUMERICAL 3D TRANSONIC FLOW SIMULATION OVER A WING

Introduction to C omputational F luid Dynamics. D. Murrin

Modeling Unsteady Compressible Flow

Simulation of Turbulent Flow over the Ahmed Body

AERODYNAMIC OPTIMIZATION OF A FORMULA STUDENT CAR

Aeroacoustic computations with a new CFD solver based on the Lattice Boltzmann Method

Shape Optimization for Aerodynamic Efficiency Using Adjoint Methods

Verification of Laminar and Validation of Turbulent Pipe Flows

Integrating multi-body simulation and CFD: toward complex multidisciplinary design optimisation

Detached-Eddy Simulation of a Linear Compressor Cascade with Tip Gap and Moving Wall *)

COMPUTATIONAL FLUID DYNAMICS USED IN THE DESIGN OF WATERBLAST TOOLING

Driven Cavity Example

Computational Fluid Dynamics PRODUCT SHEET

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

CFD Topology Optimization of Automotive Components

AERO-VIBRO-ACOUSTIC NOISE PREDICTION FOR HIGH SPEED ELEVATORS

High Performance Computing

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Turbostream: A CFD solver for manycore

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

I. Introduction. Optimization Algorithm Components. Abstract for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany

Progress and Future Prospect of CFD in Aerospace

Modeling External Compressible Flow

ANSYS AIM Tutorial Flow over an Ahmed Body

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

Program: Advanced Certificate Program

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Direct Numerical Simulation of a Low Pressure Turbine Cascade. Christoph Müller

International Power, Electronics and Materials Engineering Conference (IPEMEC 2015)

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

CFD MODELING FOR PNEUMATIC CONVEYING

Shallow Water Simulations on Graphics Hardware

Advanced ANSYS FLUENT Acoustics

Possibility of Implicit LES for Two-Dimensional Incompressible Lid-Driven Cavity Flow Based on COMSOL Multiphysics

USING OPENFOAM AND ANSA FOR ROAD AND RACE CAR CFD

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

High Performance Unsteady Workflows with STAR-CCM+

Detached Eddy Simulation Analysis of a Transonic Rocket Booster for Steady & Unsteady Buffet Loads

Applications of ICFD /SPH Solvers by LS-DYNA to Solve Water Splashing Impact to Automobile Body. Abstract

Reproducibility of Complex Turbulent Flow Using Commercially-Available CFD Software

First Steps - Ball Valve Design

CFD Best Practice Guidelines: A process to understand CFD results and establish Simulation versus Reality

The Spalart Allmaras turbulence model

Accelerating Implicit LS-DYNA with GPU

Computational Domain Selection for. CFD Simulation

Validation of an Unstructured Overset Mesh Method for CFD Analysis of Store Separation D. Snyder presented by R. Fitzsimmons

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.

Tutorial 17. Using the Mixture and Eulerian Multiphase Models

Applications of ICFD solver by LS-DYNA in Automotive Fields to Solve Fluid-Solid-Interaction (FSI) Problems

AERODYNAMIC OPTIMIZATION OF REAR AND FRONT FLAPS ON A CAR

ANSA-TGrid: A Common Platform for Automotive CFD Preprocessing. Xingshi Wang, PhD, ANSYS Inc. Mohammad Peyman Davoudabadi, PhD, ANSYS Inc.

Performance Benefits of NVIDIA GPUs for LS-DYNA

Workbench Tutorial Flow Over an Airfoil, Page 1 ANSYS Workbench Tutorial Flow Over an Airfoil

PRESSURE DROP AND FLOW UNIFORMITY ANALYSIS OF COMPLETE EXHAUST SYSTEMS FOR DIESEL ENGINES

Peta-Scale Simulations with the HPC Software Framework walberla:

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project

Numerical Algorithms on Multi-GPU Architectures

Large Eddy Simulation of Flow over a Backward Facing Step using Fire Dynamics Simulator (FDS)

RhinoCFD Tutorial. Flow Past a Sphere

Large scale Imaging on Current Many- Core Platforms

Mesh optimization for ground vehicle Aerodynamics

McNair Scholars Research Journal

High-Lift Aerodynamics: STAR-CCM+ Applied to AIAA HiLiftWS1 D. Snyder

Multiphase flow metrology in oil and gas production: Case study of multiphase flow in horizontal tube

Design Verification of Hydraulic Performance. for a Centrifugal Pump using. Computational Fluid Dynamics (CFD)

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Potsdam Propeller Test Case (PPTC)

CFD Simulation of a dry Scroll Vacuum Pump including Leakage Flows

Massively Parallel Phase Field Simulations using HPC Framework walberla

DRAG AND LIFT REDUCTION ON PASSENGER CAR WITH REAR SPOILER

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Funded by the European Union INRIA. AEROGUST Workshop 27 th - 28 th April 2017, University of Liverpool. Presented by Andrea Ferrero and Angelo Iollo

Experimental Data Confirms CFD Models of Mixer Performance

Modelling of a Wall Inlet in Numerical Simulation of Airflow in Livestock Buildings

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Transcription:

6th European Conference on Computational Mechanics (ECCM 6) 7th European Conference on Computational Fluid Dynamics (ECFD 7) -5 June 28, Glasgow, UK MASSIVELY-PARALLEL MULTI-GPU SIMULATIONS FOR FAST AND ACCURATE AUTOMOTIVE AERODYNAMICS CHRISTOPH A. NIEDERMEIER, CHRISTIAN F. JANSSEN 2 AND THOMAS INDINGER FluiDyna GmbH Edisonstraße 3, 8576 Unterschleißheim {christoph.niedermeier,thomas.indinger}@fluidyna.de 2 Altair Engineering GmbH Friedensallee 27, 22765 Hamburg, Germany christian.janssen@altair.com Key words: LBM, GPU, Automotive Aerodynamics, Multi-GPU Abstract. In this paper, the scaling properties of the commercial Computational Fluid Dynamics (CFD) solver ultrafluidx are analyzed. The solver is based on the Lattice Boltzmann Method (LBM), is explicit in time, inherently transient and only requires next-neighbor information on the computational grid. ultrafluidx thus highly benefits from the computational power of Graphics Processing Units (GPUs). Additionally, the solver can make use of multiple GPUs on multiple compute nodes through an efficient implementation based on CUDA-aware MPI. The capabilities of the solver have already been demonstrated to various automotive OEMs around the world through several validation projects. This paper focuses on the scalability of ultrafluidx on single- and multi-node configurations in large-scale multi-gpu environments. Weak and strong scaling results for two selected test cases are reported and demonstrate that multi-gpu flow solvers have the potential to deliver high-fidelity results overnight. INTRODUCTION Aerodynamic properties of vehicles play a crucial role in the automotive design process. The required sophisticated level of knowledge about the flow field around the vehicle under consideration can only be achieved through wind tunnel experiments or highly-resolved numerical simulations. The flow field in such a case is very complex and unsteady, and the corresponding CFD simulations have to be fully transient and highly resolved to capture all relevant physical effects. GPU-based CFD paves the way for a wide use of such highfidelity simulations for automotive aerodynamics and gives a boost to green computing using more energy-efficient hardware. ultrafluidx achieves unprecedented turnaround times of just a few hours for transient flow simulations of fully detailed production-level passenger and heavy-duty vehicles.

2 ULTRAFLUIDX The current contribution is based on the commercial GPU-based CFD solver ultra- FluidX. ultrafluidx is developed by FluiDyna GmbH and is exclusively distributed by Altair Engineering. Main motivation for the code development is to introduce higher fidelity models, fully transient analyses, small turnaround times and low total cost to automotive technical and business drivers. ultrafluidx is based on the Lattice Boltzmann Method and a recent, highly accurate LBM collision operator based on Cumulants [, 2, 3]. The LBM collision model is accompanied by an LBM-consistent Smagorinsky LES model [4] an a wall modeling technique based on the work of [5]. The Cartesian LBM base mesh is combined with an octree-based grid refinement. Based on a relatively coarse Cartesian background grid, the mesh is consecutively refined towards the obstacle surface. Between each refinement level, the mesh spacing is reduced by a factor of 2. The first refinement levels typically are defined via axis-aligned bounding boxes, as can be seen in Figure (left). Higher refinement levels closer to the geometry are either based on custom user-defined volumes or are defined via geometry offsets. Kinks and thin parts are fully resolved in the volumetric preprocessing. Moreover, the appropriate physical modeling is selected automatically inside tight gaps and cavities. Representative production-level volume meshes are generated within the order of O(h). Figure : Details of the ultrafluidx grids. In addition, the code contains several enhanced features for automotive applications. These features typically come with additional computational overhead, which has to be considered in the performance evaluation and in comparison to other available software packages. These features include, but are not limited to, porous media zones, support for moving floors (single- and five-belt systems), static floors with boundary layer suction, rotating wheel models (straightforward wall-velocity boundary conditions or more advanced MRF and overset grid approaches) and automatic convergence detection. The code also features advanced post-processing capabilities, such as window averaging, spatial drag/lift contributions, probe output and section cut output. 3 PERFORMANCE ANALYSIS In this section, several weak and strong scaling results from single- and multi-node configurations are presented. First, a rather academic test case is analyzed. Then, a realistic production case including extensive grid refinement and additional automotive features that were discussed in the previous section is investigated. In particular, the case contains porous media, rotating wheels modeled with a wall-velocity boundary condition 2

and a belt system. The cases were run on three different test beds, two local machines with NVIDIA Tesla P/V GPUs and a massively parallel supercomputer with NVIDIA Tesla Ps distributed among multiple nodes, as detailed in Table. The local P machine from Supermicro is equipped with ten P PCIe GPUs. The machine features a single PCIe root complex and, hence, allows for very efficient peer-to-peer communication between the GPUs. The second, massively parallel test bed consists of four nodes of the TSUBAME 3. supercomputer. Each of these nodes features four P GPUs with SXM2 and first-generation NVLink, while the nodes are connected to each other via Intel Omni- Path. Moreover, the NVIDIA DGX- machine based on the Volta GPU architecture is tested. This machine features eight V GPUs and an even faster second-generation NVLink GPU interconnect. Table : Details of the three test beds. P PCIe (local) P SXM2 (non-local) V (local) Model/Make SYS-428GR-TR2 TSUBAME 3. DGX- GPU Type Tesla P Tesla P Tesla V GPU Architecture NVIDIA Pascal NVIDIA Pascal NVIDIA Volta SP Performance 9.3 TFLOPS TFLOPS 5.7 TFLOPS GPU Memory 6 GB HBM2 6 GB HBM2 6 GB HBM2 Memory Bandwidth 732 GB/s 732 GB/s 9 GB/s Interconnect PCIe 3. x6 NVLink. NVLink 2. GPUs per Node 4 8 Node Count 4 3. Empty wind tunnel (uniform grid, no refinement) At first, an empty wind tunnel case is used to evaluate the single-gpu performance of ultrafluidx on the different hardware configurations, before turning to multi-gpu simulations. The rectangular domain has an aspect ratio of 6:3:2 with a constant-velocity inlet, a non-reflecting outlet and slip walls in between. The case was run on a single GPU, using eight successively refined, uniform grids with a total of.3 to 36 million voxels (refined in steps of two). The resulting single-gpu performance averaged over, iterations is plotted in Figure 2. Significant differences between the Pascal and the Volta GPU architectures can be observed. The maximum performance of the V is approximately.9 times higher than the maximum performance of the P (PCIe). Among the GPUs with Pascal architecture, the slightly higher theoretical peak performance of the SXM2 models can also clearly be observed. Moreover, and more importantly, the single-gpu performance highly depends on the GPU utilization. For less than one million voxels per GPU, the single-gpu performance drops below 8 % of the maximum performance. A similar trend can be seen for increasing voxel counts per GPU: again, the performance is reduced. For, e.g., 36 M voxels per GPU, a performance of 85 % to 9 % of the maximum performance is found. 3

Performance (MNUPS) 8 6 4 2 3 Number of million voxels Normalized Performance 2 3 P PCIe V SXM2 P SXM2 Number of million voxels Figure 2: Single-GPU performance for different test case sizes. Then, weak scaling tests were conducted on all three hardware configurations and using all three grids (9 M, 8 M, 36 M). In order to do so, the initial rectangular domain was extended downstream depending on the number of GPUs to keep the aspect ratio of 6:3:2 constant per GPU device. Excellent parallel efficiency is found for these tests (see Figure 3a), above 98 % for one to eight P GPUs in local and non-local setups and above 95 % for up to eight Vs in the DGX-. Moreover, no significant differences between the fully local P setups and the distributed multi-node setup is found (Figure 3b). Even for the simulations with 6 Ps in a total of four different compute nodes, the parallel efficiency is above 97 %. Parallel efficiency.98.96.94 P, local V, local P, non-local 97.25% 5 5 9M 8M (a) Overview over all three test beds. Parallel efficiency.99.98 5 5 9M P, local 9M P, non-local (b) Close-up: local vs. non-local Ps. Figure 3: Weak scaling results for the empty channel case: parallel efficiency. After these initial weak scaling runs, a set of strong scaling tests is run. Such tests are much more relevant for engineering applications, as they directly reflect the impact of the parallelization on the time-to-solution of the problem under consideration. By adding more and more GPUs, the individual GPU workload is successively reduced in order to evaluate if and how the overall simulation times can be reduced by using more computational resources. The resulting parallel efficiency for the empty wind tunnel case 4

is depicted in Figure 4. The parallel efficiency for the 9 M case drops to 45 % for eight GPUs. On the other hand, the parallel efficiency for the 36 M case is above % for the simulations on two and four GPUs. This is caused by the fact that the individual GPU performance changes as well in strong scaling tests, as it scales with the individual GPU work load, independent on the additional communication and parallelization overheads. P PCIe (local) P SXM2 (non-local) V SXM2 (local) Raw efficiency 5 5 5 9M 8M Figure 4: Strong scaling results for the empty channel for three selected grid sizes: parallel efficiency in relation to the single-gpu run (raw efficiency). In order to asses the direct impact of the parallelization and communication overhead, the GPU performance has to be examined in relation to the raw GPU performance for the corresponding GPU utilization (voxels/gpu device). This effective efficiency is depicted in Figure 5. No overshoots above % can be found, the performance degradation for high GPU numbers is also less pronounced, and, most importantly, the scaling for one to four GPUs shows a parallel efficiency of above 95 %. P PCIe (local) P SXM2 (non-local) V SXM2 (local) Effective efficiency 95% 95% 5 5 9% 9M 8M Figure 5: Strong scaling results for the empty channel for three selected grid sizes: parallel efficiency in relation to the single-gpu run with the same workload (voxels/gpu, effective efficiency). 5

Apart from efficiency considerations, strong scaling tests immediately reveal the potential of further accelerating the simulation under considerations, i.e., reducing the timeto-solution. In Figure 6, the total performance of the solver for the empty channel case is plotted. Again, good strong scaling properties can be observed. Moreover, it can be observed that the scaling strongly depends on the test case configuration and, more specifically, the ratio of computational load to communication overhead. Exemplarily, we compare the baseline 36 M case with an aspect ratio of 6:3:2 with a stretched channel case consisting of 36 M voxels as well, but with an aspect ratio of 24:3:2. The latter has a significantly reduced communication overhead because of the lower surface-to-volume ratio per subdomain. Therefore, while the compact channel case shows a clear performance saturation on the Pascal GPUs above four GPUs, the stretched channel case shows much better scaling (dashed line in Figures 4 6). On the DGX- with Volta GPU architecture, the difference between the two setups is significantly less pronounced due to the higher bandwidth of the second-generation NVLink GPU interconnect. P PCIe (local) P SXM2 (non-local) V SXM2 (local) 6, 6, 6, MNUPS 5 5 5 9M 8M Figure 6: Strong scaling results for the empty channel case: total performance for two different test case configurations with an aspect ratio of 6:3:2 (solid lines) and, exemplarily, an aspect ratio of 24:3:2 (dashed line). 3.2 Realistic wind tunnel with a generic car geometry The scaling analysis of the channel flow simulations showed excellent scaling and a high code performance. For production cases, several computationally intensive features are added to the simulation setup that, in addition, add load imbalances. To assess the impact of the advanced automotive features on the scaling behavior of ultrafluidx, a generic roadster geometry (Figure 7) is used in the following. Four grid resolutions with 36 M, 7 M, 43 M, and 286 M voxels were selected for the analysis. All cases have six grid refinement levels in addition to the coarse background mesh, where the majority of voxels resides in levels three to six (approx. 2 %, 3 %, 34 % and % of the total voxels, respectively). On the highest refinement level, the chosen resolution per case results in 6

voxel sizes of 3. mm, 2.5 mm, 2. mm and.6 mm, where the corresponding time step sizes are 5.3 ns, 4.2 ns, 3.3 ns and 2.5 ns. For these cases, the performance is averaged over iterations, which is sufficient because of the larger compute time per iteration compared to the empty wind tunnel case. Figure 7: Generic roadster geometry in the numerical wind tunnel. The section cut shows the instantaneous velocity magnitude in the y-normal symmetry plane of the wind tunnel. The strong scaling results are depicted in Figure 8. This time, again the total performance is shown, as it corresponds to the most relevant property of such a simulation: the acceleration and reduction in time-to-solution. Exemplarily, the 43 M voxel grid is discussed here. More than 8 % strong scaling efficiency is observed for all configurations when switching from four to eight GPUs. This is independent of the GPU architecture (Pascal vs. Volta) or the parallelization (single-gpu vs. multi-gpu). Overall, only approx. 2.6 hours of compute time would be necessary to simulate one second of physical P PCIe (local) P SXM2 (non-local) V SXM2 (local) 4x 2 GPUs: 75.6 % 2x 4 GPUs: 87. % MNUPS 3, 85.6 % 3, 87. % 3, 82. %, 5, 5 5, 7M 43M 286M Figure 8: Strong scaling results for the roadster case: total performance and parallel efficiency (for the scaling from four to eight GPUs). 7

time at the real Mach number of the case (e.g., for aeroacoustic analyses). If only the aerodynamics are of interest, a further reduction of the time-to-solution by a factor of two and more is possible through the usage of an elevated Mach number, a common technique in LBM simulations. 4 SUMMARY AND CONCLUSIONS This paper is concerned with the scaling analysis of the massively-parallel multi-gpu solver ultrafluidx. ultrafluidx can achieve turnaround times of just a few hours for simulations of fully detailed production-level passenger and heavy-duty vehicles. As far as scaling efficiency is concerned, ultrafluidx shows a very good weak scaling behavior with a parallel efficiency above 95 % for an empty wind tunnel case. Even for a realistic automotive production case, ultrafluidx scales very well up to eight GPUs on all tested hardware configurations, no matter if in a shared-memory system or a massivelyparallel multi-node configuration. The strong scaling efficiency is found to be above 8 % for realistic grid resolutions and hardware configurations. Selected applications of ultrafluidx will be presented at the ECFD conference, including passenger vehicles and full truck geometries [6]. The resulting turnaround times are very competitive and allow to embed even highly-resolved, high-fidelity CFD simulations into a simulationbased design cycle. REFERENCES [] Martin Geier, Martin Schönherr, Andrea Pasquali, and Manfred Krafczyk. The cumulant lattice Boltzmann equation in three dimensions: Theory and validation. Computers & Mathematics with Applications, 7(4):57 547, 25. [2] Martin Geier, Andrea Pasquali, and Martin Schönherr. Parametrization of the cumulant lattice Boltzmann method for fourth order accurate diffusion part I: Derivation and validation. Journal of Computational Physics, 348:862 888, 27. [3] Martin Geier, Andrea Pasquali, and Martin Schönherr. Parametrization of the cumulant lattice Boltzmann method for fourth order accurate diffusion Part II: Application to flow around a sphere at drag crisis. Journal of Computational Physics, 348:889 898, 27. [4] Orestis Malaspinas and Pierre Sagaut. Consistent subgrid scale modelling for lattice Boltzmann methods. Journal of Fluid Mechanics, 7:54 542, 22. [5] Orestis Malaspinas and Pierre Sagaut. Wall model for large-eddy simulation based on the lattice Boltzmann method. Journal of Computational Physics, 275:25 4, 24. [6] Christoph A. Niedermeier, Hanna Ketterle, and Thomas Indinger. Evaluation of the Aerodynamic Drag Prediction Capability of different CFD Tools for Long-Distance Haulage Vehicles with Comparison to On-Road Tests. In SAE 27 Commercial Vehicle Engineering Congress. 8