Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Similar documents
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Two-Phase flows on massively parallel multi-gpu clusters

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Large scale Imaging on Current Many- Core Platforms

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

OP2 FOR MANY-CORE ARCHITECTURES

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

International Supercomputing Conference 2009

A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Session S0069: GPU Computing Advances in 3D Electromagnetic Simulation

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Optimization of HOM Couplers using Time Domain Schemes

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Application of GPU technology to OpenFOAM simulations

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

A Simulated Annealing algorithm for GPU clusters

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Large-scale Gas Turbine Simulations on GPU clusters

Hardware Recommendations for SOLIDWORKS 2017

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Turbostream: A CFD solver for manycore

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node

Center for Computational Science

Gradient Free Design of Microfluidic Structures on a GPU Cluster

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Harnessing GPU speed to accelerate LAMMPS particle simulations

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Numerical Methods for PDEs. SSC Workgroup Meetings Juan J. Alonso October 8, SSC Working Group Meetings, JJA 1

Computational Fluid Dynamics (CFD) using Graphics Processing Units

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Efficient Imaging Algorithms on Many-Core Platforms

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

The Fermi GPU and HPC Application Breakthroughs

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Accelerating Financial Applications on the GPU

General Purpose GPU Computing in Partial Wave Analysis

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Accelerator Programming Lecture 1

Implementation of a compressible-flow simulation code in the D programming language

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Nonoscillatory Central Schemes on Unstructured Triangulations for Hyperbolic Systems of Conservation Laws

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Dual Interpolants for Finite Element Methods

GPU Cluster Computing for FEM

Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Parallel Performance Studies for a Clustering Algorithm

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Introduction to GPU hardware and to CUDA

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Laptop Requirement: Technical Specifications and Guidelines. Frequently Asked Questions

Performance Benefits of NVIDIA GPUs for LS-DYNA

High Performance Computing with Accelerators

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

Software and Performance Engineering for numerical codes on GPU clusters

DEMO KIT quick start guide MULTI-TOUCH PROJECTED CAPACITIVE

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Computing on GPU Clusters

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Timo Lähivaara, Tomi Huttunen, Simo-Pekka Simonaho University of Kuopio, Department of Physics P.O.Box 1627, FI-70211, Finland

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

GPU Simulations of Violent Flows with Smooth Particle Hydrodynamics (SPH) Method

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Numerical Algorithms on Multi-GPU Architectures

Conforming Vector Interpolation Functions for Polyhedral Meshes

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

OFX SERIES quick start guide 7 to 32 STANDARD OPEN FRAME TOUCH MONITORS

1 Past Research and Achievements

DDFV Schemes for the Euler Equations

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

Accelerating CFD with Graphics Hardware

Mathematical computations with GPUs

GPU Acceleration of Unmodified CSM and CFD Solvers

A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv: v1 [cs.dc] 5 Sep 2013

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Blazer Pro V2.1 Client Requirements & Hardware Performance

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

High Performance Computing for PDE Towards Petascale Computing

Transcription:

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 26, 2016

(1) The Vijayasundaram Method for Multi-Physics Euler Equations The Euler equations are given by a system of differential equations. We consider two gas species with densities ρ 1 and ρ 2 for the simulations and ideal gas state equations. More complicated and realistic state equation can also be handled by the ARMO simulation code. Let ρ 1, ρ 2 be the densities of the gas species and ρ = ρ 1 + ρ 2 the density of the gas, p the pressure, and p 1, p 2, p 3 the components of the gas momentum density, and E the total energy density. Let x = {x 1, x 2, x 3 } Ω R 3 and t (0, T ) R be the space time coordinates. Then the conserved quantity w(x, t) is given by w = ρ 1 ρ 2 p 1 p 2 p 3 E (1) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 1

and the flux vectors are defined as f k (w) = ρ 1 p k /ρ ρ 2 p k /ρ p 1 p k /ρ + δ 1k p p 2 p k /ρ + δ 2k p p 3 p k /ρ + δ 3k p (E + p)p k /ρ, k {1, 2, 3} (2) The Euler equations on the domain Ω (0, T ) can then be expressed as w(x, t) + t x 1 f 1 (w(x, t)) + x 2 f 2 (w(x, t)) + x 3 f 3 (w(x, t)) = 0 (3) and together with suitable boundary conditions the system can be solved with the finite volume approach. The finite volume method can be formulated by applying Green s theorem d dt Ω w(x, t)dx = Ω f 1 n 1 + f 2 n 2 + f 3 n 3 ds (4) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 2

where n = (n 1, n 2, n 3 ) denotes the outer normal to the boundary Ω. The discrete version is then derived by integration over a time intervall [t n, t n + t] and averaging over the cells K i. w (n+1) Ki = w (n) Ki t j S(i) Γ ij K i 3 F k,γij (w (n) Ki, w (n) Kj )n k (5) k=1 With a tetrahedral approximation to Ω {K i } i I and Γ ij are the interfaces between the cells K i, K j and the set S(i) stores the indices of the neighboring cells of K i Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 3

The Vijayasundaram method defines the fluxes as ( ) ( ) u + v u + v F k,γij (u, v) = A + k u + A k v, 2 2 k = 1, 2, 3 (6) The essence of the Vijayasundaram method is the calculation of an eigenspace decomposition of A k = df k /dw, k = 1, 2, 3 into positive and negative subspaces. Thus the matrices A + k, A k are constructed from the positive and negative eigenvalues of A k = R k Λ k L k with Λ k = diag(λ k,1,..., λ k,6 ) and k = 1, 2, 3. A ± k = R kλ ± k L k, Λ ± k = diag(λ± k,1,..., λ± k,m ), (8) λ + k,i = max(λ k,i, 0), λ k,i = min(λ k,i, 0), i = 1,..., 6 (9) (7) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 4

(2) ARMO CPU/GPU Algorithms High level parallel CPU algorithm: Require: f, g, com, nei, geo, pio Require: t max, i max, C, σ, m, n t 0, i 0 while t < t max and i < i max do exchange(m, n, f, g, com) mpi alltoall(m, n, g, f) vijaya(n, nei, geo, pio, f, g, σ) mpi allreduce max(σ) update(n, f, g, σ, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 5

High level parallel GPU algorithm: Require: f D, g D, com D, nei D, geo D, pio D, σ D Require: t max, i max, C, σ, m, n, snd, rcv t 0, i 0 while t < t max and i < i max do exchange D (m, n, f D, g D, com D ) device to host(n, g D, snd) mpi alltoall(snd, rcv) host to device(n, f D, rcv) vijaya D (n, nei D, geo D, pio D, f D, g D, σ D ) device to host(σ D, σ) mpi allreduce max(σ) host to device(σ D, σ) update D (n, f D, g D, σ D, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 6

(3) ARMO CPU/GPU Benchmarks Figure 1: GPU Cluster: mephisto.uni-graz.at Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 7

GPU Computing Hardware kepler: 4x Nvidia Tesla K20 GPU (9,984 cores / 24 GB on-board RAM) mephisto: 20x Nvidia Tesla C2070 GPU (8,960 cores / 120 GB on-board RAM) iscsergpu: 32x Nvidia Geforce GTX 295 (15,360 cores / 56 GB on-board RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3 GB on-board RAM) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 8

GPU Clusters and Servers kepler: 2x Intel Xeon E5-2650 @ 2.0 GHz with 256 GB RAM (4x Tesla K20) mephisto: 12x Intel Xeon X5650 @ 2.67 GHz with 520 GB RAM (20x Tesla C2070) iscsergpu: 8x Intel Core i7 965 @ 3.2 GHz with 12 GB RAM (32x GTX 295) gtx: AMD Phenom 9950 @ 2.6 GHz with 8 GB RAM (4x GTX 280) fermi: Intel Core i7 920 @ 2.66 GHz with 12 GB RAM (2x GTX 480) CPU Clusters and Servers memo: 8x Intel Xeon X7560 @ 2.27 GHz with 1024 GB RAM penge: 12x Dual Intel Xeon E5450 @ 3.0 GHz with 16 GB RAM quad2: 4x AMD Opteron 8347 @ 1.9 GHz with 32 GB RAM Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 9

Benchmark example: Intake port of a diesel engine with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 10

Four pieces of the intake port for parallel processing using domain decomposition. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 11

CPU cores memo quad2 gtx iscsergpu penge fermi kepler mephisto 1 12.35 33.58 19.37 9.32 11.74 10.37 12.13 10.84 2 5.94 16.07 9.26 4.55 5.08 5.02 6.27 5.25 4 2.96 7.59 4.47 2.29 2.47 2.54 3.07 2.63 8 (6) 1.44 3.13 1.81 [1] 1.27 [1] 2.11 1.50 (1.76) 16 (12) 0.68 1.38 1.09 [2] 0.64 [2] 0.72 (0.84) [1] 32 (24) 0.35 0.65 [4] 0.33 [4] (0.41) [2] 64 (48) 0.18 0.17 [8] (0.21) [4] Speedup 68.22 24.21 4.33 14.34 67,47 4.91 16.85 51.62 Efficiency 1.07 1.51 1.08 0.45 1.05 0.61 1.05 1.07 GPUs memo quad2 gtx iscsergpu penge fermi kepler mephisto ECC: on/off 1 0.284 0.380 0.156 0.120 0.245 / 0.184 2 0.141 0.175 0.090 0.070 0.168 / 0.108 4 0.086 0.098 0.047 0.142 / 0.063 [1] 8 0.069 [1] 0.120 / 0.045 [2] 16 0.128 / 0.039 [4] Speedup 3.30 5.51 1.73 2.55 1.91 / 4.72 Efficiency 0.82 0.69 0.86 0.64 0.11 / 0.29 Table 1: Parallel scalability benchmark for an intake-port with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 12

Benchmark example: Nozzle with 642,700, 2,570,800, and 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 13

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 135.80 79.65 40.62 47.41 55.65 48.28 2 65.85 38.55 20.13 23.50 27.11 23.68 4 32.73 19.06 10.23 11.89 13.68 11.85 8 (6) 15.67 7.86 [1] 9.41 6.89 (7.92) 16 (12) 7.61 4.22 [2] 3.26 (3.75) [1] 32 (24) 2.42 [4] (1.74) [2] 64 (48) (0.84) [4] Speedup 19.06 4.13 17.27 5.04 17.07 57.48 Efficiency 1.19 1.03 0.54 0.63 1.07 1.20 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 1.186 1.561 0.617 0.459 1.011 / 0.740 2 0.540 0.702 0.312 0.211 0.523 / 0.369 4 0.275 0.337 0.116 0.307 / 0.199 [1] 8 0.185 [1] 0.203 / 0.132 [2] 16 0.155 / 0.100 [4] Speedup 5.00 11.60 1.98 3.96 6.52 / 7.40 Efficiency 1.25 1.45 0.99 0.99 0.41 / 0.46 Table 2: Parallel scalability benchmark for a nozzle with 642,700 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 14

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 415.00 259.89 142.83 174.55 209.01 172.26 2 203.15 128.70 72.03 85.06 103.96 86.39 4 105.69 65.90 37.27 43.64 52.60 43.78 8 (6) 55.34 29.47 [1] 35.17 27.03 (29.74) 16 (12) 29.16 14.77 [2] 12.95 (14.58) [1] 32 (24) 7.40 [4] (7.16) [2] 64 (48) 3.75 [8] (3.49) [4] Speedup 14.23 3.94 38.09 4.96 16.14 49.36 Efficiency 0.89 0.99 0.60 0.62 1.01 1.03 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 3.955 4.683 2.160 1.247 2.534 / 2.406 2 1.694 2.052 1.082 0.635 1.307 / 1.212 4 0.841 1.002 0.330 0.721 / 0.671 [1] 8 0.514 [1] 0.423 / 0.342 [2] 16 0.320 [2] 0.265 / 0.206 [4] Speedup 4.70 14.63 2.00 3.78 9.56 / 11.70 Efficiency 1.18 0.91 1.00 0.94 0.60 / 0.73 Table 3: Parallel scalability benchmark for a nozzle with 2,570,800 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 15

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 1384.5 916.89 508.74 603.83 752.71 630.41 2 693.25 462.34 257.83 305.15 374.63 315.16 4 361.81 238.70 132.20 156.57 189.02 160.26 8 (6) 200.29 110.17 [1] 128.98 97.01 (109.45) 16 (12) 108.48 55.93 [2] 48.44 (54.44) [1] 32 (24) 28.20 [4] (27.16) [2] 64 (48) 14.11 [8] (13.66) [4] Speedup 12.76 3.84 36.05 4.68 15.54 46.15 Efficiency 0.80 0.96 0.56 0.59 0.97 0.96 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 * * 7.896 4.071 9.405 / 9.316 2 6.602 7.619 3.964 2.038 4.721 / 4.686 4 3.088 3.529 1.027 2.403 / 2.365 [1] 8 1.725 [1] 1.264 / 1.184 [2] 16 0.935 [2] 0.686 / 0.618 [4] 32 (24) 0.701 [4] (0.495) / * [6] 64 0.495 [8] Speedup 4.28 30.78 1.99 3.96 13.71 / 15.07 Efficiency 1.07 0.48 1.00 0.99 0.86 / 0.94 Table 4: Parallel scalability benchmark for a nozzle with 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 16

Effective GFLOPS for ARMO Simulator Intake-port Nozzle Nozzle Nozzle CPU / GPU Hardware 155,325 642,700 2,570,800 10,283,200 kepler 2x Intel Xeon E5-2650 29.68 [2] 27.12 [2] 27.32 [2] 29.21 [2] kepler 4x Nvidia Tesla K20 454.74 [4] 762.38 [4] 1071.95 [4] 1377.78 [4] mephisto 16x Nvidia Tesla C2070 548.02 [16] 884.36 [16] 1717.19 [16] 2289.59 [16] iscsergpu 32x Nvidia GTX 295 309.75 [8] 478.03 [8] 1105.44 [16] 2858.52 [64] Table 5: Effective GFLOPS for ARMO simulator. GPU cluster performance is equivalent to 800 1600 CPU cores! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 17

Conclusions GPUs deliver excellent performance for CFD problems! 800 1600 speedup on GPU cluster with 4 64 GPUs compared with modern CPU core New GPU hardware: Maxwell architecture brings even more performance CUDA programming model fits well Essential software design decision: Element-based loops! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 18