Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Similar documents
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Two-Phase flows on massively parallel multi-gpu clusters

Asynchronous OpenCL/MPI numerical simulations of conservation laws

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

OP2 FOR MANY-CORE ARCHITECTURES

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to GPU hardware and to CUDA

General Purpose GPU Computing in Partial Wave Analysis

Large scale Imaging on Current Many- Core Platforms

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

International Supercomputing Conference 2009

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

High Performance Computing with Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CME 213 S PRING Eric Darve

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Mathematical computations with GPUs

Center for Computational Science

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Double-Precision Matrix Multiply on CUDA

Applications of Berkeley s Dwarfs on Nvidia GPUs

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Accelerator Programming Lecture 1

GPU for HPC. October 2010

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Optimization of HOM Couplers using Time Domain Schemes

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Software and Performance Engineering for numerical codes on GPU clusters

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

CUDA Experiences: Over-Optimization and Future HPC

A Simulated Annealing algorithm for GPU clusters

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Numerical Algorithms on Multi-GPU Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Numerical Simulation on the GPU

Accelerating Financial Applications on the GPU

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Computational Fluid Dynamics (CFD) using Graphics Processing Units

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Turbostream: A CFD solver for manycore

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Performance Benefits of NVIDIA GPUs for LS-DYNA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

n N c CIni.o ewsrg.au

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Reducing the memory footprint of an Eikonal solver

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

Efficient Imaging Algorithms on Many-Core Platforms

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Application of GPU technology to OpenFOAM simulations

Optimisation Myths and Facts as Seen in Statistical Physics

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Parallel Computing. November 20, W.Homberg

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Large-scale Gas Turbine Simulations on GPU clusters

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

B. Tech. Project Second Stage Report on

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Accelerating Data Warehousing Applications Using General Purpose GPUs

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Accelerating CFD with Graphics Hardware

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Building NVLink for Developers

Computing on GPU Clusters

Automated Finite Element Computations in the FEniCS Framework using GPUs

Harnessing GPU speed to accelerate LAMMPS particle simulations

Session S0069: GPU Computing Advances in 3D Electromagnetic Simulation

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Transcription:

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de October 26, 2015

Projects and Collaborations Collaboration Partners Zoltán Horváth, Széchenyi István University, Hungary (TAMOP Project) Craig C. Douglas, University of Wyoming, USA (GPU Cluster) Gundolf Haase, University of Graz, Austria (SFB MOBIS) Charles Hirsch, NUMECA International S.A, Belgium (E-CFD-GPU Project) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 1

Overview High Performance Computing with GPUs The Vijayasundaram Method for Multi-Physics Euler Equations ARMO CPU/GPU Algorithms ARMO CPU/GPU Benchmarks Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 2

(1) High Performance Computing with GPUs GPU: Graphics Processing Unit Many-core / many-thread architecture: Thousends of compute cores Double precision: 1.5 TFLOPS Single precision: 4 TFLOPS Shared memory architecture: Very high memory bandwidth: 300 GB/s Free Nvidia CUDA compiler and development tools Big GPU players: Nvidia, AMD, Intel Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 3

Nvidia Tesla K40 GPU architecture 1.43 TFLOPS double precision / 4.29 TFLOPS single precision 2880 CUDA cores per GPU / 12 GB on-board ECC RAM 288 GB/s memory bandwidth to on-board ECC RAM L1/L2 cache hierarchy / 64 bit memory address space Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 4

Floating-Point Operations per Second for the CPU and GPU Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 5

Memory Bandwidth for the CPU and GPU Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 6

GPU Software Design Essentials Extreme multithreading: Schedule millions of threads Shared memory architecture: Think OpenMP Memory access coalescing: Still important Utilize L1/L2/Texture caches and shared memory Pitfalls Noncoalesced random read/write to memory is slow Atomic memory operations are very expensive Big, branchy code blocks are bad for GPUs: Code serialization Heavy register use in GPU kernels limits the device utilization FLOPS are free, memory access is expensive! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 7

(2) The Vijayasundaram Method for Multi-Physics Euler Equations The Euler equations are given by a system of differential equations. We consider two gas species with densities ρ 1 and ρ 2 for the simulations and ideal gas state equations. More complicated and realistic state equation can also be handled by the ARMO simulation code. Let ρ 1, ρ 2 be the densities of the gas species and ρ = ρ 1 + ρ 2 the density of the gas, p the pressure, and p 1, p 2, p 3 the components of the gas momentum density, and E the total energy density. Let x = {x 1, x 2, x 3 } Ω R 3 and t (0, T ) R be the space time coordinates. Then the conserved quantity w(x, t) is given by w = ρ 1 ρ 2 p 1 p 2 p 3 E (1) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 8

and the flux vectors are defined as f k (w) = ρ 1 p k /ρ ρ 2 p k /ρ p 1 p k /ρ + δ 1k p p 2 p k /ρ + δ 2k p p 3 p k /ρ + δ 3k p (E + p)p k /ρ, k {1, 2, 3} (2) The Euler equations on the domain Ω (0, T ) can then be expressed as w(x, t) + t x 1 f 1 (w(x, t)) + x 2 f 2 (w(x, t)) + x 3 f 3 (w(x, t)) = 0 (3) and together with suitable boundary conditions the system can be solved with the finite volume approach. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 9

The finite volume method can be formulated by applying Green s theorem d dt Ω w(x, t)dx = Ω f 1 n 1 + f 2 n 2 + f 3 n 3 ds (4) where n = (n 1, n 2, n 3 ) denotes the outer normal to the boundary Ω. The discrete version is then derived by integration over a time intervall [t n, t n + t] and averaging over the cells K i. w (n+1) Ki = w (n) Ki t Γ ij 3 F k,γij (w (n) Ki, w (n) Kj )n k (5) K i j S(i) With a tetrahedral approximation to Ω {K i } i I and Γ ij are the interfaces between the cells K i, K j and the set S(i) stores the indices of the neighboring cells of K i k=1 Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 10

The Vijayasundaram method defines the fluxes as ( ) ( ) u + v u + v F k,γij (u, v) = A + k u + A k v, 2 2 k = 1, 2, 3 (6) The essence of the Vijayasundaram method is the calculation of an eigenspace decomposition of A k = df k /dw, k = 1, 2, 3 into positive and negative subspaces. Thus the matrices A + k, A k are constructed from the positive and negative eigenvalues of A k = R k Λ k L k with Λ k = diag(λ k,1,..., λ k,6 ) and k = 1, 2, 3. A ± k = R kλ ± k L k, Λ ± k = diag(λ± k,1,..., λ± k,m ), (8) λ + k,i = max(λ k,i, 0), λ k,i = min(λ k,i, 0), i = 1,..., 6 (9) (7) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 11

(3) ARMO CPU/GPU Algorithms High level parallel CPU algorithm: Require: f, g, com, nei, geo, pio Require: t max, i max, C, σ, m, n t 0, i 0 while t < t max and i < i max do exchange(m, n, f, g, com) mpi alltoall(m, n, g, f) vijaya(n, nei, geo, pio, f, g, σ) mpi allreduce max(σ) update(n, f, g, σ, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 12

High level parallel GPU algorithm: Require: f D, g D, com D, nei D, geo D, pio D, σ D Require: t max, i max, C, σ, m, n, snd, rcv t 0, i 0 while t < t max and i < i max do exchange D (m, n, f D, g D, com D ) device to host(n, g D, snd) mpi alltoall(snd, rcv) host to device(n, f D, rcv) vijaya D (n, nei D, geo D, pio D, f D, g D, σ D ) device to host(σ D, σ) mpi allreduce max(σ) host to device(σ D, σ) update D (n, f D, g D, σ D, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 13

(4) ARMO CPU/GPU Benchmarks Figure 1: GPU Cluster: mephisto.uni-graz.at Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 14

GPU Computing Hardware kepler: 4x Nvidia Tesla K20 GPU (9,984 cores / 24 GB on-board RAM) mephisto: 20x Nvidia Tesla C2070 GPU (8,960 cores / 120 GB on-board RAM) iscsergpu: 32x Nvidia Geforce GTX 295 (15,360 cores / 56 GB on-board RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3 GB on-board RAM) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 15

GPU Clusters and Servers kepler: 2x Intel Xeon E5-2650 @ 2.0 GHz with 256 GB RAM (4x Tesla K20) mephisto: 12x Intel Xeon X5650 @ 2.67 GHz with 520 GB RAM (20x Tesla C2070) iscsergpu: 8x Intel Core i7 965 @ 3.2 GHz with 12 GB RAM (32x GTX 295) gtx: AMD Phenom 9950 @ 2.6 GHz with 8 GB RAM (4x GTX 280) fermi: Intel Core i7 920 @ 2.66 GHz with 12 GB RAM (2x GTX 480) CPU Clusters and Servers memo: 8x Intel Xeon X7560 @ 2.27 GHz with 1024 GB RAM penge: 12x Dual Intel Xeon E5450 @ 3.0 GHz with 16 GB RAM quad2: 4x AMD Opteron 8347 @ 1.9 GHz with 32 GB RAM Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 16

Benchmark example: Intake port of a diesel engine with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 17

Four pieces of the intake port for parallel processing using domain decomposition. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 18

CPU cores memo quad2 gtx iscsergpu penge fermi kepler mephisto 1 12.35 33.58 19.37 9.32 11.74 10.37 12.13 10.84 2 5.94 16.07 9.26 4.55 5.08 5.02 6.27 5.25 4 2.96 7.59 4.47 2.29 2.47 2.54 3.07 2.63 8 (6) 1.44 3.13 1.81 [1] 1.27 [1] 2.11 1.50 (1.76) 16 (12) 0.68 1.38 1.09 [2] 0.64 [2] 0.72 (0.84) [1] 32 (24) 0.35 0.65 [4] 0.33 [4] (0.41) [2] 64 (48) 0.18 0.17 [8] (0.21) [4] Speedup 68.22 24.21 4.33 14.34 67,47 4.91 16.85 51.62 Efficiency 1.07 1.51 1.08 0.45 1.05 0.61 1.05 1.07 GPUs memo quad2 gtx iscsergpu penge fermi kepler mephisto ECC: on/off 1 0.284 0.380 0.156 0.120 0.245 / 0.184 2 0.141 0.175 0.090 0.070 0.168 / 0.108 4 0.086 0.098 0.047 0.142 / 0.063 [1] 8 0.069 [1] 0.120 / 0.045 [2] 16 0.128 / 0.039 [4] Speedup 3.30 5.51 1.73 2.55 1.91 / 4.72 Efficiency 0.82 0.69 0.86 0.64 0.11 / 0.29 Table 1: Parallel scalability benchmark for an intake-port with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 19

Benchmark example: Nozzle with 642,700, 2,570,800, and 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 20

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 135.80 79.65 40.62 47.41 55.65 48.28 2 65.85 38.55 20.13 23.50 27.11 23.68 4 32.73 19.06 10.23 11.89 13.68 11.85 8 (6) 15.67 7.86 [1] 9.41 6.89 (7.92) 16 (12) 7.61 4.22 [2] 3.26 (3.75) [1] 32 (24) 2.42 [4] (1.74) [2] 64 (48) (0.84) [4] Speedup 19.06 4.13 17.27 5.04 17.07 57.48 Efficiency 1.19 1.03 0.54 0.63 1.07 1.20 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 1.186 1.561 0.617 0.459 1.011 / 0.740 2 0.540 0.702 0.312 0.211 0.523 / 0.369 4 0.275 0.337 0.116 0.307 / 0.199 [1] 8 0.185 [1] 0.203 / 0.132 [2] 16 0.155 / 0.100 [4] Speedup 5.00 11.60 1.98 3.96 6.52 / 7.40 Efficiency 1.25 1.45 0.99 0.99 0.41 / 0.46 Table 2: Parallel scalability benchmark for a nozzle with 642,700 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 21

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 415.00 259.89 142.83 174.55 209.01 172.26 2 203.15 128.70 72.03 85.06 103.96 86.39 4 105.69 65.90 37.27 43.64 52.60 43.78 8 (6) 55.34 29.47 [1] 35.17 27.03 (29.74) 16 (12) 29.16 14.77 [2] 12.95 (14.58) [1] 32 (24) 7.40 [4] (7.16) [2] 64 (48) 3.75 [8] (3.49) [4] Speedup 14.23 3.94 38.09 4.96 16.14 49.36 Efficiency 0.89 0.99 0.60 0.62 1.01 1.03 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 3.955 4.683 2.160 1.247 2.534 / 2.406 2 1.694 2.052 1.082 0.635 1.307 / 1.212 4 0.841 1.002 0.330 0.721 / 0.671 [1] 8 0.514 [1] 0.423 / 0.342 [2] 16 0.320 [2] 0.265 / 0.206 [4] Speedup 4.70 14.63 2.00 3.78 9.56 / 11.70 Efficiency 1.18 0.91 1.00 0.94 0.60 / 0.73 Table 3: Parallel scalability benchmark for a nozzle with 2,570,800 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 22

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 1384.5 916.89 508.74 603.83 752.71 630.41 2 693.25 462.34 257.83 305.15 374.63 315.16 4 361.81 238.70 132.20 156.57 189.02 160.26 8 (6) 200.29 110.17 [1] 128.98 97.01 (109.45) 16 (12) 108.48 55.93 [2] 48.44 (54.44) [1] 32 (24) 28.20 [4] (27.16) [2] 64 (48) 14.11 [8] (13.66) [4] Speedup 12.76 3.84 36.05 4.68 15.54 46.15 Efficiency 0.80 0.96 0.56 0.59 0.97 0.96 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 * * 7.896 4.071 9.405 / 9.316 2 6.602 7.619 3.964 2.038 4.721 / 4.686 4 3.088 3.529 1.027 2.403 / 2.365 [1] 8 1.725 [1] 1.264 / 1.184 [2] 16 0.935 [2] 0.686 / 0.618 [4] 32 (24) 0.701 [4] (0.495) / * [6] 64 0.495 [8] Speedup 4.28 30.78 1.99 3.96 13.71 / 15.07 Efficiency 1.07 0.48 1.00 0.99 0.86 / 0.94 Table 4: Parallel scalability benchmark for a nozzle with 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 23

Effective GFLOPS for ARMO Simulator Intake-port Nozzle Nozzle Nozzle CPU / GPU Hardware 155,325 642,700 2,570,800 10,283,200 kepler 2x Intel Xeon E5-2650 29.68 [2] 27.12 [2] 27.32 [2] 29.21 [2] kepler 4x Nvidia Tesla K20 454.74 [4] 762.38 [4] 1071.95 [4] 1377.78 [4] mephisto 16x Nvidia Tesla C2070 548.02 [16] 884.36 [16] 1717.19 [16] 2289.59 [16] iscsergpu 32x Nvidia GTX 295 309.75 [8] 478.03 [8] 1105.44 [16] 2858.52 [64] Table 5: Effective GFLOPS for ARMO simulator. GPU cluster performance is equivalent to 800 1600 CPU cores! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 24

Conclusions GPUs deliver excellent performance for CFD problems! 800 1600 speedup on GPU cluster with 4 64 GPUs compared with modern CPU core New GPU hardware: Maxwell architecture brings even more performance CUDA programming model fits well Essential software design decision: Element-based loops! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 25