Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Size: px

Start display at page:

Download "Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters"

Arron Harvey
5 years ago
Views:

1 Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 October 26, 2015

Projects and Collaborations Collaboration Partners Zoltán Horváth,

Douglas, University of Wyoming, USA (GPU Cluster) Gundolf Haase,

2 Projects and Collaborations Collaboration Partners Zoltán Horváth, Széchenyi István University, Hungary (TAMOP Project) Craig C. Douglas, University of Wyoming, USA (GPU Cluster) Gundolf Haase, University of Graz, Austria (SFB MOBIS) Charles Hirsch, NUMECA International S.A, Belgium (E-CFD-GPU Project) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 1

3 Overview High Performance Computing with GPUs The Vijayasundaram Method for Multi-Physics Euler Equations ARMO CPU/GPU Algorithms ARMO CPU/GPU Benchmarks Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 2

5 TFLOPS Single precision: 4 TFLOPS Shared memory architecture: Very high memory bandwidth: 300 GB/s Free

4 (1) High Performance Computing with GPUs GPU: Graphics Processing Unit Many-core / many-thread architecture: Thousends of compute cores Double precision: 1.5 TFLOPS Single precision: 4 TFLOPS Shared memory architecture: Very high memory bandwidth: 300 GB/s Free Nvidia CUDA compiler and development tools Big GPU players: Nvidia, AMD, Intel Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 3

5 Nvidia Tesla K40 GPU architecture 1.43 TFLOPS double precision / 4.29 TFLOPS single precision 2880 CUDA cores per GPU / 12 GB on-board ECC RAM 288 GB/s memory bandwidth to on-board ECC RAM L1/L2 cache hierarchy / 64 bit memory address space Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 4

6 Floating-Point Operations per Second for the CPU and GPU Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 5

7 Memory Bandwidth for the CPU and GPU Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 6

8 GPU Software Design Essentials Extreme multithreading: Schedule millions of threads Shared memory architecture: Think OpenMP Memory access coalescing: Still important Utilize L1/L2/Texture caches and shared memory Pitfalls Noncoalesced random read/write to memory is slow Atomic memory operations are very expensive Big, branchy code blocks are bad for GPUs: Code serialization Heavy register use in GPU kernels limits the device utilization FLOPS are free, memory access is expensive! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 7

9 (2) The Vijayasundaram Method for Multi-Physics Euler Equations The Euler equations are given by a system of differential equations. We consider two gas species with densities ρ 1 and ρ 2 for the simulations and ideal gas state equations. More complicated and realistic state equation can also be handled by the ARMO simulation code. Let ρ 1, ρ 2 be the densities of the gas species and ρ = ρ 1 + ρ 2 the density of the gas, p the pressure, and p 1, p 2, p 3 the components of the gas momentum density, and E the total energy density. Let x = {x 1, x 2, x 3 } Ω R 3 and t (0, T ) R be the space time coordinates. Then the conserved quantity w(x, t) is given by w = ρ 1 ρ 2 p 1 p 2 p 3 E (1) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 8

10 and the flux vectors are defined as f k (w) = ρ 1 p k /ρ ρ 2 p k /ρ p 1 p k /ρ + δ 1k p p 2 p k /ρ + δ 2k p p 3 p k /ρ + δ 3k p (E + p)p k /ρ, k {1, 2, 3} (2) The Euler equations on the domain Ω (0, T ) can then be expressed as w(x, t) + t x 1 f 1 (w(x, t)) + x 2 f 2 (w(x, t)) + x 3 f 3 (w(x, t)) = 0 (3) and together with suitable boundary conditions the system can be solved with the finite volume approach. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 9

11 The finite volume method can be formulated by applying Green s theorem d dt Ω w(x, t)dx = Ω f 1 n 1 + f 2 n 2 + f 3 n 3 ds (4) where n = (n 1, n 2, n 3 ) denotes the outer normal to the boundary Ω. The discrete version is then derived by integration over a time intervall [t n, t n + t] and averaging over the cells K i. w (n+1) Ki = w (n) Ki t Γ ij 3 F k,γij (w (n) Ki, w (n) Kj )n k (5) K i j S(i) With a tetrahedral approximation to Ω {K i } i I and Γ ij are the interfaces between the cells K i, K j and the set S(i) stores the indices of the neighboring cells of K i k=1 Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 10

12 The Vijayasundaram method defines the fluxes as ( ) ( ) u + v u + v F k,γij (u, v) = A + k u + A k v, 2 2 k = 1, 2, 3 (6) The essence of the Vijayasundaram method is the calculation of an eigenspace decomposition of A k = df k /dw, k = 1, 2, 3 into positive and negative subspaces. Thus the matrices A + k, A k are constructed from the positive and negative eigenvalues of A k = R k Λ k L k with Λ k = diag(λ k,1,..., λ k,6 ) and k = 1, 2, 3. A ± k = R kλ ± k L k, Λ ± k = diag(λ± k,1,..., λ± k,m ), (8) λ + k,i = max(λ k,i, 0), λ k,i = min(λ k,i, 0), i = 1,..., 6 (9) (7) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 11

13 (3) ARMO CPU/GPU Algorithms High level parallel CPU algorithm: Require: f, g, com, nei, geo, pio Require: t max, i max, C, σ, m, n t 0, i 0 while t < t max and i < i max do exchange(m, n, f, g, com) mpi alltoall(m, n, g, f) vijaya(n, nei, geo, pio, f, g, σ) mpi allreduce max(σ) update(n, f, g, σ, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 12

14 High level parallel GPU algorithm: Require: f D, g D, com D, nei D, geo D, pio D, σ D Require: t max, i max, C, σ, m, n, snd, rcv t 0, i 0 while t < t max and i < i max do exchange D (m, n, f D, g D, com D ) device to host(n, g D, snd) mpi alltoall(snd, rcv) host to device(n, f D, rcv) vijaya D (n, nei D, geo D, pio D, f D, g D, σ D ) device to host(σ D, σ) mpi allreduce max(σ) host to device(σ D, σ) update D (n, f D, g D, σ D, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 13

15 (4) ARMO CPU/GPU Benchmarks Figure 1: GPU Cluster: mephisto.uni-graz.at Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 14

RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3

16 GPU Computing Hardware kepler: 4x Nvidia Tesla K20 GPU (9,984 cores / 24 GB on-board RAM) mephisto: 20x Nvidia Tesla C2070 GPU (8,960 cores / 120 GB on-board RAM) iscsergpu: 32x Nvidia Geforce GTX 295 (15,360 cores / 56 GB on-board RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3 GB on-board RAM) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 15

17 GPU Clusters and Servers kepler: 2x Intel Xeon 2.0 GHz with 256 GB RAM (4x Tesla K20) mephisto: 12x Intel Xeon 2.67 GHz with 520 GB RAM (20x Tesla C2070) iscsergpu: 8x Intel Core i7 3.2 GHz with 12 GB RAM (32x GTX 295) gtx: AMD Phenom 2.6 GHz with 8 GB RAM (4x GTX 280) fermi: Intel Core i GHz with 12 GB RAM (2x GTX 480) CPU Clusters and Servers memo: 8x Intel Xeon 2.27 GHz with 1024 GB RAM penge: 12x Dual Intel Xeon 3.0 GHz with 16 GB RAM quad2: 4x AMD Opteron 1.9 GHz with 32 GB RAM Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 16

18 Benchmark example: Intake port of a diesel engine with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 17

19 Four pieces of the intake port for parallel processing using domain decomposition. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 18

20 CPU cores memo quad2 gtx iscsergpu penge fermi kepler mephisto (6) [1] 1.27 [1] (1.76) 16 (12) [2] 0.64 [2] 0.72 (0.84) [1] 32 (24) [4] 0.33 [4] (0.41) [2] 64 (48) [8] (0.21) [4] Speedup , Efficiency GPUs memo quad2 gtx iscsergpu penge fermi kepler mephisto ECC: on/off / / / [1] [1] / [2] / [4] Speedup / 4.72 Efficiency / 0.29 Table 1: Parallel scalability benchmark for an intake-port with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 19

21 Benchmark example: Nozzle with 642,700, 2,570,800, and 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 20

22 CPU cores quad2 gtx iscsergpu fermi kepler mephisto (6) [1] (7.92) 16 (12) [2] 3.26 (3.75) [1] 32 (24) 2.42 [4] (1.74) [2] 64 (48) (0.84) [4] Speedup Efficiency GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off / / / [1] [1] / [2] / [4] Speedup / 7.40 Efficiency / 0.46 Table 2: Parallel scalability benchmark for a nozzle with 642,700 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 21

23 CPU cores quad2 gtx iscsergpu fermi kepler mephisto (6) [1] (29.74) 16 (12) [2] (14.58) [1] 32 (24) 7.40 [4] (7.16) [2] 64 (48) 3.75 [8] (3.49) [4] Speedup Efficiency GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off / / / [1] [1] / [2] [2] / [4] Speedup / Efficiency / 0.73 Table 3: Parallel scalability benchmark for a nozzle with 2,570,800 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 22

24 CPU cores quad2 gtx iscsergpu fermi kepler mephisto (6) [1] (109.45) 16 (12) [2] (54.44) [1] 32 (24) [4] (27.16) [2] 64 (48) [8] (13.66) [4] Speedup Efficiency GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 * * / / / [1] [1] / [2] [2] / [4] 32 (24) [4] (0.495) / * [6] [8] Speedup / Efficiency / 0.94 Table 4: Parallel scalability benchmark for a nozzle with 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 23

25 Effective GFLOPS for ARMO Simulator Intake-port Nozzle Nozzle Nozzle CPU / GPU Hardware 155, ,700 2,570,800 10,283,200 kepler 2x Intel Xeon E [2] [2] [2] [2] kepler 4x Nvidia Tesla K [4] [4] [4] [4] mephisto 16x Nvidia Tesla C [16] [16] [16] [16] iscsergpu 32x Nvidia GTX [8] [8] [16] [64] Table 5: Effective GFLOPS for ARMO simulator. GPU cluster performance is equivalent to CPU cores! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 24

26 Conclusions GPUs deliver excellent performance for CFD problems! speedup on GPU cluster with 4 64 GPUs compared with modern CPU core New GPU hardware: Maxwell architecture brings even more performance CUDA programming model fits well Essential software design decision: Element-based loops! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 25

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,