Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Size: px

Start display at page:

Download "Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters"

Bertram Melvin Dennis
5 years ago
Views:

1 Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 January 26, 2016

2 (1) The Vijayasundaram Method for Multi-Physics Euler Equations The Euler equations are given by a system of differential equations. We consider two gas species with densities ρ 1 and ρ 2 for the simulations and ideal gas state equations. More complicated and realistic state equation can also be handled by the ARMO simulation code. Let ρ 1, ρ 2 be the densities of the gas species and ρ = ρ 1 + ρ 2 the density of the gas, p the pressure, and p 1, p 2, p 3 the components of the gas momentum density, and E the total energy density. Let x = {x 1, x 2, x 3 } Ω R 3 and t (0, T ) R be the space time coordinates. Then the conserved quantity w(x, t) is given by w = ρ 1 ρ 2 p 1 p 2 p 3 E (1) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 1

3 and the flux vectors are defined as f k (w) = ρ 1 p k /ρ ρ 2 p k /ρ p 1 p k /ρ + δ 1k p p 2 p k /ρ + δ 2k p p 3 p k /ρ + δ 3k p (E + p)p k /ρ, k {1, 2, 3} (2) The Euler equations on the domain Ω (0, T ) can then be expressed as w(x, t) + t x 1 f 1 (w(x, t)) + x 2 f 2 (w(x, t)) + x 3 f 3 (w(x, t)) = 0 (3) and together with suitable boundary conditions the system can be solved with the finite volume approach. The finite volume method can be formulated by applying Green s theorem d dt Ω w(x, t)dx = Ω f 1 n 1 + f 2 n 2 + f 3 n 3 ds (4) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 2

4 where n = (n 1, n 2, n 3 ) denotes the outer normal to the boundary Ω. The discrete version is then derived by integration over a time intervall [t n, t n + t] and averaging over the cells K i. w (n+1) Ki = w (n) Ki t j S(i) Γ ij K i 3 F k,γij (w (n) Ki, w (n) Kj )n k (5) k=1 With a tetrahedral approximation to Ω {K i } i I and Γ ij are the interfaces between the cells K i, K j and the set S(i) stores the indices of the neighboring cells of K i Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 3

5 The Vijayasundaram method defines the fluxes as ( ) ( ) u + v u + v F k,γij (u, v) = A + k u + A k v, 2 2 k = 1, 2, 3 (6) The essence of the Vijayasundaram method is the calculation of an eigenspace decomposition of A k = df k /dw, k = 1, 2, 3 into positive and negative subspaces. Thus the matrices A + k, A k are constructed from the positive and negative eigenvalues of A k = R k Λ k L k with Λ k = diag(λ k,1,..., λ k,6 ) and k = 1, 2, 3. A ± k = R kλ ± k L k, Λ ± k = diag(λ± k,1,..., λ± k,m ), (8) λ + k,i = max(λ k,i, 0), λ k,i = min(λ k,i, 0), i = 1,..., 6 (9) (7) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 4

6 (2) ARMO CPU/GPU Algorithms High level parallel CPU algorithm: Require: f, g, com, nei, geo, pio Require: t max, i max, C, σ, m, n t 0, i 0 while t < t max and i < i max do exchange(m, n, f, g, com) mpi alltoall(m, n, g, f) vijaya(n, nei, geo, pio, f, g, σ) mpi allreduce max(σ) update(n, f, g, σ, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 5

7 High level parallel GPU algorithm: Require: f D, g D, com D, nei D, geo D, pio D, σ D Require: t max, i max, C, σ, m, n, snd, rcv t 0, i 0 while t < t max and i < i max do exchange D (m, n, f D, g D, com D ) device to host(n, g D, snd) mpi alltoall(snd, rcv) host to device(n, f D, rcv) vijaya D (n, nei D, geo D, pio D, f D, g D, σ D ) device to host(σ D, σ) mpi allreduce max(σ) host to device(σ D, σ) update D (n, f D, g D, σ D, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 6

8 (3) ARMO CPU/GPU Benchmarks Figure 1: GPU Cluster: mephisto.uni-graz.at Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 7

RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3

9 GPU Computing Hardware kepler: 4x Nvidia Tesla K20 GPU (9,984 cores / 24 GB on-board RAM) mephisto: 20x Nvidia Tesla C2070 GPU (8,960 cores / 120 GB on-board RAM) iscsergpu: 32x Nvidia Geforce GTX 295 (15,360 cores / 56 GB on-board RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3 GB on-board RAM) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 8

10 GPU Clusters and Servers kepler: 2x Intel Xeon 2.0 GHz with 256 GB RAM (4x Tesla K20) mephisto: 12x Intel Xeon 2.67 GHz with 520 GB RAM (20x Tesla C2070) iscsergpu: 8x Intel Core i7 3.2 GHz with 12 GB RAM (32x GTX 295) gtx: AMD Phenom 2.6 GHz with 8 GB RAM (4x GTX 280) fermi: Intel Core i GHz with 12 GB RAM (2x GTX 480) CPU Clusters and Servers memo: 8x Intel Xeon 2.27 GHz with 1024 GB RAM penge: 12x Dual Intel Xeon 3.0 GHz with 16 GB RAM quad2: 4x AMD Opteron 1.9 GHz with 32 GB RAM Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 9

11 Benchmark example: Intake port of a diesel engine with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 10

12 Four pieces of the intake port for parallel processing using domain decomposition. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 11

13 CPU cores memo quad2 gtx iscsergpu penge fermi kepler mephisto (6) [1] 1.27 [1] (1.76) 16 (12) [2] 0.64 [2] 0.72 (0.84) [1] 32 (24) [4] 0.33 [4] (0.41) [2] 64 (48) [8] (0.21) [4] Speedup , Efficiency GPUs memo quad2 gtx iscsergpu penge fermi kepler mephisto ECC: on/off / / / [1] [1] / [2] / [4] Speedup / 4.72 Efficiency / 0.29 Table 1: Parallel scalability benchmark for an intake-port with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 12

14 Benchmark example: Nozzle with 642,700, 2,570,800, and 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 13

15 CPU cores quad2 gtx iscsergpu fermi kepler mephisto (6) [1] (7.92) 16 (12) [2] 3.26 (3.75) [1] 32 (24) 2.42 [4] (1.74) [2] 64 (48) (0.84) [4] Speedup Efficiency GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off / / / [1] [1] / [2] / [4] Speedup / 7.40 Efficiency / 0.46 Table 2: Parallel scalability benchmark for a nozzle with 642,700 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 14

16 CPU cores quad2 gtx iscsergpu fermi kepler mephisto (6) [1] (29.74) 16 (12) [2] (14.58) [1] 32 (24) 7.40 [4] (7.16) [2] 64 (48) 3.75 [8] (3.49) [4] Speedup Efficiency GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off / / / [1] [1] / [2] [2] / [4] Speedup / Efficiency / 0.73 Table 3: Parallel scalability benchmark for a nozzle with 2,570,800 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 15

17 CPU cores quad2 gtx iscsergpu fermi kepler mephisto (6) [1] (109.45) 16 (12) [2] (54.44) [1] 32 (24) [4] (27.16) [2] 64 (48) [8] (13.66) [4] Speedup Efficiency GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 * * / / / [1] [1] / [2] [2] / [4] 32 (24) [4] (0.495) / * [6] [8] Speedup / Efficiency / 0.94 Table 4: Parallel scalability benchmark for a nozzle with 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 16

18 Effective GFLOPS for ARMO Simulator Intake-port Nozzle Nozzle Nozzle CPU / GPU Hardware 155, ,700 2,570,800 10,283,200 kepler 2x Intel Xeon E [2] [2] [2] [2] kepler 4x Nvidia Tesla K [4] [4] [4] [4] mephisto 16x Nvidia Tesla C [16] [16] [16] [16] iscsergpu 32x Nvidia GTX [8] [8] [16] [64] Table 5: Effective GFLOPS for ARMO simulator. GPU cluster performance is equivalent to CPU cores! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 17

19 Conclusions GPUs deliver excellent performance for CFD problems! speedup on GPU cluster with 4 64 GPUs compared with modern CPU core New GPU hardware: Maxwell architecture brings even more performance CUDA programming model fits well Essential software design decision: Element-based loops! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 18

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,