Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 26, 2016

(1) The Vijayasundaram Method for Multi-Physics Euler Equations The Euler equations are given by a system of differential equations. We consider two gas species with densities ρ 1 and ρ 2 for the simulations and ideal gas state equations. More complicated and realistic state equation can also be handled by the ARMO simulation code. Let ρ 1, ρ 2 be the densities of the gas species and ρ = ρ 1 + ρ 2 the density of the gas, p the pressure, and p 1, p 2, p 3 the components of the gas momentum density, and E the total energy density. Let x = {x 1, x 2, x 3 } Ω R 3 and t (0, T ) R be the space time coordinates. Then the conserved quantity w(x, t) is given by w = ρ 1 ρ 2 p 1 p 2 p 3 E (1) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 1

and the flux vectors are defined as f k (w) = ρ 1 p k /ρ ρ 2 p k /ρ p 1 p k /ρ + δ 1k p p 2 p k /ρ + δ 2k p p 3 p k /ρ + δ 3k p (E + p)p k /ρ, k {1, 2, 3} (2) The Euler equations on the domain Ω (0, T ) can then be expressed as w(x, t) + t x 1 f 1 (w(x, t)) + x 2 f 2 (w(x, t)) + x 3 f 3 (w(x, t)) = 0 (3) and together with suitable boundary conditions the system can be solved with the finite volume approach. The finite volume method can be formulated by applying Green s theorem d dt Ω w(x, t)dx = Ω f 1 n 1 + f 2 n 2 + f 3 n 3 ds (4) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 2

where n = (n 1, n 2, n 3 ) denotes the outer normal to the boundary Ω. The discrete version is then derived by integration over a time intervall [t n, t n + t] and averaging over the cells K i. w (n+1) Ki = w (n) Ki t j S(i) Γ ij K i 3 F k,γij (w (n) Ki, w (n) Kj )n k (5) k=1 With a tetrahedral approximation to Ω {K i } i I and Γ ij are the interfaces between the cells K i, K j and the set S(i) stores the indices of the neighboring cells of K i Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 3

The Vijayasundaram method defines the fluxes as ( ) ( ) u + v u + v F k,γij (u, v) = A + k u + A k v, 2 2 k = 1, 2, 3 (6) The essence of the Vijayasundaram method is the calculation of an eigenspace decomposition of A k = df k /dw, k = 1, 2, 3 into positive and negative subspaces. Thus the matrices A + k, A k are constructed from the positive and negative eigenvalues of A k = R k Λ k L k with Λ k = diag(λ k,1,..., λ k,6 ) and k = 1, 2, 3. A ± k = R kλ ± k L k, Λ ± k = diag(λ± k,1,..., λ± k,m ), (8) λ + k,i = max(λ k,i, 0), λ k,i = min(λ k,i, 0), i = 1,..., 6 (9) (7) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 4

(2) ARMO CPU/GPU Algorithms High level parallel CPU algorithm: Require: f, g, com, nei, geo, pio Require: t max, i max, C, σ, m, n t 0, i 0 while t < t max and i < i max do exchange(m, n, f, g, com) mpi alltoall(m, n, g, f) vijaya(n, nei, geo, pio, f, g, σ) mpi allreduce max(σ) update(n, f, g, σ, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 5

High level parallel GPU algorithm: Require: f D, g D, com D, nei D, geo D, pio D, σ D Require: t max, i max, C, σ, m, n, snd, rcv t 0, i 0 while t < t max and i < i max do exchange D (m, n, f D, g D, com D ) device to host(n, g D, snd) mpi alltoall(snd, rcv) host to device(n, f D, rcv) vijaya D (n, nei D, geo D, pio D, f D, g D, σ D ) device to host(σ D, σ) mpi allreduce max(σ) host to device(σ D, σ) update D (n, f D, g D, σ D, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 6

(3) ARMO CPU/GPU Benchmarks Figure 1: GPU Cluster: mephisto.uni-graz.at Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 7

GPU Computing Hardware kepler: 4x Nvidia Tesla K20 GPU (9,984 cores / 24 GB on-board RAM) mephisto: 20x Nvidia Tesla C2070 GPU (8,960 cores / 120 GB on-board RAM) iscsergpu: 32x Nvidia Geforce GTX 295 (15,360 cores / 56 GB on-board RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3 GB on-board RAM) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 8

GPU Clusters and Servers kepler: 2x Intel Xeon E5-2650 @ 2.0 GHz with 256 GB RAM (4x Tesla K20) mephisto: 12x Intel Xeon X5650 @ 2.67 GHz with 520 GB RAM (20x Tesla C2070) iscsergpu: 8x Intel Core i7 965 @ 3.2 GHz with 12 GB RAM (32x GTX 295) gtx: AMD Phenom 9950 @ 2.6 GHz with 8 GB RAM (4x GTX 280) fermi: Intel Core i7 920 @ 2.66 GHz with 12 GB RAM (2x GTX 480) CPU Clusters and Servers memo: 8x Intel Xeon X7560 @ 2.27 GHz with 1024 GB RAM penge: 12x Dual Intel Xeon E5450 @ 3.0 GHz with 16 GB RAM quad2: 4x AMD Opteron 8347 @ 1.9 GHz with 32 GB RAM Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 9

Benchmark example: Intake port of a diesel engine with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 10

Four pieces of the intake port for parallel processing using domain decomposition. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 11

CPU cores memo quad2 gtx iscsergpu penge fermi kepler mephisto 1 12.35 33.58 19.37 9.32 11.74 10.37 12.13 10.84 2 5.94 16.07 9.26 4.55 5.08 5.02 6.27 5.25 4 2.96 7.59 4.47 2.29 2.47 2.54 3.07 2.63 8 (6) 1.44 3.13 1.81 [1] 1.27 [1] 2.11 1.50 (1.76) 16 (12) 0.68 1.38 1.09 [2] 0.64 [2] 0.72 (0.84) [1] 32 (24) 0.35 0.65 [4] 0.33 [4] (0.41) [2] 64 (48) 0.18 0.17 [8] (0.21) [4] Speedup 68.22 24.21 4.33 14.34 67,47 4.91 16.85 51.62 Efficiency 1.07 1.51 1.08 0.45 1.05 0.61 1.05 1.07 GPUs memo quad2 gtx iscsergpu penge fermi kepler mephisto ECC: on/off 1 0.284 0.380 0.156 0.120 0.245 / 0.184 2 0.141 0.175 0.090 0.070 0.168 / 0.108 4 0.086 0.098 0.047 0.142 / 0.063 [1] 8 0.069 [1] 0.120 / 0.045 [2] 16 0.128 / 0.039 [4] Speedup 3.30 5.51 1.73 2.55 1.91 / 4.72 Efficiency 0.82 0.69 0.86 0.64 0.11 / 0.29 Table 1: Parallel scalability benchmark for an intake-port with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 12

Benchmark example: Nozzle with 642,700, 2,570,800, and 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 13

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 135.80 79.65 40.62 47.41 55.65 48.28 2 65.85 38.55 20.13 23.50 27.11 23.68 4 32.73 19.06 10.23 11.89 13.68 11.85 8 (6) 15.67 7.86 [1] 9.41 6.89 (7.92) 16 (12) 7.61 4.22 [2] 3.26 (3.75) [1] 32 (24) 2.42 [4] (1.74) [2] 64 (48) (0.84) [4] Speedup 19.06 4.13 17.27 5.04 17.07 57.48 Efficiency 1.19 1.03 0.54 0.63 1.07 1.20 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 1.186 1.561 0.617 0.459 1.011 / 0.740 2 0.540 0.702 0.312 0.211 0.523 / 0.369 4 0.275 0.337 0.116 0.307 / 0.199 [1] 8 0.185 [1] 0.203 / 0.132 [2] 16 0.155 / 0.100 [4] Speedup 5.00 11.60 1.98 3.96 6.52 / 7.40 Efficiency 1.25 1.45 0.99 0.99 0.41 / 0.46 Table 2: Parallel scalability benchmark for a nozzle with 642,700 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 14

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 415.00 259.89 142.83 174.55 209.01 172.26 2 203.15 128.70 72.03 85.06 103.96 86.39 4 105.69 65.90 37.27 43.64 52.60 43.78 8 (6) 55.34 29.47 [1] 35.17 27.03 (29.74) 16 (12) 29.16 14.77 [2] 12.95 (14.58) [1] 32 (24) 7.40 [4] (7.16) [2] 64 (48) 3.75 [8] (3.49) [4] Speedup 14.23 3.94 38.09 4.96 16.14 49.36 Efficiency 0.89 0.99 0.60 0.62 1.01 1.03 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 3.955 4.683 2.160 1.247 2.534 / 2.406 2 1.694 2.052 1.082 0.635 1.307 / 1.212 4 0.841 1.002 0.330 0.721 / 0.671 [1] 8 0.514 [1] 0.423 / 0.342 [2] 16 0.320 [2] 0.265 / 0.206 [4] Speedup 4.70 14.63 2.00 3.78 9.56 / 11.70 Efficiency 1.18 0.91 1.00 0.94 0.60 / 0.73 Table 3: Parallel scalability benchmark for a nozzle with 2,570,800 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 15

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 1384.5 916.89 508.74 603.83 752.71 630.41 2 693.25 462.34 257.83 305.15 374.63 315.16 4 361.81 238.70 132.20 156.57 189.02 160.26 8 (6) 200.29 110.17 [1] 128.98 97.01 (109.45) 16 (12) 108.48 55.93 [2] 48.44 (54.44) [1] 32 (24) 28.20 [4] (27.16) [2] 64 (48) 14.11 [8] (13.66) [4] Speedup 12.76 3.84 36.05 4.68 15.54 46.15 Efficiency 0.80 0.96 0.56 0.59 0.97 0.96 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 * * 7.896 4.071 9.405 / 9.316 2 6.602 7.619 3.964 2.038 4.721 / 4.686 4 3.088 3.529 1.027 2.403 / 2.365 [1] 8 1.725 [1] 1.264 / 1.184 [2] 16 0.935 [2] 0.686 / 0.618 [4] 32 (24) 0.701 [4] (0.495) / * [6] 64 0.495 [8] Speedup 4.28 30.78 1.99 3.96 13.71 / 15.07 Efficiency 1.07 0.48 1.00 0.99 0.86 / 0.94 Table 4: Parallel scalability benchmark for a nozzle with 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 16

Effective GFLOPS for ARMO Simulator Intake-port Nozzle Nozzle Nozzle CPU / GPU Hardware 155,325 642,700 2,570,800 10,283,200 kepler 2x Intel Xeon E5-2650 29.68 [2] 27.12 [2] 27.32 [2] 29.21 [2] kepler 4x Nvidia Tesla K20 454.74 [4] 762.38 [4] 1071.95 [4] 1377.78 [4] mephisto 16x Nvidia Tesla C2070 548.02 [16] 884.36 [16] 1717.19 [16] 2289.59 [16] iscsergpu 32x Nvidia GTX 295 309.75 [8] 478.03 [8] 1105.44 [16] 2858.52 [64] Table 5: Effective GFLOPS for ARMO simulator. GPU cluster performance is equivalent to 800 1600 CPU cores! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 17

Conclusions GPUs deliver excellent performance for CFD problems! 800 1600 speedup on GPU cluster with 4 64 GPUs compared with modern CPU core New GPU hardware: Maxwell architecture brings even more performance CUDA programming model fits well Essential software design decision: Element-based loops! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 18