Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de October 26, 2015

Projects and Collaborations Collaboration Partners Zoltán Horváth, Széchenyi István University, Hungary (TAMOP Project) Craig C. Douglas, University of Wyoming, USA (GPU Cluster) Gundolf Haase, University of Graz, Austria (SFB MOBIS) Charles Hirsch, NUMECA International S.A, Belgium (E-CFD-GPU Project) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 1

Overview High Performance Computing with GPUs The Vijayasundaram Method for Multi-Physics Euler Equations ARMO CPU/GPU Algorithms ARMO CPU/GPU Benchmarks Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 2

(1) High Performance Computing with GPUs GPU: Graphics Processing Unit Many-core / many-thread architecture: Thousends of compute cores Double precision: 1.5 TFLOPS Single precision: 4 TFLOPS Shared memory architecture: Very high memory bandwidth: 300 GB/s Free Nvidia CUDA compiler and development tools Big GPU players: Nvidia, AMD, Intel Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 3

Nvidia Tesla K40 GPU architecture 1.43 TFLOPS double precision / 4.29 TFLOPS single precision 2880 CUDA cores per GPU / 12 GB on-board ECC RAM 288 GB/s memory bandwidth to on-board ECC RAM L1/L2 cache hierarchy / 64 bit memory address space Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 4

Floating-Point Operations per Second for the CPU and GPU Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 5

Memory Bandwidth for the CPU and GPU Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 6

GPU Software Design Essentials Extreme multithreading: Schedule millions of threads Shared memory architecture: Think OpenMP Memory access coalescing: Still important Utilize L1/L2/Texture caches and shared memory Pitfalls Noncoalesced random read/write to memory is slow Atomic memory operations are very expensive Big, branchy code blocks are bad for GPUs: Code serialization Heavy register use in GPU kernels limits the device utilization FLOPS are free, memory access is expensive! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 7

(2) The Vijayasundaram Method for Multi-Physics Euler Equations The Euler equations are given by a system of differential equations. We consider two gas species with densities ρ 1 and ρ 2 for the simulations and ideal gas state equations. More complicated and realistic state equation can also be handled by the ARMO simulation code. Let ρ 1, ρ 2 be the densities of the gas species and ρ = ρ 1 + ρ 2 the density of the gas, p the pressure, and p 1, p 2, p 3 the components of the gas momentum density, and E the total energy density. Let x = {x 1, x 2, x 3 } Ω R 3 and t (0, T ) R be the space time coordinates. Then the conserved quantity w(x, t) is given by w = ρ 1 ρ 2 p 1 p 2 p 3 E (1) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 8

and the flux vectors are defined as f k (w) = ρ 1 p k /ρ ρ 2 p k /ρ p 1 p k /ρ + δ 1k p p 2 p k /ρ + δ 2k p p 3 p k /ρ + δ 3k p (E + p)p k /ρ, k {1, 2, 3} (2) The Euler equations on the domain Ω (0, T ) can then be expressed as w(x, t) + t x 1 f 1 (w(x, t)) + x 2 f 2 (w(x, t)) + x 3 f 3 (w(x, t)) = 0 (3) and together with suitable boundary conditions the system can be solved with the finite volume approach. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 9

The finite volume method can be formulated by applying Green s theorem d dt Ω w(x, t)dx = Ω f 1 n 1 + f 2 n 2 + f 3 n 3 ds (4) where n = (n 1, n 2, n 3 ) denotes the outer normal to the boundary Ω. The discrete version is then derived by integration over a time intervall [t n, t n + t] and averaging over the cells K i. w (n+1) Ki = w (n) Ki t Γ ij 3 F k,γij (w (n) Ki, w (n) Kj )n k (5) K i j S(i) With a tetrahedral approximation to Ω {K i } i I and Γ ij are the interfaces between the cells K i, K j and the set S(i) stores the indices of the neighboring cells of K i k=1 Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 10

The Vijayasundaram method defines the fluxes as ( ) ( ) u + v u + v F k,γij (u, v) = A + k u + A k v, 2 2 k = 1, 2, 3 (6) The essence of the Vijayasundaram method is the calculation of an eigenspace decomposition of A k = df k /dw, k = 1, 2, 3 into positive and negative subspaces. Thus the matrices A + k, A k are constructed from the positive and negative eigenvalues of A k = R k Λ k L k with Λ k = diag(λ k,1,..., λ k,6 ) and k = 1, 2, 3. A ± k = R kλ ± k L k, Λ ± k = diag(λ± k,1,..., λ± k,m ), (8) λ + k,i = max(λ k,i, 0), λ k,i = min(λ k,i, 0), i = 1,..., 6 (9) (7) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 11

(3) ARMO CPU/GPU Algorithms High level parallel CPU algorithm: Require: f, g, com, nei, geo, pio Require: t max, i max, C, σ, m, n t 0, i 0 while t < t max and i < i max do exchange(m, n, f, g, com) mpi alltoall(m, n, g, f) vijaya(n, nei, geo, pio, f, g, σ) mpi allreduce max(σ) update(n, f, g, σ, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 12

High level parallel GPU algorithm: Require: f D, g D, com D, nei D, geo D, pio D, σ D Require: t max, i max, C, σ, m, n, snd, rcv t 0, i 0 while t < t max and i < i max do exchange D (m, n, f D, g D, com D ) device to host(n, g D, snd) mpi alltoall(snd, rcv) host to device(n, f D, rcv) vijaya D (n, nei D, geo D, pio D, f D, g D, σ D ) device to host(σ D, σ) mpi allreduce max(σ) host to device(σ D, σ) update D (n, f D, g D, σ D, C) i i + 1 t t + C/σ end while Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 13

(4) ARMO CPU/GPU Benchmarks Figure 1: GPU Cluster: mephisto.uni-graz.at Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 14

GPU Computing Hardware kepler: 4x Nvidia Tesla K20 GPU (9,984 cores / 24 GB on-board RAM) mephisto: 20x Nvidia Tesla C2070 GPU (8,960 cores / 120 GB on-board RAM) iscsergpu: 32x Nvidia Geforce GTX 295 (15,360 cores / 56 GB on-board RAM) gtx: 4x Nvidia Geforce GTX 280 (960 cores / 4 GB on-board RAM) fermi: 2x Nvidia Geforce GTX 480 (960 cores / 3 GB on-board RAM) Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 15

GPU Clusters and Servers kepler: 2x Intel Xeon E5-2650 @ 2.0 GHz with 256 GB RAM (4x Tesla K20) mephisto: 12x Intel Xeon X5650 @ 2.67 GHz with 520 GB RAM (20x Tesla C2070) iscsergpu: 8x Intel Core i7 965 @ 3.2 GHz with 12 GB RAM (32x GTX 295) gtx: AMD Phenom 9950 @ 2.6 GHz with 8 GB RAM (4x GTX 280) fermi: Intel Core i7 920 @ 2.66 GHz with 12 GB RAM (2x GTX 480) CPU Clusters and Servers memo: 8x Intel Xeon X7560 @ 2.27 GHz with 1024 GB RAM penge: 12x Dual Intel Xeon E5450 @ 3.0 GHz with 16 GB RAM quad2: 4x AMD Opteron 8347 @ 1.9 GHz with 32 GB RAM Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 16

Benchmark example: Intake port of a diesel engine with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 17

Four pieces of the intake port for parallel processing using domain decomposition. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 18

CPU cores memo quad2 gtx iscsergpu penge fermi kepler mephisto 1 12.35 33.58 19.37 9.32 11.74 10.37 12.13 10.84 2 5.94 16.07 9.26 4.55 5.08 5.02 6.27 5.25 4 2.96 7.59 4.47 2.29 2.47 2.54 3.07 2.63 8 (6) 1.44 3.13 1.81 [1] 1.27 [1] 2.11 1.50 (1.76) 16 (12) 0.68 1.38 1.09 [2] 0.64 [2] 0.72 (0.84) [1] 32 (24) 0.35 0.65 [4] 0.33 [4] (0.41) [2] 64 (48) 0.18 0.17 [8] (0.21) [4] Speedup 68.22 24.21 4.33 14.34 67,47 4.91 16.85 51.62 Efficiency 1.07 1.51 1.08 0.45 1.05 0.61 1.05 1.07 GPUs memo quad2 gtx iscsergpu penge fermi kepler mephisto ECC: on/off 1 0.284 0.380 0.156 0.120 0.245 / 0.184 2 0.141 0.175 0.090 0.070 0.168 / 0.108 4 0.086 0.098 0.047 0.142 / 0.063 [1] 8 0.069 [1] 0.120 / 0.045 [2] 16 0.128 / 0.039 [4] Speedup 3.30 5.51 1.73 2.55 1.91 / 4.72 Efficiency 0.82 0.69 0.86 0.64 0.11 / 0.29 Table 1: Parallel scalability benchmark for an intake-port with 155,325 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 19

Benchmark example: Nozzle with 642,700, 2,570,800, and 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 20

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 135.80 79.65 40.62 47.41 55.65 48.28 2 65.85 38.55 20.13 23.50 27.11 23.68 4 32.73 19.06 10.23 11.89 13.68 11.85 8 (6) 15.67 7.86 [1] 9.41 6.89 (7.92) 16 (12) 7.61 4.22 [2] 3.26 (3.75) [1] 32 (24) 2.42 [4] (1.74) [2] 64 (48) (0.84) [4] Speedup 19.06 4.13 17.27 5.04 17.07 57.48 Efficiency 1.19 1.03 0.54 0.63 1.07 1.20 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 1.186 1.561 0.617 0.459 1.011 / 0.740 2 0.540 0.702 0.312 0.211 0.523 / 0.369 4 0.275 0.337 0.116 0.307 / 0.199 [1] 8 0.185 [1] 0.203 / 0.132 [2] 16 0.155 / 0.100 [4] Speedup 5.00 11.60 1.98 3.96 6.52 / 7.40 Efficiency 1.25 1.45 0.99 0.99 0.41 / 0.46 Table 2: Parallel scalability benchmark for a nozzle with 642,700 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 21

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 415.00 259.89 142.83 174.55 209.01 172.26 2 203.15 128.70 72.03 85.06 103.96 86.39 4 105.69 65.90 37.27 43.64 52.60 43.78 8 (6) 55.34 29.47 [1] 35.17 27.03 (29.74) 16 (12) 29.16 14.77 [2] 12.95 (14.58) [1] 32 (24) 7.40 [4] (7.16) [2] 64 (48) 3.75 [8] (3.49) [4] Speedup 14.23 3.94 38.09 4.96 16.14 49.36 Efficiency 0.89 0.99 0.60 0.62 1.01 1.03 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 3.955 4.683 2.160 1.247 2.534 / 2.406 2 1.694 2.052 1.082 0.635 1.307 / 1.212 4 0.841 1.002 0.330 0.721 / 0.671 [1] 8 0.514 [1] 0.423 / 0.342 [2] 16 0.320 [2] 0.265 / 0.206 [4] Speedup 4.70 14.63 2.00 3.78 9.56 / 11.70 Efficiency 1.18 0.91 1.00 0.94 0.60 / 0.73 Table 3: Parallel scalability benchmark for a nozzle with 2,570,800 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 22

CPU cores quad2 gtx iscsergpu fermi kepler mephisto 1 1384.5 916.89 508.74 603.83 752.71 630.41 2 693.25 462.34 257.83 305.15 374.63 315.16 4 361.81 238.70 132.20 156.57 189.02 160.26 8 (6) 200.29 110.17 [1] 128.98 97.01 (109.45) 16 (12) 108.48 55.93 [2] 48.44 (54.44) [1] 32 (24) 28.20 [4] (27.16) [2] 64 (48) 14.11 [8] (13.66) [4] Speedup 12.76 3.84 36.05 4.68 15.54 46.15 Efficiency 0.80 0.96 0.56 0.59 0.97 0.96 GPUs quad2 gtx iscsergpu fermi kepler mephisto ECC: on/off 1 * * 7.896 4.071 9.405 / 9.316 2 6.602 7.619 3.964 2.038 4.721 / 4.686 4 3.088 3.529 1.027 2.403 / 2.365 [1] 8 1.725 [1] 1.264 / 1.184 [2] 16 0.935 [2] 0.686 / 0.618 [4] 32 (24) 0.701 [4] (0.495) / * [6] 64 0.495 [8] Speedup 4.28 30.78 1.99 3.96 13.71 / 15.07 Efficiency 1.07 0.48 1.00 0.99 0.86 / 0.94 Table 4: Parallel scalability benchmark for a nozzle with 10,283,200 elements. Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 23

Effective GFLOPS for ARMO Simulator Intake-port Nozzle Nozzle Nozzle CPU / GPU Hardware 155,325 642,700 2,570,800 10,283,200 kepler 2x Intel Xeon E5-2650 29.68 [2] 27.12 [2] 27.32 [2] 29.21 [2] kepler 4x Nvidia Tesla K20 454.74 [4] 762.38 [4] 1071.95 [4] 1377.78 [4] mephisto 16x Nvidia Tesla C2070 548.02 [16] 884.36 [16] 1717.19 [16] 2289.59 [16] iscsergpu 32x Nvidia GTX 295 309.75 [8] 478.03 [8] 1105.44 [16] 2858.52 [64] Table 5: Effective GFLOPS for ARMO simulator. GPU cluster performance is equivalent to 800 1600 CPU cores! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 24

Conclusions GPUs deliver excellent performance for CFD problems! 800 1600 speedup on GPU cluster with 4 64 GPUs compared with modern CPU core New GPU hardware: Maxwell architecture brings even more performance CUDA programming model fits well Essential software design decision: Element-based loops! Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters 25