3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires lots of memory and computational power GPUs are very suitable for direct methods Have great instruction throughput and high memory bandwidth How will it scale on multiple GPUs?

cmc-fluid-solver Open source project on Google Code Started at CMC faculty of MSU, Russia CPU: OpenMP, GPU: CUDA 3D fluid simulation using ADI solver Key people: MSU: Vilen Paskonov, Sergey Berezin NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy

Outline Fluid Simulation in 3D domain Problem statement, applications ADI numerical method GPU implementation details, optimizations Performance analysis Multi-GPU implementation

Problem Statement Viscid incompressible fluid in 3D domain Arbitrary closed geometry for boundaries no-slip free injection Euler coordinates: velocity and temperature

Applications Sea and ocean simulation Additional parameters: salinity, etc. Low-speed gas flow Inside 3D channel Around objects

Definitions Density Velocity Temperature Pressure Equation of state Describe relation between and Example: gas constant for air

Governing equations Continuity equation For incompressible fluids: Navier-Stokes equations: Dimensionless form, use equation of state Reynolds number (= inertia/viscosity ratio)

Governing equations Energy equation: Dimensionless form, use equation of state heat capacity ratio Prandtl number dissipative function

ADI numerical method X Y Z Fixed Y, Z Fixed X, Z Fixed X, Y X Y Z

ADI numerical method Benefits Doesn t have hard requirements on time step Domain decomposition each step can be well parallelized Many applications Computational Fluid Dynamics Computational Finance Linear 3D PDE

ADI method iterations Use global iterations for the whole system of equations previous time step Solve X-dir equations Solve Y-dir equations global iterations Solve Z-dir equations Updating all variables next time step Some equations are not linear: Use local iterations to approximate the non-linear term

Discretization Use regular grid, implicit finite difference scheme Second order in space First order in time Leads to a tridiagonal system for Independent system for each fixed pair (j, k)

Tridiagonal systems Need to solve lots of tridiagonal systems Sizes of systems may vary across the grid system 1 system 2 system 3 Outside cell Boundary cell Inside cell

Implementation details <for each direction X, Y, Z> { <for each local iteration> { <for each equation u, v, w, T> { build tridiagonal matrices and rhs solve tridiagonal systems } update non-linear terms } }

GPU implementation Store all data arrays entirely in GPU memory Reduce number of PCI-E transfers to minimum Map 3D arrays to linear memory (X, Y, Z) Main kernel Build matrix coefficients Solve tridiagonal systems Z + Y * dimz + X * dimy * dimz Z fastest-changing dimension

Building matrices Input data: Previous/non-linear 3D layers Each thread computes: Coefficients of a tridiagonal matrix Right-hand side vector a b c d Use C++ templates for direction and equation

Tesla C2050 (SP) Building matrices performance 2.0 sec Total time Build kernels 1.5 1.0 X dir Y dir Dir Requests per warp L1 global load hit % IPC X 2 3 25 45 1.4 0.5 Z dir Y 2 3 33 44 1.4 0.0 Build Build + Solve Z 32 0 15 0.2 Poor Z direction performance compared to X/Y Threads access contiguous memory region Memory access is uncoalesced, lots of cache misses

Building matrices optimization Run Z phase in transposed XZY space Better locality for memory accesses Additional overhead on transpose X local iterations Y local iterations Z local iterations XYZ Transpose input arrays Transpose output arrays XYZ XZY Y local iterations XZY

Tesla C2050 (SP) Building matrices - optimization 2.0 1.5 1.0 0.5 sec Total time 2.5x Transpose Build + Solve Z dir Build kernels Requests per warp L1 global load hit % IPC Original 32 0 15 0.2 Transposed 2 3 30 38 1.3 0.0 X dir Y dir Z dir Z dir OPT Tridiagonal solver time dominates over transpose Transpose will takes less % with more local iterations

Solving tridiagonal systems Number of tridiagonal systems ~ grid size squared Sweep algorithm is the most efficient in this case 1 thread solves 1 system for( int p = 1; p < end; p++ ) { //.. compute tridiagonal coefficients a_val, b_val, c_val, d_val.. get(c,p) = c_val / (b_val - a_val * get(c,p-1)); get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1)); } for( int i = end-1; i >= 0; i-- ) get(x,i) = get(d,i) - get(c,i) * get(x, i+1);

Thread 1 Thread 2 Thread 3 Solving tridiagonal systems Matrix layout is crucial for performance Interleaved layout a0 a0 a0 a0 a1 a1 a1 a1 similar as ELLPACK for sparse matrices Sweep friendly X, Y directions matrices are interleaved by default Z is interleaved as well if doing in transposed space

Solving tridiagonal systems L1/L2 effect on performance Using 48K L1 instead of 16K gives 10-15% speed-up Turning L1 off reduces performance by 10% Really help on misaligned accesses and spatial reuse Occupancy >= 50% Running 128 threads per block 26-42 registers per thread (different for u, v, w, T) No shared memory

Performance benchmark CPU configuration: Intel Core i7-3930k CPU @ 3.2 GHz, 12 cores Use OpenMP for CPU parallelization Mostly memory bandwidth bound Some parts achieves ~4x speed-up vs 1 core GPU configuration: NVIDIA Tesla C2070

Test cases Box Pipe X Y Y 1 X L 1 Z Simple geometry Systems of the same size Need to compute in all rectangular grid points

Test cases White Sea X Y Complex geometry Big divergence for system sizes Need to compute only inside the area

Performance results Box Pipe segments/ms SINGLE segments/ms DOUBLE 2500 2500 9.3x 2000 2000 1500 CPU 1500 8.4x CPU 1000 GPU 1000 GPU 500 500 0 Solve X Solve Y Solve Z Total Grid 128x128x128 0 Solve X Solve Y Solve Z Total

Performance results White Sea segments/ms SINGLE segments/ms DOUBLE 4000 4000 3500 3500 3000 3000 2500 2000 1500 9.5x CPU GPU 2500 2000 1500 10.3x CPU GPU 1000 1000 500 500 0 Solve X Solve Y Solve Z Total Grid 256x192x160 0 Solve X Solve Y Solve Z Total

Outline Fluid Simulation in 3D domain Multi-GPU implementation General splitting algorithm Running computations using CUDA Benchmarking and performance analysis Improving weak scaling

Multi-GPU motivation Limited available amount of memory 3D arrays: grid, temporary arrays, matrices Max size of grid that can fit into Tesla M2050 ~ 224 3 Distribute the computations between multiple GPUs and multiple nodes Can compute large grids Speed-up computations

Main Idea of mgpu Systems along Y/Z are solved independently in parallel on each GPU No data transfer Along X data must be synchronized Y X Computing alternating directions: Z X Y Z GPU 0 GPU 1 GPU 2

CUDA - parallelization Split the grid along X (the longest stride) Z + Y * dimz + X * dimy * dimz Launch kernels on several GPUs from one host thread for (int i = 0; i < numdev; i++) { cudasetdevice(i); //Switch device kernel<<< >>>(devarray[i],..); //Computation } CUDA 4.x Data transfer Async P2P through PCI-E (cudamemcpypeerasync)

Synchronization of Nonlinear Layer for (int i = 0; i < numdev-1; i++) cudamemcpypeerrasync(dhaloleft[i+1], i+1, ddataright[i], i, num_bytes, devstream[i]); // might need multidev synchronization here for (int i = 1; i < numdev; i++) cudamemcpypeerasync(dhaloright[i-1], i-1, ddataleft[i], i, num_bytes, devstream[i]); High aggregate throughput on 8 GPU system Communication impact Is not significant

Solve X (tridiagonal solver) GPU 0 GPU 1 GPU 1 bound partially bound unbound halo

Solve X (tridiagonal solver) Process bound segments without intercommunication Interleave segments for better memory access one segment per thread Align to the left Gauss elimination Communicate Forward Backward

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Result: No speedup along X Z Active GPU

Benchmarks Multiple GPU: 8 Tesla M2050 with P2P Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each Sample tests: Y Box Pipe 1 X L 1 Z White Sea

Millions points per sec Millions points per sec Results: 8 GPU, 1 MPI node 250 200 150 Box Pipe x2.9 x4.5 35 30 25 20 White Sea x1.35 x1.4 100 15 50 10 5 0 Total 0 Total 1 2 4 8 Grid 224 3 1 2 4 8 Tesla M2050

Points / ms Points / ms 1 GPU Efficiency 80000 60000 40000 20000 0 20000 15000 10000 5000 0 Box Pipe 0 64 128 192 256 Grid size White Sea 0 64 128 192 256 Grid size Estimate amount of work per GPU in 8xGPU system using single GPU: 256 3 /8 = 128 3 Box Pipe enough work for single GPU White Sea takes about 5% of volume of the grid. Grid size of 128 3 is not enough. Tesla M2090

Millions points per sec Millions points per sec Results: 1 GPU, 4 MPI nodes 200 180 160 140 120 100 80 60 40 20 0 Box Pipe Total x2.8 35 30 25 20 15 10 5 0 White Sea Total x1.2 1 2 4 1 2 4 Tesla M2090

Segments Load Balancing X splitting criteria: Equal volumes Equal number of segments Performance benefit observed: up to 15.5% 1400 1200 1000 800 600 400 200 0 Y(x) + Z(X) + X(x)dX 0 72 144 216 288 x Tesla M2090

Time, ms Time, ms Time, ms Load Balancing. White Sea (288x320x320) 20 10 0 20 10 0 20 10 0 Even X SweepX SweepY SweepZ Transpose GPU 0 GPU 1 GPU 2 GPU 3 t total = 47.3 Even Segments GPU 0 GPU 1 GPU 2 SweepX SweepY SweepZ Transpose GPU 3 Even Volumes GPU 0 SweepX SweepY SweepZ Transpose Tesla M2090 t total = 44.3 GPU 1 GPU 2 GPU 3 t total = 44.4

Analysis All parts of the solver but one (Gauss elimination along X) are fully parallel Communication (using P2P + InfiniBand) is not a big issue for given problem size Bad weak scaling Use blocks to hide latency for X sweeps

Y Improved Solve X GPU0 GPU1 GPU2 X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z

Improved Solve X Y B 0 B 1 GPU0 GPU1 GPU2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once B 2 B 3 B 4 Z

Improved Solve X Y B 0 B 1 B 2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward B 3 B 4 Z

Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group B 4 Z

Y Algorithm block 0 node 0 i node N nodes X N Z N blocks cudastream 2 2(N nodes i node 1) receive X inode +1 i block Z receive X inode 1 i block i node cudastream 1

Y Algorithm block 0 node 0 i node N nodes X N Z N blocks Backward cudastream 2 2(N nodes i node 1) Forward cudastream 1 i block i block i node Z

Y Algorithm block 0 node 0 i node N nodes X N Z N blocks send X inode cudastream 2 2(N nodes i node 1) cudastream 1 i block Z i block i node send X inode

Improved Solve XY Y B 0 X Separate buffer for Y sweeps B 1 Y blocks B 2 B 3 Block Y sweeps are performed independently in separate cudastreams Helps with data transfer/compute overlap B 4 Z

Time, ms Weak Scaling 400 350 300 Average time for Solve XYZ Box Pipe Grids: 224 3, 288 3, 352 3, 448 3 250 200 150 100 0 2 4 6 8 10 Number of GPUs Tesla M2050

Time, ms Big Systems Limit 250 200 150 100 50 0 Average time for Solve XYZ 1 2 4 8 16 32 Number of blocks Consider on scalar field: no physics, more available RAM 8 M2050 GPUs Grid: 768 3 With larger grid sizes, curve minimum shifts down/right

Conclusions GPU outperforms multi-core CPU over 10x factor GPU works well with complex input domains Performance and scaling factors heavily depend on input geometry and size of grid Efficient work distribution methods are essential for performance Using block-splitting for ADI improves scaling factor by hiding dependency of sweep processing

Future work Test on large scale systems Potentially on Lomonosov supercomputer at MSU GPU part with peak performance of 863 TFlops Memory usage optimizations Explore different tridiagonal approaches

Questions? Thank You!