3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires lots of memory and computational power GPUs are very suitable for direct methods Have great instruction throughput and high memory bandwidth How will it scale on multiple GPUs?
cmc-fluid-solver Open source project on Google Code Started at CMC faculty of MSU, Russia CPU: OpenMP, GPU: CUDA 3D fluid simulation using ADI solver Key people: MSU: Vilen Paskonov, Sergey Berezin NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy
Outline Fluid Simulation in 3D domain Problem statement, applications ADI numerical method GPU implementation details, optimizations Performance analysis Multi-GPU implementation
Problem Statement Viscid incompressible fluid in 3D domain Arbitrary closed geometry for boundaries no-slip free injection Euler coordinates: velocity and temperature
Applications Sea and ocean simulation Additional parameters: salinity, etc. Low-speed gas flow Inside 3D channel Around objects
Definitions Density Velocity Temperature Pressure Equation of state Describe relation between and Example: gas constant for air
Governing equations Continuity equation For incompressible fluids: Navier-Stokes equations: Dimensionless form, use equation of state Reynolds number (= inertia/viscosity ratio)
Governing equations Energy equation: Dimensionless form, use equation of state heat capacity ratio Prandtl number dissipative function
ADI numerical method X Y Z Fixed Y, Z Fixed X, Z Fixed X, Y X Y Z
ADI numerical method Benefits Doesn t have hard requirements on time step Domain decomposition each step can be well parallelized Many applications Computational Fluid Dynamics Computational Finance Linear 3D PDE
ADI method iterations Use global iterations for the whole system of equations previous time step Solve X-dir equations Solve Y-dir equations global iterations Solve Z-dir equations Updating all variables next time step Some equations are not linear: Use local iterations to approximate the non-linear term
Discretization Use regular grid, implicit finite difference scheme Second order in space First order in time Leads to a tridiagonal system for Independent system for each fixed pair (j, k)
Tridiagonal systems Need to solve lots of tridiagonal systems Sizes of systems may vary across the grid system 1 system 2 system 3 Outside cell Boundary cell Inside cell
Implementation details <for each direction X, Y, Z> { <for each local iteration> { <for each equation u, v, w, T> { build tridiagonal matrices and rhs solve tridiagonal systems } update non-linear terms } }
GPU implementation Store all data arrays entirely in GPU memory Reduce number of PCI-E transfers to minimum Map 3D arrays to linear memory (X, Y, Z) Main kernel Build matrix coefficients Solve tridiagonal systems Z + Y * dimz + X * dimy * dimz Z fastest-changing dimension
Building matrices Input data: Previous/non-linear 3D layers Each thread computes: Coefficients of a tridiagonal matrix Right-hand side vector a b c d Use C++ templates for direction and equation
Tesla C2050 (SP) Building matrices performance 2.0 sec Total time Build kernels 1.5 1.0 X dir Y dir Dir Requests per warp L1 global load hit % IPC X 2 3 25 45 1.4 0.5 Z dir Y 2 3 33 44 1.4 0.0 Build Build + Solve Z 32 0 15 0.2 Poor Z direction performance compared to X/Y Threads access contiguous memory region Memory access is uncoalesced, lots of cache misses
Building matrices optimization Run Z phase in transposed XZY space Better locality for memory accesses Additional overhead on transpose X local iterations Y local iterations Z local iterations XYZ Transpose input arrays Transpose output arrays XYZ XZY Y local iterations XZY
Tesla C2050 (SP) Building matrices - optimization 2.0 1.5 1.0 0.5 sec Total time 2.5x Transpose Build + Solve Z dir Build kernels Requests per warp L1 global load hit % IPC Original 32 0 15 0.2 Transposed 2 3 30 38 1.3 0.0 X dir Y dir Z dir Z dir OPT Tridiagonal solver time dominates over transpose Transpose will takes less % with more local iterations
Solving tridiagonal systems Number of tridiagonal systems ~ grid size squared Sweep algorithm is the most efficient in this case 1 thread solves 1 system for( int p = 1; p < end; p++ ) { //.. compute tridiagonal coefficients a_val, b_val, c_val, d_val.. get(c,p) = c_val / (b_val - a_val * get(c,p-1)); get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1)); } for( int i = end-1; i >= 0; i-- ) get(x,i) = get(d,i) - get(c,i) * get(x, i+1);
Thread 1 Thread 2 Thread 3 Solving tridiagonal systems Matrix layout is crucial for performance Interleaved layout a0 a0 a0 a0 a1 a1 a1 a1 similar as ELLPACK for sparse matrices Sweep friendly X, Y directions matrices are interleaved by default Z is interleaved as well if doing in transposed space
Solving tridiagonal systems L1/L2 effect on performance Using 48K L1 instead of 16K gives 10-15% speed-up Turning L1 off reduces performance by 10% Really help on misaligned accesses and spatial reuse Occupancy >= 50% Running 128 threads per block 26-42 registers per thread (different for u, v, w, T) No shared memory
Performance benchmark CPU configuration: Intel Core i7-3930k CPU @ 3.2 GHz, 12 cores Use OpenMP for CPU parallelization Mostly memory bandwidth bound Some parts achieves ~4x speed-up vs 1 core GPU configuration: NVIDIA Tesla C2070
Test cases Box Pipe X Y Y 1 X L 1 Z Simple geometry Systems of the same size Need to compute in all rectangular grid points
Test cases White Sea X Y Complex geometry Big divergence for system sizes Need to compute only inside the area
Performance results Box Pipe segments/ms SINGLE segments/ms DOUBLE 2500 2500 9.3x 2000 2000 1500 CPU 1500 8.4x CPU 1000 GPU 1000 GPU 500 500 0 Solve X Solve Y Solve Z Total Grid 128x128x128 0 Solve X Solve Y Solve Z Total
Performance results White Sea segments/ms SINGLE segments/ms DOUBLE 4000 4000 3500 3500 3000 3000 2500 2000 1500 9.5x CPU GPU 2500 2000 1500 10.3x CPU GPU 1000 1000 500 500 0 Solve X Solve Y Solve Z Total Grid 256x192x160 0 Solve X Solve Y Solve Z Total
Outline Fluid Simulation in 3D domain Multi-GPU implementation General splitting algorithm Running computations using CUDA Benchmarking and performance analysis Improving weak scaling
Multi-GPU motivation Limited available amount of memory 3D arrays: grid, temporary arrays, matrices Max size of grid that can fit into Tesla M2050 ~ 224 3 Distribute the computations between multiple GPUs and multiple nodes Can compute large grids Speed-up computations
Main Idea of mgpu Systems along Y/Z are solved independently in parallel on each GPU No data transfer Along X data must be synchronized Y X Computing alternating directions: Z X Y Z GPU 0 GPU 1 GPU 2
CUDA - parallelization Split the grid along X (the longest stride) Z + Y * dimz + X * dimy * dimz Launch kernels on several GPUs from one host thread for (int i = 0; i < numdev; i++) { cudasetdevice(i); //Switch device kernel<<< >>>(devarray[i],..); //Computation } CUDA 4.x Data transfer Async P2P through PCI-E (cudamemcpypeerasync)
Synchronization of Nonlinear Layer for (int i = 0; i < numdev-1; i++) cudamemcpypeerrasync(dhaloleft[i+1], i+1, ddataright[i], i, num_bytes, devstream[i]); // might need multidev synchronization here for (int i = 1; i < numdev; i++) cudamemcpypeerasync(dhaloright[i-1], i-1, ddataleft[i], i, num_bytes, devstream[i]); High aggregate throughput on 8 GPU system Communication impact Is not significant
Solve X (tridiagonal solver) GPU 0 GPU 1 GPU 1 bound partially bound unbound halo
Solve X (tridiagonal solver) Process bound segments without intercommunication Interleave segments for better memory access one segment per thread Align to the left Gauss elimination Communicate Forward Backward
Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z
Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU
Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU
Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU
Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU
Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Result: No speedup along X Z Active GPU
Benchmarks Multiple GPU: 8 Tesla M2050 with P2P Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each Sample tests: Y Box Pipe 1 X L 1 Z White Sea
Millions points per sec Millions points per sec Results: 8 GPU, 1 MPI node 250 200 150 Box Pipe x2.9 x4.5 35 30 25 20 White Sea x1.35 x1.4 100 15 50 10 5 0 Total 0 Total 1 2 4 8 Grid 224 3 1 2 4 8 Tesla M2050
Points / ms Points / ms 1 GPU Efficiency 80000 60000 40000 20000 0 20000 15000 10000 5000 0 Box Pipe 0 64 128 192 256 Grid size White Sea 0 64 128 192 256 Grid size Estimate amount of work per GPU in 8xGPU system using single GPU: 256 3 /8 = 128 3 Box Pipe enough work for single GPU White Sea takes about 5% of volume of the grid. Grid size of 128 3 is not enough. Tesla M2090
Millions points per sec Millions points per sec Results: 1 GPU, 4 MPI nodes 200 180 160 140 120 100 80 60 40 20 0 Box Pipe Total x2.8 35 30 25 20 15 10 5 0 White Sea Total x1.2 1 2 4 1 2 4 Tesla M2090
Segments Load Balancing X splitting criteria: Equal volumes Equal number of segments Performance benefit observed: up to 15.5% 1400 1200 1000 800 600 400 200 0 Y(x) + Z(X) + X(x)dX 0 72 144 216 288 x Tesla M2090
Time, ms Time, ms Time, ms Load Balancing. White Sea (288x320x320) 20 10 0 20 10 0 20 10 0 Even X SweepX SweepY SweepZ Transpose GPU 0 GPU 1 GPU 2 GPU 3 t total = 47.3 Even Segments GPU 0 GPU 1 GPU 2 SweepX SweepY SweepZ Transpose GPU 3 Even Volumes GPU 0 SweepX SweepY SweepZ Transpose Tesla M2090 t total = 44.3 GPU 1 GPU 2 GPU 3 t total = 44.4
Analysis All parts of the solver but one (Gauss elimination along X) are fully parallel Communication (using P2P + InfiniBand) is not a big issue for given problem size Bad weak scaling Use blocks to hide latency for X sweeps
Y Improved Solve X GPU0 GPU1 GPU2 X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z
Improved Solve X Y B 0 B 1 GPU0 GPU1 GPU2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once B 2 B 3 B 4 Z
Improved Solve X Y B 0 B 1 B 2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward B 3 B 4 Z
Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group B 4 Z
Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z
Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z
Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z Equal work per node!
Y Algorithm block 0 node 0 i node N nodes X N Z N blocks cudastream 2 2(N nodes i node 1) receive X inode +1 i block Z receive X inode 1 i block i node cudastream 1
Y Algorithm block 0 node 0 i node N nodes X N Z N blocks Backward cudastream 2 2(N nodes i node 1) Forward cudastream 1 i block i block i node Z
Y Algorithm block 0 node 0 i node N nodes X N Z N blocks send X inode cudastream 2 2(N nodes i node 1) cudastream 1 i block Z i block i node send X inode
Improved Solve XY Y B 0 X Separate buffer for Y sweeps B 1 Y blocks B 2 B 3 Block Y sweeps are performed independently in separate cudastreams Helps with data transfer/compute overlap B 4 Z
Time, ms Weak Scaling 400 350 300 Average time for Solve XYZ Box Pipe Grids: 224 3, 288 3, 352 3, 448 3 250 200 150 100 0 2 4 6 8 10 Number of GPUs Tesla M2050
Time, ms Big Systems Limit 250 200 150 100 50 0 Average time for Solve XYZ 1 2 4 8 16 32 Number of blocks Consider on scalar field: no physics, more available RAM 8 M2050 GPUs Grid: 768 3 With larger grid sizes, curve minimum shifts down/right
Conclusions GPU outperforms multi-core CPU over 10x factor GPU works well with complex input domains Performance and scaling factors heavily depend on input geometry and size of grid Efficient work distribution methods are essential for performance Using block-splitting for ADI improves scaling factor by hiding dependency of sweep processing
Future work Test on large scale systems Potentially on Lomonosov supercomputer at MSU GPU part with peak performance of 863 TFlops Memory usage optimizations Explore different tridiagonal approaches
Questions? Thank You!