3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Size: px

Start display at page:

Download "3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA"

Frederick Manning
6 years ago
Views:

1 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

2 Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires lots of memory and computational power GPUs are very suitable for direct methods Have great instruction throughput and high memory bandwidth How will it scale on multiple GPUs?

3 cmc-fluid-solver Open source project on Google Code Started at CMC faculty of MSU, Russia CPU: OpenMP, GPU: CUDA 3D fluid simulation using ADI solver Key people: MSU: Vilen Paskonov, Sergey Berezin NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy

4 Outline Fluid Simulation in 3D domain Problem statement, applications ADI numerical method GPU implementation details, optimizations Performance analysis Multi-GPU implementation

5 Problem Statement Viscid incompressible fluid in 3D domain Arbitrary closed geometry for boundaries no-slip free injection Euler coordinates: velocity and temperature

6 Applications Sea and ocean simulation Additional parameters: salinity, etc. Low-speed gas flow Inside 3D channel Around objects

7 Definitions Density Velocity Temperature Pressure Equation of state Describe relation between and Example: gas constant for air

8 Governing equations Continuity equation For incompressible fluids: Navier-Stokes equations: Dimensionless form, use equation of state Reynolds number (= inertia/viscosity ratio)

9 Governing equations Energy equation: Dimensionless form, use equation of state heat capacity ratio Prandtl number dissipative function

10 ADI numerical method X Y Z Fixed Y, Z Fixed X, Z Fixed X, Y X Y Z

11 ADI numerical method Benefits Doesn t have hard requirements on time step Domain decomposition each step can be well parallelized Many applications Computational Fluid Dynamics Computational Finance Linear 3D PDE

ADI method iterations Use global iterations for the whole system of equations previous time step Solve X-dir equations Solve Y-dir equations global

12 ADI method iterations Use global iterations for the whole system of equations previous time step Solve X-dir equations Solve Y-dir equations global iterations Solve Z-dir equations Updating all variables next time step Some equations are not linear: Use local iterations to approximate the non-linear term

13 Discretization Use regular grid, implicit finite difference scheme Second order in space First order in time Leads to a tridiagonal system for Independent system for each fixed pair (j, k)

14 Tridiagonal systems Need to solve lots of tridiagonal systems Sizes of systems may vary across the grid system 1 system 2 system 3 Outside cell Boundary cell Inside cell

15 Implementation details <for each direction X, Y, Z> { <for each local iteration> { <for each equation u, v, w, T> { build tridiagonal matrices and rhs solve tridiagonal systems } update non-linear terms } }

16 GPU implementation Store all data arrays entirely in GPU memory Reduce number of PCI-E transfers to minimum Map 3D arrays to linear memory (X, Y, Z) Main kernel Build matrix coefficients Solve tridiagonal systems Z + Y * dimz + X * dimy * dimz Z fastest-changing dimension

17 Building matrices Input data: Previous/non-linear 3D layers Each thread computes: Coefficients of a tridiagonal matrix Right-hand side vector a b c d Use C++ templates for direction and equation

Tesla C2050 (SP) Building matrices performance 2.0 sec Total time Build kernels 1.5 1.0 X dir Y dir Dir Requests per warp L1 global load hit % IPC X 2 3 25 45 1.4 0.

18 Tesla C2050 (SP) Building matrices performance 2.0 sec Total time Build kernels X dir Y dir Dir Requests per warp L1 global load hit % IPC X Z dir Y Build Build + Solve Z Poor Z direction performance compared to X/Y Threads access contiguous memory region Memory access is uncoalesced, lots of cache misses

X local iterations Y local iterations Z local iterations XYZ

19 Building matrices optimization Run Z phase in transposed XZY space Better locality for memory accesses Additional overhead on transpose X local iterations Y local iterations Z local iterations XYZ Transpose input arrays Transpose output arrays XYZ XZY Y local iterations XZY

20 Tesla C2050 (SP) Building matrices - optimization sec Total time 2.5x Transpose Build + Solve Z dir Build kernels Requests per warp L1 global load hit % IPC Original Transposed X dir Y dir Z dir Z dir OPT Tridiagonal solver time dominates over transpose Transpose will takes less % with more local iterations

21 Solving tridiagonal systems Number of tridiagonal systems ~ grid size squared Sweep algorithm is the most efficient in this case 1 thread solves 1 system for( int p = 1; p < end; p++ ) { //.. compute tridiagonal coefficients a_val, b_val, c_val, d_val.. get(c,p) = c_val / (b_val - a_val * get(c,p-1)); get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1)); } for( int i = end-1; i >= 0; i-- ) get(x,i) = get(d,i) - get(c,i) * get(x, i+1);

22 Thread 1 Thread 2 Thread 3 Solving tridiagonal systems Matrix layout is crucial for performance Interleaved layout a0 a0 a0 a0 a1 a1 a1 a1 similar as ELLPACK for sparse matrices Sweep friendly X, Y directions matrices are interleaved by default Z is interleaved as well if doing in transposed space

23 Solving tridiagonal systems L1/L2 effect on performance Using 48K L1 instead of 16K gives 10-15% speed-up Turning L1 off reduces performance by 10% Really help on misaligned accesses and spatial reuse Occupancy >= 50% Running 128 threads per block registers per thread (different for u, v, w, T) No shared memory

24 Performance benchmark CPU configuration: Intel Core i7-3930k 3.2 GHz, 12 cores Use OpenMP for CPU parallelization Mostly memory bandwidth bound Some parts achieves ~4x speed-up vs 1 core GPU configuration: NVIDIA Tesla C2070

25 Test cases Box Pipe X Y Y 1 X L 1 Z Simple geometry Systems of the same size Need to compute in all rectangular grid points

26 Test cases White Sea X Y Complex geometry Big divergence for system sizes Need to compute only inside the area

27 Performance results Box Pipe segments/ms SINGLE segments/ms DOUBLE x CPU x CPU 1000 GPU 1000 GPU Solve X Solve Y Solve Z Total Grid 128x128x128 0 Solve X Solve Y Solve Z Total

28 Performance results White Sea segments/ms SINGLE segments/ms DOUBLE x CPU GPU x CPU GPU Solve X Solve Y Solve Z Total Grid 256x192x160 0 Solve X Solve Y Solve Z Total

29 Outline Fluid Simulation in 3D domain Multi-GPU implementation General splitting algorithm Running computations using CUDA Benchmarking and performance analysis Improving weak scaling

30 Multi-GPU motivation Limited available amount of memory 3D arrays: grid, temporary arrays, matrices Max size of grid that can fit into Tesla M2050 ~ Distribute the computations between multiple GPUs and multiple nodes Can compute large grids Speed-up computations

31 Main Idea of mgpu Systems along Y/Z are solved independently in parallel on each GPU No data transfer Along X data must be synchronized Y X Computing alternating directions: Z X Y Z GPU 0 GPU 1 GPU 2

32 CUDA - parallelization Split the grid along X (the longest stride) Z + Y * dimz + X * dimy * dimz Launch kernels on several GPUs from one host thread for (int i = 0; i < numdev; i++) { cudasetdevice(i); //Switch device kernel<<< >>>(devarray[i],..); //Computation } CUDA 4.x Data transfer Async P2P through PCI-E (cudamemcpypeerasync)

33 Synchronization of Nonlinear Layer for (int i = 0; i < numdev-1; i++) cudamemcpypeerrasync(dhaloleft[i+1], i+1, ddataright[i], i, num_bytes, devstream[i]); // might need multidev synchronization here for (int i = 1; i < numdev; i++) cudamemcpypeerasync(dhaloright[i-1], i-1, ddataleft[i], i, num_bytes, devstream[i]); High aggregate throughput on 8 GPU system Communication impact Is not significant

34 Solve X (tridiagonal solver) GPU 0 GPU 1 GPU 1 bound partially bound unbound halo

35 Solve X (tridiagonal solver) Process bound segments without intercommunication Interleave segments for better memory access one segment per thread Align to the left Gauss elimination Communicate Forward Backward

36 Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z

37 Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU

38 Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU

39 Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU

40 Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU

41 Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Result: No speedup along X Z Active GPU

42 Benchmarks Multiple GPU: 8 Tesla M2050 with P2P Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each Sample tests: Y Box Pipe 1 X L 1 Z White Sea

43 Millions points per sec Millions points per sec Results: 8 GPU, 1 MPI node Box Pipe x2.9 x White Sea x1.35 x Total 0 Total Grid Tesla M2050

44 Points / ms Points / ms 1 GPU Efficiency Box Pipe Grid size White Sea Grid size Estimate amount of work per GPU in 8xGPU system using single GPU: /8 = Box Pipe enough work for single GPU White Sea takes about 5% of volume of the grid. Grid size of is not enough. Tesla M2090

Millions points per sec Millions points per sec Results: 1 GPU, 4 MPI nodes 200 180 160 140 120 100

45 Millions points per sec Millions points per sec Results: 1 GPU, 4 MPI nodes Box Pipe Total x White Sea Total x Tesla M2090

46 Segments Load Balancing X splitting criteria: Equal volumes Equal number of segments Performance benefit observed: up to 15.5% Y(x) + Z(X) + X(x)dX x Tesla M2090

Time, ms Time, ms Time, ms Load Balancing. White Sea (288x320x320) 20 10 0 20 10 0 20 10 0 Even X SweepX SweepY SweepZ Transpose GPU 0 GPU 1 GPU 2 GPU 3 t total = 47.

47 Time, ms Time, ms Time, ms Load Balancing. White Sea (288x320x320) Even X SweepX SweepY SweepZ Transpose GPU 0 GPU 1 GPU 2 GPU 3 t total = 47.3 Even Segments GPU 0 GPU 1 GPU 2 SweepX SweepY SweepZ Transpose GPU 3 Even Volumes GPU 0 SweepX SweepY SweepZ Transpose Tesla M2090 t total = 44.3 GPU 1 GPU 2 GPU 3 t total = 44.4

48 Analysis All parts of the solver but one (Gauss elimination along X) are fully parallel Communication (using P2P + InfiniBand) is not a big issue for given problem size Bad weak scaling Use blocks to hide latency for X sweeps

49 Y Improved Solve X GPU0 GPU1 GPU2 X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z

50 Improved Solve X Y B 0 B 1 GPU0 GPU1 GPU2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once B 2 B 3 B 4 Z

51 Improved Solve X Y B 0 B 1 B 2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward B 3 B 4 Z

52 Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group B 4 Z

53 Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z

54 Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z

55 Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z Equal work per node!

56 Y Algorithm block 0 node 0 i node N nodes X N Z N blocks cudastream 2 2(N nodes i node 1) receive X inode +1 i block Z receive X inode 1 i block i node cudastream 1

57 Y Algorithm block 0 node 0 i node N nodes X N Z N blocks Backward cudastream 2 2(N nodes i node 1) Forward cudastream 1 i block i block i node Z

58 Y Algorithm block 0 node 0 i node N nodes X N Z N blocks send X inode cudastream 2 2(N nodes i node 1) cudastream 1 i block Z i block i node send X inode

59 Improved Solve XY Y B 0 X Separate buffer for Y sweeps B 1 Y blocks B 2 B 3 Block Y sweeps are performed independently in separate cudastreams Helps with data transfer/compute overlap B 4 Z

60 Time, ms Weak Scaling Average time for Solve XYZ Box Pipe Grids: 224 3, 288 3, 352 3, Number of GPUs Tesla M2050

61 Time, ms Big Systems Limit Average time for Solve XYZ Number of blocks Consider on scalar field: no physics, more available RAM 8 M2050 GPUs Grid: With larger grid sizes, curve minimum shifts down/right

62 Conclusions GPU outperforms multi-core CPU over 10x factor GPU works well with complex input domains Performance and scaling factors heavily depend on input geometry and size of grid Efficient work distribution methods are essential for performance Using block-splitting for ADI improves scaling factor by hiding dependency of sweep processing

63 Future work Test on large scale systems Potentially on Lomonosov supercomputer at MSU GPU part with peak performance of 863 TFlops Memory usage optimizations Explore different tridiagonal approaches

64 Questions? Thank You!

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular