3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Similar documents
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Software and Performance Engineering for numerical codes on GPU clusters

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Two-Phase flows on massively parallel multi-gpu clusters

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Numerical Algorithms on Multi-GPU Architectures

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Fast Tridiagonal Solvers on GPU

Fundamental CUDA Optimization. NVIDIA Corporation

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Advanced CUDA Optimizations. Umar Arshad ArrayFire

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Double-Precision Matrix Multiply on CUDA

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

B. Tech. Project Second Stage Report on

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CUDA OPTIMIZATIONS ISC 2011 Tutorial

High Performance Computing on GPUs using NVIDIA CUDA

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

International Supercomputing Conference 2009

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Numerical Methods for PDEs. SSC Workgroup Meetings Juan J. Alonso October 8, SSC Working Group Meetings, JJA 1

How to Optimize Geometric Multigrid Methods on GPUs

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

HPC Algorithms and Applications

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

GPU-accelerated data expansion for the Marching Cubes algorithm

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Automated Finite Element Computations in the FEniCS Framework using GPUs

Optimisation Myths and Facts as Seen in Statistical Physics

OpenACC programming for GPGPUs: Rotor wake simulation

Applications of Berkeley s Dwarfs on Nvidia GPUs

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Fast Segmented Sort on GPUs

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

CUDA Performance Optimization. Patrick Legresley

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Large scale Imaging on Current Many- Core Platforms

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Parallel FFT Program Optimizations on Heterogeneous Computers

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

GPU Fundamentals Jeff Larkin November 14, 2016

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Parallel Poisson Solver in Fortran

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Advanced CUDA Optimizing to Get 20x Performance

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Fundamental Optimizations

General Purpose GPU Computing in Partial Wave Analysis

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Warps and Reduction Algorithms

CPU/GPU COMPUTING FOR AN IMPLICIT MULTI-BLOCK COMPRESSIBLE NAVIER-STOKES SOLVER ON HETEROGENEOUS PLATFORM

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Driven Cavity Example

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Transcription:

3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires lots of memory and computational power GPUs are very suitable for direct methods Have great instruction throughput and high memory bandwidth How will it scale on multiple GPUs?

cmc-fluid-solver Open source project on Google Code Started at CMC faculty of MSU, Russia CPU: OpenMP, GPU: CUDA 3D fluid simulation using ADI solver Key people: MSU: Vilen Paskonov, Sergey Berezin NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy

Outline Fluid Simulation in 3D domain Problem statement, applications ADI numerical method GPU implementation details, optimizations Performance analysis Multi-GPU implementation

Problem Statement Viscid incompressible fluid in 3D domain Arbitrary closed geometry for boundaries no-slip free injection Euler coordinates: velocity and temperature

Applications Sea and ocean simulation Additional parameters: salinity, etc. Low-speed gas flow Inside 3D channel Around objects

Definitions Density Velocity Temperature Pressure Equation of state Describe relation between and Example: gas constant for air

Governing equations Continuity equation For incompressible fluids: Navier-Stokes equations: Dimensionless form, use equation of state Reynolds number (= inertia/viscosity ratio)

Governing equations Energy equation: Dimensionless form, use equation of state heat capacity ratio Prandtl number dissipative function

ADI numerical method X Y Z Fixed Y, Z Fixed X, Z Fixed X, Y X Y Z

ADI numerical method Benefits Doesn t have hard requirements on time step Domain decomposition each step can be well parallelized Many applications Computational Fluid Dynamics Computational Finance Linear 3D PDE

ADI method iterations Use global iterations for the whole system of equations previous time step Solve X-dir equations Solve Y-dir equations global iterations Solve Z-dir equations Updating all variables next time step Some equations are not linear: Use local iterations to approximate the non-linear term

Discretization Use regular grid, implicit finite difference scheme Second order in space First order in time Leads to a tridiagonal system for Independent system for each fixed pair (j, k)

Tridiagonal systems Need to solve lots of tridiagonal systems Sizes of systems may vary across the grid system 1 system 2 system 3 Outside cell Boundary cell Inside cell

Implementation details <for each direction X, Y, Z> { <for each local iteration> { <for each equation u, v, w, T> { build tridiagonal matrices and rhs solve tridiagonal systems } update non-linear terms } }

GPU implementation Store all data arrays entirely in GPU memory Reduce number of PCI-E transfers to minimum Map 3D arrays to linear memory (X, Y, Z) Main kernel Build matrix coefficients Solve tridiagonal systems Z + Y * dimz + X * dimy * dimz Z fastest-changing dimension

Building matrices Input data: Previous/non-linear 3D layers Each thread computes: Coefficients of a tridiagonal matrix Right-hand side vector a b c d Use C++ templates for direction and equation

Tesla C2050 (SP) Building matrices performance 2.0 sec Total time Build kernels 1.5 1.0 X dir Y dir Dir Requests per warp L1 global load hit % IPC X 2 3 25 45 1.4 0.5 Z dir Y 2 3 33 44 1.4 0.0 Build Build + Solve Z 32 0 15 0.2 Poor Z direction performance compared to X/Y Threads access contiguous memory region Memory access is uncoalesced, lots of cache misses

Building matrices optimization Run Z phase in transposed XZY space Better locality for memory accesses Additional overhead on transpose X local iterations Y local iterations Z local iterations XYZ Transpose input arrays Transpose output arrays XYZ XZY Y local iterations XZY

Tesla C2050 (SP) Building matrices - optimization 2.0 1.5 1.0 0.5 sec Total time 2.5x Transpose Build + Solve Z dir Build kernels Requests per warp L1 global load hit % IPC Original 32 0 15 0.2 Transposed 2 3 30 38 1.3 0.0 X dir Y dir Z dir Z dir OPT Tridiagonal solver time dominates over transpose Transpose will takes less % with more local iterations

Solving tridiagonal systems Number of tridiagonal systems ~ grid size squared Sweep algorithm is the most efficient in this case 1 thread solves 1 system for( int p = 1; p < end; p++ ) { //.. compute tridiagonal coefficients a_val, b_val, c_val, d_val.. get(c,p) = c_val / (b_val - a_val * get(c,p-1)); get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1)); } for( int i = end-1; i >= 0; i-- ) get(x,i) = get(d,i) - get(c,i) * get(x, i+1);

Thread 1 Thread 2 Thread 3 Solving tridiagonal systems Matrix layout is crucial for performance Interleaved layout a0 a0 a0 a0 a1 a1 a1 a1 similar as ELLPACK for sparse matrices Sweep friendly X, Y directions matrices are interleaved by default Z is interleaved as well if doing in transposed space

Solving tridiagonal systems L1/L2 effect on performance Using 48K L1 instead of 16K gives 10-15% speed-up Turning L1 off reduces performance by 10% Really help on misaligned accesses and spatial reuse Occupancy >= 50% Running 128 threads per block 26-42 registers per thread (different for u, v, w, T) No shared memory

Performance benchmark CPU configuration: Intel Core i7-3930k CPU @ 3.2 GHz, 12 cores Use OpenMP for CPU parallelization Mostly memory bandwidth bound Some parts achieves ~4x speed-up vs 1 core GPU configuration: NVIDIA Tesla C2070

Test cases Box Pipe X Y Y 1 X L 1 Z Simple geometry Systems of the same size Need to compute in all rectangular grid points

Test cases White Sea X Y Complex geometry Big divergence for system sizes Need to compute only inside the area

Performance results Box Pipe segments/ms SINGLE segments/ms DOUBLE 2500 2500 9.3x 2000 2000 1500 CPU 1500 8.4x CPU 1000 GPU 1000 GPU 500 500 0 Solve X Solve Y Solve Z Total Grid 128x128x128 0 Solve X Solve Y Solve Z Total

Performance results White Sea segments/ms SINGLE segments/ms DOUBLE 4000 4000 3500 3500 3000 3000 2500 2000 1500 9.5x CPU GPU 2500 2000 1500 10.3x CPU GPU 1000 1000 500 500 0 Solve X Solve Y Solve Z Total Grid 256x192x160 0 Solve X Solve Y Solve Z Total

Outline Fluid Simulation in 3D domain Multi-GPU implementation General splitting algorithm Running computations using CUDA Benchmarking and performance analysis Improving weak scaling

Multi-GPU motivation Limited available amount of memory 3D arrays: grid, temporary arrays, matrices Max size of grid that can fit into Tesla M2050 ~ 224 3 Distribute the computations between multiple GPUs and multiple nodes Can compute large grids Speed-up computations

Main Idea of mgpu Systems along Y/Z are solved independently in parallel on each GPU No data transfer Along X data must be synchronized Y X Computing alternating directions: Z X Y Z GPU 0 GPU 1 GPU 2

CUDA - parallelization Split the grid along X (the longest stride) Z + Y * dimz + X * dimy * dimz Launch kernels on several GPUs from one host thread for (int i = 0; i < numdev; i++) { cudasetdevice(i); //Switch device kernel<<< >>>(devarray[i],..); //Computation } CUDA 4.x Data transfer Async P2P through PCI-E (cudamemcpypeerasync)

Synchronization of Nonlinear Layer for (int i = 0; i < numdev-1; i++) cudamemcpypeerrasync(dhaloleft[i+1], i+1, ddataright[i], i, num_bytes, devstream[i]); // might need multidev synchronization here for (int i = 1; i < numdev; i++) cudamemcpypeerasync(dhaloright[i-1], i-1, ddataleft[i], i, num_bytes, devstream[i]); High aggregate throughput on 8 GPU system Communication impact Is not significant

Solve X (tridiagonal solver) GPU 0 GPU 1 GPU 1 bound partially bound unbound halo

Solve X (tridiagonal solver) Process bound segments without intercommunication Interleave segments for better memory access one segment per thread Align to the left Gauss elimination Communicate Forward Backward

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Forward sweep along X Z Active GPU

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Z Active GPU

Y Solve X X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu Back sweep along X Result: No speedup along X Z Active GPU

Benchmarks Multiple GPU: 8 Tesla M2050 with P2P Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each Sample tests: Y Box Pipe 1 X L 1 Z White Sea

Millions points per sec Millions points per sec Results: 8 GPU, 1 MPI node 250 200 150 Box Pipe x2.9 x4.5 35 30 25 20 White Sea x1.35 x1.4 100 15 50 10 5 0 Total 0 Total 1 2 4 8 Grid 224 3 1 2 4 8 Tesla M2050

Points / ms Points / ms 1 GPU Efficiency 80000 60000 40000 20000 0 20000 15000 10000 5000 0 Box Pipe 0 64 128 192 256 Grid size White Sea 0 64 128 192 256 Grid size Estimate amount of work per GPU in 8xGPU system using single GPU: 256 3 /8 = 128 3 Box Pipe enough work for single GPU White Sea takes about 5% of volume of the grid. Grid size of 128 3 is not enough. Tesla M2090

Millions points per sec Millions points per sec Results: 1 GPU, 4 MPI nodes 200 180 160 140 120 100 80 60 40 20 0 Box Pipe Total x2.8 35 30 25 20 15 10 5 0 White Sea Total x1.2 1 2 4 1 2 4 Tesla M2090

Segments Load Balancing X splitting criteria: Equal volumes Equal number of segments Performance benefit observed: up to 15.5% 1400 1200 1000 800 600 400 200 0 Y(x) + Z(X) + X(x)dX 0 72 144 216 288 x Tesla M2090

Time, ms Time, ms Time, ms Load Balancing. White Sea (288x320x320) 20 10 0 20 10 0 20 10 0 Even X SweepX SweepY SweepZ Transpose GPU 0 GPU 1 GPU 2 GPU 3 t total = 47.3 Even Segments GPU 0 GPU 1 GPU 2 SweepX SweepY SweepZ Transpose GPU 3 Even Volumes GPU 0 SweepX SweepY SweepZ Transpose Tesla M2090 t total = 44.3 GPU 1 GPU 2 GPU 3 t total = 44.4

Analysis All parts of the solver but one (Gauss elimination along X) are fully parallel Communication (using P2P + InfiniBand) is not a big issue for given problem size Bad weak scaling Use blocks to hide latency for X sweeps

Y Improved Solve X GPU0 GPU1 GPU2 X Split the grid ( long X ) Array[i*dimz*dimy+ ] Allocation of layers in mgpu 3D segment analysis Z

Improved Solve X Y B 0 B 1 GPU0 GPU1 GPU2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once B 2 B 3 B 4 Z

Improved Solve X Y B 0 B 1 B 2 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward B 3 B 4 Z

Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group B 4 Z

Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z

Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z

Improved Solve X Y B 0 B 1 B 2 B 3 X Splitting the grid to XY blocks along Z direction Segments sorting Sweep through all scalar fields at once Forward sweep along X, Async halo send forward Move to the next block group Backward sweep along X, Async halo send backward B 4 Z Equal work per node!

Y Algorithm block 0 node 0 i node N nodes X N Z N blocks cudastream 2 2(N nodes i node 1) receive X inode +1 i block Z receive X inode 1 i block i node cudastream 1

Y Algorithm block 0 node 0 i node N nodes X N Z N blocks Backward cudastream 2 2(N nodes i node 1) Forward cudastream 1 i block i block i node Z

Y Algorithm block 0 node 0 i node N nodes X N Z N blocks send X inode cudastream 2 2(N nodes i node 1) cudastream 1 i block Z i block i node send X inode

Improved Solve XY Y B 0 X Separate buffer for Y sweeps B 1 Y blocks B 2 B 3 Block Y sweeps are performed independently in separate cudastreams Helps with data transfer/compute overlap B 4 Z

Time, ms Weak Scaling 400 350 300 Average time for Solve XYZ Box Pipe Grids: 224 3, 288 3, 352 3, 448 3 250 200 150 100 0 2 4 6 8 10 Number of GPUs Tesla M2050

Time, ms Big Systems Limit 250 200 150 100 50 0 Average time for Solve XYZ 1 2 4 8 16 32 Number of blocks Consider on scalar field: no physics, more available RAM 8 M2050 GPUs Grid: 768 3 With larger grid sizes, curve minimum shifts down/right

Conclusions GPU outperforms multi-core CPU over 10x factor GPU works well with complex input domains Performance and scaling factors heavily depend on input geometry and size of grid Efficient work distribution methods are essential for performance Using block-splitting for ADI improves scaling factor by hiding dependency of sweep processing

Future work Test on large scale systems Potentially on Lomonosov supercomputer at MSU GPU part with peak performance of 863 TFlops Memory usage optimizations Explore different tridiagonal approaches

Questions? Thank You!