Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Similar documents
3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Fast Tridiagonal Solvers on GPU

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Two-Phase flows on massively parallel multi-gpu clusters

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Driven Cavity Example

Adarsh Krishnamurthy (cs184-bb) Bela Stepanova (cs184-bs)

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

How to Optimize Geometric Multigrid Methods on GPUs

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

cuibm A GPU Accelerated Immersed Boundary Method

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Software and Performance Engineering for numerical codes on GPU clusters

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Scan Primitives for GPU Computing

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

Automated Finite Element Computations in the FEniCS Framework using GPUs

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

The Immersed Interface Method

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Computational Fluid Dynamics (CFD) using Graphics Processing Units

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Numerical Algorithms on Multi-GPU Architectures

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

CS 231. Fluid simulation

CGT 581 G Fluids. Overview. Some terms. Some terms

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU-accelerated data expansion for the Marching Cubes algorithm

Investigation of cross flow over a circular cylinder at low Re using the Immersed Boundary Method (IBM)

State of Art and Project Proposals Intensive Computation

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Turbostream: A CFD solver for manycore

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Computational Fluid Dynamics using OpenCL a Practical Introduction

CPU/GPU COMPUTING FOR AN IMPLICIT MULTI-BLOCK COMPRESSIBLE NAVIER-STOKES SOLVER ON HETEROGENEOUS PLATFORM

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Abstract. Introduction. Kevin Todisco

ALE Seamless Immersed Boundary Method with Overset Grid System for Multiple Moving Objects

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

Fast Multipole Method on the GPU

GPU Implementation of Implicit Runge-Kutta Methods

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Performance of Implicit Solver Strategies on GPUs

Efficient Imaging Algorithms on Many-Core Platforms

GPGPU parallel algorithms for structured-grid CFD codes

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

High Performance Ocean Modeling using CUDA

Using GPUs for unstructured grid CFD

OpenACC programming for GPGPUs: Rotor wake simulation

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Computational Acceleration of Image Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Center for Computational Science

Optical Flow Estimation with CUDA. Mikhail Smirnov

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Overview of Traditional Surface Tracking Methods

Particleworks: Particle-based CAE Software fully ported to GPU

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123

FOURTH ORDER COMPACT FORMULATION OF STEADY NAVIER-STOKES EQUATIONS ON NON-UNIFORM GRIDS

Particle-based Fluid Simulation

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

B. Tech. Project Second Stage Report on

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Asynchronous OpenCL/MPI numerical simulations of conservation laws

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

An Overview of Computational Fluid Dynamics

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

International Supercomputing Conference 2009

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Warps and Reduction Algorithms

Direct Numerical Simulation of a Low Pressure Turbine Cascade. Christoph Müller

Enabling In Situ Viz and Data Analysis with Provenance in libmesh

Parallel algorithms for fast air pollution assessment in three dimensions

OP2 FOR MANY-CORE ARCHITECTURES

Shape of Things to Come: Next-Gen Physics Deep Dive

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 6:

Virtual EM Inc. Ann Arbor, Michigan, USA

Transcription:

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010

Introduction Tridiagonal solvers very popular technique in both compute and graphics applications Application in Alternating Direction Implicit (ADI) methods 2 different examples will be covered in this talk: 3D Fluid Simulation for research and science 2D Depth-of-Field Effect for graphics and games

Outline Tridiagonal Solvers Overview Introduction Gauss Elimination Cyclic Reduction Fluid Simulation in 3D domain Depth-of-Field Effect in 2D domain

Tridiagonal Solvers Need to solve many independent tridiagonal systems Matrix sizes configurations are problem specific

Gauss Elimination (Sweep) Forward sweep: Backward substitution: The fastest serial approach O(N) complexity Optimal number of operations

Cyclic Reduction Eliminate unknowns using linear combination of equations: After one reduction step a system is decomposed into 2: Reduction step can be done by N threads in parallel

Cyclic Reduction Parallel Cyclic Reduction (PCR) Apply reduction to new systems and repeat O(log N) For some special cases can use fewer steps Cyclic Reduction (CR) Forward reduction, backward substitution Complexity O(N), but requires more operations than Sweep

Outline Tridiagonal Solvers Overview Fluid Simulation in 3D domain Problem statement, applications ADI numerical method GPU implementation details, optimizations Performance analysis and results comparisons Depth-of-Field Effect in 2D domain

Problem Statement Viscid incompressible fluid in 3D domain New challenge: complex dynamic boundaries no-slip free injection Euler coordinates: velocity and temperature

Applications Blood flow simulation in heart Capture boundaries movement from MR or US Simulate blood flow inside heart volume

Applications Sea and ocean simulations Static boundaries Additional simulation parameters: salinity, etc.

Definitions Density Velocity Temperature Pressure Equation of state Describe relation between and Example: gas constant for air

Governing equations Continuity equation For incompressible fluids: Navier-Stokes equations: Dimensionless form, use equation of state Reynolds number (= inertia/viscosity ratio)

Governing equations Energy equation: Dimensionless form, use equation of state heat capacity ratio Prandtl number dissipative function

ADI numerical method Alternating Direction Implicit X Y Z Fixed Y, Z Fixed X, Z Fixed X, Y

ADI method iterations Use global iterations for the whole system of equations previous time step Solve X-dir equations Solve Y-dir equations global iterations Solve Z-dir equations Updating all variables next time step Some equations are not linear: Use local iterations to approximate the non-linear term

Discretization Use regular grid, implicit finite difference scheme: i i+1 i-1 i+1 i i-1 Got a tridiagonal system for Independent system for each fixed pair (j, k)

Tridiagonal systems Need to solve lots of tridiagonal systems Sizes of systems may vary across the grid system 1 system 2 system 3 Outside cell Boundary cell Inside cell

Implementation details <for each direction X, Y, Z> { <for each local iteration> { <for each equation u, v, w, T> { build tridiagonal matrices and rhs solve tridiagonal systems } update non-linear terms } }

GPU implementation Store all data arrays entirely on GPU in linear memory Reduce amount of memory transfers to minimum Map 3D arrays to linear memory (X, Y, Z) Main tasks Build matrices Solve systems Z + Y * dimz + X * dimy * dimz Z fastest-changing dimension Additional routines for non-linear updates Merge, copy 3D arrays with masks

Building matrices Input data: Previous/non-linear 3D layers Each thread computes: Coefficients of a tridiagonal matrix Right-hand side vector a b c d Use C++ templates for direction and equation

Tesla C2050 (SP) Building matrices performance 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 sec Build Total time Build + Solve X dir Y dir Z dir Dir Build kernels Requests per load L1 global load hit % IPC X 2 3 25 45 1.4 Y 2 3 33 44 1.4 Z 32 0 15 0.2 Poor Z direction performance compared to X/Y Threads access contiguous memory region Memory access is uncoalesced, lots of cache misses

Building matrices optimization Run Z phase in transposed XZY space Better locality for memory accesses Additional overhead on transpose X local iterations Y local iterations Z local iterations XYZ Transpose input arrays Transpose output arrays XYZ XZY Y local iterations XZY

Tesla C2050 (SP) Building matrices - optimization 1.8 1.6 1.4 1.2 sec Total time Z dir Build kernels Requests per load L1 global load hit % IPC 1.0 0.8 0.6 0.4 2.5x Transpose Build + Solve Original 32 0 15 0.2 Transposed 2 3 30 38 1.3 0.2 0.0 X dir Y dir Z dir Z dir OPT Tridiagonal solver time dominates over transpose Transpose will takes less % with more local iterations

Problem analysis Big number of tridiagonal systems Grid size squared on each step: 16K for 128^3 Relatively small system sizes Max of grid size: 128 for 128^3 1 thread per system is a suitable approach for this case Enough systems to utilize GPU

Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3 Solving tridiagonal systems Matrix layout is crucial for performance Sequential layout Interleaved layout system 1 system 2 a0 a1 a2 a3 a0 a1 a2 a3 a0 a0 a0 a0 a1 a1 a1 a1 PCR and CR friendly Sweep friendly ADI Z direction ADI X, Y directions

Tridiagonal solvers performance 18 16 14 12 10 8 6 4 2 0 ms Tesla C2050 64 96 128 160 192 224 256 N PCR CR SWEEP SWEEP_T 3D ADI test setup Systems size = N Number of systems = N^2 Matrix layout *CR, SWEEP_T: sequential SWEEP: interleaved PCR/CR implementations from: Fast tridiagonal solvers on the GPU by Y. Zhang, J. Cohen, J. Owens, 2010

Choosing tridiagonal solver Application for 3D ADI method For X, Y directions matrices are interleaved by default Z is interleaved as well if doing in transposed space Sweep solver seems like the best choice Although can experiment with PCR for Z direction

Performance analysis Fermi L1/L2 effect on performance Using 48K L1 instead of 16K gives 10-15% speed-up Turning L1 off reduces performance by 10% Really help on misaligned accesses and spatial reuse Sweep effective bandwidth on Tesla C2050 Single Precision = 119 GB/s, 82% of peak Double Precision = 135 GB/s, 93% of peak

Performance benchmark CPU configuration: Intel Core i7 4 cores @ 2.8 GHz Use all 4 cores through OpenMP (3-4x speed-up vs 1 core) GPU configurations: NVIDIA Tesla C1060 NVIDIA Tesla C2050 Measure Build + Solve time for all iterations

Test cases Cube Heart Volume Sea Area Test Grid size X dir Y dir Z dir Cube 128 x 128 x 128 10K 10K 10K Heart Volume 96 x 160 x 128 13K 8K 8K Sea Area 160 x 128 x 128 18K 20K 6K

Performance results Cube Same number of systems for X/Y/Z, all sizes are equal systems/ms 22 20 SINGLE 9.5x systems/ms 22 20 DOUBLE 18 16 18 16 8.2x 14 12 5.9x Core i7 14 12 Core i7 10 Tesla C1060 10 Tesla C1060 8 6 Tesla C2050 8 6 2.7x Tesla C2050 4 4 2 2 0 X dir Y dir Z dir All dirs 0 X dir Y dir Z dir All dirs

Performance results Heart Volume More systems for X, sizes vary smoothly systems/ms 45 SINGLE systems/ms 45 DOUBLE 40 40 35 35 30 30 25 11.4x Core i7 25 Core i7 20 15 8.1x Tesla C1060 Tesla C2050 20 15 7.8x Tesla C1060 Tesla C2050 10 10 3.3x 5 5 0 0 X dir Y dir Z dir All dirs X dir Y dir Z dir All dirs

Performance results Sea Area Lots of small systems of different size for X/Y systems/ms 80 SINGLE systems/ms 80 DOUBLE 70 70 60 60 50 50 40 30 6.5x 9.6x Core i7 Tesla C1060 Tesla C2050 40 30 7.2x Core i7 Tesla C1060 Tesla C2050 20 10 20 10 3.1x 0 X dir Y dir Z dir All dirs 0 X dir Y dir Z dir All dirs

ADI step direction Future work Current main limitation is memory capacity Lots of 3D arrays: grid data, time layers, tridiagonal matrices Multi-GPU and clusters enable high-resolution grids Split grid along current direction (X, Y or Z) Assign each partition to one GPU Solve systems and update common 3D layers GPU 1 GPU 2 GPU 3 GPU 4

References http://code.google.com/p/cmc-fluid-solver/ Tridiagonal Solvers on the GPU and Applications to Fluid Simulation, GPU Technology Conference, 2009, Nikolai Sakharnykh A dynamic visualization system for multiprocessor computers with common memory and its application for numerical modeling of the turbulent flows of viscous fluids, Moscow University Computational Mathematics and Cybernetics, 2007, Paskonov V.M., Berezin S.B., Korukhova E.S.

Outline Tridiagonal Solvers Overview Fluid Simulation in 3D domain Depth-of-Field Effect in 2D domain Problem statement ADI numerical method Analysis and hybrid approach Shared memory optimization Visual results

Problem statement ADI method application for 2D problems Real-time Depth-Of-Field simulation Using diffusion equation to blur the image Now need to solve tridiagonal systems in 2D domain Different setup, different methods for GPU

Numerical method Alternating Direction Implicit method X Y for all fixed for all fixed

Implementing ADI method Implicit scheme is always stable After discretization on a regular grid with dx = dt = 1: i i-1 i i+1 Got a tridiagonal system for each fixed j

Problem analysis Small number of systems Max 2K for high resolutions Large matrix dimensions Up to 2K for high resolutions Using 1 thread per system wouldn t be so efficient Low GPU utilization

GPU implementation Matrix layout Sequential for X direction Interleaved for Y direction Use hybrid algorithm Start with Parallel Cyclic Reduction (PCR) Subdivide our systems into smaller ones Finish with Gauss Elimination (Sweep) Solve each new system by 1 thread

PCR step Doubles number of systems, halves systems size W 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 2 3 0 1 2 3 step 0 step 1 H Systems H Systems H*2 Systems H*4 Size W Size W/2 Size W/4

Hybrid approach Benefits of each additional PCR step Improves memory layout for Sweep in X direction Subdivides systems for more efficient Sweep But additional overhead for each step Increasing total number of operations for solving a system Need to choose an optimal number of PCR steps For DOF effect in high-res: 3 steps for X direction

Shared memory optimization Shared memory is used as a temporary transpose buffer Load 32x16 area into shared memory Compute 4 iterations inside shared memory Store 32x16 back to global memory Gives 20% speed-up 16 16 8 8 8 8 16 32 Shared memory

Diffusion simulation idea Diffuse temperature series (= color) on 2D plate (= image) with non-uniform heat conductivity (= circles of confusion) Key features: No color bleeding Support for spatially varying arbitrary circles of confusion in focus radius = 0 out of focus large radius sharp boundary

Depth-of-Field in games From Metro2033, THQ and 4A Games

Depth-of-Field in games From Metro2033, THQ and 4A Games

References Pixar article about Diffusion DOF: http://graphics.pixar.com/library/depthoffield/paper.pdf Cyclic reduction methods on CUDA: http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978 Upcoming sample in NVIDIA SDK

Summary 10x speed-ups for 3D ADI methods in complex areas Great Fermi performance in double precision Achieve real-time performance in graphics and games Realistic depth-of-field effect Tridiagonal solvers performance is problem specific

Questions?