Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs. NVIDIA GPU Technology Conference San Jose, California, 2010 André R.

Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs NVIDIA GPU Technology Conference San Jose, California, 2010 André R. Brodtkorb

Talk Outline Learn how to simulate a half an hour dam break in 27 seconds Introduction Why Shallow Water Simulations? The Shallow Water Equations Numerical scheme Our contribution Simulator Implementation Results including screen capture video Live Demo on a standard Laptop Summary 2

The Shallow Water Equations First described by de Saint-Venant (1797-1886) Gravity-induced fluid motion 2D free surface Negligible vertical acceleration Wave length much larger than depth Conservation of mass and momentum Not only for water: Atmospheric flow Avalanches... Water image from http://freephoto.com / Ian Britton 3

Target application areas Tsunamis Floods 2004 Indian Ocean (230000) 2010: Pakistan (2000+) Storm Surges Dam breaks 2005 Hurricane Katrina (1836) 1959 Malpasset (423) Images from wikipedia.org 4

Mathematical Formulation Vector of Conserved variables Flux Functions Bed slope source term Bed friction source term 5

The Shallow Water Equations Water depth, discharge (u), and discharge (v) 6

Explicit Numerical Schemes Hyperbolic partial differential equation Enables explicit schemes Accurate modeling of discontinuities / shocks High accuracy in smooth parts without oscillations near discontinuities Capable of representing dry states Negative water depths ruin simulations Images from wikipedia.org, James Kilfiger 7

Explicit Numerical Schemes Additional wanted properties: Second order accurate fluxes Total variation diminishing Well balancedness 8

Explicit Numerical Schemes Additional wanted properties: Second order accurate fluxes Total variation diminishing Well balancedness Scheme of choice: A. Kurganov and G. Petrova, A Second-Order Well-Balanced Positivity Preserving Central-Upwind Scheme for the Saint-Venant System Communications in Mathematical Sciences, 5 (2007), 133-160 9

Kurganov-Petrova Spatial discretization Rewrite in terms of w=h+b Write on vector form Impose finite-volume grid 10

Kurganov-Petrova Finite Volume Grid Q defined as cell averages B defined as piecewise bilinear F and G calculated across cell interfaces Source terms, H, calculated as cell averages 11

Kurganov-Petrova Flux calculations Continuous variables Discrete variables Slope reconstruction Flux calculation Integration points Dry states fix 12

Kurganov-Petrova Temporal discretization Gather all explicit terms One ordinary differential equation in time per cell 13

Kurganov-Petrova Temporal discretization Discretize using second order Runge-Kutta Total variation diminishing Semi-implicit friction source term Discretize in time 14

Kurganov-Petrova CFL condition Explicit scheme, time step restriction: Time step size restricted by a Courant-Friedrichs-Lewy condition The numerical domain of dependence must include the domain of dependence of the equation Each wave is allowed to travel at most one quarter grid cell per time step Space Mathematical propagation speed Unstable Time Stable 15

Kurganov-Petrova Simulation cycle 1. Calculate fluxes 2. Calculate Dt 6. Boundary conditions 3. Halfstep 5. Evolve in time 4. Calculate fluxes 16

Implementation GPU code Four CUDA kernels: 87% Flux <1% Timestep size (CFL condition) 12% Forward euler step <1% Set boundary conditions Step 17

Flux kernel Domain decomposition A nine-point nonlinear stencil Comprised of simpler stencils Heavy use of shmem Computationally demanding Traditional Block Decomposition Overlaping ghost cells (aka. apron) Global ghost cells for boundary conditions Domain padding 18

Flux kernel Block size Block size is 16x14 Warp size: multiple of 32 Shared memory use: 16 shmem buffers use ~16 KB Occupancy Use 48 KB shared mem, 16 KB cache Three resident blocks Trades cache for occupancy Fermi cache Global memory access 19

Flux kernel - computations Input Slopes Integration points Flux Calculations Flux across north and east interface Bed slope source term for the cell Collective stencil operations n threads, and n+1 interfaces one warp performs extra calculations! Alternative is one thread per stencil operation Many idle threads, and extra register pressure 20

Flux kernel flux limiter Limits the fluxes to obtain non-oscillatory solution Generalized minmod limiter Least steep slope, or Zero if signs differ Creates divergent code paths Use branchless implementation (2007) Requires special sign function Much faster than naïve approach float minmod(float a, float b, float c) { return 0.25f *sign(a) *(sign(a) + sign(b)) *(sign(b) + sign(c)) *min( min(abs(a), abs(b)), abs(c) ); } (2007) T. Hagen, M. Henriksen, J. Hjelmervik, and K.-A. Lie. How to solve systems of conservation laws numerically using the graphics processor as a high-performance computational engine. Geometrical Modeling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF, (211 264). Springer Verlag, 2007. 21

Timestep size kernel Flux kernel calculates wave speed per cell Find global maximum Calculate timestep using the CFL condition Parallel reduction: Models CUDA SDK sample Template code Fully coalesced reads Without bank conflicts Optimization Perform partial reduction in flux kernel Reduces memory and bandwidth by a factor 192 16x14 1 Image from Optimizing Parallel Reduction in CUDA, Mark Harris 22

Time integration kernel Computes Q* or Q n+1 Solves the time-ode per cell Trivial to implement Fully coalesced memory access Memory bound 23

Boundary conditions kernel Global boundary uses ghost cells Fixed inlet / outlet discharge Fixed depth Reflecting Outflow/Absorbing Global boundary Local ghost cells Currently no mixed boundaries Can also supply hydrograph Tsunamies Storm surges Tidal waves 3.5m Tsunami, 1h 10m Storm Surge, 4d 24

Boundary conditions kernel Similar to CUDA SDK reduction sample, using templates: One block sets all four boundaries Boundary length (>64, >128, >256, >512) Boundary type ( none, reflecting, fixed depth, fixed discharge, absorbing outlet) In total: 4*5*5*5*5 = 2500 realizations switch(block.x) { case 512: BCKernelLauncher<512, N, S, E, W>(grid, block, stream); break; case 256: BCKernelLauncher<256, N, S, E, W>(grid, block, stream); break; case 128: BCKernelLauncher<128, N, S, E, W>(grid, block, stream); break; case 64: BCKernelLauncher< 64, N, S, E, W>(grid, block, stream); break; } 25

Optimization: Early exit Observation: Many dry areas do not require computation Use a small buffer to store wet blocks Exit flux kernel if nearest neighbours are dry Up-to 6x speedup Blocks still have to be scheduled Blocks read the auxiliary buffer One wet cell marks the whole block as wet 26

Results - Performance Circular Dam break 1st order Euler 30% wet cells: 1200 megacells / s 50% wet cells: 900 megacells / s 100% wet cells: 300 megacells / s 2nd order Runge-Kutta 30% wet cells: 600 megacells / s 50% wet cells: 450 megacells / s 100% wet cells: 150 megacells / s 27

Results Multiple GPUs Single-node multi-gpu Four Tesla GPUs Threading Near-perfect weak scaling Near-perfect strong scaling Up-to 380 million cells (16 GB) 19 000 x 19 000 cells 28

Verification 2D Parabolic basin Planar water surface oscillates 100 x 100 cells Horizontal scale: 8 km Vertical scale: 3.3 m Simulation and analytical match well But, as most schemes, growing errors along wet-dry interface 29

Validation Barrage de Malpasset South-east France near Fréjus Bursts at 21:13 December 2nd 1959 40 meter high wall of water 70 km/h (43 mi/h) Reaches mediterranean in 30 minutes 423 casualties, $68 million in damages Double curvature dam 66.5 m high 220 m crest length 55 million cubic metres of water Images from Google maps, TeraMetrics 30

Validation Experimental data from 1:400 model 482 000 cells 1100 x 440 bathymetry values 15 meter resolution Accurately predicts maximum elevation and front arrival time Largest discrepancy at gauges 14 (arrival time) and 9 (elevation) Compares well with published results 31

Implementation CPU framework Simulation loop executed by CPU Output to netcdf Direct visualization via OpenGL 32

Video: http://www.youtube.com/watch?v=fbzbr-fjrwy 33

Live Demo Dell XPS m1330, Flamingo Pink Purchased 09-2008, price ~$1850 Intel Core 2 duo T9300 @ 2.5 GHz 4.0 GB RAM NVIDIA GeForce 8400M GS 128 MB graphics RAM Only 16 cuda cores (GTX 480 has 448) Windows Vista Ultimate SP2 32-bit CUDA toolkit/sdk 3.1 32-bit CUDA Driver 257.21 Microsoft Visual Studio 2008 Images from dell.com 34

Summary Learn how to simulate a half an hour dam break in seconds Faster than real-time performance 150-1200 megacells per second Verified and validated results Can accurately predict real-world events using single precision Direct visualization Interactive exploration of simulation results 35

References A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig, Simulation and Visualization of the Saint-Venant System using GPUs, Computing and Visualization in Science, 2010 special issue on Hot topics in Computational Engineering, [forthcoming]. A. R. Brodtkorb, M. L. Sætra, and M. Altinakar, Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation, in review, 2010. A. R. Brodtkorb, Scientific Computing on Heterogeneous Architectures Ph.D. Thesis, University of Oslo, Submitted, 2010. 36

Thank you for your attention. Questions? http://babrodtk.at.ifi.uio.no http://hetcomp.com Andre.Brodtkorb@sintef.no 37