Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU

Size: px

Start display at page:

Download "Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU"

Fay Black
5 years ago
Views:

1 Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU André R. Brodtkorb, Ph.D., Research Scientist, SINTEF ICT, Department of Applied Mathematics, Norway Desafios del Modelado de Tsunamis y la Evaluación de Riesgo Universidad Tecnica Federico Santa Maria Valparaíso, Chile

2 Brief Outline Introduction GPU Computing Programming GPUs for Water Resources Efficient Simulation of the Shallow Water Equations on GPUs Summary 2

3 Development of the Microprocessor 1942: Digital Electric Computer (Atanasoff and Berry) 1947: Transistor (Shockley, Bardeen, and Brattain) : Integrated Circuit (Kilby) : Microprocessor (Hoff, Faggin, Mazor) More transistors (Moore, 1965) 3

4 Development of the Microprocessor (Moore's law) 1971: 4004, 2300 trans, 740 KHz 1982: 80286, 134 thousand trans, 8 MHz 1993: Pentium P5, 1.18 mill. trans, 66 MHz 2000: Pentium 4, 42 mill. trans, 1.5 GHz 2010: Nehalem 2.3 bill. trans, 2.66 GHz 4

The end of frequency scaling (2004) The power density of microprocessors is proportional to the clock frequency cubed: 1971-2004: 29% increase in frequency 2004-2011: Frequency constant 1999-2011:

5 The end of frequency scaling (2004) The power density of microprocessors is proportional to the clock frequency cubed: : 29% increase in frequency : Frequency constant : 25% increase in parallelism Parallelism technologies: Multi-core (8x) Hyper threading (2x) AVX/SSE/MMX/etc (8x) A serial program uses <2% of available resources! [1] Asanovik et al., A View From Berkeley,

6 Overcoming the Power Wall Single-core Dual-core 100% 100% 100% 100% 85% 170% Performance Power Frequency By lowering the frequency, the power consumption drops dramatically By using multiple cores, we can get higher performance with the same power budget! 6

7 Massive Parallelism: The Graphics Processing Unit CPU GPU Cores 4 16 Float ops / clock Frequency (MHz) GigaFLOPS Power consumption ~130 W ~250 W Memory (GiB) Performance Memory Bandwidth 7

Early Programming of GPUs GPUs were first

multiplication Geometry Output Input B [1]

8 Early Programming of GPUs GPUs were first programmed using OpenGL and other graphics languages Mathematics were written as operations on graphical primitives Extremely cumbersome and error prone Element-wise matrix multiplication Input A Matrix multiplication Geometry Output Input B [1] Fast matrix multiplies using graphics hardware, Larsen and McAllister,

Examples of Early GPU Research at SINTEF Preparation for

Equations (~25x) SW Equations (~25x) Marine aqoustics

9 Examples of Early GPU Research at SINTEF Preparation for FEM (~5x) Self-intersection (~10x) Registration of medical data (~20x) Fluid dynamics and FSI (Navier-Stokes) Inpainting (~400x matlab code) Euler Equations (~25x) SW Equations (~25x) Marine aqoustics (~20x) Matlab Interface Linear algebra Water injection in a fluvial reservoir (20x) 9

10 Todays GPU Programming Languages OpenCL DirectX DirectCompute BrookGPU AMD Brook+ OpenACC C++ AMP AMD CTM / CAL PGI Accelerator NVIDIA CUDA Graphics APIs "Academic" Abstractions C- and pragma-based languages 10

supercomputers GPU Supercomputers on the Top 500 List 14% 12%

11 Examples of GPU Use Today Thousands of academic papers Big investment by large software companies Growing use in supercomputers GPU Supercomputers on the Top 500 List 14% 12% 10% 8% 6% 4% 2% 0% aug.2007 jul.2008 jul.2009 jul.2010 jul.2011 jul

12 Programming GPUs For efficient use of CPUs you need to know a lot about the hardware restraints: Threading, hyperthreading, etc. NUMA memory, memory alignment, etc. SSE/AVX instructions, Cache size, cache prefetching, etc. Instruction latencies, For GPUs, it is exactly the same, but it is a "simpler" architecture: Less "magic" hardware to help you means its easier to reach peak performance Less "magic" hardware means you need to consider the hardware for all programs 12

13 Grid (3x2 blocks) GPU Execution Model Block (8x8 threads) The same program is launched for all threads "in parallel" The thread identifiers are used to calculate its global position The thread position is used to load and store data, and execute code The parallel execution means that synchronization can be very expensive Thread in position (21, 11) threadidx.x = 5 threadidx.y = 3 blockidx.x = 2 blockidx.y = 1 13

14 GPU Execution Model CPU scalar op CPU AVX op GPU Warp op CPU scalar op: CPU SSE/AVX op: GPU Warp op: 1 thread, 1 operand on 1 data element 1 thread, 1 operand on 2-8 data elements 1 warp = 32 threads, 32 operands on 32 data elements Exposed as individual threads Actually runs the same instruction Divergence implies serialization and masking 14

Warp Serialization and Masking Hardware serializes and masks divergent code flow: Programmer is relieved of fiddling with element masks (which is necessary for SSE) Execution

15 Warp Serialization and Masking Hardware serializes and masks divergent code flow: Programmer is relieved of fiddling with element masks (which is necessary for SSE) Execution time is still the sum of all branches taken Worst case 1/32 performance Important to minimize divergent code flow! Move conditionals into data, use min, max, conditional moves. 15

Example: Warp Serialization in Newton s Method First if-statement Masks out superfluous threads Not significant Iteration loop Identical for all threads Early exit Possible divergence Only beneficial

69ms (kernel only) global void newton(float* x,const float* a,const float* b,const float* c,int N) { int i = blockidx.x * blockdim.x + threadidx.

16 Example: Warp Serialization in Newton s Method First if-statement Masks out superfluous threads Not significant Iteration loop Identical for all threads Early exit Possible divergence Only beneficial when all threads in warp can exit Removing early exit increases performance from 0.84ms to 0.69ms (kernel only) global void newton(float* x,const float* a,const float* b,const float* c,int N) { int i = blockidx.x * blockdim.x + threadidx.x; if(i < N) { const float la = a[i]; const float lb = b[i]; const float lc = c[i]; float lx = 0.f; for(int it=0; it<maxit; it++) { float f = la*lx*lx + lb*lx + lc; if(fabsf(f) < 1e-7f) { break; } float df = 2.f*la*lx + lb; lx = lx - f/df; } x[i] = lx; } } (But fails 7 of times since multiple zeros isn t handled properly, but that is a different story ) 16

17 Algoritm Design Example: Solving the Heat Equation The heat equation describes diffusive heat conduction in a medium Prototypical partial differential equation u is the temperature, kappa is the diffusion coefficient, t is time, and x is space. We want to design an algorithm that suits the GPU execution model 17

Finding a solution to the heat equation Solving such partial differential equations analytically is nontrivial in all but a few very special cases Solution strategy: replace the

18 Finding a solution to the heat equation Solving such partial differential equations analytically is nontrivial in all but a few very special cases Solution strategy: replace the continuous derivatives with approximations at a set of grid points Solve for each grid point numerically on a computer "Use many grid points, and high order of approximation to get good results" 18

19 The Heat Equation with an implicit scheme 1. We can construct an implicit scheme by carefully choosing the "correct" approximation of derivatives 2. This ends up in a system of linear equations 3. Solve Ax=b using standard GPU methods to evolve the solution in time 19

20 The Heat Equation with an implicit scheme Such implicit schemes are often sought after: They allow for large time steps, They can be solved using standard tools Allow complex geometries They can be very accurate However Linear algebra solvers can be slow and memory hungry, especially on the GPU Many sparse solvers are inherently serial and unsuited for the GPU For many time-varying phenomena, we are also interested in the temporal dynamics of the problem 20

21 Numerical performance Algorithmic and numerical performance Total performance is the product of algorithmic and numerical performance Your mileage may vary: algorithmic performance is highly problem dependent Explicit stencils Tridiag Sparse linear algebra solvers have low numerical performance Only able to utilize a fraction of the capabilities of CPUs, and worse on GPUs PLU QR Red- Black Explicit schemes with compact stencils can give near-peak numerical performance May give the overall highest performance Multigrid Algorithmic performance Krylov 21

22 Explicit schemes with compact stencils Explicit schemes can give rise to compact stencils Embarrassingly parallel Perfect for the GPU! 22

The Shallow Water Equations A hyperbolic partial differential

Conservation of mass and momentum Gravity waves in 2D free

horizontal Not only used to describe physics of water:

23 The Shallow Water Equations A hyperbolic partial differential equation First described by de Saint-Venant ( ) Conservation of mass and momentum Gravity waves in 2D free surface Gravity-induced fluid motion Governing flow is horizontal Not only used to describe physics of water: Simplification of atmospheric flow Avalanches... Water image from / Ian Britton 23

24 Target Application Areas Tsunamis Floods 2011: Japan (5321+) 2004: Indian Ocean ( ) Storm Surges 2010: Pakistan (2000+) 1931: China floods ( ) Dam breaks 2005: Hurricane Katrina (1836) 1530: Netherlands ( ) 1975: Banqiao Dam ( ) 1959: Malpasset (423) Images from wikipedia.org, 24

Using GPUs for Shallow Water Simulations In preparation for events: Evaluate possible scenarios Simulation of many ensemble members Creation of

(deployment of barriers, evacuation of affected areas, etc.

25 Using GPUs for Shallow Water Simulations In preparation for events: Evaluate possible scenarios Simulation of many ensemble members Creation of inundation maps and emergency action plans In response to ongoing events Simulate possible scenarios in real-time Simulate strategies for action (deployment of barriers, evacuation of affected areas, etc.) High requirements to performance => Use the GPU Simulation result from NOAA Inundation map from Los Angeles County Tsunami Inundation Maps, 25

26 The Shallow Water Equations Vector of Conserved variables Flux Functions Bed slope source term Bed friction source term 26

27 The Shallow Water Equations A Hyperbolic partial differential equation Enables explicit schemes Solutions form discontinuities / shocks Require high accuracy in smooth parts without oscillations near discontinuities Solutions include dry areas Negative water depths ruin simulations Often high requirements to accuracy Order of spatial/temporal discretization Floating point rounding errors Can be difficult to capture "lake at rest" A standing wave or shock 27

Finding the perfect numerical scheme We want to find a numerical scheme that Works well for our target scenarios Handles dry zones (land) Handles shocks gracefully (without smearing or causing

28 Finding the perfect numerical scheme We want to find a numerical scheme that Works well for our target scenarios Handles dry zones (land) Handles shocks gracefully (without smearing or causing oscillations) Preserves "lake at rest" Has the accuracy for capturing the required physics Preserves the physical quantities Fits GPUs well Works well with single precision Is embarrassingly parallel Has a compact stencil 28

29 The Finite Volume Scheme of Choice* Scheme of choice: A. Kurganov and G. Petrova, A Second-Order Well-Balanced Positivity Preserving Central-Upwind Scheme for the Saint-Venant System Communications in Mathematical Sciences, 5 (2007), Second order accurate fluxes Total Variation Diminishing Well-balanced (captures lake-at-rest) Compact stencil (Good,but not perfect, match with the GPU) * With all possible disclaimers 29

Discretization Our grid consists of a set of cells or volumes The bathymetry is a piecewise bilinear function The physical variables (h, hu, hv), are piecewise constants

30 Discretization Our grid consists of a set of cells or volumes The bathymetry is a piecewise bilinear function The physical variables (h, hu, hv), are piecewise constants per volume Physical quantities are transported across the cell interfaces Algorithm: 1. Reconstruct physical variables 2. Evolve the solution 3. Average over grid cells 30

31 Kurganov-Petrova Spatial Discretization (Computing fluxes) Continuous variables Discrete variables Reconstruction Dry states fix Slope evaluation Flux calculation 31

32 Temporal Discretization (Evolving in time) Gather all known terms Use second order Runge-Kutta to solve the ODE 32

33 Overview of a Full Simulation Cycle 1. Calculate fluxes 2. Calculate Dt 6. Apply boundary conditions 3. ODE Halfstep 5. Evolve in time 4. Calculate fluxes 33

34 Implementation GPU code Four CUDA kernels: 87% Flux calculation <1% Timestep size (CFL condition) 12% Forward Euler step <1% Set boundary conditions Step 34

35 Flux kernel Domain decomposition A nine-point nonlinear stencil Comprised of simpler stencils Heavy use of shared mem Computationally demanding Traditional Block Decomposition Overlaping ghost cells (aka. apron) Global ghost cells for boundary conditions Domain padding 35

buffers use ~16 KB Occupancy Use 48 KB shared mem, 16 KB cache Three

36 Flux kernel Block size Block size is 16x14, from trying to optimize many parameters: Warp size: multiple of 32 Shared memory use: 16 shmem buffers use ~16 KB Occupancy Use 48 KB shared mem, 16 KB cache Three resident blocks Trades cache for occupancy Fermi cache Global memory access 36

37 Optimization of Flux Kernel The Flux Limiter Limits the fluxes to obtain non-oscillatory solution Generalized minmod limiter Least steep slope, or Zero if signs differ Creates divergent code paths Is executed a large number of times Use branchless implementation (2007) Requires special sign function Significantly faster than if-test approach float minmod(float a, float b, float c) { return 0.25f *sign(a) *(sign(a) + sign(b)) *(sign(b) + sign(c)) *min( min(abs(a), abs(b)), abs(c) ); } (2007) T. Hagen, M. Henriksen, J. Hjelmervik, and K.-A. Lie. How to solve systems of conservation laws numerically using the graphics processor as a high-performance computational engine. Geometrical Modeling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF, ( ). Springer Verlag,

38 Assessing performance Different ways of assessing performance Speedups can be dishonest Numerical performance does not tell all Number of iterations required, size of time step, and other algorithmic parameters are just as important Profile your code, and see what percentage of peak performance you attain You should reach "near-peak" GFLOPS or GB/s, or explain why not Gives an impression of scalability Our code reaches a high level of resource utilization Our code is significantly faster than the CPU 38

Accuracy and Error Garbage in, garbage out Simulations have many sources for

Model and parameters Friction coefficient estimation "Magic" numerical parameters

Measurement Radar / Lidar / Stereoscopy Low spatial resolution Low vertical

39 Accuracy and Error Garbage in, garbage out Simulations have many sources for errors Humans! Model and parameters Friction coefficient estimation "Magic" numerical parameters Choice of boundary conditions Numerical dissipation Handling of wetting and drying Measurement Radar / Lidar / Stereoscopy Low spatial resolution Low vertical accuracy Gridding Can require expert knowledge Computer precision Recycle image from recyclereminders.com Cray computer image from Wikipedia, user David.Monniaux 39

40 Single Versus Double Precision Given erroneous data, double precision calculates a more accurate (but still wrong) answer Single precision benefits: Uses half the storage space Uses half the bandwidth Executes (at least) twice as fast 40

Single Versus Double Precision Example Three

water depth (wet-wet) Synthetic terrain with dam

on the order of machine epsilon Single precision

front is more than an order of magnitude larger

41 Single Versus Double Precision Example Three different test cases Low water depth (wet-wet) High water depth (wet-wet) Synthetic terrain with dam break (wet-dry) Conclusions: Loss in conservation on the order of machine epsilon Single precision gives larger error Errors related to the wet-dry front is more than an order of magnitude larger (model error) Single precision is sufficiently accurate for this scheme 41

More on Accuracy We were experiencing large errors in conservation of mass for special cases The equations is written in terms of w = B+h to preserve "lake at rest" Large B, and small h The scale

42 More on Accuracy We were experiencing large errors in conservation of mass for special cases The equations is written in terms of w = B+h to preserve "lake at rest" Large B, and small h The scale difference gives major floating point errors (h flushed to zero) Even double precision is insufficient Solve by storing only h, and reconstruct w only when required! Single precision sufficient for most real-world cases Always store the quantity of interest! 42

43 1D Validation: Flow over Triangular bump (90s) 0.60 G G G Simulated Measured Simulated Measured Simulated Measured G2 G4 G8 G10 G11 G13 G G G G G Simulated Measured Simulated Measured Simulated Measured Simulated Measured 43

44 2D Verification: Parabolic basin Analytical 2D parabolic basin (Thacker) Planar water surface oscillates 100 x 100 cells Horizontal scale: 8 km Vertical scale: 3.3 m Simulation and analytical match well But, as most schemes, growing errors along wet-dry interface (model error ) 44

2D Validation: Barrage du Malpasset We model the equations correctly, but can we model real events?

5 m high, 220 m crest length, 55 million m 3 Bursts at 21:13 December 2nd 1959 Reaches Mediterranean in

data from 1:400 model 482 000 cells (1099 x 439 cells) 15 meter resolution Our results match

45 2D Validation: Barrage du Malpasset We model the equations correctly, but can we model real events? South-east France near Fréjus: Barrage du Malpasset Double curvature dam, 66.5 m high, 220 m crest length, 55 million m 3 Bursts at 21:13 December 2nd 1959 Reaches Mediterranean in 30 minutes (speeds up-to 70 km/h) 423 casualties, $68 million in damages Validate against experimental data from 1:400 model cells (1099 x 439 cells) 15 meter resolution Our results match experimental data very well Discrepancies at gauges 14 and 9 present in most (all?) published results Image from google earth, mes-ballades.com 45

46 Bonus material: Achieving Even Higher Performance 46

have up-to four GPUs Near-perfect weak and

47 Multi-GPU simulations Because we have a finite domain of dependence, we can create independent partitions of the domain and distribute to multiple GPUs Modern PCs have up-to four GPUs Near-perfect weak and strong scaling Collaboration with Martin L. Sætra 47

nearest neighbors are dry Up-to 6x speedup (mileage may vary) Blocks still

48 Early exit optimization Observation: Many dry areas do not require computation Use a small buffer to store wet blocks Exit flux kernel if nearest neighbors are dry Up-to 6x speedup (mileage may vary) Blocks still have to be scheduled Blocks read the auxiliary buffer One wet cell marks the whole block as wet 48

49 Sparse domain optimization The early exit strategy launches too many blocks Dry blocks should not need to check that they are dry! Sparse Compute: Do not perform any computations on dry parts of the domain Sparse Memory: Do not save any values in the dry parts of the domain Ph.D. work of Martin L. Sætra 49

50 Sparse domain optimization 1. Find all wet blocks 2. Grow to include dependencies 3. Sort block indices and launch the required number of blocks Similarly for memory, but it gets quite complicated 2x improvement over early exit (mileage may vary)! Comparison using an average of 26% wet cells 50

51 Video 51

52 Summary 52

53 Summary GPUs are powerful 7x theoretical difference between CPU and GPU Forces you to think about hardware (in a good way) GPUs have never been easier to program Modern languages and toolkits help you get a flying start Easy to achieve speed-ups Expert knowledge still required to reach peak performance Shallow water simulations map very well to GPUs Able to reach near-peak performance Physical correctness can be ensured, even using single precision Multi-GPU and sparse domain optimizations give even higher performance 53

54 Thank you for your attention Talk material based on work on our simulator engine. Some references: A. Brodtkorb, M. L. Sætra, Explicit Shallow Water Simulations on GPUs: Guidelines and Best Practices, CMWR Proceedings, 2012 A. Brodtkorb, M. L. Sætra, M. Altinakar, Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation, Computers & Fuids, 55, (2011), pp A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig, Simulation and Visualization of the Saint-Venant System using GPUs, Computing and Visualization in Science, 13(7), (2011), pp Contact: André R. Brodtkorb Homepage: Youtube: SINTEF: 54

55 "This slide is intentionally left blank" 55

Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs. NVIDIA GPU Technology Conference San Jose, California, 2010 André R.

Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs NVIDIA GPU Technology Conference San Jose, California, 2010 André R. Brodtkorb Talk Outline Learn how to simulate a half an hour dam