Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU
|
|
- Fay Black
- 5 years ago
- Views:
Transcription
1 Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU André R. Brodtkorb, Ph.D., Research Scientist, SINTEF ICT, Department of Applied Mathematics, Norway Desafios del Modelado de Tsunamis y la Evaluación de Riesgo Universidad Tecnica Federico Santa Maria Valparaíso, Chile
2 Brief Outline Introduction GPU Computing Programming GPUs for Water Resources Efficient Simulation of the Shallow Water Equations on GPUs Summary 2
3 Development of the Microprocessor 1942: Digital Electric Computer (Atanasoff and Berry) 1947: Transistor (Shockley, Bardeen, and Brattain) : Integrated Circuit (Kilby) : Microprocessor (Hoff, Faggin, Mazor) More transistors (Moore, 1965) 3
4 Development of the Microprocessor (Moore's law) 1971: 4004, 2300 trans, 740 KHz 1982: 80286, 134 thousand trans, 8 MHz 1993: Pentium P5, 1.18 mill. trans, 66 MHz 2000: Pentium 4, 42 mill. trans, 1.5 GHz 2010: Nehalem 2.3 bill. trans, 2.66 GHz 4
5 The end of frequency scaling (2004) The power density of microprocessors is proportional to the clock frequency cubed: : 29% increase in frequency : Frequency constant : 25% increase in parallelism Parallelism technologies: Multi-core (8x) Hyper threading (2x) AVX/SSE/MMX/etc (8x) A serial program uses <2% of available resources! [1] Asanovik et al., A View From Berkeley,
6 Overcoming the Power Wall Single-core Dual-core 100% 100% 100% 100% 85% 170% Performance Power Frequency By lowering the frequency, the power consumption drops dramatically By using multiple cores, we can get higher performance with the same power budget! 6
7 Massive Parallelism: The Graphics Processing Unit CPU GPU Cores 4 16 Float ops / clock Frequency (MHz) GigaFLOPS Power consumption ~130 W ~250 W Memory (GiB) Performance Memory Bandwidth 7
8 Early Programming of GPUs GPUs were first programmed using OpenGL and other graphics languages Mathematics were written as operations on graphical primitives Extremely cumbersome and error prone Element-wise matrix multiplication Input A Matrix multiplication Geometry Output Input B [1] Fast matrix multiplies using graphics hardware, Larsen and McAllister,
9 Examples of Early GPU Research at SINTEF Preparation for FEM (~5x) Self-intersection (~10x) Registration of medical data (~20x) Fluid dynamics and FSI (Navier-Stokes) Inpainting (~400x matlab code) Euler Equations (~25x) SW Equations (~25x) Marine aqoustics (~20x) Matlab Interface Linear algebra Water injection in a fluvial reservoir (20x) 9
10 Todays GPU Programming Languages OpenCL DirectX DirectCompute BrookGPU AMD Brook+ OpenACC C++ AMP AMD CTM / CAL PGI Accelerator NVIDIA CUDA Graphics APIs "Academic" Abstractions C- and pragma-based languages 10
11 Examples of GPU Use Today Thousands of academic papers Big investment by large software companies Growing use in supercomputers GPU Supercomputers on the Top 500 List 14% 12% 10% 8% 6% 4% 2% 0% aug.2007 jul.2008 jul.2009 jul.2010 jul.2011 jul
12 Programming GPUs For efficient use of CPUs you need to know a lot about the hardware restraints: Threading, hyperthreading, etc. NUMA memory, memory alignment, etc. SSE/AVX instructions, Cache size, cache prefetching, etc. Instruction latencies, For GPUs, it is exactly the same, but it is a "simpler" architecture: Less "magic" hardware to help you means its easier to reach peak performance Less "magic" hardware means you need to consider the hardware for all programs 12
13 Grid (3x2 blocks) GPU Execution Model Block (8x8 threads) The same program is launched for all threads "in parallel" The thread identifiers are used to calculate its global position The thread position is used to load and store data, and execute code The parallel execution means that synchronization can be very expensive Thread in position (21, 11) threadidx.x = 5 threadidx.y = 3 blockidx.x = 2 blockidx.y = 1 13
14 GPU Execution Model CPU scalar op CPU AVX op GPU Warp op CPU scalar op: CPU SSE/AVX op: GPU Warp op: 1 thread, 1 operand on 1 data element 1 thread, 1 operand on 2-8 data elements 1 warp = 32 threads, 32 operands on 32 data elements Exposed as individual threads Actually runs the same instruction Divergence implies serialization and masking 14
15 Warp Serialization and Masking Hardware serializes and masks divergent code flow: Programmer is relieved of fiddling with element masks (which is necessary for SSE) Execution time is still the sum of all branches taken Worst case 1/32 performance Important to minimize divergent code flow! Move conditionals into data, use min, max, conditional moves. 15
16 Example: Warp Serialization in Newton s Method First if-statement Masks out superfluous threads Not significant Iteration loop Identical for all threads Early exit Possible divergence Only beneficial when all threads in warp can exit Removing early exit increases performance from 0.84ms to 0.69ms (kernel only) global void newton(float* x,const float* a,const float* b,const float* c,int N) { int i = blockidx.x * blockdim.x + threadidx.x; if(i < N) { const float la = a[i]; const float lb = b[i]; const float lc = c[i]; float lx = 0.f; for(int it=0; it<maxit; it++) { float f = la*lx*lx + lb*lx + lc; if(fabsf(f) < 1e-7f) { break; } float df = 2.f*la*lx + lb; lx = lx - f/df; } x[i] = lx; } } (But fails 7 of times since multiple zeros isn t handled properly, but that is a different story ) 16
17 Algoritm Design Example: Solving the Heat Equation The heat equation describes diffusive heat conduction in a medium Prototypical partial differential equation u is the temperature, kappa is the diffusion coefficient, t is time, and x is space. We want to design an algorithm that suits the GPU execution model 17
18 Finding a solution to the heat equation Solving such partial differential equations analytically is nontrivial in all but a few very special cases Solution strategy: replace the continuous derivatives with approximations at a set of grid points Solve for each grid point numerically on a computer "Use many grid points, and high order of approximation to get good results" 18
19 The Heat Equation with an implicit scheme 1. We can construct an implicit scheme by carefully choosing the "correct" approximation of derivatives 2. This ends up in a system of linear equations 3. Solve Ax=b using standard GPU methods to evolve the solution in time 19
20 The Heat Equation with an implicit scheme Such implicit schemes are often sought after: They allow for large time steps, They can be solved using standard tools Allow complex geometries They can be very accurate However Linear algebra solvers can be slow and memory hungry, especially on the GPU Many sparse solvers are inherently serial and unsuited for the GPU For many time-varying phenomena, we are also interested in the temporal dynamics of the problem 20
21 Numerical performance Algorithmic and numerical performance Total performance is the product of algorithmic and numerical performance Your mileage may vary: algorithmic performance is highly problem dependent Explicit stencils Tridiag Sparse linear algebra solvers have low numerical performance Only able to utilize a fraction of the capabilities of CPUs, and worse on GPUs PLU QR Red- Black Explicit schemes with compact stencils can give near-peak numerical performance May give the overall highest performance Multigrid Algorithmic performance Krylov 21
22 Explicit schemes with compact stencils Explicit schemes can give rise to compact stencils Embarrassingly parallel Perfect for the GPU! 22
23 The Shallow Water Equations A hyperbolic partial differential equation First described by de Saint-Venant ( ) Conservation of mass and momentum Gravity waves in 2D free surface Gravity-induced fluid motion Governing flow is horizontal Not only used to describe physics of water: Simplification of atmospheric flow Avalanches... Water image from / Ian Britton 23
24 Target Application Areas Tsunamis Floods 2011: Japan (5321+) 2004: Indian Ocean ( ) Storm Surges 2010: Pakistan (2000+) 1931: China floods ( ) Dam breaks 2005: Hurricane Katrina (1836) 1530: Netherlands ( ) 1975: Banqiao Dam ( ) 1959: Malpasset (423) Images from wikipedia.org, 24
25 Using GPUs for Shallow Water Simulations In preparation for events: Evaluate possible scenarios Simulation of many ensemble members Creation of inundation maps and emergency action plans In response to ongoing events Simulate possible scenarios in real-time Simulate strategies for action (deployment of barriers, evacuation of affected areas, etc.) High requirements to performance => Use the GPU Simulation result from NOAA Inundation map from Los Angeles County Tsunami Inundation Maps, 25
26 The Shallow Water Equations Vector of Conserved variables Flux Functions Bed slope source term Bed friction source term 26
27 The Shallow Water Equations A Hyperbolic partial differential equation Enables explicit schemes Solutions form discontinuities / shocks Require high accuracy in smooth parts without oscillations near discontinuities Solutions include dry areas Negative water depths ruin simulations Often high requirements to accuracy Order of spatial/temporal discretization Floating point rounding errors Can be difficult to capture "lake at rest" A standing wave or shock 27
28 Finding the perfect numerical scheme We want to find a numerical scheme that Works well for our target scenarios Handles dry zones (land) Handles shocks gracefully (without smearing or causing oscillations) Preserves "lake at rest" Has the accuracy for capturing the required physics Preserves the physical quantities Fits GPUs well Works well with single precision Is embarrassingly parallel Has a compact stencil 28
29 The Finite Volume Scheme of Choice* Scheme of choice: A. Kurganov and G. Petrova, A Second-Order Well-Balanced Positivity Preserving Central-Upwind Scheme for the Saint-Venant System Communications in Mathematical Sciences, 5 (2007), Second order accurate fluxes Total Variation Diminishing Well-balanced (captures lake-at-rest) Compact stencil (Good,but not perfect, match with the GPU) * With all possible disclaimers 29
30 Discretization Our grid consists of a set of cells or volumes The bathymetry is a piecewise bilinear function The physical variables (h, hu, hv), are piecewise constants per volume Physical quantities are transported across the cell interfaces Algorithm: 1. Reconstruct physical variables 2. Evolve the solution 3. Average over grid cells 30
31 Kurganov-Petrova Spatial Discretization (Computing fluxes) Continuous variables Discrete variables Reconstruction Dry states fix Slope evaluation Flux calculation 31
32 Temporal Discretization (Evolving in time) Gather all known terms Use second order Runge-Kutta to solve the ODE 32
33 Overview of a Full Simulation Cycle 1. Calculate fluxes 2. Calculate Dt 6. Apply boundary conditions 3. ODE Halfstep 5. Evolve in time 4. Calculate fluxes 33
34 Implementation GPU code Four CUDA kernels: 87% Flux calculation <1% Timestep size (CFL condition) 12% Forward Euler step <1% Set boundary conditions Step 34
35 Flux kernel Domain decomposition A nine-point nonlinear stencil Comprised of simpler stencils Heavy use of shared mem Computationally demanding Traditional Block Decomposition Overlaping ghost cells (aka. apron) Global ghost cells for boundary conditions Domain padding 35
36 Flux kernel Block size Block size is 16x14, from trying to optimize many parameters: Warp size: multiple of 32 Shared memory use: 16 shmem buffers use ~16 KB Occupancy Use 48 KB shared mem, 16 KB cache Three resident blocks Trades cache for occupancy Fermi cache Global memory access 36
37 Optimization of Flux Kernel The Flux Limiter Limits the fluxes to obtain non-oscillatory solution Generalized minmod limiter Least steep slope, or Zero if signs differ Creates divergent code paths Is executed a large number of times Use branchless implementation (2007) Requires special sign function Significantly faster than if-test approach float minmod(float a, float b, float c) { return 0.25f *sign(a) *(sign(a) + sign(b)) *(sign(b) + sign(c)) *min( min(abs(a), abs(b)), abs(c) ); } (2007) T. Hagen, M. Henriksen, J. Hjelmervik, and K.-A. Lie. How to solve systems of conservation laws numerically using the graphics processor as a high-performance computational engine. Geometrical Modeling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF, ( ). Springer Verlag,
38 Assessing performance Different ways of assessing performance Speedups can be dishonest Numerical performance does not tell all Number of iterations required, size of time step, and other algorithmic parameters are just as important Profile your code, and see what percentage of peak performance you attain You should reach "near-peak" GFLOPS or GB/s, or explain why not Gives an impression of scalability Our code reaches a high level of resource utilization Our code is significantly faster than the CPU 38
39 Accuracy and Error Garbage in, garbage out Simulations have many sources for errors Humans! Model and parameters Friction coefficient estimation "Magic" numerical parameters Choice of boundary conditions Numerical dissipation Handling of wetting and drying Measurement Radar / Lidar / Stereoscopy Low spatial resolution Low vertical accuracy Gridding Can require expert knowledge Computer precision Recycle image from recyclereminders.com Cray computer image from Wikipedia, user David.Monniaux 39
40 Single Versus Double Precision Given erroneous data, double precision calculates a more accurate (but still wrong) answer Single precision benefits: Uses half the storage space Uses half the bandwidth Executes (at least) twice as fast 40
41 Single Versus Double Precision Example Three different test cases Low water depth (wet-wet) High water depth (wet-wet) Synthetic terrain with dam break (wet-dry) Conclusions: Loss in conservation on the order of machine epsilon Single precision gives larger error Errors related to the wet-dry front is more than an order of magnitude larger (model error) Single precision is sufficiently accurate for this scheme 41
42 More on Accuracy We were experiencing large errors in conservation of mass for special cases The equations is written in terms of w = B+h to preserve "lake at rest" Large B, and small h The scale difference gives major floating point errors (h flushed to zero) Even double precision is insufficient Solve by storing only h, and reconstruct w only when required! Single precision sufficient for most real-world cases Always store the quantity of interest! 42
43 1D Validation: Flow over Triangular bump (90s) 0.60 G G G Simulated Measured Simulated Measured Simulated Measured G2 G4 G8 G10 G11 G13 G G G G G Simulated Measured Simulated Measured Simulated Measured Simulated Measured 43
44 2D Verification: Parabolic basin Analytical 2D parabolic basin (Thacker) Planar water surface oscillates 100 x 100 cells Horizontal scale: 8 km Vertical scale: 3.3 m Simulation and analytical match well But, as most schemes, growing errors along wet-dry interface (model error ) 44
45 2D Validation: Barrage du Malpasset We model the equations correctly, but can we model real events? South-east France near Fréjus: Barrage du Malpasset Double curvature dam, 66.5 m high, 220 m crest length, 55 million m 3 Bursts at 21:13 December 2nd 1959 Reaches Mediterranean in 30 minutes (speeds up-to 70 km/h) 423 casualties, $68 million in damages Validate against experimental data from 1:400 model cells (1099 x 439 cells) 15 meter resolution Our results match experimental data very well Discrepancies at gauges 14 and 9 present in most (all?) published results Image from google earth, mes-ballades.com 45
46 Bonus material: Achieving Even Higher Performance 46
47 Multi-GPU simulations Because we have a finite domain of dependence, we can create independent partitions of the domain and distribute to multiple GPUs Modern PCs have up-to four GPUs Near-perfect weak and strong scaling Collaboration with Martin L. Sætra 47
48 Early exit optimization Observation: Many dry areas do not require computation Use a small buffer to store wet blocks Exit flux kernel if nearest neighbors are dry Up-to 6x speedup (mileage may vary) Blocks still have to be scheduled Blocks read the auxiliary buffer One wet cell marks the whole block as wet 48
49 Sparse domain optimization The early exit strategy launches too many blocks Dry blocks should not need to check that they are dry! Sparse Compute: Do not perform any computations on dry parts of the domain Sparse Memory: Do not save any values in the dry parts of the domain Ph.D. work of Martin L. Sætra 49
50 Sparse domain optimization 1. Find all wet blocks 2. Grow to include dependencies 3. Sort block indices and launch the required number of blocks Similarly for memory, but it gets quite complicated 2x improvement over early exit (mileage may vary)! Comparison using an average of 26% wet cells 50
51 Video 51
52 Summary 52
53 Summary GPUs are powerful 7x theoretical difference between CPU and GPU Forces you to think about hardware (in a good way) GPUs have never been easier to program Modern languages and toolkits help you get a flying start Easy to achieve speed-ups Expert knowledge still required to reach peak performance Shallow water simulations map very well to GPUs Able to reach near-peak performance Physical correctness can be ensured, even using single precision Multi-GPU and sparse domain optimizations give even higher performance 53
54 Thank you for your attention Talk material based on work on our simulator engine. Some references: A. Brodtkorb, M. L. Sætra, Explicit Shallow Water Simulations on GPUs: Guidelines and Best Practices, CMWR Proceedings, 2012 A. Brodtkorb, M. L. Sætra, M. Altinakar, Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation, Computers & Fuids, 55, (2011), pp A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig, Simulation and Visualization of the Saint-Venant System using GPUs, Computing and Visualization in Science, 13(7), (2011), pp Contact: André R. Brodtkorb Homepage: Youtube: SINTEF: 54
55 "This slide is intentionally left blank" 55
Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs. NVIDIA GPU Technology Conference San Jose, California, 2010 André R.
Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs NVIDIA GPU Technology Conference San Jose, California, 2010 André R. Brodtkorb Talk Outline Learn how to simulate a half an hour dam
More informationShallow Water Simulations on Graphics Hardware
Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis
More informationEXPLICIT SHALLOW WATER SIMULATIONS ON GPUS: GUIDELINES AND BEST PRACTICES
XIX International Conference on Water Resources CMWR University of Illinois at Urbana-Champaign June 7-, EXPLICIT SHALLOW WATER SIMULATIONS ON GPUS: GUIDELINES AND BEST PRACTICES André R. Brodtkorb, Martin
More informationLoad-balancing multi-gpu shallow water simulations on small clusters
Load-balancing multi-gpu shallow water simulations on small clusters Gorm Skevik master thesis autumn 2014 Load-balancing multi-gpu shallow water simulations on small clusters Gorm Skevik 1st August 2014
More informationEfficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation
1 Revised personal version of final journal article : Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation André R. Brodtkorb a,, Martin L. Sætra b,
More informationEfficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation
Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation André R. Brodtkorb a,, Martin L. Sætra b, Mustafa Altinakar c a SINTEF ICT, Department of Applied
More informationState-of-the-art in Heterogeneous Computing
State-of-the-art in Heterogeneous Computing Guest Lecture NTNU Trond Hagen, Research Manager SINTEF, Department of Applied Mathematics 1 Overview Introduction GPU Programming Strategies Trends: Heterogeneous
More informationThis is a draft of the paper entitled Simulation and Visualization of the Saint-Venant System using GPUs
SIMULATION AND VISUALIZATION OF THE SAINT-VENANT SYSTEM USING GPUS ANDRÉ R. BRODTKORB, TROND R. HAGEN, KNUT-ANDREAS LIE, AND JOSTEIN R. NATVIG This is a draft of the paper entitled Simulation and Visualization
More informationThis is a draft of the paper entitled Simulation and Visualization of the Saint-Venant System using GPUs
SIMULATION AND VISUALIZATION OF THE SAINT-VENANT SYSTEM USING GPUS ANDRÉ R. BRODTKORB, TROND R. HAGEN, KNUT-ANDREAS LIE, AND JOSTEIN R. NATVIG This is a draft of the paper entitled Simulation and Visualization
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationThis is a draft. The full paper can be found in Journal of Scientific Computing xx(x):xx xx:
EFFICIENT GPU-IMPLEMENTATION OF ADAPTIVE MESH REFINEMENT FOR THE SHALLOW-WATER EQUATIONS MARTIN L. SÆTRA 1,2, ANDRÉ R. BRODTKORB 3, AND KNUT-ANDREAS LIE 1,3 This is a draft. The full paper can be found
More informationAuto-tuning Shallow water simulations on GPUs
Auto-tuning Shallow water simulations on GPUs André B. Amundsen Master s Thesis Spring 2014 Auto-tuning Shallow water simulations on GPUs André B. Amundsen 15th May 2014 ii Abstract Graphic processing
More informationShort introduction to GPU and Heterogeneous Computing
Short introduction to GPU and Heterogeneous Computing University of Málaga, 2016-04-11 André R. Brodtkorb, SINTEF, Norway Technology for a better society 1 Established 1950 by the Norwegian Institute of
More informationPartial Differential Equations
Simulation in Computer Graphics Partial Differential Equations Matthias Teschner Computer Science Department University of Freiburg Motivation various dynamic effects and physical processes are described
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationParallel Adaptive Tsunami Modelling with Triangular Discontinuous Galerkin Schemes
Parallel Adaptive Tsunami Modelling with Triangular Discontinuous Galerkin Schemes Stefan Vater 1 Kaveh Rahnema 2 Jörn Behrens 1 Michael Bader 2 1 Universität Hamburg 2014 PDES Workshop 2 TU München Partial
More informationTutorial: GPU and Heterogeneous Computing in Discrete Optimization
Tutorial: GPU and Heterogeneous Computing in Discrete Optimization Established 1950 by the Norwegian Institute of Technology. The largest independent research organisation in Scandinavia. A non-profit
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationMid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr.
Mid-Year Report Discontinuous Galerkin Euler Equation Solver Friday, December 14, 2012 Andrey Andreyev Advisor: Dr. James Baeder Abstract: The focus of this effort is to produce a two dimensional inviscid,
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationTechnology for a better society. SINTEF ICT, Applied Mathematics, Heterogeneous Computing Group
Technology for a better society SINTEF, Applied Mathematics, Heterogeneous Computing Group Trond Hagen GPU Computing Seminar, SINTEF Oslo, October 23, 2009 1 Agenda 12:30 Introduction and welcoming Trond
More informationNIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011
NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 First-Order Hyperbolic System Method If you have a CFD book for hyperbolic problems, you have a CFD book for all problems.
More informationComputational Fluid Dynamics using OpenCL a Practical Introduction
19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Computational Fluid Dynamics using OpenCL a Practical Introduction T Bednarz
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationA GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv: v1 [cs.dc] 5 Sep 2013
A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv:1309.1230v1 [cs.dc] 5 Sep 2013 Kerry A. Seitz, Jr. 1, Alex Kennedy 1, Owen Ransom 2, Bassam A. Younis 2, and John D. Owens 3 1 Department
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More informationCS205b/CME306. Lecture 9
CS205b/CME306 Lecture 9 1 Convection Supplementary Reading: Osher and Fedkiw, Sections 3.3 and 3.5; Leveque, Sections 6.7, 8.3, 10.2, 10.4. For a reference on Newton polynomial interpolation via divided
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationSimulation of one-layer shallow water systems on multicore and CUDA architectures
Noname manuscript No. (will be inserted by the editor) Simulation of one-layer shallow water systems on multicore and CUDA architectures Marc de la Asunción José M. Mantas Manuel J. Castro Received: date
More informationMET report. One-Layer Shallow Water Models on the GPU
MET report no. 27/2013 Oceanography One-Layer Shallow Water Models on the GPU André R. Brodtkorb 1, Trond R. Hagen 2, Lars Petter Røed 3 1 SINTEF IKT, Avd. for Anvendt Matematikk 2 SINTEF IKT, Avd. for
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationINTERNATIONAL JOURNAL OF CIVIL AND STRUCTURAL ENGINEERING Volume 2, No 3, 2012
INTERNATIONAL JOURNAL OF CIVIL AND STRUCTURAL ENGINEERING Volume 2, No 3, 2012 Copyright 2010 All rights reserved Integrated Publishing services Research article ISSN 0976 4399 Efficiency and performances
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationThe Shallow Water Equations and CUDA
The Shallow Water Equations and CUDA Alexander Pöppl December 9 th 2015 Tutorial: High Performance Computing - Algorithms and Applications, December 9 th 2015 1 Last Tutorial Discretized Heat Equation
More informationThe Shallow Water Equations and CUDA
The Shallow Water Equations and CUDA Oliver Meister December 17 th 2014 Tutorial Parallel Programming and High Performance Computing, December 17 th 2014 1 Last Tutorial Discretized Heat Equation System
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationSENSEI / SENSEI-Lite / SENEI-LDC Updates
SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC
More informationFinal Report. Discontinuous Galerkin Compressible Euler Equation Solver. May 14, Andrey Andreyev. Adviser: Dr. James Baeder
Final Report Discontinuous Galerkin Compressible Euler Equation Solver May 14, 2013 Andrey Andreyev Adviser: Dr. James Baeder Abstract: In this work a Discontinuous Galerkin Method is developed for compressible
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationDevelopment of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak
Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationCUDA/OpenGL Fluid Simulation. Nolan Goodnight
CUDA/OpenGL Fluid Simulation Nolan Goodnight ngoodnight@nvidia.com Document Change History Version Date Responsible Reason for Change 0.1 2/22/07 Nolan Goodnight Initial draft 1.0 4/02/07 Nolan Goodnight
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationcomputational Fluid Dynamics - Prof. V. Esfahanian
Three boards categories: Experimental Theoretical Computational Crucial to know all three: Each has their advantages and disadvantages. Require validation and verification. School of Mechanical Engineering
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationGPU Implementation of Implicit Runge-Kutta Methods
GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationA Toolbox of Level Set Methods
A Toolbox of Level Set Methods Ian Mitchell Department of Computer Science University of British Columbia http://www.cs.ubc.ca/~mitchell mitchell@cs.ubc.ca research supported by the Natural Science and
More informationHomework 4A Due November 7th IN CLASS
CS207, Fall 2014 Systems Development for Computational Science Cris Cecka, Ray Jones Homework 4A Due November 7th IN CLASS Previously, we ve developed a quite robust Graph class to let us use Node and
More informationSimulating Shallow Water on GPUs Programming of Heterogeneous Systems in Physics
Simulating Shallow Water on GPUs Programming of Heterogeneous Systems in Physics Martin Pfeiffer (m.pfeiffer@uni-jena.de) Friedrich Schiller University Jena 06.10.2011 Simulating Shallow Water on GPUs
More informationThe Shallow Water Equations and CUDA
The Shallow Water Equations and CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing January 11 th 2017 Last Tutorial Discretized Heat Equation
More informationA MATLAB Interface to the GPU
A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationNumerical Methods for (Time-Dependent) HJ PDEs
Numerical Methods for (Time-Dependent) HJ PDEs Ian Mitchell Department of Computer Science The University of British Columbia research supported by National Science and Engineering Research Council of
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More informationDebojyoti Ghosh. Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering
Debojyoti Ghosh Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering To study the Dynamic Stalling of rotor blade cross-sections Unsteady Aerodynamics: Time varying
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationLax-Wendroff and McCormack Schemes for Numerical Simulation of Unsteady Gradually and Rapidly Varied Open Channel Flow
Archives of Hydro-Engineering and Environmental Mechanics Vol. 60 (2013), No. 1 4, pp. 51 62 DOI: 10.2478/heem-2013-0008 IBW PAN, ISSN 1231 3726 Lax-Wendroff and McCormack Schemes for Numerical Simulation
More informationRadial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing
Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free
More informationOptical Flow Estimation with CUDA. Mikhail Smirnov
Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent
More informationSimulation in Computer Graphics. Particles. Matthias Teschner. Computer Science Department University of Freiburg
Simulation in Computer Graphics Particles Matthias Teschner Computer Science Department University of Freiburg Outline introduction particle motion finite differences system of first order ODEs second
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationS7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs
S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs Elmar Westphal - Forschungszentrum Jülich GmbH Spheroids Spheroid: A volume formed by rotating an ellipse around one of its axes Two
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More information1.2 Numerical Solutions of Flow Problems
1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian
More informationCGT 581 G Fluids. Overview. Some terms. Some terms
CGT 581 G Fluids Bedřich Beneš, Ph.D. Purdue University Department of Computer Graphics Technology Overview Some terms Incompressible Navier-Stokes Boundary conditions Lagrange vs. Euler Eulerian approaches
More informationPROGRAMACIÓN GRÁFICA DE ALTAS PRESTACIONES INTRODUCTION TO GPUS. André R. Brodtkorb
PROGRAMACIÓN GRÁFICA DE ALTAS PRESTACIONES INTRODUCTION TO GPUS André R. Brodtkorb Programación Gráfica de Altas Prestaciones Short course on High-performance simulation with high-level languages Part
More informationTowards Exascale Computing with the Atmospheric Model NUMA
Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More information