Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU

Size: px
Start display at page:

Download "Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU"

Transcription

1 Simulaciones Eficientes de las Ecuaciones de Aguas Someras en GPU André R. Brodtkorb, Ph.D., Research Scientist, SINTEF ICT, Department of Applied Mathematics, Norway Desafios del Modelado de Tsunamis y la Evaluación de Riesgo Universidad Tecnica Federico Santa Maria Valparaíso, Chile

2 Brief Outline Introduction GPU Computing Programming GPUs for Water Resources Efficient Simulation of the Shallow Water Equations on GPUs Summary 2

3 Development of the Microprocessor 1942: Digital Electric Computer (Atanasoff and Berry) 1947: Transistor (Shockley, Bardeen, and Brattain) : Integrated Circuit (Kilby) : Microprocessor (Hoff, Faggin, Mazor) More transistors (Moore, 1965) 3

4 Development of the Microprocessor (Moore's law) 1971: 4004, 2300 trans, 740 KHz 1982: 80286, 134 thousand trans, 8 MHz 1993: Pentium P5, 1.18 mill. trans, 66 MHz 2000: Pentium 4, 42 mill. trans, 1.5 GHz 2010: Nehalem 2.3 bill. trans, 2.66 GHz 4

5 The end of frequency scaling (2004) The power density of microprocessors is proportional to the clock frequency cubed: : 29% increase in frequency : Frequency constant : 25% increase in parallelism Parallelism technologies: Multi-core (8x) Hyper threading (2x) AVX/SSE/MMX/etc (8x) A serial program uses <2% of available resources! [1] Asanovik et al., A View From Berkeley,

6 Overcoming the Power Wall Single-core Dual-core 100% 100% 100% 100% 85% 170% Performance Power Frequency By lowering the frequency, the power consumption drops dramatically By using multiple cores, we can get higher performance with the same power budget! 6

7 Massive Parallelism: The Graphics Processing Unit CPU GPU Cores 4 16 Float ops / clock Frequency (MHz) GigaFLOPS Power consumption ~130 W ~250 W Memory (GiB) Performance Memory Bandwidth 7

8 Early Programming of GPUs GPUs were first programmed using OpenGL and other graphics languages Mathematics were written as operations on graphical primitives Extremely cumbersome and error prone Element-wise matrix multiplication Input A Matrix multiplication Geometry Output Input B [1] Fast matrix multiplies using graphics hardware, Larsen and McAllister,

9 Examples of Early GPU Research at SINTEF Preparation for FEM (~5x) Self-intersection (~10x) Registration of medical data (~20x) Fluid dynamics and FSI (Navier-Stokes) Inpainting (~400x matlab code) Euler Equations (~25x) SW Equations (~25x) Marine aqoustics (~20x) Matlab Interface Linear algebra Water injection in a fluvial reservoir (20x) 9

10 Todays GPU Programming Languages OpenCL DirectX DirectCompute BrookGPU AMD Brook+ OpenACC C++ AMP AMD CTM / CAL PGI Accelerator NVIDIA CUDA Graphics APIs "Academic" Abstractions C- and pragma-based languages 10

11 Examples of GPU Use Today Thousands of academic papers Big investment by large software companies Growing use in supercomputers GPU Supercomputers on the Top 500 List 14% 12% 10% 8% 6% 4% 2% 0% aug.2007 jul.2008 jul.2009 jul.2010 jul.2011 jul

12 Programming GPUs For efficient use of CPUs you need to know a lot about the hardware restraints: Threading, hyperthreading, etc. NUMA memory, memory alignment, etc. SSE/AVX instructions, Cache size, cache prefetching, etc. Instruction latencies, For GPUs, it is exactly the same, but it is a "simpler" architecture: Less "magic" hardware to help you means its easier to reach peak performance Less "magic" hardware means you need to consider the hardware for all programs 12

13 Grid (3x2 blocks) GPU Execution Model Block (8x8 threads) The same program is launched for all threads "in parallel" The thread identifiers are used to calculate its global position The thread position is used to load and store data, and execute code The parallel execution means that synchronization can be very expensive Thread in position (21, 11) threadidx.x = 5 threadidx.y = 3 blockidx.x = 2 blockidx.y = 1 13

14 GPU Execution Model CPU scalar op CPU AVX op GPU Warp op CPU scalar op: CPU SSE/AVX op: GPU Warp op: 1 thread, 1 operand on 1 data element 1 thread, 1 operand on 2-8 data elements 1 warp = 32 threads, 32 operands on 32 data elements Exposed as individual threads Actually runs the same instruction Divergence implies serialization and masking 14

15 Warp Serialization and Masking Hardware serializes and masks divergent code flow: Programmer is relieved of fiddling with element masks (which is necessary for SSE) Execution time is still the sum of all branches taken Worst case 1/32 performance Important to minimize divergent code flow! Move conditionals into data, use min, max, conditional moves. 15

16 Example: Warp Serialization in Newton s Method First if-statement Masks out superfluous threads Not significant Iteration loop Identical for all threads Early exit Possible divergence Only beneficial when all threads in warp can exit Removing early exit increases performance from 0.84ms to 0.69ms (kernel only) global void newton(float* x,const float* a,const float* b,const float* c,int N) { int i = blockidx.x * blockdim.x + threadidx.x; if(i < N) { const float la = a[i]; const float lb = b[i]; const float lc = c[i]; float lx = 0.f; for(int it=0; it<maxit; it++) { float f = la*lx*lx + lb*lx + lc; if(fabsf(f) < 1e-7f) { break; } float df = 2.f*la*lx + lb; lx = lx - f/df; } x[i] = lx; } } (But fails 7 of times since multiple zeros isn t handled properly, but that is a different story ) 16

17 Algoritm Design Example: Solving the Heat Equation The heat equation describes diffusive heat conduction in a medium Prototypical partial differential equation u is the temperature, kappa is the diffusion coefficient, t is time, and x is space. We want to design an algorithm that suits the GPU execution model 17

18 Finding a solution to the heat equation Solving such partial differential equations analytically is nontrivial in all but a few very special cases Solution strategy: replace the continuous derivatives with approximations at a set of grid points Solve for each grid point numerically on a computer "Use many grid points, and high order of approximation to get good results" 18

19 The Heat Equation with an implicit scheme 1. We can construct an implicit scheme by carefully choosing the "correct" approximation of derivatives 2. This ends up in a system of linear equations 3. Solve Ax=b using standard GPU methods to evolve the solution in time 19

20 The Heat Equation with an implicit scheme Such implicit schemes are often sought after: They allow for large time steps, They can be solved using standard tools Allow complex geometries They can be very accurate However Linear algebra solvers can be slow and memory hungry, especially on the GPU Many sparse solvers are inherently serial and unsuited for the GPU For many time-varying phenomena, we are also interested in the temporal dynamics of the problem 20

21 Numerical performance Algorithmic and numerical performance Total performance is the product of algorithmic and numerical performance Your mileage may vary: algorithmic performance is highly problem dependent Explicit stencils Tridiag Sparse linear algebra solvers have low numerical performance Only able to utilize a fraction of the capabilities of CPUs, and worse on GPUs PLU QR Red- Black Explicit schemes with compact stencils can give near-peak numerical performance May give the overall highest performance Multigrid Algorithmic performance Krylov 21

22 Explicit schemes with compact stencils Explicit schemes can give rise to compact stencils Embarrassingly parallel Perfect for the GPU! 22

23 The Shallow Water Equations A hyperbolic partial differential equation First described by de Saint-Venant ( ) Conservation of mass and momentum Gravity waves in 2D free surface Gravity-induced fluid motion Governing flow is horizontal Not only used to describe physics of water: Simplification of atmospheric flow Avalanches... Water image from / Ian Britton 23

24 Target Application Areas Tsunamis Floods 2011: Japan (5321+) 2004: Indian Ocean ( ) Storm Surges 2010: Pakistan (2000+) 1931: China floods ( ) Dam breaks 2005: Hurricane Katrina (1836) 1530: Netherlands ( ) 1975: Banqiao Dam ( ) 1959: Malpasset (423) Images from wikipedia.org, 24

25 Using GPUs for Shallow Water Simulations In preparation for events: Evaluate possible scenarios Simulation of many ensemble members Creation of inundation maps and emergency action plans In response to ongoing events Simulate possible scenarios in real-time Simulate strategies for action (deployment of barriers, evacuation of affected areas, etc.) High requirements to performance => Use the GPU Simulation result from NOAA Inundation map from Los Angeles County Tsunami Inundation Maps, 25

26 The Shallow Water Equations Vector of Conserved variables Flux Functions Bed slope source term Bed friction source term 26

27 The Shallow Water Equations A Hyperbolic partial differential equation Enables explicit schemes Solutions form discontinuities / shocks Require high accuracy in smooth parts without oscillations near discontinuities Solutions include dry areas Negative water depths ruin simulations Often high requirements to accuracy Order of spatial/temporal discretization Floating point rounding errors Can be difficult to capture "lake at rest" A standing wave or shock 27

28 Finding the perfect numerical scheme We want to find a numerical scheme that Works well for our target scenarios Handles dry zones (land) Handles shocks gracefully (without smearing or causing oscillations) Preserves "lake at rest" Has the accuracy for capturing the required physics Preserves the physical quantities Fits GPUs well Works well with single precision Is embarrassingly parallel Has a compact stencil 28

29 The Finite Volume Scheme of Choice* Scheme of choice: A. Kurganov and G. Petrova, A Second-Order Well-Balanced Positivity Preserving Central-Upwind Scheme for the Saint-Venant System Communications in Mathematical Sciences, 5 (2007), Second order accurate fluxes Total Variation Diminishing Well-balanced (captures lake-at-rest) Compact stencil (Good,but not perfect, match with the GPU) * With all possible disclaimers 29

30 Discretization Our grid consists of a set of cells or volumes The bathymetry is a piecewise bilinear function The physical variables (h, hu, hv), are piecewise constants per volume Physical quantities are transported across the cell interfaces Algorithm: 1. Reconstruct physical variables 2. Evolve the solution 3. Average over grid cells 30

31 Kurganov-Petrova Spatial Discretization (Computing fluxes) Continuous variables Discrete variables Reconstruction Dry states fix Slope evaluation Flux calculation 31

32 Temporal Discretization (Evolving in time) Gather all known terms Use second order Runge-Kutta to solve the ODE 32

33 Overview of a Full Simulation Cycle 1. Calculate fluxes 2. Calculate Dt 6. Apply boundary conditions 3. ODE Halfstep 5. Evolve in time 4. Calculate fluxes 33

34 Implementation GPU code Four CUDA kernels: 87% Flux calculation <1% Timestep size (CFL condition) 12% Forward Euler step <1% Set boundary conditions Step 34

35 Flux kernel Domain decomposition A nine-point nonlinear stencil Comprised of simpler stencils Heavy use of shared mem Computationally demanding Traditional Block Decomposition Overlaping ghost cells (aka. apron) Global ghost cells for boundary conditions Domain padding 35

36 Flux kernel Block size Block size is 16x14, from trying to optimize many parameters: Warp size: multiple of 32 Shared memory use: 16 shmem buffers use ~16 KB Occupancy Use 48 KB shared mem, 16 KB cache Three resident blocks Trades cache for occupancy Fermi cache Global memory access 36

37 Optimization of Flux Kernel The Flux Limiter Limits the fluxes to obtain non-oscillatory solution Generalized minmod limiter Least steep slope, or Zero if signs differ Creates divergent code paths Is executed a large number of times Use branchless implementation (2007) Requires special sign function Significantly faster than if-test approach float minmod(float a, float b, float c) { return 0.25f *sign(a) *(sign(a) + sign(b)) *(sign(b) + sign(c)) *min( min(abs(a), abs(b)), abs(c) ); } (2007) T. Hagen, M. Henriksen, J. Hjelmervik, and K.-A. Lie. How to solve systems of conservation laws numerically using the graphics processor as a high-performance computational engine. Geometrical Modeling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF, ( ). Springer Verlag,

38 Assessing performance Different ways of assessing performance Speedups can be dishonest Numerical performance does not tell all Number of iterations required, size of time step, and other algorithmic parameters are just as important Profile your code, and see what percentage of peak performance you attain You should reach "near-peak" GFLOPS or GB/s, or explain why not Gives an impression of scalability Our code reaches a high level of resource utilization Our code is significantly faster than the CPU 38

39 Accuracy and Error Garbage in, garbage out Simulations have many sources for errors Humans! Model and parameters Friction coefficient estimation "Magic" numerical parameters Choice of boundary conditions Numerical dissipation Handling of wetting and drying Measurement Radar / Lidar / Stereoscopy Low spatial resolution Low vertical accuracy Gridding Can require expert knowledge Computer precision Recycle image from recyclereminders.com Cray computer image from Wikipedia, user David.Monniaux 39

40 Single Versus Double Precision Given erroneous data, double precision calculates a more accurate (but still wrong) answer Single precision benefits: Uses half the storage space Uses half the bandwidth Executes (at least) twice as fast 40

41 Single Versus Double Precision Example Three different test cases Low water depth (wet-wet) High water depth (wet-wet) Synthetic terrain with dam break (wet-dry) Conclusions: Loss in conservation on the order of machine epsilon Single precision gives larger error Errors related to the wet-dry front is more than an order of magnitude larger (model error) Single precision is sufficiently accurate for this scheme 41

42 More on Accuracy We were experiencing large errors in conservation of mass for special cases The equations is written in terms of w = B+h to preserve "lake at rest" Large B, and small h The scale difference gives major floating point errors (h flushed to zero) Even double precision is insufficient Solve by storing only h, and reconstruct w only when required! Single precision sufficient for most real-world cases Always store the quantity of interest! 42

43 1D Validation: Flow over Triangular bump (90s) 0.60 G G G Simulated Measured Simulated Measured Simulated Measured G2 G4 G8 G10 G11 G13 G G G G G Simulated Measured Simulated Measured Simulated Measured Simulated Measured 43

44 2D Verification: Parabolic basin Analytical 2D parabolic basin (Thacker) Planar water surface oscillates 100 x 100 cells Horizontal scale: 8 km Vertical scale: 3.3 m Simulation and analytical match well But, as most schemes, growing errors along wet-dry interface (model error ) 44

45 2D Validation: Barrage du Malpasset We model the equations correctly, but can we model real events? South-east France near Fréjus: Barrage du Malpasset Double curvature dam, 66.5 m high, 220 m crest length, 55 million m 3 Bursts at 21:13 December 2nd 1959 Reaches Mediterranean in 30 minutes (speeds up-to 70 km/h) 423 casualties, $68 million in damages Validate against experimental data from 1:400 model cells (1099 x 439 cells) 15 meter resolution Our results match experimental data very well Discrepancies at gauges 14 and 9 present in most (all?) published results Image from google earth, mes-ballades.com 45

46 Bonus material: Achieving Even Higher Performance 46

47 Multi-GPU simulations Because we have a finite domain of dependence, we can create independent partitions of the domain and distribute to multiple GPUs Modern PCs have up-to four GPUs Near-perfect weak and strong scaling Collaboration with Martin L. Sætra 47

48 Early exit optimization Observation: Many dry areas do not require computation Use a small buffer to store wet blocks Exit flux kernel if nearest neighbors are dry Up-to 6x speedup (mileage may vary) Blocks still have to be scheduled Blocks read the auxiliary buffer One wet cell marks the whole block as wet 48

49 Sparse domain optimization The early exit strategy launches too many blocks Dry blocks should not need to check that they are dry! Sparse Compute: Do not perform any computations on dry parts of the domain Sparse Memory: Do not save any values in the dry parts of the domain Ph.D. work of Martin L. Sætra 49

50 Sparse domain optimization 1. Find all wet blocks 2. Grow to include dependencies 3. Sort block indices and launch the required number of blocks Similarly for memory, but it gets quite complicated 2x improvement over early exit (mileage may vary)! Comparison using an average of 26% wet cells 50

51 Video 51

52 Summary 52

53 Summary GPUs are powerful 7x theoretical difference between CPU and GPU Forces you to think about hardware (in a good way) GPUs have never been easier to program Modern languages and toolkits help you get a flying start Easy to achieve speed-ups Expert knowledge still required to reach peak performance Shallow water simulations map very well to GPUs Able to reach near-peak performance Physical correctness can be ensured, even using single precision Multi-GPU and sparse domain optimizations give even higher performance 53

54 Thank you for your attention Talk material based on work on our simulator engine. Some references: A. Brodtkorb, M. L. Sætra, Explicit Shallow Water Simulations on GPUs: Guidelines and Best Practices, CMWR Proceedings, 2012 A. Brodtkorb, M. L. Sætra, M. Altinakar, Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation, Computers & Fuids, 55, (2011), pp A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig, Simulation and Visualization of the Saint-Venant System using GPUs, Computing and Visualization in Science, 13(7), (2011), pp Contact: André R. Brodtkorb Homepage: Youtube: SINTEF: 54

55 "This slide is intentionally left blank" 55

Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs. NVIDIA GPU Technology Conference San Jose, California, 2010 André R.

Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs. NVIDIA GPU Technology Conference San Jose, California, 2010 André R. Evacuate Now? Faster-than-real-time Shallow Water Simulations on GPUs NVIDIA GPU Technology Conference San Jose, California, 2010 André R. Brodtkorb Talk Outline Learn how to simulate a half an hour dam

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

EXPLICIT SHALLOW WATER SIMULATIONS ON GPUS: GUIDELINES AND BEST PRACTICES

EXPLICIT SHALLOW WATER SIMULATIONS ON GPUS: GUIDELINES AND BEST PRACTICES XIX International Conference on Water Resources CMWR University of Illinois at Urbana-Champaign June 7-, EXPLICIT SHALLOW WATER SIMULATIONS ON GPUS: GUIDELINES AND BEST PRACTICES André R. Brodtkorb, Martin

More information

Load-balancing multi-gpu shallow water simulations on small clusters

Load-balancing multi-gpu shallow water simulations on small clusters Load-balancing multi-gpu shallow water simulations on small clusters Gorm Skevik master thesis autumn 2014 Load-balancing multi-gpu shallow water simulations on small clusters Gorm Skevik 1st August 2014

More information

Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation

Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation 1 Revised personal version of final journal article : Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation André R. Brodtkorb a,, Martin L. Sætra b,

More information

Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation

Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation André R. Brodtkorb a,, Martin L. Sætra b, Mustafa Altinakar c a SINTEF ICT, Department of Applied

More information

State-of-the-art in Heterogeneous Computing

State-of-the-art in Heterogeneous Computing State-of-the-art in Heterogeneous Computing Guest Lecture NTNU Trond Hagen, Research Manager SINTEF, Department of Applied Mathematics 1 Overview Introduction GPU Programming Strategies Trends: Heterogeneous

More information

This is a draft of the paper entitled Simulation and Visualization of the Saint-Venant System using GPUs

This is a draft of the paper entitled Simulation and Visualization of the Saint-Venant System using GPUs SIMULATION AND VISUALIZATION OF THE SAINT-VENANT SYSTEM USING GPUS ANDRÉ R. BRODTKORB, TROND R. HAGEN, KNUT-ANDREAS LIE, AND JOSTEIN R. NATVIG This is a draft of the paper entitled Simulation and Visualization

More information

This is a draft of the paper entitled Simulation and Visualization of the Saint-Venant System using GPUs

This is a draft of the paper entitled Simulation and Visualization of the Saint-Venant System using GPUs SIMULATION AND VISUALIZATION OF THE SAINT-VENANT SYSTEM USING GPUS ANDRÉ R. BRODTKORB, TROND R. HAGEN, KNUT-ANDREAS LIE, AND JOSTEIN R. NATVIG This is a draft of the paper entitled Simulation and Visualization

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

This is a draft. The full paper can be found in Journal of Scientific Computing xx(x):xx xx:

This is a draft. The full paper can be found in Journal of Scientific Computing xx(x):xx xx: EFFICIENT GPU-IMPLEMENTATION OF ADAPTIVE MESH REFINEMENT FOR THE SHALLOW-WATER EQUATIONS MARTIN L. SÆTRA 1,2, ANDRÉ R. BRODTKORB 3, AND KNUT-ANDREAS LIE 1,3 This is a draft. The full paper can be found

More information

Auto-tuning Shallow water simulations on GPUs

Auto-tuning Shallow water simulations on GPUs Auto-tuning Shallow water simulations on GPUs André B. Amundsen Master s Thesis Spring 2014 Auto-tuning Shallow water simulations on GPUs André B. Amundsen 15th May 2014 ii Abstract Graphic processing

More information

Short introduction to GPU and Heterogeneous Computing

Short introduction to GPU and Heterogeneous Computing Short introduction to GPU and Heterogeneous Computing University of Málaga, 2016-04-11 André R. Brodtkorb, SINTEF, Norway Technology for a better society 1 Established 1950 by the Norwegian Institute of

More information

Partial Differential Equations

Partial Differential Equations Simulation in Computer Graphics Partial Differential Equations Matthias Teschner Computer Science Department University of Freiburg Motivation various dynamic effects and physical processes are described

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Parallel Adaptive Tsunami Modelling with Triangular Discontinuous Galerkin Schemes

Parallel Adaptive Tsunami Modelling with Triangular Discontinuous Galerkin Schemes Parallel Adaptive Tsunami Modelling with Triangular Discontinuous Galerkin Schemes Stefan Vater 1 Kaveh Rahnema 2 Jörn Behrens 1 Michael Bader 2 1 Universität Hamburg 2014 PDES Workshop 2 TU München Partial

More information

Tutorial: GPU and Heterogeneous Computing in Discrete Optimization

Tutorial: GPU and Heterogeneous Computing in Discrete Optimization Tutorial: GPU and Heterogeneous Computing in Discrete Optimization Established 1950 by the Norwegian Institute of Technology. The largest independent research organisation in Scandinavia. A non-profit

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Mid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr.

Mid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr. Mid-Year Report Discontinuous Galerkin Euler Equation Solver Friday, December 14, 2012 Andrey Andreyev Advisor: Dr. James Baeder Abstract: The focus of this effort is to produce a two dimensional inviscid,

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

Technology for a better society. SINTEF ICT, Applied Mathematics, Heterogeneous Computing Group

Technology for a better society. SINTEF ICT, Applied Mathematics, Heterogeneous Computing Group Technology for a better society SINTEF, Applied Mathematics, Heterogeneous Computing Group Trond Hagen GPU Computing Seminar, SINTEF Oslo, October 23, 2009 1 Agenda 12:30 Introduction and welcoming Trond

More information

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 First-Order Hyperbolic System Method If you have a CFD book for hyperbolic problems, you have a CFD book for all problems.

More information

Computational Fluid Dynamics using OpenCL a Practical Introduction

Computational Fluid Dynamics using OpenCL a Practical Introduction 19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Computational Fluid Dynamics using OpenCL a Practical Introduction T Bednarz

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv: v1 [cs.dc] 5 Sep 2013

A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv: v1 [cs.dc] 5 Sep 2013 A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv:1309.1230v1 [cs.dc] 5 Sep 2013 Kerry A. Seitz, Jr. 1, Alex Kennedy 1, Owen Ransom 2, Bassam A. Younis 2, and John D. Owens 3 1 Department

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

CS205b/CME306. Lecture 9

CS205b/CME306. Lecture 9 CS205b/CME306 Lecture 9 1 Convection Supplementary Reading: Osher and Fedkiw, Sections 3.3 and 3.5; Leveque, Sections 6.7, 8.3, 10.2, 10.4. For a reference on Newton polynomial interpolation via divided

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Simulation of one-layer shallow water systems on multicore and CUDA architectures

Simulation of one-layer shallow water systems on multicore and CUDA architectures Noname manuscript No. (will be inserted by the editor) Simulation of one-layer shallow water systems on multicore and CUDA architectures Marc de la Asunción José M. Mantas Manuel J. Castro Received: date

More information

MET report. One-Layer Shallow Water Models on the GPU

MET report. One-Layer Shallow Water Models on the GPU MET report no. 27/2013 Oceanography One-Layer Shallow Water Models on the GPU André R. Brodtkorb 1, Trond R. Hagen 2, Lars Petter Røed 3 1 SINTEF IKT, Avd. for Anvendt Matematikk 2 SINTEF IKT, Avd. for

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

INTERNATIONAL JOURNAL OF CIVIL AND STRUCTURAL ENGINEERING Volume 2, No 3, 2012

INTERNATIONAL JOURNAL OF CIVIL AND STRUCTURAL ENGINEERING Volume 2, No 3, 2012 INTERNATIONAL JOURNAL OF CIVIL AND STRUCTURAL ENGINEERING Volume 2, No 3, 2012 Copyright 2010 All rights reserved Integrated Publishing services Research article ISSN 0976 4399 Efficiency and performances

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2

More information

The Shallow Water Equations and CUDA

The Shallow Water Equations and CUDA The Shallow Water Equations and CUDA Alexander Pöppl December 9 th 2015 Tutorial: High Performance Computing - Algorithms and Applications, December 9 th 2015 1 Last Tutorial Discretized Heat Equation

More information

The Shallow Water Equations and CUDA

The Shallow Water Equations and CUDA The Shallow Water Equations and CUDA Oliver Meister December 17 th 2014 Tutorial Parallel Programming and High Performance Computing, December 17 th 2014 1 Last Tutorial Discretized Heat Equation System

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

SENSEI / SENSEI-Lite / SENEI-LDC Updates

SENSEI / SENSEI-Lite / SENEI-LDC Updates SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC

More information

Final Report. Discontinuous Galerkin Compressible Euler Equation Solver. May 14, Andrey Andreyev. Adviser: Dr. James Baeder

Final Report. Discontinuous Galerkin Compressible Euler Equation Solver. May 14, Andrey Andreyev. Adviser: Dr. James Baeder Final Report Discontinuous Galerkin Compressible Euler Equation Solver May 14, 2013 Andrey Andreyev Adviser: Dr. James Baeder Abstract: In this work a Discontinuous Galerkin Method is developed for compressible

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

CUDA/OpenGL Fluid Simulation. Nolan Goodnight CUDA/OpenGL Fluid Simulation Nolan Goodnight ngoodnight@nvidia.com Document Change History Version Date Responsible Reason for Change 0.1 2/22/07 Nolan Goodnight Initial draft 1.0 4/02/07 Nolan Goodnight

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

computational Fluid Dynamics - Prof. V. Esfahanian

computational Fluid Dynamics - Prof. V. Esfahanian Three boards categories: Experimental Theoretical Computational Crucial to know all three: Each has their advantages and disadvantages. Require validation and verification. School of Mechanical Engineering

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

A Toolbox of Level Set Methods

A Toolbox of Level Set Methods A Toolbox of Level Set Methods Ian Mitchell Department of Computer Science University of British Columbia http://www.cs.ubc.ca/~mitchell mitchell@cs.ubc.ca research supported by the Natural Science and

More information

Homework 4A Due November 7th IN CLASS

Homework 4A Due November 7th IN CLASS CS207, Fall 2014 Systems Development for Computational Science Cris Cecka, Ray Jones Homework 4A Due November 7th IN CLASS Previously, we ve developed a quite robust Graph class to let us use Node and

More information

Simulating Shallow Water on GPUs Programming of Heterogeneous Systems in Physics

Simulating Shallow Water on GPUs Programming of Heterogeneous Systems in Physics Simulating Shallow Water on GPUs Programming of Heterogeneous Systems in Physics Martin Pfeiffer (m.pfeiffer@uni-jena.de) Friedrich Schiller University Jena 06.10.2011 Simulating Shallow Water on GPUs

More information

The Shallow Water Equations and CUDA

The Shallow Water Equations and CUDA The Shallow Water Equations and CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing January 11 th 2017 Last Tutorial Discretized Heat Equation

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Numerical Methods for (Time-Dependent) HJ PDEs

Numerical Methods for (Time-Dependent) HJ PDEs Numerical Methods for (Time-Dependent) HJ PDEs Ian Mitchell Department of Computer Science The University of British Columbia research supported by National Science and Engineering Research Council of

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA

More information

Debojyoti Ghosh. Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering

Debojyoti Ghosh. Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering Debojyoti Ghosh Adviser: Dr. James Baeder Alfred Gessow Rotorcraft Center Department of Aerospace Engineering To study the Dynamic Stalling of rotor blade cross-sections Unsteady Aerodynamics: Time varying

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Lax-Wendroff and McCormack Schemes for Numerical Simulation of Unsteady Gradually and Rapidly Varied Open Channel Flow

Lax-Wendroff and McCormack Schemes for Numerical Simulation of Unsteady Gradually and Rapidly Varied Open Channel Flow Archives of Hydro-Engineering and Environmental Mechanics Vol. 60 (2013), No. 1 4, pp. 51 62 DOI: 10.2478/heem-2013-0008 IBW PAN, ISSN 1231 3726 Lax-Wendroff and McCormack Schemes for Numerical Simulation

More information

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free

More information

Optical Flow Estimation with CUDA. Mikhail Smirnov

Optical Flow Estimation with CUDA. Mikhail Smirnov Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent

More information

Simulation in Computer Graphics. Particles. Matthias Teschner. Computer Science Department University of Freiburg

Simulation in Computer Graphics. Particles. Matthias Teschner. Computer Science Department University of Freiburg Simulation in Computer Graphics Particles Matthias Teschner Computer Science Department University of Freiburg Outline introduction particle motion finite differences system of first order ODEs second

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs

S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs Elmar Westphal - Forschungszentrum Jülich GmbH Spheroids Spheroid: A volume formed by rotating an ellipse around one of its axes Two

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

CGT 581 G Fluids. Overview. Some terms. Some terms

CGT 581 G Fluids. Overview. Some terms. Some terms CGT 581 G Fluids Bedřich Beneš, Ph.D. Purdue University Department of Computer Graphics Technology Overview Some terms Incompressible Navier-Stokes Boundary conditions Lagrange vs. Euler Eulerian approaches

More information

PROGRAMACIÓN GRÁFICA DE ALTAS PRESTACIONES INTRODUCTION TO GPUS. André R. Brodtkorb

PROGRAMACIÓN GRÁFICA DE ALTAS PRESTACIONES INTRODUCTION TO GPUS. André R. Brodtkorb PROGRAMACIÓN GRÁFICA DE ALTAS PRESTACIONES INTRODUCTION TO GPUS André R. Brodtkorb Programación Gráfica de Altas Prestaciones Short course on High-performance simulation with high-level languages Part

More information

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information