Stokes Preconditioning on a GPU

Size: px
Start display at page:

Download "Stokes Preconditioning on a GPU"

Transcription

1 Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical Center AGU 09 San Francisco, CA December 15, 2009 M. Knepley (UC) GPU AGU09 1 / 44

2 Collaborators Prof. Dave Yuen Dept. of Geology and Geophysics, University of Minnesota Dr. David May, developer of BFBT (in PETSc) Dept. of Earth Sciences, ETHZ Felipe Cruz, developer of FMM-GPU Dept. of Applied Mathematics, University of Bristol Prof. Lorena Barba Dept. of Mechanical Engineering, Boston University M. Knepley (UC) GPU AGU09 3 / 44

3 Outline 5 Slide Talk 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 4 / 44

4 BFBT 5 Slide Talk BFBT preconditions the Schur complement using S 1 b = L 1 p G T KGL 1 p (1) where L p is the Laplacian in the pressure space. M. Knepley (UC) GPU AGU09 5 / 44

5 Current Problems 5 Slide Talk The current BFBT code is limited by Bandwidth constraints Sparse matrix-vector product Achieves at most 10% of peak performance Synchronization GMRES orthogonalization Coarse problem Convergence Viscosity variation Mesh dependence M. Knepley (UC) GPU AGU09 6 / 44

6 5 Slide Talk Alternative Proposal Use a Boundary Element Method, for the Laplace solves in BFBT, accelerated by FMM. M. Knepley (UC) GPU AGU09 7 / 44

7 Missing Pieces 5 Slide Talk BEM discretization and assembly Matrix-free operator application using the Fast Multipole Method Overcomes bandwidth limit, 480 GF on an NVIDIA 1060C GPU Overcomes coarse bottleneck by overlapping direct work Solver for BEM system Same total work as FEM due to well-conditioned operator Possibility of multilevel preconditioner (even better) Interpolation between FEM and BEM Boundary interpolation just averages Can again use FMM for interior M. Knepley (UC) GPU AGU09 8 / 44

8 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44

9 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44

10 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44

11 Outline What are the Problems? 1 5 Slide Talk 2 What are the Problems? Bandwidth Synchronization Convergence 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 10 / 44

12 Problems What are the Problems? The current BFBT code is limited by Bandwidth constraints Synchronization Convergence M. Knepley (UC) GPU AGU09 11 / 44

13 Outline What are the Problems? Bandwidth 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 12 / 44

14 Bandwidth What are the Problems? Bandwidth Small bandwidth to main memory can limit performance Sparse matrix-vector product Operator application AMG restriction and interpolation M. Knepley (UC) GPU AGU09 13 / 44

15 What are the Problems? STREAM Benchmark Bandwidth Simple benchmark program measuring sustainable memory bandwidth Protoypical operation is Triad (WAXPY): w = y + αx Measures the memory bandwidth bottleneck (much below peak) Datasets outstrip cache Machine Peak (MF/s) Triad (MB/s) MF/MW Eq. MF/s Matt s Laptop (5.5%) Intel Core2 Quad (1.2%) Tesla 1060C * (0.8%) Table: Bandwidth limited machine performance M. Knepley (UC) GPU AGU09 14 / 44

16 What are the Problems? Bandwidth Analysis of Sparse Matvec (SpMV) Assumptions No cache misses No waits on memory references Notation m Number of matrix rows nz Number of nonzero matrix elements V Number of vectors to multiply We can look at bandwidth needed for peak performance ( V ) m nz + 6 V or achieveable performance given a bandwith BW byte/flop (2) Vnz BW Mflop/s (3) (8V + 2)m + 6nz Towards Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes, and Smith. M. Knepley (UC) GPU AGU09 15 / 44

17 What are the Problems? Improving Serial Performance Bandwidth For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (4) which is a dismal 8.8% of peak. Can improve performance by Blocking Multiple vectors but operation issue limitations take over. M. Knepley (UC) GPU AGU09 16 / 44

18 What are the Problems? Improving Serial Performance Bandwidth For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (4) which is a dismal 8.8% of peak. Better approaches: Unassembled operator application (Spectral elements, FMM) N data, N 2 computation Nonlinear evaluation (Picard, FAS, Exact Polynomial Solvers) N data, N k computation M. Knepley (UC) GPU AGU09 16 / 44

19 Outline What are the Problems? Synchronization 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 17 / 44

20 Synchronization What are the Problems? Synchronization Synchronization penalties can come from Reductions GMRES orthogonalization More than 20% penalty for PFLOTRAN on Cray XT5 Small subproblems Multigrid coarse problem Lower levels of Fast Multipole Method tree M. Knepley (UC) GPU AGU09 18 / 44

21 Outline What are the Problems? Convergence 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 19 / 44

22 Convergence What are the Problems? Convergence Convergence of the BFBT solve depends on Viscosity constrast (slightly) Viscosity topology Mesh Convergence of the AMG Poisson solve depends on Mesh M. Knepley (UC) GPU AGU09 20 / 44

23 Outline Can we do Better? 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? BEM Formulation BEM Solver Interpolation 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 21 / 44

24 Can we do Better? Alternative Proposal Use a Boundary Element Method, for the Laplace solves in BFBT, accelerated by FMM. M. Knepley (UC) GPU AGU09 22 / 44

25 Missing Pieces Can we do Better? BEM discretization and assembly Solver for BEM system Interpolation between FEM and BEM M. Knepley (UC) GPU AGU09 23 / 44

26 Outline Can we do Better? BEM Formulation 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 24 / 44

27 Can we do Better? Boundary Element Method BEM Formulation The Poisson problem u(x) = f (x) on Ω (5) u(x) Ω = g(x) (6) M. Knepley (UC) GPU AGU09 25 / 44

28 Can we do Better? Boundary Element Method BEM Formulation The Poisson problem (Boundary Integral Equation formulation) C(x)u(x) = F(x, y)g(y) G(x, y) u(y) ds(y) Ω n (5) G(x, y) = 1 log r 2π (6) F(x, y) = 1 r 2πr n (7) M. Knepley (UC) GPU AGU09 25 / 44

29 Can we do Better? Boundary Element Method BEM Formulation Restricting to the boundary, we see that 1 2 g(x) = Ω F(x, y)g(y) G(x, y) u(y) ds(y) (5) n M. Knepley (UC) GPU AGU09 25 / 44

30 Can we do Better? Boundary Element Method BEM Formulation Discretizing, we have Gq = ( ) 1 2 I F g (5) M. Knepley (UC) GPU AGU09 25 / 44

31 Can we do Better? Boundary Element Method BEM Formulation Now we can evaluate u in the interior u(x) = Ω F(x, y)g(y) G(x, y) u(y) ds(y) (5) n M. Knepley (UC) GPU AGU09 25 / 44

32 Can we do Better? Boundary Element Method BEM Formulation Or in discrete form u = Fg Gq (5) M. Knepley (UC) GPU AGU09 25 / 44

33 Can we do Better? Boundary Element Method BEM Formulation The sources in the interior may be added in using superposition ( ) 1 u(y) 2 g(x) = F(x, y)g(y) G(x, y) Ω n f ds(y) (5) M. Knepley (UC) GPU AGU09 25 / 44

34 Outline Can we do Better? BEM Solver 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 26 / 44

35 BEM Solver Can we do Better? BEM Solver The solve has two pieces: Operator application Boundary solve Interior evaluation Accomplished using the Fast Multipole Method Iterative solver Usually GMRES We use PETSc M. Knepley (UC) GPU AGU09 27 / 44

36 Operator Application Can we do Better? BEM Solver Using the Fast Multiple Method, the Green s functions (F and G) can be applied: in O(N) time using small memory bandwidth in the interior and on the boundary with much higher serial and parallel performance M. Knepley (UC) GPU AGU09 28 / 44

37 Can we do Better? Fast Multipole Method BEM Solver FMM accelerates the calculation of the function: Φ(x i ) = j K (x i, x j )q(x j ) (6) Accelerates O(N 2 ) to O(N) time The kernel K (x i, x j ) must decay quickly from (x i, x i ) Can be singular on the diagonal (Calderón-Zygmund operator) Discovered by Leslie Greengard and Vladimir Rohklin in 1987 Very similar to recent wavelet techniques M. Knepley (UC) GPU AGU09 29 / 44

38 Can we do Better? Fast Multipole Method BEM Solver FMM accelerates the calculation of the function: Φ(x i ) = j q j x i x j (6) Accelerates O(N 2 ) to O(N) time The kernel K (x i, x j ) must decay quickly from (x i, x i ) Can be singular on the diagonal (Calderón-Zygmund operator) Discovered by Leslie Greengard and Vladimir Rohklin in 1987 Very similar to recent wavelet techniques M. Knepley (UC) GPU AGU09 29 / 44

39 Can we do Better? PetFMM CPU Performance Strong Scaling BEM Solver Speedup uniform 4ML8R5 uniform 10ML9R5 2 spiral 1ML8R5 spiral w/ space-filling 1ML8R5 Perfect Speedup M. Knepley (UC) Number GPUof processors AGU09 30 / 44

40 Can we do Better? PetFMM CPU Performance Strong Scaling BEM Solver ME Initialization Upward Sweep Downward Sweep Evaluation Load balancing stage Total time Time [sec] M. Knepley (UC) Number GPUof processors AGU09 30 / 44

41 Can we do Better? PetFMM Load Balance BEM Solver uniform 4ML8R5 uniform 10ML9R5 spiral 1ML8R5 spiral w/space-filling 1ML8R Number of processors M. Knepley (UC) GPU AGU09 31 / 44

42 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

43 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

44 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

45 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

46 PetFMM Can we do Better? BEM Solver PetFMM is an freely available implementation of the Fast Multipole Method Leverages PETSc Same open source license Uses Sieve for parallelism Extensible design in C++ Templated over the kernel Templated over traversal for evaluation MPI implementation Novel parallel strategy for anisotropic/sparse particle distributions PetFMM A dynamically load-balancing parallel fast multipole library 86% efficient strong scaling on 64 procs Example application using the Vortex Method for fluids (coming soon) GPU implementation M. Knepley (UC) GPU AGU09 33 / 44

47 Convergence Can we do Better? BEM Solver BEM Laplace operator is well-conditioned κ = O(N B ) = O( N) Dijkstra and Mattheij Thus the total work is in O(N 2 B ) = O(N) Same as MG Regular integral operators require only two multigrid cycles Multigrid of the 2nd kind by Hackbush M. Knepley (UC) GPU AGU09 34 / 44

48 Outline Can we do Better? Interpolation 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 35 / 44

49 FEM BEM Can we do Better? Interpolation FEM BEM FEM boundary conditions can be directly used in BEM May require a VecScatter FEM BEM BEM can evaluate the field at any domain point Cost is linear in the number of evaluations using FMM Can accomodate both pointwise values, and moments by quadrature M. Knepley (UC) GPU AGU09 36 / 44

50 Outline Advantages and Disadvantages 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages Bandwidth Convergence 5 What is Next? M. Knepley (UC) GPU AGU09 37 / 44

51 Outline Advantages and Disadvantages Bandwidth 4 Advantages and Disadvantages Bandwidth Convergence M. Knepley (UC) GPU AGU09 38 / 44

52 Advantages and Disadvantages Bandwidth Bandwidth and Serial Performance Provably low bandwidth Shang-Hua Teng, SISC, 19(2), , 1998 Key advantage over algebraic methods like FFT Similar to wavelet transform Amenable to GPU implementation Also highly concurrent M. Knepley (UC) GPU AGU09 39 / 44

53 Outline Advantages and Disadvantages Convergence 4 Advantages and Disadvantages Bandwidth Convergence M. Knepley (UC) GPU AGU09 40 / 44

54 Advantages and Disadvantages Convergence Convergence and Synchronization BEM matrices are better conditioned However, FEM has better preconditioners Without better preconditioners, might see more synchronization Underexplored FMM can avoid bottleneck at lower levels Overlap direct work with lower tree levels Can provably eliminate bottleneck M. Knepley (UC) GPU AGU09 41 / 44

55 Advantages and Disadvantages Debatable Advantages Convergence Small memory FEM can be done matrix-free Opens door to using Stokes operator for PC We currently do not know what to do here M. Knepley (UC) GPU AGU09 42 / 44

56 Outline What is Next? 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 43 / 44

57 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44

58 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44

59 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Implementation for Scientific Computing: FEM and FMM

Implementation for Scientific Computing: FEM and FMM Implementation for Scientific Computing: FEM and FMM Matthew Knepley Computation Institute University of Chicago Department of Mathematics Tufts University March 12, 2010 M. Knepley (UC) SC Tufts 1 / 121

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information

Fast Methods with Sieve

Fast Methods with Sieve Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley

More information

Developing GPU-Enabled Scientific Libraries

Developing GPU-Enabled Scientific Libraries Developing GPU-Enabled Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University Medical Center CBC Workshop on CBC

More information

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle ICES Student Forum The University of Texas at Austin, USA November 4, 204 Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of

More information

Incorporation of Multicore FEM Integration Routines into Scientific Libraries

Incorporation of Multicore FEM Integration Routines into Scientific Libraries Incorporation of Multicore FEM Integration Routines into Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University

More information

Adaptive boundary element methods in industrial applications

Adaptive boundary element methods in industrial applications Adaptive boundary element methods in industrial applications Günther Of, Olaf Steinbach, Wolfgang Wendland Adaptive Fast Boundary Element Methods in Industrial Applications, Hirschegg, September 30th,

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Theoretical Foundations

Theoretical Foundations Theoretical Foundations Dmitry Karpeev 1,2, Matthew Knepley 1,2, and Robert Kirby 3 1 Mathematics and Computer Science Division 2 Computation Institute Argonne National Laboratory University of Chicago

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Developing GPU-Enabled Scientific Libraries

Developing GPU-Enabled Scientific Libraries Developing GPU-Enabled Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University Medical Center CBC Workshop on CBC

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

The Immersed Interface Method

The Immersed Interface Method The Immersed Interface Method Numerical Solutions of PDEs Involving Interfaces and Irregular Domains Zhiiin Li Kazufumi Ito North Carolina State University Raleigh, North Carolina Society for Industrial

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

Getting Modern Algorithms into the Hands of Working Scientists on Modern Hardware

Getting Modern Algorithms into the Hands of Working Scientists on Modern Hardware Getting Modern Algorithms into the Hands of Working Scientists on Modern Hardware Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University

More information

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python F.D. Witherden, M. Klemm, P.E. Vincent 1 Overview Motivation. Accelerators and Modern Hardware Python and PyFR. Summary. Motivation

More information

AMS527: Numerical Analysis II

AMS527: Numerical Analysis II AMS527: Numerical Analysis II A Brief Overview of Finite Element Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao SUNY Stony Brook AMS527: Numerical Analysis II 1 / 25 Overview Basic concepts Mathematical

More information

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications Jingfang Huang Department of Mathematics UNC at Chapel Hill Fundamentals of Fast Multipole (type) Method Fundamentals: Low

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

Transactions on Modelling and Simulation vol 20, 1998 WIT Press, ISSN X

Transactions on Modelling and Simulation vol 20, 1998 WIT Press,   ISSN X Parallel indirect multipole BEM analysis of Stokes flow in a multiply connected domain M.S. Ingber*, A.A. Mammoli* & J.S. Warsa* "Department of Mechanical Engineering, University of New Mexico, Albuquerque,

More information

A comparison of parallel rank-structured solvers

A comparison of parallel rank-structured solvers A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.

More information

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik www.mpi-inf.mpg.de/ ~strzodka Dominik Göddeke

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

cuibm A GPU Accelerated Immersed Boundary Method

cuibm A GPU Accelerated Immersed Boundary Method cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Interdisciplinary practical course on parallel finite element method using HiFlow 3

Interdisciplinary practical course on parallel finite element method using HiFlow 3 Interdisciplinary practical course on parallel finite element method using HiFlow 3 E. Treiber, S. Gawlok, M. Hoffmann, V. Heuveline, W. Karl EuroEDUPAR, 2015/08/24 KARLSRUHE INSTITUTE OF TECHNOLOGY -

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

On Robust Parallel Preconditioning for Incompressible Flow Problems

On Robust Parallel Preconditioning for Incompressible Flow Problems On Robust Parallel Preconditioning for Incompressible Flow Problems Timo Heister, Gert Lube, and Gerd Rapin Abstract We consider time-dependent flow problems discretized with higher order finite element

More information

D036 Accelerating Reservoir Simulation with GPUs

D036 Accelerating Reservoir Simulation with GPUs D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the

More information

Driven Cavity Example

Driven Cavity Example BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Computational Fluid Dynamics - Incompressible Flows

Computational Fluid Dynamics - Incompressible Flows Computational Fluid Dynamics - Incompressible Flows March 25, 2008 Incompressible Flows Basis Functions Discrete Equations CFD - Incompressible Flows CFD is a Huge field Numerical Techniques for solving

More information

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free

More information

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

Available online at  ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013) Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 61 ( 2013 ) 81 86 Parallel Computational Fluid Dynamics Conference (ParCFD2013) An OpenCL-based parallel CFD code for simulations

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke. The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund

More information

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC 2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble

More information

Parallel and Distributed Systems Lab.

Parallel and Distributed Systems Lab. Parallel and Distributed Systems Lab. Department of Computer Sciences Purdue University. Jie Chi, Ronaldo Ferreira, Ananth Grama, Tzvetan Horozov, Ioannis Ioannidis, Mehmet Koyuturk, Shan Lei, Robert Light,

More information

GPU Cluster Computing for FEM

GPU Cluster Computing for FEM GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing

More information

Exploring unstructured Poisson solvers for FDS

Exploring unstructured Poisson solvers for FDS Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Parallel Computations

Parallel Computations Parallel Computations Timo Heister, Clemson University heister@clemson.edu 2015-08-05 deal.ii workshop 2015 2 Introduction Parallel computations with deal.ii: Introduction Applications Parallel, adaptive,

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

The 3D DSC in Fluid Simulation

The 3D DSC in Fluid Simulation The 3D DSC in Fluid Simulation Marek K. Misztal Informatics and Mathematical Modelling, Technical University of Denmark mkm@imm.dtu.dk DSC 2011 Workshop Kgs. Lyngby, 26th August 2011 Governing Equations

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Towards Realistic Performance Bounds for Implicit CFD Codes

Towards Realistic Performance Bounds for Implicit CFD Codes Towards Realistic Performance Bounds for Implicit CFD Codes W. D. Gropp, a D. K. Kaushik, b D. E. Keyes, c and B. F. Smith d a Mathematics and Computer Science Division, Argonne National Laboratory, Argonne,

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

Approaches to Parallel Implementation of the BDDC Method

Approaches to Parallel Implementation of the BDDC Method Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,

More information

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes Jung-Han Kimn 1 and Blaise Bourdin 2 1 Department of Mathematics and The Center for Computation and

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Introduction to Multigrid and its Parallelization

Introduction to Multigrid and its Parallelization Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are

More information

A fast solver for the Stokes equations with distributed forces in complex geometries 1

A fast solver for the Stokes equations with distributed forces in complex geometries 1 A fast solver for the Stokes equations with distributed forces in complex geometries George Biros, Lexing Ying, and Denis Zorin Courant Institute of Mathematical Sciences, New York University, New York

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

Krishnan Suresh Associate Professor Mechanical Engineering

Krishnan Suresh Associate Professor Mechanical Engineering Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick

More information

Handling Parallelisation in OpenFOAM

Handling Parallelisation in OpenFOAM Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Andreas Sandberg 1 Introduction The purpose of this lab is to: apply what you have learned so

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

An algebraic multi-grid implementation in FEniCS for solving 3d fracture problems using XFEM

An algebraic multi-grid implementation in FEniCS for solving 3d fracture problems using XFEM An algebraic multi-grid implementation in FEniCS for solving 3d fracture problems using XFEM Axel Gerstenberger, Garth N. Wells, Chris Richardson, David Bernstein, and Ray Tuminaro FEniCS 14, Paris, France

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information