Stokes Preconditioning on a GPU
|
|
- Annabel Blair
- 5 years ago
- Views:
Transcription
1 Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical Center AGU 09 San Francisco, CA December 15, 2009 M. Knepley (UC) GPU AGU09 1 / 44
2 Collaborators Prof. Dave Yuen Dept. of Geology and Geophysics, University of Minnesota Dr. David May, developer of BFBT (in PETSc) Dept. of Earth Sciences, ETHZ Felipe Cruz, developer of FMM-GPU Dept. of Applied Mathematics, University of Bristol Prof. Lorena Barba Dept. of Mechanical Engineering, Boston University M. Knepley (UC) GPU AGU09 3 / 44
3 Outline 5 Slide Talk 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 4 / 44
4 BFBT 5 Slide Talk BFBT preconditions the Schur complement using S 1 b = L 1 p G T KGL 1 p (1) where L p is the Laplacian in the pressure space. M. Knepley (UC) GPU AGU09 5 / 44
5 Current Problems 5 Slide Talk The current BFBT code is limited by Bandwidth constraints Sparse matrix-vector product Achieves at most 10% of peak performance Synchronization GMRES orthogonalization Coarse problem Convergence Viscosity variation Mesh dependence M. Knepley (UC) GPU AGU09 6 / 44
6 5 Slide Talk Alternative Proposal Use a Boundary Element Method, for the Laplace solves in BFBT, accelerated by FMM. M. Knepley (UC) GPU AGU09 7 / 44
7 Missing Pieces 5 Slide Talk BEM discretization and assembly Matrix-free operator application using the Fast Multipole Method Overcomes bandwidth limit, 480 GF on an NVIDIA 1060C GPU Overcomes coarse bottleneck by overlapping direct work Solver for BEM system Same total work as FEM due to well-conditioned operator Possibility of multilevel preconditioner (even better) Interpolation between FEM and BEM Boundary interpolation just averages Can again use FMM for interior M. Knepley (UC) GPU AGU09 8 / 44
8 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44
9 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44
10 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44
11 Outline What are the Problems? 1 5 Slide Talk 2 What are the Problems? Bandwidth Synchronization Convergence 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 10 / 44
12 Problems What are the Problems? The current BFBT code is limited by Bandwidth constraints Synchronization Convergence M. Knepley (UC) GPU AGU09 11 / 44
13 Outline What are the Problems? Bandwidth 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 12 / 44
14 Bandwidth What are the Problems? Bandwidth Small bandwidth to main memory can limit performance Sparse matrix-vector product Operator application AMG restriction and interpolation M. Knepley (UC) GPU AGU09 13 / 44
15 What are the Problems? STREAM Benchmark Bandwidth Simple benchmark program measuring sustainable memory bandwidth Protoypical operation is Triad (WAXPY): w = y + αx Measures the memory bandwidth bottleneck (much below peak) Datasets outstrip cache Machine Peak (MF/s) Triad (MB/s) MF/MW Eq. MF/s Matt s Laptop (5.5%) Intel Core2 Quad (1.2%) Tesla 1060C * (0.8%) Table: Bandwidth limited machine performance M. Knepley (UC) GPU AGU09 14 / 44
16 What are the Problems? Bandwidth Analysis of Sparse Matvec (SpMV) Assumptions No cache misses No waits on memory references Notation m Number of matrix rows nz Number of nonzero matrix elements V Number of vectors to multiply We can look at bandwidth needed for peak performance ( V ) m nz + 6 V or achieveable performance given a bandwith BW byte/flop (2) Vnz BW Mflop/s (3) (8V + 2)m + 6nz Towards Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes, and Smith. M. Knepley (UC) GPU AGU09 15 / 44
17 What are the Problems? Improving Serial Performance Bandwidth For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (4) which is a dismal 8.8% of peak. Can improve performance by Blocking Multiple vectors but operation issue limitations take over. M. Knepley (UC) GPU AGU09 16 / 44
18 What are the Problems? Improving Serial Performance Bandwidth For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (4) which is a dismal 8.8% of peak. Better approaches: Unassembled operator application (Spectral elements, FMM) N data, N 2 computation Nonlinear evaluation (Picard, FAS, Exact Polynomial Solvers) N data, N k computation M. Knepley (UC) GPU AGU09 16 / 44
19 Outline What are the Problems? Synchronization 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 17 / 44
20 Synchronization What are the Problems? Synchronization Synchronization penalties can come from Reductions GMRES orthogonalization More than 20% penalty for PFLOTRAN on Cray XT5 Small subproblems Multigrid coarse problem Lower levels of Fast Multipole Method tree M. Knepley (UC) GPU AGU09 18 / 44
21 Outline What are the Problems? Convergence 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 19 / 44
22 Convergence What are the Problems? Convergence Convergence of the BFBT solve depends on Viscosity constrast (slightly) Viscosity topology Mesh Convergence of the AMG Poisson solve depends on Mesh M. Knepley (UC) GPU AGU09 20 / 44
23 Outline Can we do Better? 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? BEM Formulation BEM Solver Interpolation 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 21 / 44
24 Can we do Better? Alternative Proposal Use a Boundary Element Method, for the Laplace solves in BFBT, accelerated by FMM. M. Knepley (UC) GPU AGU09 22 / 44
25 Missing Pieces Can we do Better? BEM discretization and assembly Solver for BEM system Interpolation between FEM and BEM M. Knepley (UC) GPU AGU09 23 / 44
26 Outline Can we do Better? BEM Formulation 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 24 / 44
27 Can we do Better? Boundary Element Method BEM Formulation The Poisson problem u(x) = f (x) on Ω (5) u(x) Ω = g(x) (6) M. Knepley (UC) GPU AGU09 25 / 44
28 Can we do Better? Boundary Element Method BEM Formulation The Poisson problem (Boundary Integral Equation formulation) C(x)u(x) = F(x, y)g(y) G(x, y) u(y) ds(y) Ω n (5) G(x, y) = 1 log r 2π (6) F(x, y) = 1 r 2πr n (7) M. Knepley (UC) GPU AGU09 25 / 44
29 Can we do Better? Boundary Element Method BEM Formulation Restricting to the boundary, we see that 1 2 g(x) = Ω F(x, y)g(y) G(x, y) u(y) ds(y) (5) n M. Knepley (UC) GPU AGU09 25 / 44
30 Can we do Better? Boundary Element Method BEM Formulation Discretizing, we have Gq = ( ) 1 2 I F g (5) M. Knepley (UC) GPU AGU09 25 / 44
31 Can we do Better? Boundary Element Method BEM Formulation Now we can evaluate u in the interior u(x) = Ω F(x, y)g(y) G(x, y) u(y) ds(y) (5) n M. Knepley (UC) GPU AGU09 25 / 44
32 Can we do Better? Boundary Element Method BEM Formulation Or in discrete form u = Fg Gq (5) M. Knepley (UC) GPU AGU09 25 / 44
33 Can we do Better? Boundary Element Method BEM Formulation The sources in the interior may be added in using superposition ( ) 1 u(y) 2 g(x) = F(x, y)g(y) G(x, y) Ω n f ds(y) (5) M. Knepley (UC) GPU AGU09 25 / 44
34 Outline Can we do Better? BEM Solver 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 26 / 44
35 BEM Solver Can we do Better? BEM Solver The solve has two pieces: Operator application Boundary solve Interior evaluation Accomplished using the Fast Multipole Method Iterative solver Usually GMRES We use PETSc M. Knepley (UC) GPU AGU09 27 / 44
36 Operator Application Can we do Better? BEM Solver Using the Fast Multiple Method, the Green s functions (F and G) can be applied: in O(N) time using small memory bandwidth in the interior and on the boundary with much higher serial and parallel performance M. Knepley (UC) GPU AGU09 28 / 44
37 Can we do Better? Fast Multipole Method BEM Solver FMM accelerates the calculation of the function: Φ(x i ) = j K (x i, x j )q(x j ) (6) Accelerates O(N 2 ) to O(N) time The kernel K (x i, x j ) must decay quickly from (x i, x i ) Can be singular on the diagonal (Calderón-Zygmund operator) Discovered by Leslie Greengard and Vladimir Rohklin in 1987 Very similar to recent wavelet techniques M. Knepley (UC) GPU AGU09 29 / 44
38 Can we do Better? Fast Multipole Method BEM Solver FMM accelerates the calculation of the function: Φ(x i ) = j q j x i x j (6) Accelerates O(N 2 ) to O(N) time The kernel K (x i, x j ) must decay quickly from (x i, x i ) Can be singular on the diagonal (Calderón-Zygmund operator) Discovered by Leslie Greengard and Vladimir Rohklin in 1987 Very similar to recent wavelet techniques M. Knepley (UC) GPU AGU09 29 / 44
39 Can we do Better? PetFMM CPU Performance Strong Scaling BEM Solver Speedup uniform 4ML8R5 uniform 10ML9R5 2 spiral 1ML8R5 spiral w/ space-filling 1ML8R5 Perfect Speedup M. Knepley (UC) Number GPUof processors AGU09 30 / 44
40 Can we do Better? PetFMM CPU Performance Strong Scaling BEM Solver ME Initialization Upward Sweep Downward Sweep Evaluation Load balancing stage Total time Time [sec] M. Knepley (UC) Number GPUof processors AGU09 30 / 44
41 Can we do Better? PetFMM Load Balance BEM Solver uniform 4ML8R5 uniform 10ML9R5 spiral 1ML8R5 spiral w/space-filling 1ML8R Number of processors M. Knepley (UC) GPU AGU09 31 / 44
42 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44
43 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44
44 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44
45 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44
46 PetFMM Can we do Better? BEM Solver PetFMM is an freely available implementation of the Fast Multipole Method Leverages PETSc Same open source license Uses Sieve for parallelism Extensible design in C++ Templated over the kernel Templated over traversal for evaluation MPI implementation Novel parallel strategy for anisotropic/sparse particle distributions PetFMM A dynamically load-balancing parallel fast multipole library 86% efficient strong scaling on 64 procs Example application using the Vortex Method for fluids (coming soon) GPU implementation M. Knepley (UC) GPU AGU09 33 / 44
47 Convergence Can we do Better? BEM Solver BEM Laplace operator is well-conditioned κ = O(N B ) = O( N) Dijkstra and Mattheij Thus the total work is in O(N 2 B ) = O(N) Same as MG Regular integral operators require only two multigrid cycles Multigrid of the 2nd kind by Hackbush M. Knepley (UC) GPU AGU09 34 / 44
48 Outline Can we do Better? Interpolation 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 35 / 44
49 FEM BEM Can we do Better? Interpolation FEM BEM FEM boundary conditions can be directly used in BEM May require a VecScatter FEM BEM BEM can evaluate the field at any domain point Cost is linear in the number of evaluations using FMM Can accomodate both pointwise values, and moments by quadrature M. Knepley (UC) GPU AGU09 36 / 44
50 Outline Advantages and Disadvantages 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages Bandwidth Convergence 5 What is Next? M. Knepley (UC) GPU AGU09 37 / 44
51 Outline Advantages and Disadvantages Bandwidth 4 Advantages and Disadvantages Bandwidth Convergence M. Knepley (UC) GPU AGU09 38 / 44
52 Advantages and Disadvantages Bandwidth Bandwidth and Serial Performance Provably low bandwidth Shang-Hua Teng, SISC, 19(2), , 1998 Key advantage over algebraic methods like FFT Similar to wavelet transform Amenable to GPU implementation Also highly concurrent M. Knepley (UC) GPU AGU09 39 / 44
53 Outline Advantages and Disadvantages Convergence 4 Advantages and Disadvantages Bandwidth Convergence M. Knepley (UC) GPU AGU09 40 / 44
54 Advantages and Disadvantages Convergence Convergence and Synchronization BEM matrices are better conditioned However, FEM has better preconditioners Without better preconditioners, might see more synchronization Underexplored FMM can avoid bottleneck at lower levels Overlap direct work with lower tree levels Can provably eliminate bottleneck M. Knepley (UC) GPU AGU09 41 / 44
55 Advantages and Disadvantages Debatable Advantages Convergence Small memory FEM can be done matrix-free Opens door to using Stokes operator for PC We currently do not know what to do here M. Knepley (UC) GPU AGU09 42 / 44
56 Outline What is Next? 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 43 / 44
57 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44
58 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44
59 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44
Center for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationA Kernel-independent Adaptive Fast Multipole Method
A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges
More informationFast Multipole Method on the GPU
Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical
More informationImplementation for Scientific Computing: FEM and FMM
Implementation for Scientific Computing: FEM and FMM Matthew Knepley Computation Institute University of Chicago Department of Mathematics Tufts University March 12, 2010 M. Knepley (UC) SC Tufts 1 / 121
More informationTree-based methods on GPUs
Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology
More informationFast Methods with Sieve
Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley
More informationDeveloping GPU-Enabled Scientific Libraries
Developing GPU-Enabled Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University Medical Center CBC Workshop on CBC
More informationParallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle
ICES Student Forum The University of Texas at Austin, USA November 4, 204 Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of
More informationIncorporation of Multicore FEM Integration Routines into Scientific Libraries
Incorporation of Multicore FEM Integration Routines into Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University
More informationAdaptive boundary element methods in industrial applications
Adaptive boundary element methods in industrial applications Günther Of, Olaf Steinbach, Wolfgang Wendland Adaptive Fast Boundary Element Methods in Industrial Applications, Hirschegg, September 30th,
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More information1.2 Numerical Solutions of Flow Problems
1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationTheoretical Foundations
Theoretical Foundations Dmitry Karpeev 1,2, Matthew Knepley 1,2, and Robert Kirby 3 1 Mathematics and Computer Science Division 2 Computation Institute Argonne National Laboratory University of Chicago
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationDeveloping GPU-Enabled Scientific Libraries
Developing GPU-Enabled Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University Medical Center CBC Workshop on CBC
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationVery fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards
Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationThe Immersed Interface Method
The Immersed Interface Method Numerical Solutions of PDEs Involving Interfaces and Irregular Domains Zhiiin Li Kazufumi Ito North Carolina State University Raleigh, North Carolina Society for Industrial
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationThe Fast Multipole Method on NVIDIA GPUs and Multicore Processors
The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied
More informationGetting Modern Algorithms into the Hands of Working Scientists on Modern Hardware
Getting Modern Algorithms into the Hands of Working Scientists on Modern Hardware Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University
More informationFOR P3: A monolithic multigrid FEM solver for fluid structure interaction
FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationAccelerating the Iterative Linear Solver for Reservoir Simulation
Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,
More informationPyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent
PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python F.D. Witherden, M. Klemm, P.E. Vincent 1 Overview Motivation. Accelerators and Modern Hardware Python and PyFR. Summary. Motivation
More informationAMS527: Numerical Analysis II
AMS527: Numerical Analysis II A Brief Overview of Finite Element Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao SUNY Stony Brook AMS527: Numerical Analysis II 1 / 25 Overview Basic concepts Mathematical
More informationLow-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill
Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications Jingfang Huang Department of Mathematics UNC at Chapel Hill Fundamentals of Fast Multipole (type) Method Fundamentals: Low
More informationsmooth coefficients H. Köstler, U. Rüde
A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin
More informationTransactions on Modelling and Simulation vol 20, 1998 WIT Press, ISSN X
Parallel indirect multipole BEM analysis of Stokes flow in a multiply connected domain M.S. Ingber*, A.A. Mammoli* & J.S. Warsa* "Department of Mechanical Engineering, University of New Mexico, Albuquerque,
More informationA comparison of parallel rank-structured solvers
A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.
More informationFinite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters
Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik www.mpi-inf.mpg.de/ ~strzodka Dominik Göddeke
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationcuibm A GPU Accelerated Immersed Boundary Method
cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationInterdisciplinary practical course on parallel finite element method using HiFlow 3
Interdisciplinary practical course on parallel finite element method using HiFlow 3 E. Treiber, S. Gawlok, M. Hoffmann, V. Heuveline, W. Karl EuroEDUPAR, 2015/08/24 KARLSRUHE INSTITUTE OF TECHNOLOGY -
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationParallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney
Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationOn Robust Parallel Preconditioning for Incompressible Flow Problems
On Robust Parallel Preconditioning for Incompressible Flow Problems Timo Heister, Gert Lube, and Gerd Rapin Abstract We consider time-dependent flow problems discretized with higher order finite element
More informationD036 Accelerating Reservoir Simulation with GPUs
D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the
More informationDriven Cavity Example
BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationComputational Fluid Dynamics - Incompressible Flows
Computational Fluid Dynamics - Incompressible Flows March 25, 2008 Incompressible Flows Basis Functions Discrete Equations CFD - Incompressible Flows CFD is a Huge field Numerical Techniques for solving
More informationRadial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing
Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free
More informationAvailable online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 61 ( 2013 ) 81 86 Parallel Computational Fluid Dynamics Conference (ParCFD2013) An OpenCL-based parallel CFD code for simulations
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationIterative methods for use with the Fast Multipole Method
Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationThe GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.
The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund
More informationAlgorithms, System and Data Centre Optimisation for Energy Efficient HPC
2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy
More informationEfficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling
Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike
More informationGPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA
GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble
More informationParallel and Distributed Systems Lab.
Parallel and Distributed Systems Lab. Department of Computer Sciences Purdue University. Jie Chi, Ronaldo Ferreira, Ananth Grama, Tzvetan Horozov, Ioannis Ioannidis, Mehmet Koyuturk, Shan Lei, Robert Light,
More informationGPU Cluster Computing for FEM
GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing
More informationExploring unstructured Poisson solvers for FDS
Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationGPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging
GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationAccelerating Double Precision FEM Simulations with GPUs
Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University
More information3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs
3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional
More informationParallel Computations
Parallel Computations Timo Heister, Clemson University heister@clemson.edu 2015-08-05 deal.ii workshop 2015 2 Introduction Parallel computations with deal.ii: Introduction Applications Parallel, adaptive,
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationAccelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead
Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationThe 3D DSC in Fluid Simulation
The 3D DSC in Fluid Simulation Marek K. Misztal Informatics and Mathematical Modelling, Technical University of Denmark mkm@imm.dtu.dk DSC 2011 Workshop Kgs. Lyngby, 26th August 2011 Governing Equations
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationTowards Realistic Performance Bounds for Implicit CFD Codes
Towards Realistic Performance Bounds for Implicit CFD Codes W. D. Gropp, a D. K. Kaushik, b D. E. Keyes, c and B. F. Smith d a Mathematics and Computer Science Division, Argonne National Laboratory, Argonne,
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationApproaches to Parallel Implementation of the BDDC Method
Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More informationNumerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes
Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes Jung-Han Kimn 1 and Blaise Bourdin 2 1 Department of Mathematics and The Center for Computation and
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationIntroduction to Multigrid and its Parallelization
Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are
More informationA fast solver for the Stokes equations with distributed forces in complex geometries 1
A fast solver for the Stokes equations with distributed forces in complex geometries George Biros, Lexing Ying, and Denis Zorin Courant Institute of Mathematical Sciences, New York University, New York
More informationLarge-scale Gas Turbine Simulations on GPU clusters
Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based
More informationKrishnan Suresh Associate Professor Mechanical Engineering
Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick
More informationHandling Parallelisation in OpenFOAM
Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationAdvanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm
Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Andreas Sandberg 1 Introduction The purpose of this lab is to: apply what you have learned so
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationOvercoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science
Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationAn algebraic multi-grid implementation in FEniCS for solving 3d fracture problems using XFEM
An algebraic multi-grid implementation in FEniCS for solving 3d fracture problems using XFEM Axel Gerstenberger, Garth N. Wells, Chris Richardson, David Bernstein, and Ray Tuminaro FEniCS 14, Paris, France
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More information