Stokes Preconditioning on a GPU

Size: px

Start display at page:

Download "Stokes Preconditioning on a GPU"

Annabel Blair
5 years ago
Views:

1 Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical Center AGU 09 San Francisco, CA December 15, 2009 M. Knepley (UC) GPU AGU09 1 / 44

2 Collaborators Prof. Dave Yuen Dept. of Geology and Geophysics, University of Minnesota Dr. David May, developer of BFBT (in PETSc) Dept. of Earth Sciences, ETHZ Felipe Cruz, developer of FMM-GPU Dept. of Applied Mathematics, University of Bristol Prof. Lorena Barba Dept. of Mechanical Engineering, Boston University M. Knepley (UC) GPU AGU09 3 / 44

3 Outline 5 Slide Talk 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 4 / 44

4 BFBT 5 Slide Talk BFBT preconditions the Schur complement using S 1 b = L 1 p G T KGL 1 p (1) where L p is the Laplacian in the pressure space. M. Knepley (UC) GPU AGU09 5 / 44

5 Current Problems 5 Slide Talk The current BFBT code is limited by Bandwidth constraints Sparse matrix-vector product Achieves at most 10% of peak performance Synchronization GMRES orthogonalization Coarse problem Convergence Viscosity variation Mesh dependence M. Knepley (UC) GPU AGU09 6 / 44

6 5 Slide Talk Alternative Proposal Use a Boundary Element Method, for the Laplace solves in BFBT, accelerated by FMM. M. Knepley (UC) GPU AGU09 7 / 44

7 Missing Pieces 5 Slide Talk BEM discretization and assembly Matrix-free operator application using the Fast Multipole Method Overcomes bandwidth limit, 480 GF on an NVIDIA 1060C GPU Overcomes coarse bottleneck by overlapping direct work Solver for BEM system Same total work as FEM due to well-conditioned operator Possibility of multilevel preconditioner (even better) Interpolation between FEM and BEM Boundary interpolation just averages Can again use FMM for interior M. Knepley (UC) GPU AGU09 8 / 44

8 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44

9 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44

10 5 Slide Talk Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 9 / 44

11 Outline What are the Problems? 1 5 Slide Talk 2 What are the Problems? Bandwidth Synchronization Convergence 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 10 / 44

12 Problems What are the Problems? The current BFBT code is limited by Bandwidth constraints Synchronization Convergence M. Knepley (UC) GPU AGU09 11 / 44

13 Outline What are the Problems? Bandwidth 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 12 / 44

14 Bandwidth What are the Problems? Bandwidth Small bandwidth to main memory can limit performance Sparse matrix-vector product Operator application AMG restriction and interpolation M. Knepley (UC) GPU AGU09 13 / 44

15 What are the Problems? STREAM Benchmark Bandwidth Simple benchmark program measuring sustainable memory bandwidth Protoypical operation is Triad (WAXPY): w = y + αx Measures the memory bandwidth bottleneck (much below peak) Datasets outstrip cache Machine Peak (MF/s) Triad (MB/s) MF/MW Eq. MF/s Matt s Laptop (5.5%) Intel Core2 Quad (1.2%) Tesla 1060C * (0.8%) Table: Bandwidth limited machine performance M. Knepley (UC) GPU AGU09 14 / 44

16 What are the Problems? Bandwidth Analysis of Sparse Matvec (SpMV) Assumptions No cache misses No waits on memory references Notation m Number of matrix rows nz Number of nonzero matrix elements V Number of vectors to multiply We can look at bandwidth needed for peak performance ( V ) m nz + 6 V or achieveable performance given a bandwith BW byte/flop (2) Vnz BW Mflop/s (3) (8V + 2)m + 6nz Towards Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes, and Smith. M. Knepley (UC) GPU AGU09 15 / 44

17 What are the Problems? Improving Serial Performance Bandwidth For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (4) which is a dismal 8.8% of peak. Can improve performance by Blocking Multiple vectors but operation issue limitations take over. M. Knepley (UC) GPU AGU09 16 / 44

18 What are the Problems? Improving Serial Performance Bandwidth For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (4) which is a dismal 8.8% of peak. Better approaches: Unassembled operator application (Spectral elements, FMM) N data, N 2 computation Nonlinear evaluation (Picard, FAS, Exact Polynomial Solvers) N data, N k computation M. Knepley (UC) GPU AGU09 16 / 44

19 Outline What are the Problems? Synchronization 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 17 / 44

20 Synchronization What are the Problems? Synchronization Synchronization penalties can come from Reductions GMRES orthogonalization More than 20% penalty for PFLOTRAN on Cray XT5 Small subproblems Multigrid coarse problem Lower levels of Fast Multipole Method tree M. Knepley (UC) GPU AGU09 18 / 44

21 Outline What are the Problems? Convergence 2 What are the Problems? Bandwidth Synchronization Convergence M. Knepley (UC) GPU AGU09 19 / 44

22 Convergence What are the Problems? Convergence Convergence of the BFBT solve depends on Viscosity constrast (slightly) Viscosity topology Mesh Convergence of the AMG Poisson solve depends on Mesh M. Knepley (UC) GPU AGU09 20 / 44

23 Outline Can we do Better? 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? BEM Formulation BEM Solver Interpolation 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 21 / 44

24 Can we do Better? Alternative Proposal Use a Boundary Element Method, for the Laplace solves in BFBT, accelerated by FMM. M. Knepley (UC) GPU AGU09 22 / 44

25 Missing Pieces Can we do Better? BEM discretization and assembly Solver for BEM system Interpolation between FEM and BEM M. Knepley (UC) GPU AGU09 23 / 44

26 Outline Can we do Better? BEM Formulation 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 24 / 44

27 Can we do Better? Boundary Element Method BEM Formulation The Poisson problem u(x) = f (x) on Ω (5) u(x) Ω = g(x) (6) M. Knepley (UC) GPU AGU09 25 / 44

28 Can we do Better? Boundary Element Method BEM Formulation The Poisson problem (Boundary Integral Equation formulation) C(x)u(x) = F(x, y)g(y) G(x, y) u(y) ds(y) Ω n (5) G(x, y) = 1 log r 2π (6) F(x, y) = 1 r 2πr n (7) M. Knepley (UC) GPU AGU09 25 / 44

29 Can we do Better? Boundary Element Method BEM Formulation Restricting to the boundary, we see that 1 2 g(x) = Ω F(x, y)g(y) G(x, y) u(y) ds(y) (5) n M. Knepley (UC) GPU AGU09 25 / 44

30 Can we do Better? Boundary Element Method BEM Formulation Discretizing, we have Gq = ( ) 1 2 I F g (5) M. Knepley (UC) GPU AGU09 25 / 44

31 Can we do Better? Boundary Element Method BEM Formulation Now we can evaluate u in the interior u(x) = Ω F(x, y)g(y) G(x, y) u(y) ds(y) (5) n M. Knepley (UC) GPU AGU09 25 / 44

32 Can we do Better? Boundary Element Method BEM Formulation Or in discrete form u = Fg Gq (5) M. Knepley (UC) GPU AGU09 25 / 44

33 Can we do Better? Boundary Element Method BEM Formulation The sources in the interior may be added in using superposition ( ) 1 u(y) 2 g(x) = F(x, y)g(y) G(x, y) Ω n f ds(y) (5) M. Knepley (UC) GPU AGU09 25 / 44

34 Outline Can we do Better? BEM Solver 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 26 / 44

35 BEM Solver Can we do Better? BEM Solver The solve has two pieces: Operator application Boundary solve Interior evaluation Accomplished using the Fast Multipole Method Iterative solver Usually GMRES We use PETSc M. Knepley (UC) GPU AGU09 27 / 44

36 Operator Application Can we do Better? BEM Solver Using the Fast Multiple Method, the Green s functions (F and G) can be applied: in O(N) time using small memory bandwidth in the interior and on the boundary with much higher serial and parallel performance M. Knepley (UC) GPU AGU09 28 / 44

37 Can we do Better? Fast Multipole Method BEM Solver FMM accelerates the calculation of the function: Φ(x i ) = j K (x i, x j )q(x j ) (6) Accelerates O(N 2 ) to O(N) time The kernel K (x i, x j ) must decay quickly from (x i, x i ) Can be singular on the diagonal (Calderón-Zygmund operator) Discovered by Leslie Greengard and Vladimir Rohklin in 1987 Very similar to recent wavelet techniques M. Knepley (UC) GPU AGU09 29 / 44

38 Can we do Better? Fast Multipole Method BEM Solver FMM accelerates the calculation of the function: Φ(x i ) = j q j x i x j (6) Accelerates O(N 2 ) to O(N) time The kernel K (x i, x j ) must decay quickly from (x i, x i ) Can be singular on the diagonal (Calderón-Zygmund operator) Discovered by Leslie Greengard and Vladimir Rohklin in 1987 Very similar to recent wavelet techniques M. Knepley (UC) GPU AGU09 29 / 44

39 Can we do Better? PetFMM CPU Performance Strong Scaling BEM Solver Speedup uniform 4ML8R5 uniform 10ML9R5 2 spiral 1ML8R5 spiral w/ space-filling 1ML8R5 Perfect Speedup M. Knepley (UC) Number GPUof processors AGU09 30 / 44

40 Can we do Better? PetFMM CPU Performance Strong Scaling BEM Solver ME Initialization Upward Sweep Downward Sweep Evaluation Load balancing stage Total time Time [sec] M. Knepley (UC) Number GPUof processors AGU09 30 / 44

41 Can we do Better? PetFMM Load Balance BEM Solver uniform 4ML8R5 uniform 10ML9R5 spiral 1ML8R5 spiral w/space-filling 1ML8R Number of processors M. Knepley (UC) GPU AGU09 31 / 44

42 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

43 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

44 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

45 GPU Performance Can we do Better? BEM Solver In our C++ code on a CPU, M2L transforms take 85% of the time This does vary depending on N New M2L design was implemented using PyCUDA Port to C++ is underway We can now achieve 500 GF on the NVIDIA Tesla Previous best performance we found was 100 GF We will release PetFMM-GPU in the new year M. Knepley (UC) GPU AGU09 32 / 44

46 PetFMM Can we do Better? BEM Solver PetFMM is an freely available implementation of the Fast Multipole Method Leverages PETSc Same open source license Uses Sieve for parallelism Extensible design in C++ Templated over the kernel Templated over traversal for evaluation MPI implementation Novel parallel strategy for anisotropic/sparse particle distributions PetFMM A dynamically load-balancing parallel fast multipole library 86% efficient strong scaling on 64 procs Example application using the Vortex Method for fluids (coming soon) GPU implementation M. Knepley (UC) GPU AGU09 33 / 44

47 Convergence Can we do Better? BEM Solver BEM Laplace operator is well-conditioned κ = O(N B ) = O( N) Dijkstra and Mattheij Thus the total work is in O(N 2 B ) = O(N) Same as MG Regular integral operators require only two multigrid cycles Multigrid of the 2nd kind by Hackbush M. Knepley (UC) GPU AGU09 34 / 44

48 Outline Can we do Better? Interpolation 3 Can we do Better? BEM Formulation BEM Solver Interpolation M. Knepley (UC) GPU AGU09 35 / 44

49 FEM BEM Can we do Better? Interpolation FEM BEM FEM boundary conditions can be directly used in BEM May require a VecScatter FEM BEM BEM can evaluate the field at any domain point Cost is linear in the number of evaluations using FMM Can accomodate both pointwise values, and moments by quadrature M. Knepley (UC) GPU AGU09 36 / 44

50 Outline Advantages and Disadvantages 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages Bandwidth Convergence 5 What is Next? M. Knepley (UC) GPU AGU09 37 / 44

51 Outline Advantages and Disadvantages Bandwidth 4 Advantages and Disadvantages Bandwidth Convergence M. Knepley (UC) GPU AGU09 38 / 44

52 Advantages and Disadvantages Bandwidth Bandwidth and Serial Performance Provably low bandwidth Shang-Hua Teng, SISC, 19(2), , 1998 Key advantage over algebraic methods like FFT Similar to wavelet transform Amenable to GPU implementation Also highly concurrent M. Knepley (UC) GPU AGU09 39 / 44

53 Outline Advantages and Disadvantages Convergence 4 Advantages and Disadvantages Bandwidth Convergence M. Knepley (UC) GPU AGU09 40 / 44

54 Advantages and Disadvantages Convergence Convergence and Synchronization BEM matrices are better conditioned However, FEM has better preconditioners Without better preconditioners, might see more synchronization Underexplored FMM can avoid bottleneck at lower levels Overlap direct work with lower tree levels Can provably eliminate bottleneck M. Knepley (UC) GPU AGU09 41 / 44

55 Advantages and Disadvantages Debatable Advantages Convergence Small memory FEM can be done matrix-free Opens door to using Stokes operator for PC We currently do not know what to do here M. Knepley (UC) GPU AGU09 42 / 44

56 Outline What is Next? 1 5 Slide Talk 2 What are the Problems? 3 Can we do Better? 4 Advantages and Disadvantages 5 What is Next? M. Knepley (UC) GPU AGU09 43 / 44

57 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44

58 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44

59 What is Next? Direct Fast Method for Variable-Viscosity Stokes Complexity not currently precisely quantified We would like a given number of flops/digit of accuracy Brute Force Use BEM to compute layers between regions of constant viscosity Better conditioned, but not direct Elegant method should be possible The operator is pseudo-differential Kernel-independent FMM exists M. Knepley (UC) GPU AGU09 44 / 44

Center for Computational Science

Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,