How to Optimize Geometric Multigrid Methods on GPUs

Size: px

Start display at page:

Download "How to Optimize Geometric Multigrid Methods on GPUs"

Clifford Foster
6 years ago
Views:

1 How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper

2 Schedule motivation imaging in gradient space HDR compression evading the memory wall tuning an MG for Poisson's equation wavefront parallelization implementation details performance results conclusions

3 Motivation image processing in gradient space efficient algorithms for various application simple transformation (finite difference stencil) retransformation requires solution of Poisson's equation u = f u =0 in Ω on δω

4 Motivation image processing in gradient space efficient algorithms for various application simple transformation (finite difference stencil) retransformation requires solution of Poisson's equation u = f u =0 in Ω on δω»a multigrid tutorial«[briggs/henson/mccormick] Chapter 1 only!

5 High dynamic range compression gradient is locally adjusted based on»gradient Domain High Dynamic Range Compression«[Fattal/Lischinski/Werman] first implementation by Harald Köstler C for CUDA on NVIDIA GPUs (addressing compute capability 1.2/1.3)

6 Multigrid for Poisson's equation memory bound RBGS smoother has 2 Flops per main memory access in the ideal case modern hardware has a ratio of around 25:1 save memory transfer at all cost! use single precision (sufficient for image processing) use hardware that provides high bandwidth (GPGPUs) many optimizations used have been described long ago:»multi-level adaptive solutions to boundary-value problems«[brandt] 1977

7 Exploit color splitting split red and black grid data into separate arrays an RBGS update will cause only half the data transfer (2/3 for allocate-on-write) caches use half-weighting for restriction becomes weighted injection for RBGS without over-relaxation residual is not stored only half of fine grid data needs to be read correct only black unknowns red unknowns will not be used, but overwritten by post-smoothing RBGS not even half as much data is accessed as with mixed arrays and stored residual and full-weighting restriction

8 Performance of first version half of an NVIDIA GTX 295 (2009) 112 GB/s peak bandwidth approx. 8.8 ms for a V(2,2) cycle compute capability 1.3 NVIDIA GTX 480 (2010) bandwidth: 1.6 performance: GB/s peak bandwidth approx. 4.7 ms for a V(2,2) cycle compute capability 2.0 (Fermi) Are there additional optimizations? Is there a way to get around the bandwidth limitations?

9 Traditional optimization technique: Cache blocking on standard CPUs reuse data that resides in the cache hierarchy as often as possible spatial cache blocking traverse the grid to increase cache reuse (tiling, space-filling curves) temporal cache blocking perform additional operations on available data simple example: black update at (i,j-1) immediately after red update at (i,j) fusion of single red and black GS update only as efficient as color splitting difficulties to transfer this concept to GPUs GPUs have no coherent caches thread blocks / workgroups naturally perform cache tiling low amount of storage for a necessary high number of threads traditional cache blocking techniques are horrible to parallize

10 Wavefront example: 1D 3-point stencil target array (volatile) buffer source array

11 Wavefront example: 1D 3-point stencil

12 Wavefront example: 1D 3-point stencil

13 Wavefront example: 1D 3-point stencil

14 Wavefront example: 1D 3-point stencil

15 Wavefront example: 1D 3-point stencil

16 Wavefront example: 1D 3-point stencil

17 Wavefront example: 1D 3-point stencil

18 Wavefront++

19 Wavefront parallelization Thread 1 Thread 2

20 Multidimensional wavefronts Thread 1 Thread 2 about 380x22 Thread 3

21 Portable and flexible approach applicable for various architectures natural approach for Cell GPUs use shared memory for intermediate buffers standard CPUs have no special resources for volatile data memory is held in cache by regular touching»efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization«[Wellein/Hager/Zeiser/Wittmann/Fehske]»High Performance Stencil Code Algorithms for GPGPUs«[Schäfer/Fey] (submitted)

22 A wavefront multigrid without red unknowns implemented in OpenCL fusion of operations at least whole RBGS iterations are performed in a single wave / sweep red unknowns are computed in volatile storage only whole RBGS iterations can also be fused with successive operations RBGS + computation of residual + half-weighting restriction to lower level RBGS + interpolation to / correction of higher level further optimizations zero as initial guess first Jacobi iteration or red GS update yields scaled RHS specialized kernel: hard-coded V(1,1) for 33x33 homogeneous Dirichlet boundary conditions never stored don't write RBGS results correction data

23 Potential of fused kernels maximal reduction in memory transfer compared to separate function/kernels using color splitting RBGS RBGS + restriction RGBS + correction update initial 2/3 1/2 9/17 7/17 11/16 10/16

24 Potential of fused kernels maximal reduction in memory transfer compared to separate function/kernels using color splitting RBGS RBGS + restriction RGBS + correction fewer kernel calls update initial 2/3 1/2 9/17 7/17 11/16 10/16

25 Efficiency of wavefront implementation How much bandwidth would an optimal non-blocking implementation require? size GPU time equivalent BW RBGS update RBGS update + restricton RBGS update + correction µs 165 GB/s µs 172 GB/s µs 133 GB/s peak: 177 GB/s, typical: 110 to 150 GB/s, max. 160 GB/s

26 Efficiency of wavefront implementation How much more memory was transferred than absolutely necessary? size measured BW BW overhead RBGS update RBGS update + restricton RBGS update + correction GB/s 13 % GB/s 19 % GB/s 14 % peak: 177 GB/s, typical: 110 to 150 GB/s, max. 160 GB/s

27 Efficiency of wavefront implementation How much more memory was transferred than absolutely necessary? size 2 x 193 µs = 386 µs vs. 305 µs => speedup > 25 % (33 % for CPU time) measured BW BW overhead RBGS update GB/s 13 % RBGS update + restricton RBGS update + correction separate RGS / BGS GB/s 19 % GB/s 14 % GB/s < 2 % peak: 177 GB/s, typical: 110 to 150 GB/s, max. 160 GB/s

28 Wall clock time / unknowns per second V(2,2) FMG(V,2,2) 2049x x x x ms / 1.6 Gu/s 6.0 ms / 700 Mu/s 1.2 ms / 880 Mu/s 3.0 ms / 350 Mu/s 0.6 ms / 390 Mu/s 1.8 ms / 150 Mu/s 0.4 ms / 145 Mu/s 1.1 ms / 60 Mu/s

29 Wall clock time / unknowns per second 2049x x x x257 V(2,2) 80 % speedup FMG(V,2,2) 2.6 ms / 1.6 Gu/s 6.0 ms / 700 Mu/s 1.2 ms / 880 Mu/s 3.0 ms / 350 Mu/s 0.6 ms / 390 Mu/s 1.8 ms / 150 Mu/s 0.4 ms / 145 Mu/s 1.1 ms / 60 Mu/s

30 Profiler data V(2,2) 2049x % 5% 77% gpu time cpu time runtime

31 Conclusions wavefront implementations doable on GPGPUs also non-trivial (but simple) algorithms can defer the memory wall to some extend tedious and error prone but at least as important are tuning of algorithm and memory layout to meet the architectural requirements future work back-port to CPU in OpenCL finding optimal FMG/V-cycle tuning parameters for lower levels (currently optimized for finest level)

32 Thank you very much for your attention!

33 Thank you very much for your attention! Questions

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg