Analyzing and improving performance portability of OpenCL applications via auto-tuning

Size: px

Start display at page:

Download "Analyzing and improving performance portability of OpenCL applications via auto-tuning"

Anthony Kennedy
5 years ago
Views:

1 Analyzing and improving performance portability of OpenCL applications via auto-tuning James Price & Simon McIntosh-Smith University of Bristol - High Performance Computing Group Funded in part by Imagination Technologies

2 Overview Performance portability is one of the key concerns for developers targeting many different architectures Current work in this area has provided mixed results Here we propose a method of rigorously analyzing performance portability by exploiting the black-box nature of auto-tuning We also present a simple technique that can improve performance portability when using auto-tuning

3 Analysis approach Expose set of possible implementation decisions as tuning options Dynamically generate kernels to provide greater flexibility Run auto-tuner to optimize the kernel for each device individually This produces a set of architecture-specific kernels We then run each architecture-specific kernel on every other device and measure efficiency

4 Benchmark #1: Jacobi method Work-group size / parallel decomposition Memory layout / memory access pattern Loop unrolling *+ vs mad vs fma Solve Ax = b Split matrix A into diagonal D and remainder R Branching vs masks Division by diagonal (inline or precompute) xi+1 = D -1 (b - Rxi) Address spaces of input vectors Embedding values into kernel as constants

5 Benchmark #2: Bilateral filter Work-group size Tile size / tile layout Prefetch pixels into private/local memory Buffers vs images *+ vs mad vs fma Loop interchange Native math functions O(x, y) = W (x, y, i, j) =e ( Embedding values into kernel as constants r i= r r j= r r i= r I(i, j) W (x, y, i, j) r j= r W (x, y, i, j) p i 2 +j 2 (I(x,y) I(x+i,y+j) 2 d 2 ) r

space of molecules / forcefield Precomputing forcefield

6 Benchmark #3: BUDE Parallel decomposition / work-group size Data layout / memory access patterns Loop interchange Address space of molecules / forcefield Precomputing forcefield coefficients Native math functions Embedding values into kernel as constants

7 Devices Vendor Product Architecture Compute units GeForce GT 580 Fermi (GF110) 16 GeForce GT 680 Kepler (GK104) 8 NVIDIA GeForce GT 780 Ti Kepler (GK110) 15 GeForce GT 980 Ti Maxwell (GM200) 22 GeForce GT 1080 Ti Pascal (GP102) 28 Radeon HD 7970 Tahiti 32 AMD Radeon R9 290 Hawaii 44 Radeon R9 Furyx Fiji 64 Radeon R 480 Ellesmere 36 Core i Ivy Bridge 4 Intel Core i Haswell 4 Core i Skylake 4

8 Jacobi efficiencies 69% 4 28% 21% 1 16% 16% 39% 43% 11% 17% 21% 18% 41% 40% 19% 6% 13% 53% 4 R % 78% 9 81% tuned for R9 Fury R9 290 HD 7970 GT 1080 Ti 2 91% 23% 91% 79% 19% 8 58% 23% 21% 33% 68% 38% 28% % 1% 80% 70% % GT 980 Ti 69% 51% 9 53% 33% 49% 36% GT 780 Ti 66% 9 68% 33% 3 6 GT % 26% 57% 1% GT % 53% 56% 1% GT 580 GT 680 GT 780 Ti GT 980 Ti GT 1080 Ti HD 7970 R9 290 running on R9 Fury R 480

9 Jacobi - NVIDIA Work-group size GT 1080 Ti 91% 58% differs between GT 980 Ti 69% 51% 9 devices tuned for GT 780 Ti GT % 9 Other drops due to address spaces and memory access GT % 53% 56% pattern GT 580 GT 680 GT 780 Ti running on GT 980 Ti GT 1080 Ti

10 Jacobi - AMD Most parameters R 480 uniform between all AMD devices tuned for R9 Fury R9 290 Fury suffers a little with expensive division HD 7970 HD 7970 R9 290 running on R9 Fury R 480

11 Jacobi - Intel Ivy Bridge performs poorly tuned for with fma builtin Vectorisation 53% 4 width differs running on

12 Jacobi efficiencies WCE 69% 4 28% 21% 1 16% 16% 39% 43% 11% 17% 21% 18% 41% 40% 19% 6% 13% 53% 4 6% R % 78% 9 81% tuned for R9 Fury R9 290 HD 7970 GT 1080 Ti 2 91% 23% 91% 79% 19% 8 58% 23% 21% 33% 68% 38% 28% % 1% 80% 70% % 19% 1% GT 980 Ti 69% 51% 9 53% 33% 49% 36% GT 780 Ti 66% 9 68% 33% 3 6 GT % 26% 57% 1% 1% GT % 53% 56% 1% - GT 580 GT 680 GT 780 Ti GT 980 Ti GT 1080 Ti HD 7970 R9 290 running on R9 Fury R 480

13 Bilateral efficiencies WCE 19% 13% 18% 17% 1 11% 8% 9% 8% 19% 13% 18% 17% 1 11% 8% 9% 8% 8% 6% 7% 7% 7% 6% R % 70% 69% 8 87% 29% 27% 26% 26% tuned for R9 Fury R9 290 GT 1080 Ti 70% 70% 68% 76% 73% % 86% 28% 28% 27% 27% 7 26% 2 73% 26% 2 - GT 980 Ti 3 31% 7 GT 780 Ti 9 96% 4 73% GT % 38% 4 7 GT % 29% 73% GT 580 GT 680 GT 780 Ti GT 980 Ti GT 1080 Ti R9 290 R9 Fury running on R 480

14 BUDE efficiencies WCE 6 39% 8 87% 8 61% 47% 9 39% 81% 81% 78% 6 48% 9 1 8% 13% 21% 29% 10% 10% 1 1 8% R % 30% 3 96% 7 67% 50% 63% 61% 57% 30% R9 Fury 91% % 86% 7 66% 16% 13% 13% tuned for R9 290 HD 7970 GT 1080 Ti 8 91% 50% 78% 48% 46% 33% 3 89% 66% 77% 7 73% 61% 86% 86% % 1 57% 46% 1 3 GT 980 Ti 91% 66% 83% - GT 780 Ti 81% 89% 83% - GT % 51% 43% - GT % 51% 43% - GT 580 GT 680 GT 780 Ti GT 980 Ti GT 1080 Ti HD 7970 R9 290 running on R9 Fury R 480

15 Multi-objective auto-tuning Extend tuning process to consider multiple devices at once Each time a kernel is generated, the auto-tuner evaluates it on every target device The performance values are then reduced into a single number representing the overall fitness We use worst-case efficiency for this fitness function

16 Multi-objective tuning results

17 Summary Over-optimisation hurts performance portability Auto-tuning can be a great way to expose these issues It can also help generate performance portable kernels Future work looking at tuning across different input/ problem configurations

A GPU based brute force de-dispersion algorithm for LOFAR

A GPU based brute force de-dispersion algorithm for LOFAR W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 8 th May 2012 1 GPUs Why use GPUs? Latest Kepler/Fermi based cards