Fast Bilateral Filter GPU implementation

Size: px

Start display at page:

Download "Fast Bilateral Filter GPU implementation"

Claribel Quinn
5 years ago
Views:

1 Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design, University of Erlangen-Nuremberg July 21, 2016

2 Overview Fast Bilateral Filter Implementation Benchmark 2

3 Introduction to Bilateral filter First lets take a look on 2d convolution f x = g x k(x) f x = x N(n) g x k(x x ) f x = F 1 {F g x F k x } Gaussian kernel 3

4 (a) Original image (b) Gaussian filtered image 4

5 (a) Original image (b) Gaussian filtered image (c) Bilateral filtered image* (d) Bilateral filtered image** 5

6 (a) Original image (b) Bilateral kernel (c) Result 6

7 Bilateral filter equations f x = k 1 (x) x N g x c x, x s(g x, g(x )) f x Output image g x Input image 7

8 Bilateral filter equations f x = k 1 (x) g x c x, x s(g x, g(x )) x N c x, x = e x x 2 2σ 2 d Spacial similarity function s g(x), g(x ) = e g x g x 2 2σ r 2 Range similarity function k x = x N c x, x s(g x, g(x )) Normalizing factor 8

9 9

10 Fast Bilateral Filter 10

11 Fast Bilateral Filter 11

12 Fast Bilateral Filter 12

13 Overview Fast Bilateral Filter Implementation Benchmark 13

14 Implementation Load image into GPU Fill cubes Perform separable convolution Slicing & nonlinearity Copy image back 14

15 Cube filling WI W 15

16 Perform separable convolution I = I O(NMk 2 ) O(2 NMk) N, M Image dimensions k : convolution kernel length 16

17 Perform Slicing & Nonlinearity WI W output 17

18 Texture fetching Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

19 Overview Fast Bilateral Filter Implementation Benchmark 19

20 Benchmark for intensity kernel length, intensity and spatial scalling 11 20

21 Benchmark for intensity kernel length, intensity and spatial scalling 11 21

22 Benchmark for intensity kernel length, intensity and spatial scalling 11 22

23 Benchmark for intensity kernel length, intensity and spatial scalling 11 23

24 Benchmark for intensity kernel length, intensity and spatial scalling 11 24

25 Benchmark for intensity kernel length, intensity and spatial scalling 11 25

26 Benchmark for intensity kernel length, intensity and spatial scalling 11 26

27 Benchmark for intensity kernel length, intensity and spatial scalling 11 27

28 Benchmark Best method for filling WI W 28

29 global void cubefilling_loop(const float* image, float *dev_cube_wi, float *dev_cube_w, const dim3 image_size, int scale_xy, int scale_eps, dim3 dimensions_down) { unsigned int i = blockidx.x * blockdim.x + threadidx.x; unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i < dimensions_down.x && j < dimensions_down.y) { #pragma unroll #pragma unroll size_t cube_idx_1 = i + dimensions_down.x*j; for (int ii = 0; ii < scale_xy; ii++) { for (int jj = 0; jj < scale_xy; jj++) { size_t i_idx = scale_xy*i + ii; size_t j_idx = scale_xy*j + jj; if (i_idx < image_size.x && j_idx < image_size.y) { dimensions_down.x*dimensions_down.y*floorf(k / scale_eps); } float k = image[i_idx + image_size.x*j_idx]; size_t cube_idx_2 = cube_idx_1 + dev_cube_wi[cube_idx_2] += k; dev_cube_w[cube_idx_2] += 1.0f; } } } } Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

30 global void cubefilling_atomic(const float* image, float *dev_cube_wi, float *dev_cube_w, const dim3 image_size, int scale_xy, int scale_eps, dim3 dimensions_down) { const size_t i = blockidx.x * blockdim.x + threadidx.x; const size_t j = blockidx.y * blockdim.y + threadidx.y; if (i < image_size.x && j < image_size.y) { const float k = image[i + image_size.x*j]; const size_t cube_idx = (i / scale_xy) + dimensions_down.x*(j / scale_xy) + dimensions_down.x*dimensions_down.y*((int)k / scale_eps); atomicadd(&dev_cube_wi[cube_idx], k); atomicadd(&dev_cube_w[cube_idx], 1.0f); } } Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

31 Benchmark Best method for filling 31

32 Benchmark Best method for filling 32

33 Benchmark Best method for filling 33

34 Benchmark Best method for filling 34

35 Benchmark 100% Relative runtime on Telsa K20c 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Image size convolution cubefilling slicing Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

36 Benchmark - CPU for intensity kernel length 21 and spatial k.l

37 Issues on the Implementation On windows, a register key must be set to allow the GPU to run a kernel more than 2 seconds (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers > TdrDelay=10) On (almost) every CUDA API call, retrieve cudastatus and check for errors. It doesn t raise any exceptions. On linux, add some lines to.bashrc export PKG_CONFIG_PATH=/scratch-local/usr/lib64/pkgconfig/ export PATH=$PATH:/opt/cuda/bin export LD_LIBRARY_PATH=/scratch-local/usr/lib/:$(LD_LIBRARY_PATH) Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

38 If you are interested on the code and more plots Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

39 Thank You For Your Attention! Mlady, Bernardelli Multi-Core Architectures and Programming Fast Bilateral Filter GPU implementation 21/07/

40 References Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionO perations.html C. Tomasi and R. Manduchi Bilateral Filtering for gray and color images. In Proceedings of IEEE International Conference on Computer Vision, pages ximation.pdf guide/index.html#linear-filtering linear-filtering-of-1-d-texture-of-4- texels 40

Example 1: Color-to-Grayscale Image Processing

GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The