Image convolution with CUDA

Similar documents
Example 1: Color-to-Grayscale Image Processing

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

ECE 408 / CS 483 Final Exam, Fall 2014

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA Parallelism Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Lecture 10!! Introduction to CUDA!

Tesla Architecture, CUDA and Optimization Strategies

Module Memory and Data Locality

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Hardware/Software Co-Design

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Cartoon parallel architectures; CPUs and GPUs

GPU programming basics. Prof. Marco Bertini

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CME 213 S PRING Eric Darve

Warps and Reduction Algorithms

COSC 6374 Parallel Computations Introduction to CUDA

Introduction to GPGPU and GPU-architectures

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Real-time Graphics 9. GPGPU

CUDA Performance Optimization. Patrick Legresley

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Dense Linear Algebra. HPC - Algorithms and Applications

Lecture 2: CUDA Programming

For personnal use only

CS 314 Principles of Programming Languages

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Real-time Graphics 9. GPGPU

Advanced CUDA Optimizations

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Hands-on CUDA Optimization. CUDA Workshop

Lecture 5. Performance Programming with CUDA

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Histogram calculation in CUDA. Victor Podlozhnyuk

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Parallel Numerical Algorithms

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Lab 1 Part 1: Introduction to CUDA

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

Optimizing Parallel Reduction in CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Sparse Linear Algebra in CUDA

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Code Optimizations for High Performance GPU Computing

Fast Bilateral Filter GPU implementation

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Global Memory Access Pattern and Control Flow

CS516 Programming Languages and Compilers II

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Data Parallel Execution Model

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Introduction to GPGPUs and to CUDA programming model

Module 3: CUDA Execution Model -I. Objective

Computer Architecture

Optimization Strategies Global Memory Access Pattern and Control Flow

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Lecture 3: Introduction to CUDA

University of Bielefeld

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Performance Nuggets

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

High Performance Linear Algebra on Data Parallel Co-Processors I

Scientific discovery, analysis and prediction made possible through high performance computing.

Vector Addition on the Device: main()

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

GPU Computing with CUDA

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Tiled Matrix Multiplication

Double-Precision Matrix Multiply on CUDA

More CUDA. Advanced programming

GPU Computing with CUDA

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Computing. Lecture 19: CUDA - I

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CS377P Programming for Performance GPU Programming - I

CUDA. More on threads, shared memory, synchronization. cuprintf

Transcription:

Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

Convolution (definition) A convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f: In discrete terms it can be written as: For two dimensions: 09-03-11 2/46

Convolution in image processing tasks Convolution filtering can be used for a wide range of image processing tasks. Many types of blur filters or edge detectors use convolutions. Original image Blur convolution applied to the original image In the context of image processing a convolution filter is just the scalar product of the filter weights with the input pixels within a window surrounding each of the output pixels. 09-03-11 3/46

A naive convolution algorithm The scalar product is a parallel operation that is well suited to computation on highly parallel hardware such as the GPU. To process and compute an output pixel (red), a region of the input image (orange) is multiplied element-wise with the filter kernel (purple) and then the results are summed. 09-03-11 4/46

A naive convolution algorithm (CPU version) #include <cutil.h> #include <cuda_runtime.h> #include <cutil_inline.h> #include <QtGui/QImage> #include <QtGui/QColor> #include <math.h> #define KERNEL_RADIUS 8 #define KERNEL_LENGTH (2 * KERNEL_RADIUS + 1) #define BLOCK_W 16 #define BLOCK_H 16 09-03-11 5/46

A naive convolution algorithm (CPU version) int main(int argc, char **argv){ // read an input image QString filename ("./Lena.png"); QImage img (filename); Original image int width = img.width(); int height = img.height(); // separate color components float *pr = new float [width * height]; float *pg = new float [width * height]; float *pb = new float [width * height]; float *pbuffer = new float [width * height]; bzero(pr, width * height * sizeof (float) ); bzero(pg, width * height * sizeof (float) ); bzero(pb, width * height * sizeof (float) ); bzero(pbuffer, width * height * sizeof (float) ); 09-03-11 6/46

Original image pr pg pb for(int i = 0; i < height; ++i){ for(int j = 0; j < width; ++j){ QRgb rgb = img.pixel(j,i); *(pr + i*width +j) = (float) qred(rgb); *(pg + i*width +j) = (float) qgreen(rgb); *(pb + i*width +j) = (float) qblue(rgb); } } 09-03-11 7/46

Original image After convolution x the distance from the origin in the horizontal axis σ the standard deviation of the Gaussian distribution Values from the distribution are used to build a convolution matrix which is applied to the original image. Each pixel s new value is set to a weighted average of that pixel s neighborhood. The original pixel s value receives the heaviest weight (having the highest Gaussian value) and neighboring pixels receive smaller weights as their distance to the original pixel increases. 09-03-11 8/46

Gaussian convolution kernel is a symmetric function, so the row and column filters are identical. Applying a Gaussian blur has the effect of reducing the image s highfrequency components: a Gaussian blur is thus a low pass filter. // convolution kernel float *pkernel = new float [KERNEL_LENGTH]; bzero(pkernel, KERNEL_LENGTH * sizeof (float) ); // kernel initialization float kernelsum = 0; for(int i = 0; i < KERNEL_LENGTH; ++i){ } float dist = (float)(i - KERNEL_RADIUS) / (float) KERNEL_RADIUS; pkernel[i] = expf(- dist * dist / 2); kernelsum += pkernel[i]; for(int i = 0; i < KERNEL_LENGTH; ++i) pkernel [i] /= kernelsum; 09-03-11 9/46

// run very simple convolution on CPU convolutioncpu(pbuffer, pr, pkernel, width, height); memcpy(pr, pbuffer, width*height*sizeof (float) ); convolutioncpu(pbuffer, pg, pkernel, width, height); memcpy(pg, pbuffer, width*height*sizeof (float) ); convolutioncpu(pbuffer, pb, pkernel, width, height); memcpy(pb, pbuffer, width*height*sizeof (float) ); // build an output image to see the final result QImage out_img(width,height,qimage::format_rgb32); for(int i = 0; i < height; ++i){ for(int j = 0; j < width; ++j){ } } QColor clr((int)*(pr + i*width +j), (int)*(pg + i*width +j), (int)*(pb + i*width +j)); out_img.setpixel(j,i,clr.rgb()); // Free the memory 09-03-11 10/46

void convolutioncpu(float *dst, float *src, float *kernel, int width, int height){ for(int y = 0; y < height; y++){ for(int x = 0; x < width; x++){ float sum = 0; float value = 0; for(int i = -KERNEL_RADIUS; i <= KERNEL_RADIUS; ++i){ for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; ++j){ int c_y = y + i; int c_x = x + j; if (c_x < 0 c_x > (width-1) c_y < 0 (c_y > (height-1))) value = 0; else value = *(src + c_y*width + c_x); } } sum += value * kernel[kernel_radius + i] * kernel[kernel_radius + j]; } *(dst + y*width + x) = sum; } } 09-03-11 11/46

Device memory spaces CUDA devices use several memory spaces, which have different characteristics that reflect their distinct usage in CUDA applications. 09-03-11 12/46

A convolution algorithm (the most naive approach) The most naive approach is to use global memory to send data to device and each thread accesses it to compute convolution kernel. There are no idle threads since total number of threads invoked is the same as total number of pixels. 09-03-11 13/46

// run very simple convolution on GPU convolutiongpu(pbuffer, pr, pkernel, width, height); memcpy(pr, pbuffer, width*height*sizeof (float) ); convolutiongpu(pbuffer, pg, pkernel, width, height); memcpy(pg, pbuffer, width*height*sizeof (float) ); convolutiongpu(pbuffer, pb, pkernel, width, height); memcpy(pb, pbuffer, width*height*sizeof (float) ); On the GPU now! 09-03-11 14/46

void convolutiongpu(float *h_data_out, float *h_data_in, float *d_kernel, int w, int h){ float *d_buffer_in = 0; float *d_buffer_out = 0; cudamalloc( (void **)&d_buffer_in, w*h*sizeof(float)); cudamalloc( (void **)&d_buffer_out, w*h*sizeof(float)); // copy convolution kernel to the constant GPU memory cudamemcpytosymbol((const char*)d_kerneldev, d_kernel, int gridy = h / BLOCK_H; int gridx = w / BLOCK_W; dim3 blocks(block_w, BLOCK_H); dim3 grids(gridx, gridy); KERNEL_LENGTH*sizeof(float)); cudamemcpy(d_buffer_in, h_data_in, w*h*sizeof(float), cudamemcpyhosttodevice); unsigned int htimer; cutcreatetimer(&htimer); cutresettimer(htimer); cutstarttimer(htimer); 09-03-11 15/46

convolutiongpu_kernel<<<grids, blocks>>>(d_buffer_out, d_buffer_in, w, h); cudathreadsynchronize(); cutstoptimer(htimer); double gputime = cutgettimervalue(htimer); std::cout << "Simple convolution on GPU, time = " << gputime << " ms" << std::endl; cudamemcpy(h_data_out, d_buffer_out, w*h*sizeof(float), cudamemcpydevicetohost); // Cleanup cudafree(d_buffer_in); cudafree(d_buffer_out); global void convolutiongpu_kernel(float *d_result, float *d_data, int dataw, int datah){ // global memory address of the current thread in the whole grid const int gloc = threadidx.x + blockidx.x * blockdim.x + threadidx.y * dataw + blockidx.y * blockdim.y * dataw; 09-03-11 16/46

// global memory address of the current thread in the whole grid const int gloc = threadidx.x + blockidx.x * blockdim.x + threadidx.y * dataw + blockidx.y * blockdim.y * dataw; blockidx.x threadidx.x blockidx.y blockdim.y blockdim.x threadidx.y gloc dataw 09-03-11 17/46

float sum = 0; float value = 0; // image coordinates of the current thread int x = threadidx.x + blockidx.x * blockdim.x; int y = threadidx.y + blockidx.y * blockdim.y; for(int i = -KERNEL_RADIUS; i <= KERNEL_RADIUS; ++i){ for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; ++j){ int c_y = y + i; int c_x = x + j; // check boundaries if( c_x < 0 c_x > (dataw-1) c_y < 0 c_y > (datah-1) ) value = 0; else value = *(d_data + c_y*dataw + c_x); } } sum += value * d_kerneldev[kernel_radius + i] * d_kerneldev[kernel_radius + j]; d_result[gloc] = sum; 09-03-11 18/46

Shared memory and the apron For any reasonable kernel size, the pixels at the edge of the shared memory array will depend on pixels not in shared memory. Around the image block within a thread block, there is an apron of pixels of the width of the kernel radius that is required in order to filter the image block. Thus, each thread block must load into shared memory the pixels to be filtered and the apron pixels. The apron of one block overlaps with adjacent blocks. 09-03-11 19/46

Avoid idle threads Image block 16 x 16 Radius of the kernel = 16 If one thread is used for each pixel loaded into shared memory, then the threads loading the apron pixels will be idle during the filter computation. As the radius of the filter increases, the percentage of idle threads increases. 16 16 This wastes much of the available parallelism, and with the limited amount of shared memory available, the waste for large radius kernels can be quite high. 1 pixel 4 bytes 1 block 9216 bytes This is more than half of the available 16KB shared memory per multiprocessor on the G80 GPU. 09-03-11 20/46

Shared memory Shared memory model for naive approach: each thread in block loads 4 values from the global memory. Therefore, total shared memory size is 4 times bigger than active convolution pixels area. 09-03-11 21/46

global void convolutiongpu_shmem_kernel(float *d_result, float *d_data, int dataw, int datah){ // shared memory for the thread (active pixels + apron) shared float s_data[block_w + KERNEL_RADIUS * 2][BLOCK_W + KERNEL_RADIUS * 2]; // thread indices; every thread loads four pixels int tx = threadidx.x; int ty = threadidx.y; // global memory address of the current thread in the whole grid const int gloc = tx + blockidx.x * blockdim.x + ty * dataw + blockidx.y * blockdim.y * dataw; // original image coordinates of the current thread int x0 = tx + blockidx.x * blockdim.x; int y0 = ty + blockidx.y * blockdim.y; // upper left int x = x0 - KERNEL_RADIUS; int y = y0 - KERNEL_RADIUS; 09-03-11 22/46

if( x < 0 y < 0) s_data[ty][tx] = 0; else s_data[ty][tx] = *(d_data + gloc - KERNEL_RADIUS dataw * KERNEL_RADIUS); // upper right x = x0 + KERNEL_RADIUS; y = y0 - KERNEL_RADIUS; if( x > (dataw-1) y < 0) s_data[ty][tx + blockdim.x] = 0; else s_data[ty][tx + blockdim.x] = *(d_data + gloc + KERNEL_RADIUS // lower left x = x0 - KERNEL_RADIUS; y = y0 + KERNEL_RADIUS; dataw * KERNEL_RADIUS); if( x < 0 y > (datah-1) ) s_data[ty + blockdim.y][tx] = 0; else s_data[ty + blockdim.y][tx] = *(d_data + gloc - KERNEL_RADIUS + dataw * KERNEL_RADIUS); 09-03-11 23/46

// lower right x = x0 + KERNEL_RADIUS; y = y0 + KERNEL_RADIUS; if( x > (dataw-1) y > (datah-1) ) s_data[ty + blockdim.y][tx + blockdim.x] = 0; else s_data[ty + blockdim.y][tx + blockdim.x] = *(d_data + gloc + KERNEL_RADIUS + syncthreads(); // convolution itself float sum = 0; // index of the current thread in an active pixels area x = KERNEL_RADIUS + tx; y = KERNEL_RADIUS + ty; dataw * KERNEL_RADIUS); for(int i = -KERNEL_RADIUS; i <= KERNEL_RADIUS; ++i) for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; ++j) sum += s_data[y + i][x + j] * d_kerneldev[kernel_radius + i] * d_kerneldev[kernel_radius + j]; } d_result[gloc] = sum; 09-03-11 24/46

Separable filters Generally, a two-dimensional convolution filter requires n*m multiplications for each output pixel, where n and m are the width and height of the filter kernel. Separable filters are a special type of filter that can be expressed as the composition of two onedimensional filters, one on the rows on the image, and one on the columns. Applying to the data is the same as applying Separable filter requires only n+m multiplications for each output pixel. Separable filters have the benefit of offering more flexibility in the implementation and in addition reducing the arithmetic complexity and bandwidth usage of the computation for each data point. 09-03-11 25/46

Filter separation Basically two separate convolutions are applied. The first one is row-wise and the second one is column-wise from the first result data (apply column convolution over row-wised filtered data). This also reduces some of conditional statements and total number of apron pixels, since vertical apron in row-convolution kernel and horizontal apron in column-convolution kernel do not need to be considered. 09-03-11 26/46

global void convolutionrowgpu_kernel(float *d_result, float *d_data, int dataw, int int ty = threadidx.y; int tx = threadidx.x; datah){ // each thread loads two values from global memory into shared mem shared float s_data[block_h][block_w + 2 * KERNEL_RADIUS]; // global memory address of the current thread in the whole grid const int gloc = tx + blockidx.x * blockdim.x + ty * dataw + blockidx.y * blockdim.y * dataw; // original image based coordinate const int x0 = tx + blockidx.x * blockdim.x; // case1: left int x = x0 - KERNEL_RADIUS; if ( x < 0 ) s_data[ty][tx] = 0; else s_data[ty][tx] = d_data[gloc - KERNEL_RADIUS]; 09-03-11 27/46

// case2: right x = x0 + KERNEL_RADIUS; if ( x > dataw-1 ) s_data[ty][tx + blockdim.x] = 0; else s_data[ty][tx + blockdim.x] = d_data[gloc + KERNEL_RADIUS]; syncthreads(); // convolution float sum = 0; x = KERNEL_RADIUS + tx; for (int i = -KERNEL_RADIUS; i <= KERNEL_RADIUS; i++) sum += s_data[ty][x+i] * d_kerneldev[kernel_radius + i]; d_result[gloc] = sum; } 09-03-11 28/46

global void convolutioncolgpu_kernel(float *d_result, float *d_data, int dataw, int int ty = threadidx.y; int tx = threadidx.x; datah){ // each thread loads two values from global memory into shared mem shared float s_data[block_h + KERNEL_RADIUS * 2][BLOCK_W]; // global memory address of the current thread in the whole grid const int gloc = tx + blockidx.x * blockdim.x + ty * dataw + blockidx.y * blockdim.y * dataw; // original image based coordinate const int y0 = ty + blockidx.y * blockdim.y; // case1: upper int y = y0 - KERNEL_RADIUS; if ( y < 0 ) s_data[ty][tx] = 0; else s_data[ty][tx] = d_data[gloc dataw * KERNEL_RADIUS]; 09-03-11 29/46

// case2: lower y = y0 + KERNEL_RADIUS; if ( y > datah-1 ) s_data[ty + blockdim.y][tx] = 0; else s_data[ty + blockdim.y][tx] = d_data[gloc + dataw * KERNEL_RADIUS]; syncthreads(); // convolution float sum = 0; y = KERNEL_RADIUS + ty; for (int i = -KERNEL_RADIUS; i <= KERNEL_RADIUS; i++) sum += s_data[y+i][tx] * d_kerneldev[kernel_radius + i]; d_result[gloc] = sum; } 09-03-11 30/46

Shared memory and Bank conflicts Shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans n distinct memory banks can be serviced simultaneously. However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized. Access to shared memory should be designed to avoid serializing requests due to bank conflicts. 09-03-11 31/46

Shared memory and Bank conflicts - 16 banks - Successive 32-bit words belong to different banks - Shared memory accesses are per 16-threads (half-warp) - If n threads (out of 16) access the same bank, n accesses are executed serially 09-03-11 32/46

Reorganize shared memory 1D shared memory access pattern for a row filter. The first four iterations of the convolution computation. Red area is indicating values accessed by half warp threads. 09-03-11 33/46

global void convolutionrowgpu_optimized_kernel(float *d_result, float *d_data, int int ty = threadidx.y; int tx = threadidx.x; dataw, int datah){ // shared memory represented here by 1D array // each thread loads two values from global memory into shared mem shared float s_data[block_h * (BLOCK_W + 2 * KERNEL_RADIUS)]; // global memory address of the current thread in the whole grid const int gloc = tx + blockidx.x * blockdim.x + ty * dataw + blockidx.y * blockdim.y * dataw; // original image based coordinate const int x0 = tx + blockidx.x * blockdim.x; const int shift = ty * (BLOCK_W + 2 * KERNEL_RADIUS); // case1: left int x = x0 - KERNEL_RADIUS; if ( x < 0 ) s_data[tx + shift] = 0; else s_data[tx + shift] = d_data[gloc - KERNEL_RADIUS]; 09-03-11 34/46

// case2: right x = x0 + KERNEL_RADIUS; if ( x > dataw-1 ) s_data[tx + blockdim.x + shift] = 0; else s_data[tx + blockdim.x + shift] = d_data[gloc + KERNEL_RADIUS]; syncthreads(); // convolution itself float sum = 0; x = KERNEL_RADIUS + tx; for (int i = -KERNEL_RADIUS; i <= KERNEL_RADIUS; i++) sum += s_data[x + i + shift] * d_kerneldev[kernel_radius + i]; } d_result[gloc] = sum; 09-03-11 35/46

global void convolutioncolgpu_optimized_kernel(float *d_result, float *d_data, int int ty = threadidx.y; int tx = threadidx.x; dataw, int datah){ // shared memory represented here by 1D array // each thread loads two values from global memory into shared mem shared float s_data[block_w * (BLOCK_H + KERNEL_RADIUS * 2)]; // global mem address of this thread const int gloc = tx + blockidx.x * blockdim.x + ty * dataw + blockidx.y * blockdim.y * dataw; // original image based coordinate const int y0 = ty + blockidx.y * blockdim.y; const int shift = ty * (BLOCK_W); // case1: upper int y = y0 - KERNEL_RADIUS; if ( y < 0 ) s_data[tx + shift] = 0; else s_data[tx + shift] = d_data[ gloc dataw * KERNEL_RADIUS]; 09-03-11 36/46

// case2: lower y = y0 + KERNEL_RADIUS; const int shift1 = shift + blockdim.y * BLOCK_W; if ( y > datah-1 ) s_data[tx + shift1] = 0; else s_data[tx + shift1] = d_data[gloc + dataw * KERNEL_RADIUS]; syncthreads(); // convolution float sum = 0; for (int i = 0; i <= KERNEL_RADIUS*2; i++) sum += s_data[tx + (ty + i) * BLOCK_W] * d_kerneldev[i]; d_result[gloc] = sum; } 09-03-11 37/46

Runtime and speedups 400 x 400 pixels 2000 x 2000 pixels Naive version (CPU) Naive version (GPU) Shared memory Separable convolution 91 ms 2330 ms 27 ms (3.3) 693 ms (3.3) 26.9 ms (1.003) 663 ms (1.05) 3 ms (8.9) 72 ms (9.2) Optimized separable convolution 1.6 ms 38 ms (1.8) (1.9) 09-03-11 38/46

Time performance 09-03-11 39/46

Coalesced access to Global memory Global memory access by all threads in the half-warp of a block can be coalesced into efficient memory transactions on a G80 architectures when: - Threads access 32-, 64-, 128-bit data types. - All 16 words of the transaction must lie in the same segment of size equal to the memory transaction size (or twice the memory transaction size when accessing 128-bit words). This implies that the starting address and alignment are important. - Threads must access words in sequence. Coalescing of a half-warp of 32-bit words, such as floats (1.x) 09-03-11 40/46

09-03-11 41/46

Coalesced access to Global memory The simplest case of coalescing: the k-th thread accesses the k-th word in a segment. Not all threads need to participate. This access results in a single 64B transaction. Unaligned sequential addresses that fit within a single 128-byte segment. 09-03-11 42/46

Coalesced access to Global memory If a half warp accesses memory that is sequential but split across two 128B segments, then two transactions are performed. One 64B transaction and one 32B transaction. A half-warp accessing memory with a stride of 2. 09-03-11 43/46

Unrolling loops By default, the compiler unrolls small loops with a known trip count. The #pragma unroll directive however can be used to control unrolling of any given loop. It must be placed Immediately before the loop and only applies to that loop. It is optionally followed by a number that specifies how many times the loop must be unrolled. # pragma unroll 5 for(int i = 0; i < n; ++i) Fast multiplication mul24(x, y) multiplies two 24-bit integer values x and y. x and y are 32-bit integers but only the low 24 bits are used to perform the multiplication. mul24 should only be used when values in x and y are in the range [-2 23, 2 23-1], if x and y are signed integers and in the range [0, 2 24-1], if x and y are unsigned integers. If x and y are not in this range, the multiplication result is implementation-defined. Fast integer functions can be used for optimizing performance of kernels. 09-03-11 44/46

Bibliography Wolfram Mathworld, Convolution, http://mathworld.wolfram.com/convolution.html WolframMathworld, Normal Distribution, http://mathworld.wolfram.com/normaldistribution.html DrDobbs, http://drdobbs.com/ Victor Podlozhnyuk, Image Convolution with CUDA, NVIDIA CUDA SDK, convolutionseparable document NVIDIA CUDA Programming Guide CUDA C Best Practices Guide 09-03-11 45/46

Thank you for your attention! QUESTIONS? Göttingen, 9.03.2011