Motion tracking on CUDA architecture

Size: px
Start display at page:

Download "Motion tracking on CUDA architecture"

Transcription

1 Motion tracking on CUDA architecture Marek Blazewicz1, Krzysztof Kurowski1, Bogdan Ludwiczak1, Krystyna Woloszczuk1 1 Poznan Supercomputing and Networking Center Applications Department ul. Noskowskiego Poznan, Poland, {marqs, krzysztof.kurowski, bogdanl, krysia}@man.poznan.pl Abstract. This technical report describes an implementation of a simple motion tracking algorithm on GPU using CUDA architecture. It discusses the details of the algorithm and the way it was ported to GPU. The speed ups in comparison to a serial code on CPU that were measured are presented and discussed. Finally, possible applications that were tested are shown. Keywords: CUDA, Nvidia, GPU, HPC, motion tracking, feature finding. 1. Introduction With the aim of heading more and more advanced questions, the computing industry rapidly moves toward parallel processing and many core architectures. The increasing programmability of graphics processing units (GPUs) allows us to use these chips not only for specific graphics computations for which they were designed, but for general purpose computing problems. This field is called GPGPU (General Purpose computation on GPUs). The main aim of this work was to get acquainted to a software development kit called CUDA (compute unified device architecture), an extension to C programming language designed for the development of the GPU code. In this work we try to efficiently port a simple motion tracking algorithm to a GPU having in mind all the architecture requirements and limitations. Graphic card used for the experiments was an NVIDIA GTX 280 series card. We used Eclipse environment to develop the code, OpenCV library to handle video input/output and CUDA Visual Profiler to optimize the performance. Section 2 describes briefly the characteristics of GTX 280 hardware, the CUDA architecture and its performance optimization issues. For more details see [1]. Section 3 approaches general issues of motion tracking problem, distinguishing between the problem of feature finding and feature tracking. Section 4 details the algorithms that were chosen for both problems. In section 5 we describe how the algorithm was ported to GPU. Section 6 details the performance results depending

2 on the chosen launching parameters and section 7 discusses the results. In section 8 we show a few possible applications that were implemented. Finally, section 9 lists the references used in the article. 2. GPU and CUDA architecture The programmable GPU is a highly parallel, multi threaded, many core processor with a very high computational power and memory bandwidth. The main difference between CPU and GPU is that graphic cards, specialized for graphics rendering (e. g. compute intensive and highly parallel computation), have more transistors devoted to data processing than to data caching and flow control. GTX 280 graphic card consists of 30 multiprocessors (MP) with texture filtering and addressing units, a texture cache, a set of registers, a cache for constants, and a parallel data cache. Each multiprocessor has eight stream processors (SP). A multiprocessor is responsible for creating, managing, and executing concurrent threads. Lightweight thread creation, zero overhead thread scheduling and fast barrier synchronization efficiently supports programs that can be decomposed into parallel subproblems according to SIMD architecture (single instruction multiple data). For example, an image processing algorithm that performs the same operation on every pixel of an image can be very efficiently ported to GPU architecture each thread can be assigned one pixel and perform the calculations in a fast and highly parallel manner. Threads are arranged in blocks, each block being executed on one multiprocessor. To manage hundreds of threads, a multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. As soon as a block of threads finishes its calculations, another block is launched to a multiprocessor. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. As a consequence, to improve the performance a programmer should provide as few divergent instructions in a warp as possible. Each multiprocessor has four types of on chip memory: one set of local 32 bit registers per processor, a parallel data cache (called shared memory) that is shared by all scalar processor cores, a read only constant cache that is shared by all scalar processor cores and speeds up reads from the constant memory space, which is a read only region of device memory, a read only texture cache that is shared by all scalar processor cores and speeds up reads from the texture memory space, which is a read only region of device

3 memory. Each multiprocessor accesses the texture cache via a texture unit that implements the various addressing modes and data filtering. The local and global memory spaces are read write regions of device memory and are not cached. A multiprocessor can execute at most eight thread blocks at a time. How many of them will actually be executed depends on how many registers per thread and how much shared memory per block are required for a given kernel (by a kernel we mean a function that is executed by each thread) since the multiprocessor s registers and shared memory are split among all the threads of the batch of blocks. So, to assure that at least two blocks are executed concurrently, a programmer should assign at most half of the registers and half of the shared memory to one kernel. In GTX 280, the amount of shared memory available per multiprocessor is 16 KB and so is the amount of registers. Shared memory is organized in 16 banks. To achieve high performance, a programmer should follow a number of rules concerning memory bandwidth, instruction throughput and control flow: Maximize the use of all available memory types. Allow the thread scheduler to overlap memory transactions with mathematical computations as much as possible, which requires that each thread executes intensive arithmetic operations and that there are many active threads per multiprocessor. For floating point operations, if double precision is not crucial for the algorithm, use single precision operations that are two to four times faster than double precision operations. As mentioned earlier, minimize the number of divergent warps to avoid the serialization. To reduce the number of serializations in loops, unroll them as much as possible. Device memory is of much higher latency and lower bandwidth than on chip memory. Therefore minimize the device memory accesses by staging data coming from device memory into shared memory, process the data in shared memory and write the results back to device memory. It is even better to avoid the data transfer at the cost of recomputing the data whenever it is needed. As for global memory, the device is capable of reading 32 bit, 64 bit, or 128 bit words from global memory into registers in a single instruction. Therefore provide that each thread in a warp accesses the global memory in a way that can be compiled to a single load instruction (called coalesced memory access). Otherwise threads will be serialized. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. The shared memory is organized in 16 banks such that successive 32 bit words are assigned to successive banks. Banks can be accessed simultaneously, so to achieve a high memory bandwidth, memory read or write requests should fall into distinct memory banks.

4 9. To maximize the use of the available computing resources, there should be at least as many blocks as there are multiprocessors in the device (in case of GTX280 30). Running only one block per multiprocessor will force the multiprocessor to idle during thread synchronization and during device memory reads if there are not enough threads per block to cover the load latency. The number of blocks should be therefore at least 100 according to CUDA Programming Guide. 10. The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under populated warps. A number of is said to be optimal. 11. The bandwidth between host memory and device memory is very low. Therefore try to minimize the number of transfers between these two memories, for example by moving more code to device and storing intermediate results on device memory. Also, because of the overhead associated with each transfer, batching many small transfers into a big one always performs much better than making each transfer separately. 3. Motion Tracking Motion tracking is a process of locating a moving object (or several ones) in time using a camera. This process can be divided into two challenging tasks: 1. How to choose the good targets (features) that will be easy to track. Good features to track are generally those parts of an image which have a high contrast. Therefore neither plain regions nor edges are of our interest (the former would be difficult to track in any direction, the latter can be tracked in only one dimension). The points we are interested in are called corners points that are traceable in two dimensions. 2. How to associate target locations in consecutive video frames. This is especially difficult when the objects are moving fast relative to the frame rate, the image is blurry, the objects rotate in third dimension, there are rapid lighting changes or another object obscures the tracked object. As the main purpose of this work is mostly to get acquainted to CUDA, the proposed algorithm does not head those problems. There are two main approaches to feature tracking. The first one chooses features in the first frame only and tries to find the best match for them in the consecutive frames in a chosen search region. This technique requires that the same feature can be detected reliably and consistently across many frames. A big disadvantage is that correspondence errors tend to be very large, for example when rotation, zoom, skew or lighting changes occur. As a consequence, we loose the track of the feature rather quick.

5 The second approach compares features between consecutive frames, not with the feature characteristic from the first frame. This approach is more robust to rotation, lighting changes etc., so we can track the feature longer, but it sometimes causes the features to drift we finish tracking a different feature that we started with. After comparing the advantages and disadvantages of both approaches, the second approach was chosen and implemented. 4. Algorithms 4.1. Feature finding To find interesting features in an image to track we have decided to use the Harris corner detector, which is a popular point detector due to its strong invariance to rotation, scale, illumination variation and image noise. The Harris corner detector is based on the local auto correlation function of a signal, where local auto correlation function measures local changes of the signal with patches shifted by a small amount in different directions. If changes of a signal are significant no matter what is the direction of a shift, it means we found a corner. Given a shift ( x, y) and a point (x, y), the auto correlation function is defined as c x, y = w [I x i, yi I x i x, y i y ] 2, (1) where I denotes the brightness of a gray scaled image, x i, y i are points in the window w centered on x, y. Approximating I u x, v y by Taylor expansion, T c x, y =[ x y] C x, y [ x y]. C x, y is a matrix that captures the intensity structure of the local neighborhood: (2)

6 [ Ix IxIy C= Ix Iy I y where I x and direction respectively. 2 2 (3) ], I y are partial derivatives of I in x and y 1, 2 be the eigenvalues of matrix C x, y. By analyzing the Let eigenvalues we can describe the characteristic of a point. Three situations can be distinguished: 1. If both 1 and a plain region. 2 are small, it means that the analyzed point is a part of 2. If one eigenvalue is high and the other low, we found an edge. 3. If both eigenvalues are high, it indicates a corner. Harris proposed a measure of cornerness as 2 H= (4). Technically, to calculate the cornerness of a point, we have to calculate Ix and Iy derivatives of an image under window, calculate the C matrix and determine the eigenvalues. I x and I y derivatives are calculated by convolving the image with two separable Sobel filters: [ x derivative mask= [ ] y derivative mask = (5) ] (6)

7 1, 2 are calculated as: 1=0.5 C 00 C 11 C C 11 C 00 C C 01 C10 (7) 2=0.5 C 00 C 11 C 00 2 C 11 C 00 C 11 4 C 01 C 10 (8) 2 2 In the algorithm we use a window with radius 1 (experiments showed that larger radius is not necessary). Having calculated the cornerness values for each point in the image, we set a threshold over which we consider a point as a good feature to track. To avoid having features located very close to each other, we divide an image into small parts and choose one representative (best) corner for each region. To calculate the cornerness value for each point in an image, the algorithm requires 6 multiplications and additions to calculate each point in x and y derivatives, 9x4 multiplications and additions to calculate the C matrix and around 6x2 multiplications and adds to calculate lambdas and cornerness. To choose the representatives for the corners, we need additionally corners _ distance 2 comparisons Feature tracking The standard approach to feature tracking is to measure the correlation between the feature in one image and a point shifted by some distance in the consecutive frame. By comparing the correlations in the search region, we can choose the best match as the new position of the tracked feature. Among the strategies to measure the correspondence between two features under mask, the SSD (sum of squared differences) is considered to be of a good precision, but very time consuming and so applicable only in non real time programs. The cost of computations depends mostly on the range of the search region. Confining to a small search region saves search time and makes the search process easier, but runs the risk of the tracked feature leaving the search region entirely between frames. The other factor that influences the performance but also the time of computations is the size of patch which we treat as a feature and correlate between the images. Again, small patch saves the time, but is less accurate. We have established in the experiments that, considering the tracking performance, a reasonable choice for most applications is a patch radius 3 and a search distance radius 7. Of course these values greatly depend on the particular application (how sharp and detailed is the image, e. g. how small the feature patch can be, how fast do the features shift between frames etc.).

8 The chosen parameters require calculating 15x15 SSDs, each consisting of 7x7 multiplications and additions, for each tracked feature. 5. Implementation in CUDA 5.1. Feature finding As the calculations for pixels are mostly local, the algorithm can be effectively paralleled by dividing the image into smaller regions and assigning each region to a different block of threads. The block will calculate the cornerness for each pixel in its region and send the position of the best pixel (representative of the region) back to host. The image data will be saved in a texture and the parts of it needed by a particular block will be sent to and processed in a shared memory of a block. Using a texture provides a fast and cached access to the data and is especially well suited for image processing. For example, the methods for accessing the texture provide a safe access to outside the border points. On CPU, a safe access to the points behind a picture border must be explicitly ensured by a programmer. According to the algorithm, each block will have to allocate a memory for an array with its part of the picture, two additional arrays to store the intermediate results of x and y derivatives and a table for cornerness values. To spare the memory needed for the last array, the cornerness values will overwrite the original values in the first array. This means in particular that the cornerness values calculated as floats will be rounded to unsigned chars, therefore the precision of calculations will be slightly reduced (but only by non integer part, because values of cornerness are always scaled to be smaller than 200). The experiments showed indeed that sometimes, comparing to CPU results, different corners were chosen. As we haven't noticed that the GPU points were in any way more difficult to track than the CPU points, the sacrifice of precision for the time seems profitable. Following the rules for optimizing the performance listed in section 2, there should be at least 100 blocks, each occupying at most half of the available shared memory. Having large blocks minimizes the number of operations, but requires a lot of memory which will degrade the performance. After careful calculations the size of the region was established to 32x32 for that size the memory needed is equal to ~7kB, which is slightly less than half of the available memory. With this size we also meet the requirement for the number of blocks for a standard camera image of size 640x480 there will be 300 blocks. A number of threads in a block should be divisible by 32 (the optimal number is ), but not too large because then the computations for one thread will be too small to overlap the memory transactions. Following those rules the number of

9 threads was chosen to be 4x16 each thread will process four consecutive pixels in either of its rows. Fig. 1 Way of accessing an array of 1byte data in four iterations to avoid shared memory bank conflicts. Fig. 2 Way of using previously calculated columns for consecutive pixels. There are a few reasons standing behind the idea of processing four consecutive pixels in a row by one thread: to minimize the bank conflicts, a number of memory operations and amount of computations. The details are described below. The algorithm works as follows: 1. Copy the image data from the host to the CUDA array on the device and bind it to the texture. 2. Run the kernels in each block. In each thread: 2.1. Load your 8 pixels (4 pixels in two rows) to the shared memory. Designate some threads to load the remaining apron pixels. Having each thread

10 accessing every 4th pixel will minimize the bank conflicts and reduce the time of memory transactions. The idea is shown on Figure 1. To avoid more of the bank conflicts, the shared memory was aligned to 64 bytes (or 32 in another benchmark) For your first pixel, get 9 surrounding pixels and calculate x and y gradient For the next pixels in your row, get only the missing right column, replace the unneeded first column and and calculate the gradients. It will spare the memory transactions. The idea is shown on Figure Send all four pixels to arrays with x and y gradient at once to provide the write in only one transaction Do the same for your second row For your first pixel, get 9 surrounding gradient values, calculate the sums of I x2, I y 2 and I x I y for C matrix, saving the summing results for each column of pixels under mask separately. Calculate the cornerness on this basis For the next pixels in your row, load only 3 missing values from both gradient arrays, calculate the sum of this column and replace the summing result for the unneeded first column. This will save both memory operations and a number of calculations (again Figure 2). Then recalculate your C matrix and the cornerness Send all four cornerness values to image array (overwrite it) at once to provide the write in only one transaction Designate one thread to find the best corner in the region and send it to device. 3. Copy the results from the device to the host memory at once to avoid repetitive overhead Feature tracking In feature tracking, the most natural way to parallelize the algorithm is to let each block calculate the best shift for one feature. This is risky if we have only a few features to track (the number of block would be too small) the best performance will be achieved if we have more than 30 features which is usually the case. In each block, there are 15x15 shifts to evaluate. To meet the requirement of having the number of threads divisible by 32, the number of threads was chosen to be 4x16. Each thread processes 4 consecutive pixels (shifts) in its row. This time each thread processes pixels from just one row, because otherwise a number of threads would be too small (32) to get a good performance. As can be noticed, 16x16 shifts will be evaluated instead of required 15x15. We do those excessive calculations because this will not augment the processing time in any way and, moreover, our search distance will be expanded. This time there is no problem with the available amount of shared memory as there are no intermediate results to store each block needs only around1kb of

11 memory (for the tracked feature description, part of an image needed and the result) which is far less than half of the memory available. The algorithm structure is analogous to the feature finding algorithm: 1. Copy new image data from the host to the CUDA array on the device and bind it to the texture. Copy also the description of the feature and its tracked surrounding (7x7 pixels) and the position of the central pixel to an array on the device. A feature's description consists of 49x1 byte for pixels' values + 2x2 bytes for central pixel's position = 53 bytes. To reduce the number of device memory operations which are rather costly, we ensure the alignment of the data to 64bytes (for details see [1] about global memory access patterns). 2. Run the kernels in each block. Each thread: 2.1.Load your 4 pixels from a texture to the shared memory aligned to 32 bytes. Designate some threads to load the remaining apron pixels. Having each thread accessing every 4th pixel will minimize the bank conflicts and reduce the time of memory transactions Designate 16 first threads to load the feature's description array to the shared memory, each loading 4 bytes at once to avoid bank conflicts For your first pixel (evaluated shift), get 7x7 surrounding pixels and calculate the SSD For next pixels in your row, get only the missing right column, replace the unneeded first column and and calculate the SSD. This will spare the memory transactions Choose only the best pixel and send its value and coordinates back to the shared memory. In that way there are no bank conflicts so all threads will store their result in only one memory transaction Designate one thread to find the best shift for the feature and send it to device. 3. Copy the results from the device to the host memory at once to avoid repetitive overhead. 6. Performance results Performances of the two functions (feature finding and feature tracking) were analyzed independently. Results achieved on GPU were compared to the results of running the analogous serial code on one core on CPU. Times measured included the whole process from the point were GPU and CPU versions diverged to the point where both versions converge back to the same code. As a result, if f. e. GPU version included some additional data copying and preprocessing or CPU version required some extra memory allocation, these operations were also included in the time measurement. This method of measurement was chosen because only after analyzing the process as a whole can we estimate the actual speed up achieved by

12 porting some code to GPU. For example, even if the key GPU algorithm was extremely fast but the overheads and memory allocations required would overlap the time savings, it should be claimed that there was no actual profit from porting the code to GPU Feature finding The performance of feature_finding function was evaluated for several launching parameters. In particular, the number of threads in a block and the shared memory alignments were manipulated. At the first place we tried to change the shared memory alignment. Alignment to 64bytes, implemented in the original version, ensures that each row starts in the first bank. This should minimize the number of bank conflicts, but as it turned out, in our case it consumed a lot of memory (one row needed only 36 bytes, so over a half was wasted) and therefore fewer blocks could be launched on one multiprocessor at once. Also, alignment to 64 is profitable mostly when all 16 threads in a half warp access the same row. In our case 16 consecutive threads access 4 rows, so there is a 4 way bank conflict anyway. Therefore the other run was made with alignment to 4, which does not reduce the bank conflicts, but at least saves the memory. The other parameter tested was the number of threads in a block. The original number of 8x16 was augmented to 8x32. In this case there are more threads to overlap the memory accesses, but on the other hand, less computation is done by one thread to overlap these accesses. Table 1. summarizes the results obtained for different launching parameters. The times were averaged from 10 runs. Table 1. Speed ups in feature_finding function, no warm up made. Launching options Speed up (8x16) aligned to (8x16) aligned to (8x32) aligned to As can be seen, in all cases the speed up compared to CPU doesn't change and is equal to 3,66. On CPU the time was around 500ms and on GPU around 140ms. In CUDA examples provided by NVIDIA the time of a kernel function is always measured after a warm up that is, after running a function once before the actual run to warm up the graphic card. A warm up is done because the first call of the function needs to initialize the driver, allocate some memory for CUDA on the GPU, initialize the context, copy the kernel code over to the GPU etc. which gives a big overhead. Performing one warm up before a first call to the function didn't change the result. It is probably due to the fact that we use two kernels, so all the initializations done by our warm up are lost during the load of the other kernel.

13 Performing a warm up before each call to the function however gave impressive results, summarized in a Table 2. Table 2. Speed ups in feature_finding function, warm up made. Launching options Speed up (8x16) aligned to (8x16) aligned to (8x32) aligned to This time the function was calculated in only 1,3ms on GPU. This benchmark showed how huge can be the overhead for complex kernels. Only after warm up the changes between different launching options are visible. Alignment to 4 accelerates the computations as much less of the shared memory is used by one thread so more warps can be executed at a time. On the other hand, increasing the number of threads didn't improve the performance, probably because to few computations are done by one thread to overlap the memory transactions Feature tracking For feature_tracking function, no benchmarking seemed worth trying. First, the number of blocks is determined by the number of features. Secondly, as a block analyses part of an image of size 15x15, there is not much freedom in manipulating with the number of threads lower number than chosen would fall below the claimed minimum of 32 threads in a warp, and more threads would reduce the already small amount of calculations done by each thread. The alignment of the shared memory was optimally set to 32. The experiments were run on two video streams, one producing around 40 corners in feature_finding function, and the other producing around 70 corners. A comparison was done to find out how the number of tracked features influences the times on CPU and GPU. The results show that the time on CPU increases linearly with the number of features while the time on GPU doesn't change. Results are summarized in Table 3. Table 3. Speed ups in feature_tracking function. Launching options Speed up ~40 features, warm up ~40 features, no warm up ~70 features, warm up ~70 features, no warm up 36.90

14 Another thing that should be noticed is that for the feature_finding function the overhead of the first function call is much smaller, so even without warm up the results are very good. The overhead is smaller because the function is much simpler (less code has to be copied to the device) and less memory needs to be allocated. 7. Discussion of results Before discussing the results, it should be stressed that CPU algorithm was implemented as is, without optimizations. For example, a way of using previously calculated results for the consecutive computations in find_features function could also be implemented in CPU version. When programming for GPU, the memory limitations force to look for various ways of optimization, which then turn out to be useful also for CPU. The speeds ups observed for both functions originate from a couple of reasons: thread parallelization designed with respect to optimal number of blocks and threads, evenly distributed use of different types of memory available on GPU with respect to their bandwidth and availability, less computations made with faster single precision operators, less memory accesses obtained by coalescence and avoidance of bank conflicts, use of textures with built in management of outside the border pixels. For the find_features function, particularly impressive results are obtained with the use of warm up technique, which is not very practical in real world applications there is no sense in running each kernel twice just to have a second run very fast. So as the actual speed up of that function we should consider 3.66 which is also a good result. The huge overhead implicates that there is not much sense in further optimization of the algorithm (f. e. further reduction of bank conflicts), as they are overlapped by the overhead. The only solution would be to reduce the overhead by dividing the kernel into smaller kernels, but then we would loose the advantage of not copying the intermediate results back to CPU, so it probably wouldn't improve the results. The speed up is sometimes obtained at the cost of accuracy for find_features functions, storing the cornerness values as floats would overload the memory, so they were rounded to integer numbers. As a result, a CPU version sometimes finds slightly different corners, but the experiments showed that the CPU's corners are not significantly easier to track. The advantage of track_features function on GPU is that it is more scalable in comparison to CPU. Increasing the number of tracked features didn't change the time while on CPU the dependence was linear. As a result, for larger sets of features even bigger speed up is observed. For both functions, further optimizations are always possible on CUDA. For example, all the loops could be unrolled, or some kind of a hash function could be

15 used to load pixels into shared memory to avoid bank conflicts completely. But those optimizations would reduce largely the readability of the code. We claim that one should always try to find the golden mean and not overoptimize the code if it already serves its purpose. The drawback of programming for GPUs is that the performance of the code highly depends on an application and even on a particular data set. Therefore there are no golden rules (for example for the optimal number of blocks) applicable in all situations. The number of blocks can't be even estimated or precalculated, as it depends on many factors like the structure of the code, number of divergent operations, loops, amount of memory needed etc. The only way to optimize the performance is to test various settings empirically. What is more, the code is not very portable and rarely can be reused in other applications. For example, in feature_tracking function, the change of the search distance would require a strong reorganization of the code. Finally, it is worth to mention the tool that helps to analyze the performance of the code. The CUDA Visual Profiler is the application that, by running the code several times, can analyze the actual use of different types of memory, the divergence of a code, number of coalesced and uncoalesced reads etc. We found it a very useful tool to locate the weak points in the code and to give directions in which we should try to optimize the code. For example, the analysis of a find_features function showed that for alignment to 64 the occupancy of GPU processors was only 25%. Minimizing the alignment to 4 increased that number to 50%. Increasing the number of threads gave 75% occupancy, but the actual time of running the kernel increased. After the analysis of other Visual Profiler statistics, it could be noticed that the use of the registers felt down, which showed that although the occupancy was high (many threads were loaded and unloaded all the time), the number of calculations per thread decreased so much that it could not overlap the memory transactions. That kind of analysis helps to understand the characteristics and requirements of a particular kernel function. To sum up our up till now experience with CUDA, it is a very promising tool for accelerating the computations. It allows to implement the solutions for real time applications that are inaccessible for serial computations on CPU. The GPU architecture and CUDA are especially well suited for image processing algorithms the design of texture structures for example even simplifies the code in comparison to CPU as the programmer doesn't have to bother about border cases or polarization when reading between the pixels. The disadvantage of programming for GPUs is that the performance doesn't come for free there are lots of rules to follow which are sometimes difficult to preserve at once. The code can hardly be optimized analytically the only good way is to test different setting empirically. The CUDA Visual Profiler is a good tool to analyze and compare the benchmarks. Furthermore, the code has to be tied to a particular algorithm and even to a data set to perform well. The code for GPU is usually a few times longer and more complicated than on CPU. Also, one should keep in mind that the results reported by CUDA programmers usually involve a previous warm up, which will in most cases not be done in real world applications.

16 So, using the terms of CUDA, there is a big overhead for writing a good GPU program, but the speed ups obtained can be impressive. 8. Possible applications The information about shifts of the corners between frames in a video stream can be used for various purposes. First, speeds of the objects in the scene can be tracked. Figure 3 shows the possible application. Fig. 3 Feature tracking: visualization of motion speed Fig. 4 Feature tracking: visualization of movement direction

17 We tried to track the movement of birds in a water basin. Each bird is marked with a couple of corners. The color of a marker indicates the speed: blue corners indicate the lowest speed and red ones the fastest. In a second application we marked the directions and velocities of the moving features. Figure 4 shows a screen shot of a video with a tracked lorry. By analyzing the direction of movement, features belonging to the static landscape can be distinguished from the moving object. Another information that can be retrieved from corner tracking is the depth of an image. If the video shows a static scenery and the camera moves constantly along it, a depth of a picture can be estimated as the objects close to the camera will move faster than the objects in a distance. 9. References 1. NVIDIA CUDA (Compute Unified Device Architecture) Programming Guide, Version 2.0, 6/7/ Konstantinos G. Derpan, The Harris Corner Detector, 27/10/ Joe Stam, Stereo Imaging with CUDA, 01/ Victor Podlozhnyuk, Image Convolution with CUDA, 06/ J. P. Lewis, Fast Normalized Cross correlation, Industrial Light & Magic Sobel filter example CUDA SDK by Nvidia, SobelFilter

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Convolutional Neural Networks for Object Classication in CUDA

Convolutional Neural Networks for Object Classication in CUDA Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky (kriz@cs.toronto.edu) April 16, 2009 1 Introduction Here I will present my implementation of a simple convolutional neural

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

Recall: Derivative of Gaussian Filter. Lecture 7: Correspondence Matching. Observe and Generalize. Observe and Generalize. Observe and Generalize

Recall: Derivative of Gaussian Filter. Lecture 7: Correspondence Matching. Observe and Generalize. Observe and Generalize. Observe and Generalize Recall: Derivative of Gaussian Filter G x I x =di(x,y)/dx Lecture 7: Correspondence Matching Reading: T&V Section 7.2 I(x,y) G y convolve convolve I y =di(x,y)/dy Observe and Generalize Derivative of Gaussian

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

IN FINITE element simulations usually two computation

IN FINITE element simulations usually two computation Proceedings of the International Multiconference on Computer Science and Information Technology pp. 337 342 Higher order FEM numerical integration on GPUs with OpenCL 1 Przemysław Płaszewski, 12 Krzysztof

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

C E N T E R A T H O U S T O N S C H O O L of H E A L T H I N F O R M A T I O N S C I E N C E S. Image Operations II

C E N T E R A T H O U S T O N S C H O O L of H E A L T H I N F O R M A T I O N S C I E N C E S. Image Operations II T H E U N I V E R S I T Y of T E X A S H E A L T H S C I E N C E C E N T E R A T H O U S T O N S C H O O L of H E A L T H I N F O R M A T I O N S C I E N C E S Image Operations II For students of HI 5323

More information

Design guidelines for embedded real time face detection application

Design guidelines for embedded real time face detection application Design guidelines for embedded real time face detection application White paper for Embedded Vision Alliance By Eldad Melamed Much like the human visual system, embedded computer vision systems perform

More information

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy BSB663 Image Processing Pinar Duygulu Slides are adapted from Selim Aksoy Image matching Image matching is a fundamental aspect of many problems in computer vision. Object or scene recognition Solving

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. Romain Dolbeau March 24, 2014 1 Introduction To quote John Walker, the first person to brute-force the problem [1]:

More information

Optical Flow Estimation with CUDA. Mikhail Smirnov

Optical Flow Estimation with CUDA. Mikhail Smirnov Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract

More information

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen Overview of ArgonST Manufacturer of integrated sensor hardware and sensor analysis systems 2 RF, COMINT,

More information

CHAPTER VIII SEGMENTATION USING REGION GROWING AND THRESHOLDING ALGORITHM

CHAPTER VIII SEGMENTATION USING REGION GROWING AND THRESHOLDING ALGORITHM CHAPTER VIII SEGMENTATION USING REGION GROWING AND THRESHOLDING ALGORITHM 8.1 Algorithm Requirement The analysis of medical images often requires segmentation prior to visualization or quantification.

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

DIGITAL. Computer Vision Algorithms for. Automating HD Post-Production. Institute of Information and Communication Technologies

DIGITAL. Computer Vision Algorithms for. Automating HD Post-Production. Institute of Information and Communication Technologies Computer Vision Algorithms for Automating HD Post-Production Hannes Fassold, Jakub Rosner 2010-09-22 DIGITAL Institute of Information and Communication Technologies Overview 2 Harris/KLT feature point

More information

Universiteit Leiden Computer Science

Universiteit Leiden Computer Science Universiteit Leiden Computer Science Optimizing octree updates for visibility determination on dynamic scenes Name: Hans Wortel Student-no: 0607940 Date: 28/07/2011 1st supervisor: Dr. Michael Lew 2nd

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

GPU Profiling and Optimization. Scott Grauer-Gray

GPU Profiling and Optimization. Scott Grauer-Gray GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Outline 7/2/201011/6/

Outline 7/2/201011/6/ Outline Pattern recognition in computer vision Background on the development of SIFT SIFT algorithm and some of its variations Computational considerations (SURF) Potential improvement Summary 01 2 Pattern

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Fast Natural Feature Tracking for Mobile Augmented Reality Applications

Fast Natural Feature Tracking for Mobile Augmented Reality Applications Fast Natural Feature Tracking for Mobile Augmented Reality Applications Jong-Seung Park 1, Byeong-Jo Bae 2, and Ramesh Jain 3 1 Dept. of Computer Science & Eng., University of Incheon, Korea 2 Hyundai

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Robert Collins CSE598G. Intro to Template Matching and the Lucas-Kanade Method

Robert Collins CSE598G. Intro to Template Matching and the Lucas-Kanade Method Intro to Template Matching and the Lucas-Kanade Method Appearance-Based Tracking current frame + previous location likelihood over object location current location appearance model (e.g. image template,

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

Accelerating the acceleration search a case study. By Chris Laidler

Accelerating the acceleration search a case study. By Chris Laidler Accelerating the acceleration search a case study By Chris Laidler Optimization cycle Assess Test Parallelise Optimise Profile Identify the function or functions in which the application is spending most

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

CMPSCI 691AD General Purpose Computation on the GPU

CMPSCI 691AD General Purpose Computation on the GPU CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang (cont. from last lecture) Device Management Context Management Module Management

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Local Feature Detectors

Local Feature Detectors Local Feature Detectors Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr Slides adapted from Cordelia Schmid and David Lowe, CVPR 2003 Tutorial, Matthew Brown,

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Image processing and features

Image processing and features Image processing and features Gabriele Bleser gabriele.bleser@dfki.de Thanks to Harald Wuest, Folker Wientapper and Marc Pollefeys Introduction Previous lectures: geometry Pose estimation Epipolar geometry

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VI (Nov Dec. 2014), PP 29-33 Analysis of Image and Video Using Color, Texture and Shape Features

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Real-Time Scene Reconstruction. Remington Gong Benjamin Harris Iuri Prilepov

Real-Time Scene Reconstruction. Remington Gong Benjamin Harris Iuri Prilepov Real-Time Scene Reconstruction Remington Gong Benjamin Harris Iuri Prilepov June 10, 2010 Abstract This report discusses the implementation of a real-time system for scene reconstruction. Algorithms for

More information

Histogram calculation in CUDA. Victor Podlozhnyuk

Histogram calculation in CUDA. Victor Podlozhnyuk Histogram calculation in CUDA Victor Podlozhnyuk vpodlozhnyuk@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 06/15/2007 vpodlozhnyuk First draft of histogram256 whitepaper

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Implementation Of Harris Corner Matching Based On FPGA

Implementation Of Harris Corner Matching Based On FPGA 6th International Conference on Energy and Environmental Protection (ICEEP 2017) Implementation Of Harris Corner Matching Based On FPGA Xu Chengdaa, Bai Yunshanb Transportion Service Department, Bengbu

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information