Motion tracking on CUDA architecture

Size: px

Start display at page:

Download "Motion tracking on CUDA architecture"

Juliana Wilson
6 years ago
Views:

1 Motion tracking on CUDA architecture Marek Blazewicz1, Krzysztof Kurowski1, Bogdan Ludwiczak1, Krystyna Woloszczuk1 1 Poznan Supercomputing and Networking Center Applications Department ul. Noskowskiego Poznan, Poland, {marqs, krzysztof.kurowski, bogdanl, krysia}@man.poznan.pl Abstract. This technical report describes an implementation of a simple motion tracking algorithm on GPU using CUDA architecture. It discusses the details of the algorithm and the way it was ported to GPU. The speed ups in comparison to a serial code on CPU that were measured are presented and discussed. Finally, possible applications that were tested are shown. Keywords: CUDA, Nvidia, GPU, HPC, motion tracking, feature finding. 1. Introduction With the aim of heading more and more advanced questions, the computing industry rapidly moves toward parallel processing and many core architectures. The increasing programmability of graphics processing units (GPUs) allows us to use these chips not only for specific graphics computations for which they were designed, but for general purpose computing problems. This field is called GPGPU (General Purpose computation on GPUs). The main aim of this work was to get acquainted to a software development kit called CUDA (compute unified device architecture), an extension to C programming language designed for the development of the GPU code. In this work we try to efficiently port a simple motion tracking algorithm to a GPU having in mind all the architecture requirements and limitations. Graphic card used for the experiments was an NVIDIA GTX 280 series card. We used Eclipse environment to develop the code, OpenCV library to handle video input/output and CUDA Visual Profiler to optimize the performance. Section 2 describes briefly the characteristics of GTX 280 hardware, the CUDA architecture and its performance optimization issues. For more details see [1]. Section 3 approaches general issues of motion tracking problem, distinguishing between the problem of feature finding and feature tracking. Section 4 details the algorithms that were chosen for both problems. In section 5 we describe how the algorithm was ported to GPU. Section 6 details the performance results depending

2 on the chosen launching parameters and section 7 discusses the results. In section 8 we show a few possible applications that were implemented. Finally, section 9 lists the references used in the article. 2. GPU and CUDA architecture The programmable GPU is a highly parallel, multi threaded, many core processor with a very high computational power and memory bandwidth. The main difference between CPU and GPU is that graphic cards, specialized for graphics rendering (e. g. compute intensive and highly parallel computation), have more transistors devoted to data processing than to data caching and flow control. GTX 280 graphic card consists of 30 multiprocessors (MP) with texture filtering and addressing units, a texture cache, a set of registers, a cache for constants, and a parallel data cache. Each multiprocessor has eight stream processors (SP). A multiprocessor is responsible for creating, managing, and executing concurrent threads. Lightweight thread creation, zero overhead thread scheduling and fast barrier synchronization efficiently supports programs that can be decomposed into parallel subproblems according to SIMD architecture (single instruction multiple data). For example, an image processing algorithm that performs the same operation on every pixel of an image can be very efficiently ported to GPU architecture each thread can be assigned one pixel and perform the calculations in a fast and highly parallel manner. Threads are arranged in blocks, each block being executed on one multiprocessor. To manage hundreds of threads, a multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. As soon as a block of threads finishes its calculations, another block is launched to a multiprocessor. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. As a consequence, to improve the performance a programmer should provide as few divergent instructions in a warp as possible. Each multiprocessor has four types of on chip memory: one set of local 32 bit registers per processor, a parallel data cache (called shared memory) that is shared by all scalar processor cores, a read only constant cache that is shared by all scalar processor cores and speeds up reads from the constant memory space, which is a read only region of device memory, a read only texture cache that is shared by all scalar processor cores and speeds up reads from the texture memory space, which is a read only region of device

3 memory. Each multiprocessor accesses the texture cache via a texture unit that implements the various addressing modes and data filtering. The local and global memory spaces are read write regions of device memory and are not cached. A multiprocessor can execute at most eight thread blocks at a time. How many of them will actually be executed depends on how many registers per thread and how much shared memory per block are required for a given kernel (by a kernel we mean a function that is executed by each thread) since the multiprocessor s registers and shared memory are split among all the threads of the batch of blocks. So, to assure that at least two blocks are executed concurrently, a programmer should assign at most half of the registers and half of the shared memory to one kernel. In GTX 280, the amount of shared memory available per multiprocessor is 16 KB and so is the amount of registers. Shared memory is organized in 16 banks. To achieve high performance, a programmer should follow a number of rules concerning memory bandwidth, instruction throughput and control flow: Maximize the use of all available memory types. Allow the thread scheduler to overlap memory transactions with mathematical computations as much as possible, which requires that each thread executes intensive arithmetic operations and that there are many active threads per multiprocessor. For floating point operations, if double precision is not crucial for the algorithm, use single precision operations that are two to four times faster than double precision operations. As mentioned earlier, minimize the number of divergent warps to avoid the serialization. To reduce the number of serializations in loops, unroll them as much as possible. Device memory is of much higher latency and lower bandwidth than on chip memory. Therefore minimize the device memory accesses by staging data coming from device memory into shared memory, process the data in shared memory and write the results back to device memory. It is even better to avoid the data transfer at the cost of recomputing the data whenever it is needed. As for global memory, the device is capable of reading 32 bit, 64 bit, or 128 bit words from global memory into registers in a single instruction. Therefore provide that each thread in a warp accesses the global memory in a way that can be compiled to a single load instruction (called coalesced memory access). Otherwise threads will be serialized. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. The shared memory is organized in 16 banks such that successive 32 bit words are assigned to successive banks. Banks can be accessed simultaneously, so to achieve a high memory bandwidth, memory read or write requests should fall into distinct memory banks.

4 9. To maximize the use of the available computing resources, there should be at least as many blocks as there are multiprocessors in the device (in case of GTX280 30). Running only one block per multiprocessor will force the multiprocessor to idle during thread synchronization and during device memory reads if there are not enough threads per block to cover the load latency. The number of blocks should be therefore at least 100 according to CUDA Programming Guide. 10. The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under populated warps. A number of is said to be optimal. 11. The bandwidth between host memory and device memory is very low. Therefore try to minimize the number of transfers between these two memories, for example by moving more code to device and storing intermediate results on device memory. Also, because of the overhead associated with each transfer, batching many small transfers into a big one always performs much better than making each transfer separately. 3. Motion Tracking Motion tracking is a process of locating a moving object (or several ones) in time using a camera. This process can be divided into two challenging tasks: 1. How to choose the good targets (features) that will be easy to track. Good features to track are generally those parts of an image which have a high contrast. Therefore neither plain regions nor edges are of our interest (the former would be difficult to track in any direction, the latter can be tracked in only one dimension). The points we are interested in are called corners points that are traceable in two dimensions. 2. How to associate target locations in consecutive video frames. This is especially difficult when the objects are moving fast relative to the frame rate, the image is blurry, the objects rotate in third dimension, there are rapid lighting changes or another object obscures the tracked object. As the main purpose of this work is mostly to get acquainted to CUDA, the proposed algorithm does not head those problems. There are two main approaches to feature tracking. The first one chooses features in the first frame only and tries to find the best match for them in the consecutive frames in a chosen search region. This technique requires that the same feature can be detected reliably and consistently across many frames. A big disadvantage is that correspondence errors tend to be very large, for example when rotation, zoom, skew or lighting changes occur. As a consequence, we loose the track of the feature rather quick.

5 The second approach compares features between consecutive frames, not with the feature characteristic from the first frame. This approach is more robust to rotation, lighting changes etc., so we can track the feature longer, but it sometimes causes the features to drift we finish tracking a different feature that we started with. After comparing the advantages and disadvantages of both approaches, the second approach was chosen and implemented. 4. Algorithms 4.1. Feature finding To find interesting features in an image to track we have decided to use the Harris corner detector, which is a popular point detector due to its strong invariance to rotation, scale, illumination variation and image noise. The Harris corner detector is based on the local auto correlation function of a signal, where local auto correlation function measures local changes of the signal with patches shifted by a small amount in different directions. If changes of a signal are significant no matter what is the direction of a shift, it means we found a corner. Given a shift ( x, y) and a point (x, y), the auto correlation function is defined as c x, y = w [I x i, yi I x i x, y i y ] 2, (1) where I denotes the brightness of a gray scaled image, x i, y i are points in the window w centered on x, y. Approximating I u x, v y by Taylor expansion, T c x, y =[ x y] C x, y [ x y]. C x, y is a matrix that captures the intensity structure of the local neighborhood: (2)

6 [ Ix IxIy C= Ix Iy I y where I x and direction respectively. 2 2 (3) ], I y are partial derivatives of I in x and y 1, 2 be the eigenvalues of matrix C x, y. By analyzing the Let eigenvalues we can describe the characteristic of a point. Three situations can be distinguished: 1. If both 1 and a plain region. 2 are small, it means that the analyzed point is a part of 2. If one eigenvalue is high and the other low, we found an edge. 3. If both eigenvalues are high, it indicates a corner. Harris proposed a measure of cornerness as 2 H= (4). Technically, to calculate the cornerness of a point, we have to calculate Ix and Iy derivatives of an image under window, calculate the C matrix and determine the eigenvalues. I x and I y derivatives are calculated by convolving the image with two separable Sobel filters: [ x derivative mask= [ ] y derivative mask = (5) ] (6)

7 1, 2 are calculated as: 1=0.5 C 00 C 11 C C 11 C 00 C C 01 C10 (7) 2=0.5 C 00 C 11 C 00 2 C 11 C 00 C 11 4 C 01 C 10 (8) 2 2 In the algorithm we use a window with radius 1 (experiments showed that larger radius is not necessary). Having calculated the cornerness values for each point in the image, we set a threshold over which we consider a point as a good feature to track. To avoid having features located very close to each other, we divide an image into small parts and choose one representative (best) corner for each region. To calculate the cornerness value for each point in an image, the algorithm requires 6 multiplications and additions to calculate each point in x and y derivatives, 9x4 multiplications and additions to calculate the C matrix and around 6x2 multiplications and adds to calculate lambdas and cornerness. To choose the representatives for the corners, we need additionally corners _ distance 2 comparisons Feature tracking The standard approach to feature tracking is to measure the correlation between the feature in one image and a point shifted by some distance in the consecutive frame. By comparing the correlations in the search region, we can choose the best match as the new position of the tracked feature. Among the strategies to measure the correspondence between two features under mask, the SSD (sum of squared differences) is considered to be of a good precision, but very time consuming and so applicable only in non real time programs. The cost of computations depends mostly on the range of the search region. Confining to a small search region saves search time and makes the search process easier, but runs the risk of the tracked feature leaving the search region entirely between frames. The other factor that influences the performance but also the time of computations is the size of patch which we treat as a feature and correlate between the images. Again, small patch saves the time, but is less accurate. We have established in the experiments that, considering the tracking performance, a reasonable choice for most applications is a patch radius 3 and a search distance radius 7. Of course these values greatly depend on the particular application (how sharp and detailed is the image, e. g. how small the feature patch can be, how fast do the features shift between frames etc.).

8 The chosen parameters require calculating 15x15 SSDs, each consisting of 7x7 multiplications and additions, for each tracked feature. 5. Implementation in CUDA 5.1. Feature finding As the calculations for pixels are mostly local, the algorithm can be effectively paralleled by dividing the image into smaller regions and assigning each region to a different block of threads. The block will calculate the cornerness for each pixel in its region and send the position of the best pixel (representative of the region) back to host. The image data will be saved in a texture and the parts of it needed by a particular block will be sent to and processed in a shared memory of a block. Using a texture provides a fast and cached access to the data and is especially well suited for image processing. For example, the methods for accessing the texture provide a safe access to outside the border points. On CPU, a safe access to the points behind a picture border must be explicitly ensured by a programmer. According to the algorithm, each block will have to allocate a memory for an array with its part of the picture, two additional arrays to store the intermediate results of x and y derivatives and a table for cornerness values. To spare the memory needed for the last array, the cornerness values will overwrite the original values in the first array. This means in particular that the cornerness values calculated as floats will be rounded to unsigned chars, therefore the precision of calculations will be slightly reduced (but only by non integer part, because values of cornerness are always scaled to be smaller than 200). The experiments showed indeed that sometimes, comparing to CPU results, different corners were chosen. As we haven't noticed that the GPU points were in any way more difficult to track than the CPU points, the sacrifice of precision for the time seems profitable. Following the rules for optimizing the performance listed in section 2, there should be at least 100 blocks, each occupying at most half of the available shared memory. Having large blocks minimizes the number of operations, but requires a lot of memory which will degrade the performance. After careful calculations the size of the region was established to 32x32 for that size the memory needed is equal to ~7kB, which is slightly less than half of the available memory. With this size we also meet the requirement for the number of blocks for a standard camera image of size 640x480 there will be 300 blocks. A number of threads in a block should be divisible by 32 (the optimal number is ), but not too large because then the computations for one thread will be too small to overlap the memory transactions. Following those rules the number of

There are a few reasons standing behind the idea of processing four consecutive pixels in a row by one thread: to minimize the bank conflicts, a number of memory operations and amount of computations.

9 threads was chosen to be 4x16 each thread will process four consecutive pixels in either of its rows. Fig. 1 Way of accessing an array of 1byte data in four iterations to avoid shared memory bank conflicts. Fig. 2 Way of using previously calculated columns for consecutive pixels. There are a few reasons standing behind the idea of processing four consecutive pixels in a row by one thread: to minimize the bank conflicts, a number of memory operations and amount of computations. The details are described below. The algorithm works as follows: 1. Copy the image data from the host to the CUDA array on the device and bind it to the texture. 2. Run the kernels in each block. In each thread: 2.1. Load your 8 pixels (4 pixels in two rows) to the shared memory. Designate some threads to load the remaining apron pixels. Having each thread

10 accessing every 4th pixel will minimize the bank conflicts and reduce the time of memory transactions. The idea is shown on Figure 1. To avoid more of the bank conflicts, the shared memory was aligned to 64 bytes (or 32 in another benchmark) For your first pixel, get 9 surrounding pixels and calculate x and y gradient For the next pixels in your row, get only the missing right column, replace the unneeded first column and and calculate the gradients. It will spare the memory transactions. The idea is shown on Figure Send all four pixels to arrays with x and y gradient at once to provide the write in only one transaction Do the same for your second row For your first pixel, get 9 surrounding gradient values, calculate the sums of I x2, I y 2 and I x I y for C matrix, saving the summing results for each column of pixels under mask separately. Calculate the cornerness on this basis For the next pixels in your row, load only 3 missing values from both gradient arrays, calculate the sum of this column and replace the summing result for the unneeded first column. This will save both memory operations and a number of calculations (again Figure 2). Then recalculate your C matrix and the cornerness Send all four cornerness values to image array (overwrite it) at once to provide the write in only one transaction Designate one thread to find the best corner in the region and send it to device. 3. Copy the results from the device to the host memory at once to avoid repetitive overhead Feature tracking In feature tracking, the most natural way to parallelize the algorithm is to let each block calculate the best shift for one feature. This is risky if we have only a few features to track (the number of block would be too small) the best performance will be achieved if we have more than 30 features which is usually the case. In each block, there are 15x15 shifts to evaluate. To meet the requirement of having the number of threads divisible by 32, the number of threads was chosen to be 4x16. Each thread processes 4 consecutive pixels (shifts) in its row. This time each thread processes pixels from just one row, because otherwise a number of threads would be too small (32) to get a good performance. As can be noticed, 16x16 shifts will be evaluated instead of required 15x15. We do those excessive calculations because this will not augment the processing time in any way and, moreover, our search distance will be expanded. This time there is no problem with the available amount of shared memory as there are no intermediate results to store each block needs only around1kb of

11 memory (for the tracked feature description, part of an image needed and the result) which is far less than half of the memory available. The algorithm structure is analogous to the feature finding algorithm: 1. Copy new image data from the host to the CUDA array on the device and bind it to the texture. Copy also the description of the feature and its tracked surrounding (7x7 pixels) and the position of the central pixel to an array on the device. A feature's description consists of 49x1 byte for pixels' values + 2x2 bytes for central pixel's position = 53 bytes. To reduce the number of device memory operations which are rather costly, we ensure the alignment of the data to 64bytes (for details see [1] about global memory access patterns). 2. Run the kernels in each block. Each thread: 2.1.Load your 4 pixels from a texture to the shared memory aligned to 32 bytes. Designate some threads to load the remaining apron pixels. Having each thread accessing every 4th pixel will minimize the bank conflicts and reduce the time of memory transactions Designate 16 first threads to load the feature's description array to the shared memory, each loading 4 bytes at once to avoid bank conflicts For your first pixel (evaluated shift), get 7x7 surrounding pixels and calculate the SSD For next pixels in your row, get only the missing right column, replace the unneeded first column and and calculate the SSD. This will spare the memory transactions Choose only the best pixel and send its value and coordinates back to the shared memory. In that way there are no bank conflicts so all threads will store their result in only one memory transaction Designate one thread to find the best shift for the feature and send it to device. 3. Copy the results from the device to the host memory at once to avoid repetitive overhead. 6. Performance results Performances of the two functions (feature finding and feature tracking) were analyzed independently. Results achieved on GPU were compared to the results of running the analogous serial code on one core on CPU. Times measured included the whole process from the point were GPU and CPU versions diverged to the point where both versions converge back to the same code. As a result, if f. e. GPU version included some additional data copying and preprocessing or CPU version required some extra memory allocation, these operations were also included in the time measurement. This method of measurement was chosen because only after analyzing the process as a whole can we estimate the actual speed up achieved by

12 porting some code to GPU. For example, even if the key GPU algorithm was extremely fast but the overheads and memory allocations required would overlap the time savings, it should be claimed that there was no actual profit from porting the code to GPU Feature finding The performance of feature_finding function was evaluated for several launching parameters. In particular, the number of threads in a block and the shared memory alignments were manipulated. At the first place we tried to change the shared memory alignment. Alignment to 64bytes, implemented in the original version, ensures that each row starts in the first bank. This should minimize the number of bank conflicts, but as it turned out, in our case it consumed a lot of memory (one row needed only 36 bytes, so over a half was wasted) and therefore fewer blocks could be launched on one multiprocessor at once. Also, alignment to 64 is profitable mostly when all 16 threads in a half warp access the same row. In our case 16 consecutive threads access 4 rows, so there is a 4 way bank conflict anyway. Therefore the other run was made with alignment to 4, which does not reduce the bank conflicts, but at least saves the memory. The other parameter tested was the number of threads in a block. The original number of 8x16 was augmented to 8x32. In this case there are more threads to overlap the memory accesses, but on the other hand, less computation is done by one thread to overlap these accesses. Table 1. summarizes the results obtained for different launching parameters. The times were averaged from 10 runs. Table 1. Speed ups in feature_finding function, no warm up made. Launching options Speed up (8x16) aligned to (8x16) aligned to (8x32) aligned to As can be seen, in all cases the speed up compared to CPU doesn't change and is equal to 3,66. On CPU the time was around 500ms and on GPU around 140ms. In CUDA examples provided by NVIDIA the time of a kernel function is always measured after a warm up that is, after running a function once before the actual run to warm up the graphic card. A warm up is done because the first call of the function needs to initialize the driver, allocate some memory for CUDA on the GPU, initialize the context, copy the kernel code over to the GPU etc. which gives a big overhead. Performing one warm up before a first call to the function didn't change the result. It is probably due to the fact that we use two kernels, so all the initializations done by our warm up are lost during the load of the other kernel.

13 Performing a warm up before each call to the function however gave impressive results, summarized in a Table 2. Table 2. Speed ups in feature_finding function, warm up made. Launching options Speed up (8x16) aligned to (8x16) aligned to (8x32) aligned to This time the function was calculated in only 1,3ms on GPU. This benchmark showed how huge can be the overhead for complex kernels. Only after warm up the changes between different launching options are visible. Alignment to 4 accelerates the computations as much less of the shared memory is used by one thread so more warps can be executed at a time. On the other hand, increasing the number of threads didn't improve the performance, probably because to few computations are done by one thread to overlap the memory transactions Feature tracking For feature_tracking function, no benchmarking seemed worth trying. First, the number of blocks is determined by the number of features. Secondly, as a block analyses part of an image of size 15x15, there is not much freedom in manipulating with the number of threads lower number than chosen would fall below the claimed minimum of 32 threads in a warp, and more threads would reduce the already small amount of calculations done by each thread. The alignment of the shared memory was optimally set to 32. The experiments were run on two video streams, one producing around 40 corners in feature_finding function, and the other producing around 70 corners. A comparison was done to find out how the number of tracked features influences the times on CPU and GPU. The results show that the time on CPU increases linearly with the number of features while the time on GPU doesn't change. Results are summarized in Table 3. Table 3. Speed ups in feature_tracking function. Launching options Speed up ~40 features, warm up ~40 features, no warm up ~70 features, warm up ~70 features, no warm up 36.90

14 Another thing that should be noticed is that for the feature_finding function the overhead of the first function call is much smaller, so even without warm up the results are very good. The overhead is smaller because the function is much simpler (less code has to be copied to the device) and less memory needs to be allocated. 7. Discussion of results Before discussing the results, it should be stressed that CPU algorithm was implemented as is, without optimizations. For example, a way of using previously calculated results for the consecutive computations in find_features function could also be implemented in CPU version. When programming for GPU, the memory limitations force to look for various ways of optimization, which then turn out to be useful also for CPU. The speeds ups observed for both functions originate from a couple of reasons: thread parallelization designed with respect to optimal number of blocks and threads, evenly distributed use of different types of memory available on GPU with respect to their bandwidth and availability, less computations made with faster single precision operators, less memory accesses obtained by coalescence and avoidance of bank conflicts, use of textures with built in management of outside the border pixels. For the find_features function, particularly impressive results are obtained with the use of warm up technique, which is not very practical in real world applications there is no sense in running each kernel twice just to have a second run very fast. So as the actual speed up of that function we should consider 3.66 which is also a good result. The huge overhead implicates that there is not much sense in further optimization of the algorithm (f. e. further reduction of bank conflicts), as they are overlapped by the overhead. The only solution would be to reduce the overhead by dividing the kernel into smaller kernels, but then we would loose the advantage of not copying the intermediate results back to CPU, so it probably wouldn't improve the results. The speed up is sometimes obtained at the cost of accuracy for find_features functions, storing the cornerness values as floats would overload the memory, so they were rounded to integer numbers. As a result, a CPU version sometimes finds slightly different corners, but the experiments showed that the CPU's corners are not significantly easier to track. The advantage of track_features function on GPU is that it is more scalable in comparison to CPU. Increasing the number of tracked features didn't change the time while on CPU the dependence was linear. As a result, for larger sets of features even bigger speed up is observed. For both functions, further optimizations are always possible on CUDA. For example, all the loops could be unrolled, or some kind of a hash function could be

15 used to load pixels into shared memory to avoid bank conflicts completely. But those optimizations would reduce largely the readability of the code. We claim that one should always try to find the golden mean and not overoptimize the code if it already serves its purpose. The drawback of programming for GPUs is that the performance of the code highly depends on an application and even on a particular data set. Therefore there are no golden rules (for example for the optimal number of blocks) applicable in all situations. The number of blocks can't be even estimated or precalculated, as it depends on many factors like the structure of the code, number of divergent operations, loops, amount of memory needed etc. The only way to optimize the performance is to test various settings empirically. What is more, the code is not very portable and rarely can be reused in other applications. For example, in feature_tracking function, the change of the search distance would require a strong reorganization of the code. Finally, it is worth to mention the tool that helps to analyze the performance of the code. The CUDA Visual Profiler is the application that, by running the code several times, can analyze the actual use of different types of memory, the divergence of a code, number of coalesced and uncoalesced reads etc. We found it a very useful tool to locate the weak points in the code and to give directions in which we should try to optimize the code. For example, the analysis of a find_features function showed that for alignment to 64 the occupancy of GPU processors was only 25%. Minimizing the alignment to 4 increased that number to 50%. Increasing the number of threads gave 75% occupancy, but the actual time of running the kernel increased. After the analysis of other Visual Profiler statistics, it could be noticed that the use of the registers felt down, which showed that although the occupancy was high (many threads were loaded and unloaded all the time), the number of calculations per thread decreased so much that it could not overlap the memory transactions. That kind of analysis helps to understand the characteristics and requirements of a particular kernel function. To sum up our up till now experience with CUDA, it is a very promising tool for accelerating the computations. It allows to implement the solutions for real time applications that are inaccessible for serial computations on CPU. The GPU architecture and CUDA are especially well suited for image processing algorithms the design of texture structures for example even simplifies the code in comparison to CPU as the programmer doesn't have to bother about border cases or polarization when reading between the pixels. The disadvantage of programming for GPUs is that the performance doesn't come for free there are lots of rules to follow which are sometimes difficult to preserve at once. The code can hardly be optimized analytically the only good way is to test different setting empirically. The CUDA Visual Profiler is a good tool to analyze and compare the benchmarks. Furthermore, the code has to be tied to a particular algorithm and even to a data set to perform well. The code for GPU is usually a few times longer and more complicated than on CPU. Also, one should keep in mind that the results reported by CUDA programmers usually involve a previous warm up, which will in most cases not be done in real world applications.

Possible applications The information about shifts of the corners between frames in a video stream can be used for

16 So, using the terms of CUDA, there is a big overhead for writing a good GPU program, but the speed ups obtained can be impressive. 8. Possible applications The information about shifts of the corners between frames in a video stream can be used for various purposes. First, speeds of the objects in the scene can be tracked. Figure 3 shows the possible application. Fig. 3 Feature tracking: visualization of motion speed Fig. 4 Feature tracking: visualization of movement direction

17 We tried to track the movement of birds in a water basin. Each bird is marked with a couple of corners. The color of a marker indicates the speed: blue corners indicate the lowest speed and red ones the fastest. In a second application we marked the directions and velocities of the moving features. Figure 4 shows a screen shot of a video with a tracked lorry. By analyzing the direction of movement, features belonging to the static landscape can be distinguished from the moving object. Another information that can be retrieved from corner tracking is the depth of an image. If the video shows a static scenery and the camera moves constantly along it, a depth of a picture can be estimated as the objects close to the camera will move faster than the objects in a distance. 9. References 1. NVIDIA CUDA (Compute Unified Device Architecture) Programming Guide, Version 2.0, 6/7/ Konstantinos G. Derpan, The Harris Corner Detector, 27/10/ Joe Stam, Stereo Imaging with CUDA, 01/ Victor Podlozhnyuk, Image Convolution with CUDA, 06/ J. P. Lewis, Fast Normalized Cross correlation, Industrial Light & Magic Sobel filter example CUDA SDK by Nvidia, SobelFilter

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic