B Tech Prject First Stage Reprt n GPU Based Image Prcessing Submitted by Sumit Shekhar (05007028) Under the guidance f Prf Subhasis Chaudhari
1. Intrductin 1.1 Graphic Prcessr Units A graphic prcessr unit is simply a prcessr attached t the graphics card used in vide games, play statins and cmputers. The way they are different frm the CPUs, the central prcessing units, are that they are massively threaded and parallel in their peratins. This is because f the nature f their wrk they are used fr fast rendering; same peratin is carried ut fr each f the pixels in the image. Thus, they have mre transistrs devted t data prcessing rather than flw cntrl and data caching. Tday graphic prcessr units have utdated their primary purpse. They are being used and prmted fr scientific calculatins all ver the wrld by the name f GPGPUs r general purpse GPUs; engineers are achieving several times speed-up by running their prgrams n GPUs. Applicatins fields are many: image prcessing, general signal prcessing, physics simulatins, cmputatinal bilgy, etc, etc. 1.2 Prblem Statement and Mtivatin Graphic Prcessr Units can speed-up the cmputatin maniflds ver the traditinal CPU implementatin. Image prcessing, being inherently parallel, can be implemented quite effectively n a GPU. Thus, many applicatins which are therwise run slwly can be fastened up, and can be put t useful real-time applicatins. This was the mtivatin behind my final year prject. NVIDIA CUDA is a parallel prgramming mdel and sftware, which has develped specifically t address the prblems f efficiently prgram the GPU as well as be cmpatible with a wide variety f GPU cres available in the market. Further, being an extensin t the standard C language, it presents a lw-learning curve fr the prgrammers, as well giving them flexibility t put in their creativity in the parallel cdes they write. My task was t implement bject-tracking algrithms using CUDA and ptimize them. As a part f the first stage, I implemented Bilateral Filtering methd n GPU. Traditinally, brute frce bilateral filters take a lng time t run because (i) they cannt be implemented using FFT algrithms as the calculatin invlves bth spatial and range filtering and, (ii) they are nt separable, hence takes O(n 2 ) cmputatins. Using GPU, I fund them t be running much faster than even n a high-end CPU.
2. CUDA Prgramming Mdel 2.1 Executin CUDA extends the standard C language t make it applicable fr parallel prgramming. Its varius features include: C functins are implemented n GPU device using kernels, which are executed in parallel in several CUDA threads, as ppsed t C functins which are executed nly nce. The kernels are defined using glbal identifier, which is again an extensin f CUDA. The kernel functin is called using a special <<< >>> syntax, which specifies the number f threads in which the kernel has t execute. The <<< >>> allws t determine the rganizatin f threads and hw they are executed. A typical syntax fr calling a kernel functin is shwn as belw: funcadd<<<grid, Blck, 1>>>(A, B, C) This illustrates hw a general functin is executed n the GPU in CUDA. Blck variable defines a blck f threads, which can be ne-dimensinal, tw-dimensinal r three- vectr called dimensinal. Each thread in the blck is identified by a 3-cmpnent threadidx, whse x, y and z cmpnents respectively gives a unique index t each thread in a blck. Similarly, Grid defines the layut f the blcks, which can be either ne-dimensinal r tw-dimensinal. The blcks are als identified by a vectr called blckidx. Each blck has a limit n the maximum number f threads it can cntain, which is determined by the architecture f the unit. These threads can be synchrnized with each ther using syncthreads() functins and can als be made t access the memry in synchrnizatin. The grid and blcks can be shwn by fllwing diagram: Figure 1: Grid and Thread Blcks
2.2 Memry Hierarchy There are multiple memry spaces which a CUDA thread can use t accesss data. The different kinds available are shwn belw in the figure: Figure 2: GPU Memry Mdel Hst Memry: This is the main memry f the cmputer, frm/t which data can be laded/written back frm the device memry. Lcal Memry: This memry is available t each thread running in the device. Shared Memry: This is shared between the varius threads f a blck. This is n-chip memry and hence the access is very fast. It is divided int varius banks, which are equally sized memry mdules. Glbal Memry: This memry is accessible t all the threads and blcks, and is usually used t lad the hst data int the GPU memry. As the memry is nt cached, the access t this memry is nt as fast, but a right access pattern can be used t maximize memry bandwidth. Texture Memry: This is a cached memry, hence is faster than glbal memry. The texture memry can be laded by the hst, but can be nly read by the device kernel. It als prvides nrmalized access t the data. Useful fr reading images in the kernel. Cnstant Memry: This is als a cached memry fr fast access t cnstant data. 2.3 Hardware Implementatin GPU cnsists f an array f multiprcessrs, such that threads f a thread blck run cncurrently n ne multiprcessr. As the blcks finish, new blcks are launched n the vacated blcks. The verall device architecture can be shwn as:
Figure 3: Multiprcessrs n GPU Each multiprcessr executes the threads in wraps, which are grups f 32 parallel threads. Thus each multiprcessr cnsists f lcal registers, shared memry that is shared by all the scalar prcessrs and a cnstant cache. Texture Cache is available thrugh a texture unit. The size f blck is limited by the amunt f registers required per thread and the amunt f shared memry. Kernel fails t launch if nt even a single blck can be launched n a multiprcessr. 2.4 Few imprtant extensins in CUDA Functin Type Qualifiers: glbal declares a functin as kernel. The functin is executed n device and can be called frm hst. device declares a functin which executed n device and called frm hst. hst used t identify a functin executed and called frm hst nly. Variable type qualifiers: device defines a variable stred in device memry. It resides in glbal memry space and accessible frm all the threads. cnstant declares a variable in cnstant memry space. shared declares a variable in shared memry space f thread blck.
Built-in variables griddim: stres the dimensins f the grid, is a 3-cmpnent vectr. blckidx: stres the blck index within the grid as a 3-cmpnent vectr. blckdim: stres the dimensins f a blck. threadidx: stres the thread index within the blck as a 3-cmpnent vectr. Run-time APIs cudamallc: allcates memry, f the size given as input, in the glbal memry space f the device. Similar t mallc in C. cudafree: frees the memry allcated by cudamallc. cudamemcpy: cpies data t/frm device memry frm/t hst memry.
3. Bilateral Filters n GPU 3.1 Intrductin Bilateral filters were first cined by Tmasi and Manduchi [1]. These filters smthen the image but keep the edges cnstant by means f nn-linear cmbinatin f the image values f nearby pixels. This has been achieved by a cmbinatin f range filtering and spatial filtering. Range filters perate n value f the image pixels rather than their lcatin; spatial filters take the lcatin int accunt. By cmbining bth f them, the paper achieves an edge-sensitive smthening filter, which varies bth accrding t the image pixel value as well lcatin. A general frm f the filter is given by: h = ℇ ℇ, ℇ, ℇ Where, = ℇ, ℇ, ℇ Here, c(.) is the gemetric distance between the pixels x and ℇ, and s(.) is a similarity functin which measures hw clse the value f the image pixel is t the given value. Fr the special case f Gaussian c(.) and s(.), the equatin becmes: ℇ, = ℇ ℇ, = ℇ The functining f the bilateral filter can be seen in the fllwing figures: Figure 4: (a) A step functin perturbed by randm nise (b) cmbined similarity weights a pixel right t the step (c) final smthened utput [1]
3.2 Implementatin Bilateral filter cannt be implemented by using the FFT algrithms in this frm, because the filter values change with image pixel lcatin, depending n the image values f neighburing pixels. Mrever it is als nt separable in its current frm. Brute frce algrithm was used t implement the filter n bth GPU and CPU. The pseud-cde fr the algrithm can be given as: Fr input: image I, Gaussian Parameters σ d and σ r, utput image I b, W b weight cefficients 1. All values f I b and W b initialized t zer. 2. Fr each pixel (x, y) with intensity I(x,y) a. Fr each pixel (x, y ) in image with values I(x,y ) Cmpute the assciated weight: weight exp(-(i(x,y ) I(x,y)) 2 /2σ 2 d ((x x ) 2 + (y y ) 2 )/2σ 2 s) b. Update the weight sum W b (x,y) = W b (x,y) + weight c. Update I b (x,y) = I b (x,y) + weight x I b (x,y ) 3. Nrmalize the result: I b (x,y) I b (x,y)/ W b (x,y) Fr actual implementatin, the filter radius was taken t be twice the value f its spatial sigma, as the Gaussian tail dies ff quickly. This truncated filter was used as an apprximatin fr the full kernel. 3.3 GPU Implementatin Fr GPU implementatin, the fllwing template was fllwed: { // lad image frm disk // lad reference image frm image (utput) // allcate device memry fr result // allcate array and cpy image data // set texture parameters // access with nrmalized texture crdinates // Bind the array t the texture dim3 dimblck(8, 8, 1); dim3 dimgrid(width / dimblck.x, height / dimblck.y, 1); // execute the kernel BilateralFilter<<< dimgrid, dimblck, 0 >>>( image, spat_filter, width, height, sigmar, sigmad); // check if kernel executin generated an errr // allcate mem fr the result n hst side // cpy result frm device t hst // write result t file // cleanup memry}
Sme f the ptimizatins used in the cde are: Texture memry has been used fr accessing the image values. Texture memry being cache memry prvides a fast access t the image data. Spatial filter was calculated in the hst cde, and passed t the kernel as a cnstant matrix. This reduced the time fr cmputing the values again fr every pixel. NVIDIA 8600 graphics card was used t implement the cdes. 3.4 CPU Implementatin CPU cde was similar t the GPU cde except that the bilateral filter functin was executed using fr. then lp ver all the pixels f the image, which run in parallel threads in GPU. This was dne t get a better estimate f the CPU and GPU timings, as they are running same algrithm. The CPU under test was Intel Quad Cre Prcessr running at 2.4 GHz. 3.5 Speed Cmparisn A 512 x 512 gray scale Lena image was given as input t the prgram. The speed cmparisns were made in tw cases: Varying σ d keeping σ r cnstant Results fr varius sigma values are tabulated belw fr σ r = 0.1 Spatial sigma (σ d) GPU Time (ms) CPU Time (ms) Speed GPU (Mpix/s) Speed CPU (Mpix/s) Rati 1 230 1880 1.14 0.14 8.2 2 290 7310 0.90 0.036 25.2 3 330 16390 0.79 0.016 50 4 400 29010 0.66 0.009 72.5 5 520 45130 0.50 0.005 87 6 660 65200 0.40 0.004 99 Thus, we can see that CPU is much slwer than GPU in executing the same task. Further, the time taken fr CPU increases in apprximately n 2 fashin with increase in the filter length. Hence, the rati f speeds increases with increase in filter length and reaches at abut 100x in the last case.
Varying σ r keeping σ d cnstant The range sigma was als varied keeping spatial sigma cnstant. The time f executin fr GPU and CPU was fund t be almst cnstant fr different values f σ r. Range sigma (σ r) GPU Time (ms) fr σ d = 5 CPU Time (ms) fr σ d = 3 0.1 518 16390 0.2 516 16360 0.3 512 16370 0.4 507 16320 Output Images: Cmparisn f CPU and GPU utput images: Original Image GPU utput fr σ d = 3 σ r = 1 CPU utput fr σ d = 3 σ r = 1 Difference Image Variatin with σ r, keeping spatial sigma, σ d cnstant (GPU utputs): σ d = 3 σ r = 0.1 σ d = 3 σ r = 0.3 σ d = 5 σ r = 0.6
Variatin with σ d, keeping range sigma σ r cnstant (GPU utputs): σ d = 1 σ r = 0.1 σ d = 5 σ r = 0.3 σ d = 10 σ r = 0.6 4. Cnclusins GPU was able t achieve a much better time respnse than CPU in all the cases f filter implementatin. The rati f speeds increased with increase in filter length. The errr between the GPU and CPU utputs was very lss, thus GPU perfrms the calculatins quite accurately. Variatin in spatial sigma and range sigma shwed desirable changes in the utput image. Increase in range sigma value, keeping the ther cnstant increased the blurring acrss the edges as expected. Similarly, keeping the value f range sigma cnstant and increasing the ther value resulted in better smthening f the images withut disturbing the edges. 5. Future Wrk The majrity f the first stage wrk was explratry, learning abut the architecture f GPU and learning t implement CUDA language. I als gave a basic demnstratin f bilateral filtering in GPU. Many fast appraches have been develped t implement bilateral filter. These can be implemented in GPU and the perfrmance can be imprved further. Further, mre cmplex prblems can be implemented, which wuld require further explring the capabilities f GPU. A cmparative study f different GPU platfrms can als be made in testing the algrithms. 6. References 1. C. Tmasi, R. Manduchi: Bilateral Filtering fr gray and clur images, IEEE Internatinal Cnference n Cmputer Visin, 1998. 2. CUDA Prgramming Guide, NVIDIA