B Tech Project First Stage Report on

Similar documents
NVIDIA S KEPLER ARCHITECTURE. Tony Chen 2015

Parallel Processing in NCAR Command Language for Performance Improvement

NVIDIA Tesla K20X GPU Accelerator. Breton Minnehan, Beau Sattora

An Introduction to Crescendo s Maestro Application Delivery Platform

Operating systems. Module 15 kernel I/O subsystem. Tami Sorgente 1

Greg Dias, Alex Wilson. Fermi

Due Date: Lab report is due on Mar 6 (PRA 01) or Mar 7 (PRA 02)

Extensible Query Processing in Starburst

Using SPLAY Tree s for state-full packet classification

Common Language Runtime

It has hardware. It has application software.

Design Patterns. Collectional Patterns. Session objectives 11/06/2012. Introduction. Composite pattern. Iterator pattern

STEREO VISION WITH COGNIMEM

CSE 3320 Operating Systems Computer and Operating Systems Overview Jia Rao

On the road again. The network layer. Data and control planes. Router forwarding tables. The network layer data plane. CS242 Computer Networks

SW-G using new DryadLINQ(Argentia)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CS4500/5500 Operating Systems Computer and Operating Systems Overview

Because this underlying hardware is dedicated to processing graphics commands, OpenGL drawing is typically very fast.

UFuRT: A Work-Centered Framework and Process for Design and Evaluation of Information Systems

Tekmos. TK68020 Microprocessor. Features. General Description. 9/03/14 1

Memory Hierarchy. Goal of a memory hierarchy. Typical numbers. Processor-Memory Performance Gap. Principle of locality. Caches

Exploring the Viability of the Cell Broadband Engine for Bioinformatics Applications

History of Java. VM (Java Virtual Machine) What is JVM. What it does. 1. Brief history of Java 2. Java Version History

Executing Parallelized Dictionary Attacks on CPUs and GPUs

Scatter Search And Bionomic Algorithms For The Aircraft Landing Problem

Implementation of Authentication Mechanism for a Virtual File System

Computer Organization and Architecture

Computer Organization and Architecture

MediaTek LinkIt Development Platform for RTOS Memory Layout Developer's Guide

Xilinx Answer Xilinx PCI Express DMA Drivers and Software Guide

NQueens Problem with CUDA

ECE 545 Project Deliverables

SSDNow vs. HDD and Use Cases/Scenarios. U.S.T.S. Tech. Comm

Operating systems. Module 7 IPC (Interprocess communication) PART I. Tami Sorgente 1

CS4500/5500 Operating Systems Page Replacement Algorithms and Segmentation

INSTALLING CCRQINVOICE

Principles of Programming Languages

Data Structure Interview Questions

- Replacement of a single statement with a sequence of statements(promotes regularity)

Lab 5 Sorting with Linked Lists

WEB LAB - Subset Extraction

Project 4: System Calls 1

Ascii Art Capstone project in C

Priority-aware Coflow Placement and scheduling in Datacenters

RELEASE NOTES FOR PHOTOMESH 7.3.1

Implementing a Data Warehouse with Microsoft SQL Server

Hierarchical Classification of Amazon Products

Lab 1 - Calculator. K&R All of Chapter 1, 7.4, and Appendix B1.2

Summary. Server environment: Subversion 1.4.6

UiPath Automation. Walkthrough. Walkthrough Calculate Client Security Hash

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Contents: Module. Objectives. Lesson 1: Lesson 2: appropriately. As benefit of good. with almost any planning. it places on the.

USER MANUAL. RoomWizard Administrative Console

Computer Organization and Architecture

Computational Methods of Scientific Programming Fall 2008

Querying Data with Transact SQL

CSE 3320 Operating Systems Page Replacement Algorithms and Segmentation Jia Rao

Instance Based Learning

The programming for this lab is done in Java and requires the use of Java datagrams.

UiPath Automation. Walkthrough. Walkthrough Calculate Client Security Hash

Performance of VSA in VMware vsphere 5

Stock Affiliate API workflow

Retrieval Effectiveness Measures. Overview

ME Week 5 Project 2 ilogic Part 1

Dynamic Storage (ECS)

CS1150 Principles of Computer Science Introduction (Part II)

Multilevel Updating Method of Three- Dimensional Spatial Database Presented By: Tristram Taylor SE521

MID-II Examinations April 2018 Course: B. Tech Branch:CSE Year: II. Date of Exam: AN Max.Marks 30 TIME :02:00PM TO 03:00 PM

CS4500/5500 Operating Systems Synchronization

Introduction. ENG2410 Digital Design Memory Systems. Resources. A Digital Computer System. Week #11 Topics. School of Engineering 1.


CONTROL-COMMAND. Software Technical Specifications for ThomX Suppliers 1.INTRODUCTION TECHNICAL REQUIREMENTS... 2

Last time. VHDL in Action. Motivation. Covered in This Lesson. Packages. Packages (cont'd)

Teaching Operating Systems Scheduling

To over come these problems collections are recommended to use. Collections Arrays

Objectives. OpenGL - Shaders. GLSL A Quick Review

Quick Guide on implementing SQL Manage for SAP Business One

Infrastructure Series

Milestone Solution Partner IT Infrastructure Components Certification Summary

McGill University School of Computer Science COMP-206. Software Systems. Due: September 29, 2008 on WEB CT at 23:55.

Chapter-10 INHERITANCE

CSE 3320 Operating Systems Synchronization Jia Rao

Java Programming Course IO

High Security SaaS Concept Software as a Service (SaaS) for Life Science

Handling complexity in embedded application development: an example of real time MP3 streaming over Bluetooth

1 Introduction Functions... 2

Getting Started with the SDAccel Environment on Nimbix Cloud

To start your custom application development, perform the steps below.

CITI Technical Report 08-1 Parallel NFS Block Layout Module for Linux

Lab 1 - Calculator. K&R All of Chapter 1, 7.4, and Appendix B1.2 Iterative Code Design handout Style Guidelines handout

Stealing passwords via browser refresh

This labs uses traffic traces from Lab 1 and traffic generator and sink components from Lab 2.

UML : MODELS, VIEWS, AND DIAGRAMS

Troubleshooting of network problems is find and solve with the help of hardware and software is called troubleshooting tools.

Eastern Mediterranean University School of Computing and Technology Information Technology Lecture2 Functions

TIBCO Statistica Options Configuration

CSE 361S Intro to Systems Software Lab #2

TRAINING GUIDE. Overview of Lucity Spatial

Transcription:

B Tech Prject First Stage Reprt n GPU Based Image Prcessing Submitted by Sumit Shekhar (05007028) Under the guidance f Prf Subhasis Chaudhari

1. Intrductin 1.1 Graphic Prcessr Units A graphic prcessr unit is simply a prcessr attached t the graphics card used in vide games, play statins and cmputers. The way they are different frm the CPUs, the central prcessing units, are that they are massively threaded and parallel in their peratins. This is because f the nature f their wrk they are used fr fast rendering; same peratin is carried ut fr each f the pixels in the image. Thus, they have mre transistrs devted t data prcessing rather than flw cntrl and data caching. Tday graphic prcessr units have utdated their primary purpse. They are being used and prmted fr scientific calculatins all ver the wrld by the name f GPGPUs r general purpse GPUs; engineers are achieving several times speed-up by running their prgrams n GPUs. Applicatins fields are many: image prcessing, general signal prcessing, physics simulatins, cmputatinal bilgy, etc, etc. 1.2 Prblem Statement and Mtivatin Graphic Prcessr Units can speed-up the cmputatin maniflds ver the traditinal CPU implementatin. Image prcessing, being inherently parallel, can be implemented quite effectively n a GPU. Thus, many applicatins which are therwise run slwly can be fastened up, and can be put t useful real-time applicatins. This was the mtivatin behind my final year prject. NVIDIA CUDA is a parallel prgramming mdel and sftware, which has develped specifically t address the prblems f efficiently prgram the GPU as well as be cmpatible with a wide variety f GPU cres available in the market. Further, being an extensin t the standard C language, it presents a lw-learning curve fr the prgrammers, as well giving them flexibility t put in their creativity in the parallel cdes they write. My task was t implement bject-tracking algrithms using CUDA and ptimize them. As a part f the first stage, I implemented Bilateral Filtering methd n GPU. Traditinally, brute frce bilateral filters take a lng time t run because (i) they cannt be implemented using FFT algrithms as the calculatin invlves bth spatial and range filtering and, (ii) they are nt separable, hence takes O(n 2 ) cmputatins. Using GPU, I fund them t be running much faster than even n a high-end CPU.

2. CUDA Prgramming Mdel 2.1 Executin CUDA extends the standard C language t make it applicable fr parallel prgramming. Its varius features include: C functins are implemented n GPU device using kernels, which are executed in parallel in several CUDA threads, as ppsed t C functins which are executed nly nce. The kernels are defined using glbal identifier, which is again an extensin f CUDA. The kernel functin is called using a special <<< >>> syntax, which specifies the number f threads in which the kernel has t execute. The <<< >>> allws t determine the rganizatin f threads and hw they are executed. A typical syntax fr calling a kernel functin is shwn as belw: funcadd<<<grid, Blck, 1>>>(A, B, C) This illustrates hw a general functin is executed n the GPU in CUDA. Blck variable defines a blck f threads, which can be ne-dimensinal, tw-dimensinal r three- vectr called dimensinal. Each thread in the blck is identified by a 3-cmpnent threadidx, whse x, y and z cmpnents respectively gives a unique index t each thread in a blck. Similarly, Grid defines the layut f the blcks, which can be either ne-dimensinal r tw-dimensinal. The blcks are als identified by a vectr called blckidx. Each blck has a limit n the maximum number f threads it can cntain, which is determined by the architecture f the unit. These threads can be synchrnized with each ther using syncthreads() functins and can als be made t access the memry in synchrnizatin. The grid and blcks can be shwn by fllwing diagram: Figure 1: Grid and Thread Blcks

2.2 Memry Hierarchy There are multiple memry spaces which a CUDA thread can use t accesss data. The different kinds available are shwn belw in the figure: Figure 2: GPU Memry Mdel Hst Memry: This is the main memry f the cmputer, frm/t which data can be laded/written back frm the device memry. Lcal Memry: This memry is available t each thread running in the device. Shared Memry: This is shared between the varius threads f a blck. This is n-chip memry and hence the access is very fast. It is divided int varius banks, which are equally sized memry mdules. Glbal Memry: This memry is accessible t all the threads and blcks, and is usually used t lad the hst data int the GPU memry. As the memry is nt cached, the access t this memry is nt as fast, but a right access pattern can be used t maximize memry bandwidth. Texture Memry: This is a cached memry, hence is faster than glbal memry. The texture memry can be laded by the hst, but can be nly read by the device kernel. It als prvides nrmalized access t the data. Useful fr reading images in the kernel. Cnstant Memry: This is als a cached memry fr fast access t cnstant data. 2.3 Hardware Implementatin GPU cnsists f an array f multiprcessrs, such that threads f a thread blck run cncurrently n ne multiprcessr. As the blcks finish, new blcks are launched n the vacated blcks. The verall device architecture can be shwn as:

Figure 3: Multiprcessrs n GPU Each multiprcessr executes the threads in wraps, which are grups f 32 parallel threads. Thus each multiprcessr cnsists f lcal registers, shared memry that is shared by all the scalar prcessrs and a cnstant cache. Texture Cache is available thrugh a texture unit. The size f blck is limited by the amunt f registers required per thread and the amunt f shared memry. Kernel fails t launch if nt even a single blck can be launched n a multiprcessr. 2.4 Few imprtant extensins in CUDA Functin Type Qualifiers: glbal declares a functin as kernel. The functin is executed n device and can be called frm hst. device declares a functin which executed n device and called frm hst. hst used t identify a functin executed and called frm hst nly. Variable type qualifiers: device defines a variable stred in device memry. It resides in glbal memry space and accessible frm all the threads. cnstant declares a variable in cnstant memry space. shared declares a variable in shared memry space f thread blck.

Built-in variables griddim: stres the dimensins f the grid, is a 3-cmpnent vectr. blckidx: stres the blck index within the grid as a 3-cmpnent vectr. blckdim: stres the dimensins f a blck. threadidx: stres the thread index within the blck as a 3-cmpnent vectr. Run-time APIs cudamallc: allcates memry, f the size given as input, in the glbal memry space f the device. Similar t mallc in C. cudafree: frees the memry allcated by cudamallc. cudamemcpy: cpies data t/frm device memry frm/t hst memry.

3. Bilateral Filters n GPU 3.1 Intrductin Bilateral filters were first cined by Tmasi and Manduchi [1]. These filters smthen the image but keep the edges cnstant by means f nn-linear cmbinatin f the image values f nearby pixels. This has been achieved by a cmbinatin f range filtering and spatial filtering. Range filters perate n value f the image pixels rather than their lcatin; spatial filters take the lcatin int accunt. By cmbining bth f them, the paper achieves an edge-sensitive smthening filter, which varies bth accrding t the image pixel value as well lcatin. A general frm f the filter is given by: h = ℇ ℇ, ℇ, ℇ Where, = ℇ, ℇ, ℇ Here, c(.) is the gemetric distance between the pixels x and ℇ, and s(.) is a similarity functin which measures hw clse the value f the image pixel is t the given value. Fr the special case f Gaussian c(.) and s(.), the equatin becmes: ℇ, = ℇ ℇ, = ℇ The functining f the bilateral filter can be seen in the fllwing figures: Figure 4: (a) A step functin perturbed by randm nise (b) cmbined similarity weights a pixel right t the step (c) final smthened utput [1]

3.2 Implementatin Bilateral filter cannt be implemented by using the FFT algrithms in this frm, because the filter values change with image pixel lcatin, depending n the image values f neighburing pixels. Mrever it is als nt separable in its current frm. Brute frce algrithm was used t implement the filter n bth GPU and CPU. The pseud-cde fr the algrithm can be given as: Fr input: image I, Gaussian Parameters σ d and σ r, utput image I b, W b weight cefficients 1. All values f I b and W b initialized t zer. 2. Fr each pixel (x, y) with intensity I(x,y) a. Fr each pixel (x, y ) in image with values I(x,y ) Cmpute the assciated weight: weight exp(-(i(x,y ) I(x,y)) 2 /2σ 2 d ((x x ) 2 + (y y ) 2 )/2σ 2 s) b. Update the weight sum W b (x,y) = W b (x,y) + weight c. Update I b (x,y) = I b (x,y) + weight x I b (x,y ) 3. Nrmalize the result: I b (x,y) I b (x,y)/ W b (x,y) Fr actual implementatin, the filter radius was taken t be twice the value f its spatial sigma, as the Gaussian tail dies ff quickly. This truncated filter was used as an apprximatin fr the full kernel. 3.3 GPU Implementatin Fr GPU implementatin, the fllwing template was fllwed: { // lad image frm disk // lad reference image frm image (utput) // allcate device memry fr result // allcate array and cpy image data // set texture parameters // access with nrmalized texture crdinates // Bind the array t the texture dim3 dimblck(8, 8, 1); dim3 dimgrid(width / dimblck.x, height / dimblck.y, 1); // execute the kernel BilateralFilter<<< dimgrid, dimblck, 0 >>>( image, spat_filter, width, height, sigmar, sigmad); // check if kernel executin generated an errr // allcate mem fr the result n hst side // cpy result frm device t hst // write result t file // cleanup memry}

Sme f the ptimizatins used in the cde are: Texture memry has been used fr accessing the image values. Texture memry being cache memry prvides a fast access t the image data. Spatial filter was calculated in the hst cde, and passed t the kernel as a cnstant matrix. This reduced the time fr cmputing the values again fr every pixel. NVIDIA 8600 graphics card was used t implement the cdes. 3.4 CPU Implementatin CPU cde was similar t the GPU cde except that the bilateral filter functin was executed using fr. then lp ver all the pixels f the image, which run in parallel threads in GPU. This was dne t get a better estimate f the CPU and GPU timings, as they are running same algrithm. The CPU under test was Intel Quad Cre Prcessr running at 2.4 GHz. 3.5 Speed Cmparisn A 512 x 512 gray scale Lena image was given as input t the prgram. The speed cmparisns were made in tw cases: Varying σ d keeping σ r cnstant Results fr varius sigma values are tabulated belw fr σ r = 0.1 Spatial sigma (σ d) GPU Time (ms) CPU Time (ms) Speed GPU (Mpix/s) Speed CPU (Mpix/s) Rati 1 230 1880 1.14 0.14 8.2 2 290 7310 0.90 0.036 25.2 3 330 16390 0.79 0.016 50 4 400 29010 0.66 0.009 72.5 5 520 45130 0.50 0.005 87 6 660 65200 0.40 0.004 99 Thus, we can see that CPU is much slwer than GPU in executing the same task. Further, the time taken fr CPU increases in apprximately n 2 fashin with increase in the filter length. Hence, the rati f speeds increases with increase in filter length and reaches at abut 100x in the last case.

Varying σ r keeping σ d cnstant The range sigma was als varied keeping spatial sigma cnstant. The time f executin fr GPU and CPU was fund t be almst cnstant fr different values f σ r. Range sigma (σ r) GPU Time (ms) fr σ d = 5 CPU Time (ms) fr σ d = 3 0.1 518 16390 0.2 516 16360 0.3 512 16370 0.4 507 16320 Output Images: Cmparisn f CPU and GPU utput images: Original Image GPU utput fr σ d = 3 σ r = 1 CPU utput fr σ d = 3 σ r = 1 Difference Image Variatin with σ r, keeping spatial sigma, σ d cnstant (GPU utputs): σ d = 3 σ r = 0.1 σ d = 3 σ r = 0.3 σ d = 5 σ r = 0.6

Variatin with σ d, keeping range sigma σ r cnstant (GPU utputs): σ d = 1 σ r = 0.1 σ d = 5 σ r = 0.3 σ d = 10 σ r = 0.6 4. Cnclusins GPU was able t achieve a much better time respnse than CPU in all the cases f filter implementatin. The rati f speeds increased with increase in filter length. The errr between the GPU and CPU utputs was very lss, thus GPU perfrms the calculatins quite accurately. Variatin in spatial sigma and range sigma shwed desirable changes in the utput image. Increase in range sigma value, keeping the ther cnstant increased the blurring acrss the edges as expected. Similarly, keeping the value f range sigma cnstant and increasing the ther value resulted in better smthening f the images withut disturbing the edges. 5. Future Wrk The majrity f the first stage wrk was explratry, learning abut the architecture f GPU and learning t implement CUDA language. I als gave a basic demnstratin f bilateral filtering in GPU. Many fast appraches have been develped t implement bilateral filter. These can be implemented in GPU and the perfrmance can be imprved further. Further, mre cmplex prblems can be implemented, which wuld require further explring the capabilities f GPU. A cmparative study f different GPU platfrms can als be made in testing the algrithms. 6. References 1. C. Tmasi, R. Manduchi: Bilateral Filtering fr gray and clur images, IEEE Internatinal Cnference n Cmputer Visin, 1998. 2. CUDA Prgramming Guide, NVIDIA