Stevina Dias* Sherrin Benjamin* Mitchell D silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile Parle

Size: px

Start display at page:

Download "Stevina Dias* Sherrin Benjamin* Mitchell D silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile Parle"

Jasmin West
6 years ago
Views:

1 GPU Programmig Models Stevia Dias* Sherri Bejami* Mitchell D silva* Lyette Lopes* *Assistat Professor Dwarkadas J Saghavi College of Egieerig, Vile Parle Abstract The CPU, the brais of the computer is beig ehaced by aother part of the computer the Graphics Processig Uit (GPU), which is its soul. GPUs are optimized for takig huge batches of data ad performig the same operatio repeatedly ad quickly, which is ot the case with PC microprocessors. A CPU combied with GPU, ca deliver better system performace, price, ad power. The versio 2 of Microsoft s Accelerator offers a efficiet way for applicatios to implemet array-processig operatios. These operatios use the parallel processig capabilities of multi-processor computers. CUDA is a parallel computig platform ad programmig model created by NVIDIA. CUDA provides both lower ad a higher level APIs. Keywords GPU, GPU Programmig CUDA, CUDA API, CGMA ratio, CUDA Memory Model, Accelerator, Data Parallel Arrays I. INTRODUCTION Computers have chips that facilitate the display of images to moitors. Each chip has differet fuctioality. Itel s itegrated graphics cotroller provides basic graphics that ca display applicatios like Microsoft PowerPoit, lowresolutio video ad basic games. The GPU is a programmable ad powerful computatioal device that goes far beyod basic graphics cotroller fuctios [1]. GPUs are special-purpose processors used for the display of three dimesioal scees. The capabilities of GPU are beig haressed to accelerate computatioal workloads i areas such as fiacial modellig, cuttig-edge scietific research ad oil ad gas exploratio. The GPU ca ow take o may multimedia tasks, such as speed up Adobe Flash video, traslatig video to differet formats, image recogitio, virus patter matchig ad may more. But the challege is to solve those problems that have a iheret parallel ature video processig, image aalysis, sigal processig. The CPU is composed of few cores ad a huge cache that ca hadle a few software s at a time. Coversely, a GPU is composed of hudreds of cores that ca hadle thousads of s simultaeously [2]. This ability of a GPU ca accelerate some software by 100x over a CPU aloe. The GPU achieves this acceleratio while beig more powerful ad cost-efficiet as compared to a CPU. This paper describes 2 differet ways of programmig a GPU amely Accelerator ad CUDA. II. ACCELERATOR Accelerator is a high-level data parallel library which uses parallel processors such as the GPU or multi core CPU to speed up executio. The Accelerator applicatio programmig iterface (API) helps i implemetig a wide variety of arrayprocessig operatios by supportig a fuctioal programmig model. the details of parallelizig ad ruig the computatio o the selected target processor, icludig GPUs ad multi core CPUs are hadled by the accelerator. The Accelerator API is almost processor idepedet, so the same arrayprocessig code rus o ay supported processor with oly mior chages [3]. The basic steps i Accelerator programmig are: i. Create data arrays. ii. Load the data arrays ito Accelerator dataparallel array objects. iii. Process the data-parallel array objects usig a series of Accelerator operatios. iv. Create a result object. v. Evaluate the result object o a target. A. Data Parallel Array Objects Applicatios do ot have direct access to the array data ad caot maipulate the array by idex [3]. Applicatios implemet array processig schemes by applyig Accelerator operatios to dataparallel array objects which represet a etire array as a uit rather tha workig with idividual array elemets. 5 data-parallel array objects supported by Accelerator v2 are: i. BoolParallelArray ii. DoubleParallelArray iii. Float4ParallelArray iv. FloatParallelArray v. ItParallelArray A example for Float4ParallelArray is as follows: typedef struct float a; float b; float c; float d; } float4; 225 P a g e

2 Float4 is a Accelerator structure that cotais a quadruplet of float values. It is used primarily i graphics programmig. float4 xyz (0.7f, 5.0f, 4.2f, 9.0f). The API icludes a large collectio of operatios that applicatios ca use to maipulate the cotets of arrays ad to combie the arrays i various ways. Most operatios take oe or more data-parallel array objects as iput, ad retur the processed data as a ew data-parallel object. Accelerator operatios work with copies of the iput objects; they do ot modify the origial objects. Operatio iputs are typically data-parallel objects, but some operatios ca also take costat values. Table 1 Type of Accelerator Operatios Operatio Explaatio Creatio ad coversio Elemetwise Reductio Trasform Liear algebra Create ew data-parallel array objects ad covert them from oe type to aother. Operate o each elemet of oe or more dataparallel array objects ad retur a ew object with the same dimesios as the origials. Add. Abs Subtract Divide, etc Reduce the rak of a data-parallel array object by applyig a fuctio across 1D or more dimesios. Trasform the orgaizatio of the elemets i a data-parallel array object. These operatios reorgaize the data i the object, but do ot require computatio. For example: Traspose Pad Perform stadard matrix operatios o dataparallel array objects, icludig matrix multiplicatio, scalar product, ad outer product. B. Sample Code usig System; usig Microsoft.ParallelArrays; usig FloatArray = Microsoft.ParallelArrays.FloatParallelArray; usig PArray = Microsoft.ParallelArrays.ParallelArrays; amespace SubArrays class Calculatio staticvoid Mai(strig[] args) it arraysize = 100; Radom raf = ewradom(); float[] Array1 = ewfloat[arraysize]; float[] Array2 = ewfloat[arraysize]; float[] Array3 = ewfloat[arraysize]; DX9Target result = ewdx9target(); // (1) for (it i = 0; i < arraysize; i++) // (2) Array1[i] = (float)(math.si((double)i / 10.0) + raf.nextdouble() / 5.0); Array2[i] = (float)(math.si((double)i / 10.0) + raf.nextdouble() / 5.0); } FloatArray fpiput1 = ew FloatArray (Array1); // (3) FloatArray fpiput2 = ew FloatArray (Array2); FloatArray fpstacked = PArray.Subtract(fpIput1, fpiput2); // (4) FloatArray fpoutput = PArray.Divide(fpStacked, 2); Array3 = result.toarray1d(fpoutput); // (5) for (it i = 0; i < arraylegth; i++) // (6) Cosole.WriteLie(Array3[i].ToStrig()); //(7) } } } } The above code uses two commoly used types: ParallelArrays ad FloatParallelArray. ParallelArrays represeted by PArray, cotais the Accelerator operatio methods [3]. FloatParallelArray represeted by FloatArray, cotais floatig poit arrays i the Accelerator eviromet. 1. Create a target object: Each processor that supports Accelerator has oe or more target objects. Target Objects covert Accelerator operatios to processor-specific code ad ru them o the processor. StackArrays uses DX9Target, which rus Accelerator operatios o ay DirectX 9-compatible GPU, usig the DirectX 9 API. 2. Create iput data arrays: the iput arrays for StackArrays are geerated by the applicatio, but they ca also be created from stored data. 3. Load each iput array: Each iput array is loaded ito a Accelerator data parallel array object. 4. Array Processig: Use oe or more Accelerator operatios to process the arrays. 5. Evaluatio: evaluate the results of the series of operatios by passig the result object to the target s ToArray1D method. ToArray1D evaluates the 1D arrays ad ToArray2D, evaluates 2-D arrays. 6. Process the result: Stack Arrays displays the elemets of the processed array i the cosole widow. 226 P a g e

3 C. Accelerator Architecture Applicatios Umaaged Applicatios Maaged Applicatios Accelerator Library Target Rutimes DirectX Maaged API GPU- Specific Other Processors Accelerator Resource Maager Native API Multi-Core CPU Other Resource Maagers CUDA is a high level laguage which stads for Compute Uified Device Architecture. It is a parallel computig platform ad programmig model created by NVIDIA. The CUDA platform is accessible to software developers through CUDAaccelerated libraries, compiler directives ad extesios to idustry-stadard programmig laguages, icludig C, C++ ad FORTRAN. Programmers use C/C++ with CUDA extesios to express parallelism, data locality, ad cooperatio, which is compiled with vcc, NVIDIA s LLVM-based C/C++ compiler, to code algorithms for executio o the GPU.CUDA eables heterogeeous systems(i.e., CPU+GPU). CPU & GPU are separate devices with separate DRAMs Scale to 100 s of cores, 1000 s of parallel s. CUDA is a Extesio to the C Programmig Laguage [4]. Processors GPUs Multi-core CPU C/C++ CUDA Applicatio GENERIC Figure 1: Accelerator Architecture 1) The Accelerator Library ad APIs The Accelerator library is used by the applicatios to implemet the processor-idepedet aspects of Accelerator programmig. The library exposes two APIs amely Maaged ad Native. C++ applicatios use the ative API, which is implemeted as umaaged C++ classes. Maaged applicatios use the maaged API. NVCC PTX Code CPU Code 2) Targets Accelerator eeds a set of target objects. Each target traslates Accelerator operatios ad data objects ito a suitable form for its associated processor ad rus the computatio o the processor. A processor ca have multiple targets, each of which accesses the processor i a differet way. For example, most GPUs will probably have DirectX 9 ad DirectX 11 targets, ad possibly a vedorimplemeted target to support processor-specific techologies. Each target cosists of a small API, which are called by applicatios to iteract with the target. The most commoly used methods aretoarray1d ad ToArray2D. 3) Processor Hardware The versio 2 of Accelerator ca ru operatios o differet targets. Curretly it icludes targets for: Multicore x64 CPUs, GPUs, by usig DirectX 9 III. CUDA PTX to Target Compiler G80. GPU Figure 2: Compilatio Procedure of CUDA SPECIALIZED CUDA Programmig Basics cosists of CUDA API basics, Programmig model ad its Memory model A. CUDA API Basics: There are three basic APIs i CUDA amely, Fuctio type qualifiers, Variable type qualifiers ad Built-i variables. Fuctio type qualifiers are used to specify executio o host or device. Variable type qualifiers are used to specify 227 P a g e

4 the locatio o the device. Four built-i variables are used to specify the grid ad block dimesios ad the block ad idices. blockdim dim3 dimesios of the block Idx uit3 idex withi the block 1) Fuctio Type Qualifiers Table 2 Fuctio type Qualifier Fuctio Type Qualifier Executed o Callable from device Device Device global Device Host host Host Host 2) Variable Type Qualifier Table 3 Variable Type Qualifier Variable Type Qualifier Locatio Lifetime device costat shared Global space Costat space Shared space lifetime of a applicatio lifetime of a applicatio lifetime of a block Accessible from all the s withi the grid ad from the host through the rutime library all the s withi the grid ad from the host through the rutime library (optioally used together with device ) all the s withi the block (optioally used together with device ) 3) Built i Variables Table 4 Built i Variables Built i Type Explaatio variables griddim dim3 dimesios of the grid blockidx uit3 block idex withi the grid 4) Executio Cofiguratio (EC): EC must be specified for ay call to a global fuctio. Defies the dimesio of the grid ad blocks specified by isertig a expressio betwee fuctio ame ad argumet list: Fuctio Declaratio: global void fuctio_ame(float* parameter) Fuctio Call: fuctio_ame <<< Dg, Db, Ns >>> (parameter) Where Dg, Db, Ns are: Dg is of type dim3, dimesio ad size of the grid Dg.x * Dg.y = umber of blocks beig lauched; Db is of type dim3, dimesio ad size of each block Db.x * Db.y * Db.z = umber of s per block. B. Programmig Model The GPU is see as a computig device to execute a portio of a applicatio that has to be executed several times, ca be isolated as a fuctio ad works idepedetly o differet data. Such a fuctio ca be compiled to ru o the device. The resultig program is called a Kerel. The batch of s that executes a kerel is orgaized as a grid of blocks. Followig are steps i CUDA code Step 1: Iitialize the device (GPU) Step 2: ocate o the device Step 3: Copy the data from host array to device array Step 4: Execute kerel o device Step 5: Copy data from device (GPU) to host Step 6: Free allocated memories o the device ad host Kerel Code (xx_kerel.cu): A kerel is a fuctio callable from the host ad executed o the CUDA device -- simultaeously by may s i parallel. Callig a kerel ivolves specifyig the ame of the kerel plus a executio cofiguratio. C. The Memory Model Each CUDA device has several memories as show i Figure 3. These memories ca be used by programmers to achieve high CGMA (Compute to Global Memory Access) ratio ad thus high executio speed i their kerels. CGMA ratio is the umber of floatig-poit calculatios performed for each access to the global withi a regio of a CUDA program. Variables that reside i shared memories ad registers ca be accessed at very high speed i a parallel maer. Each ca oly access registers that are allocated to them. A kerel fuctio uses registers to store frequetly accessed variables. These variables are private to each. 228 P a g e

Shared memories are allocated to blocks. s i a block ca access variables i the shared locatios of the block. Threads ca share their results via Shared memories.

Texture is read from kerels usig device fuctios called texture fetches. The first parameter of a texture fetch specifies a object called a texture referece.

5 Shared memories are allocated to blocks. s i a block ca access variables i the shared locatios of the block. Threads ca share their results via Shared memories. The global ca be accessed by all the s at aytime of program executio. The costat allows faster ad more parallel data access paths for CUDA kerel executio tha the global. Texture is read from kerels usig device fuctios called texture fetches. The first parameter of a texture fetch specifies a object called a texture referece. A texture referece defies which part of texture is fetched. capacities are implemetatio depedet. Oce their capacities are exceeded, they become limitig factors for the umber of s that ca be assiged to each SM [5]. Table 2 Differet types of CUDA Memory Memor y Locatio Local Off-chip No Shared O-chip Cache d N/A Global Off-chip No Costa t Texture Access Read/Writ e Read/writ e Read/writ e Off-chip Yes Read Off-chip Yes Read Who Oe Threa d s i a block s + CPU s + CPU s + CPU D. A example of a CUDA program The followig program calculates ad prits the square root of first 1000 itegers. #iclude <stdio.h> // (1) #iclude <cuda.h> #iclude <coio.h> global void sq_rt(float*a,it N) // (2) itidx=blockidx.x*blockdim.x+idx.x; if(idx<n) a[idx]=sqrt(a[idx]);} it mai(void) // (3) float*arr_host,*arr_dev; // (4) cost it C=100; // (5) size_t legth=c*sizeof(float); arr_host= (float*)malloc(legth); // (6) Figure 3: CUDA Memory Model CUDA defies registers, shared, ad costat that ca be accessed at higher speed ad i a more parallel maer tha the global. Usig these memories effectively will likely require re-desig of the algorithm. It is importat for CUDA programmers to be aware of the limited sizes of these special memories. Their cudamalloc((void**)&arr_dev,legth);// (7) for(it i=0;i<c;i++) arr_host[i]=(float)i; cudamemcpy(arr_dev,arr_host, legth,cudamemcpyhosttodevice);// (8) 229 P a g e

$pritf("%d\t%f\",i,arr_host[i]); free (arr_host); // (12) cudafree(arr_dev); getch();} 1. Iclude header files 2. Kerel fuctio that executes o the CUDA device 3. mai ( ) routie, that the CPU must fid 4.$ Defies poiters to host ad device arrays 5. Defies other variables used i the program, size_t is a usiged iteger type of at least 16 bit 6. ocate array o the host 7.

Defies poiters to host ad device arrays 5. Defies other variables used i the program, size_t is a usiged iteger type of at least 16 bit 6. ocate array o the host 7.

6 it blk_size=4; // (9) it _blk=c/blk_size+(c%blk_size==0); sq_rt<<<_blk,blk_size>>>(arr_dev,c); cudamemcpy(arr_host,arr_dev,legth,cudamemcpy DeviceToHost); // (10) for (it i=0;i<c;i++) //(11) pritf("%d\t%f\",i,arr_host[i]); free (arr_host); // (12) cudafree(arr_dev); getch();} 1. Iclude header files 2. Kerel fuctio that executes o the CUDA device 3. mai ( ) routie, that the CPU must fid 4. Defies poiters to host ad device arrays 5. Defies other variables used i the program, size_t is a usiged iteger type of at least 16 bit 6. ocate array o the host 7. ocate array o device (DRAM of the GPU) 8. Copy the data from host array to device array. 9. Kerel Call, Executio Cofiguratio 10. Retrieve result from device to host i the host 11. Prit result 12. Free allocated memories o the device ad host egieerig.a GPU allows you to ru the sigle processor applicatios i parallel processig eviromet. I this paper, we itroduced GPU computig, CUDA ad Accelerator programmig eviromet by explaiig, step-by-step, how to implemet a efficiet GPU-eabled applicatio. The above metioed programmig models help to achieve the goal of Parallel Processig. Due to the presece of these programmig models ad may more models, the gamig applicatios have become more lively ad creative. REFERENCES [1] [2] files/cruz _gpucomputig09.pdf [3] research.microsoft.com/e.../accelerator/acc elerator_i tro.docx 11/12/2012 4:06 PM [4] 11/12/2012 3:52 PM [5] uda_docs/nvidia_cuda_programmigg uide_2.3.pdf [6] De Doo, D.; Esposito, A.; Tarricoe, L.; Catariucci, L. Ateas ad Propagatio Magazie, IEEE Volume: 52, Issue: 3, Itroductio to GPU Computig ad CUDA Programmig: A Case Study o FDTD [EM Programmer's Notebook] Figure 4: Illustratio of code CONCLUSION The stagatio of CPU clock speeds has brought parallel computig, multi-core architectures, ad hardware-acceleratio techiques back to the forefrot of computatioal sciece ad 230 P a g e

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies