The Architecture of Graphic Processor Unit GPU and CUDA programming

Size: px

Start display at page:

Download "The Architecture of Graphic Processor Unit GPU and CUDA programming"

Rudolf Small
6 years ago
Views:

1 The Architecture of Graphic Processor Unit GPU and CUDA programming P. Bakowski 1

2 Evolution of parallel architectures We. can distinguish 3 generations of massively parallel architectures (scientific calculation): (1976) The super-computers with special processors for vector calculation (Single Instruction Multiple Data) The Cray-1 contained 200,000 integrated circuits and could perform 100 million floating point operations per second (100 MFLOPS). price: $5 - $8.8M Number of units sold: 85 2

Evolution of parallel architectures (2010) The super-computers with standard microprocessors adapted for massive multiprocessing operating as Multiple Instruction Multiple Data computers.

3 Evolution of parallel architectures (2010) The super-computers with standard microprocessors adapted for massive multiprocessing operating as Multiple Instruction Multiple Data computers. IBM Roadrunner: PowerXCell 8i CPUs, 6480 dual cores - AMD Opteron, Linux Consumption: 2,35 MW Surface: 296 racks, 560 m2 Memory: 103,6 TiB Performance: 1,042 petaflops Price: USD $125M 3

4 Evolution of GPU architectures (2012) General Processing on Graphic Processing Units (GPGPU) technology based on the circuits integrated into graphic cards. ~$

5 GPGPU on embedded GPU architectures (2014) Embedded GPGPU is based on advanced SoCs Nvidia Tegra K1: ARM Cortex-15 (4) + Kepler GPU Exynos 5422: ARM Cortex-15 (4) + Mali-T6xx, T7xx, T8xx GPUs $100 ~8W $200 ~16W 5

6 GPU based processing Tegra K1: a Kepler class GPU unit with 192 processing. cores, 2 signal processors, video processing units for high definition (2K) video encoding and decoding, audio processing unit, and a set of data, video, and audio interfaces. 6

7 Tegra K1: streaming multi-processor. The streaming multiprocessor (SMX) each core contains one FP unit and one INT unit 32/48/96/192 cores per SMX GPGPU programming with CUDA (or opencl) 7

8 ARM: Mali-T624/T628/T678 GPU. Mali processing units 128-bit wide GPGPU programming with opencl 8

9 NVIDIA and CUDA CUDA - a software architecture on nvidia hardware CUDA language - an extension of the C 9

10 NVIDIA and CUDA The CUDA Toolkit contains: compiler: nvcc libraries FFT and BLAS profiler debugger gdb for GPU runtime driver for CUDA included in nvidia drivers guide of programming SDK for CUDA developers source codes (examples) and documentation 10

11 CUDA : compilation phases The CUDA C code is compiled with nvcc, that is a script activating other programs: cudacc, g++, cl, etc. 11

12 CUDA : compilation phases nvcc generates: the CPU code, compiled with other parts of application and written in pure C, and the PTX object code for the GPU 12

13 CUDA : compilation phases The executable files with CUDA code require: runtime CUDA library (cudart) and base CUDA library 13

14 CUDA : programming model extended C projected on multiple threads the threads are organized into blocks a set of blocks with their threads forms a grid A bi-dimensional grid with (3 columns, 2 rows) of 6 threedimensional blocks represented by 4*4*4 threads. 14

15 CUDA : memory model Global memory - all SMX and CPU Shared memories - the threads running in the same block Constant memory and Texture memory - all threads globally in read-only mode Each thread - Local memory and set of registers 15

16 Basic CUDA programming The CUDA programs: - pure C code for the execution on CPU and - extended C code for the execution on GPU In this context three types of functions are defined: host global device running only on the CPU (optional) running on the GPU, called by the CPU running on the GPU, called by the GPU 16

17 Basic CUDA programming host global device running only on the CPU (optional) running on the GPU, called by the CPU running on the GPU, called by the GPU The function marked by the prefix global is also called kernel. 17

18 Basic CUDA programming The call of a global function is organized around the set of threads and blocks to be activated. This is defined by an entry of the type: kernel <<<blocs, threads>>> (arguments) 18

19 Basic CUDA programming The simplest example is: kernel <<<1,10>>> (arguments); Another example is: kernel <<<2,5>>> (arguments); 19

20 CUDA : kernel structure The automatic variables are: threadidx, blockidx, blockdim, griddim. For one dimensional organization: threadidx.x, blockidx.x, blockdim.x, and griddim.x // GPU kernel for AddVect.Float.cu global void addvect(float* in1, float* in2, float* out) { int i = threadidx.x + blockidx.x*blockdim.x; out[i] = in1[i] + in2[i]; } 20

21 CUDA : example CPU side int main() { int i=0; float v1[]={1,2,3,4,5,6,7,8,9,10}; float v2[]={1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9}; int memsize = sizeof(v1); int vsize = memsize/sizeof(float); float res[vsize]; float* Cv1; cudamalloc((void **)&Cv1,memsize); float* Cv2; cudamalloc((void **)&Cv2,memsize); float* Cres; cudamalloc((void **)&Cres,memsize); cudamemcpy(cv1,v1,memsize,cudamemcpyhosttodevice); cudamemcpy(cv2,v2,memsize,cudamemcpyhosttodevice);.. 21

22 CUDA : example CPU side.. cudamemcpy(cv1,v1,memsize,cudamemcpyhosttodevice); cudamemcpy(cv2,v2,memsize,cudamemcpyhosttodevice); addvect<<<2,vsize/2>>>(cv1,cv2,cres); // 2 blocks cudamemcpy(res,cres,memsize,cudamemcpydevicetohost); printf("res= { "); for(i=0;i<vsize;i++){printf("%2.2f ",res[i]);} printf("}\n"); } 22

23 CUDA : Tegra K1 Zero Copy // Set flag to enable zero copy access cudasetdeviceflags(cudadevicemaphost); // Host Arrays float* h_in = NULL; float* h_out = NULL; // Allocate host memory using CUDA allocation calls cudahostalloc((void **)&h_in, sizein, cudahostallocmapped); cudahostalloc((void **)&h_out,sizeout,cudahostallocmapped); // Device arrays float *d_out, *d_in; // Get device pointer from host memory. No allocation or memcpy cudahostgetdevicepointer((void **)&d_in, (void *)h_in, 0); cudahostgetdevicepointer((void **)&d_out, (void *)h_out, 0); // Launch the GPU kernel kernel<<<blocks, threads>>>(d_out, d_in); // No need to copy d_out back // Continue processing on host using h_out} 23

24 CUDA : analysis of a device { // struct cudadeviceprop char name [256]; totalglobalmem size_t // possible value 2 GB sharedmemperblock size_t // possible value 128 KB regsperblock int // possible value 64 warpsize int //possible value 32 mempitch size_t; maxthreadsperblock int // possible value 1024 maxthreadsdim int [3]; maxgridsize int [3]; totalconstmem size_t; int major; // possible value 1, 2 or 3 int minor; // possible value 1,2,3 int clockrate / / possible value 1.2 GHz texturealignment size_t; deviceoverlap int; int multiprocessorcount 1,2,4 kernelexectimeoutenabled int; } 24

25 CUDA : analysis of a device // DeviceStat.cu.. #include <cuda.h> #include <cuda_runtime.h> int main () { cudadeviceprop dp / / dp short for deviceproperties int device = 0; cudagetdeviceproperties(&dp,device); printf ("Name:%s\n", dp.name); printf ("Memory total:%d MB\n", dp.totalglobalmem/(1024*1024)); printf ("Shared memory per block:%d in B\n", dp.sharedmemperblock); printf ("MaxThreads block:%d \n", dp.maxthreadsperblock); printf ("warpsize:%d \n", dp.warpsize); printf ("major:%d \n", dp.major); printf ("minor:%d \n", dp.minor); printf ("number of SM:%d \n", dp.multiprocessorcount); printf ("Clock frequency:%1.3f inghz \n", dp.clockrate/ ); return 0; } 25

26 CUDA : analysis of a device For Tegra K1 we obtain: ubuntu@tegra ubuntu:~/cuda$./devicestat name: GK20A totalglobalmem: 1746 in MB shared memory per block: 48 in KBytes max threads per block: 1024 warpsize: 32 major: 3 minor: 2 multi processor count: 1 clock rate: in GHz 26

27 CUDA : matrix multiplication void CPU_matrix_mul(float* a, float* b, float* c) { for(int i=0;i<dim;i++) for(int j=0;j<dim;j++) for(int k=0;k<dim;k++) c[j+i*dim] += a[k+j*dim]*b[j+k*dim]; } 27

$CUDA : matrix multiplication #define Width 512 // corresponds to DIM global void matrix_mul(float* dev_a,float* dev_b,float* dev_c,int Width) { // 2D thread ID Each product (512*512) is int tx =$

28 CUDA : matrix multiplication #define Width 512 // corresponds to DIM global void matrix_mul(float* dev_a,float* dev_b,float* dev_c,int Width) { // 2D thread ID Each product (512*512) is int tx = threadidx.x; calculated by a separate int ty = threadidx.y; float Pvalue =0; thread that adds the products for(int k=0;k<width;++k) in 512 steps { float Ael=dev_A[ty*Width + k]; float Bel=dev_B[k*Width +tx]; Pvalue += Ael*Bel; } dev_c[ty*width+tx]=pvalue; } 28

29 CUDA : performance evaluation float et; // short for elapsedtime cudaevent_t start, stop; cudaeventcreate(&start); cudaeventcreate(&stop); cudaeventrecord(start, 0); // here we call the GPU kernel (or CPU function) cudaeventrecord(stop,0); cudaeventsynchronize(stop); cudaeventelapsedtime(&et,start,stop); The result of this assessment can be displayed as follows: printf("gpu.time:%d*%d:%3.2fms\n",width,width,et); 29

30 CUDA : performance evaluation The result of this assessment can be displayed as follows: printf("gpu.time:%d*%d:%3.2fms\n",width,width,et); 30

31 CUDA : shared memory & synchronization Shared memory - threads running in the same block. A scalar product of two vectors: consecutive elements of these vectors are multiplied and added to an amount that ultimately represents the scalar product called dot product. 31

32 CUDA : shared memory & synchronization global void dot(float *a, float *b, float *c) { shared float cache[threadsperblock]; int tid = threadidx.x + blockidx.x *blockdim.x; int cidx = threadidx.x ; // cidx short for cacheindex float temp = 0; // addition of products in blocks of threads while(tid) { temp += a[tid] * b[tid]; // addition of the threads in a block tid += blockdim.x*griddim.x; } cache[cidx] = temp; syncthreads(); // reduction 32

33 CUDA : shared memory & synchronization int i = blockdim.x/2; // vector reduction while (i! = 0) { if(cidx <i) cache[cidx]+= cache[cidx+i]; syncthreads(); i/=2; reduction in } if (cidx == 0) c[blockidx.x] = cache[0]; // final product } several steps 33

34 CUDA and graphic APIs CUDA programs may exploit the graphic functions provided by graphic APIs (opencv, opengl) These functions provide necessary image processing and generation operations for rastering and shading rendering of the images on the screen. We use only some opencv and opengl operations to read/write images from/to files (opencv) and to display the images directly from GPU memory (opengl). 34

35 CUDA and opencv // NegImage.CV.cu #include <opencv/highgui.h> #define uchar unsigned char #define DtoH cudamemcpydevicetohost #define HtoD cudamemcpyhosttodevice global void negimage (uchar * array) { int i = threadidx.x + blockidx.x*blockdim.x; array[i] = 255 array[i]; // byte complement } 35

36 CUDA and opencv int main () { int no = 192 * 4800; // number of elements int no = nb * sizeof (uchar); IplImage * img = 0; // image in CV uchar * data; // space for the bitmap uchar * d_a = 0; // pointer global memory img = cvloadimage("clipvga.jpg", 1); // loading and decompressing data = (uchar *) img >imagedata; cudamalloc ((void **) &d _a,nb); int bs = 192; // block size int no = gs; // grid size cudamemcpy (d_a,data,nb,htod); negimage<<<gs,bs>>>(d_a); // kernel call cudamemcpy (data,d_a,nb,dtoh); cvnamedwindow ("Win1" CV_WINDOW_AUTOSIZE); cvshowimage ("Win1",img); cvwaitkey(0); cudafree(d_a); } 36

37 CUDA and opengl The mapping of a CUDA buffer on the opengl framebuffer class GPUAnimBitmap and functions display_and_exit(), anim_and_exit() 37

38 CUDA and opengl int main( int argc, char **argv ) { GPUAnimBitmap bitmap( DIMX, DIMY, NULL ); bitmap.display_and_exit( (void(*(uchar4*,void*))generate_frame,null); } int main( void ) { GPUAnimBitmap bitmap(dimx,dimy,null ); bitmap.anim_and_exit((void(*) (uchar4*,void*,int))generate_frame,null); } clock tick 38

39 Summary Evolution of massive multiprocessing (multi-core) GPUs independent and integrated (embedded) NVIDIA Tegra K1 architecture NVIDIA and CUDA CUDA processing and memory model a few simple examples CUDA - opencv and opengl 39

Mathematical computations with GPUs

Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device