GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Size: px

Start display at page:

Download "GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:"

Bryce Daniel
5 years ago
Views:

1 COMP528 Multi-Core Programming GPU programming,ii Alexei Lisitsa Dept of computer science University of Liverpool Different ways: GPU programming specialized libraries; compiler directives (OpenACC); specialized languages (language extensions, CUDA, OpenCL,OpenGL) CUDA Platform and Programming Model CUDA is a parallel computing platform and programming model proposed by NVIDIA company; The NVIDIA CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications; The CUDA Toolkit: a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing; CUDA C Extension of C language; Specific constructions to deal with GPU parallelism; 1

2 SAXPY example SAXPY = Single-Precision A*B+Y where A is a scalar (number); * is scalar multiplication; Y is a vector; + is a vector addition Simple test task for parallelization can be used to illustrate the features of particular parallelization approach SAXPY in (plain) C { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } //after M. Harris OpenACC SAXPY { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } //after M. Harris CUDA model The CUDA programming model: - both the CPU and GPU are used; - host refers to the CPU and its memory; - device refers to the GPU and its memory; - code run on the host can manage memory on both the host and device; - Code run on the host launches kernels which are functions executed on the device; - These kernels are executed by many GPU threads in parallel. 2

3 Typical sequence of operations for a CUDA C Declare and allocate host and device memory; Initialize host data; Transfer data from the host to the device; Execute one or more kernels; Transfer results from the device to the host. CUDA C SAXPY, Stage 1 Declare and allocate memory: float *x, *y, *d_x, *d_y; x = (float*)malloc(n*sizeof(float)); //allocate memory y = (float*)malloc(n*sizeof(float)); //on the host cudamalloc(d_x, N*sizeof(float)); //allocate memory cudamalloc(d_y, N*sizeof(float)); // on the device CUDA C SAXPY, Stage II for (int i = 0; i < N; i++) //initialization of host arrays { x[i] = 1.0f; y[i] = 2.0f; } cudamemcpy(d_x, x, N*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, y, N*sizeof(float), cudamemcpyhosttodevice); //copying the content of host arrays to device arrays CUDA C SAXPY, Stage III Execute a kernel = call a function running on GPU saxpy<<<(n+255)/256, 256>>>(N, 2.0, d_x, d_y); //saxpy() is a user defined kernel, see next slides //<<<...>>> defines an execution configuration 3

4 Execution configuration CUDA C SAXPY, Stage IV Transfer the results from the device to the host <<<(N+255)/256, 256>>> cudamemcpy(y, d_y, N*sizeof(float),cudaMemcpyDeviceToHost) The first argument specifies the number of thread blocks in the grid, The second specifies the number of threads in a thread block. Kernel code global //declaration specifier for kernels { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } //CUDA build-in variables, available within the //threads //blockidx.x index of a block within a grid //blockdim.x dimension of a block //threadidx.x index of a thread within a block Kernel code running Within each thread an unique index is computed: int i = blockidx.x*blockdim.x + threadidx.x; Then the required operation on the arrays is computed for this particular index value (after boundary checks) This was 1-D grid partition 4

5 2D and 3D partitions CUDA also supports 2D and 3D partitions: - < >.y and < >.z analogues of above built-in variables are used; - griddim.x, griddim.y, griddim.z Compilation and execution Compilation: nvcc -o saxpy saxpy.cu Execution:./saxpy Further reading CUDA Zone: w.html Mark Harris, Six Ways to SAXPY devblogs.nvidia.com/parallelforall/siz-wayssaxpy 5

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers