Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Size: px

Start display at page:

Download "Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University"

Christian Pearson
6 years ago
Views:

1 Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University

2 Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications OpenCL - Architecture and programming model - Comparison with CUDA - Example applications

3 CUDA Programming Guide References - NVIDIA_CUDA_Programming_Guide_2.3.pdf CUDA SDK (example application) - OpenCL Specification - Introduction to GPU Programming - Volodymyr Kindratenko Innovative Systems NCSA Institute for Advanced Computing Applications and Technologies (IACAT)

4 CUDA

5 Installing CUDA CUDA Driver - Software to communicate with the GPU CUDA Toolkit - Compiler, libraries, emulator, development tools CUDA SDK - Example programs

Hardware Architecture Scalable array of Streaming Multiprocessors (SMs) - 8 scalar processors (SIMD) Multiple memory spaces - On-chip memory (shared

6 Hardware Architecture Scalable array of Streaming Multiprocessors (SMs) - 8 scalar processors (SIMD) Multiple memory spaces - On-chip memory (shared memory, registers, some caches) - Off-chip memory (global/device, texture, constant) High-latency, high-bandwidth PCIe interface with CPU - Transfer ~GB/s

7 GPU vs. CPU Architectures

8 Programming Model meets HW Massively multithreaded programs - Program will run correctly without considering underlying hardware, but will be very slow Programmer must divide threads between SMs (discussed in following slides) Divergence in control flow serializes SIMD execution No global synchronization*

9 Thread Structure A CUDA kernel is executed by a grid of threads Host Device Grid 1 Kernel 1 Block (0, 0) Block (1, 0) Block (2, 0) Due to GPU architecture, threads are Block (0, 1) Block (1, 1) Block (2, 1) grouped into blocks which execute together on an SM Kernel 2 Grid 2 Block (1, 1) Each block has a unique ID within a grid (block ID) and a unique ID within a block (thread ID) Thread (0,0,0) Thread (0,1,0) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) - Used to compute global ID

10 Thread Blocks Threads within a block: - Can perform local barriers - Have access to the same shared memory (SW cache) - Are scheduled in SIMD groups called warps Threads within a warp execute the same instruction simultaneously with different data (here is where divergence impacts performance)

11 Porting Applications Porting application to GPU - Create standalone C version (remove classes, library calls) - Write multi-threaded CPU version (debugging, partitioning) - Create simple CUDA version - Optimize CUDA version for underlying hardware Learning curve similar to threaded C programming - Large performance gains require mapping program to specific underlying architecture

12 Vector Addition (CPU) void vecadd(float *A, float *B, float *C, int N) { for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; } Computational kernel int main() { int N = 4096; float *A = (float *)malloc(sizeof(float)*n); float *B = (float *)malloc(sizeof(float)*n); float *C = (float *)malloc(sizeof(float)*n); init(a); init(b); Allocate memory Initialize memory } vecadd(a, B, C, N); free(a); free(b); free(c); Deallocate memory

13 Vector Addition (GPU) global void gpuvecadd(float *A, float *B, float *C) { int tid = blockidx.x * blockdim.x + threadidx.x GPU Computational kernel C[tid] = A[tid] + B[tid]; blockidx.x threadidx.x GRID BLOCK (0,0) (0,0) (1,0) (2,0)... (31,0) blockdim.x = 32 BLOCK (1,0) (0,0) (1,0) (2,0)... (31,0)... tid = blockidx.x * blockdim.x + threadidx.x

14 Vector Addition (GPU) int main() { int N = 4096; float *A = (float *)malloc(sizeof(float)*n); float *B = (float *)malloc(sizeof(float)*n); float *C = (float *)malloc(sizeof(float)*n) init(a); init(b); float *d_a, *d_b, *d_c; cudamalloc(&d_a, sizeof(float)*n); cudamalloc(&d_b, sizeof(float)*n); cudamalloc(&d_c, sizeof(float)*n); cudamemcpy(d_a, A, sizeof(float)*n, HtoD); cudamemcpy(d_b, B, sizeof(float)*n, HtoD); dim3 dimblock(32,1); dim3 dimgrid(n/32,1); gpuvecadd <<< dimblock,dimgrid >>> (d_a, d_b, d_c); cudamemcpy(c, d_c, sizeof(float)*n, DtoH); cudafree(d_a); cudafree(d_b); cudafree(d_c); free(a); free(b); free(c); Allocate memory on GPU Initialize memory on GPU Configure threads Run kernel (on GPU) Copy results back to CPU Deallocate memory on GPU

15 Example: Image Flip Original Input Image Rotated Output Image

16 Image Flip (GPU) main() { int width, height; float *inimage, *outimage; readimage(inimage, &width, &height); int size = width * height; outimage = (float*)malloc(sizeof(float)*size); float *d_inimage, *d_outimage; cudamalloc(&d_inimage, sizeof(float)*size); cudamalloc(&d_outimage, sizeof(float)*size); cudamemcpy(d_inimage, inimage, sizeof(float)*size, HtoD); dim3 dimblock(8, 8); dim3 dimgrid(width / dimblock.x, height / dimblock.y); flipimage <<< dimgrid, dimblock >>> (d_inimage, d_outimage, width, height); cudamemcpy(outimage, d_outimage, sizeof(float)*size, DtoH) cudafree(d_inimage); cudafree(d_outimage); writeimage(outimage); } free(inimage); free(outimage);

17 Image Flip (GPU) global void flipimage(float *inimage, float *outimage, int width, int height) { } int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; outimage[((height-1)-y)*width + x] = inimage[y*width + x]; Thread Block (0, 0) Thread Block (1, 0) Thread Block (63, 0) 512 Thread Block (0, 63) Thread Block (63, 63)

18 Checking GPU Capabilities Run devicequery program in CUDA SDK

19 OpenCL

20 OpenCL Architecture Parallel computing for heterogenous devices - CPUs, GPUs, other processors (Cell, DSPs, etc) - Portable accelerated code Defined in four parts - Platform Model - Execution Model - Memory Model - Programming Model

21 Platform Model The model consists of a host connected to one or more OpenCL devices A device is divided into one or more compute units Compute units are divided into one or more processing elements

22 Execution Model CUDA Terminology Grid Block Thread OpenCL Terminology Index space Work-group Work-item

23 Execution Model 2 main parts: - Host programs execute on the host - Kernels execute on one or more OpenCL devices Each instance of a kernel is called a work-item Work-items are organized as work-groups When a kernel is submitted, an index space of work-groups and work-items is defined Work-items can identify themselves based on their work-group ID and their local ID within the work-group (sound familiar?)

$HUQHOV 0HPRU\2EMHFWV &RPPDQG4XHXHV BBNHUQHOYRLG GSBPXOJOREDO FRQVWIORDWD JOREDOFRQVWIORDWE JOREDOIORDWF ^ LQWLG JHWBJOREDOBLG F>LG@ D>LG@E>LG@ ` GSBPXO &38SURJUDPELQDU\ GSBPXO *38SURJUDPELQDU\ GSBPXO$

24 Execution Model 2SHQ&/ &38 *38 &RQWH[W 3URJUDPV.HUQHOV 0HPRU\2EMHFWV &RPPDQG4XHXHV BBNHUQHOYRLG GSBPXOJOREDO FRQVWIORDWD JOREDOFRQVWIORDWE JOREDOIORDWF ^ LQWLG JHWBJOREDOBLG F>LG@ D>LG@E>LG@ ` GSBPXO &38SURJUDPELQDU\ GSBPXO *38SURJUDPELQDU\ GSBPXO DUJ DUJ >@ >@ DUJ>@YDOXH DUJ DUJ >@ >@ DUJ>@YDOXH DUJ DUJ >@ >@ DUJ>@YDOXH,PDJHV %XIIHUV,QIn 2UGHU Order 4XHXH Queue *38 GPU 2XWRI Out of 2UGHU Order 4XHXH Queue &RS\ULJKW.KURQRV*URXS 3DJH

25 Execution Model A context refers to the environment in which kernels execute - Devices - Kernels (OpenCL functions that run on OpenCL devices) - Program objects (The program source that implements the kernel) - Memory objects (Data that can be operated on by the device) - Command queues are used to coordinate execution of the kernels on the devices Memory commands (data transfers) Kernel synchronization commands Synchronization Execution between host and device(s) is asynchronous Commands can execute in-order or out-of-order

26 Memory Model Defines the various types of supported memories No guarantees of consistency between different work-groups Memory Global Constant Local Private Description Accessible by all work-items RO, global Local to a work-group Private to a work-item

27 Programming Model Data parallel - One-to-one mapping between work-items and elements in a memory object - Work-groups can be defined explicitly (like CUDA) or implicitly (specify the number of work-items and OpenCL creates the work-groups) Task parallel - Kernel is executed independent of an index space - Other ways to express parallelism: enqueueing multiple tasks, using device-specific vector types, etc. Synchronization - Possible between items in a work-group - Possible between commands in a context command queue

28 OpenCL Program Flow Typical OpenCL program: - Select the desired devices (ex: all GPUs) - Create a context - Create command queues (per device) - Compile programs - Create kernels - Allocate memory on devices - Transfer data to devices - Execute - Transfer results back - Free memory on devices

29 Vector Addition (OpenCL) kernel void VectorAdd( global const float* A, global const float* B, global float* C) { // get index into global data array int igid = get_global_id(0); } // add the vector elements c[igid] = a[igid] + b[igid];

30 Vector Addition (OpenCL) // create the OpenCL context on a GPU device context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srca, NULL); memobjs[1] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcb, NULL); memobjs[2] = clcreatebuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * n, NULL, NULL); // create the program program = clcreateprogramwithsource(context, 1, (const char**)&program_source, NULL, NULL); // build the program clbuildprogram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clcreatekernel(program, "dot_product", NULL);...

31 Vector Addition (OpenCL) // set the args values clsetkernelarg(kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); clsetkernelarg(kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0]= 1; // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, 0, NULL, NULL); // read output image err = clenqueuereadbuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n * sizeof(cl_float), dst, 0, NULL, NULL); // clean up clreleasekernel(kernel); clreleaseprogram(program); clreleasecommandqueue(cmd_queue); clreleasecontext(context);

32 GPU projects at NU Tomosynthesis mammography 3D Cardiac CT Vascular segmentation Physics simulation (surgical simulator) Hyperspectral imaging Image manipulation (convolution, filtering) Phase unwrapping Ray tracing Memory hierarchy analysis Compiler optimizations Clustering algorithms (kmeans)

33 Thank you!

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism