Accelerate with GPUs Harnessing GPGPUs with trending technologies

Size: px

Start display at page:

Download "Accelerate with GPUs Harnessing GPGPUs with trending technologies"

Job Wells
5 years ago
Views:

1 Accelerate with GPUs Harnessing GPGPUs with trending technologies Anubhav Jain and Amit Kalele Parallelization and Optimization CoE Tata Consultancy Services Ltd. Copyright 2016 Tata Consultancy Services Limited 1

2 Outline Overview of GPUs HW architecture Parallelism, CPU/GPU model Programming technologies Introduction to GPU programming, Tools, SDK etc. OpenACC directive based approach Introduction to CUDA C Nvidia specific approach OpenCL generic approach HIP AMD specific approach Advance programming example CUDA approach 2

3 Basic Terminology Host CPU CPU Memory (Host Memory) Device GPU GPU Memory (Device Memory) 3

4 CPU & GPU Architecture Control ALU ALU ALU ALU Cache DRAM CPU Low Latency/Throughput Less ALUs Powerful ALUs DRAM GPU High Latency/Throughput More ALUs Simpler ALUs 4

5 CPU Vs GPU Parallelism CPUs use task/data parallelism Multiple tasks map to multiple threads Tasks run different instructions 10s of relatively heavyweight threads run on 10s of cores Each thread has to be individually programmed GPUs use data parallelism SIMT model (Single Instruction Multiple Thread) Same instruction on different data 10,000s of lightweight threads on 100s of cores Programming done for batches of threads (e.g. block of threads) 5

6 CPU-GPU Model CPU GPU Memory PCIe-x16 Memory CPUs Host or Master, offloads work to GPU GPUs Device or Co processor, worker 6

7 GPGPU Programming Technologies AMD s HIP OpenACC OpenCL CUDA GPGPU Terminology GPGPU.org 7

8 GPU Streaming Multiprocessor Nvidia Tesla K40 15 SMX Total Cores: Cores/SMX 8

9 GPU Memory Model (CUDA) Local Private to Each thread Shared Visible to all threads within Block Fast on-chip memory Like user managed cache Global All threads within Grid Texture/Constant All threads within Grid 9

10 Simple Program Workflow Host Device CPU CPU Memory PCI Bus CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 1 CORE CORE CORE CORE GPU Memory 10

11 Simple Program Workflow Host CPU CPU Memory 2 PCI Bus Device CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 2. Load device code and execute 1 CORE CORE CORE CORE GPU Memory 11

Copying input data from host to device 2.

12 Simple Program Workflow Host CPU CPU Memory 2 PCI Bus Device CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 2. Load device code and execute 3. Copying results back to host 1 3 CORE CORE CORE CORE GPU Memory 12

13 OpenACC Open Accelerators: Developed by Cray, CAPS, Nvidia and PGI PGI s OpenACC supports Nvidia Tesla, AMD Radeon, Intel & AMD multicore processors Supports C, C++, Fortran PGI Compilers: pgcc, pgc++ and pgfortran Directive Based Approach 13

14 OpenACC: vectoradd A B [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] Parallel Vector Addition C [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] #pragma acc kernels loop copyin(a[0:n],b[0:n]) copyout(c[0:n]) for( int i = 0; i < N; ++i ) c[i] = a[i] + b[i]; Allocate a,b,c on device Copy a,b from host to device Copy c from device to host #pragma acc parallel loop for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 14

15 OpenACC: vectoradd The OpenACC execution model has three levels: gang, worker, and vector gang==block, worker==warp, and vector==threads #pragma acc kernels loop gang(100), vector(128) for( int i = 0; i < N; ++i ) c[i] = a[i] + b[i]; Spawn 100 CUDA Blocks Spawn 128 Threads/Block Helps in Fine tuning 15

16 OpenACC: vectoradd pgcc -ta = (target accelerator) vectoradd.c -ta=tesla[:cuda7.0 cuda7.5 cuda8.0 ] -ta=radeon -ta=host -ta=multicore OpenACC Learning reference: 16

CUDA Uses CUDA-enabled Nvidia GPUs for GPGPU Processing Developed by Nvidia Supports C, C++, PGI CUDA Fortran Third party wrappers are also available for Java, Perl,

17 CUDA Uses CUDA-enabled Nvidia GPUs for GPGPU Processing Developed by Nvidia Supports C, C++, PGI CUDA Fortran Third party wrappers are also available for Java, Perl, Python, MATLAB etc. CUDA Compiler: nvcc Splits source file into host and device components CUDA complier handles device code Standard host compiler handles host code 17

18 CUDA Device code /*Device Kernel*/ global void kernel(... ) { Host code int main() { /*Allocate space on device*/ cudamalloc( ); /*Copying data from host to device*/ cudamemcpy(, cudamemcpyhosttodevice); /*Launch of GPU kernel*/ kernel<<<gridsize, blocksize>>>( ); /*Copying data from device to host*/ cudamemcpy(, cudamemcpydevicetohost); /*Cleanup*/ cudafree( ); Serial region (Host) Parallel region (Device) Serial region (Host) return 0; 18

$h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C =$

19 VectorAdd Example using CUDA #include <cuda_runtime.h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Device Calls here // Free host memory free(a); free(b); free(c); // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input vectors from host to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); // Free device global memory cudafree(d_a); cudafree(d_b); cudafree(d_c); // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < numelements) { C[i] = A[i] + B[i]; cudadevicereset(); return 0; 19

20 VectorAdd Example using CUDA #include <cuda_runtime.h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Device Calls here Host Side data allocation & initialization Device Calls : NEXT SLIDE // Free host memory free(a); free(b); free(c); cudadevicereset(); return 0; 20

VectorAdd Example using CUDA // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input

21 VectorAdd Example using CUDA // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input vectors from host to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); Device side arrays allocation Copying data from host to device Calling kernel : NEXT SLIDE Copying results from host to device // Free device global memory cudafree(d_a); cudafree(d_b); cudafree(d_c); Free device side arrays 21

22 VectorAdd Example using CUDA // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < numelements) { C[i] = A[i] + B[i]; 22

23 Thread Hierarchy thread block grid Kernel is executed as grid of blocks of threads Nvidia Tesla K20 (Kepler) architecture 23

24 VectorAdd Example using CUDA threadidx.x threadidx.x threadidx.x // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; blockidx.x=0 blockidx.x=1 blockidx.x= if (i < numelements) { A [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] + C[i] = A[i] + B[i]; B [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] = C [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] Device Memory 24

org/wiki/cuda cudagetdeviceproperties returns major and minor number (compute capability) CUDA SDK Tools: nvvp :

25 VectorAdd Example using CUDA Compile: nvcc vectoradd.cu -arch=sm_30 Compute capability of Nvidia Cards: 1.0, 2.0, 3.0, 3.5 Check here: cudagetdeviceproperties returns major and minor number (compute capability) CUDA SDK Tools: nvvp : Nvidia Visual Profiler nvprof : Command-line based Profiler cuda-gdb : Nvidia CUDA debugger CUDA Learning reference: 25

26 Nvidia Visual Profiler : nvvp Analysis View Timeline View GPU Details View PC Sampling View 26

27 OpenCL CPUs, GPUs, DSPs, FPGAs and other processors and hardware accelerators Implementations are available from AMD, Apple, IBM, Altera, Intel, Nvidia, Qualcomm, Samsung etc. Supports C, C++ Third party APIs for Python, Java,.NET etc. 27

to Host Execute Kernel Create OpenCL kernel Build the program Create program from kernel source Copy data

28 Program Workflow CUDA OpenCL Create memory buffers on device Copy data host to device Read kernel Offline/Online Create OpenCL Context Create Command Queue Create memory buffers on device Copy results back to Host Execute Kernel Create OpenCL kernel Build the program Create program from kernel source Copy data host to device Cleanup device memory Set kernel arguments Execute Kernel Copy results back to Host Cleanup device memory 28

$h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] =$

29 VectorAdd Example using OpenCL #include <CL/cl.h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.cl", "r"); source_str = (char*)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); // Get platform and device information (Hidden Steps) // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue( context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clcreatebuffer( context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Build the program ret = clbuildprogram( program, 1, &device_id, NULL, NULL, NULL); // Create the OpenCL kernel cl_kernel kernel = clcreatekernel( program, "vector_add", &ret); // Set the arguments of the kernel ret = clsetkernelarg( kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clsetkernelarg( kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clsetkernelarg( kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj); // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clenqueuendrangekernel( command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Free host memory free(a); free(b); free(c); return 0; // Copy the lists A and B to their respective memory buffers ret = clenqueuewritebuffer( command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clcreateprogramwithsource( context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*list_size); ret = clenqueuereadbuffer( command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Clean up ret = clflush( command_queue); ret = clreleasekernel( kernel); ret = clreleaseprogram( program); ret = clreleasememobject( a_mem_obj); ret = clreleasecommandqueue( command_queue); ret = clreleasecontext( context); 29

$h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] =$ LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.

LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.

cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue(context, device_id, 0, &ret); // Create

$"vector_add", *C) { &ret); // Set the arguments // get threadid of the kernel ret = clsetkernelarg(kernel, int i = get_global_id(0); 0, sizeof(cl_mem), (void *)&a_mem_obj); ret =$

30 VectorAdd Example using OpenCL #include <CL/cl.h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.cl", "r"); source_str = (char*)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); // Get platform and device information (Hidden Steps) // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue(context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clcreatebuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Build the program ret = clbuildprogram(program, 1, &device_id, NULL, NULL, NULL); // Device kernel // Create the kernel OpenCL void kernel vector_add( global const int *A, cl_kernel kernel global = clcreatekernel(program, const int *B, global int "vector_add", *C) { &ret); // Set the arguments // get threadid of the kernel ret = clsetkernelarg(kernel, int i = get_global_id(0); 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clsetkernelarg(kernel, // Do the operation 2, sizeof(cl_mem), (void *)&c_mem_obj); C[i] = A[i] + B[i]; // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Free host memory free(a); free(b); free(c); return 0; // Copy the lists A and B to their respective memory buffers ret = clenqueuewritebuffer(command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clcreateprogramwithsource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*list_size); ret = clenqueuereadbuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Clean up ret = clflush(command_queue); ret = clreleasekernel(kernel); ret = clreleaseprogram(program); ret = clreleasememobject(a_mem_obj); ret = clreleasecommandqueue(command_queue); ret = clreleasecontext(context); 30

clcreateprogramwithbinary() Online: Kernel source

31 OpenCL Online/Offline kernel compilation Offline: Kernel binary is read in by the host code clcreateprogramwithbinary() Online: Kernel source file is read in by the host code clcreateprogramwithsource() 31

com/) LTPV: Light Temporal Performance Viewer (https://github.

32 OpenCL Tools Compile: gcc vectoradd.c -I /usr/local/cuda-7.0/include -L /usr/local/cuda-7.0/lib64 lopencl Tools: CodeXL Powerful Debugging, Profiling & Analysis ( LTPV: Light Temporal Performance Viewer ( AMD gdebugger: Debugger & Memory Analyzer OpenCL Learning reference :

33 Terminology Comparison CUDA OpenCL CUDA OpenCL thread work-item threadidx.x get_local_id(0) block work-group blockidx.x get_group_id(0) grid NDRange blockdim.x get_local_size(0) global kernel <<< >>> clenqueuendrangekernel 33

34 Memory Model Comparison OpenCL CUDA 34

35 Memory Model Comparison CUDA OpenCL Global Memory Local Memory Global Memory Texture Memory Constant Memory Shared Memory Registers Constant Memory Local Memory Private Memory 35

36 HIP Heterogeneous-compute Interface for Portability Developed by AMD, allow developers to convert CUDA code to common C++ Supports C, C++ The resulting code can run through either CUDA NVCC or AMD HCC compilers Provides more choice in hardware and development tools. 36

For AMD platforms : AMD Kaveri, Carrizo and Fiji For Nvidia

37 HIP Code Conversion Workflow Hipify Tool to convert CUDA code to portable CPP Converts CUDA APIs and Kernel builtins For AMD platforms : AMD Kaveri, Carrizo and Fiji For Nvidia platforms : requires Unified Memory and CUDA SDK 6.0 or newer 37

HIP Tools hipify : Tool to convert CUDA code to portable CPP. Converts CUDA APIs and kernel builtins. hipcc : Compiler driver that can be used to replace nvcc in existing CUDA code.

38 HIP Tools hipify : Tool to convert CUDA code to portable CPP. Converts CUDA APIs and kernel builtins. hipcc : Compiler driver that can be used to replace nvcc in existing CUDA code. hipcc will call nvcc or hcc depending on platform, and include appropriate platform-specific headers and libraries. hipconfig : Print HIP configuration (HIP_PATH, HIP_PLATFORM, CXX config flags, etc) hipexamine.sh : Script to scan directory, find all code, and report statistics on how much can be ported with HIP (and identify likely features not yet supported) hipconvertinplace.sh : to hipify all code files in the Cuda source directory. 38

$cpp int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size);$ $float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Free host$ hipmalloc ((void **)&d_c, size); // Copy input vectors from host to device hipmemcpy (d_a, A, size, cudamemcpyhosttodevice); hipmemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the

hipmalloc ((void **)&d_c, size); // Copy input vectors from host to device hipmemcpy (d_a, A, size, cudamemcpyhosttodevice); hipmemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the

$global memory hipfree (d_a); hipfree(d_b); hipfree (d_c); // Device kernel global void vectoradd(hiplaunchparm lp, const float *A, const float *B, float *C, int numelements) { int i =$

39 VectorAdd Example using HIP #include <hip/hip_runtime.h> Convert: hipify vectoradd.cu > vectoradd.cpp Compile: hipcc vectoradd.cpp int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Free host memory free(a); free(b); free(c); hipdevicereset (); return 0; // Allocate the device vectors float *d_a, *d_b, *d_c; hipmalloc ((void **)&d_a, size); hipmalloc ((void **)&d_b, size); hipmalloc ((void **)&d_c, size); // Copy input vectors from host to device hipmemcpy (d_a, A, size, cudamemcpyhosttodevice); hipmemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; hiplaunchkernel(hip_kernel_name(vectoradd), dim3(blockspergrid), dim3(threadsperblock), 0, 0, d_a, d_b, d_c, numelements); // Copy the device result vector to host hipmemcpy (C, d_c, size, cudamemcpydevicetohost); // Free device global memory hipfree (d_a); hipfree(d_b); hipfree (d_c); // Device kernel global void vectoradd(hiplaunchparm lp, const float *A, const float *B, float *C, int numelements) { int i = hipblockdim_x * hipblockidx_x + hipthreadidx_x; if (i < numelements) { C[i] = A[i] + B[i]; 39

HIP Support Devices (hipsetdevice(), hipgetdeviceproperties(), etc) Memory management (hipmalloc(), hipmemcpy(), hipfree()) Streams (hipstreamcreate(), etc.

) Kernel launching CUDA-style kernel indexing Device-side math built-ins Error reporting (hipgetlasterror(), hipgeterrorstring()) Partially supported or Not supported HiP

40 HIP Support Devices (hipsetdevice(), hipgetdeviceproperties(), etc) Memory management (hipmalloc(), hipmemcpy(), hipfree()) Streams (hipstreamcreate(), etc.)---under development Events (hipeventrecord(), hipeventelapsedtime(), etc.) Kernel launching CUDA-style kernel indexing Device-side math built-ins Error reporting (hipgetlasterror(), hipgeterrorstring()) Partially supported or Not supported HiP is evolving Check for regular updates here: ProfessionalCompute-Tools/HIP Textures Constant memory Dynamic parallelism Managed memory CUDA Libraries Graphics interoperation with OpenGL or Direct3D CUDA array, mipmappedarray and pitched memory CUDA Driver API 40

41 Methods of kernel Launch CUDA vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); OpenCL clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); HIP hiplaunchkernel(hip_kernel_name(vectoradd), dim3(blockspergrid), dim3(threadsperblock), 0, 0, d_a, d_b, d_c, numelements); HIP Porting Guide: 41

42 Terminology Comparison CUDA OpenCL HIP CUDA OpenCL HIP thread work-item thread threadidx.x get_local_id(0) hipthreadidx_x block work-group block blockidx.x get_group_id(0) hipblockidx_x grid NDRange grid blockdim.x get_local_size(0) hipblockdim_x global kernel global <<< >>> clenqueuendrangekernel hiplaunchkernel 42

43 Programming with Shared Memory GPU Shared Memory Visible to all threads within Block Fast on-chip memory Like user managed cache 43

$Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.$

44 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; g_idata[i]

45 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; sdata[threadidx.x] sdata[threadidx.x] sdata[threadidx.x]

46 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] syncthreads(); threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; g_odata[blockidx.x]

47 GPU Accelerated Libraries Application Fast Fourier Transforms Basic Linear Algebra cufft clfft cublas clblas CLBlast Sparse Matrix Dense and Sparse Direct Solvers cusparse clspmv cusolver SuperLU Parallel algorithms and data structures Random Number Generation Maths Library Thrust clpp curand ArrayFire 47

48 Achieved Speedup with respect to their original performance Success Stories 300 Applications SpeedUp with GPUs SpeedUp Incremental Risk Charge (IRC) [Hardware footprint reduction by 50X] Value at Risk (VaR) Digital Watermarking Principal component Analysis (PCA) Mark-to-Market Computation for IRS portfolio 2D-LBM for Fluid Flows 48

Concluding remarks Many technologies to develop applications on GPUs Some are easy to use (OpenACC) but may result in limited performance gains CUDA/Nvidia has good overall presence but proprietary

49 Concluding remarks Many technologies to develop applications on GPUs Some are easy to use (OpenACC) but may result in limited performance gains CUDA/Nvidia has good overall presence but proprietary technology may not be acceptable to some OpenCL offers flexibility of porting between HWs Little tedious way of programming compared to others HIP Good tool to convert and run a CUDA application on AMD device In general GPUs offer good performance benefit for massive data parallel applications 49

50 Thank You 50

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014