Introduction to Parallel Computing with CUDA. Oswald Haan

Size: px

Start display at page:

Download "Introduction to Parallel Computing with CUDA. Oswald Haan"

Horatio Shaw
5 years ago
Views:

1 Introduction to Parallel Computing with CUDA Oswald Haan

2 Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

3 Course Material All presentations available under All code examples can be copied with cp r ~ohaan/cuda_kurs/*. cp r ~ohaan/cuda_kurs_f/*. Detailed documentation from NVIDIAs CUDA Toolkit Documentation Introductory Book: CUDA by Examples, Authors: J. Sanders and E. Kandrot

4 Course Material CUDA for Fortran PGI s CUDA Fortran Programming Guide and Reference A series of Introductory articles

5 GPGPU: General Purpose Computation on Graphic Processing Units host hh Multicore CPU coprocessor hh Manycore GPU Host memory hh Copocessor Cc memory

6 General Purpose Processor vs Graphics Processor Size of chip area for different purposes CPU GPU

7 NVIDIA GPU Tesla K40 Streaming Multiprocessor (SMX) Tesla K40 provides: 15 Streaming Multiprocessors 12 GB Main Memory 288 GB/s Memory Bandwidth 1.5 MB L2 Cache

Tesla K40 SMX 192 SP Floating Point Units 64 DP Floating Point Units 32 Special Functional Units 32 Load/Store Units Clock rate 745 MHz Nominal maximal speed for floating point operations: 64K

8 Tesla K40 SMX 192 SP Floating Point Units 64 DP Floating Point Units 32 Special Functional Units 32 Load/Store Units Clock rate 745 MHz Nominal maximal speed for floating point operations: 64K 32-registers per SMX 64 KB shared memory/l1 cache per SMX 48 KB read only/texture cache per SMX Nominal maximal speed for main memory accesses: SP: DP: 4,38 TeraFlop/s 1,43 TeraFlop/s SP: DP: 72 GigaWords/s 36 GigaWords/s

9 GeForce GTX 980 Streaming Multiprocessor (SMX) GeForce GTX 980 provides: 16 Streaming Multiprocessors 4.3 GB Main Memory 224 GB/s Memory Bandwidth 2 MB L2 Cache

GeForce GTX 980 Streaming Multiprocessor 128 SP Floating Point Units 4 DP Floating Point Units 32 Special Functional Units 32 Load/Store Units Clock rate 1240 MHz Nominal maximal speed for floating

10 GeForce GTX 980 Streaming Multiprocessor 128 SP Floating Point Units 4 DP Floating Point Units 32 Special Functional Units 32 Load/Store Units Clock rate 1240 MHz Nominal maximal speed for floating point operations: 64K 32-registers per SMX 96 KB shared memory per SMX 24 KB texture/l1 cache per SMX Nominal maximal speed for main memory accesses: SP: DP: 5.08 TeraFlop/s 0.16 TeraFlop/s SP: DP: 56 GigaWords/s 28 GigaWords/s

11 GPGPU-Programming Model: Offloading of kernels Within a conventional sequential program special subroutines are defined as kernels, which can be offloaded to the GPU. Memory areas for data accessed by the kernel subroutine must be provided on the host processor and on the GPU Input data used by the kernel have to be copied form host to GPU before the kernel is invoked Result data produced by the kernel must be copied from GPU to host after the kernel execution is completed

12 host GPU mem. allocation on host invocation of mem. allocation on GPU copy of kernel input data invocation of kernel execution on GPU mem. allocation on GPU copy of kernel input data Kernel execution on GPU copy of kernel output data copy of kernel output data

13 GPGPU-Programming Model: Parallel execution of kernels SPMD (Single Program-Multiple Data): Multiple Threads execute the same kernel program Accessed data and control flow within each execution thread can be differentiated by using a thread identification number tid, which in each thread has a different value All threads have access to the common global memory space Synchronization of write accesses to the same memory object is not prescribed; write order must be specified by means of explicit synchronization mechanisms

14 GPGPU with CUDA CUDA (Compute Unified Device Architecture) is NVIDIA s program development environment. Contains extensions to C/C++ and library routines implementing the GPGPU programming model Provides compiler nvcc for CUDA programs Profiling and debugging tools Numerical Libraries (e.g. CUDA Math Library, cublas) contains low level drivers for NVIDIA s graphic cards

15 CUDA Fortran Developed jointly by PGI (Portland Group Inc.) and NVIDIA CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran. Available in PGI 2010 and later releases. CUDA Fortran is supported on Linux, macos and Windows.

16 CUDA Extensions to C/C++ Qualifiers for function declarations global void kernel() called from host, executed on device device float function() called from device, executed on device host float function() called from host, executed on host device host float function() will be executed on device or on host, depending from where it is called Qualifiers for variables specify the location of a variable in device memory device, constant, shared Defining the execution configuration for kernels to be executed on device kernel <<<grid, block,...>>> (arg1,arg2,...) thread-local variables for thread identification threadidx.x, blockdim.x, blockidx.x, griddim.x

17 CUDA functions for managing device memory Allocating size Bytes at Address A_d in device memory cudamalloc(&a_d,size) Copying size Bytes from address A_h in host memory to adress A_d in device memory cudamemcpy(a_d, A_h, size, cudamemcpyhosttodevice) Copying size Bytes from address A_d in device memory to adress A_h in host memory cudamemcpy(a_h, A_d, size, cudamemcpydevicetohost) Copying size Bytes from address A_d in device memory to adress B_d in host memory cudamemcpy(b_d, A_d, size, cudamemcpydevicetodevice) Deallocating device memory cudafree(a_d)

18 Unified Memory Introduced in CUDA 6 host hh coprocessor hh host hh coprocessor hh Multicore CPU Manycore GPU Multicore CPU Manycore GPU Host memory hh Copocessor Cc memory Unified memory

19 Managing CUDA Unified Memory Static allocation of UM: device managed int A[1000]; Dynamic allocation of UM: cudamallocmanaged(&a, 1000 * sizeof(int));

20 CUDA Fortran Extension to Fortran90 Qualifiers for subroutines and functions: attributes(host): to be executed on host, to be called from subprograms with host attribute attributes(global): only for subroutines; declares a kernel to be called from host and to be executed on device attributes(device): to be executed on device, to be called from subprograms with global or device attribute host is the default attribute

21 CUDA Fortran Extension to Fortran90 Qualifiers for variables determine in which memory the memory space for the variables will be allocated: By default, variables declared in modules or host subprograms are allocated in the host main memory. device: variable is allocated in device main memory managed: variable migrates between host and device, depending from where it is accessed. (Unified Memory) constant: variable is allocated in device constant memory space. shared: variable is allocated the device shared memory texture: variable is allocated in device texture memory space, accesses to texture data goes through a separate cache on the device pinned: variable is allocated in host page-locked memory, copies from page-locked memory to device memory are faster

22 CUDA Fortran Extension to Fortran90 Predefined Variables in Device Subprograms thread-local variables for thread identification threadidx%x, blockdim%x, blockidx%x, griddim%x Starting execution on device Execution configuration for kernels to be executed on device call kernel<<<grid,block, >>>(arg1,arg2,...) Kernel loop directive!$cuf kernel do[(n)] <<< grid, block,... >>> Generates automatically device code for a nested loop with nesting > n

Thread Hierarchy The threads created by executing a kernel are organized in a two level hierarchy: 1, 2 or 3-dim grid of 1, 2 or 3-dim blocks of threads.

23 Thread Hierarchy The threads created by executing a kernel are organized in a two level hierarchy: 1, 2 or 3-dim grid of 1, 2 or 3-dim blocks of threads. Each thread is unambiguously numbered by four index vectors griddim.j, j=x,y,z blockidx.j = 0,,gridDim.j-1 blockdim.j, j=x,y,z threadidx.j = 0,,blockDim.j-1

24 Calculating the Thread ID Grid contains gridsize blocks, each block containing blocksize threads where gridsize = griddim.x * griddim.y * griddim.z blocksize = blockdim.x * blockdim.y * blockdim.z Threads are numbered from 0 to blocksize*gridsize 1 : tid = id_thr + blocksize * id_blk id_thr = threadidx.x + threadidx.y * blockdim.x + threadidx.z * blockdim.x * blockdim.y id_blk = blockidx.x + blockidx.y * griddim.x + blockidx.z * griddim.x * griddim.y

25 Thread-Parallel Execution of Kernels Host invokes a kernel execution on the device: kernel<<<nb,nt>>>(a,b,c); This kernel will be executed by nb blocks, each with nt threads, by a total number of nb*nt threads. Device code specifies the stream of instructions, which every thread of the execution configuration will execute global kernel (float *a, *float b, float *c ) { int tid = threadidx.x + griddim.x * blockidx.x c[tid] = a[tid]+b[tid];... }

26 functional unit SP / DP / SF / LS SMX 1 SMX n Hardware Hierarchy... Control unit: Schedule, dispatch Shared 32 bit registers L2 cache Main memory Shared memory / L1 cache Graphics Device Streaming Multiprocessor (SMX) CUDA Core

27 Mapping of Threads to Hardware All threads of a thread-block are executed simultaneously on the same SMX Threads within a block can communicate via shared memory All threads of a thread-block must complete the kernel execution before the SMX-resources used for this thread block are freed to be used for the next block of threads Different thread-blocks are distributed to the same or to different SMX s according to the availability of resources on the SMX s. Execution order of threads in different thread-blocks is not prescribed No communication between threads in different thread-blocks

28 Maximal Number of Active Threads in a SMX Upper limit of the number of threads per SMX: 2048 threads for Geforce GTX 980 and Tesla K40 Upper limit of the number of threads per block: 1024 threads per block for Geforce GTX 980 and Tesla K40 Upper limit of the number of blocks per SMX: 32 blocks for Geforce GTX 980, 16 blocks for Tesla K40

29 Actual Number of Active Threads in a SMX The actual number of active threads in a SMX can be smaller than the maximal number, depending on: 1. Number of registers used per thread 2 16 = registers = 262 kb per SMX for Geforce GTX 980 and Tesla K B maximal register size per thread 128 B register size per thread for maximal number of 2048 threads in a SMX 2. Shared memory used per thread 96 kb shared memory per SMX for Geforce GTX 980, 48 kb for Tesla K40 The number of registers and the amount of shared memory space needed for a single thread in a thread-block of a given kernel is determined at compile time and can be enquired with a compile flag --ptxas-options=-v

30 SIMT Thread Scheduling SIMT : single instruction multiple threads 32 threads (called a warp of threads) are scheduled together, always executing the same instruction simultaneously on groups of 32 CUDA cores on the SMX A warp is active, until all of its threads have completed the kernel Number of threads of active warps on a SMX can exceed by far its number of CUDA cores. Execution is switched to warps, in which all threads are ready to execute the next instruction. Hiding the latency for memory access

31 Occupancy Occupancy = The number of active warps per SMX/ maximal number of warps per SMX High occupancy helps to hide memory latency Conditions for 100% occupancy threads per block is a multiple of 32 (size of a warp) threads per block is a divisor of 2048 (max. number of threads in a SMX) threads per block >= 64 (Geforce GTX 980), >=128 (Tesla K40) register size per thread <= 128 B (Geforce GTX 980, Tesla K40) shared memory per thread <= 48 B (Geforce GTX 980), <= 24 B (Tesla K40) Trade off: high occupancy large amount of fast SMX-memory per thread and and and and

32 Diverging Threads Conditional execution depending on thread number Groups of threads in a warp with different execution paths will be scheduled separately Leads to longer execution times

33 CPU 1 SMX 1 SMX n Memory Organization, Hardware View... Control unit: Schedule, dispatch Shared 32 bit registers... cache L2 cache Main memory Main memory Shared memory / L1 cache constant + texture cache Host Graphics Device Streaming Multiprocessor (SMX)

34 thread thread thread thread local mem local mem local mem local mem Device Memory, Software View register register register register... block block shared men shared men global mem constant mem texture mem

35 CUDA type qualifiers

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei