Introduction to GPGPUs

Size: px

Start display at page:

Download "Introduction to GPGPUs"

Shon Bryan
5 years ago
Views:

1 Introduction to GPGPUs Sandra Wienke, M.Sc. PPCES 2012 Rechen- und Kommunikationszentrum (RZ)

2 Links General GPGPU Community: GPU Computing Community: CUDA Nvidia CUDA Zone (Toolkit, Profiler, SDK, documentation, ): PGI s CUDA Fortran: PGI s CUDA-x86: Slide 2

3 Links OpenCL Khronos Group (Specification, Reference Pages, ): OpenCL + Nvidia OpenCL + AMD: OpenCL + Intel: PGI Accelerator Accelerator Model: User Forum: Slide 3

4 Books David Kirk und Wen-Mei W. Hwu: Programming Massively Parallel Processors A Hands-on Approach (2010) Jason Sanders und Edward Kandrot: CUDA by Example: An Introduction to General-Purpose GPU Programming (2010) Slide 4

5 Books A. Munshi, B. Gaster, T. Mattson, J. Fung, D. Ginsburg: OpenCL Programming Guide (2011) B. Gaster, D. Kaeli, L. Howes, P. Mistry, D. Schaa: Heterogeneous Computing with OpenCL (2011) Slide 5

6 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 6

7 Overview GPGPUs = General Purpose Graphics Processing Units History a very brief overview 80s - 90s: Development is mainly driven by games Fixed-function 3D graphics pipeline Graphics APIs like OpenGL, DirectX popular Since 2001: Programmable pixel and vertex shader in graphics pipeline (adjustments in OpenGL, DirectX) Researchers take notice of performance growth of GPUs: Tasks must be cast into native graphics operations Since 2006: Vertex/pixel shader are replaced by a single processor unit Support of programming language C, synchronization, General purpose Slide 7

8 Known Parallelization on CPU level Shared-memory programming OpenMP: Parallel regions by pragmas (threads) Distributed-memory programming MPI: Message passing among processors Performance metrics FLOPS: Floating Point Operations per Second Memory bandwidth/ throughput [GB/s] Latency [cycles] Speedup: S = T serial / T parallel Trend towards multicore architectures Clock frequency at physical limit Slide 8

9 NVIDIA Corporation 2010 Comparison CPU GPU 8 cores Massively Parallel Processors Manycore Architecture CPU GPU GPU-Threads Thousands ( few on CPU) Light-weight, little creation overhead Fast switching Slide 9

out-of-order and speculative execution GPU Optimized for data-parallel

10 Comparison CPU GPU Similar # of transistors but different design NVIDIA Corporation 2010 CPU Optimized for low latencies Huge caches Control logic for out-of-order and speculative execution GPU Optimized for data-parallel throughput Architecture tolerant of memory latency More transistors dedicated to computation Slide 10

11 Comparison CPU GPU Considerations for GPU parallelization Hardware-related programming Knowledge of hardware essential Code restructuring usually needed (kernel, data management, data transfer, tuning) Very small shared memory Global synchronization not possible within one kernel Number of suitable problems limited Why GPGPUs? Slide 11

12 Motivation for GPUs Performance: High rate of Flops achievable! Little overhead (threads), 1000s of threads (Massive) data parallelism in application Independent data Uniform operations Heterogeneous computer architecture (CPU + GPU) Asynchronous computations, overlapping OpenMP/MPI + GPU parallelization Relative low cost + power consumption ( GreenIT ) Compared to computers/clusters having a similar performance GPU available in almost every computer Slide 12

13 Some (programmable) GPU types NVIDIA GeForce: 8800GTX, GT220, GTX 470, Quadro: 6000, FX 4800, NVS 450, Tesla: C870, C1060, C2050, AMD Radeon: HD 3870, HD 5850, FirePro: 3D V3800, 3D V9800, FireStream: 9350, Here we will go into NVIDIA- GPUs. However, the fundamentals also apply to GPUs of other vendors. Slide 13

14 Example SAXPY SAXPY = Single-precision real Alpha X Plus Y: y x y void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float* x; float* y; x = (float*) malloc(n * sizeof(float)); y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; } // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); } free(x); free(y); return 0; Slide 14

15 Example SAXPY Outlook: SAXPY for GPUs (CUDA C) global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; // Pointer to CPU memory // Allocate and initialize h_x and h_y float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } Slide 15

16 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 16

NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors 448-512 cores/ streaming

Floating point & integer unit 14-16 streaming multiprocessors (SM, MP) Each comprises 32 cores Memory

17 NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors cores/ streaming processors (SP) i.a. Floating point & integer unit streaming multiprocessors (SM, MP) Each comprises 32 cores Memory hierarchy multiprocessor Peak performance SP: 1.03 TFlops GPU DP: 515 GFlops ECC support Compute capability: 2.0 Defines features, e.g. double precision capability, memory access pattern Slide 17

18 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory Slide 18

19 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance Slide 19

20 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory Slide 20

21 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 21

22 GPGPU paradigms CUDA C/C++ (NVIDIA): architecture + programming language, NVIDIA GPUs Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs OpenCL C (Khronos Group): open standard, portable, CPU/GPU/ PGI Accelerator Model C/Fortran (PGI): Programming using pragmas (syntax similar to OpenMP), NVIDIA GPUs OpenACC C/Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov (NVIDIA GPUs) Slide 22

23 Paradigm CUDA = Compute Unified Device Architecture CUDA C/C++ (NVIDIA) Based on industry standard C/C++ Extensions, e.g. built-in variables, function/variable type qualifiers Restrictions, e.g. kernel function recursions Driver API (low level), Runtime API (higher level) CUDA Fortran (PGI) Analogous to NVIDIA s CUDA C, some additional features Only available with the PGI compilers Brief timeline Nov 06: Introduction of CUDA, G80 GPU architecture Jun 07: CUDA Toolkit 1.0 Jun 08: GT200 GPU architecture March 10: Fermi GPU architecture Jan 12: CUDA Toolkit 4.1 Slide 23

as array of threads All threads execute the same code Threads are identified by IDs Select

24 NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs Select input/output data Control decisions float x = input[threadid]; float y = func(x); output[threadid] = y; Slide 24

25 Programming model Threads are grouped into blocks Blocks are grouped into a grid Slide 25

26 Programming model Kernel is executed as a grid of blocks of threads Host Device Kernel 1 1D Block 0 Block 4 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 ID-tuples for threads and blocks Kernel 2 2D Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) Slide 26

27 Programming model (CUDA C) Setup GPU (e.g. driver, envrionment variables) Download + install CUDA Toolkit (cf. Links section) grey = background information CUDA C Compiling module load cuda nvcc arch=sm_20 saxpy.cu nvcc: Nvidia s compiler for C/C++ GPU code -arch=sm_20: Set compute capability 2.0 # on our cluster Sets certain architecture features, e.g. enabling double precision floating point operations Slide 27

28 Programming model (CUDA C) Kernel code Function qualifiers: global, device, host Built-in variables: griddim: contains dimensions of grid (type dim3) blockdim : contains dimensions of block (type dim3) blockidx : contains block index within grid (type uint3) threadidx: contains thread index within block (type uint3) Compute unique IDs, e.g. global 1D Idx: gidx = blockidx.x * blockdim.x + threadidx.x Kernel usage Compiling with nvcc (creating PTX code) Kernel arguments can be passed directly to the kernel Kernel invocation with execution configuration (chevron syntax): func<<<dimgrid, dimblock>>> (parameter) Slide 28

29 Programming model (CUDA Fortran) Setup GPU (e.g. driver) Setup PGI Compiler blue = background information CUDA Fortran Compiling module switch intel pgi[/version] pgf90 Mcuda[=cc20,4.0] saxpy.cuf -Mcuda: Enables CUDA Fortran extensions # on our cluster cc20: Generates code for device with compute capability : Uses CUDA Toolkit 4.0 cuf: free-format CUDA Fortran program CUF: program is processed by preprocessor before being compiled Slide 29

30 Programming model (CUDA Fortran) Kernel code Subroutine/function qualifiers: attributes(global), attributes(device), attributes(host) Built-in variables: griddim: contains dimensions of grid (type(dim3)) blockdim : contains dimensions of block (type(dim3)) blockidx : contains block index within grid (type(dim3)) threadidx: contains thread index within block (type(dim3)) Compute unique IDs, e.g. global 1D Idx: gidx = (blockidx%x-1) * blockdim%x + threadidx%x Kernel usage Compiling with pgf90 or pgfortran (creating PTX code) Kernel arguments can be passed directly to the kernel Kernel invocation with execution configuration (chevron syntax): call func<<<dimgrid, dimblock>>> (parameter) Slide 30

31 Example SAXPY: Kernel usage global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; } if (i < n){ y[i] = a*x[i] + y[i]; } C/C++ module saxpy use cudafor Fortran contains attributes(global) subroutine saxpy_parallel (n, a, x, y) integer, intent(in), value :: n, a real, intent(in), device :: x(n) real, intent(inout), device :: y(n) integer :: i i = blockdim%x * (blockidx%x - 1 ) + threadidx%x if (i <= n) then y(i) = a*x(i) + y(i) end if end subroutine saxpy_parallel end module saxpy int main(int argc, char* argv[]) { [..] // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n,2.0,d_x,d_y); [..] } program main use saxpy [..]! Invoke parallel SAXPY kernel threadsperblock = dim3(128,1,1) blockspergrid = dim3(n/threadsperblock%x,1,1) call saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, a, d_x, d_y) [..] end program main Slide 31

32 Programming model Why blocks? Cooperation of threads within a block possible Synchronization (barrier) Share data/ results using Shared Memory Scalability Fast communication between n threads is not feasible when n large But: blocks are executed independently Blocks can be distributed across arbitrary number of multiprocessors Number of blocks (with #threads fixed)? Few: many threads can communicate Many: Good scaling Slide 32

33 Programming model: scalability G84 (very old architecture) NVIDIA Corporation Slide 33

34 Programming model: scalability G80 (medium old architecture) NVIDIA Corporation Slide 34

35 Programming model: scalability GT200 (last architecture) NVIDIA Corporation Idle Idle Idle Slide 35

36 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 36

37 Execution model Host-directed execution model Main program runs on host Certain code regions run on device Execution configuration: <<<blockspergrid, threadsperblock>>> Warps Threads execute as groups of 32 Threads in warp share same program counter Single instruction multiple threads (SIMT) Slide 37

Execution model Thread Core Each thread is

depending on memory resources Grid (Kernel)

38 Execution model Thread Core Each thread is executed by a core Block Multiprocessor Each block is executed on a multiprocessor Several concurrent blocks can reside on a MP depending on memory resources Grid (Kernel) Device Each kernel is executed on a device Slide 38

39 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 39

40 Memory model Host + device memory = separate entities No coherence between host + device Manual data synchronization/transfer Host (De-)Allocates device memory (global, constant, texture) Triggers data transfer Device Works on device memory (hierarchy) Slide 40

Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Grid/ application

41 Memory model Thread Registers Local memory Block Shared memory: Tesla C1060: 16 KB; Fermi: 16KB shared +48KB L1 ODER 48KB shared +16KB L1 on-chip Device Multiprocessor 1 Multiprocessor n Registers Core 1 Registers Core m Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Grid/ application Constant memory 64 KB; read-only; off-chip; cached Global memory up to 6 GB; off-chip Fermi: L2 cache Host Host Memory Slide 41

42 Memory model (CUDA C) Variable type qualifiers device, shared, constant Memory management cudamalloc(pointertogpumem, size) cudafree(pointertogpumem) Memory transfer (synchronous) cudamemcpy(dest, src, size, direction) direction: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Slide 42

43 Memory model (CUDA Fortran) Variable type qualifiers attributes(device), attributes(shared), attributes(constant), attributes(pinned), attributes(value) Memory management cudamalloc(pointertogpumem, size) cudafree(pointertogpumem) Memory transfer (synchronous) By assignment statements var_host = var_dev (CPU to GPU transfer) var_dev = var_host (GPU to CPU transfer) var1_dev = var2_dev (Copy on GPU) Runtime routine cudamemcpy(dest, src, size, direction) direction: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Slide 43

44 Example SAXPY: Memory int main(int argc,char* argv[]){ } float* h_x,*h_y; // host pointer // Allocate and initialize h_x and h_y float *d_x,*d_y; // device pointer cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; C/C++ program main use saxpy real, allocatable :: h_x(:), h_y(:)! host pointer real, allocatable, device :: d_x(:), d_y(:)! Device pointer allocate(h_x(n),h_y(n),d_x(n),d_y(n))! Initialize h_x and h_y d_x = h_x d_y = h_y! Invoke parallel real,intent(in), SAXPY kernel h_y = d_y deallocate(h_x, h_y, d_x, d_y) end program main Fortran attributes(global) subroutine saxpy_parallel (n, a, x, y) integer,intent(in),value ::n,a device::x(n) real,intent(inout),device::y(n) [..] end subroutine saxpy_parallel Slide 44

45 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 45

46 Summary 3 steps for a basic program with CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; // Pointer to CPU memory // Allocate and initialize h_x and h_y float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } 2. Launch kernel 3. Transfer data to CPU + free data on GPU Slide 46

47 Summary Processing flow Copy data from host to device Execute GPU code (kernel) in parallel Copy data from device to host Kernel executes grid of blocks of threads Memory hierarchy on GPU Thread: registers, local Block: shared Grid: global Use GPUs properly! Launch many many threads Uniform operations on data ( thread ID) Use all available resources (GPU + CPU) Slide 47

48 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 48

49 CUDA Tools Debugger cuda-gdb Extended gdb (usable via ddd), NVIDIA (free of charge) cuda-memcheck Discovers memory access errors, NVIDIA (free of charge) TotalView DDT Parallel Nsight Profiling/ tracing Visual Profiler VampirTrace GUI (memory) debugger, Linux, RogueWave GUI (memory) debugger, Linux, Allinea Windows, integrated in Visual Studio, NVIDIA (free of charge) Performance analysis w/ HW counters, NVIDIA (free of charge) Performance monitoring (tracing), TU Dresden Slide 49

50 CUDA Libraries NVIDIA... cublas Dense linear algebra (subset of BLAS) cusparse Sparse linear algebra cufft Discrete Fourier transforms curand Random number generation NPP Signal and image processing Thrust STL/Boost style template lib (e.g. scan, sort, reduce, transform) math.h Basics, exponentials, trigonometry,.. (e.g. sin, ceil, round) Third party CULA Dense/sparse linear algebra (subset of LAPACK) MAGMA Dense linear algebra (subset of BLAS, LAPACK) IMSL Fortran numerical library utilizes cublas NAG Numeric libraries (e.g. RNGs) libjacket Math, signal processing, image processing, statistics Open Source cudpp Data parallel primitives (e.g. scan, sort, reduction) CUSP Sparse linear algebra, graph computations OpenCurrent Partial differential equations Slide 50

Introduction to GPGPUs

Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces