GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey

Size: px

Start display at page:

Download "GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey"

Dale Jennings
5 years ago
Views:

1 GPU Architecture and Programming Ozcan Ozturk Ankara, Turkey 1

2 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 2

3 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 3

4 What is Parallel Computing? Parallel computing: using multiple processors in parallel to solve problems more quickly than with a single processor, or with less energy Examples of parallel machines A computer Cluster that contains multiple PCs with local memories combined together with a high speed network A Symmetric Multi-Processor (SMP) that contains multiple processor chips connected to a single shared memory system A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip, also called Multi-Core Computers The main motivation for parallel execution historically came from the desire for improved performance Computation is the third pillar of scientific endeavor, in addition to Theory and Experimentation But parallel execution has also now become a ubiquitous necessity due to power constraints, as we will see 4

5 Why Parallel Computing? The real world is massively parallel 5

6 Why Parallel Computing? Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult problems in many areas of science and engineering: Atmosphere, Earth, Environment Physics - applied, nuclear, particle, condensed matter Bioscience, Biotechnology, Genetics Chemistry, Molecular Sciences Geology, Seismology Mechanical Engineering - from prosthetics to spacecraft Electrical Engineering, Circuit Design, Microelectronics Computer Science, Mathematics 6

7 Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: Do theory or paper design. Perform experiments or build system. Limitations: Too difficult -- build large wind tunnels. Too expensive -- build a throw-away passenger jet. Too slow -- wait for climate or galactic evolution. Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: Use high performance computer systems to simulate the phenomenon Base on known physical laws and efficient numerical methods. 7

8 Why Parallel Computing? Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. 8

9 Why Parallel Computing? For example: Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Management of national and multi-national corporations Financial and economic modeling Advanced graphics and virtual reality, particularly in the entertainment industry Networked video and multi-media technologies Collaborative work environments 9

10 Why Massively Parallel Processing? A quiet revolution and potential build-up Calculation: TFLOPS vs. 100 GFLOPS Memory Bandwidth: ~10x Many-core GPU Multi-core CPU Courtesy: John Owens GPU in every PC massive volume and potential impact 10

11 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 11

12 NVIDIA Tesla Family 12

13 NVIDIA Tesla K80 13

14 GeForce 8800 (2007) 16 highly threaded SM s, >128 FPU s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 14

15 Fermi (2010) ~1.5TFLOPS (SP)/~800GFLOPS (DP) 230 GB/s DRAM Bandwidth 15

16 Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered supercomputing applications Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These Super-apps represent and model physical, concurrent world Various granularities of parallelism exist, but programming model must not hinder parallel implementation data delivery needs careful management 16

Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream

17 Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream architectures out or domain-specific architectures in CUDA is latter Traditional applications Current architecture coverage New applications Domain-specific architecture coverage 17 Obstacles

18 Speedup of Applications GPU Speedup Relative to CPU Kernel Application H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI- GeForce 8800 GTX vs. 2.2GHz Opteron speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25 to 400 speedup if the function s data requirements and control flow suit the GPU and the application is optimized FHD 18

Classic PC architecture Northbridge connects 3 components that must communicate at high speed CPU, DRAM, video Video also needs to have 1 st -class access to

19 Classic PC architecture Northbridge connects 3 components that must communicate at high speed CPU, DRAM, video Video also needs to have 1 st -class access to DRAM Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices 19 Core Logic Chipset CPU

(Original) PCI Bus Specification Connected to the southbridge Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate More recently 66 MHz, 64-bit, 512 MB/second peak Upstream

20 (Original) PCI Bus Specification Connected to the southbridge Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate More recently 66 MHz, 64-bit, 512 MB/second peak Upstream bandwidth remain slow for device (256MB/s peak) Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge 20

21 PCI as Memory Mapped I/O PCI device registers are mapped into the CPU s physical address space Accessed through loads/ stores (kernel mode) Addresses assigned to the PCI devices at boot time All devices listen for their addresses 21

22 PCI Express (PCIe) Switched, point-to-point connection Each card has a dedicated link to the central switch, no bus arbitration. Packet switches messages form virtual channel Prioritized packets for QoS E.g., real-time video streaming 22

PCIe PC Architecture PCIe forms the interconnect backbone Northbridge/Southbridge are both PCIe switches Some Southbridge designs have built-in PCI- PCIe bridge to allow

23 PCIe PC Architecture PCIe forms the interconnect backbone Northbridge/Southbridge are both PCIe switches Some Southbridge designs have built-in PCI- PCIe bridge to allow old PCI cards Some PCIe cards are PCI cards with a PCI-PCIe bridge Source: Jon Stokes, PCI Express: An Overview cles/paedia/hardware/pcie.ars 23

24 Intel Skylake Processor (2015) 24

25 Intel Skylake Processor (2015) 25

26 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 26

27 CUDA Overview CUDA (Compute Unified Device Architecture ) Programming model basic concepts and data types CUDA application programming interface - basic Simple examples to illustrate basic concepts and functionalities Performance features will be covered later 27

28 CUDA C with no shader limitations! Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code ( host ) Serial Code ( device ) Parallel Kernel KernelA<<< nblk, ntid >>>(args);... ( host ) Serial Code ( device ) Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 28

29 CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host ( memory Has its own DRAM (device Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 29

30 G80 CUDA mode A Device Example Processors execute computing threads New operating mode/hw interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 30

31 Extended C Type Qualifiers global, device, shared, local, constant Keywords threadidx, blockidx Intrinsics syncthreads Runtime API Memory, symbol, execution management Function launch device float filter[n]; global void convolve (float *image) { shared float region[m];... region[threadidx] = image[i]; syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudamalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 31

32 NVCC Compiler s Role: Code/Compile Device mycode.cu int main_data; shared int sdata; Compiled by native compiler: gcc, icc, cc Compiled by nvcc compiler Main() { } host hfunc () { int hdata; <<<gfunc(g,b,m)>>>(); } global gfunc() { int gdata; } device dfunc() { int ddata; } Device Only Interface Host Only int main_data; Main() {} host hfunc () { int hdata; <<<gfunc(g,b,m)>>> (); } shared sdata; global gfunc() { int gdata; } device dfunc() { int ddata; } 32

33 Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp OCG gcc / cl G80 SASS foo.sass 33 withexperiencenvidia s Mark Murphy, Open64, 4/2008/Papers/101.doc

34 Arrays of Parallel Threads A CUDA kernel is executed by an array of threads ( SPMD ) All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions threadid float x = input[threadid]; float y = func(x); output[threadid] = y; 34

35 Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 threadid float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; 35

36 Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes 36 Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) Courtesy: NDVIA

37 CUDA Memory Model Overview Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now Constant and texture memory will come later Host Grid ( 0 (0, Block ( 0 (1, Block Shared Memory Shared Memory Registers ( 0 (0, Thread Registers ( 0 (1, Thread Registers ( 0 (0, Thread Global Memory Registers ( 0 (1, Thread 37

38 CUDA API Highlights:Easy and Lightweight The API is an extension to the ANSI C programming language Low learning curve The hardware is designed to enable lightweight runtime and driver High performance 38

CUDA Device Memory Allocation cudamalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudafree()

39 CUDA Device Memory Allocation cudamalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudafree() Frees object from device Global Memory Pointer to freed object 39 Host Registers ( 0 (0, Block Shared Memory ( 0 (0, Thread Registers ( 0 (1, Thread Grid Global Memory Registers ( 0 (1, Block Shared Memory ( 0 (0, Thread Registers ( 0 (1, Thread

40 (. cont ) CUDA Device Memory Allocation Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md d is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudamalloc((void**)&md, size); cudafree(md); 40

41 CUDA Host-Device Data Transfer cudamemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Host Registers ( 0 (0, Block Shared Memory ( 0 (0, Thread Registers ( 0 (1, Thread Grid Global Memory Registers ( 0 (1, Block Shared Memory ( 0 (0, Thread Registers ( 0 (1, Thread 41

42 CUDA Host-Device Data Transfer Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudamemcpyhosttodevice and cudamemcpydevicetohost are symbolic constants cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamemcpy(m, Md, size, cudamemcpydevicetohost); 42

43 CUDA Function Declarations () DeviceFunc device float () KernelFunc global void Executed on the: device device Only callable from the: device host host float () HostFunc host host global defines a kernel function Must return void device and host can be used together 43

44 (. cont ) CUDA Function Declarations device functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments 44

45 Calling a Kernel Function Thread Creation A kernel function must be called with an execution configuration: global void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); 45

46 Simple working code example Goal for this example: Really simple but illustrative of key concepts Fits in one file with simple compile command Can absorb during lecture What does it do? Scan elements of array of numbers (any of 0 to 9) How many times does 6 appear? Array of 16 elements, each thread examines 4 elements, 1 block in grid, 1 grid threadidx.x = 0 examines in_array elements 0, 4, 8, 12 Known as a threadidx.x = 1 examines in_array elements 1, 5, 9, 13 cyclic data threadidx.x = 2 examines in_array elements 2, 6, 10, 14 distribution threadidx.x = 3 examines in_array elements 3, 7, 11, 15} 46

47 CUDA Pseudo-Code MAIN PROGRAM: Initialization Allocate memory on host for input and output Assign random numbers to input array Call host function Calculate final output from per-thread output Print result GLOBAL FUNCTION: Thread scans subset of array elements Call device function to compare with 6 Compute local result HOST FUNCTION: Allocate memory on device for copy of input and output Copy input to device Set up grid/block Call global function Synchronize after completion Copy device output to host DEVICE FUNCTION: Compare current element and 6 Return 1 if same, else 0 47

48 Main Program: Preliminaries MAIN PROGRAM: Initialization Allocate memory on host for input and output Assign random numbers to input array Call host function Calculate final output from per-thread output Print result #include <stdio.h> #define SIZE 16 #define BLOCKSIZE 4 int main(int argc, char **argv) { int *in_array, *out_array; } 48

49 Main Program: Invoke Global Function MAIN PROGRAM: Initialization (OMIT) Allocate memory on host for input and output Assign random numbers to input array Call host function Calculate final output from per-thread output Print result #include <stdio.h> #define SIZE 16 #define BLOCKSIZE 4 host void outer_compute (int *in_arr, int *out_arr); int main(int argc, char **argv) { int *in_array, *out_array; /* initialization */ outer_compute(in_array, out_array); } 49

50 Main Program: Calculate Output & Print Result MAIN PROGRAM: Initialization (OMIT) Allocate memory on host for input and output Assign random numbers to input array Call host function Calculate final output from per-thread output Print result #include <stdio.h> #define SIZE 16 #define BLOCKSIZE 4 host void outer_compute (int *in_arr, int *out_arr); int main(int argc, char **argv) { int *in_array, *out_array; int sum = 0; /* initialization */ outer_compute(in_array, out_array); for (int i=0; i<blocksize; i++) { sum+=out_array[i]; } printf ( Result = %d\n",sum); } 50

51 Host Function: Preliminaries & Allocation HOST FUNCTION: Allocate memory on device for copy of input and output host void outer_compute (int Copy input to device *h_in_array, int *h_out_array) Set up grid/block { Call global function Synchronize after completion Copy device output to host int *d_in_array, *d_out_array; cudamalloc((void **) &d_in_array, SIZE*sizeof(int)); cudamalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int)); } 51

52 Host Function: Copy Data To/From Host HOST FUNCTION: Allocate memory on device for copy of input and output Copy input to device Set up grid/block Call global function Synchronize after completion Copy device output to host host void outer_compute (int *h_in_array, int *h_out_array) { int *d_in_array, *d_out_array; } cudamalloc((void **) &d_in_array, SIZE*sizeof(int)); cudamalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int)); cudamemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudamemcpyhosttodevice); do computation... cudamemcpy(h_out_array,d_out_array, BLOCKSIZE*sizeof(int), cudamemcpydevicetohost); 52

53 Host Function: Setup & Call Global Function HOST FUNCTION: Allocate memory on device for copy of input and output Copy input to device Set up grid/block Call global function Synchronize after completion Copy device output to host host void outer_compute (int *h_in_array, int *h_out_array) { int *d_in_array, *d_out_array; cudamalloc((void **) &d_in_array, SIZE*sizeof(int)); cudamalloc((void **) &d_out_array, BLOCKSIZE*sizeof(int)); cudamemcpy(d_in_array, h_in_array, SIZE*sizeof(int), cudamemcpyhosttodevice); compute<<<(1,blocksize)>>> (d_in_array, d_out_array); cudathreadsynchronize(); cudamemcpy(h_out_array, d_out_array, BLOCKSIZE*sizeof(int), cudamemcpydevicetohost); } 53

54 Global Function GLOBAL FUNCTION: Thread scans subset of array elements Call device function to compare with 6 Compute local result global void compute(int *d_in,int *d_out) { } d_out[threadidx.x] = 0; for (int i=0; i<size/blocksize; i++) { } int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); 54

55 Device Function DEVICE FUNCTION: Compare current element and 6 Return 1 if same, else 0 device int compare(int a, int b) { if (a == b) return 1; return 0; } 55

56 A Simple Running Example Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity 56

57 Programming Model: Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory M N P WIDTH WIDTH WIDTH WIDTH 57

58 Memory Layout of a Matrix in C M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 M M 0,0 M 0,1 M 0,2 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 58

59 Step 1: Matrix Multiplication A Simple Host Version // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; j for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } M P[i * Width + j] = sum; i } } N P k WIDTH WIDTH k 59 WIDTH WIDTH 59

60 ( Code Step 2: Input Matrix Data Transfer (Host-side void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); 60

61 ( Code Step 3: Output Matrix Data Transfer (Host-side 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); } 61

62 Step 4: Kernel Function // Matrix multiplication kernel per thread code global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 62

63 (. cont ) Step 4: Kernel Function for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } tx Nd k WIDTH Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Md Pd ty ty k tx WIDTH WIDTH WIDTH 63

64 Step 5: Kernel Invocation (Host-side Code) // Setup the execution configuration dim3 dimgrid(1, 1); dim3 dimblock(width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(md, Nd, Pd, Width); 64

65 Only One Thread Block Used One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not ( high very Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 Thread ( 2,2) WIDTH Md Nd Pd 65

66 Compiling a CUDA Program Virtual Physical C/C++ CUDA Application NVCC PTX Code PTX to Target Compiler float4 me = gx[gtid]; me.x += me.y * me.z; CPU Code Parallel Thread execution (PTX) Virtual Machine and ISA Programming model Execution resources and state ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; mad.f32 $f1, $f5, $f3, $f1; G80 GPU Target code 66

67 Linking Any executable with CUDA code requires two dynamic libraries: ( cudart ) The CUDA runtime library ( cuda ) The CUDA core library 67

68 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread Running in device emulation mode, one can: (. etc Use host native debug support (breakpoints, inspection, Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of syncthreads 68

69 Block Usage in Matrix Multiplication Block(0,0) Block(1,0) P 0,0 P 1,0 P 2,0 P 3,0 TILE_WIDTH = 2 P 0,1 P 1,1 P 2,1 P 3,1 P 0,2 P 1,2 P 2,2 P 3,2 P 0,3 P 1,3 P 2,3 P 3,3 Block(0,1) Block(1,1) 69

70 Block Usage in Matrix Multiplication Nd 0,0 Nd 1,0 Nd 0,1 Nd 1,1 Nd 0,2 Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3 70

71 Revised Matrix Multiplication Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y*tile_width + threadidx.y; // Calculate the column index of Pd and N int Col = blockidx.x*tile_width + threadidx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } 71

72 CUDA Thread Block All threads in a block execute the same kernel program (SPMD) Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Threads have thread id numbers within block Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #: m Thread program Courtesy: John Nickolls, NVIDIA 72

73 Transparent Scalability Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors Device Kernel grid Block 0 Block 1 Device Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Each block can execute in any order relative to other blocks. Block 6 Block 7 73

74 G80 Example: Executing Thread Blocks t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU SP MT IU SP Blocks Blocks Shared Memory Shared Memory Threads are assigned to Streaming Multiprocessors in block granularity Up to 8 blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM maintains thread/block id #s SM manages/schedules thread execution 74

75 G80 Example: Thread Scheduling Each Block is executed as 32-thread Warps An implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps Block 1 Warps t0 t1 t2 t31 SP SP SP SP Block 2 Warps t0 t1 t2 t31 Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SFU SP SP SP SP SFU Block 1 Warps t0 t1 t2 t31 75

76 G80 Example: Thread Scheduling (Cont.) SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected TB1, W1 stall TB2, W1 stall TB3, W2 stall TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3 W1 W1 W1 W2 W1 W1 W2 W3 W2 Instruction: Time TB = Thread Block, W = Warp 76

77 Source Code Matrix Multiplication 77

78 syncthreads() _global void globfunction(int *arr, int N) { shared int local_array[threads_per_block]; //local block memory cache int idx = blockidx.x* blockdim.x+ threadidx.x; //...calculate results local_array[threadidx.x] = results; //synchronize the local threads writing to the local memory cache syncthreads(); // read the results of another thread in the current thread int val = local_array[(threadidx.x + 1) % THREADS_PER_BLOCK]; } //write back the value to global memory arr[idx] = val; 78

79 cudathreadsynchronize() cudathreadsynchronize() is a _host_ function Waits for all previous async operations (i.e. kernel calls, async memory copies) to complete. synchtreads() is a _device_ function Acts as a thread barrier. All threads in a block must reach the barrier before any can continue execution. It is only of use when you need to avoid race conditions when threads in a block access shared memory. 79

80 Use CUDA best practices guide! 80

81 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 81

82 OpenCL - Open Computing Language A standard based upon C for portable parallel applications Task parallel and data parallel applications Focuses on multi platform support (multiple CPUs, GPUs, ) Development initiated by Apple. Developed by Khronos group who also managed OpenGL OpenCL Released with Mac OS 10.6 (Snow Leopard) Similarities with CUDA Implementation available for NVIDIA GPUs 82

83 OpenCL Programming Model Uses data parallel programming model, similar to CUDA Host program launches kernel routines as in CUDA, but allows for just-in-time compilation during host execution. OpenCL work items corresponds to CUDA threads OpenCL work groups corresponds to CUDA thread blocks Work items in same work group can be synchronized with a barrier as in CUDA. 83

84 Structure of OpenCL main program Get information about platform and devices available on system Select devices to use Create an OpenCL command queue Create memory buffers on device Transfer data from host to device memory buffers Create kernel program object Build (compile) kernel in-line (or load precompiled binary) Create OpenCL kernel object Set kernel arguments Execute kernel Read kernel memory and copy to host memory. 84

85 Includes #include <CL/cl.h> //OpenCL header for C 85

86 OpenCL to CUDA Model Mapping OpenCL Parallelism Concept kernel host program NDRange (index space) work item work group CUDA Equivalent kernel host program grid thread block 86

87 Overview of OpenCL Execution Model 87

88 Mapping of OpenCL to CUDA OpenCL API Call Explanation CUDA Equivalent get_global_id(0); get_local_id(0) get_global_size(0); get_local_size(0); global index of the work item in the x dimension local index of the work item within the work group in the x dimension size of NDRange in the x dimension Size of each work group in the x dimension blockidx.x blockdim.x+threadidx. x blockidx.x griddim.x blockdim.x blockdim.x 88

89 Conceptual OpenCL Device Architecture 89

90 Mapping OpenCL Memory Types to CUDA OpenCL Memory Types global memory constant memory local memory private memory CUDA Equivalent global memory constant memory shared memory Local memory 90

91 OpenCL Context for Device Management 91

92 Sample Kernel // Perform an element-wise addition of A and B and store in C // N work-items will be created to execute this kernel. kernel \n void vecadd( global int *A, global int *B, global int *C) { // Get the work-item s unique ID int idx = get_global_id(0); } // Add the corresponding locations of // A and B, and store the result in C. C[idx] = A[idx] + B[idx]; 92

93 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 93

94 Other Accelerators Fine Grain Reconfigurable Accelerators (FPGAs) Coarse Grain Reconfigurable Accelerators (CGRAs) Application Specific Accelerators Accelerator for graph parallel applications Accelerator for analytics General Purpose Accelerators GPUs DSPs Intel MIC 94

95 Thank You! 95

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates