ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource assignment at the block level Scheduling at the warp level Basics of SIT execution 1
Reading Assignment Kirk and Hwu, Programming assively Parallel Processors: A Hands on Approach,, Chapter 4 Kirk and Hwu, Programming assively Parallel Processors: A Hands on Approach,, Chapter 6.3 Reference: CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programmingguide/#abstract 3 A ulti-dimensional Grid Example host device Kernel 1 Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Kernel 2 Grid 2 Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) (0,0,0) (0,1,0) (0,0,1) (0,1,1) (0,0,2) (0,1,2) (0,0,3) Threa d (0,1,3) (0,0,0) 2
Built-In Variables 1D-3D Grid of thread blocks Built-in: griddim griddim.x, griddim.y, griddim.z Built-in: blockdim blockdim.x, blockdim.y, blockdim.z Example dim3 dimgrid (32,2,2) - 3D grid of thread blocks dim3 dimgrid (2,2,1) - 2D grid of thread blocks dim3 dimgrid (32,1,1) - 1D grid of thread blocks dim3 dimgrid ((n/256.0),1,1) - 1D grid of thread blocks my_kernel<<<dimgrid, dimblock>>>(..) Built-In Variables (2) 1D-3D grid of threads in a thread block Built-in: blockidx blockidx.x, blockidx.y, blockidx.z Built-in: threadidx threadidx.x, threadidx.y, threadidx.z All blocks have the same thread configuration Example dim3 dimblock (4,2,2) - 3D grid of thread blocks dim3 dimblock (2,2,1) - 2D grid of thread blocks dim3 dimblock (32,1,1) - 1D grid of thread blocks my_kernel<<<dimgrid, dimblock>>>(..) 3
Built-In Variables (3) 1D-3D grid of threads in a thread block Built-in: blockidx blockidx.x, blockidx.y, blockidx.z Built-in: threadidx threadidx.x, threadidx.y, threadidx.z Initialized by the runtime through a kernel call Range fixed by the compute capability and target devices You can query the device (later) 2D Examples 4
Processing a Picture with a 2D Grid 16 16 blocks 72x62 pixels Row-ajor Layout in C/C++ 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,1 1,0 1,2 1,3 2,0 2,1 2,2 2,3 Row*Width+Col = 2*4+1 = 9 0 1 2 3 5 3,0 3,1 3,2 3,3 4 6 7 8 9 10 11 12 13 14 15 5
Source Code of the Picture Kernel global void PictureKernel(float* d_pin, float* d_pout, int n,int m) { // Calculate the row # of the d_pin and d_pout element to process int Row = blockidx.y*blockdim.y + threadidx.y; // Calculate the column # of the d_pin and d_pout element to process int Col = blockidx.x*blockdim.x + threadidx.x; // each thread computes one element of d_pout if in range if ((Row < m) && (Col < n)) { d_pout[row*n+col] = 2*d_Pin[Row*n+Col]; 11 Approach Summary Storage layout of data Assign unique ID ap IDs to Data (access) Figure 4.5 Covering a 76 62 picture with 16 blocks. 6
A Simple Running Example atrix ultiplication A simple illustration of the basic features of memory and thread management in CUDA programs index usage emory layout Register usage Assume square matrix for simplicity Leave shared memory usage until later Square atrix-atrix ultiplication P = * N of size x Each thread calculates one element of P Each row of is loaded times from global memory Each column of N is loaded times from global memory N P 7
Row-ajor Layout in C/C++ 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,1 1,0 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 Row*Width+Col = 2*4+1 = 9 0 1 2 3 5 4 6 7 8 9 10 11 12 13 14 15 atrix ultiplication A Simple Host Version in C // atrix multiplication on the (CPU) host in double precision void atrixulonhost(float*, float* N, float* P, int Width) { N for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {map double sum = 0; for (int k = 0; k < Width; ++k) double a = [i * Width + k]; double b = N[k * Width + j]; sum += a * b; P[i * Width + j] = sum; k i P j k 8
Kernel Version: Functional Description for (int k = 0; k < Width; ++k) Pvalue += d_[row*width+k] * d_n[k*width+col]; Which thread is at coordinate (row,col)? s self allocate row i N P j col k blockdim.y k blockdim.x Kernel Function - A Small Example Have each 2D thread block to compute a (TILE_) 2 sub-matrix (tile) of the result matrix Each has (TILE_) 2 threads Generate a 2D Grid of (/TILE_) 2 blocks Block(0,0) Block(0,1) P 0,0 P 1,0 P 0,1 P 1,1 P 0,2 P 0,3 P 1,2 P 1,3 = 4; TILE_ = 2 Each block has 2*2 = 4 threads P 2,0 P 2,1 P 2,2 P 2,3 /TILE_ = 2 Use 2* 2 = 4 blocks P 3,0 P 3,1 P 2,3 P 3,3 Block(1,0) Block(1,1) David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 9
A Slightly Bigger Example P 0,0 P 0,1 P 0,2 P 0,3 P 0,4 P 0,5 P 0,6 P 0,7 P 1,0 P 1,1 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 P 3,0 P 3,1 P 3,2 P 3,3 P 1,4 P 1,5 P 1,6 P 1,7 P 2,4 P 2,5 P 2,6 P 2,7 P 3,4 P 3,5 P 3,6 P 3,7 = 8; TILE_ = 2 Each block has 2*2 = 4 threads P 4,0 P 5,0 P 4,1 P 5,1 P 4,2 P 4,3 P 5,2 P 5,3 P 4,4 P 5,4 P 4,5 P 5,5 P 4,6 P 4,7 P 5,6 P 5,7 /TILE_ = 4 Use 4* 4 = 16 blocks P 6,0 P 6,1 P 6,2 P 6,3 P 7,0 P 7,1 P 7,2 P 7,3 P 6,4 P 6,5 P 6,6 P 6,7 P 7,4 P 7,5 P 7,6 P 7,7 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 A Slightly Bigger Example (cont.) P 0,0 P 0,1 P 0,2 P 0,3 P 0,4 P 0,5 P 0,6 P 0,7 P 1,0 P 1,1 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 P 3,0 P 3,1 P 3,2 P 3,3 P 4,0 P 4,1 P 4,2 P 4,3 P 1,4 P 1,5 P 1,6 P 1,7 P 2,4 P 2,5 P 2,6 P 2,7 P 3,4 P 3,5 P 3,6 P 3,7 P 4,4 P 4,5 P 4,6 P 4,7 = 8; TILE_ = 4 Each block has 4*4 =16 threads /TILE_ = 2 Use 2* 2 = 4 blocks P 5,0 P 5,1 P 5,2 P 5,3 P 5,4 P 5,5 P 5,6 P 5,7 P 6,0 P 6,1 P 6,2 P 6,3 P 7,0 P 7,1 P 7,2 P 7,3 P 6,4 P 6,5 P 6,6 P 6,7 P 7,4 P 7,5 P 7,6 P 7,7 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 10
Kernel Invocation (Host-side Code) // Setup the execution configuration // TILE_ is a #define constant dim3 dimgrid(width/tile_, Width/TILE_, 1); dim3 dimblock(tile_, TILE_, 1); // Launch the device computation threads! atrixulkernel<<<dimgrid, dimblock>>>(d, Nd, Pd, Width); Kernel Function // atrix multiplication kernel per thread code global void atrixulkernel(float* d_, float* d_n, float* d_p, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 11
Work for Block (0,0) in a TILE_ = 2 Configuration blockdim.x Col = 0 * 2 + threadidx.x Row = 0 * 2 + threadidx.y blockdim.y Col = 1 Col = 0 N 0,0 N 0,1 N 0,2 N 0,3 blockidx.x blockidx.y N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Row = 0 0,0 0,1 0,2 0,3 P 0,0 P 0,1 P 0,2 P 0,3 Row = 1 1,0 1,1 1,2 1,3 P 1,0 P 1,1 P 1,2 P 1,3 2,0 2,1 2,2 2,3 P 2,0 P 2,1 P 2,2 P 2,3 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 3,0 3,1 3,2 3,3 P 3,0 P 3,1 P 3,2 P 3,3 blockdim.x Col = 1 * 2 + threadidx.x Row = 0 * 2 + threadidx.y Work for Block (0,1) blockdim.y N 0,0 N 0,1 Col = 2 Col = 3 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 blockidx.x blockidx.y N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 2,3 N 3,3 Row = 0 Row = 1 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 P 0,0 P 0,1 P 0,1 P 1,1 P 0,2 P 0,3 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 3,0 3,1 3,2 3,3 P 3,0 P 3,1 P 3,2 P 3,3 12
A Simple atrix ultiplication Kernel global void atrixulkernel(float* d_, float* d_n, float* d_p, int Width) { // Calculate the row index of the d_p element and d_ int Row = blockidx.y*blockdim.y+threadidx.y; // Calculate the column idenx of d_p and d_n int Col = blockidx.x*blockdim.x+threadidx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_[row*width+k] * d_n[k*width+col]; d_p[row*width+col] = Pvalue; QUESTIONS? 26 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 13