Programação CUDA: um caminho verde para a computação de alto desempenho

Size: px

Start display at page:

Download "Programação CUDA: um caminho verde para a computação de alto desempenho"

Brendan Wilkerson
6 years ago
Views:

1 Programação CUDA: um caminho verde para a computação de alto desempenho Attilio Cucchieri IFSC USP lattice/

2 THIRD LECTURE

3 OUTLINE OF THE THIRD LECTURE CUDA architecture (II)

4 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids

5 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids Global memory vs. shared memory

6 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids Global memory vs. shared memory Data prefetching

7 CUDA Arquitecture (II)

8 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void.

9 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only.

10 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only. host : functions with host qualifier are executed on the host and they are callable from the host only (this is the default).

11 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only. host : functions with host qualifier are executed on the host and they are callable from the host only (this is the default). host and device qualifiers can be combined.

12 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied.

13 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application.

14 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application. shared : these variables are stored in shared memory, they are accessible by all threads in a block and have the lifetime of the block.

15 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application. shared : these variables are stored in shared memory, they are accessible by all threads in a block and have the lifetime of the block. Unqualified: each thread has read/write access to unqualified variables; scalars and built-in vector-type are stored in registers, vectors and register spilling go to local memory; they have the lifetime of the thread.

16 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB.

17 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive.

18 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc).

19 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc). Device GPU is attached to the host CPU via the relatively slow PCIe bus (16 GB/s bi-directional with x).

20 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc). Device GPU is attached to the host CPU via the relatively slow PCIe bus (16 GB/s bi-directional with x). Not all C/C++ features can be used inside a kernel (but new features are usually included in new versions, e.g. recursion is supported by NVIDIA hardware with compute capability 2.0 or more).

21 The Reduction Step Revisited (I) In the evaluation of the scalar product we used the reduction step (EXAMPLES/ dot1.cu): for (unsigned int i = 1; i < blockdim.x; i *= 2) { if (cacheindex % (2*i) == 0) cache[cacheindex] += cache[cacheindex + i]; syncthreads(); } The problem with this scheme is that each active warp has only a few active threads. For example, with 256 threads per block, in the fourth iteration the partial sums are saved in cache[16i], with i [0,15], by the threads with threadidx.x equal to 16i. Thus, each warp will have only two active threads. One possible solution is to consider (EXAMPLES2/dot2.cu) for (unsigned int i = 1; i < blockdim.x; i *= 2) { int index = 2 * i * cacheindex; if (index < blockdim.x) } cache[index] += cache[index + i]; syncthreads();

22 The Reduction Step Revisited (II) With the new reduction scheme, for the same example considered before, in the fourth iteration the partial sums are still saved in cache[16i], with i [0,15], but this time the active threads have threadidx.x equal to i, i.e. they belong to the same warp! Another possible improvement can be introduced by modifying the last kernel in the following way (EXAMPLES2/dot3.cu): int i = 1; while(i < blockdim.x/64) { } int index = 2 * i * cacheindex; if (index < blockdim.x) cache[index] += cache[index + i]; i *= 2; syncthreads(); Note: the for loop become a while loop and this loop is stopped when i < blockdim.x/64. How do we include the rest of the loop?

23 The Reduction Step Revisited (III) Here is the rest of the loop (you need to disable racedetector in configure.ocelot): if (cacheindex < 32){ int ic = cacheindex*i; cache[ic] += cache[ic + 32*i]; cache[ic] += cache[ic + 16*i]; cache[ic] += cache[ic + 8*i]; cache[ic] += cache[ic + 4*i]; cache[ic] += cache[ic + 2*i]; cache[ic] += cache[ic + i]; } Note: there is not an explicit synchronization among the threads! Since cacheindex < 32, all the active threads in the above code belong to the same warp. Thus, synchronization is automatic and enforced by the hardware! The above code is rather complicated because data are scattered: for example, in the fourth iteration, the partial sums are stored in cache[16i], with i [0,15]. This could also slow down memory access.

24 The Reduction Step Revisited (IV) In order to solve this problem, we show below the final version of the kernel for the reduction step (EXAMPLES2/dot4.cu): int i = blockdim.x/2; while (i > 32) { if (cacheindex < i) cache[cacheindex] += cache[cacheindex + i]; syncthreads(); i/=2; } if (cacheindex < 32){ cache[cacheindex] += cache[cacheindex + 32]; cache[cacheindex] += cache[cacheindex + 16]; cache[cacheindex] += cache[cacheindex + 8]; cache[cacheindex] += cache[cacheindex + 4]; cache[cacheindex] += cache[cacheindex + 2]; cache[cacheindex] += cache[cacheindex + 1]; }

25 The Reduction Step Revisited (V) The new reduction scheme, avoiding data scattering, is represented below. (Source: cs.uaf.edu)

26 Performance Let us compare the running time for the four different codes presented here for the scalar product (codes dot1-mod.cu, dot2.cu, dot3.cu, dot4.cu in the directory EXAMPLES2), using the Tesla S1070 GPU. Here, we used vectors with elements.

27 Performance Let us compare the running time for the four different codes presented here for the scalar product (codes dot1-mod.cu, dot2.cu, dot3.cu, dot4.cu in the directory EXAMPLES2), using the Tesla S1070 GPU. Here, we used vectors with elements. First method and GPU timing using Events (dot1-mod.cu): ms. Increasing the number of active threads in a warp (dot2.cu): ms. Without explicit synchronization for threads in a warp (dot3.cu): ms. Saving results in nearby memory positions (dot4.cu): ms.

28 Two-dimensional grids

29 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks.

30 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks. Since the predefined variables threadidx and blockidx are usually used to implement in an easy way data parallelism, it is clear that, when dealing with matrices, a two-dimensional structure can be useful.

31 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks. Since the predefined variables threadidx and blockidx are usually used to implement in an easy way data parallelism, it is clear that, when dealing with matrices, a two-dimensional structure can be useful. In order to illustrate the use of two-dimensional grids and blocks, we consider the product of two (square N N) matrices (in C/C++ notation): C ij = N 1 k=0 A ik B kj.

32 Matrix-Matrix Product (I) A simple serial code can be written as (N 3 iterations, EXAMPLES2/matrix-matrix0.cu): #define N 1024 int main( void ) { unsigned int isize = N; // allocate host memory for matrices A, B and C // initialize host memory // capture the start time for (int i=0; i < isize; i++) for (int j=0; j < isize; j++) { } float Csub = 0; for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[k*isize+j] ; C[i*isize+j] = Csub; } // capture the stop time // clean up memory return 0;

33 Matrix-Matrix Product (II) Recall that, in C/C++, for a matrix A[i][k] // i = row index, k = column index the column index k runs faster (row major convention, i.e. the structures of rows is preserved). (Source: intel)

34 Matrix-Matrix Product (III) Thus, when using only one index, we write and A[i*N+k] // instead of A[i][k] // with i = row index, k = column index B[k*N+j] // instead of B[k][j] // with k = row index, j = column index For the CUDA code we have to choose a grid/block structure (EXAMPLES2/ matrix-matrix1.cu): #define BLOCK SIZE 16 #define N unsigned int isize = N;... // setup execution parameters dim3 threads(block SIZE, BLOCK SIZE); dim3 grid(isize/threads.x, isize/threads.y);...

35 Matrix-Matrix Product (IV) Recall that threadidx.x can be considered the local x coordinate of a given thread (inside its own block) and, similarly, threadidx.y can be considered the local y coordinate for the same thread. Similarly, blockidx.x and blockidx.y are the x and y coordinates of a given block inside the grid. Then, the global x and y coordinates of a thread inside the grid can be written as // Global x index for threads int ix = threadidx.x + BLOCK_SIZE*blockIdx.x; // Global y index for threads int iy = threadidx.y + BLOCK_SIZE*blockIdx.y;

36 Matrix-Matrix Product (V) And here goes the kernel global void matrixmul(float *C, float *A, float *B, int w) { int ix = threadidx.x + BLOCK_SIZE*blockIdx.x; int iy = threadidx.y + BLOCK_SIZE*blockIdx.y; float Csub = 0; for (int k=0; k < w; k++) Csub += A[iy*w+k] * B[k*w+ix] ; C[iy*w+ix] = Csub; // C[iy][ix] } Since each thread in the grid has its specific (and unique) set of global coordinate (ix,iy), each thread evaluates one element of the matrix C, i.e. C[iy][ix] = C[iy*w+ix].

37 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product!

38 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance?

39 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix.

40 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix. In the Tesla S1070 GPU the GPU parallel code needed ms for the same matrix, about 270 times faster than the serial code!

41 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix. In the Tesla S1070 GPU the GPU parallel code needed ms for the same matrix, about 270 times faster than the serial code! In the same GPU the parallel code needed ms for a matrix, about 6.4 times slower (but for the serial code the time increased by a a factor 8.8).

42 Global memory vs. shared memory

43 Improving the Serial Code In the innermost loop of the serial code (with index k) for (int i=0; i < isize; i++) for (int j=0; j < isize; j++) { float Csub = 0; for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[k*isize+j]; C[i*isize+j] = Csub; } the element of the matrix A[i*isize+k] are accessed sequentially (stride of one). On the contrary, for the matrix B[k*isize+j] we have a stride of isize (= 1024 in our case). This can likely give frequent cache miss and page fault.

44 Transposing B We can get a better performance by transposing the matrix B (EXAMPLES2/matrix-matrix0t.cu): float tmpb; for (int i=0; i < isize; i++) for (int j=0; j < i; j++) { tmpb = B[i*isize+j]; B[i*isize+j] = B[j*isize+i]; B[j*isize+i] = tmpb; } Then, the innermost loop of the serial code becomes for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[j*isize+k]; and both matrices can be accessed with a stride of one! On my desktop, the computing time (include transposition) goes down from ms to ms!

45 Improving the CUDA Code (I) In order to improve the CUDA code, the easiest thing to do is to use the shared memory: i) data stored in shared memory are visible to all threads within a block and lasts for the duration of the block; ii) in the best case scenario, shared memory performance is comparable to register memory. This can be done in the following way: (Source: codinggorilla.domemtech)

46 Improving the CUDA Code (II) The kernel becomes (EXAMPLES2/matrix-matrix2.cu): float Csub = 0; for (int a = aini, b = bini; a <= aend; a += astep, b += bstep) { As[threadIdx.y][threadIdx.x] = A[a + ib]; Bs[threadIdx.y][threadIdx.x] = B[b + ib]; syncthreads(); for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); } C[ix+w*iy] = Csub; Each thread, with local (block) coordinates (threadidx.x, threadidx.y), copies one element of A and B to the matrices As and Bs, stored in shared memory. For each pair of sub-matrices As and Bs we evaluate the matrix-matrix product. After summing these partial contributions, we save the result in C[iy*w+ix] (as before).

47 Improving the CUDA Code (III) We only need now to define the matrices As and Bs shared float As[BLOCK_SIZE][BLOCK_SIZE]; shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; and to set up the range values of the indices a and b. In the code we have: int aini = w*block_size*blockidx.y; int aend = aini + w - 1; int astep = BLOCK_SIZE; int bini = BLOCK_SIZE*blockIdx.x; int bstep = BLOCK_SIZE*w; int ib = threadidx.x + w*threadidx.y; In the kernel we consider A[a + ib] and B[b + ib]. One can check that: a + ib = (w*block_size*blockidx.y) + (threadidx.x + w*threadidx.y) + (m*block_size) b + ib = (BLOCK_SIZE*blockIdx.x) + (threadidx.x + w*threadidx.y) + (m*block_size*w) where m = 0,1,...,w/BLOCK_SIZE indicates which sub-blocks are considered.

48 Performance For a matrix we find (in the Tesla S1070 GPU)

49 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code;

50 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster!

51 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)?

52 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)? NOT AT ALL! The test gives ms!

53 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)? NOT AT ALL! The test gives ms! WHY?

54 Shared Memory Organization Shared memory is divided into 32 logical banks (32-bit words, 4 bytes). Since there are 32 threads in a warp and each bank services only one request per cycle, multiple simultaneous accesses to the same bank will result in what is known as a bank conflict. (Source: microway)

55 No Bank Conflict (Source: 3dgep)

56 Two-Way Bank Conflict (Source: 3dgep)

57 Bank Conflict with Bs Transposed A bank conflict occurs when two or more threads in a warp access different words in the same bank. There is no bank conflict if different threads access the same word, or different bytes of the same word, in the same bank. In the original code we had for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; and for a given k the elements of Bs[k][threadIdx.x] have global index k*block SIZE + threadidx.x, with threadidx.x = 0, 1,..., BLOCK SIZE = 16, and belong to 16 different banks. After transposing the matrix Bs, we have for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[threadIdx.x][k]; and for a given k the elements of Bs[threadIdx.x][k] have global index threadidx.x*block SIZE + k. In our case this means threadidx.x*16 + k and the 16 elements for threadidx.x = 0, 1,..., BLOCK SIZE belong to only 2 banks!

58 Data prefetching

59 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete.

60 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results.

61 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results. This strategy usually works fine if all threads have a large number of independent instructions between memory-access instructions and the use of the accessed data.

62 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results. This strategy usually works fine if all threads have a large number of independent instructions between memory-access instructions and the use of the accessed data. When this is not the case, a possible solution is to prefetch the next data, while using the data already available.

63 Scheme of the Original Code In our case (matrix-matrix product) the kernel was organized in the following way: Csub = 0; for (loop) { As[threadIdx.y][threadIdx.x] = A[a + ib]; Bs[threadIdx.y][threadIdx.x] = B[b + ib]; syncthreads(); // compute Csub syncthreads(); } C[ix+w*iy] = Csub;

64 Scheme of the New Code The same code with prefetching becomes: Csub = 0; // prefetch first data into registers for (loop) { // copy data from registers to shared memory syncthreads(); // prefetch next data into registers // compute Csub syncthreads(); } C[ix+w*iy] = Csub;

65 The New Kernel (I) And here is the modified kernel (EXAMPLES2/matrix-matrix3.cu): float Csub = 0; int a = aini; int b = bini; float atmp = A[aini+ib]; float btmp = B[bini+ib]; for (int m=0; m<(w/block_size)-1; m++) { As[threadIdx.y][threadIdx.x] = atmp; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); a += astep; b += bstep; atmp = A[a + ib]; btmp = B[b + ib];

66 The New Kernel (II) The last step is done outside the loop and without prefetching: } for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); As[threadIdx.y][threadIdx.x] = atmp; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); } C[ix+w*iy] = Csub;

67 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching.

68 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory.

69 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays.

70 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays. One can check the usage of registers and the spilling to local memory with the compiler option ptxas-options=-v (CHECK!) and fix the maximum number of registers per thread with -maxrregcount=x.

71 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays. One can check the usage of registers and the spilling to local memory with the compiler option ptxas-options=-v (CHECK!) and fix the maximum number of registers per thread with -maxrregcount=x. More in general, the execution resources (such as registers) of the streaming multiprocessors (SMX) are dynamically partioned and assigned during runtime and one should check if resources are under-utilized.

72 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024;

73 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32;

74 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64;

75 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16;

76 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048;

77 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048; max number of 32-bit registers per thread: 255;

78 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048; max number of 32-bit registers per thread: 255; Recall that each SMX has 64K 32-bit registers.

79 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps;

80 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block;

81 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16;

82 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048;

83 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048; equivalently, with 12 blocks we have a total of 96 warps, which is more than the max number of 64;

84 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048; equivalently, with 12 blocks we have a total of 96 warps, which is more than the max number of 64; In this case the number of registers used is not a limitation and we can have a 100 % occupancy.

85 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps;

86 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block;

87 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16;

88 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048;

89 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048; equivalently, with 6 blocks we have a total of 48 warps, which is less than the max number of 64;

90 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048; equivalently, with 6 blocks we have a total of 48 warps, which is less than the max number of 64; In this case the number of registers used is a limitation and we can have a 75 % occupancy.

91 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada.

92 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %?

93 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros.

94 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %?

95 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %? Agora cada thread usa 40 registros.

96 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %? Agora cada thread usa 40 registros. O que você poderia fazer neste caso para garantir uma ocupação de 100 %?

97 FOURTH LECTURE

98 OUTLINE OF THE FOURTH LECTURE Thread granularity

99 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory

100 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory Streams

101 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory Streams Multiple GPUs

102 Thread granularity

103 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread.

104 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread. In particular, this is the case if some redundant work exists between threads.

105 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread. In particular, this is the case if some redundant work exists between threads. In the case of the matrix-matrix product using shared memory, several blocks of threads load the same sub-matrix into shared memory.

106 The Scheme of the Algorithm We can improve the code by changing the granularity (i.e. the extent to which a system is broken down into small parts), by considering one column of sub-matrices of B with several lines of sub-matrices of A (or vice-versa). (Source: 3dgep)

107 The New Code (I) We start by calling the kernel with 1/4 of the original number of blocks and, therefore, 1/4 of the original number of threads (EXAMPLES2/matrix-matrix3aa.cu): dim3 threads(block_size, BLOCK_SIZE); dim3 grid(isize/threads.x, isize/(4*threads.y)); xmul<<< grid, threads >>>(dev_c, dev_a, dev_b, isize); Note that we have reduce the value of griddim.y, i.e. the grid is not square anymore. Inside the kernel we will have to work with 4 sub-matrices As shared float As[BLOCK_SIZE][BLOCK_SIZE][4]; and have 4 global coordinates iy for threads int iy1 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y); int iy2 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+1); int iy3 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+2); int iy4 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+3);

108 The New Code (II) Every instruction involving the As sub-matrix is replicated 4 times:... As[threadIdx.y][threadIdx.x][0] = atmp1; As[threadIdx.y][threadIdx.x][1] = atmp2; As[threadIdx.y][threadIdx.x][2] = atmp3; As[threadIdx.y][threadIdx.x][3] = atmp4; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); atmp1 = A[a1 + ib]; atmp2 = A[a2 + ib]; atmp3 = A[a3 + ib]; atmp4 = A[a4 + ib]; btmp = B[b + ib];...

109 The New Code (III) At the end, each thread in a block evaluates four elements of the the C matrix.... C[ix+w*iy1] = Csub1; C[ix+w*iy2] = Csub2; C[ix+w*iy3] = Csub3; C[ix+w*iy4] = Csub4;... With the new code, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to ms with the old code (the one with prefetching).

110 Page-locked host memory

111 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA.

112 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA).

113 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA). In particular, this memory will be page-locked (pinned memory), i.e. it will not be available to the 0S until it is made free (using cudafreehost).

114 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA). In particular, this memory will be page-locked (pinned memory), i.e. it will not be available to the 0S until it is made free (using cudafreehost). NOTE: if too much host memory is page-locked, the system can run out of memory...

115 How to Use cudahostalloc() We consider a function that checks the speed of copying data to/from the GPU.... cudahostalloc( (void**)&a, size * sizeof( *a ), cudahostallocdefault ); cudamalloc( (void**)&dev_a, size * sizeof( *dev_a ) ); cudaeventrecord( start, 0 ); for (int i=0; i<100; i++) { if (up) cudamemcpy( dev_a, a, size * sizeof( *a ), cudamemcpyhosttodevice ); else cudamemcpy( a, dev_a, size * sizeof( *a ), } // evaluate elapsedtime cudafreehost( a );... cudamemcpydevicetohost );

116 The Same with cudamalloc() The same function without pinned memory.... a = (int*)malloc( size * sizeof( *a ) ); cudamalloc( (void**)&dev_a, size * sizeof( *dev_a ) ); cudaeventrecord( start, 0 ); for (int i=0; i<100; i++) { if (up) cudamemcpy( dev_a, a, size * sizeof( *a ), cudamemcpyhosttodevice ); else cudamemcpy( a, dev_a, size * sizeof( *a ), } // evaluate elapsedtime free( a );... cudamemcpydevicetohost );

117 The Main Code The main code is timing the use of cudamemcpy with and without pinned memory (EXAMPLES2/copy_timed.cu). int main( void ) { float elapsedtime; float MB = (float)100*size*sizeof(int)/1024/1024; // try it with cudamalloc, HostToDevice elapsedtime = cuda_malloc_test( SIZE, true ); printf( "Time using cudamalloc: %3.1f ms\n", elapsedtime ); printf( "\tmb/s during copy up: %3.1f\n", MB/(elapsedTime/1000) ); // try it with cudamalloc, DeviceToHost... // now try it with cudahostalloc, HostToDevice... // and cudahostalloc, DeviceToHost... }

118 Timing The results for timing using the Tesla S1070 GPU is: cudamalloc, HostToDevice: ms (with MB/s) cudamalloc, DeviceToHost: ms (with MB/s) cudahostalloc, HostToDevice: ms (with MB/s) cudahostalloc, DeviceToHost: ms (with MB/s) Thus, if one needs to cudamemcpy back and forth the same set of data, and this does not requires too much memory, it is usually better to use cudahostalloc.

119 Streams

120 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams.

121 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc.

122 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc. Using multiple streams, we can sometimes accelerate computing.

123 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc. Using multiple streams, we can sometimes accelerate computing. We will see first how to organize one stream.

124 One Stream What the kernel does in this example is not very important. We concentrate on the main program (EXAMPLES2/basic_single_stream.cu). int main( void ) { cudadeviceprop prop; int whichdevice; cudagetdevice( &whichdevice ); cudagetdeviceproperties( &prop, whichdevice ); if (!prop.deviceoverlap) { printf( "Device will not handle overlaps, so no speed up from streams\n" ); return 0; }... cudastream t stream;... cudastreamcreate(&stream);

125 Asynchronous cudamemcpy In order to make effective the use of streams, we need to copy data asynchronously.... // various cudamalloc, cudahostalloc... cudahostalloc( (void**)&host_a, FULL_DATA_SIZE * sizeof(int), cudahostallocdefault );... cudaeventrecord( start, 0 ); for (int i=0; i<full_data_size; i+= N) { // copy the locked memory to the device, async cudamemcpyasync( dev a, host a+i,... N * sizeof(int), cudamemcpyhosttodevice, stream );

126 Completing the Stream Finally, we complete the main program and destroy the stream. } }... kernel<<<n/256,256,0,stream>>>( dev a, dev b, dev c ); cudamemcpyasync( host_c+i, dev_c, N * sizeof(int), cudamemcpydevicetohost, stream ); cudastreamsynchronize( stream ); // evaluate elapsetime // cleanup the streams and memory cudastreamdestroy( stream ); return 0; All the operation in a stream are executed in the given order, i.e. an operation starts only when the previous one (in the same stream) has been completed.

127 Two Streams and Speed-Up With only one stream we can, for example, do some work on the CPU, while the GPU is busy completing the stream. But, with two streams we could reduce the computing time. (Source: W. Xiao s talk)

128 Two-Stream Code Here, we will use the two streams inside the for loop over the data. Outside the loop, we need to duplicate all the definitions. Inside the loop, the most effective way of working with two streams is (usually) to alternate betweem them (EXAMPLES2/ basic_double_stream.cu). The computing time changed from 83.3 ms to 72.8 ms. for (int i=0; i<full DATA SIZE; i+= N*2) { cudamemcpyasync( dev_a0, host_a+i, N * sizeof(int), cudamemcpyhosttodevice, stream0 ); cudamemcpyasync( dev_a1, host_a+i+n, N * sizeof(int), cudamemcpyhosttodevice,... stream1 ); } kernel<<<n/256,256,0,stream0>>>( dev a0, dev b0, dev c0 ); kernel<<<n/256,256,0,stream1>>>( dev a1, dev b1, dev c1 );...

129 Fermi: Parallel Kernels Before the computing capability 2.0 only one kernel could run at a given time. The Fermi GPUs have the ability to run multiple, independent kernels on different thread grids simultaneously. (Source: techreport.com)

130 Kepler: Dynamic Parallelism Kepler GPUs can generate new work for themselves and control the scheduling of that work without involving the CPU! (Source: NVIDIA)

131 Kepler: Hyper Q For Fermi GPUs: 16-way concurrency of kernel, from separate streams but only one hardware queue. For Kepler GPUs: 32 independent (at the harware level) tasks! (Source: NVIDIA)

132 Multiple GPUs

133 No GPUDirect P2P Before, in order to transfer data between two GPUs, the CPU was taking care of the communication, generating several copies of the same set of data. (Source: NVIDIA)

134 With GPUDirect P2P Now, with CUDA 5.0 and the new Kepler hardware, it is possible to have direct access to GPU memory by third-party devices. (Source: NVIDIA)

135 That s all Folks!

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory. Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es