Programação CUDA: um caminho verde para a computação de alto desempenho
|
|
- Brendan Wilkerson
- 6 years ago
- Views:
Transcription
1 Programação CUDA: um caminho verde para a computação de alto desempenho Attilio Cucchieri IFSC USP lattice/
2 THIRD LECTURE
3 OUTLINE OF THE THIRD LECTURE CUDA architecture (II)
4 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids
5 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids Global memory vs. shared memory
6 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids Global memory vs. shared memory Data prefetching
7 CUDA Arquitecture (II)
8 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void.
9 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only.
10 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only. host : functions with host qualifier are executed on the host and they are callable from the host only (this is the default).
11 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only. host : functions with host qualifier are executed on the host and they are callable from the host only (this is the default). host and device qualifiers can be combined.
12 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied.
13 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application.
14 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application. shared : these variables are stored in shared memory, they are accessible by all threads in a block and have the lifetime of the block.
15 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application. shared : these variables are stored in shared memory, they are accessible by all threads in a block and have the lifetime of the block. Unqualified: each thread has read/write access to unqualified variables; scalars and built-in vector-type are stored in registers, vectors and register spilling go to local memory; they have the lifetime of the thread.
16 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB.
17 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive.
18 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc).
19 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc). Device GPU is attached to the host CPU via the relatively slow PCIe bus (16 GB/s bi-directional with x).
20 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc). Device GPU is attached to the host CPU via the relatively slow PCIe bus (16 GB/s bi-directional with x). Not all C/C++ features can be used inside a kernel (but new features are usually included in new versions, e.g. recursion is supported by NVIDIA hardware with compute capability 2.0 or more).
21 The Reduction Step Revisited (I) In the evaluation of the scalar product we used the reduction step (EXAMPLES/ dot1.cu): for (unsigned int i = 1; i < blockdim.x; i *= 2) { if (cacheindex % (2*i) == 0) cache[cacheindex] += cache[cacheindex + i]; syncthreads(); } The problem with this scheme is that each active warp has only a few active threads. For example, with 256 threads per block, in the fourth iteration the partial sums are saved in cache[16i], with i [0,15], by the threads with threadidx.x equal to 16i. Thus, each warp will have only two active threads. One possible solution is to consider (EXAMPLES2/dot2.cu) for (unsigned int i = 1; i < blockdim.x; i *= 2) { int index = 2 * i * cacheindex; if (index < blockdim.x) } cache[index] += cache[index + i]; syncthreads();
22 The Reduction Step Revisited (II) With the new reduction scheme, for the same example considered before, in the fourth iteration the partial sums are still saved in cache[16i], with i [0,15], but this time the active threads have threadidx.x equal to i, i.e. they belong to the same warp! Another possible improvement can be introduced by modifying the last kernel in the following way (EXAMPLES2/dot3.cu): int i = 1; while(i < blockdim.x/64) { } int index = 2 * i * cacheindex; if (index < blockdim.x) cache[index] += cache[index + i]; i *= 2; syncthreads(); Note: the for loop become a while loop and this loop is stopped when i < blockdim.x/64. How do we include the rest of the loop?
23 The Reduction Step Revisited (III) Here is the rest of the loop (you need to disable racedetector in configure.ocelot): if (cacheindex < 32){ int ic = cacheindex*i; cache[ic] += cache[ic + 32*i]; cache[ic] += cache[ic + 16*i]; cache[ic] += cache[ic + 8*i]; cache[ic] += cache[ic + 4*i]; cache[ic] += cache[ic + 2*i]; cache[ic] += cache[ic + i]; } Note: there is not an explicit synchronization among the threads! Since cacheindex < 32, all the active threads in the above code belong to the same warp. Thus, synchronization is automatic and enforced by the hardware! The above code is rather complicated because data are scattered: for example, in the fourth iteration, the partial sums are stored in cache[16i], with i [0,15]. This could also slow down memory access.
24 The Reduction Step Revisited (IV) In order to solve this problem, we show below the final version of the kernel for the reduction step (EXAMPLES2/dot4.cu): int i = blockdim.x/2; while (i > 32) { if (cacheindex < i) cache[cacheindex] += cache[cacheindex + i]; syncthreads(); i/=2; } if (cacheindex < 32){ cache[cacheindex] += cache[cacheindex + 32]; cache[cacheindex] += cache[cacheindex + 16]; cache[cacheindex] += cache[cacheindex + 8]; cache[cacheindex] += cache[cacheindex + 4]; cache[cacheindex] += cache[cacheindex + 2]; cache[cacheindex] += cache[cacheindex + 1]; }
25 The Reduction Step Revisited (V) The new reduction scheme, avoiding data scattering, is represented below. (Source: cs.uaf.edu)
26 Performance Let us compare the running time for the four different codes presented here for the scalar product (codes dot1-mod.cu, dot2.cu, dot3.cu, dot4.cu in the directory EXAMPLES2), using the Tesla S1070 GPU. Here, we used vectors with elements.
27 Performance Let us compare the running time for the four different codes presented here for the scalar product (codes dot1-mod.cu, dot2.cu, dot3.cu, dot4.cu in the directory EXAMPLES2), using the Tesla S1070 GPU. Here, we used vectors with elements. First method and GPU timing using Events (dot1-mod.cu): ms. Increasing the number of active threads in a warp (dot2.cu): ms. Without explicit synchronization for threads in a warp (dot3.cu): ms. Saving results in nearby memory positions (dot4.cu): ms.
28 Two-dimensional grids
29 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks.
30 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks. Since the predefined variables threadidx and blockidx are usually used to implement in an easy way data parallelism, it is clear that, when dealing with matrices, a two-dimensional structure can be useful.
31 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks. Since the predefined variables threadidx and blockidx are usually used to implement in an easy way data parallelism, it is clear that, when dealing with matrices, a two-dimensional structure can be useful. In order to illustrate the use of two-dimensional grids and blocks, we consider the product of two (square N N) matrices (in C/C++ notation): C ij = N 1 k=0 A ik B kj.
32 Matrix-Matrix Product (I) A simple serial code can be written as (N 3 iterations, EXAMPLES2/matrix-matrix0.cu): #define N 1024 int main( void ) { unsigned int isize = N; // allocate host memory for matrices A, B and C // initialize host memory // capture the start time for (int i=0; i < isize; i++) for (int j=0; j < isize; j++) { } float Csub = 0; for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[k*isize+j] ; C[i*isize+j] = Csub; } // capture the stop time // clean up memory return 0;
33 Matrix-Matrix Product (II) Recall that, in C/C++, for a matrix A[i][k] // i = row index, k = column index the column index k runs faster (row major convention, i.e. the structures of rows is preserved). (Source: intel)
34 Matrix-Matrix Product (III) Thus, when using only one index, we write and A[i*N+k] // instead of A[i][k] // with i = row index, k = column index B[k*N+j] // instead of B[k][j] // with k = row index, j = column index For the CUDA code we have to choose a grid/block structure (EXAMPLES2/ matrix-matrix1.cu): #define BLOCK SIZE 16 #define N unsigned int isize = N;... // setup execution parameters dim3 threads(block SIZE, BLOCK SIZE); dim3 grid(isize/threads.x, isize/threads.y);...
35 Matrix-Matrix Product (IV) Recall that threadidx.x can be considered the local x coordinate of a given thread (inside its own block) and, similarly, threadidx.y can be considered the local y coordinate for the same thread. Similarly, blockidx.x and blockidx.y are the x and y coordinates of a given block inside the grid. Then, the global x and y coordinates of a thread inside the grid can be written as // Global x index for threads int ix = threadidx.x + BLOCK_SIZE*blockIdx.x; // Global y index for threads int iy = threadidx.y + BLOCK_SIZE*blockIdx.y;
36 Matrix-Matrix Product (V) And here goes the kernel global void matrixmul(float *C, float *A, float *B, int w) { int ix = threadidx.x + BLOCK_SIZE*blockIdx.x; int iy = threadidx.y + BLOCK_SIZE*blockIdx.y; float Csub = 0; for (int k=0; k < w; k++) Csub += A[iy*w+k] * B[k*w+ix] ; C[iy*w+ix] = Csub; // C[iy][ix] } Since each thread in the grid has its specific (and unique) set of global coordinate (ix,iy), each thread evaluates one element of the matrix C, i.e. C[iy][ix] = C[iy*w+ix].
37 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product!
38 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance?
39 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix.
40 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix. In the Tesla S1070 GPU the GPU parallel code needed ms for the same matrix, about 270 times faster than the serial code!
41 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix. In the Tesla S1070 GPU the GPU parallel code needed ms for the same matrix, about 270 times faster than the serial code! In the same GPU the parallel code needed ms for a matrix, about 6.4 times slower (but for the serial code the time increased by a a factor 8.8).
42 Global memory vs. shared memory
43 Improving the Serial Code In the innermost loop of the serial code (with index k) for (int i=0; i < isize; i++) for (int j=0; j < isize; j++) { float Csub = 0; for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[k*isize+j]; C[i*isize+j] = Csub; } the element of the matrix A[i*isize+k] are accessed sequentially (stride of one). On the contrary, for the matrix B[k*isize+j] we have a stride of isize (= 1024 in our case). This can likely give frequent cache miss and page fault.
44 Transposing B We can get a better performance by transposing the matrix B (EXAMPLES2/matrix-matrix0t.cu): float tmpb; for (int i=0; i < isize; i++) for (int j=0; j < i; j++) { tmpb = B[i*isize+j]; B[i*isize+j] = B[j*isize+i]; B[j*isize+i] = tmpb; } Then, the innermost loop of the serial code becomes for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[j*isize+k]; and both matrices can be accessed with a stride of one! On my desktop, the computing time (include transposition) goes down from ms to ms!
45 Improving the CUDA Code (I) In order to improve the CUDA code, the easiest thing to do is to use the shared memory: i) data stored in shared memory are visible to all threads within a block and lasts for the duration of the block; ii) in the best case scenario, shared memory performance is comparable to register memory. This can be done in the following way: (Source: codinggorilla.domemtech)
46 Improving the CUDA Code (II) The kernel becomes (EXAMPLES2/matrix-matrix2.cu): float Csub = 0; for (int a = aini, b = bini; a <= aend; a += astep, b += bstep) { As[threadIdx.y][threadIdx.x] = A[a + ib]; Bs[threadIdx.y][threadIdx.x] = B[b + ib]; syncthreads(); for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); } C[ix+w*iy] = Csub; Each thread, with local (block) coordinates (threadidx.x, threadidx.y), copies one element of A and B to the matrices As and Bs, stored in shared memory. For each pair of sub-matrices As and Bs we evaluate the matrix-matrix product. After summing these partial contributions, we save the result in C[iy*w+ix] (as before).
47 Improving the CUDA Code (III) We only need now to define the matrices As and Bs shared float As[BLOCK_SIZE][BLOCK_SIZE]; shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; and to set up the range values of the indices a and b. In the code we have: int aini = w*block_size*blockidx.y; int aend = aini + w - 1; int astep = BLOCK_SIZE; int bini = BLOCK_SIZE*blockIdx.x; int bstep = BLOCK_SIZE*w; int ib = threadidx.x + w*threadidx.y; In the kernel we consider A[a + ib] and B[b + ib]. One can check that: a + ib = (w*block_size*blockidx.y) + (threadidx.x + w*threadidx.y) + (m*block_size) b + ib = (BLOCK_SIZE*blockIdx.x) + (threadidx.x + w*threadidx.y) + (m*block_size*w) where m = 0,1,...,w/BLOCK_SIZE indicates which sub-blocks are considered.
48 Performance For a matrix we find (in the Tesla S1070 GPU)
49 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code;
50 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster!
51 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)?
52 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)? NOT AT ALL! The test gives ms!
53 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)? NOT AT ALL! The test gives ms! WHY?
54 Shared Memory Organization Shared memory is divided into 32 logical banks (32-bit words, 4 bytes). Since there are 32 threads in a warp and each bank services only one request per cycle, multiple simultaneous accesses to the same bank will result in what is known as a bank conflict. (Source: microway)
55 No Bank Conflict (Source: 3dgep)
56 Two-Way Bank Conflict (Source: 3dgep)
57 Bank Conflict with Bs Transposed A bank conflict occurs when two or more threads in a warp access different words in the same bank. There is no bank conflict if different threads access the same word, or different bytes of the same word, in the same bank. In the original code we had for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; and for a given k the elements of Bs[k][threadIdx.x] have global index k*block SIZE + threadidx.x, with threadidx.x = 0, 1,..., BLOCK SIZE = 16, and belong to 16 different banks. After transposing the matrix Bs, we have for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[threadIdx.x][k]; and for a given k the elements of Bs[threadIdx.x][k] have global index threadidx.x*block SIZE + k. In our case this means threadidx.x*16 + k and the 16 elements for threadidx.x = 0, 1,..., BLOCK SIZE belong to only 2 banks!
58 Data prefetching
59 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete.
60 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results.
61 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results. This strategy usually works fine if all threads have a large number of independent instructions between memory-access instructions and the use of the accessed data.
62 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results. This strategy usually works fine if all threads have a large number of independent instructions between memory-access instructions and the use of the accessed data. When this is not the case, a possible solution is to prefetch the next data, while using the data already available.
63 Scheme of the Original Code In our case (matrix-matrix product) the kernel was organized in the following way: Csub = 0; for (loop) { As[threadIdx.y][threadIdx.x] = A[a + ib]; Bs[threadIdx.y][threadIdx.x] = B[b + ib]; syncthreads(); // compute Csub syncthreads(); } C[ix+w*iy] = Csub;
64 Scheme of the New Code The same code with prefetching becomes: Csub = 0; // prefetch first data into registers for (loop) { // copy data from registers to shared memory syncthreads(); // prefetch next data into registers // compute Csub syncthreads(); } C[ix+w*iy] = Csub;
65 The New Kernel (I) And here is the modified kernel (EXAMPLES2/matrix-matrix3.cu): float Csub = 0; int a = aini; int b = bini; float atmp = A[aini+ib]; float btmp = B[bini+ib]; for (int m=0; m<(w/block_size)-1; m++) { As[threadIdx.y][threadIdx.x] = atmp; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); a += astep; b += bstep; atmp = A[a + ib]; btmp = B[b + ib];
66 The New Kernel (II) The last step is done outside the loop and without prefetching: } for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); As[threadIdx.y][threadIdx.x] = atmp; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); } C[ix+w*iy] = Csub;
67 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching.
68 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory.
69 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays.
70 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays. One can check the usage of registers and the spilling to local memory with the compiler option ptxas-options=-v (CHECK!) and fix the maximum number of registers per thread with -maxrregcount=x.
71 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays. One can check the usage of registers and the spilling to local memory with the compiler option ptxas-options=-v (CHECK!) and fix the maximum number of registers per thread with -maxrregcount=x. More in general, the execution resources (such as registers) of the streaming multiprocessors (SMX) are dynamically partioned and assigned during runtime and one should check if resources are under-utilized.
72 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024;
73 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32;
74 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64;
75 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16;
76 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048;
77 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048; max number of 32-bit registers per thread: 255;
78 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048; max number of 32-bit registers per thread: 255; Recall that each SMX has 64K 32-bit registers.
79 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps;
80 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block;
81 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16;
82 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048;
83 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048; equivalently, with 12 blocks we have a total of 96 warps, which is more than the max number of 64;
84 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048; equivalently, with 12 blocks we have a total of 96 warps, which is more than the max number of 64; In this case the number of registers used is not a limitation and we can have a 100 % occupancy.
85 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps;
86 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block;
87 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16;
88 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048;
89 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048; equivalently, with 6 blocks we have a total of 48 warps, which is less than the max number of 64;
90 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048; equivalently, with 6 blocks we have a total of 48 warps, which is less than the max number of 64; In this case the number of registers used is a limitation and we can have a 75 % occupancy.
91 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada.
92 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %?
93 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros.
94 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %?
95 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %? Agora cada thread usa 40 registros.
96 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %? Agora cada thread usa 40 registros. O que você poderia fazer neste caso para garantir uma ocupação de 100 %?
97 FOURTH LECTURE
98 OUTLINE OF THE FOURTH LECTURE Thread granularity
99 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory
100 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory Streams
101 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory Streams Multiple GPUs
102 Thread granularity
103 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread.
104 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread. In particular, this is the case if some redundant work exists between threads.
105 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread. In particular, this is the case if some redundant work exists between threads. In the case of the matrix-matrix product using shared memory, several blocks of threads load the same sub-matrix into shared memory.
106 The Scheme of the Algorithm We can improve the code by changing the granularity (i.e. the extent to which a system is broken down into small parts), by considering one column of sub-matrices of B with several lines of sub-matrices of A (or vice-versa). (Source: 3dgep)
107 The New Code (I) We start by calling the kernel with 1/4 of the original number of blocks and, therefore, 1/4 of the original number of threads (EXAMPLES2/matrix-matrix3aa.cu): dim3 threads(block_size, BLOCK_SIZE); dim3 grid(isize/threads.x, isize/(4*threads.y)); xmul<<< grid, threads >>>(dev_c, dev_a, dev_b, isize); Note that we have reduce the value of griddim.y, i.e. the grid is not square anymore. Inside the kernel we will have to work with 4 sub-matrices As shared float As[BLOCK_SIZE][BLOCK_SIZE][4]; and have 4 global coordinates iy for threads int iy1 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y); int iy2 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+1); int iy3 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+2); int iy4 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+3);
108 The New Code (II) Every instruction involving the As sub-matrix is replicated 4 times:... As[threadIdx.y][threadIdx.x][0] = atmp1; As[threadIdx.y][threadIdx.x][1] = atmp2; As[threadIdx.y][threadIdx.x][2] = atmp3; As[threadIdx.y][threadIdx.x][3] = atmp4; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); atmp1 = A[a1 + ib]; atmp2 = A[a2 + ib]; atmp3 = A[a3 + ib]; atmp4 = A[a4 + ib]; btmp = B[b + ib];...
109 The New Code (III) At the end, each thread in a block evaluates four elements of the the C matrix.... C[ix+w*iy1] = Csub1; C[ix+w*iy2] = Csub2; C[ix+w*iy3] = Csub3; C[ix+w*iy4] = Csub4;... With the new code, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to ms with the old code (the one with prefetching).
110 Page-locked host memory
111 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA.
112 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA).
113 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA). In particular, this memory will be page-locked (pinned memory), i.e. it will not be available to the 0S until it is made free (using cudafreehost).
114 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA). In particular, this memory will be page-locked (pinned memory), i.e. it will not be available to the 0S until it is made free (using cudafreehost). NOTE: if too much host memory is page-locked, the system can run out of memory...
115 How to Use cudahostalloc() We consider a function that checks the speed of copying data to/from the GPU.... cudahostalloc( (void**)&a, size * sizeof( *a ), cudahostallocdefault ); cudamalloc( (void**)&dev_a, size * sizeof( *dev_a ) ); cudaeventrecord( start, 0 ); for (int i=0; i<100; i++) { if (up) cudamemcpy( dev_a, a, size * sizeof( *a ), cudamemcpyhosttodevice ); else cudamemcpy( a, dev_a, size * sizeof( *a ), } // evaluate elapsedtime cudafreehost( a );... cudamemcpydevicetohost );
116 The Same with cudamalloc() The same function without pinned memory.... a = (int*)malloc( size * sizeof( *a ) ); cudamalloc( (void**)&dev_a, size * sizeof( *dev_a ) ); cudaeventrecord( start, 0 ); for (int i=0; i<100; i++) { if (up) cudamemcpy( dev_a, a, size * sizeof( *a ), cudamemcpyhosttodevice ); else cudamemcpy( a, dev_a, size * sizeof( *a ), } // evaluate elapsedtime free( a );... cudamemcpydevicetohost );
117 The Main Code The main code is timing the use of cudamemcpy with and without pinned memory (EXAMPLES2/copy_timed.cu). int main( void ) { float elapsedtime; float MB = (float)100*size*sizeof(int)/1024/1024; // try it with cudamalloc, HostToDevice elapsedtime = cuda_malloc_test( SIZE, true ); printf( "Time using cudamalloc: %3.1f ms\n", elapsedtime ); printf( "\tmb/s during copy up: %3.1f\n", MB/(elapsedTime/1000) ); // try it with cudamalloc, DeviceToHost... // now try it with cudahostalloc, HostToDevice... // and cudahostalloc, DeviceToHost... }
118 Timing The results for timing using the Tesla S1070 GPU is: cudamalloc, HostToDevice: ms (with MB/s) cudamalloc, DeviceToHost: ms (with MB/s) cudahostalloc, HostToDevice: ms (with MB/s) cudahostalloc, DeviceToHost: ms (with MB/s) Thus, if one needs to cudamemcpy back and forth the same set of data, and this does not requires too much memory, it is usually better to use cudahostalloc.
119 Streams
120 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams.
121 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc.
122 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc. Using multiple streams, we can sometimes accelerate computing.
123 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc. Using multiple streams, we can sometimes accelerate computing. We will see first how to organize one stream.
124 One Stream What the kernel does in this example is not very important. We concentrate on the main program (EXAMPLES2/basic_single_stream.cu). int main( void ) { cudadeviceprop prop; int whichdevice; cudagetdevice( &whichdevice ); cudagetdeviceproperties( &prop, whichdevice ); if (!prop.deviceoverlap) { printf( "Device will not handle overlaps, so no speed up from streams\n" ); return 0; }... cudastream t stream;... cudastreamcreate(&stream);
125 Asynchronous cudamemcpy In order to make effective the use of streams, we need to copy data asynchronously.... // various cudamalloc, cudahostalloc... cudahostalloc( (void**)&host_a, FULL_DATA_SIZE * sizeof(int), cudahostallocdefault );... cudaeventrecord( start, 0 ); for (int i=0; i<full_data_size; i+= N) { // copy the locked memory to the device, async cudamemcpyasync( dev a, host a+i,... N * sizeof(int), cudamemcpyhosttodevice, stream );
126 Completing the Stream Finally, we complete the main program and destroy the stream. } }... kernel<<<n/256,256,0,stream>>>( dev a, dev b, dev c ); cudamemcpyasync( host_c+i, dev_c, N * sizeof(int), cudamemcpydevicetohost, stream ); cudastreamsynchronize( stream ); // evaluate elapsetime // cleanup the streams and memory cudastreamdestroy( stream ); return 0; All the operation in a stream are executed in the given order, i.e. an operation starts only when the previous one (in the same stream) has been completed.
127 Two Streams and Speed-Up With only one stream we can, for example, do some work on the CPU, while the GPU is busy completing the stream. But, with two streams we could reduce the computing time. (Source: W. Xiao s talk)
128 Two-Stream Code Here, we will use the two streams inside the for loop over the data. Outside the loop, we need to duplicate all the definitions. Inside the loop, the most effective way of working with two streams is (usually) to alternate betweem them (EXAMPLES2/ basic_double_stream.cu). The computing time changed from 83.3 ms to 72.8 ms. for (int i=0; i<full DATA SIZE; i+= N*2) { cudamemcpyasync( dev_a0, host_a+i, N * sizeof(int), cudamemcpyhosttodevice, stream0 ); cudamemcpyasync( dev_a1, host_a+i+n, N * sizeof(int), cudamemcpyhosttodevice,... stream1 ); } kernel<<<n/256,256,0,stream0>>>( dev a0, dev b0, dev c0 ); kernel<<<n/256,256,0,stream1>>>( dev a1, dev b1, dev c1 );...
129 Fermi: Parallel Kernels Before the computing capability 2.0 only one kernel could run at a given time. The Fermi GPUs have the ability to run multiple, independent kernels on different thread grids simultaneously. (Source: techreport.com)
130 Kepler: Dynamic Parallelism Kepler GPUs can generate new work for themselves and control the scheduling of that work without involving the CPU! (Source: NVIDIA)
131 Kepler: Hyper Q For Fermi GPUs: 16-way concurrency of kernel, from separate streams but only one hardware queue. For Kepler GPUs: 32 independent (at the harware level) tasks! (Source: NVIDIA)
132 Multiple GPUs
133 No GPUDirect P2P Before, in order to transfer data between two GPUs, the CPU was taking care of the communication, generating several copies of the same set of data. (Source: NVIDIA)
134 With GPUDirect P2P Now, with CUDA 5.0 and the new Kepler hardware, it is possible to have direct access to GPU memory by third-party devices. (Source: NVIDIA)
135 That s all Folks!
Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.
Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es
More informationZero Copy Memory and Multiple GPUs
Zero Copy Memory and Multiple GPUs Goals Zero Copy Memory Pinned and mapped memory on the host can be read and written to from the GPU program (if the device permits this) This may result in performance
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationZero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.
Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationGPGPUGPGPU: Multi-GPU Programming
GPGPUGPGPU: Multi-GPU Programming Fall 2012 HW4 global void cuda_transpose(const float *ad, const int n, float *atd) { } int i = threadidx.y + blockidx.y*blockdim.y; int j = threadidx.x + blockidx.x*blockdim.x;
More informationCUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012
CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012 Agenda Instruction Optimizations Mixed Instruction Types Loop Unrolling
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationAdvanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.
CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationINTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro
INTRODUCTION TO GPU COMPUTING IN AALTO Topi Siro 12.6.2013 OUTLINE PART I Introduction to GPUs Basics of CUDA PART II Maximizing performance Coalesced memory access Optimizing memory transfers Occupancy
More informationGPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationHPCSE II. GPU programming and CUDA
HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control
More informationCS : Many-core Computing with CUDA
CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationIntroduction to CUDA C
Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race
More informationSpeed Up Your Codes Using GPU
Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel
More informationINTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro
INTRODUCTION TO GPU COMPUTING IN AALTO Topi Siro 11.6.2014 PART I Introduction to GPUs Basics of CUDA (and OpenACC) Running GPU jobs on Triton Hands-on 1 PART II Optimizing CUDA codes Libraries Hands-on
More informationIntroduction to CUDA C
NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationCUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU
More informationIntroduction to GPU Computing Junjie Lai, NVIDIA Corporation
Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon
More informationLecture 10. Efficient Host-Device Data Transfer
1 Lecture 10 Efficient Host-Device Data fer 2 Objective To learn the important concepts involved in copying (transferring) data between host and device System Interconnect Direct Memory Access Pinned memory
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationGPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group
GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationEfficient Data Transfers
Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 PCIE Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More informationCOSC 462 Parallel Programming
November 22, 2017 1/12 COSC 462 Parallel Programming CUDA Beyond Basics Piotr Luszczek Mixing Blocks and Threads int N = 100, SN = N * sizeof(double); global void sum(double *a, double *b, double *c) {
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationGPU Computing with CUDA
GPU Computing with CUDA Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering University of Wisconsin-Madison Dan Negrut, 2012 UW-Madison Milano
More informationCUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationLecture 7. Using Shared Memory Performance programming and the memory hierarchy
Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationBasics of CADA Programming - CUDA 4.0 and newer
Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN
More informationCUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory
CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationAMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics
AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics Steven Reeves 1 Optimizing Data Transfers in CUDA C/C++ This section we will discuss code optimization with how to efficiently transfer data
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationIntroduction to CUDA 5.0
Introduction to CUDA 5.0 CUDA 5 In this article, I will introduce the reader to CUDA 5.0. I will briefly talk about the architecture of the Kepler GPU (Graphics Processing Unit) and I will show you how
More informationCSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface
CSE 599 I Accelerated Computing - Programming GPUS Advanced Host / Device Interface Objective Take a slightly lower-level view of the CPU / GPU interface Learn about different CPU / GPU communication techniques
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More information