Programação CUDA: um caminho verde para a computação de alto desempenho

Size: px
Start display at page:

Download "Programação CUDA: um caminho verde para a computação de alto desempenho"

Transcription

1 Programação CUDA: um caminho verde para a computação de alto desempenho Attilio Cucchieri IFSC USP lattice/

2 THIRD LECTURE

3 OUTLINE OF THE THIRD LECTURE CUDA architecture (II)

4 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids

5 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids Global memory vs. shared memory

6 OUTLINE OF THE THIRD LECTURE CUDA architecture (II) Two-dimensional grids Global memory vs. shared memory Data prefetching

7 CUDA Arquitecture (II)

8 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void.

9 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only.

10 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only. host : functions with host qualifier are executed on the host and they are callable from the host only (this is the default).

11 Function Type Qualifiers global : functions with global qualifier are executed on the device but they are callable from the host only; these functions must return void. device : functions with device qualifier are executed on the device and they are callable from the device only. host : functions with host qualifier are executed on the host and they are callable from the host only (this is the default). host and device qualifiers can be combined.

12 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied.

13 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application.

14 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application. shared : these variables are stored in shared memory, they are accessible by all threads in a block and have the lifetime of the block.

15 Variable Type Qualifiers device : these variables are stored in global memory, they are accessible from all the threads within a grid and have the lifetime of an application; when using cudamalloc, the qualifier device is implied. constant : these variables are stored in constant memory, they are accessible from all the threads within a grid and have the lifetime of an application. shared : these variables are stored in shared memory, they are accessible by all threads in a block and have the lifetime of the block. Unqualified: each thread has read/write access to unqualified variables; scalars and built-in vector-type are stored in registers, vectors and register spilling go to local memory; they have the lifetime of the thread.

16 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB.

17 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive.

18 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc).

19 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc). Device GPU is attached to the host CPU via the relatively slow PCIe bus (16 GB/s bi-directional with x).

20 GPU Drawbacks Fixed amount of memory, e.g. 4-6 GB. CUDA Architecture is not very flexible usually not useful for problems which are not computing-intensive. Must rewrite codes (easier with OpenAcc). Device GPU is attached to the host CPU via the relatively slow PCIe bus (16 GB/s bi-directional with x). Not all C/C++ features can be used inside a kernel (but new features are usually included in new versions, e.g. recursion is supported by NVIDIA hardware with compute capability 2.0 or more).

21 The Reduction Step Revisited (I) In the evaluation of the scalar product we used the reduction step (EXAMPLES/ dot1.cu): for (unsigned int i = 1; i < blockdim.x; i *= 2) { if (cacheindex % (2*i) == 0) cache[cacheindex] += cache[cacheindex + i]; syncthreads(); } The problem with this scheme is that each active warp has only a few active threads. For example, with 256 threads per block, in the fourth iteration the partial sums are saved in cache[16i], with i [0,15], by the threads with threadidx.x equal to 16i. Thus, each warp will have only two active threads. One possible solution is to consider (EXAMPLES2/dot2.cu) for (unsigned int i = 1; i < blockdim.x; i *= 2) { int index = 2 * i * cacheindex; if (index < blockdim.x) } cache[index] += cache[index + i]; syncthreads();

22 The Reduction Step Revisited (II) With the new reduction scheme, for the same example considered before, in the fourth iteration the partial sums are still saved in cache[16i], with i [0,15], but this time the active threads have threadidx.x equal to i, i.e. they belong to the same warp! Another possible improvement can be introduced by modifying the last kernel in the following way (EXAMPLES2/dot3.cu): int i = 1; while(i < blockdim.x/64) { } int index = 2 * i * cacheindex; if (index < blockdim.x) cache[index] += cache[index + i]; i *= 2; syncthreads(); Note: the for loop become a while loop and this loop is stopped when i < blockdim.x/64. How do we include the rest of the loop?

23 The Reduction Step Revisited (III) Here is the rest of the loop (you need to disable racedetector in configure.ocelot): if (cacheindex < 32){ int ic = cacheindex*i; cache[ic] += cache[ic + 32*i]; cache[ic] += cache[ic + 16*i]; cache[ic] += cache[ic + 8*i]; cache[ic] += cache[ic + 4*i]; cache[ic] += cache[ic + 2*i]; cache[ic] += cache[ic + i]; } Note: there is not an explicit synchronization among the threads! Since cacheindex < 32, all the active threads in the above code belong to the same warp. Thus, synchronization is automatic and enforced by the hardware! The above code is rather complicated because data are scattered: for example, in the fourth iteration, the partial sums are stored in cache[16i], with i [0,15]. This could also slow down memory access.

24 The Reduction Step Revisited (IV) In order to solve this problem, we show below the final version of the kernel for the reduction step (EXAMPLES2/dot4.cu): int i = blockdim.x/2; while (i > 32) { if (cacheindex < i) cache[cacheindex] += cache[cacheindex + i]; syncthreads(); i/=2; } if (cacheindex < 32){ cache[cacheindex] += cache[cacheindex + 32]; cache[cacheindex] += cache[cacheindex + 16]; cache[cacheindex] += cache[cacheindex + 8]; cache[cacheindex] += cache[cacheindex + 4]; cache[cacheindex] += cache[cacheindex + 2]; cache[cacheindex] += cache[cacheindex + 1]; }

25 The Reduction Step Revisited (V) The new reduction scheme, avoiding data scattering, is represented below. (Source: cs.uaf.edu)

26 Performance Let us compare the running time for the four different codes presented here for the scalar product (codes dot1-mod.cu, dot2.cu, dot3.cu, dot4.cu in the directory EXAMPLES2), using the Tesla S1070 GPU. Here, we used vectors with elements.

27 Performance Let us compare the running time for the four different codes presented here for the scalar product (codes dot1-mod.cu, dot2.cu, dot3.cu, dot4.cu in the directory EXAMPLES2), using the Tesla S1070 GPU. Here, we used vectors with elements. First method and GPU timing using Events (dot1-mod.cu): ms. Increasing the number of active threads in a warp (dot2.cu): ms. Without explicit synchronization for threads in a warp (dot3.cu): ms. Saving results in nearby memory positions (dot4.cu): ms.

28 Two-dimensional grids

29 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks.

30 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks. Since the predefined variables threadidx and blockidx are usually used to implement in an easy way data parallelism, it is clear that, when dealing with matrices, a two-dimensional structure can be useful.

31 2D Grids and Blocks Up to now we have considered only one-dimensional grids and blocks. Since the predefined variables threadidx and blockidx are usually used to implement in an easy way data parallelism, it is clear that, when dealing with matrices, a two-dimensional structure can be useful. In order to illustrate the use of two-dimensional grids and blocks, we consider the product of two (square N N) matrices (in C/C++ notation): C ij = N 1 k=0 A ik B kj.

32 Matrix-Matrix Product (I) A simple serial code can be written as (N 3 iterations, EXAMPLES2/matrix-matrix0.cu): #define N 1024 int main( void ) { unsigned int isize = N; // allocate host memory for matrices A, B and C // initialize host memory // capture the start time for (int i=0; i < isize; i++) for (int j=0; j < isize; j++) { } float Csub = 0; for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[k*isize+j] ; C[i*isize+j] = Csub; } // capture the stop time // clean up memory return 0;

33 Matrix-Matrix Product (II) Recall that, in C/C++, for a matrix A[i][k] // i = row index, k = column index the column index k runs faster (row major convention, i.e. the structures of rows is preserved). (Source: intel)

34 Matrix-Matrix Product (III) Thus, when using only one index, we write and A[i*N+k] // instead of A[i][k] // with i = row index, k = column index B[k*N+j] // instead of B[k][j] // with k = row index, j = column index For the CUDA code we have to choose a grid/block structure (EXAMPLES2/ matrix-matrix1.cu): #define BLOCK SIZE 16 #define N unsigned int isize = N;... // setup execution parameters dim3 threads(block SIZE, BLOCK SIZE); dim3 grid(isize/threads.x, isize/threads.y);...

35 Matrix-Matrix Product (IV) Recall that threadidx.x can be considered the local x coordinate of a given thread (inside its own block) and, similarly, threadidx.y can be considered the local y coordinate for the same thread. Similarly, blockidx.x and blockidx.y are the x and y coordinates of a given block inside the grid. Then, the global x and y coordinates of a thread inside the grid can be written as // Global x index for threads int ix = threadidx.x + BLOCK_SIZE*blockIdx.x; // Global y index for threads int iy = threadidx.y + BLOCK_SIZE*blockIdx.y;

36 Matrix-Matrix Product (V) And here goes the kernel global void matrixmul(float *C, float *A, float *B, int w) { int ix = threadidx.x + BLOCK_SIZE*blockIdx.x; int iy = threadidx.y + BLOCK_SIZE*blockIdx.y; float Csub = 0; for (int k=0; k < w; k++) Csub += A[iy*w+k] * B[k*w+ix] ; C[iy*w+ix] = Csub; // C[iy][ix] } Since each thread in the grid has its specific (and unique) set of global coordinate (ix,iy), each thread evaluates one element of the matrix C, i.e. C[iy][ix] = C[iy*w+ix].

37 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product!

38 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance?

39 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix.

40 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix. In the Tesla S1070 GPU the GPU parallel code needed ms for the same matrix, about 270 times faster than the serial code!

41 Results Using CUDA and two-dimensional grids we can easily parallelize the matrix-matrix product! What about performance? In my desktop the CPU serial code needed ms for a matrix. In the Tesla S1070 GPU the GPU parallel code needed ms for the same matrix, about 270 times faster than the serial code! In the same GPU the parallel code needed ms for a matrix, about 6.4 times slower (but for the serial code the time increased by a a factor 8.8).

42 Global memory vs. shared memory

43 Improving the Serial Code In the innermost loop of the serial code (with index k) for (int i=0; i < isize; i++) for (int j=0; j < isize; j++) { float Csub = 0; for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[k*isize+j]; C[i*isize+j] = Csub; } the element of the matrix A[i*isize+k] are accessed sequentially (stride of one). On the contrary, for the matrix B[k*isize+j] we have a stride of isize (= 1024 in our case). This can likely give frequent cache miss and page fault.

44 Transposing B We can get a better performance by transposing the matrix B (EXAMPLES2/matrix-matrix0t.cu): float tmpb; for (int i=0; i < isize; i++) for (int j=0; j < i; j++) { tmpb = B[i*isize+j]; B[i*isize+j] = B[j*isize+i]; B[j*isize+i] = tmpb; } Then, the innermost loop of the serial code becomes for (int k=0; k < isize; k++) Csub += A[i*isize+k]*B[j*isize+k]; and both matrices can be accessed with a stride of one! On my desktop, the computing time (include transposition) goes down from ms to ms!

45 Improving the CUDA Code (I) In order to improve the CUDA code, the easiest thing to do is to use the shared memory: i) data stored in shared memory are visible to all threads within a block and lasts for the duration of the block; ii) in the best case scenario, shared memory performance is comparable to register memory. This can be done in the following way: (Source: codinggorilla.domemtech)

46 Improving the CUDA Code (II) The kernel becomes (EXAMPLES2/matrix-matrix2.cu): float Csub = 0; for (int a = aini, b = bini; a <= aend; a += astep, b += bstep) { As[threadIdx.y][threadIdx.x] = A[a + ib]; Bs[threadIdx.y][threadIdx.x] = B[b + ib]; syncthreads(); for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); } C[ix+w*iy] = Csub; Each thread, with local (block) coordinates (threadidx.x, threadidx.y), copies one element of A and B to the matrices As and Bs, stored in shared memory. For each pair of sub-matrices As and Bs we evaluate the matrix-matrix product. After summing these partial contributions, we save the result in C[iy*w+ix] (as before).

47 Improving the CUDA Code (III) We only need now to define the matrices As and Bs shared float As[BLOCK_SIZE][BLOCK_SIZE]; shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; and to set up the range values of the indices a and b. In the code we have: int aini = w*block_size*blockidx.y; int aend = aini + w - 1; int astep = BLOCK_SIZE; int bini = BLOCK_SIZE*blockIdx.x; int bstep = BLOCK_SIZE*w; int ib = threadidx.x + w*threadidx.y; In the kernel we consider A[a + ib] and B[b + ib]. One can check that: a + ib = (w*block_size*blockidx.y) + (threadidx.x + w*threadidx.y) + (m*block_size) b + ib = (BLOCK_SIZE*blockIdx.x) + (threadidx.x + w*threadidx.y) + (m*block_size*w) where m = 0,1,...,w/BLOCK_SIZE indicates which sub-blocks are considered.

48 Performance For a matrix we find (in the Tesla S1070 GPU)

49 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code;

50 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster!

51 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)?

52 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)? NOT AT ALL! The test gives ms!

53 Performance For a matrix we find (in the Tesla S1070 GPU) ms with the original CUDA code; 715 ms with the new CUDA code, about 300 times faster! Do we get a better result if we transpose the matrix B inside the kernel (EXAMPLES2/matrix-matrix2t.cu)? NOT AT ALL! The test gives ms! WHY?

54 Shared Memory Organization Shared memory is divided into 32 logical banks (32-bit words, 4 bytes). Since there are 32 threads in a warp and each bank services only one request per cycle, multiple simultaneous accesses to the same bank will result in what is known as a bank conflict. (Source: microway)

55 No Bank Conflict (Source: 3dgep)

56 Two-Way Bank Conflict (Source: 3dgep)

57 Bank Conflict with Bs Transposed A bank conflict occurs when two or more threads in a warp access different words in the same bank. There is no bank conflict if different threads access the same word, or different bytes of the same word, in the same bank. In the original code we had for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; and for a given k the elements of Bs[k][threadIdx.x] have global index k*block SIZE + threadidx.x, with threadidx.x = 0, 1,..., BLOCK SIZE = 16, and belong to 16 different banks. After transposing the matrix Bs, we have for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[threadIdx.x][k]; and for a given k the elements of Bs[threadIdx.x][k] have global index threadidx.x*block SIZE + k. In our case this means threadidx.x*16 + k and the 16 elements for threadidx.x = 0, 1,..., BLOCK SIZE belong to only 2 banks!

58 Data prefetching

59 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete.

60 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results.

61 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results. This strategy usually works fine if all threads have a large number of independent instructions between memory-access instructions and the use of the accessed data.

62 Global Memory Access Global memory access usually has limited bandwidth and data access can take a long time to complete. The CUDA threading model tolerates long memory-access latency by allowing some warps to make progress, while others wait for their memory-access results. This strategy usually works fine if all threads have a large number of independent instructions between memory-access instructions and the use of the accessed data. When this is not the case, a possible solution is to prefetch the next data, while using the data already available.

63 Scheme of the Original Code In our case (matrix-matrix product) the kernel was organized in the following way: Csub = 0; for (loop) { As[threadIdx.y][threadIdx.x] = A[a + ib]; Bs[threadIdx.y][threadIdx.x] = B[b + ib]; syncthreads(); // compute Csub syncthreads(); } C[ix+w*iy] = Csub;

64 Scheme of the New Code The same code with prefetching becomes: Csub = 0; // prefetch first data into registers for (loop) { // copy data from registers to shared memory syncthreads(); // prefetch next data into registers // compute Csub syncthreads(); } C[ix+w*iy] = Csub;

65 The New Kernel (I) And here is the modified kernel (EXAMPLES2/matrix-matrix3.cu): float Csub = 0; int a = aini; int b = bini; float atmp = A[aini+ib]; float btmp = B[bini+ib]; for (int m=0; m<(w/block_size)-1; m++) { As[threadIdx.y][threadIdx.x] = atmp; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); a += astep; b += bstep; atmp = A[a + ib]; btmp = B[b + ib];

66 The New Kernel (II) The last step is done outside the loop and without prefetching: } for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); As[threadIdx.y][threadIdx.x] = atmp; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); for (int k = 0; k < BLOCK_SIZE; k++) Csub += As[threadIdx.y][k] * Bs[k][threadIdx.x]; syncthreads(); } C[ix+w*iy] = Csub;

67 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching.

68 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory.

69 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays.

70 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays. One can check the usage of registers and the spilling to local memory with the compiler option ptxas-options=-v (CHECK!) and fix the maximum number of registers per thread with -maxrregcount=x.

71 Performance and Possible Problems With prefetching, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to 715 ms without prefetching. NOTE: with prefetching we need to use more registers. If too many registers are used, we can have register spilling to local memory. NOTE: registers do not support indexing, local memory is used for arrays. One can check the usage of registers and the spilling to local memory with the compiler option ptxas-options=-v (CHECK!) and fix the maximum number of registers per thread with -maxrregcount=x. More in general, the execution resources (such as registers) of the streaming multiprocessors (SMX) are dynamically partioned and assigned during runtime and one should check if resources are under-utilized.

72 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024;

73 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32;

74 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64;

75 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16;

76 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048;

77 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048; max number of 32-bit registers per thread: 255;

78 Resources for the K20 GPUs For the K20 GPUs the available resources (in a given moment) are: max number of threads per block: 1024; threads per warp: 32; max number of warps per SMX: 64; max number of blocks per SMX: 16; max number of threads per SMX: 2048; max number of 32-bit registers per thread: 255; Recall that each SMX has 64K 32-bit registers.

79 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps;

80 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block;

81 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16;

82 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048;

83 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048; equivalently, with 12 blocks we have a total of 96 warps, which is more than the max number of 64;

84 Occupancy: an Example We can consider the following example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 21 registers 5376 registers per block; considering registers, we can have at most 65536/5376 = 12 blocks in a SMX, below the max number of 16; with 12 blocks we have a total of 3072 threads, which is more than the max number of 2048; equivalently, with 12 blocks we have a total of 96 warps, which is more than the max number of 64; In this case the number of registers used is not a limitation and we can have a 100 % occupancy.

85 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps;

86 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block;

87 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16;

88 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048;

89 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048; equivalently, with 6 blocks we have a total of 48 warps, which is less than the max number of 64;

90 Occupancy: a Second Example Here is another example: in the matrix-matrix product, each block has = 256 threads, corresponding to 8 warps; each thread requires 38 registers 9728 registers per block; considering registers, we can have at most 65536/5376 = 6 blocks in a SMX, below the max number of 16; with 6 blocks we have a total of 1536 threads, which is less than the max number of 2048; equivalently, with 6 blocks we have a total of 48 warps, which is less than the max number of 64; In this case the number of registers used is a limitation and we can have a 75 % occupancy.

91 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada.

92 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %?

93 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros.

94 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %?

95 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %? Agora cada thread usa 40 registros.

96 Exercícios Considere um kernel com 256 threads, usando 4 KB de memória compartilhada. Lembrando que na GPU K20 podemos configurar os SMX com 16, 32 ou 48 KB de memória compartilhada, qual configuração você escolheria para garantir uma ocupação de 100 %? Na mesma situação do exercício anterior, cada thread usa 20 registros. Você precisa mudar a configuração da memória compartilhada para garantir uma ocupação de 100 %? Agora cada thread usa 40 registros. O que você poderia fazer neste caso para garantir uma ocupação de 100 %?

97 FOURTH LECTURE

98 OUTLINE OF THE FOURTH LECTURE Thread granularity

99 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory

100 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory Streams

101 OUTLINE OF THE FOURTH LECTURE Thread granularity Page-locked host memory Streams Multiple GPUs

102 Thread granularity

103 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread.

104 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread. In particular, this is the case if some redundant work exists between threads.

105 Fewer Threads Sometimes it is more advantageous to use fewer threads and put more work into each thread. In particular, this is the case if some redundant work exists between threads. In the case of the matrix-matrix product using shared memory, several blocks of threads load the same sub-matrix into shared memory.

106 The Scheme of the Algorithm We can improve the code by changing the granularity (i.e. the extent to which a system is broken down into small parts), by considering one column of sub-matrices of B with several lines of sub-matrices of A (or vice-versa). (Source: 3dgep)

107 The New Code (I) We start by calling the kernel with 1/4 of the original number of blocks and, therefore, 1/4 of the original number of threads (EXAMPLES2/matrix-matrix3aa.cu): dim3 threads(block_size, BLOCK_SIZE); dim3 grid(isize/threads.x, isize/(4*threads.y)); xmul<<< grid, threads >>>(dev_c, dev_a, dev_b, isize); Note that we have reduce the value of griddim.y, i.e. the grid is not square anymore. Inside the kernel we will have to work with 4 sub-matrices As shared float As[BLOCK_SIZE][BLOCK_SIZE][4]; and have 4 global coordinates iy for threads int iy1 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y); int iy2 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+1); int iy3 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+2); int iy4 = threadidx.y + BLOCK_SIZE*(4*blockIdx.y+3);

108 The New Code (II) Every instruction involving the As sub-matrix is replicated 4 times:... As[threadIdx.y][threadIdx.x][0] = atmp1; As[threadIdx.y][threadIdx.x][1] = atmp2; As[threadIdx.y][threadIdx.x][2] = atmp3; As[threadIdx.y][threadIdx.x][3] = atmp4; Bs[threadIdx.y][threadIdx.x] = btmp; syncthreads(); atmp1 = A[a1 + ib]; atmp2 = A[a2 + ib]; atmp3 = A[a3 + ib]; atmp4 = A[a4 + ib]; btmp = B[b + ib];...

109 The New Code (III) At the end, each thread in a block evaluates four elements of the the C matrix.... C[ix+w*iy1] = Csub1; C[ix+w*iy2] = Csub2; C[ix+w*iy3] = Csub3; C[ix+w*iy4] = Csub4;... With the new code, for a matrix, we find (in the Tesla S1070 GPU) a computing time of ms, to be compared to ms with the old code (the one with prefetching).

110 Page-locked host memory

111 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA.

112 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA).

113 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA). In particular, this memory will be page-locked (pinned memory), i.e. it will not be available to the 0S until it is made free (using cudafreehost).

114 The GPU Takes Over Instead of using cudamalloc() we can organize the code using cudahostalloc(), which allocates host memory inside CUDA. The host memory allocated in this way will be under the direct control of GPU, using direct memory access (DMA). In particular, this memory will be page-locked (pinned memory), i.e. it will not be available to the 0S until it is made free (using cudafreehost). NOTE: if too much host memory is page-locked, the system can run out of memory...

115 How to Use cudahostalloc() We consider a function that checks the speed of copying data to/from the GPU.... cudahostalloc( (void**)&a, size * sizeof( *a ), cudahostallocdefault ); cudamalloc( (void**)&dev_a, size * sizeof( *dev_a ) ); cudaeventrecord( start, 0 ); for (int i=0; i<100; i++) { if (up) cudamemcpy( dev_a, a, size * sizeof( *a ), cudamemcpyhosttodevice ); else cudamemcpy( a, dev_a, size * sizeof( *a ), } // evaluate elapsedtime cudafreehost( a );... cudamemcpydevicetohost );

116 The Same with cudamalloc() The same function without pinned memory.... a = (int*)malloc( size * sizeof( *a ) ); cudamalloc( (void**)&dev_a, size * sizeof( *dev_a ) ); cudaeventrecord( start, 0 ); for (int i=0; i<100; i++) { if (up) cudamemcpy( dev_a, a, size * sizeof( *a ), cudamemcpyhosttodevice ); else cudamemcpy( a, dev_a, size * sizeof( *a ), } // evaluate elapsedtime free( a );... cudamemcpydevicetohost );

117 The Main Code The main code is timing the use of cudamemcpy with and without pinned memory (EXAMPLES2/copy_timed.cu). int main( void ) { float elapsedtime; float MB = (float)100*size*sizeof(int)/1024/1024; // try it with cudamalloc, HostToDevice elapsedtime = cuda_malloc_test( SIZE, true ); printf( "Time using cudamalloc: %3.1f ms\n", elapsedtime ); printf( "\tmb/s during copy up: %3.1f\n", MB/(elapsedTime/1000) ); // try it with cudamalloc, DeviceToHost... // now try it with cudahostalloc, HostToDevice... // and cudahostalloc, DeviceToHost... }

118 Timing The results for timing using the Tesla S1070 GPU is: cudamalloc, HostToDevice: ms (with MB/s) cudamalloc, DeviceToHost: ms (with MB/s) cudahostalloc, HostToDevice: ms (with MB/s) cudahostalloc, DeviceToHost: ms (with MB/s) Thus, if one needs to cudamemcpy back and forth the same set of data, and this does not requires too much memory, it is usually better to use cudahostalloc.

119 Streams

120 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams.

121 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc.

122 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc. Using multiple streams, we can sometimes accelerate computing.

123 Organizing the GPU Workload A sequence of GPU operations that should be executed in a specific order can be organized using streams. In a stream we can have memory copies, kernel launches, events (start and stop), etc. Using multiple streams, we can sometimes accelerate computing. We will see first how to organize one stream.

124 One Stream What the kernel does in this example is not very important. We concentrate on the main program (EXAMPLES2/basic_single_stream.cu). int main( void ) { cudadeviceprop prop; int whichdevice; cudagetdevice( &whichdevice ); cudagetdeviceproperties( &prop, whichdevice ); if (!prop.deviceoverlap) { printf( "Device will not handle overlaps, so no speed up from streams\n" ); return 0; }... cudastream t stream;... cudastreamcreate(&stream);

125 Asynchronous cudamemcpy In order to make effective the use of streams, we need to copy data asynchronously.... // various cudamalloc, cudahostalloc... cudahostalloc( (void**)&host_a, FULL_DATA_SIZE * sizeof(int), cudahostallocdefault );... cudaeventrecord( start, 0 ); for (int i=0; i<full_data_size; i+= N) { // copy the locked memory to the device, async cudamemcpyasync( dev a, host a+i,... N * sizeof(int), cudamemcpyhosttodevice, stream );

126 Completing the Stream Finally, we complete the main program and destroy the stream. } }... kernel<<<n/256,256,0,stream>>>( dev a, dev b, dev c ); cudamemcpyasync( host_c+i, dev_c, N * sizeof(int), cudamemcpydevicetohost, stream ); cudastreamsynchronize( stream ); // evaluate elapsetime // cleanup the streams and memory cudastreamdestroy( stream ); return 0; All the operation in a stream are executed in the given order, i.e. an operation starts only when the previous one (in the same stream) has been completed.

127 Two Streams and Speed-Up With only one stream we can, for example, do some work on the CPU, while the GPU is busy completing the stream. But, with two streams we could reduce the computing time. (Source: W. Xiao s talk)

128 Two-Stream Code Here, we will use the two streams inside the for loop over the data. Outside the loop, we need to duplicate all the definitions. Inside the loop, the most effective way of working with two streams is (usually) to alternate betweem them (EXAMPLES2/ basic_double_stream.cu). The computing time changed from 83.3 ms to 72.8 ms. for (int i=0; i<full DATA SIZE; i+= N*2) { cudamemcpyasync( dev_a0, host_a+i, N * sizeof(int), cudamemcpyhosttodevice, stream0 ); cudamemcpyasync( dev_a1, host_a+i+n, N * sizeof(int), cudamemcpyhosttodevice,... stream1 ); } kernel<<<n/256,256,0,stream0>>>( dev a0, dev b0, dev c0 ); kernel<<<n/256,256,0,stream1>>>( dev a1, dev b1, dev c1 );...

129 Fermi: Parallel Kernels Before the computing capability 2.0 only one kernel could run at a given time. The Fermi GPUs have the ability to run multiple, independent kernels on different thread grids simultaneously. (Source: techreport.com)

130 Kepler: Dynamic Parallelism Kepler GPUs can generate new work for themselves and control the scheduling of that work without involving the CPU! (Source: NVIDIA)

131 Kepler: Hyper Q For Fermi GPUs: 16-way concurrency of kernel, from separate streams but only one hardware queue. For Kepler GPUs: 32 independent (at the harware level) tasks! (Source: NVIDIA)

132 Multiple GPUs

133 No GPUDirect P2P Before, in order to transfer data between two GPUs, the CPU was taking care of the communication, generating several copies of the same set of data. (Source: NVIDIA)

134 With GPUDirect P2P Now, with CUDA 5.0 and the new Kepler hardware, it is possible to have direct access to GPU memory by third-party devices. (Source: NVIDIA)

135 That s all Folks!

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory. Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

Zero Copy Memory and Multiple GPUs

Zero Copy Memory and Multiple GPUs Zero Copy Memory and Multiple GPUs Goals Zero Copy Memory Pinned and mapped memory on the host can be read and written to from the GPU program (if the device permits this) This may result in performance

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu. Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

GPGPUGPGPU: Multi-GPU Programming

GPGPUGPGPU: Multi-GPU Programming GPGPUGPGPU: Multi-GPU Programming Fall 2012 HW4 global void cuda_transpose(const float *ad, const int n, float *atd) { } int i = threadidx.y + blockidx.y*blockdim.y; int j = threadidx.x + blockidx.x*blockdim.x;

More information

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012 CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012 Agenda Instruction Optimizations Mixed Instruction Types Loop Unrolling

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro INTRODUCTION TO GPU COMPUTING IN AALTO Topi Siro 12.6.2013 OUTLINE PART I Introduction to GPUs Basics of CUDA PART II Maximizing performance Coalesced memory access Optimizing memory transfers Occupancy

More information

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

CS : Many-core Computing with CUDA

CS : Many-core Computing with CUDA CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel

More information

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro INTRODUCTION TO GPU COMPUTING IN AALTO Topi Siro 11.6.2014 PART I Introduction to GPUs Basics of CUDA (and OpenACC) Running GPU jobs on Triton Hands-on 1 PART II Optimizing CUDA codes Libraries Hands-on

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

Lecture 10. Efficient Host-Device Data Transfer

Lecture 10. Efficient Host-Device Data Transfer 1 Lecture 10 Efficient Host-Device Data fer 2 Objective To learn the important concepts involved in copying (transferring) data between host and device System Interconnect Direct Memory Access Pinned memory

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

Efficient Data Transfers

Efficient Data Transfers Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 PCIE Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information

COSC 462 Parallel Programming

COSC 462 Parallel Programming November 22, 2017 1/12 COSC 462 Parallel Programming CUDA Beyond Basics Piotr Luszczek Mixing Blocks and Threads int N = 100, SN = N * sizeof(double); global void sum(double *a, double *b, double *c) {

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering University of Wisconsin-Madison Dan Negrut, 2012 UW-Madison Milano

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics

AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics Steven Reeves 1 Optimizing Data Transfers in CUDA C/C++ This section we will discuss code optimization with how to efficiently transfer data

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Introduction to CUDA 5.0

Introduction to CUDA 5.0 Introduction to CUDA 5.0 CUDA 5 In this article, I will introduce the reader to CUDA 5.0. I will briefly talk about the architecture of the Kepler GPU (Graphics Processing Unit) and I will show you how

More information

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface CSE 599 I Accelerated Computing - Programming GPUS Advanced Host / Device Interface Objective Take a slightly lower-level view of the CPU / GPU interface Learn about different CPU / GPU communication techniques

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information