Tiled Matrix Multiplication
|
|
- Earl Alvin Melton
- 5 years ago
- Views:
Transcription
1 Tiled Matrix Multiplication
2 Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y; int Col = blockidx.x*blockdim.x+threadidx.x; if if ((Row < m) m) && && (Col < k)) { float Cvalue = 0.0; for (int i = 0; 0; i < n; n; ++i) /* /* A[Row, i] i] and B[i, Col] */ */ Cvalue += += A[Row*n+i] * B[Col+i*k]; } } C[Row*k+Col] = Cvalue;
3 Col = 1 Col = 1 Col = 0 Col = 0 Global Memory Access Pattern in MM Work in Block (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 3,0 B 3,1 B 3,2 Row Row = 0 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 Row Row = 1 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2
4 Col = 1 Col = 1 Col = 0 Col = 0 Global Memory Access Pattern in MM Work in Block (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 3,0 B 3,1 B 3,2 Row Row = 0 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 Row Row = 1 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2
5 Global Memory Access Pattern in MM Consider threads (0,0) and (0,1) A 0,0 A 0,1 A 0,2 A 0,3 A 1,0... B 0,0... B 1, Thread (0,0) Thread (0,1)......
6 Global Memory Access Pattern in MM Consider threads (0,0) and (0,1) A 0,0 A 0,1 A 0,2 A 0,3 A 1,0... B 0,0... B 1, Thread (0,0) Thread (0,1)......
7 Performance on Fermi GPU All threads access global memory for their input matrix elements
8 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops)
9 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops) 4B of memory read per FLOP
10 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops) 4B of memory read per FLOP Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating
11 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s
12 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s Upper FLOPs limit: 37.5 GFLOPS The actual code runs at about 25 GFLOPS
13 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s Upper FLOPs limit: 37.5 GFLOPS The actual code runs at about 25 GFLOPS Need to cut down global memory accesses to get close to 1TF Compute-to-Global-Memory-Access Ratio
14 Shared Memory Tiling/Blocking Global Memory On-Chip Memory Thread (0,0) Thread (0,1)......
15 Shared Memory Tiling/Blocking Global Memory On-Chip Memory Thread (0,0) Thread (0,1) Divide the the global memory content into Tiles. Focus the the computation of of threads on on one or or a small number of of tiles at at each point in in time.
16 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality
17 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data
18 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data Needs Synchronization
19 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data Needs Synchronization Thread Thread 1 Thread Thread 2 Thread Thread 1 Thread Thread 2
20 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads
21 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory
22 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory
23 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory Move on to the next tile
24 Tiled Matrix Multiplication Loading a tile Phased Execution Barrier Synchronization
25 Basic Matrix Multiplication k B C[Row,Col] = Dot Dot product of of Row Row of of A and and Col Col of of B. B. Col n Row n m A C
26 Matrix Multiplication k B Col n n Row m A C
27 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. k B Col n n Row m A C
28 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. All All threads updating C[Row,...] will will access A[Row] Col k B n n Row m A C
29 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. All All threads updating C[Row,...] will will access A[Row] Col k B n n Row m A C
30 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. Each thread in in the the block loads A[Row]. Col k B n n Row m A C
31 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. Elements B[...,Col] are are used in in calculating elements C[...,Col]. Col k B n n Row m A C
32 Tiled Matrix Multiplication Compromise: Load elements from A[Row,...] and B[...,Col] in in phases k B Col n n Row m A C
33 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements k B Col n n Row m A C
34 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements All All the the threads compute the the partial product and and store store in in C Col k B n n Row m A C
35 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements All All the the threads compute the the partial product and and store store in in C Move Move to to next next phase. Repeat. Col k B n n Row m A C
36 Tiled Matrix Multiplication All threads in a block participate Each thread loads one A element and one B element in tiled code
37 Tiled Matrix Multiplication Rows in in A, A, and and Cols Cols in in B needed for for threads in in Block (0,0) (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 0,3 B 1,3 B 2,3 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3
38 Tiled Matrix Multiplication SM SM Load Load Phase 0 B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 Thread 0,0 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 Thread 0,1 Thread 1,0 B 2,0 B 2,1 B 2,2 B 2,3 Thread 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3
39 Tiled Matrix Multiplication SM SM Computation B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 C 0,0 = A 0,0 *B 0,0 + A 0,1 *B 0,1 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 C 0,1 = A 0,0 *B 0,0 + A 0,1 *B 1,1 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 = A 1,0 *B 0,0 + A 1,1 *B 0,1 C 1,1 = A 1,0 *B 1,0 + A 1,1 *B 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3
40 Tiled Matrix Multiplication SM SM Computation B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 C 0,0 = A 0,0 *B 0,0 + A 0,1 *B 0,1 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 C 0,1 = A 0,0 *B 0,0 + A 0,1 *B 1,1 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 = A 1,0 *B 0,0 + A 1,1 *B 0,1 C 1,1 = A 1,0 *B 1,0 + A 1,1 *B 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 Needs 2 iterations A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3
41 Tiled Matrix Multiplication Load Load Phase 1 B 0,0 B 0,1 B 0,2 B 0,3 Thread 0,0 B 1,0 B 1,1 B 1,2 B 1,3 SM SM Thread 0,1 Thread 1,0 B 2,0 B 2,1 B 2,2 B 2,3 B 0,2 B 0,3 Thread 1,1 B 3,0 B 3,1 B 3,2 B 3,3 B 1,2 B 1,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3
42 Tiled Matrix Multiplication Computation B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 += A 0,2 *B 2,0 + A 0,3 *B 3,0 B 1,0 B 1,1 B 1,2 B 1,3 SM SM C 0,1 += A 0,2 *B 2,1 + A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,3 B 0,2 B 0,3 C 1,0 += A 1,2 *B 2,0 + A 1,3 *B 3,0 B 3,0 B 3,1 B 3,2 B 3,3 B 1,2 B 1,3 C 1,1 += A 1,2 *B 2,1 + A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,3 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3
43 Matrix Multiplication Kernel Shared Memory Variable Declaration global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; }
44 Tiled Matrix Multiplication How many memory accesses are reduced?
45 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice
46 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice In the basic implementation, each value from A and B is loaded once and used once
47 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice In the basic implementation, each value from A and B is loaded once and used once Memory bandwidth reduction by 50%
48 Tiled Matrix Multiplication Break up up the execution of of the kernel into phases so so that the data accesses in in each phase are focused on on one tile of of A and one tile of of B Col k B n n Row m A C
49 Barrier Synchronization API call: syncthreads()
50 Barrier Synchronization API call: syncthreads() All threads in the same block must reach the syncthreads() before any can move on
51 Barrier Synchronization API call: syncthreads() All threads in the same block must reach the syncthreads() before any can move on Best used to coordinate tiled algorithms To ensure that all elements of a tile are loaded To ensure that all elements of a tile are consumed
52 Barrier Synchronization synchthreads() Thread 0 Thread 1 Thread 2 Thread 3... Thread n-2 Thread n Time Barriers can significantly reduce active threads in in a block
53 Tiled Matrix Multiplication Tx,Ty loads elements A,B Col k B n n Row m A C 0 TW-1 t*tw t*tw-1
54 Load Phase 0 of a Thread Row = by by ** blockdim.y + ty; ty; Col Col = bx bx ** blockdim.x + tx; tx; Thread(tx,ty) loads A[Row][tx], B[ty][Col] TILE_WIDTH Col Row TILE_WIDTH
55 Load Phase 1 of a Thread Thread(tx,ty) loads A[Row][1*TILE_WIDTH+tx], B[1*TILEWIDTH+ty][Col] TILE_WIDTH Row TILE_WIDTH Col TILE_WIDTH TILE_WIDTH
56 Linear Address A[Row][t*TILE_WIDTH+tx] = A[Row*n + t*tile_width + tx] tx] B[t*TILE_WIDTH+ty][Col] = B[(t*TILE_WIDTH+ty)*k + Col] t is the tile sequence number of the current phase TILE_WIDTH Col Row TILE_WIDTH
57 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width];...
58 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y;...
59 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y; int Row = by by * blockdim.y + ty; int Col = bx bx * blockdim.x + tx;...
60 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y; int Row = by by * blockdim.y + ty; int Col = bx bx * blockdim.x + tx; float Cvalue = 0; 0;...
61 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory }... }
62 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); }... }...
63 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); } for (int i = 0; 0; i < TILE_WIDTH; ++i) Cvalue += += ds_a[ty][i] * ds_b[i][tx]; syncthreads(); }...
64 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); } for (int i = 0; 0; i < TILE_WIDTH; ++i) Cvalue += += ds_a[ty][i] * ds_b[i][tx]; syncthreads(); } C[Row*k+Col] = Cvalue;
65 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads
66 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. memory trafficreduced by a factor of 16
67 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. memory trafficreduced by a factor of 16 For 32, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations. memory traffic reduced by a factor of 32
68 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache)
69 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory.
70 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory. 16K SM, can have up to 8 thread blocks executing Pending Loads: 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
71 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory. 16K SM, can have up to 8 thread blocks executing Pending Loads: 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) TILE_WIDTH = 32. TB uses 2*32*32*4 Byte= 8KB SM. 2 thread blocks can be active
72 Shared Memory and Threading TILE_WIDTH = 16 reduces accesses to global memory by a factor of 16
73 Shared Memory and Threading TILE_WIDTH = 16 reduces accesses to global memory by a factor of 16 The 150 GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!
74 Tiled MM Arbitrary Matrix Dimensions Real applications need to handle arbitrary sized matrices
75 Tiled MM Arbitrary Matrix Dimensions Real applications need to handle arbitrary sized matrices Pad (add elements to) the rows and columns into multiples of the tile size Significant space and data transfer time overhead!
76 Loads for Block (0,0) Phase 0 SM SM B 0,0 B 0,1 B 0,2 B 0,0 B 0,1 B 1,0 B 1,1 B 1,2 B 1,0 B 1,1 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,0 A 1,1 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
77 Loads for Block (0,0) Phase 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
78 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
79 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM The The threads load load A 1,0 and 1,0 and A 2,0 2,0 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
80 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM The The threads load load A 1,0 and 1,0 and A 2,0 2,0 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 A 1,0 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 1,2 A 2,0 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 SM SM C 2,0 C 2,1 C 2,2
81 Loads for Block (0,0) Phase 1 Thread 1,0 1,0 attempts to to load load B 3,0 3,0 Thread 1,1 1,1 attempts to to load load B 3,1 3,1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
82 Loads for Block (0,0) Phase 1 Thread 1,0 1,0 attempts to to load load B 3,0 3,0 Thread 1,1 1,1 attempts to to load load B 3,1 3,1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM Both Both Invalid Accesses! B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
83 Computation after Phase 1 Loads B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
84 Computation after Phase 1 Loads Iteration 0 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,2 *B 2,0 C 0,1 += A 0,2 *B 2,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,2 *B 2,0 C 1,1 += A 1,2 *B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
85 Computation after Phase 1 Loads Iteration 0 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,2 *B 2,0 C 0,1 += A 0,2 *B 2,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,2 *B 2,0 C 1,1 += A 1,2 *B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
86 Computation after Phase 1 Loads B 0,0 B 0,1 B 0,2 Iteration 1 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
87 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
88 Loads for Block (1,1) Phase 0 Thread Thread 0,1 0,1 attempts attempts to to load load B 0,3 0,3 Thread Thread 1,1 1,1 attempts attempts to to load load B 1,3 1,3 Thread Thread 1,0 1,0 attempts attempts to to load load A 3,0 3,0 Thread Thread 1,1 1,1 attempts attempts to to load load A 3,1 3,1 B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 1,2 SM SM A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM
89 Computation after Phase 0 Loads SM SM B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM
90 Computation after Phase 0 Loads SM SM B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM
91 Computation after Phase 0 Loads B 0,0 B 0,1 B 0,2 B 0,2 SM SM Iteration 0 C 2,2 += A 2,0 *B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 C 2,3 += A 2,0 *B 0,3 B 2,0 B 2,1 B 2,2 C 3,2 += A 3,0 *B 0,2 C 3,3 += A 3,0 *B 0,3 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM
92 Computation after Phase 0 Loads B 0,0 B 0,1 B 0,2 B 0,2 SM SM Iteration 1 C 2,2 += A 2,1 *B 1,2 B 1,0 B 1,1 B 1,2 B 1,2 C 2,3 += A 2,1 *B 1,3 B 2,0 B 2,1 B 2,2 C 3,2 += A 3,1 *B 1,2 C 3,3 += A 3,1 *B 1,3 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM
93 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input
94 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads
95 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads Threads that do not calculate valid C elements but still need to participate in loading the input tiles
96 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads Threads that do not calculate valid C elements but still need to participate in loading the input tiles Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-existent C[3,2] but need to participate in loading tile element B[1,2]
97 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range
98 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range If valid, proceed to load Else, do not load, just write a 0
99 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range If valid, proceed to load Else, do not load, just write a 0 0 will not affect the multiply-add step functional correctness
100 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 0 0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 0 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 1,2 0 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 SM SM C 2,0 C 2,1 C 2,2
101 Simple Solution contd. If a thread does not calculate a valid C element Can still perform multiply-add into its register Shouldn t write its Cvalue to the global memory at the end of the kernel Thread participates in the tile loading process
102 Boundary Condition for Input A Tile Each Each thread thread loads loads A[Row][t*TILE_WIDTH+tx] Each Each thread thread loads loads A[Row*n A[Row*n + t*tile_width+tx] t*tile_width+tx] if if (Row (Row < m) m) && && (t*tile_width+tx < n) n) then then load load A element element else else load load 0 n Row m A
103 Boundary Condition for Input B Tile Each Each thread thread loads loads B[t*TILE_WIDTH+ty][Col] k Each Each thread thread loads loads B[(t*TILE_WIDTH+ty)*k B[(t*TILE_WIDTH+ty)*k + Col] Col] if if (t*tile_width+ty < n) n) && && (Col< (Col< k) k) then then load load B element element else else load load 0 Col n B
104 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) {... ds_a[ty][tx] = A[Row*n + t*tile_width + tx];... ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col];... syncthreads();
105 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) { if(row < m && && t*tile_width+tx < n) n) { ds_a[ty][tx] = A[Row*n + t*tile_width + tx]; } else { ds_a[ty][tx] = 0.0; }... ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col];... syncthreads();
106 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) { if(row < m && && t*tile_width+tx < n) n) { ds_a[ty][tx] = A[Row*n + t*tile_width + tx]; } else { ds_a[ty][tx] = 0.0; } if if (t*tile_width+ty < n && && Col < k) k) { ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col]; } else { ds_b[ty][tx] = 0.0; }... syncthreads();
107 Tiled Matrix Multiplication Kernel... for (int i = 0; 0; i < TILE_WIDTH; ++i) { Cvalue += += ds_a[ty][i] * ds_b[i][tx]; } } syncthreads(); } if if (Row < m && && Col < k) k) C[Row*k + Col] = Cvalue;...
108 Important Points For each thread the conditions are different for Loading A element Loading B element Storing output elements The effect of control divergence should be small for large matrices
109 MM using Tiling
110 Matrix Multiplication double X[N][N], Y[N][N], Z[N][N]; for (i=0;i<n;i++) for (j=0;j<n;j++) for (k=0;k<n;k++) X[i][j] += Y[i][k] * Z[k][j];
111 DGEMM XX Y Y Z Z
112 Blocking / Tiling These elements are in in the Cache XX Y Y Z Z
113 Blocking / Tiling These elements are in in the Cache XX Y Y Z Z Idea of of Blocking Make full use of of elements of of when they are brought into the cache
114 Blocking / Tiling YY Z Z
115 Blocking / Tiling YY Z Z
116 Blocking / Tiling YY Z Z
117 Blocking / Tiling YY Z Z
118 Blocking Make full use of the elements of Z when they are brought into the cache (0,0) (0,1) (0,0) (0,1) (1,0) (1,1) (1,0) (1,1) Y Z X Y x Z 0,0 0,0 x 0,0 + 0,1 x 1,0 1,0 1,0 x 0,0 + 1,1 x 1,0
119 Blocking double X[N][N], Y[N][N], Z[N][N]; for (J=0; J<N; J+=B) for (K=0; K<N; K+=B) for (i=0; i<n; i++) for (j=j; j<min(j+b,n); j++) for (k=k,r=0; k<min(k+b,n); k++) r += Y[i][k] * Z[k][j]; X[i][j] += r;
120 Extra Slides
121 Global Memory Access Pattern of the Basic MM Kernel
122 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2
Module Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationGPU Memory Memory issue for CUDA programming
Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication
More informationCSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating
More informationGPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes
Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationIntroduction to CUDA (2 of 2)
Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access
More informationCS 193G. Lecture 4: CUDA Memories
CS 193G Lecture 4: CUDA Memories Hardware Implementation of CUDA Memories Grid! Each thread can:! Read/write per-thread registers! Read/write per-thread local memory Block (0, 0) Shared Memory Registers
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationMore CUDA. Advanced programming
More CUDA 1 Advanced programming Synchronization and atomics Warps and Occupancy Memory coalescing Shared Memory Streams and Asynchronous Execution Bank conflicts Other features 2 Synchronization and Atomics
More informationMemory Hierarchy. Advanced Optimizations. Slides contents from:
Memory Hierarchy Advanced Optimizations Slides contents from: Hennessy & Patterson, 5ed. Appendix B and Chapter 2. David Wentzlaff, ELE 475 Computer Architecture. MJT, High Performance Computing, NPTEL.
More informationECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms
ECE408/CS483 Fall 2016 Applied Parallel Programming Lecture 9: Tiled Convolution 1 Objective To learn about tiled convolution algorithms Some intricate aspects of tiling algorithms Output tiles versus
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationExample 1: Color-to-Grayscale Image Processing
GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationIntroduction to CUDA
Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationComputation to Core Mapping Lessons learned from a simple application
Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread
More informationCUDA. More on threads, shared memory, synchronization. cuprintf
CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationShared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.
Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationGPGPUGPGPU: Multi-GPU Programming
GPGPUGPGPU: Multi-GPU Programming Fall 2012 HW4 global void cuda_transpose(const float *ad, const int n, float *atd) { } int i = threadidx.y + blockidx.y*blockdim.y; int j = threadidx.x + blockidx.x*blockdim.x;
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationImage Processing Optimization C# on GPU with Hybridizer
Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationLecture 7. Using Shared Memory Performance programming and the memory hierarchy
Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more
More informationGPU MEMORY BOOTCAMP III
April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationLessons learned from a simple application
Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core
More informationCase Study: Implementing a Vector Kernel on a Vector Processor and GPU
32 Solutions to Case Studies and Exercises Chapter 4 Solutions Case Study: Implementing a Vector Kernel on a Vector Processor and GPU 4.1 MIPS code (answers may vary) li $r1,#0 # initialize k loop: l.s
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationCS : Many-core Computing with CUDA
CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535
More informationLecture 6. Programming with Message Passing Message Passing Interface (MPI)
Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011
More informationDRAM Bank Organization
DRM andwidth DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data DRM Core rrays are Slow DRM Core rrays are Slow DDR:
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More information2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.
Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationLecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy
Lecture 7 Overlap Using Shared Memory Performance programming the memory hierarchy Announcements Mac Mini lab (APM 2402) Starts Tuesday Every Tues and Fri for the next 2 weeks Project proposals due on
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationCS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More information