Tiled Matrix Multiplication

Size: px

Start display at page:

Download "Tiled Matrix Multiplication"

Earl Alvin Melton
5 years ago
Views:

1 Tiled Matrix Multiplication

2 Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y; int Col = blockidx.x*blockdim.x+threadidx.x; if if ((Row < m) m) && && (Col < k)) { float Cvalue = 0.0; for (int i = 0; 0; i < n; n; ++i) /* /* A[Row, i] i] and B[i, Col] */ */ Cvalue += += A[Row*n+i] * B[Col+i*k]; } } C[Row*k+Col] = Cvalue;

3 Col = 1 Col = 1 Col = 0 Col = 0 Global Memory Access Pattern in MM Work in Block (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 3,0 B 3,1 B 3,2 Row Row = 0 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 Row Row = 1 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2

4 Col = 1 Col = 1 Col = 0 Col = 0 Global Memory Access Pattern in MM Work in Block (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 3,0 B 3,1 B 3,2 Row Row = 0 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 Row Row = 1 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2

5 Global Memory Access Pattern in MM Consider threads (0,0) and (0,1) A 0,0 A 0,1 A 0,2 A 0,3 A 1,0... B 0,0... B 1, Thread (0,0) Thread (0,1)......

6 Global Memory Access Pattern in MM Consider threads (0,0) and (0,1) A 0,0 A 0,1 A 0,2 A 0,3 A 1,0... B 0,0... B 1, Thread (0,0) Thread (0,1)......

7 Performance on Fermi GPU All threads access global memory for their input matrix elements

8 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops)

9 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops) 4B of memory read per FLOP

10 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops) 4B of memory read per FLOP Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating

11 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s

12 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s Upper FLOPs limit: 37.5 GFLOPS The actual code runs at about 25 GFLOPS

13 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s Upper FLOPs limit: 37.5 GFLOPS The actual code runs at about 25 GFLOPS Need to cut down global memory accesses to get close to 1TF Compute-to-Global-Memory-Access Ratio

14 Shared Memory Tiling/Blocking Global Memory On-Chip Memory Thread (0,0) Thread (0,1)......

15 Shared Memory Tiling/Blocking Global Memory On-Chip Memory Thread (0,0) Thread (0,1) Divide the the global memory content into Tiles. Focus the the computation of of threads on on one or or a small number of of tiles at at each point in in time.

16 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality

17 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data

18 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data Needs Synchronization

19 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data Needs Synchronization Thread Thread 1 Thread Thread 2 Thread Thread 1 Thread Thread 2

20 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads

21 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory

22 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory

23 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory Move on to the next tile

24 Tiled Matrix Multiplication Loading a tile Phased Execution Barrier Synchronization

25 Basic Matrix Multiplication k B C[Row,Col] = Dot Dot product of of Row Row of of A and and Col Col of of B. B. Col n Row n m A C

26 Matrix Multiplication k B Col n n Row m A C

27 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. k B Col n n Row m A C

28 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. All All threads updating C[Row,...] will will access A[Row] Col k B n n Row m A C

29 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. All All threads updating C[Row,...] will will access A[Row] Col k B n n Row m A C

30 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. Each thread in in the the block loads A[Row]. Col k B n n Row m A C

31 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. Elements B[...,Col] are are used in in calculating elements C[...,Col]. Col k B n n Row m A C

32 Tiled Matrix Multiplication Compromise: Load elements from A[Row,...] and B[...,Col] in in phases k B Col n n Row m A C

33 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements k B Col n n Row m A C

34 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements All All the the threads compute the the partial product and and store store in in C Col k B n n Row m A C

35 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements All All the the threads compute the the partial product and and store store in in C Move Move to to next next phase. Repeat. Col k B n n Row m A C

36 Tiled Matrix Multiplication All threads in a block participate Each thread loads one A element and one B element in tiled code

37 Tiled Matrix Multiplication Rows in in A, A, and and Cols Cols in in B needed for for threads in in Block (0,0) (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 0,3 B 1,3 B 2,3 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

38 Tiled Matrix Multiplication SM SM Load Load Phase 0 B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 Thread 0,0 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 Thread 0,1 Thread 1,0 B 2,0 B 2,1 B 2,2 B 2,3 Thread 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

39 Tiled Matrix Multiplication SM SM Computation B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 C 0,0 = A 0,0 *B 0,0 + A 0,1 *B 0,1 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 C 0,1 = A 0,0 *B 0,0 + A 0,1 *B 1,1 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 = A 1,0 *B 0,0 + A 1,1 *B 0,1 C 1,1 = A 1,0 *B 1,0 + A 1,1 *B 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

40 Tiled Matrix Multiplication SM SM Computation B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 C 0,0 = A 0,0 *B 0,0 + A 0,1 *B 0,1 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 C 0,1 = A 0,0 *B 0,0 + A 0,1 *B 1,1 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 = A 1,0 *B 0,0 + A 1,1 *B 0,1 C 1,1 = A 1,0 *B 1,0 + A 1,1 *B 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 Needs 2 iterations A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

41 Tiled Matrix Multiplication Load Load Phase 1 B 0,0 B 0,1 B 0,2 B 0,3 Thread 0,0 B 1,0 B 1,1 B 1,2 B 1,3 SM SM Thread 0,1 Thread 1,0 B 2,0 B 2,1 B 2,2 B 2,3 B 0,2 B 0,3 Thread 1,1 B 3,0 B 3,1 B 3,2 B 3,3 B 1,2 B 1,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

42 Tiled Matrix Multiplication Computation B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 += A 0,2 *B 2,0 + A 0,3 *B 3,0 B 1,0 B 1,1 B 1,2 B 1,3 SM SM C 0,1 += A 0,2 *B 2,1 + A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,3 B 0,2 B 0,3 C 1,0 += A 1,2 *B 2,0 + A 1,3 *B 3,0 B 3,0 B 3,1 B 3,2 B 3,3 B 1,2 B 1,3 C 1,1 += A 1,2 *B 2,1 + A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,3 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

43 Matrix Multiplication Kernel Shared Memory Variable Declaration global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; }

44 Tiled Matrix Multiplication How many memory accesses are reduced?

45 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice

46 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice In the basic implementation, each value from A and B is loaded once and used once

47 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice In the basic implementation, each value from A and B is loaded once and used once Memory bandwidth reduction by 50%

48 Tiled Matrix Multiplication Break up up the execution of of the kernel into phases so so that the data accesses in in each phase are focused on on one tile of of A and one tile of of B Col k B n n Row m A C

49 Barrier Synchronization API call: syncthreads()

50 Barrier Synchronization API call: syncthreads() All threads in the same block must reach the syncthreads() before any can move on

51 Barrier Synchronization API call: syncthreads() All threads in the same block must reach the syncthreads() before any can move on Best used to coordinate tiled algorithms To ensure that all elements of a tile are loaded To ensure that all elements of a tile are consumed

52 Barrier Synchronization synchthreads() Thread 0 Thread 1 Thread 2 Thread 3... Thread n-2 Thread n Time Barriers can significantly reduce active threads in in a block

53 Tiled Matrix Multiplication Tx,Ty loads elements A,B Col k B n n Row m A C 0 TW-1 t*tw t*tw-1

54 Load Phase 0 of a Thread Row = by by ** blockdim.y + ty; ty; Col Col = bx bx ** blockdim.x + tx; tx; Thread(tx,ty) loads A[Row][tx], B[ty][Col] TILE_WIDTH Col Row TILE_WIDTH

55 Load Phase 1 of a Thread Thread(tx,ty) loads A[Row][1*TILE_WIDTH+tx], B[1*TILEWIDTH+ty][Col] TILE_WIDTH Row TILE_WIDTH Col TILE_WIDTH TILE_WIDTH

56 Linear Address A[Row][t*TILE_WIDTH+tx] = A[Row*n + t*tile_width + tx] tx] B[t*TILE_WIDTH+ty][Col] = B[(t*TILE_WIDTH+ty)*k + Col] t is the tile sequence number of the current phase TILE_WIDTH Col Row TILE_WIDTH

57 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width];...

58 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y;...

59 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y; int Row = by by * blockdim.y + ty; int Col = bx bx * blockdim.x + tx;...

60 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y; int Row = by by * blockdim.y + ty; int Col = bx bx * blockdim.x + tx; float Cvalue = 0; 0;...

61 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory }... }

62 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); }... }...

63 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); } for (int i = 0; 0; i < TILE_WIDTH; ++i) Cvalue += += ds_a[ty][i] * ds_b[i][tx]; syncthreads(); }...

64 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); } for (int i = 0; 0; i < TILE_WIDTH; ++i) Cvalue += += ds_a[ty][i] * ds_b[i][tx]; syncthreads(); } C[Row*k+Col] = Cvalue;

65 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads

66 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. memory trafficreduced by a factor of 16

67 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. memory trafficreduced by a factor of 16 For 32, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations. memory traffic reduced by a factor of 32

68 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache)

69 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory.

70 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory. 16K SM, can have up to 8 thread blocks executing Pending Loads: 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)

71 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory. 16K SM, can have up to 8 thread blocks executing Pending Loads: 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) TILE_WIDTH = 32. TB uses 2*32*32*4 Byte= 8KB SM. 2 thread blocks can be active

72 Shared Memory and Threading TILE_WIDTH = 16 reduces accesses to global memory by a factor of 16

73 Shared Memory and Threading TILE_WIDTH = 16 reduces accesses to global memory by a factor of 16 The 150 GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!

74 Tiled MM Arbitrary Matrix Dimensions Real applications need to handle arbitrary sized matrices

75 Tiled MM Arbitrary Matrix Dimensions Real applications need to handle arbitrary sized matrices Pad (add elements to) the rows and columns into multiples of the tile size Significant space and data transfer time overhead!

76 Loads for Block (0,0) Phase 0 SM SM B 0,0 B 0,1 B 0,2 B 0,0 B 0,1 B 1,0 B 1,1 B 1,2 B 1,0 B 1,1 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,0 A 1,1 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

77 Loads for Block (0,0) Phase 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

78 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

79 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM The The threads load load A 1,0 and 1,0 and A 2,0 2,0 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

80 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM The The threads load load A 1,0 and 1,0 and A 2,0 2,0 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 A 1,0 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 1,2 A 2,0 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 SM SM C 2,0 C 2,1 C 2,2

81 Loads for Block (0,0) Phase 1 Thread 1,0 1,0 attempts to to load load B 3,0 3,0 Thread 1,1 1,1 attempts to to load load B 3,1 3,1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

82 Loads for Block (0,0) Phase 1 Thread 1,0 1,0 attempts to to load load B 3,0 3,0 Thread 1,1 1,1 attempts to to load load B 3,1 3,1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM Both Both Invalid Accesses! B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

83 Computation after Phase 1 Loads B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

84 Computation after Phase 1 Loads Iteration 0 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,2 *B 2,0 C 0,1 += A 0,2 *B 2,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,2 *B 2,0 C 1,1 += A 1,2 *B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

85 Computation after Phase 1 Loads Iteration 0 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,2 *B 2,0 C 0,1 += A 0,2 *B 2,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,2 *B 2,0 C 1,1 += A 1,2 *B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

86 Computation after Phase 1 Loads B 0,0 B 0,1 B 0,2 Iteration 1 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

87 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

88 Loads for Block (1,1) Phase 0 Thread Thread 0,1 0,1 attempts attempts to to load load B 0,3 0,3 Thread Thread 1,1 1,1 attempts attempts to to load load B 1,3 1,3 Thread Thread 1,0 1,0 attempts attempts to to load load A 3,0 3,0 Thread Thread 1,1 1,1 attempts attempts to to load load A 3,1 3,1 B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 1,2 SM SM A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

89 Computation after Phase 0 Loads SM SM B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

90 Computation after Phase 0 Loads SM SM B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

91 Computation after Phase 0 Loads B 0,0 B 0,1 B 0,2 B 0,2 SM SM Iteration 0 C 2,2 += A 2,0 *B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 C 2,3 += A 2,0 *B 0,3 B 2,0 B 2,1 B 2,2 C 3,2 += A 3,0 *B 0,2 C 3,3 += A 3,0 *B 0,3 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

92 Computation after Phase 0 Loads B 0,0 B 0,1 B 0,2 B 0,2 SM SM Iteration 1 C 2,2 += A 2,1 *B 1,2 B 1,0 B 1,1 B 1,2 B 1,2 C 2,3 += A 2,1 *B 1,3 B 2,0 B 2,1 B 2,2 C 3,2 += A 3,1 *B 1,2 C 3,3 += A 3,1 *B 1,3 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

93 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input

94 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads

95 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads Threads that do not calculate valid C elements but still need to participate in loading the input tiles

96 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads Threads that do not calculate valid C elements but still need to participate in loading the input tiles Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-existent C[3,2] but need to participate in loading tile element B[1,2]

97 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range

98 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range If valid, proceed to load Else, do not load, just write a 0

99 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range If valid, proceed to load Else, do not load, just write a 0 0 will not affect the multiply-add step functional correctness

100 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 0 0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 0 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 1,2 0 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 SM SM C 2,0 C 2,1 C 2,2

101 Simple Solution contd. If a thread does not calculate a valid C element Can still perform multiply-add into its register Shouldn t write its Cvalue to the global memory at the end of the kernel Thread participates in the tile loading process

102 Boundary Condition for Input A Tile Each Each thread thread loads loads A[Row][t*TILE_WIDTH+tx] Each Each thread thread loads loads A[Row*n A[Row*n + t*tile_width+tx] t*tile_width+tx] if if (Row (Row < m) m) && && (t*tile_width+tx < n) n) then then load load A element element else else load load 0 n Row m A

103 Boundary Condition for Input B Tile Each Each thread thread loads loads B[t*TILE_WIDTH+ty][Col] k Each Each thread thread loads loads B[(t*TILE_WIDTH+ty)*k B[(t*TILE_WIDTH+ty)*k + Col] Col] if if (t*tile_width+ty < n) n) && && (Col< (Col< k) k) then then load load B element element else else load load 0 Col n B

104 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) {... ds_a[ty][tx] = A[Row*n + t*tile_width + tx];... ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col];... syncthreads();

105 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) { if(row < m && && t*tile_width+tx < n) n) { ds_a[ty][tx] = A[Row*n + t*tile_width + tx]; } else { ds_a[ty][tx] = 0.0; }... ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col];... syncthreads();

106 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) { if(row < m && && t*tile_width+tx < n) n) { ds_a[ty][tx] = A[Row*n + t*tile_width + tx]; } else { ds_a[ty][tx] = 0.0; } if if (t*tile_width+ty < n && && Col < k) k) { ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col]; } else { ds_b[ty][tx] = 0.0; }... syncthreads();

107 Tiled Matrix Multiplication Kernel... for (int i = 0; 0; i < TILE_WIDTH; ++i) { Cvalue += += ds_a[ty][i] * ds_b[i][tx]; } } syncthreads(); } if if (Row < m && && Col < k) k) C[Row*k + Col] = Cvalue;...

108 Important Points For each thread the conditions are different for Loading A element Loading B element Storing output elements The effect of control divergence should be small for large matrices

109 MM using Tiling

110 Matrix Multiplication double X[N][N], Y[N][N], Z[N][N]; for (i=0;i<n;i++) for (j=0;j<n;j++) for (k=0;k<n;k++) X[i][j] += Y[i][k] * Z[k][j];

111 DGEMM XX Y Y Z Z

112 Blocking / Tiling These elements are in in the Cache XX Y Y Z Z

113 Blocking / Tiling These elements are in in the Cache XX Y Y Z Z Idea of of Blocking Make full use of of elements of of when they are brought into the cache

114 Blocking / Tiling YY Z Z

115 Blocking / Tiling YY Z Z

116 Blocking / Tiling YY Z Z

117 Blocking / Tiling YY Z Z

118 Blocking Make full use of the elements of Z when they are brought into the cache (0,0) (0,1) (0,0) (0,1) (1,0) (1,1) (1,0) (1,1) Y Z X Y x Z 0,0 0,0 x 0,0 + 0,1 x 1,0 1,0 1,0 x 0,0 + 1,1 x 1,0

119 Blocking double X[N][N], Y[N][N], Z[N][N]; for (J=0; J<N; J+=B) for (K=0; K<N; K+=B) for (i=0; i<n; i++) for (j=j; j<min(j+b,n); j++) for (k=k,r=0; k<min(k+b,n); k++) r += Y[i][k] * Z[k][j]; X[i][j] += r;

120 Extra Slides

121 Global Memory Access Pattern of the Basic MM Kernel

122 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles