Tiled Matrix Multiplication

Size: px
Start display at page:

Download "Tiled Matrix Multiplication"

Transcription

1 Tiled Matrix Multiplication

2 Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y; int Col = blockidx.x*blockdim.x+threadidx.x; if if ((Row < m) m) && && (Col < k)) { float Cvalue = 0.0; for (int i = 0; 0; i < n; n; ++i) /* /* A[Row, i] i] and B[i, Col] */ */ Cvalue += += A[Row*n+i] * B[Col+i*k]; } } C[Row*k+Col] = Cvalue;

3 Col = 1 Col = 1 Col = 0 Col = 0 Global Memory Access Pattern in MM Work in Block (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 3,0 B 3,1 B 3,2 Row Row = 0 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 Row Row = 1 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2

4 Col = 1 Col = 1 Col = 0 Col = 0 Global Memory Access Pattern in MM Work in Block (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 3,0 B 3,1 B 3,2 Row Row = 0 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 Row Row = 1 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2

5 Global Memory Access Pattern in MM Consider threads (0,0) and (0,1) A 0,0 A 0,1 A 0,2 A 0,3 A 1,0... B 0,0... B 1, Thread (0,0) Thread (0,1)......

6 Global Memory Access Pattern in MM Consider threads (0,0) and (0,1) A 0,0 A 0,1 A 0,2 A 0,3 A 1,0... B 0,0... B 1, Thread (0,0) Thread (0,1)......

7 Performance on Fermi GPU All threads access global memory for their input matrix elements

8 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops)

9 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops) 4B of memory read per FLOP

10 Performance on Fermi GPU All threads access global memory for their input matrix elements Two memory accesses (A0,0,B 0,0 = 8 bytes) per floating point multiply-add (2 ops) 4B of memory read per FLOP Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating

11 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s

12 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s Upper FLOPs limit: 37.5 GFLOPS The actual code runs at about 25 GFLOPS

13 Performance on Fermi GPU Peak floating-point rate is 1TF = 1,000 GFLOPS 4*1,000 = 4,000 GB/s required to achieve peak FLOP rating Reality Fermi supports 150 GB/s Upper FLOPs limit: 37.5 GFLOPS The actual code runs at about 25 GFLOPS Need to cut down global memory accesses to get close to 1TF Compute-to-Global-Memory-Access Ratio

14 Shared Memory Tiling/Blocking Global Memory On-Chip Memory Thread (0,0) Thread (0,1)......

15 Shared Memory Tiling/Blocking Global Memory On-Chip Memory Thread (0,0) Thread (0,1) Divide the the global memory content into Tiles. Focus the the computation of of threads on on one or or a small number of of tiles at at each point in in time.

16 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality

17 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data

18 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data Needs Synchronization

19 Basic Concept of Blocking/Tiling More efficient if tiled data exhibit good spatial locality SM should be large enough to accommodate all the data Needs Synchronization Thread Thread 1 Thread Thread 2 Thread Thread 1 Thread Thread 2

20 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads

21 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory

22 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory

23 Tiling Technique Identify a tile of global memory content that are accessed by multiple threads Load the tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory Move on to the next tile

24 Tiled Matrix Multiplication Loading a tile Phased Execution Barrier Synchronization

25 Basic Matrix Multiplication k B C[Row,Col] = Dot Dot product of of Row Row of of A and and Col Col of of B. B. Col n Row n m A C

26 Matrix Multiplication k B Col n n Row m A C

27 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. k B Col n n Row m A C

28 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. All All threads updating C[Row,...] will will access A[Row] Col k B n n Row m A C

29 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. All All threads updating C[Row,...] will will access A[Row] Col k B n n Row m A C

30 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. Each thread in in the the block loads A[Row]. Col k B n n Row m A C

31 Matrix Multiplication Elements A[Row,...] are are used in in calculating elements C[Row,...]. Elements B[...,Col] are are used in in calculating elements C[...,Col]. Col k B n n Row m A C

32 Tiled Matrix Multiplication Compromise: Load elements from A[Row,...] and B[...,Col] in in phases k B Col n n Row m A C

33 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements k B Col n n Row m A C

34 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements All All the the threads compute the the partial product and and store store in in C Col k B n n Row m A C

35 Tiled Matrix Multiplication All All the the threads participate in in loading A and and B elements All All the the threads compute the the partial product and and store store in in C Move Move to to next next phase. Repeat. Col k B n n Row m A C

36 Tiled Matrix Multiplication All threads in a block participate Each thread loads one A element and one B element in tiled code

37 Tiled Matrix Multiplication Rows in in A, A, and and Cols Cols in in B needed for for threads in in Block (0,0) (0,0) B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 0,3 B 1,3 B 2,3 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

38 Tiled Matrix Multiplication SM SM Load Load Phase 0 B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 Thread 0,0 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 Thread 0,1 Thread 1,0 B 2,0 B 2,1 B 2,2 B 2,3 Thread 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

39 Tiled Matrix Multiplication SM SM Computation B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 C 0,0 = A 0,0 *B 0,0 + A 0,1 *B 0,1 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 C 0,1 = A 0,0 *B 0,0 + A 0,1 *B 1,1 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 = A 1,0 *B 0,0 + A 1,1 *B 0,1 C 1,1 = A 1,0 *B 1,0 + A 1,1 *B 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

40 Tiled Matrix Multiplication SM SM Computation B 0,0 B 0,1 B 0,2 B 0,3 B 0,0 B 0,1 C 0,0 = A 0,0 *B 0,0 + A 0,1 *B 0,1 B 1,0 B 1,1 B 1,2 B 1,3 B 1,0 B 1,1 C 0,1 = A 0,0 *B 0,0 + A 0,1 *B 1,1 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 = A 1,0 *B 0,0 + A 1,1 *B 0,1 C 1,1 = A 1,0 *B 1,0 + A 1,1 *B 1,1 B 3,0 B 3,1 B 3,2 B 3,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 C 0,3 Needs 2 iterations A 1,0 A 1,1 A 1,2 A 1,3 A 1,0 A 1,1 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

41 Tiled Matrix Multiplication Load Load Phase 1 B 0,0 B 0,1 B 0,2 B 0,3 Thread 0,0 B 1,0 B 1,1 B 1,2 B 1,3 SM SM Thread 0,1 Thread 1,0 B 2,0 B 2,1 B 2,2 B 2,3 B 0,2 B 0,3 Thread 1,1 B 3,0 B 3,1 B 3,2 B 3,3 B 1,2 B 1,3 A 0,0 A 0,1 A 0,2 A 0,3 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

42 Tiled Matrix Multiplication Computation B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 += A 0,2 *B 2,0 + A 0,3 *B 3,0 B 1,0 B 1,1 B 1,2 B 1,3 SM SM C 0,1 += A 0,2 *B 2,1 + A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,3 B 0,2 B 0,3 C 1,0 += A 1,2 *B 2,0 + A 1,3 *B 3,0 B 3,0 B 3,1 B 3,2 B 3,3 B 1,2 B 1,3 C 1,1 += A 1,2 *B 2,1 + A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,3 A 0,2 A 0,3 C 0,0 C 0,1 C 0,2 C 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 1,2 A 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 SM SM C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 C 3,0 C 3,1 C 3,2 C 3,3

43 Matrix Multiplication Kernel Shared Memory Variable Declaration global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; }

44 Tiled Matrix Multiplication How many memory accesses are reduced?

45 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice

46 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice In the basic implementation, each value from A and B is loaded once and used once

47 Tiled Matrix Multiplication How many memory accesses are reduced? In the example, each value from A and B is loaded once and used twice In the basic implementation, each value from A and B is loaded once and used once Memory bandwidth reduction by 50%

48 Tiled Matrix Multiplication Break up up the execution of of the kernel into phases so so that the data accesses in in each phase are focused on on one tile of of A and one tile of of B Col k B n n Row m A C

49 Barrier Synchronization API call: syncthreads()

50 Barrier Synchronization API call: syncthreads() All threads in the same block must reach the syncthreads() before any can move on

51 Barrier Synchronization API call: syncthreads() All threads in the same block must reach the syncthreads() before any can move on Best used to coordinate tiled algorithms To ensure that all elements of a tile are loaded To ensure that all elements of a tile are consumed

52 Barrier Synchronization synchthreads() Thread 0 Thread 1 Thread 2 Thread 3... Thread n-2 Thread n Time Barriers can significantly reduce active threads in in a block

53 Tiled Matrix Multiplication Tx,Ty loads elements A,B Col k B n n Row m A C 0 TW-1 t*tw t*tw-1

54 Load Phase 0 of a Thread Row = by by ** blockdim.y + ty; ty; Col Col = bx bx ** blockdim.x + tx; tx; Thread(tx,ty) loads A[Row][tx], B[ty][Col] TILE_WIDTH Col Row TILE_WIDTH

55 Load Phase 1 of a Thread Thread(tx,ty) loads A[Row][1*TILE_WIDTH+tx], B[1*TILEWIDTH+ty][Col] TILE_WIDTH Row TILE_WIDTH Col TILE_WIDTH TILE_WIDTH

56 Linear Address A[Row][t*TILE_WIDTH+tx] = A[Row*n + t*tile_width + tx] tx] B[t*TILE_WIDTH+ty][Col] = B[(t*TILE_WIDTH+ty)*k + Col] t is the tile sequence number of the current phase TILE_WIDTH Col Row TILE_WIDTH

57 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width];...

58 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y;...

59 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y; int Row = by by * blockdim.y + ty; int Col = bx bx * blockdim.x + tx;...

60 Tiled Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { shared float ds_a[tile_width][tile_width]; shared float ds_b[tile_width][tile_width]; int bx bx = blockidx.x; int by by = blockidx.y; int tx tx = threadidx.x; int ty ty = threadidx.y; int Row = by by * blockdim.y + ty; int Col = bx bx * blockdim.x + tx; float Cvalue = 0; 0;...

61 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory }... }

62 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); }... }...

63 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); } for (int i = 0; 0; i < TILE_WIDTH; ++i) Cvalue += += ds_a[ty][i] * ds_b[i][tx]; syncthreads(); }...

64 Tiled Matrix Multiplication Kernel... // // Loop over the A and B tiles required // // to to compute the C element for (int t = 0; 0; t < n/tile_width; ++t) { // // Collaborative loading of of A and B tiles // // into shared memory ds_a[ty][tx] = A[Row*n + t*tile_width+tx]; ds_b[ty][tx] = B[(t*TILE_WIDTH+ty)*k + Col]; syncthreads(); } for (int i = 0; 0; i < TILE_WIDTH; ++i) Cvalue += += ds_a[ty][i] * ds_b[i][tx]; syncthreads(); } C[Row*k+Col] = Cvalue;

65 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads

66 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. memory trafficreduced by a factor of 16

67 First Order Considerations Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. memory trafficreduced by a factor of 16 For 32, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations. memory traffic reduced by a factor of 32

68 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache)

69 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory.

70 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory. 16K SM, can have up to 8 thread blocks executing Pending Loads: 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)

71 Shared Memory and Threading Fermi SM has 16KB or 48KB shared memory (configurable vs. L1 cache) TILE_WIDTH = 16. Thread Block uses 2*256*4B = 2KB of shared memory. 16K SM, can have up to 8 thread blocks executing Pending Loads: 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) TILE_WIDTH = 32. TB uses 2*32*32*4 Byte= 8KB SM. 2 thread blocks can be active

72 Shared Memory and Threading TILE_WIDTH = 16 reduces accesses to global memory by a factor of 16

73 Shared Memory and Threading TILE_WIDTH = 16 reduces accesses to global memory by a factor of 16 The 150 GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!

74 Tiled MM Arbitrary Matrix Dimensions Real applications need to handle arbitrary sized matrices

75 Tiled MM Arbitrary Matrix Dimensions Real applications need to handle arbitrary sized matrices Pad (add elements to) the rows and columns into multiples of the tile size Significant space and data transfer time overhead!

76 Loads for Block (0,0) Phase 0 SM SM B 0,0 B 0,1 B 0,2 B 0,0 B 0,1 B 1,0 B 1,1 B 1,2 B 1,0 B 1,1 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 A 0,0 A 0,1 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,0 A 1,1 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

77 Loads for Block (0,0) Phase 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

78 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

79 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM The The threads load load A 1,0 and 1,0 and A 2,0 2,0 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

80 Loads for Block (0,0) Phase 1 Thread 0,1 0,1 attempts to to load load A 0,3 0,3 Thread 1,1 1,1 attempts to to load load A 1,3 1,3 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM The The threads load load A 1,0 and 1,0 and A 2,0 2,0 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 A 1,0 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 1,2 A 2,0 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 SM SM C 2,0 C 2,1 C 2,2

81 Loads for Block (0,0) Phase 1 Thread 1,0 1,0 attempts to to load load B 3,0 3,0 Thread 1,1 1,1 attempts to to load load B 3,1 3,1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

82 Loads for Block (0,0) Phase 1 Thread 1,0 1,0 attempts to to load load B 3,0 3,0 Thread 1,1 1,1 attempts to to load load B 3,1 3,1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM Both Both Invalid Accesses! B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

83 Computation after Phase 1 Loads B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

84 Computation after Phase 1 Loads Iteration 0 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,2 *B 2,0 C 0,1 += A 0,2 *B 2,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,2 *B 2,0 C 1,1 += A 1,2 *B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

85 Computation after Phase 1 Loads Iteration 0 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,2 *B 2,0 C 0,1 += A 0,2 *B 2,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,2 *B 2,0 C 1,1 += A 1,2 *B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

86 Computation after Phase 1 Loads B 0,0 B 0,1 B 0,2 Iteration 1 B 1,0 B 1,1 B 1,2 SM SM B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

87 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

88 Loads for Block (1,1) Phase 0 Thread Thread 0,1 0,1 attempts attempts to to load load B 0,3 0,3 Thread Thread 1,1 1,1 attempts attempts to to load load B 1,3 1,3 Thread Thread 1,0 1,0 attempts attempts to to load load A 3,0 3,0 Thread Thread 1,1 1,1 attempts attempts to to load load A 3,1 3,1 B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 2,0 B 2,1 B 2,2 B 1,2 SM SM A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

89 Computation after Phase 0 Loads SM SM B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

90 Computation after Phase 0 Loads SM SM B 0,0 B 0,1 B 0,2 B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 B 2,0 B 2,1 B 2,2 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

91 Computation after Phase 0 Loads B 0,0 B 0,1 B 0,2 B 0,2 SM SM Iteration 0 C 2,2 += A 2,0 *B 0,2 B 1,0 B 1,1 B 1,2 B 1,2 C 2,3 += A 2,0 *B 0,3 B 2,0 B 2,1 B 2,2 C 3,2 += A 3,0 *B 0,2 C 3,3 += A 3,0 *B 0,3 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

92 Computation after Phase 0 Loads B 0,0 B 0,1 B 0,2 B 0,2 SM SM Iteration 1 C 2,2 += A 2,1 *B 1,2 B 1,0 B 1,1 B 1,2 B 1,2 C 2,3 += A 2,1 *B 1,3 B 2,0 B 2,1 B 2,2 C 3,2 += A 3,1 *B 1,2 C 3,3 += A 3,1 *B 1,3 A 0,0 A 0,1 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 A 2,0 A 2,1 C 2,0 C 2,1 C 2,2 SM SM

93 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input

94 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads

95 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads Threads that do not calculate valid C elements but still need to participate in loading the input tiles

96 Major Cases from the Example Threads that calculate valid C elements but can step outside valid input Phase 1 of Block(0,0), 2 nd step, all threads Threads that do not calculate valid C elements but still need to participate in loading the input tiles Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-existent C[3,2] but need to participate in loading tile element B[1,2]

97 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range

98 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range If valid, proceed to load Else, do not load, just write a 0

99 Tiled MM Arbitrary Matrix Dimensions When a thread is to load any input element, test if it is in the valid index range If valid, proceed to load Else, do not load, just write a 0 0 will not affect the multiply-add step functional correctness

100 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 0 0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 0 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 1,2 0 C 1,0 C 1,1 C 1,2 A 2,0 A 2,1 A 2,2 SM SM C 2,0 C 2,1 C 2,2

101 Simple Solution contd. If a thread does not calculate a valid C element Can still perform multiply-add into its register Shouldn t write its Cvalue to the global memory at the end of the kernel Thread participates in the tile loading process

102 Boundary Condition for Input A Tile Each Each thread thread loads loads A[Row][t*TILE_WIDTH+tx] Each Each thread thread loads loads A[Row*n A[Row*n + t*tile_width+tx] t*tile_width+tx] if if (Row (Row < m) m) && && (t*tile_width+tx < n) n) then then load load A element element else else load load 0 n Row m A

103 Boundary Condition for Input B Tile Each Each thread thread loads loads B[t*TILE_WIDTH+ty][Col] k Each Each thread thread loads loads B[(t*TILE_WIDTH+ty)*k B[(t*TILE_WIDTH+ty)*k + Col] Col] if if (t*tile_width+ty < n) n) && && (Col< (Col< k) k) then then load load B element element else else load load 0 Col n B

104 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) {... ds_a[ty][tx] = A[Row*n + t*tile_width + tx];... ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col];... syncthreads();

105 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) { if(row < m && && t*tile_width+tx < n) n) { ds_a[ty][tx] = A[Row*n + t*tile_width + tx]; } else { ds_a[ty][tx] = 0.0; }... ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col];... syncthreads();

106 Tiled Matrix Multiplication Kernel... for (int t = 0; 0; t < (n 1)/TILE_WIDTH + 1; 1; ++t) { if(row < m && && t*tile_width+tx < n) n) { ds_a[ty][tx] = A[Row*n + t*tile_width + tx]; } else { ds_a[ty][tx] = 0.0; } if if (t*tile_width+ty < n && && Col < k) k) { ds_b[ty][tx] = B[(t*TILE_WIDTH + ty)*k + Col]; } else { ds_b[ty][tx] = 0.0; }... syncthreads();

107 Tiled Matrix Multiplication Kernel... for (int i = 0; 0; i < TILE_WIDTH; ++i) { Cvalue += += ds_a[ty][i] * ds_b[i][tx]; } } syncthreads(); } if if (Row < m && && Col < k) k) C[Row*k + Col] = Cvalue;...

108 Important Points For each thread the conditions are different for Loading A element Loading B element Storing output elements The effect of control divergence should be small for large matrices

109 MM using Tiling

110 Matrix Multiplication double X[N][N], Y[N][N], Z[N][N]; for (i=0;i<n;i++) for (j=0;j<n;j++) for (k=0;k<n;k++) X[i][j] += Y[i][k] * Z[k][j];

111 DGEMM XX Y Y Z Z

112 Blocking / Tiling These elements are in in the Cache XX Y Y Z Z

113 Blocking / Tiling These elements are in in the Cache XX Y Y Z Z Idea of of Blocking Make full use of of elements of of when they are brought into the cache

114 Blocking / Tiling YY Z Z

115 Blocking / Tiling YY Z Z

116 Blocking / Tiling YY Z Z

117 Blocking / Tiling YY Z Z

118 Blocking Make full use of the elements of Z when they are brought into the cache (0,0) (0,1) (0,0) (0,1) (1,0) (1,1) (1,0) (1,1) Y Z X Y x Z 0,0 0,0 x 0,0 + 0,1 x 1,0 1,0 1,0 x 0,0 + 1,1 x 1,0

119 Blocking double X[N][N], Y[N][N], Z[N][N]; for (J=0; J<N; J+=B) for (K=0; K<N; K+=B) for (i=0; i<n; i++) for (j=j; j<min(j+b,n); j++) for (k=k,r=0; k<min(k+b,n); k++) r += Y[i][k] * Z[k][j]; X[i][j] += r;

120 Extra Slides

121 Global Memory Access Pattern of the Basic MM Kernel

122 Computation after Phase 1 Loads Iteration 1 B 0,0 B 0,1 B 0,2 B 1,0 B 1,1 B 1,2 SM SM C 0,0 += A 0,3 *B 3,0 C 0,1 += A 0,3 *B 3,1 B 2,0 B 2,1 B 2,2 B 2,0 B 2,1 C 1,0 += A 1,3 *B 3,0 C 1,1 += A 1,3 *B 3,1 A 0,0 A 0,1 A 0,2 A 0,2 C 0,0 C 0,1 C 0,2 A 1,0 A 1,1 A 1,2 A 2,0 A 2,1 A 2,2 A 1,2 SM SM C 1,0 C 1,1 C 1,2 C 2,0 C 2,1 C 2,2

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

GPU Memory Memory issue for CUDA programming

GPU Memory Memory issue for CUDA programming Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Introduction to CUDA (2 of 2)

Introduction to CUDA (2 of 2) Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

CS 193G. Lecture 4: CUDA Memories

CS 193G. Lecture 4: CUDA Memories CS 193G Lecture 4: CUDA Memories Hardware Implementation of CUDA Memories Grid! Each thread can:! Read/write per-thread registers! Read/write per-thread local memory Block (0, 0) Shared Memory Registers

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

More CUDA. Advanced programming

More CUDA. Advanced programming More CUDA 1 Advanced programming Synchronization and atomics Warps and Occupancy Memory coalescing Shared Memory Streams and Asynchronous Execution Bank conflicts Other features 2 Synchronization and Atomics

More information

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Memory Hierarchy. Advanced Optimizations. Slides contents from: Memory Hierarchy Advanced Optimizations Slides contents from: Hennessy & Patterson, 5ed. Appendix B and Chapter 2. David Wentzlaff, ELE 475 Computer Architecture. MJT, High Performance Computing, NPTEL.

More information

ECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms

ECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms ECE408/CS483 Fall 2016 Applied Parallel Programming Lecture 9: Tiled Convolution 1 Objective To learn about tiled convolution algorithms Some intricate aspects of tiling algorithms Output tiles versus

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Example 1: Color-to-Grayscale Image Processing

Example 1: Color-to-Grayscale Image Processing GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Computation to Core Mapping Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread

More information

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA. More on threads, shared memory, synchronization. cuprintf CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int

More information

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different

More information

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory. Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

GPGPUGPGPU: Multi-GPU Programming

GPGPUGPGPU: Multi-GPU Programming GPGPUGPGPU: Multi-GPU Programming Fall 2012 HW4 global void cuda_transpose(const float *ad, const int n, float *atd) { } int i = threadidx.y + blockidx.y*blockdim.y; int j = threadidx.x + blockidx.x*blockdim.x;

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Image Processing Optimization C# on GPU with Hybridizer

Image Processing Optimization C# on GPU with Hybridizer Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

GPU MEMORY BOOTCAMP III

GPU MEMORY BOOTCAMP III April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

Lessons learned from a simple application

Lessons learned from a simple application Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core

More information

Case Study: Implementing a Vector Kernel on a Vector Processor and GPU

Case Study: Implementing a Vector Kernel on a Vector Processor and GPU 32 Solutions to Case Studies and Exercises Chapter 4 Solutions Case Study: Implementing a Vector Kernel on a Vector Processor and GPU 4.1 MIPS code (answers may vary) li $r1,#0 # initialize k loop: l.s

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

CS : Many-core Computing with CUDA

CS : Many-core Computing with CUDA CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

DRAM Bank Organization

DRAM Bank Organization DRM andwidth DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data DRM Core rrays are Slow DRM Core rrays are Slow DDR:

More information

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont. Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy Lecture 7 Overlap Using Shared Memory Performance programming the memory hierarchy Announcements Mac Mini lab (APM 2402) Starts Tuesday Every Tues and Fri for the next 2 weeks Project proposals due on

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information