GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013

2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

3 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

Control Flow Instructions are issued per 32 threads (warp) Divergent branches: Threads within a single warp take different paths if-else,... Different execution paths within a warp are serialized Different warps can execute different code with no impact on But what about branches? performance Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; Not all ALUs do useful work! Worst case: 1/8 performance <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 27 4 / 60

5 / 60 Project the threads into a linear order 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 Logical 2-D orgalization 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 Linear order Line up the row with larger y and z coordinates after those with lower ones

6 / 60 Partition the threads into warps 0 1 2 30 31 32 33 62 63 64

Partition the threads into warps 7 / 60

8 / 60 Avoid diverging within a warp Example with divergence if (threadidx.x > 2) {... } else {... } Branch granularity < warp size

9 / 60 Avoid diverging within a warp Example with divergence if (threadidx.x > 2) {... } else {... } Branch granularity < warp size Example without divergence if ((threadidx.x / WARP_SIZE) > 2) {... } else {... } Branch granularity is a whole multiple of warp size

10 / 60 Example: Divergent Iteration global void per_thread_sum(int *indices,float *data,float *sums) {... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum; } A single thread can drag a whole warp with it for a long time Know your data patterns If data is unpredictable, try to flatten peaks by letting threads work on multiple data items

11 / 60 Execution of the Sum Reduction Kernel Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 Iterations

12 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

13 / 60 Memory Coalescing Off-chip memory is accessed in chunks Even if you read only a single word If you don t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 128 bytes on Fermi GPU thread i 128 160 256 thread j 128 160 256

14 / 60 Memory Coalescing Off-chip memory is accessed in chunks Even if you read only a single word If you don t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 128 bytes on Fermi GPU thread i 128 160 256 thread j 128 160 256

15 / 60 Coalesced Access to Global Memory How is the global memory access of the threads in a warp coalesced? On Fermi, global memory loads and stores by threads of a warp (i.e., 32 threads) are coalesced How is the coalesced memory access aligned into segments? On Fermi, the segment size is always 128 Bytes Thread blocks are partitioned into warps based on thread indices Each warp contains threads of consecutive and increasing thread IDs with the first warp containing thread 0

16 / 60 Simple Memory Access Pattern Coalesced access in which all threads but a few access the word in a segment (on Fermi) Not all threads in a warp need to access the memory The access by threads can be permuted 128B Segment

17 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

18 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

19 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

20 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

21 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

22 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

23 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

24 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

25 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

26 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

27 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

28 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

29 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

30 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

31 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

32 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

33 / 60 Coalesced access to a matrix N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 N Load iteration 0 T 0 T 1 T 2 T 3 Load iteration 1 T 0 T 1 T 2 T 3 N 0,0 N 0,1 N 0,2 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3

34 / 60 Uncoalesced access to a matrix M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 Load iteration 1 M T 0 T 1 T 2 T 3 Load iteration 0 T 0 T 1 T 2 T 3 M 0,0 M 0,1 M 0,2 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3

35 / 60 Use Shared Memory to Improve Coalescing Md Nd Original Access Pattern WIDTH WIDTH Copy into shared memory Md Nd Tiled Access Pattern Perform multiplication with data in shared memory

36 / 60 Aligned and Sequential Aligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31

37 / 60 Aligned and Sequential Aligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 1 x 64B at 128 1 x 64B at 192 Uncached 1.2 and 1.3 1 x 64B at 128 1 x 64B at 192 2.0 Cached 1 x 128B at 128

38 / 60 Aligned and Non-sequential Aligned and non-sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31

39 / 60 Aligned and Non-sequential Aligned and non-sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 8 x 32B at 128 8 x 32B at 160 8 x 32B at 192 8 x 32B at 224 Uncached 1.2 and 1.3 1 x 64B at 128 1 x 64B at 192 2.0 Cached 1 x 128B at 128

40 / 60 Misaligned and Sequential Misaligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31

41 / 60 Misaligned and Sequential Misaligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 Uncached 1.2 and 1.3 2.0 Cached 7 x 32B at 128 8 x 32B at 160 8 x 32B at 192 8 x 32B at 224 1 x 32B at 256 1 x 128B at 128 1 x 64B at 192 1 x 32B at 256 1 x 128B at 128 1 x 128B at 256

42 / 60 Matrix Multiplication Using Multiple Blocks with Tile bx 0 1 2 tx 0 1 2 TILE_WIDTH-1 N m TILE_WIDTH by 2 1 0 2 1 0 ty TILE_WIDTH-1 M m TILE_WIDTH by k TILE_WIDTH P bx k TILE_WIDTH TILE_WIDTH TILE_WIDTH WIDTH WIDTH WIDTH WIDTH

43 / 60 Matrix Multiplication Kernel Using Multiple Blocks with Tile global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { Pvalue += Mds[ty][k] * Nds[k][tx]; } 14. syncthreads(); } 15. Pd[Row*Width + Col] = Pvalue; }

44 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

45 / 60 Shared Memory Shared memory is banked (Fermi: 32 banks) Consecutive 32-bit words are in different banks Simultaneous & concurrent access On Fermi, all the threads in the same wrap share the same request If two or more threads access the same bank but different words, get bank conflicts Multiple threads access the same bank for same word no bank conflict

Shared Memory Access on Fermi 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 46 / 60

47 / 60 Matrix Multiplication Kernel Using Multiple Blocks with Tile global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { Pvalue += Mds[ty][k] * Nds[k][tx]; } 14. syncthreads(); } 15. Pd[Row*Width + Col] = Pvalue; } Do we have a shared memory bank conflict problem here?

48 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

49 / 60 Reminder: Thread Scheduling SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected TB1, W1 stall TB2, W1 stall TB3, W2 stall TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3 W1 W1 W1 W2 W1 W1 W2 W3 W2 Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4 Time TB = Thread Block, W = Warp

50 / 60 Thread Scheduling What happens if all warps are stalled? No instruction issued performance lost Most common reason for stalling Waiting on global memory If your code reads global memory every couple of instructions Try to maximize occupancy What determines occupancy? Register usage per thread Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself Shared memory per thread block

51 / 60 Resource Limits Registers Shared Memory Registers Shared Memory TB 2 TB 1 TB 1 TB 1 TB 2 TB 0 TB 0 TB 1 TB 0 TB 0 Pool of registers and shared memory per SM Each thread block grabs registers & shared memory If one or the other is fully utilized no more thread blocks

52 / 60 How do you know what you re using? Use nvcc -ptxas-options=-v to get register & shared memory usage Plug those numbers into CUDA Occupancy Calculator How to influence how many registers you use? Pass option -maxrregcount=x to nvcc

53 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

54 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration?

55 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration? Two floating-point arithmetic instructions One loop branch instruction Two address arithmetic instruction One loop counter increment instruction

56 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration? Two floating-point arithmetic instructions One loop branch instruction Two address arithmetic instruction One loop counter increment instruction Only 1 3 of the instructions executed are for real computation!!!

57 / 60 Loop Unrolling // Assume BLOCK_SIZE = 16 Pvalue = Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] +... + Ms[ty][15] * Ns[15][tx]; Loop branch instructions gone Loop counter increment instructions gone Address arithmetic instructions gone Indices are constants Compiler is able to eliminate address arithmetic instructions Only floating-point arithmetic instructions are still there

58 / 60 Loop Unrolling // Assume BLOCK_SIZE = 16 Pvalue = Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] +... + Ms[ty][15] * Ns[15][tx]; Loop branch instructions gone Loop counter increment instructions gone Address arithmetic instructions gone Indices are constants Compiler is able to eliminate address arithmetic instructions Only floating-point arithmetic instructions are still there Close to peak performance!!!

59 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

60 / 60 Kernel Launch Overhead Kernel launches are not free A null kernel launch will take non-trivial time Actual number changes with HW generations and driver software If you are launching lots of small grids you will lose substantial performance due to this effect Independent kernel launches are cheaper than dependent kernel launches Dependent launch: Some readback to the cpu If you are reading back data to the cpu for control decisions, consider doing it on the GPU Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead