GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Size: px

Start display at page:

Download "GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60"

Myles Walsh
5 years ago
Views:

1 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013

2 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

3 3 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

Control Flow Instructions are issued per 32 threads (warp) Divergent

.. Different execution paths within a warp are serialized Different warps

performance Time (clocks) 1 2...... 8 ALU 1 ALU 2.

y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; Not all ALUs

4 Control Flow Instructions are issued per 32 threads (warp) Divergent branches: Threads within a single warp take different paths if-else,... Different execution paths within a warp are serialized Different warps can execute different code with no impact on But what about branches? performance Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; Not all ALUs do useful work! Worst case: 1/8 performance <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 27 4 / 60

5 5 / 60 Project the threads into a linear order 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 Logical 2-D orgalization 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 Linear order Line up the row with larger y and z coordinates after those with lower ones

6 6 / 60 Partition the threads into warps

7 Partition the threads into warps 7 / 60

8 8 / 60 Avoid diverging within a warp Example with divergence if (threadidx.x > 2) {... } else {... } Branch granularity < warp size

9 9 / 60 Avoid diverging within a warp Example with divergence if (threadidx.x > 2) {... } else {... } Branch granularity < warp size Example without divergence if ((threadidx.x / WARP_SIZE) > 2) {... } else {... } Branch granularity is a whole multiple of warp size

10 10 / 60 Example: Divergent Iteration global void per_thread_sum(int *indices,float *data,float *sums) {... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum; } A single thread can drag a whole warp with it for a long time Know your data patterns If data is unpredictable, try to flatten peaks by letting threads work on multiple data items

11 11 / 60 Execution of the Sum Reduction Kernel Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 Iterations

12 12 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

13 13 / 60 Memory Coalescing Off-chip memory is accessed in chunks Even if you read only a single word If you don t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 128 bytes on Fermi GPU thread i thread j

14 14 / 60 Memory Coalescing Off-chip memory is accessed in chunks Even if you read only a single word If you don t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 128 bytes on Fermi GPU thread i thread j

15 15 / 60 Coalesced Access to Global Memory How is the global memory access of the threads in a warp coalesced? On Fermi, global memory loads and stores by threads of a warp (i.e., 32 threads) are coalesced How is the coalesced memory access aligned into segments? On Fermi, the segment size is always 128 Bytes Thread blocks are partitioned into warps based on thread indices Each warp contains threads of consecutive and increasing thread IDs with the first warp containing thread 0

16 16 / 60 Simple Memory Access Pattern Coalesced access in which all threads but a few access the word in a segment (on Fermi) Not all threads in a warp need to access the memory The access by threads can be permuted 128B Segment

17 17 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

18 18 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

19 19 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

20 20 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

21 21 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

22 22 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

23 23 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

24 24 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

25 25 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

26 26 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

27 27 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

28 28 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

29 29 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

30 30 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

31 31 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

32 32 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

33 33 / 60 Coalesced access to a matrix N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 N Load iteration 0 T 0 T 1 T 2 T 3 Load iteration 1 T 0 T 1 T 2 T 3 N 0,0 N 0,1 N 0,2 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3

34 34 / 60 Uncoalesced access to a matrix M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 Load iteration 1 M T 0 T 1 T 2 T 3 Load iteration 0 T 0 T 1 T 2 T 3 M 0,0 M 0,1 M 0,2 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3

35 35 / 60 Use Shared Memory to Improve Coalescing Md Nd Original Access Pattern WIDTH WIDTH Copy into shared memory Md Nd Tiled Access Pattern Perform multiplication with data in shared memory

36 36 / 60 Aligned and Sequential Aligned and sequential Addresses: Threads: 0 31

37 37 / 60 Aligned and Sequential Aligned and sequential Addresses: Threads: 0 31 Compute capability: Memory transactions: 1.0 and x 64B at x 64B at 192 Uncached 1.2 and x 64B at x 64B at Cached 1 x 128B at 128

38 38 / 60 Aligned and Non-sequential Aligned and non-sequential Addresses: Threads: 0 31

39 39 / 60 Aligned and Non-sequential Aligned and non-sequential Addresses: Threads: 0 31 Compute capability: Memory transactions: 1.0 and x 32B at x 32B at x 32B at x 32B at 224 Uncached 1.2 and x 64B at x 64B at Cached 1 x 128B at 128

40 40 / 60 Misaligned and Sequential Misaligned and sequential Addresses: Threads: 0 31

41 41 / 60 Misaligned and Sequential Misaligned and sequential Addresses: Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 Uncached 1.2 and Cached 7 x 32B at x 32B at x 32B at x 32B at x 32B at x 128B at x 64B at x 32B at x 128B at x 128B at 256

42 42 / 60 Matrix Multiplication Using Multiple Blocks with Tile bx tx TILE_WIDTH-1 N m TILE_WIDTH by ty TILE_WIDTH-1 M m TILE_WIDTH by k TILE_WIDTH P bx k TILE_WIDTH TILE_WIDTH TILE_WIDTH WIDTH WIDTH WIDTH WIDTH

43 43 / 60 Matrix Multiplication Kernel Using Multiple Blocks with Tile global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { Pvalue += Mds[ty][k] * Nds[k][tx]; } 14. syncthreads(); } 15. Pd[Row*Width + Col] = Pvalue; }

44 44 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

45 45 / 60 Shared Memory Shared memory is banked (Fermi: 32 banks) Consecutive 32-bit words are in different banks Simultaneous & concurrent access On Fermi, all the threads in the same wrap share the same request If two or more threads access the same bank but different words, get bank conflicts Multiple threads access the same bank for same word no bank conflict

46 Shared Memory Access on Fermi Banks: Threads: Banks: Threads: Banks: Threads: Banks: Threads: Banks: Threads: Banks: Threads: / 60

47 47 / 60 Matrix Multiplication Kernel Using Multiple Blocks with Tile global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { Pvalue += Mds[ty][k] * Nds[k][tx]; } 14. syncthreads(); } 15. Pd[Row*Width + Col] = Pvalue; } Do we have a shared memory bank conflict problem here?

48 48 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

49 49 / 60 Reminder: Thread Scheduling SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected TB1, W1 stall TB2, W1 stall TB3, W2 stall TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3 W1 W1 W1 W2 W1 W1 W2 W3 W2 Instruction: Time TB = Thread Block, W = Warp

50 50 / 60 Thread Scheduling What happens if all warps are stalled? No instruction issued performance lost Most common reason for stalling Waiting on global memory If your code reads global memory every couple of instructions Try to maximize occupancy What determines occupancy? Register usage per thread Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself Shared memory per thread block

51 51 / 60 Resource Limits Registers Shared Memory Registers Shared Memory TB 2 TB 1 TB 1 TB 1 TB 2 TB 0 TB 0 TB 1 TB 0 TB 0 Pool of registers and shared memory per SM Each thread block grabs registers & shared memory If one or the other is fully utilized no more thread blocks

52 52 / 60 How do you know what you re using? Use nvcc -ptxas-options=-v to get register & shared memory usage Plug those numbers into CUDA Occupancy Calculator How to influence how many registers you use? Pass option -maxrregcount=x to nvcc

53 53 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

54 54 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration?

55 55 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration? Two floating-point arithmetic instructions One loop branch instruction Two address arithmetic instruction One loop counter increment instruction

56 56 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration? Two floating-point arithmetic instructions One loop branch instruction Two address arithmetic instruction One loop counter increment instruction Only 1 3 of the instructions executed are for real computation!!!

57 57 / 60 Loop Unrolling // Assume BLOCK_SIZE = 16 Pvalue = Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] Ms[ty][15] * Ns[15][tx]; Loop branch instructions gone Loop counter increment instructions gone Address arithmetic instructions gone Indices are constants Compiler is able to eliminate address arithmetic instructions Only floating-point arithmetic instructions are still there

58 58 / 60 Loop Unrolling // Assume BLOCK_SIZE = 16 Pvalue = Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] Ms[ty][15] * Ns[15][tx]; Loop branch instructions gone Loop counter increment instructions gone Address arithmetic instructions gone Indices are constants Compiler is able to eliminate address arithmetic instructions Only floating-point arithmetic instructions are still there Close to peak performance!!!

59 59 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

60 60 / 60 Kernel Launch Overhead Kernel launches are not free A null kernel launch will take non-trivial time Actual number changes with HW generations and driver software If you are launching lots of small grids you will lose substantial performance due to this effect Independent kernel launches are cheaper than dependent kernel launches Dependent launch: Some readback to the cpu If you are reading back data to the cpu for control decisions, consider doing it on the GPU Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write