GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Similar documents
CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

GPU Memory Memory issue for CUDA programming

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

CUDA Performance Considerations (2 of 2)

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Matrix Multiplication in CUDA. A case study

Introduction to CUDA (2 of 2)

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Data Parallel Execution Model

Introduction to GPGPUs and to CUDA programming model

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CUDA Performance Optimization. Patrick Legresley

Module Memory and Data Locality

Hardware/Software Co-Design

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

More CUDA. Advanced programming

Tiled Matrix Multiplication

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Fundamental CUDA Optimization. NVIDIA Corporation

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CUDA OPTIMIZATIONS ISC 2011 Tutorial

1/25/12. Administrative

Dense Linear Algebra. HPC - Algorithms and Applications

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Lecture 2: CUDA Programming

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Introduction to GPGPU and GPU-architectures

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Computation to Core Mapping Lessons learned from a simple application

Many-core Processors Lecture 4. Instructor: Philippos Mordohai Webpage:

Programming in CUDA. Malik M Khan

DRAM Bank Organization

Hands-on CUDA Optimization. CUDA Workshop

Lessons learned from a simple application

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Hardware/Software Co-Design

CME 213 S PRING Eric Darve

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Tesla Architecture, CUDA and Optimization Strategies

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Global Memory Access Pattern and Control Flow

Module Memory and Data Locality

Optimization Strategies Global Memory Access Pattern and Control Flow

CS 193G. Lecture 4: CUDA Memories

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS377P Programming for Performance GPU Programming - II

Lecture 5. Performance Programming with CUDA

Sparse Linear Algebra in CUDA

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Code Optimizations for High Performance GPU Computing

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Warps and Reduction Algorithms

ECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 677: Parallel Programming for Many-core Processors Lecture 6

Advanced CUDA Programming. Dr. Timo Stich

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

CUDA Parallelism Model

Double-Precision Matrix Multiply on CUDA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Optimizing Parallel Reduction in CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS 314 Principles of Programming Languages

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

GPU programming basics. Prof. Marco Bertini

Advanced CUDA Optimizations

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Computer Architecture

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Lab 1 Part 1: Introduction to CUDA

TUNING CUDA APPLICATIONS FOR MAXWELL

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Improving Performance of Machine Learning Workloads

Transcription:

1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013

2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

3 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

Control Flow Instructions are issued per 32 threads (warp) Divergent branches: Threads within a single warp take different paths if-else,... Different execution paths within a warp are serialized Different warps can execute different code with no impact on But what about branches? performance Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; Not all ALUs do useful work! Worst case: 1/8 performance <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 27 4 / 60

5 / 60 Project the threads into a linear order 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 Logical 2-D orgalization 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 Linear order Line up the row with larger y and z coordinates after those with lower ones

6 / 60 Partition the threads into warps 0 1 2 30 31 32 33 62 63 64

Partition the threads into warps 7 / 60

8 / 60 Avoid diverging within a warp Example with divergence if (threadidx.x > 2) {... } else {... } Branch granularity < warp size

9 / 60 Avoid diverging within a warp Example with divergence if (threadidx.x > 2) {... } else {... } Branch granularity < warp size Example without divergence if ((threadidx.x / WARP_SIZE) > 2) {... } else {... } Branch granularity is a whole multiple of warp size

10 / 60 Example: Divergent Iteration global void per_thread_sum(int *indices,float *data,float *sums) {... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum; } A single thread can drag a whole warp with it for a long time Know your data patterns If data is unpredictable, try to flatten peaks by letting threads work on multiple data items

11 / 60 Execution of the Sum Reduction Kernel Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 Iterations

12 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

13 / 60 Memory Coalescing Off-chip memory is accessed in chunks Even if you read only a single word If you don t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 128 bytes on Fermi GPU thread i 128 160 256 thread j 128 160 256

14 / 60 Memory Coalescing Off-chip memory is accessed in chunks Even if you read only a single word If you don t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 128 bytes on Fermi GPU thread i 128 160 256 thread j 128 160 256

15 / 60 Coalesced Access to Global Memory How is the global memory access of the threads in a warp coalesced? On Fermi, global memory loads and stores by threads of a warp (i.e., 32 threads) are coalesced How is the coalesced memory access aligned into segments? On Fermi, the segment size is always 128 Bytes Thread blocks are partitioned into warps based on thread indices Each warp contains threads of consecutive and increasing thread IDs with the first warp containing thread 0

16 / 60 Simple Memory Access Pattern Coalesced access in which all threads but a few access the word in a segment (on Fermi) Not all threads in a warp need to access the memory The access by threads can be permuted 128B Segment

17 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

18 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

19 / 60 Misaligned Access Pattern Misaligned sequential addresses that fall within two 128-byte segments 128B Segment

20 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

21 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

22 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

23 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

24 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

25 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

26 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

27 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

28 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

29 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

30 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

31 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

32 / 60 Matrix Data Access Pattern Directly access data from global memory Each thread reads one row of Md and one column of Nd

33 / 60 Coalesced access to a matrix N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 N Load iteration 0 T 0 T 1 T 2 T 3 Load iteration 1 T 0 T 1 T 2 T 3 N 0,0 N 0,1 N 0,2 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3

34 / 60 Uncoalesced access to a matrix M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 Load iteration 1 M T 0 T 1 T 2 T 3 Load iteration 0 T 0 T 1 T 2 T 3 M 0,0 M 0,1 M 0,2 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3

35 / 60 Use Shared Memory to Improve Coalescing Md Nd Original Access Pattern WIDTH WIDTH Copy into shared memory Md Nd Tiled Access Pattern Perform multiplication with data in shared memory

36 / 60 Aligned and Sequential Aligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31

37 / 60 Aligned and Sequential Aligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 1 x 64B at 128 1 x 64B at 192 Uncached 1.2 and 1.3 1 x 64B at 128 1 x 64B at 192 2.0 Cached 1 x 128B at 128

38 / 60 Aligned and Non-sequential Aligned and non-sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31

39 / 60 Aligned and Non-sequential Aligned and non-sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 8 x 32B at 128 8 x 32B at 160 8 x 32B at 192 8 x 32B at 224 Uncached 1.2 and 1.3 1 x 64B at 128 1 x 64B at 192 2.0 Cached 1 x 128B at 128

40 / 60 Misaligned and Sequential Misaligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31

41 / 60 Misaligned and Sequential Misaligned and sequential Addresses: 96 128 160 192 224 256 288 Threads: 0 31 Compute capability: Memory transactions: 1.0 and 1.1 Uncached 1.2 and 1.3 2.0 Cached 7 x 32B at 128 8 x 32B at 160 8 x 32B at 192 8 x 32B at 224 1 x 32B at 256 1 x 128B at 128 1 x 64B at 192 1 x 32B at 256 1 x 128B at 128 1 x 128B at 256

42 / 60 Matrix Multiplication Using Multiple Blocks with Tile bx 0 1 2 tx 0 1 2 TILE_WIDTH-1 N m TILE_WIDTH by 2 1 0 2 1 0 ty TILE_WIDTH-1 M m TILE_WIDTH by k TILE_WIDTH P bx k TILE_WIDTH TILE_WIDTH TILE_WIDTH WIDTH WIDTH WIDTH WIDTH

43 / 60 Matrix Multiplication Kernel Using Multiple Blocks with Tile global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { Pvalue += Mds[ty][k] * Nds[k][tx]; } 14. syncthreads(); } 15. Pd[Row*Width + Col] = Pvalue; }

44 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

45 / 60 Shared Memory Shared memory is banked (Fermi: 32 banks) Consecutive 32-bit words are in different banks Simultaneous & concurrent access On Fermi, all the threads in the same wrap share the same request If two or more threads access the same bank but different words, get bank conflicts Multiple threads access the same bank for same word no bank conflict

Shared Memory Access on Fermi 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Banks: Threads: 0 1 2 4 3 5 6 7 8 9 10 12 11 13 14 15 16 17 18 20 19 21 22 23 24 25 26 28 27 29 30 31 46 / 60

47 / 60 Matrix Multiplication Kernel Using Multiple Blocks with Tile global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { Pvalue += Mds[ty][k] * Nds[k][tx]; } 14. syncthreads(); } 15. Pd[Row*Width + Col] = Pvalue; } Do we have a shared memory bank conflict problem here?

48 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

49 / 60 Reminder: Thread Scheduling SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected TB1, W1 stall TB2, W1 stall TB3, W2 stall TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3 W1 W1 W1 W2 W1 W1 W2 W3 W2 Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4 Time TB = Thread Block, W = Warp

50 / 60 Thread Scheduling What happens if all warps are stalled? No instruction issued performance lost Most common reason for stalling Waiting on global memory If your code reads global memory every couple of instructions Try to maximize occupancy What determines occupancy? Register usage per thread Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself Shared memory per thread block

51 / 60 Resource Limits Registers Shared Memory Registers Shared Memory TB 2 TB 1 TB 1 TB 1 TB 2 TB 0 TB 0 TB 1 TB 0 TB 0 Pool of registers and shared memory per SM Each thread block grabs registers & shared memory If one or the other is fully utilized no more thread blocks

52 / 60 How do you know what you re using? Use nvcc -ptxas-options=-v to get register & shared memory usage Plug those numbers into CUDA Occupancy Calculator How to influence how many registers you use? Pass option -maxrregcount=x to nvcc

53 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

54 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration?

55 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration? Two floating-point arithmetic instructions One loop branch instruction Two address arithmetic instruction One loop counter increment instruction

56 / 60 Limited Processing Bandwidth of An SM for (int k = 0; k < BLOCK_SIZE; ++K) { Pvalue += Ms[ty][k] * Ns[k][tx]; } How many instructions are required to be carried out in each iteration? Two floating-point arithmetic instructions One loop branch instruction Two address arithmetic instruction One loop counter increment instruction Only 1 3 of the instructions executed are for real computation!!!

57 / 60 Loop Unrolling // Assume BLOCK_SIZE = 16 Pvalue = Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] +... + Ms[ty][15] * Ns[15][tx]; Loop branch instructions gone Loop counter increment instructions gone Address arithmetic instructions gone Indices are constants Compiler is able to eliminate address arithmetic instructions Only floating-point arithmetic instructions are still there

58 / 60 Loop Unrolling // Assume BLOCK_SIZE = 16 Pvalue = Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] +... + Ms[ty][15] * Ns[15][tx]; Loop branch instructions gone Loop counter increment instructions gone Address arithmetic instructions gone Indices are constants Compiler is able to eliminate address arithmetic instructions Only floating-point arithmetic instructions are still there Close to peak performance!!!

59 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling Kernel Launch Overhead

60 / 60 Kernel Launch Overhead Kernel launches are not free A null kernel launch will take non-trivial time Actual number changes with HW generations and driver software If you are launching lots of small grids you will lose substantial performance due to this effect Independent kernel launches are cheaper than dependent kernel launches Dependent launch: Some readback to the cpu If you are reading back data to the cpu for control decisions, consider doing it on the GPU Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead