CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Size: px

Start display at page:

Download "CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA"

Alvin Garrison
6 years ago
Views:

1 CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA

2 What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD Method, Session S3008, Cliff Woolley 2

3 What Does the Application Do? It does not matter!!! We care about memory accesses, instructions, latency, Companion code (with a different input file) 3

4 4

5 Our Method Trace your application Identify the hot spot and profile it Identify the performance limiter Memory Bandwidth Instruction Throughput Latency Optimize the code Iterate 5

6 Our Environment We use Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF, Microsoft Windows 7 x64, Microsoft Visual Studio 2010 SP1, CUDA 5.0, Driver , Nvidia Nsight

7 ITERATION 1 7

8 Trace the Application 8

9 CUDA Launch Summary spmv_kernel_v0 is a hot spot, let s start here!!! Kernel Time Speedup Original version 457.1ms 9

10 Profile the Most Expensive Kernel 10

11 CUDA Launches 11

12 Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 12

13 Memory Bandwidth Utilization of DRAM Bandwidth: 37.67% We are not limited by the memory bandwidth 13

14 Instruction Throughput Instructions Per Clock (IPC): 0.04 We are not limited by instruction throughput 14

15 Latency First two things to check: Occupancy Memory accesses (coalesced/uncoalesced accesses) Other things to check (if needed): Control flow efficiency (branching, idle threads) Divergence Bank conflicts in shared memory 15

16 Latency Occupancy: 47.58% Achieved / 50% Theoretical Eligible Warps per Active Cycle: >4.7 on average On GK110, 4 Eligible Warps are enough: Not an issue 16

17 Latency Memory Accesses: Load: 22 Transactions per Request Store: 8 Transactions per Request We have too many uncoalesced accesses!!! 17

18 Where Do Those Accesses Happen? CUDA Source Profiler: Find where most of the uncoalesced requests happen Tip: Sort L2 Global Transactions Executed 18

19 Access Pattern Double precision numbers: 64-bit Thread 0 Thread 1 Thread 2 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Per Warp: Up to 32 L1 Transactions / Ideal case: 2 Transactions Up to 32 L2 Transactions / Ideal case: 8 Transactions 19

20 Access Pattern Next iteration: Thread 0 Thread 1 Thread 2 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Idea: Use the Read-only cache (LDG load) On Fermi: Use a texture or Use 48KB for L1 20

21 First Modification: Use ldg We change the source code: It is slower: 625.8ms Kernel Time Speedup Original version 457.1ms LDG to load A 625.8ms 0.73x 21

22 First Modification: Use ldg Less L2 to SM traffic: 857.1MB transferred (it was 906.2MB) 22

23 First Modification: Use ldg Why does 6% less traffic lead to 37% performance loss? Instruction Efficiency (Eligible Warps per Active Cycle): The average number of Eligible Warps dropped below 1 23

24 First Modification: Use ldg There are already a lot of Active Warps per Cycle 24

25 First Modification: Use ldg Warps cannot issue because they have to wait Warps wait for Texture in 91.1% of the cases 25

26 First Modification: Use ldg The loads compete for the cache too much Low hit rate: 7.7% Texture requests introduce too much latency Things to check in those cases: Texture Hit Rate: Low means no reuse Issue Efficiency and Stall Reasons It was actually expected: GPU caches are not CPU caches!!! 26

27 First Modification: Use ldg Other accesses may benefit from LDGs Memory blocks accessed several times by several threads How can we detect it? Source code analysis There is no way to detect it from Nsight 27

28 First Modification: Use ldg We change the source code In y = Ax, we use ldg when loading x It s faster: 403.4ms Kernel Time Speedup Original version 457.1ms LDG to load A 625.8ms 0.73x LDG to load X 403.4ms 1.13x 28

29 First Modification: Use ldg Much less L2 to SM traffic: 774MB (it was 906.2MB) Good hit rate in Texture Cache: 83% 29

30 ITERATION 2 30

31 CUDA Launch Summary spmv_kernel_v2 is still a hot spot, so we profile it 31

32 Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 32

33 Identify the Main Limiter We are still limited by latency Low DRAM utilization: 36.48% Low IPC:

34 Identify the Main Limiter We are not limited by the Occupancy We have > 6 Eligible Warps per Active Cycle We are limited by uncoalesced accesses: 48.92% of Replays 34

35 Second Strategy: Change Memory Accesses 4 consecutive threads load 4 consecutive elements Threads 0, 1, 2, 3 Threads 4, 5, 6, 7 Threads 8, 9, 10, 11 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Per Warp: Up to 8 L1 Transactions / Ideal case: 2 Transactions Up to 8 L2 Transactions / Ideal case: 8 Transactions 35

36 Second Strategy: Change Memory Accesses It s much faster: 161.7ms Kernel Time Speedup Original version 457.1ms LDG to load A 625.8ms 0.73x LDG to load X 403.4ms 1.13x Coalescing with 4 Threads 161.7ms 2.83x 36

37 Second Strategy: Change Memory Accesses We have much fewer Transactions per Request: 5.51 (LD) 37

38 Second Strategy: Change Memory Accesses Much less traffic from L2: 230.5MB (it was 774MB) Much less DRAM traffic: 210.1MB (it was 503.1MB) 38

39 ITERATION 3 39

40 CUDA Launch Summary spmv_kernel_v3 is still a hot spot, so we profile it 40

41 Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 41

42 Identify the Main Limiter We are still limited by latency Low DRAM utilization: 37.67% Low IPC:

43 Latency Occupancy: 57.80% Achieved / 62.50% Theoretical Eligible Warps per Active Cycle: ~3 on average We need 4 warps on GK110, so ~3 could be an issue 43

44 Latency Occupancy is limited by the number of registers We change the number of registers with launch_bounds launch_bounds (BLOCK_SIZE, MIN_BLOCKS) It does not really help 44

45 Latency Memory Accesses: Load: 5.51 Transactions per Request Store: 2 Transactions per Request We still have too many uncoalesced accesses 45

46 Latency We still have too many uncoalesced accesses Nearly 70% of Instruction Serialization (Replays) Stall Reasons: 48.1% due to Data Requests 46

47 Where Do Those Accesses Happen? Same lines of code as before 47

48 What Can We Do? In our kernel: 4 threads per row of the matrix A Threads 0, 1, 2, 3 Threads 4, 5, 6, 7 Threads 8, 9, 10, 11 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) New approach: 1 warp of threads per row of the matrix A Threads 0, 1, 2, 3,, 31 (some possibly idle) L2 Transaction (32B) L2 Transaction (32B) L2 Transaction (32B) 48

49 One Warp Per Row It s faster: 140.4ms Kernel Time Speedup Original version 457.1ms LDG to load A 625.8ms 0.73x LDG to load X 403.4ms 1.13x Coalescing with 4 Threads 161.7ms 2.83x 1 Warp per Row 140.4ms 3.26x 49

50 One Warp Per Row Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST) 50

51 ITERATION 4 51

52 One Warp Per Row spmv_kernel_v4 is the hot spot 52

53 One Warp Per Row DRAM utilization: 40.64% IPC: 1.57 We are still limited by latency 53

54 One Warp Per Row Occupancy and memory accesses are OK (not shown) Control Flow Efficiency: 86.59% Only 72.5% threads active in the expensive loop 54

55 One Half Warp Per Row It is faster: 114.7ms Kernel Time Speedup Original version 457.1ms LDG to load A 625.8ms 0.73x LDG to load X 403.4ms 1.13x Coalescing with 4 Threads 161.7ms 2.83x 1 Warp per Row 140.4ms 3.26x ½ Warp per Row 114.7ms 3.99x 55

56 ITERATION 5 56

57 One Half Warp Per Row DRAM utilization: 49.79% IPC: 1.34 We are still limited by latency 57

58 One Half Warp Per Row Memory accesses are good enough Occupancy could be an issue: ~3.2 Eligible Warps per Cycle Occupancy is limited by registers But forcing register count does not improve performance 58

59 One Half Warp Per Row Branch divergence induce latency We have 23.1% of divergent branches 59

60 One Half Warp Per Row We fix branch divergence It is faster: 91.2ms Kernel Time Speedup Original version 457.1ms LDG to load A 625.8ms 0.73x LDG to load X 403.4ms 1.13x Coalescing with 4 Threads 161.7ms 2.83x 1 Warp per Row 140.4ms 3.26x ½ Warp per Row 114.7ms 3.99x No divergence 91.2ms 5.01x 60

61 One Half Warp Per Row DRAM utilization: 62.57% IPC: 1.56 We achieve a much better bandwidth 61

62 So Far We have consecutively: Improved caching using ldg (use with care) Improved coalescing Improved control flow efficiency Improved branching Our new kernel is 5x faster than our first implementation Nsight helped us a lot 62

63 63

64 ITERATION 6 64

65 Next Kernel We are satisfied with the performance of spmv_kernel We move to the next kernel: jacobi_smooth 65

66 66

67 What Have You Seen? An iterative method to optimize your GPU code Trace your application Identify the hot spot and profile it Identify the performance limiter Optimize the code Iterate A way to conduct that method with Nsight VSE 67

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA