S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Similar documents
CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Profiling & Tuning Applications. CUDA Course István Reguly

S WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE. Mathias Wagner, Jakob Progsch GTC 2017

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Hands-on CUDA Optimization. CUDA Workshop

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Profiling GPU Code. Jeremy Appleyard, February 2016

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Improving Performance of Machine Learning Workloads

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CUDA Performance Optimization. Patrick Legresley

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Profiling of Data-Parallel Processors

Advanced CUDA Programming. Dr. Timo Stich

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

OpenACC Course. Office Hour #2 Q&A

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU Performance Nuggets

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

GPU Fundamentals Jeff Larkin November 14, 2016

Dense Linear Algebra. HPC - Algorithms and Applications

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

TUNING CUDA APPLICATIONS FOR MAXWELL

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Processing SIMD, Vector and GPU s cont.

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

CS377P Programming for Performance GPU Programming - II

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Tesla Architecture, CUDA and Optimization Strategies

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Warps and Reduction Algorithms

Programmable Graphics Hardware (GPU) A Primer

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Advanced CUDA Optimizations

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

CUDA Experiences: Over-Optimization and Future HPC

High Performance Computing on GPUs using NVIDIA CUDA

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CME 213 S PRING Eric Darve

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GPU MEMORY BOOTCAMP III

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Multi Agent Navigation on GPU. Avi Bleiweiss

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Portland State University ECE 588/688. Graphics Processors

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

General Purpose GPU Computing in Partial Wave Analysis

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Maximizing Face Detection Performance

Accelerate your Application with Kepler. Peter Messmer

Introduction to CUDA Programming

Lecture 1: an introduction to CUDA

High Performance Computing and GPU Programming

n N c CIni.o ewsrg.au

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CS 179 Lecture 4. GPU Compute Architecture

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Lecture 2: different memory and variable types

Accelerating image registration on GPUs

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Parallel Accelerators

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Parallel Accelerators

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CUDA Tools for Debugging and Profiling

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Transcription:

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018

1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific optimizations okay? 2. Know your tools Strengths and weaknesses of each tool? Learn how to use them (and learn one well!) 3. Know your application What does it compute? How is it parallelized? What final performance is expected? 4. Know your process Performance optimization is a constant learning process 5. Make it so! The five steps to enlightenment 2

THE APOD CYCLE 4. Deploy and Test 1. Assess Identify Performance Limiter Analyze Profile Find Indicators 3. Optimize 2. Parallelize 3b. Build Knowledge 3

GUIDING OPTIMIZATION EFFORT Drilling Down into the Metrics Challenge: How to know where to start? Top-down Approach: Scope Find Hotspot Kernel Identify Performance Limiter of the Hotspot Find performance bottleneck indicators related to the limiter Identify associated regions in the source code Come up with strategy to fix and change the code Start again 4

KNOW YOUR HARDWARE: VOLTA ARCHITECTURE 5

VOLTA V100 FEATURES Volta Architecture Improved NVLink & HBM2 Volta MPS Improved SIMT Model Tensor Core Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms 120 Programmable TFLOPS Deep Learning 6

GPU COMPARISON P100 (SXM2) V100 (SXM2) Double/Single/Half TFlop/s 5.3/10.6/21.2 7.8/15.7/125 (TensorCores) Memory Bandwidth (GB/s) 732 900 Memory Size 16GB 16GB L2 Cache Size 4096 KB 6144 KB Base/Boost Clock (Mhz) 1328/1480 1312/1530 TDP (Watts) 300 300 7

VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File Unified L1/Shared memory 256 KB 128 KB Active Threads 2048 8

Pascal SM Load/Store Units IMPROVED L1 CACHE Volta SM Load/Store Units Shared Memory 64 KB Low Latency L1$ and Shared Memory 128 KB L1$ 24 KB Streaming L2$ 4 MB L2$ 6 MB 9

KNOW YOUR TOOLS: PROFILERS 10

PROFILING TOOLS Many Options! From NVIDIA nvprof NVIDIA Visual Profiler (nvvp) Nsight Visual Studio Edition Coming Soon: NVIDIA Nsight Systems Third Party TAU Performance System VampirTrace PAPI CUDA component HPC Toolkit (Tools using CUPTI) NVIDIA Nsight Compute Without loss of generality, in this talk we will be showing nvvp screenshots 11

THE NVVP PROFILER WINDOW Timeline Summary Guide Analysis Results 12

KNOW YOUR APPLICATION: HPGMG 13

3/24/2018 GPU HPGMG High-Performance Geometric Multi-Grid, Hybrid Implementation V-CYCLE F-CYCLE SMOOTHER & RESIDUAL SMOOTHER SMOOTHER & RESIDUAL SMOOTHER THRESHOLD DIRECT SOLVE CPU Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) http://crd.lbl.gov/departments/computer-science/par/research/hpgmg/ 14

MAKE IT SO: ITERATION 1 2 ND ORDER 7-POINT STENCIL 15

IDENTIFY HOTSPOT Hotspot Identify the hotspot: smooth_kernel() Kernel Time Speedup Original Version 2.079ms 1.00x 16

IDENTIFY PERFORMANCE LIMITER Memory utilization Compute utilization 17

PERFORMANCE LIMITER CATEGORIES Memory Utilization vs Compute Utilization Four possible combinations: Comp Mem Comp Mem Comp Mem Comp Mem Compute Bound Bandwidth Bound Latency Bound Compute and Bandwidth Bound 18

LATENCY BOUND ON P100 19

BANDWIDTH BOUND ON V100 20

DRILLING DOWN: LATENCY ANALYSIS (V100) The profiler warns about low occupancy Limited by block size of only 8x4=32 threads 21

OCCUPANCY GPU Utilization Each SM has limited resources: max. 64K Registers (32 bit) distributed between threads max. 48KB (96KB opt in) of shared memory per block (96KB per SMM) max. 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps) When a resource is used up, occupancy is reduced Values vary with Compute Capability 22

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 Fully covered latency warp 0 warp 1 warp 2 warp 3 Exposed latency, not enough warps No warp issues 23

LATENCY AT HIGH OCCUPANCY Many active warps but with high latency instructions warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 Exposed latency at high occupancy No warp issuing 24

LOOKING FOR MORE INDICATORS Source Code Association For line numbers use: nvcc -lineinfo 12 Global Load Transactions per 1 Request 25

MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred 26

MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp thread 2 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred 27

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Issued Inst. 1 Issued Inst. 2 Issued Inst. 0 Completed Inst. 1 Completed Inst. 2 Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads 0-7/24-31 Threads 8-15 Threads 16-23 Threads 0-7/24-31 Threads 8-15 Threads 16-23 28

FIX: BETTER GPU TILING Before After Block Size Up Transactions Per Access Down Memory Utilization Up +10% Kernel Time Speedup Original Version 2.079ms 1.00x Better Memory Accesses 1.756ms 1.18x 29

30

ITERATION 2: DATA MIGRATION 31

PAGE FAULTS Details 32

MEMORY MANAGEMENT Using Unified Memory No changes to data structures No explicit data movements Developer View With Unified Memory Single pointer for CPU and GPU data Use cudamallocmanaged for allocations Unified Memory 3/24/2 33

GPU UNIFIED MEMORY Eliminating page migrations and faults F-CYCLE THRESHOLD CPU Page faults Solution: allocate the first CPU level with cudamallochost (zero-copy memory) 3/24/2 34

PAGE FAULTS Almost gone 35

PAGE FAULTS Significant speedup for affected kernel 36

3/24/2018 MEM ADVICE API Not used here cudamemprefetchasync(ptr, length, destdevice, stream) Migrate data to destdevice: overlap with compute Update page table: much lower overhead than page fault in kernel Async operation that follows CUDA stream semantics cudamemadvise(ptr, length, advice, device) Specifies allocation and usage policy for memory region User can set and unset at any time 37

ITERATION 3: REGISTER OPTIMIZATION AND CACHING 38

LIMITER: STILL MEMORY BANDWIDTH 39

GPU MEMORY HIERARCHY V100 Bring reused data closer to the SMs Functional Units Register File Shared Memory / L1$ SM Functional Units Register File Shared Memory / L1$ SM Registers (256 KB/SM): good for intra-thread data reuse Shared mem / L1$ (128 KB/SM): good for explicit intra-block data reuse L2$ (6144 KB): implicit data reuse L2$ Global Memory (Framebuffer) 40

3/24/2018 CACHING IN REGISTERS No data loaded initially 41

3/24/2018 CACHING IN REGISTERS Load first set of data load 42

3/24/2018 CACHING IN REGISTERS Perform calculation Stencil 43

3/24/2018 CACHING IN REGISTERS Naively load next set of data? load 44

3/24/2018 CACHING IN REGISTERS Reusing already loaded data is better keep keep load 45

3/24/2018 CACHING IN REGISTERS Repeat Stencil Higher register usage may result in reduced occupancy => trade off (run experiments!) 46

THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced by a factor of 8 Memory utilization still high, but transferring less redundant data Kernel Time Speedup Original Version 2.079ms 1.00x Better Memory Accesses 1.756ms 1.18x Register Caching 1.486ms 1.40x 47

SHARED MEMORY Programmer-managed cache Great for caching data reused across threads in a CTA 128KB split between shared memory and L1 cache per SM Each block can use at most 96KB shared memory on GV100 Search for cudafuncattributepreferredsharedmemorycarveout in the docs global void sharedmemexample(int *d) { shared float s[64]; int t = threadidx.x; s[t] = d[t]; syncthreads(); if(t>0 && t<63) stencil[t] = -2.0f*s[t] + s[t-1] + s[t+1]; } global registers global shared registers 48

49

ITERATION 4: KERNELS WITH INCREASED ARITHMETIC INTENSITY 50

OPERATIONAL INTENSITY Operational intensity = arithmetic operations/bytes written and read Our stencil kernels have very low operational intensity It might be beneficial to use a different algorithm with higher operational intensity. In this case this might be achieved by using higher order stencils 51

ILP VS OCCUPANCY Earlier we looked at how occupancy helps hide latency by providing independent threads of execution. When our code requires many registers the occupancy will be limited but we can still get instruction level parallelism inside the threads. Occupancy is helpful to achieving performance but not always required Some algorithms such as matrix multiplications allow increases in operational intensity by using more registers for local storage while simultaneously offering decent ILP. In these cases it might be beneficial to maximize ILP and operational intensity at the cost of occupancy. Dependent instr. a = b + c; d = a + f; Independent instr. a = b + c; d = e + f; 52

53

SUMMARY 54

1. Know your application 2. Know your hardware 3. Know your tools 4. Know your process Identify the Hotspot Classify the Performance Limiter Look for indicators SUMMARY Performance Optimization is a Constant Learning Process 5. Make it so! 55

REFERENCES CUDA Documentation Best Practices: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ Pascal Tuning Guide: http://docs.nvidia.com/cuda/pascal-tuning-guide Volta Tuning Guide: http://docs.nvidia.com/cuda/volta-tuning-guide/ NVIDIA Developer Blog http://devblogs.nvidia.com/ Pointers to GTC 2018 Sessions: S8718 - Optimizing HPC Simulation and Visualization Codes using the NVIDIA System Profiler (previous talk, check out recording) S8430 - Everything You Need to Know About Unified Memory (Tue, 4:30PM) S8106 - Volta: Architecture and Performance Optimization (Thur, 10:30 AM) S8481 - CUDA Kernel Profiling: Deep-Dive Into NVIDIA's Next-Gen Tools (Thur, 11:00 AM) 56

THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join