Profiling & Tuning Applications. CUDA Course István Reguly

Similar documents
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Hands-on CUDA Optimization. CUDA Workshop

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Performance Nuggets

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

TUNING CUDA APPLICATIONS FOR MAXWELL

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Profiling GPU Code. Jeremy Appleyard, February 2016

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Performance Considerations (2 of 2)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Advanced CUDA Optimizations

Preparing seismic codes for GPUs and other

GPU Profiling and Optimization. Scott Grauer-Gray

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

CS377P Programming for Performance GPU Programming - II

CS 179 Lecture 4. GPU Compute Architecture

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Performance Optimization Process

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

OpenACC Course. Office Hour #2 Q&A

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Advanced CUDA Programming. Dr. Timo Stich

Tesla Architecture, CUDA and Optimization Strategies

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 1: an introduction to CUDA

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Dense Linear Algebra. HPC - Algorithms and Applications

CME 213 S PRING Eric Darve

Profiling of Data-Parallel Processors

Introduction to GPGPU and GPU-architectures

CUDA Performance Optimization. Patrick Legresley

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Optimizing Parallel Reduction in CUDA

Lecture 2: different memory and variable types

GPU MEMORY BOOTCAMP III

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Performance Optimization

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

n N c CIni.o ewsrg.au

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Improving Performance of Machine Learning Workloads

Multi Agent Navigation on GPU. Avi Bleiweiss

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

ECE 574 Cluster Computing Lecture 15

CUDA. Matthew Joyner, Jeremy Williams

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to GPU hardware and to CUDA

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Modern Processor Architectures. L25: Modern Compiler Design

CS 179: GPU Computing. Lecture 2: The Basics

Hands-on CUDA exercises

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to Parallel Computing with CUDA. Oswald Haan

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Programming Model

State-of-the-art in Heterogeneous Computing

Transcription:

Profiling & Tuning Applications CUDA Course István Reguly

Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs some tweaks to work with OpenCL nvprof command line tool, can be used with MPI applications

Identifying Performance Limiters CPU: Setup, data movement GPU: Bandwidth, compute or latency limited Number of instructions for every byte moved ~3.6 : 1 on Fermi ~6.4 : 1 on Kepler Algorithmic analysis gives a good estimate Actual code is likely different Instructions for loop control, pointer math, etc. Memory access patterns How to find out? Use the profiler (quick, but approximate) Use source code modification (takes more work)

Analysis with Source Code Modification Time memory- only and math- only versions Not so easy for kernels with data- dependent control flow Good to estimate time spent on accessing memory or executing instructions Shows whether kernel is memory or compute bound Put an if statement depending on kernel argument around math/mem instructions Use dynamic shared memory to get the same occupancy

Analysis with Source Code Modification global void kernel(float *a) { int idx = threadidx.x + blockdim.x+blockidx.x; float my_a; my_a = a[idx]; for (int i =0; i < 100; i++) my_a = sinf(my_a+i*3.14f); a[idx] = my_a; } global void kernel(float *a, int prof) { int idx = threadidx.x + blockdim.x+blockidx.x; float my_a; if (prof & 1) my_a = a[idx]; if (prof & 2) for (int i =0; i < 100; i++) my_a = sinf(my_a+i*3.14f); if (prof & 1) a[idx] = my_a; }

Example scenarios time mem math full mem math full mem math full mem math full Memory- bound Good overlap between mem- math. Latency is not a problem Math- bound Good overlap between mem- math. Well balanced Good overlap between mem- math. Mem and latency bound Poor overlap, latency is a problem

NVIDIA Visual Profiler Launch with nvvp Collects metrics and events during execution Calls to the CUDA API Overall application: Memory transfers Kernel launches Kernels Occupancy Computation efficiency Memory bandwidth efficiency Requires deterministic execution!

Meet the test setup 2D gaussian blur with a 5x5 stencil 4096^2 grid global void stencil_v0(float *input, float *output, int sizex, int sizey) { const int x = blockidx.x*blockdim.x + threadidx.x + 2; const int y = blockidx.y*blockdim.y + threadidx.y + 2; if ((x >= sizex-2) (y >= sizey-2)) return; float accum = 0.0f; for (int i = -2; i < 2; i++) { for (int j = -2; j < 2; j++) { accum += filter[i+2][j+2]*input[sizey*(y+j) + (x+i)]; } } output[sizey*y+x] = accum/273.0f; }

Meet the test setup NVIDIA K40 GK110B SM 3.5 ECC on Graphics clocks at 745MHz, Memory clocks at 3004MHz CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. lineinfo DIT=0

Interactive demo of tuning process

Launch a profiling session

First look Timeline Summary Guide Analysis results

The Timeline Host side API calls MemCpy Compute

Analysis Guided Unguided

Examine Individual Kernels Lists all kernels sorted by total execution time: the higher the rank the higher the impact of optimisation on overall performance Initial unoptimised (v0) 8.122ms

Utilisation Both below 60% - > Latency! Most of it is memory ops Let s investigate

Latency analysis Memory throttle - > perform BW analysis

Memory Bandwidth analysis L1 cache not used

Investigate further Unguided 6-8 transactions per access something is wrong with how we access memory Global memory load efficiency 53.3% L2 hit rate 96.7%

Iteration 1 turn on L1 Quick & easy step: Turn on L1 cache by using - Xptxas - dlcm=ca Memory unit is utilized, but Global Load efficiency became even worse: 20.5% Initial unoptimised (v0) Enable L1 8.122ms 6.57ms

Cache line utilization 8 8.. 32 bytes (8 floats) Unit of transaction L1 cache disabled: - > 32B transactions Min 4, Max 8 transactions

Cache line utilization 8 8.. 128 bytes (32 floats) L1 cache enabled: Unit of transaction - > 128B transactions Each time a transaction requires more - > 4*32B to L2 than 1 128B cache line: re- issue Min 16, Max 32 transactions

Cache line utilization 32 2.. 128 bytes (32 floats) L1 cache enabled: Unit of transaction - > 128B transactions - > 4*32B to L2 Min 4, Max 8 transactions

Iteration 2 32x2 blocks Memory utilization decreased 10% Performance almost doubles Global Load Efficiency 50.8% Initial unoptimised (v0) Enable L1 Blocksize 8.122ms 6.57ms 3.4ms

Key takeaway Latency/Bandwidth bound Inefficient use of memory system and bandwidth Symptoms: Lots of transactions per request (low load efficiency) Goal: Use the whole cache line Improve memory access patterns (coalescing) What to do: Align data, change block size, change data layout Use shared memory/shuffles to load efficiently

Latency analysis

Latency analysis

Latency analysis Increase the block size so more warps can be active at the same time. Kepler: Max 16 blocks per SM Max 2048 threads per SM

Occupancy using all slots Issue instruction Latency of instruction Next instruction Warp 1 Warp 2 Warp 3 Warp 4 Scheduler Increase block size to 32x4 Illustrative only, reality is a bit more complex

Iteration 3 32x4 blocks Up 10% Full occupancy Initial unoptimised (v0) Enable L1 Blocksize Blocksize 2 8.122ms 6.57ms 3.4ms 2.36ms

Key takeaway Latency bound low occupancy Unused cycles, exposed latency Symptoms: High execution/memory dependency, low occupancy Goal: Better utilisecycles by: having more warps What to do: Determine occupancy limiter (registers, block size, shared memory) and vary it

Improving memory bandwidth L1 is fast, but a bit wasteful (128B loads) 8 transactions on average (minimum would be 4) Load/Store pipe stressed Any way to reduce the load? Texture cache Dedicated pipeline 32 byte loads const restrict * ldg()

Iteration 4 texture cache Initial unoptimised (v0) Blocksize 2 Texture cache 8.122ms 2.36ms 1.53ms

Key takeaway Bandwidth bound Load/Store Unit LSU overutilised Symptoms: LSU pipe utilisation high, others low Goal: Better spread the load between other pipes: use TEX What to do: Read read- only data through the texture cache const restrict or ldg()

Compute analysis Instruction mix Load/store Compute utilization could be higher (~78%) Lots of Integer & memory instructions, fewer FP Integer ops have lower throughput than FP Try to amortize the cost: increase compute per byte

Instruction Level Parallelism Issue instruction Latency of instruction Next instruction Warp 1 Remember, GPU is in- order: a=b+c d=a+e a=b+c d=e+f Second instruction cannot be issued before first But it can be issued before the first finishes if there is no dependency Applies to memory instructions too latency much higher (counts towards stall reasons)

Instruction Level Parallelism for (j=0;j<2;j++) acc+=filter[j]*input[x+j]; tmp=input[x+0] acc += filter[0]*tmp tmp=input[x+1] acc += filter[1]*tmp for (j=0;j<2;j++) { acc0+=filter[j]*input[x+j]; acc1+=filter[j]*input[x+j+1]; } tmp=input[x+0] tmp=input[x+0+1] acc0 += filter[0]*tmp acc1 += filter[0]*tmp tmp=input[x+1] tmp=input[x+1+1] acc0 += filter[1]*tmp acc1 += filter[1]*tmp #pragma unroll can help ILP Create two accumulators Or Process 2 points per thread Bonus data re- use (register caching)

Iteration 5 2 points per thread Load/store Initial unoptimised (v0) 8.122ms Texture cache 1.53ms 2 points 1.07ms

Key takeaway Latency bound low instruction level parallelism Unused cycles, exposed latency Symptoms: High execution dependency, one pipe saturated Goal: Better utilise cycles by: increasing parallel work per thread What to do: Increase ILP by having more independent work, e.g. more than 1 output value per thread #pragma unroll

Iteration 6 4 points per thread Load/store 168 GB/s device BW Initial unoptimised (v0) 8.122ms 2 points 1.07ms 4 points 0.95ms

Conclusions Iterative approach to improving a code s performance Identify hotspot Find performance limiter, understand why it s an issue Improve your code Repeat Managed to achieve a 8.5x speedup Shown how NVVP guides us and helps understand what the code does There is more it can show References: C. Angerer, J. Demouth, CUDA Optimization with NVIDIA Nsight Eclipse Edition, GTC 2015

Metrics & Events