CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Similar documents
CUDA Performance Optimization. Patrick Legresley

GPU Fundamentals Jeff Larkin November 14, 2016

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

TUNING CUDA APPLICATIONS FOR MAXWELL

Fundamental Optimizations

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Tuning CUDA Applications for Fermi. Version 1.2

Portland State University ECE 588/688. Graphics Processors

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Programming Model

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Advanced CUDA Programming. Dr. Timo Stich

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

B. Tech. Project Second Stage Report on

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Master Informatics Eng.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to GPU programming with CUDA

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Numerical Simulation on the GPU

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

CUDA (Compute Unified Device Architecture)

ECE 574 Cluster Computing Lecture 15

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Tesla Architecture, CUDA and Optimization Strategies

Introduction to Parallel Computing with CUDA. Oswald Haan

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Parallel Processing SIMD, Vector and GPU s cont.

Introduction to CUDA Programming

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

High Performance Computing on GPUs using NVIDIA CUDA

Programming in CUDA. Malik M Khan

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Experiences: Over-Optimization and Future HPC

Improving Performance of Machine Learning Workloads

CS377P Programming for Performance GPU Programming - I

Double-Precision Matrix Multiply on CUDA

Dense matching GPU implementation

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Parallel Computing: Parallel Architectures Jin, Hai

ECE 574 Cluster Computing Lecture 17

Computer Architecture

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Profiling and Optimization. Scott Grauer-Gray

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Threading Hardware in G80

Introduc)on to GPU Programming

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA Memories. Introduction 5/4/11

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

COSC 6339 Accelerators in Big Data

Multi Agent Navigation on GPU. Avi Bleiweiss

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

University of Bielefeld

Tesla GPU Computing A Revolution in High Performance Computing

Generic Polyphase Filterbanks with CUDA

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

NVIDIA Fermi Architecture

Accelerating the acceleration search a case study. By Chris Laidler

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

Practical Introduction to CUDA and GPU

Exploiting graphical processing units for data-parallel scientific applications

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Using GPUs to compute the multilevel summation of electrostatic forces

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Introduction to GPU hardware and to CUDA

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Evaluation Of The Performance Of GPU Global Memory Coalescing

Multi-Processors and GPU

CSE 160 Lecture 24. Graphical Processing Units

Overview: Graphics Processing Units

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Antonio R. Miele Marco D. Santambrogio

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CS 179: GPU Computing

Transcription:

CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1

Table of content 1 Background information 2 Optimizations 3 Summary 2

Table of content 1 Background information 2 Optimizations 3 Summary 3

Background Information Why GPUs? Your PC with GPU Understanding SM and memory hierarchies Understanding CUDA kernel launch Questions? 4

Background Information Why GPUs? Your PC with GPU Understanding SM and memory hierarchies Understanding CUDA kernel launch Questions? It s all about real time Motion compliance < 1 ms Vision (30fps) < 33 ms Vision (60fps) < 16 ms Neural Networks Neuron within a neural network computes its own activation based on local information Learning algorithms continuously adapt the strength of connections between neurons pre-processing accelerates some of the pre-processing required (e.g. vision processing) 5

Background Information Courtesy: Asus ROG Why GPUs? Your PC with GPU Understanding SM and memory hierarchies Understanding CUDA kernel launch Questions? 6

Background Information Courtesy: Asus ROG Why GPUs? Your PC with GPU Understanding SM and memory hierarchies Understanding CUDA kernel launch Questions? Courtesy: NVidia 7

Background Information Why GPUs? Global memory (off chip DDR5 RAM) Off chip memory Constant and texture memory also allocated here Your PC with GPU Understanding SM and memory hierarchies SM (streamed multiprocessor) Blocks of threads are scheduled on SM (e.g. group of 512 threads) Understanding CUDA kernel launch Questions? Shared memory which can be shared between threads in block Global memory (off chip DDR5 RAM) 8

Background Information Why GPUs? Your PC with GPU Understanding SM and memory hierarchies Understanding CUDA kernel launch Motherboard RAM (host memory) Graphics card GPU SM SM SM SM Questions? CPU Chipset DDR5 RAM (graphics memory) PCIe Slot 9

Background Information Why GPUs? Your PC with GPU Fastest memory in rough order On chip Shared Understanding SM and memory hierarchies Off chip Constant Texture Global Understanding CUDA kernel launch Questions? Memory Location on/off chip Cached Access Scope Lifetime Register On n/a R/W 1 thread Thread Local Off R/W 1 thread Thread Shared On n/a R/W All threads in block Global Off R/W All threads + host Host allocation Constant Off Yes R All threads + host Host allocation Texture Off Yes R All threads + host Host allocation Cached only on devices of compute capability 2.x Block 10

Background Information Why GPUs? Your PC with GPU Understanding SM and memory hierarchies Understanding CUDA kernel launch Questions? 11

Background Information Which of these is on-chip memory for GPU? Why GPUs? Your PC with GPU Host memory (RAM) Registers Shared memory Global memory Understanding SM and memory hierarchies Understanding CUDA kernel launch Questions? 5 1 4 3 2 Can threads in different blocks access same shared memory? Yes No Order memories based on speed Host memory (RAM) Registers Global memory (GPU memory) Constant and texture memory Shared memory Which of these memories are persistent? Registers Global memory (GPU memory) Constant and texture memory Shared memory Except for constant and texture memory, all other memories are R/W Yes No 12

Table of content 1 Background information 2 Optimizations 3 Summary 13

Execution Configuration optimizations Instruction optimization Control flow Categorized optimization strategies Between Host and Transfers with Memory to Global Memory Occupancy Concurrent Kernel Execution Hiding Register Dependencies Thread and Block Heuristics Effects of Shared Memory Arithmetic Instructions Memory Instructions Branching and Divergence Branch Predication Loop Counters Signed vs. Unsigned Synchronizing Divergent Threads in a Loop 14

Execution Configuration optimizations Instruction optimization Control flow Categorized optimization strategies Between Host and Transfers with Memory to Global Memory Occupancy Concurrent Kernel Execution Hiding Register Dependencies Thread and Block Heuristics Effects of Shared Memory Arithmetic Instructions Memory Instructions Branching and Divergence Branch Predication Loop Counters Signed vs. Unsigned Synchronizing Divergent Threads in a Loop 15

Memory Goal for Memory optimizations maximize the use of the hardware by maximizing bandwidth maximizing bandwidth using as much fast memory and as little slow-access memory as possible What follows next discuss the various kinds of memory on the host and device and how best to set up data items to use the memory effectively 16

Memory Why data transfer between host and device must be minimized Motherboard RAM (host memory) CPU (host) Chipset 177.6 GB/s > 8 GB/s Graphics card GPU SM SM SM SM DDR5 RAM (graphics memory) PCIe Slot 177.6 GB/s 8 GB/s Its fine even if we run kernels on the GPU that do not demonstrate any speedup 17

Memory pinned memory improves transfer between host and device Allocate pinned memory on host Motherboard RAM (host memory) CPU (host) Chipset Graphics card DDR5 RAM (graphics memory) PCIe Slot Page-locked or pinned memory transfers attain the highest bandwidth between the host and the device can reduce overall system performance (since it is scarce resource) Pinning memory is heavy weight operation GPU SM SM SM SM 18

Memory Lets assume that you are doing some processing on an image. In which scenarios will you use pinned memory? A] You have very limited host memory (RAM) B] The image processing algorithm running on GPU has many steps to be performed on image C] Your application demands to have processed image always available with host CPU 19

Memory Staged concurrent copy and execute Copy Execute Copy Execute Sequential Concurrent async copy with execute (4 concurrent streams) 20

Memory 21

Memory 22

Unified Virtual Memory Internally manages the address spaces and do necessary memory transfers Coding simplicity and rapid prototyping Future compatibility 23

Memory Fastest memory in rough order On chip Off chip Shared Constant Texture Global 24

Memory Coalesced Access Memory loads and store by threads in warps are coalesced RAM are designed for batch access and we can take advantage of that in programming We will see what happens with coalesced access to global memory when 1] we change offset 2] we change stride 25

1] With different offset Memory 26

1] With different offset Memory 27

Assume we are working on an float image of size 500 X 500. Will it be a problem? If yes what is the solution? 500 12 Memory 500 28

2] With different stride Memory 29

2] With different stride Memory 30

Assume we are working on an float image of size 512 X 512. Will the below access pattern pose problem? What is the stride number in this case? Memory image[threadid][0] = some_value Accesses from a warp image[0][threadid] = some_value Accesses from a warp 31

Memory Global memory How many memory global transactions will be needed? Shared memory Total pixels to be calculated = 9 Transactions per pixel = 9 Hence total transactions = 9*9 = 81 How many global transactions with shared memory? 25 32

Memory Local memory SM have limited register space Automatic variables are allocated on registers (e.g. local variables) If registers memory is not enough then the local memory is used. This is called register spilling. Local memory resides on global memory and hence is slow After compilation nvcc compiler can report local memory usage. You must try to avoid it if possible. 33

Memory Texture memory read-only texture memory space is cached texture cache is optimized for 2D spatial locality In some cases advantageous alternative to reading device memory from global or constant memory Hardware provides other capabilities when textures are fetched using tex1d(), tex2d(), or tex3d() rather than tex1dfetch() Filtering Normalized texture coordinates Automatic handling of boundary cases 34

Memory Constant memory There is a total of 64 KB constant memory on a device constant memory space is cached In some cases advantageous alternative to reading device memory from global or constant memory the constant cache is best when threads in the same warp accesses only a few distinct locations If all threads of a warp access the same location, then constant memory can be as fast as a register access 35

Memory Registers Registers is the fastest memory space CUDA provides capability to uses small constant arrays hardware instruction support for sharing registers between threads in warp 36

Memory Shuffle instruction with constant srclane broadcasts the value in a register from one thread to all threads in a warp 37

Shuffle up and down instructions illustrated Memory advantage of the shuffle instruction for the moving average filter algorithm 38

Execution Configuration optimizations Instruction optimization Control flow Categorized optimization strategies Between Host and Transfers with Memory to Global Memory Occupancy Concurrent Kernel Execution Hiding Register Dependencies Thread and Block Heuristics Effects of Shared Memory Arithmetic Instructions Memory Instructions Branching and Divergence Branch Predication Loop Counters Signed vs. Unsigned Synchronizing Divergent Threads in a Loop 39

Questions? END 40