Debugging and Optimization strategies

Similar documents
Introduction to CUDA C

Introduction to CUDA C

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Programming with CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Introduction to GPU hardware and to CUDA

ECE 408 / CS 483 Final Exam, Fall 2014

CS 179: GPU Computing. Lecture 2: The Basics

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011

Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Scientific discovery, analysis and prediction made possible through high performance computing.

PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA. Chris Kennelly D. E. Shaw Research

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Massively Parallel Algorithms

CUDA Memory Hierarchy

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CUDA Performance Optimization. Patrick Legresley

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Writing and compiling a CUDA code

Shared Memory and Synchronizations

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008

Programming with CUDA, WS09

CUDA Kenjiro Taura 1 / 36

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

ECE 574 Cluster Computing Lecture 17

CUDA Parallel Programming Model Michael Garland

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Hardware/Software Co-Design

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Lecture 11: GPU programming

CME 213 S PRING Eric Darve

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

Speed Up Your Codes Using GPU

COSC 462 Parallel Programming

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU CUDA Programming

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Fundamental CUDA Optimization. NVIDIA Corporation

Cartoon parallel architectures; CPUs and GPUs

ECE 574 Cluster Computing Lecture 15

EEM528 GPU COMPUTING

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 2: Introduction to CUDA C

Data Parallel Execution Model

Information Coding / Computer Graphics, ISY, LiTH. More memory! ! Managed memory!! Atomics!! Pinned memory 25(85)

An Introduction to GPU Computing and CUDA Architecture

Module 2: Introduction to CUDA C

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Mathematical computations with GPUs

Module 2: Introduction to CUDA C. Objective

CUDA Parallelism Model

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Practical Introduction to CUDA and GPU

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

CS 179: GPU Programming. Lecture 7

CS179 GPU Programming Recitation 4: CUDA Particles

CUDA Architecture & Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Tesla Architecture, CUDA and Optimization Strategies

High Performance Computing and GPU Programming

TUNING CUDA APPLICATIONS FOR MAXWELL

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Transcription:

Debugging and Optimization strategies Philip Blakely Laboratory for Scientific Computing, Cambridge Philip Blakely (LSC) Optimization 1 / 25

Writing a correct CUDA code You should start with a functional CPU/serial code If possible, port small parts to GPU at a time Codes should give exactly the same results at all times If possible, display intermediate results to test individual kernels A lot of the difficulty with CUDA is thinking about parallelism in the right way, and working out which threads and blocks should do what work. Philip Blakely (LSC) Optimization 2 / 25

Error checking Always make use of error-return codes Use cudagetlasterror() Common errors: Mixing up host and device pointers - use e.g. hdata[], ddata[] naming convention Reversing direction of copying in cudamemcpy() cudamemcpy(dest, src, numbytes, cudamemcpykind) Allocating too little memory - cudamalloc needs no. of bytes - use sizeof() float3 a; a = cudamalloc((void )&a, sizeof(float3) N); Philip Blakely (LSC) Optimization 3 / 25

Floating point mismatch Moving from CPU code to GPU code will lead to changes in floating-point results Early CUDA-enabled GPUs did not conform to IEEE 754 exactly Unlikely to lead to noticable errors, except in last d.p. Recall that floating point addition is non-associative Upshot: If you are summing in parallel, result may be different from serial result Extra FMA instructions on GPUs can also affect result. See Floating Point on NVIDIA GPU.pdf Consider whether you need single-precision or double-precision arithmetic. Former is faster (by at least factor of 8) but may not be accurate enough. If any of this leads to stability problems - rethink your algorithm Philip Blakely (LSC) Optimization 4 / 25

Indexing errors Always check correct calculation of indexes Usually blockidx.x * blockdim.x + threadidx.x May have different expressions depending on distribution of data between threads/thread-blocks Always check for out-of-bounds access It s common to have redundant threads that shouldn t access data - make sure they have an appropriate if statement to make them do nothing Philip Blakely (LSC) Optimization 5 / 25

Race conditions If your code occasionally gives errors or gives different results for the same input, you ve probably got a race condition, i.e. the final state of code is dependent on order of execution of threads/blocks. Avoiding race conditions Never assume thread blocks are executed in a particular order Distinct thread blocks should not depend on each other s results Best approach is for a kernel to take data from one global array and compute a result into another array. Do not use the same array for input and output. Philip Blakely (LSC) Optimization 6 / 25

Global memory race conditions Usually occur when two or more threads update global memory value independently No guarantee in which order they will occur No guarantee which will succeed General rule: Avoid updating same global memory location from distinct threads (Unless you use atomic instructions, in which case you may have performance problems instead.) Philip Blakely (LSC) Optimization 7 / 25

Shared memory race conditions Occur when distinct threads within a thread-block update single shared memory location independently These can be overcome by correct use of syncthreads() Ensures that all threads in block have reached same point in kernel General advice: When using shared memory, check use of syncthreads() It may be possible to avoid some of this due to threads in the same warp advancing in lock-step. Philip Blakely (LSC) Optimization 8 / 25

Volatile variables In the absence of syncthreads() etc. the compiler is free to assume that no other thread will update a memory location It can therefore assume that adjacent reads give the same value and can reuse a previous value without rereading memory To avoid this, use volatile: volatile float avol = a; float tmp1 = a[i]; // Some calculation... float tmp2 = a[i]; Without volatile, compiler may/will only issue one memory-read instruction, even if in practice it might have been updated by another thread. Philip Blakely (LSC) Optimization 9 / 25

Basic debugging Often the easiest form of debugging is printf statements This is allowed in CUDA for compute capability 2.x Output for all printf statements encountered will be emitted on host No guarantee about which order print statements will occur in from different threads The output buffer is of limited size (1MB), and is dumped to screen at kernel launch or at cudadevicesynchronize() If the buffer fills up, further output will overwrite the earliest output The buffer size can be increased using cudadevicesetlimit() This feature should therefore only be used as a quick-and-dirty debugging tool. Upshot: printf may be helpful but abscence of output does not imply absence of thread reaching a particular point. Philip Blakely (LSC) Optimization 10 / 25

Use of a debugger The cuda-gdb debugger is an extension of gdb Need to compile with -g -G options Allows explicit switching between threads/blocks and checking of values Allows stepping through kernels line-by-line Not part of the practicals, as I ve had problems using it in the past, and it locks up the compute card which a kernel is being debugged. See /lsc/opt/cuda-8.0/doc/pdf/cuda-gdb.pdf Philip Blakely (LSC) Optimization 11 / 25

cuda-memcheck In a similar way to valgrind, cuda-memcheck can check for simple out-of-bounds memory access. Ill-formed CUDA program global void f(float a, int N) { int idx = blockdim.x blockidx.x + threadidx.x; } a[idx] = sinf(idx); Philip Blakely (LSC) Optimization 12 / 25

Memcheck output cuda-memcheck output ========= CUDA-MEMCHECK ========= Invalid global write of size 4 ========= at 0x00000498 in f ========= by thread (32,0,0) in block (2,0) ========= Address 0x00201080 is out of bounds ========= ========= ERROR SUMMARY: 1 error See /lsc/opt/cuda-8.0/doc/pdf/cuda Memcheck.pdf Philip Blakely (LSC) Optimization 13 / 25

Memcheck continued memcheck is capable of a large number of checks: Out-of-bounds access check Malloc/free error-checks CUDA API checks (although you should be doing this anyway) Memory leaks Race condition checks (tool=--racecheck) Philip Blakely (LSC) Optimization 14 / 25

CUDA features CUDA features miscellancy: Textures Warp-vote functions Atomic operations Streams Philip Blakely (LSC) Optimization 15 / 25

Textures Textures are 1D, 2D, or 3D constant arrays composed of 1,2, or 4 elements which are 8-bit, 16-bit, or 32-bit integers or floats. Provide some performance advantages Can linearly interpolate floating point values Can deal gracefully with out-of-range indexes Terminology and reason for existence obviously left over from graphics Can be useful for some applications Philip Blakely (LSC) Optimization 16 / 25

Textures - example application MRI scanners produce visualization of e.g. tumours Need to match-up images taken at different times Use Medial Image Registration to find transformation to map new image onto old one Automatically optimizes transformation parameters by computing distance function between images and minimizing it Determiniation of distance-function very compute-intensive Ansorge et al. used CUDA with texture-arrays to give speed-up of 750. (Disclaimer: serial code was unoptimal!) Philip Blakely (LSC) Optimization 17 / 25

Textures - allocation At file-scope: texture<int4, 2, cudareadmodeelementtype> pictexture; In main program: cudachannelformatdesc chandes = cudacreatechanneldesc<float4>(); cudaarray picarray; cudamallocarray(&picarray, &chandesc, width, height); Can now access the CUDA array as a texture within kernel: int4 pixel = tex2d(pictexture, xtrans, ytrans); Philip Blakely (LSC) Optimization 18 / 25

Textures - setting attributes Can set various attributes for accessing texture pictexture.addressmode[0] = cudaaddressmodewrap; pictexture.addressmode[1] = cudaaddressmodewrap; pictexture.filtermode = cudafiltermodepoint; pictexture.normalized = true; // access with floating point texture coordinates Normalized: true - Access using [0, 1] coordinates false - Access using integer [0, N 1] coordinates Wrap/Clamp bounds Wrap - wrap out-of-bounds coordinates to other side Clamp - clamp out-of-bounds coordinates to nearest end of range Filter mode: cudafiltermodepoint - just access exact memory value cudafiltermodelinear - perform (9-bit) linear interpolation Philip Blakely (LSC) Optimization 19 / 25

Warp vote functions The following functions allow communication between all threads in a single warp (32 threads): int all(int predicate); returns non-zero value iff predicate is non-zero on all active threads in the warp. int any(int predicate); returns non-zero value iff predicate is non-zero on at least one (active) thread in the warp. Philip Blakely (LSC) Optimization 20 / 25

Warp vote functions ctd. unsigned int ballot(int predicate); returns an integer whose Nth bit is set iff predicate is non-zero on thread N. Note that an unsigned int has 32 bits and there are 32 threads in a warp. Philip Blakely (LSC) Optimization 21 / 25

Warp-size note Kepler Tuning Guide.pdf Warp-synchronous programs are unsafe and easily broken by evolutionary improvements to the optimization strategies used by the CUDA compiler toolchain. Such programs also tend to assume that the warp size is 32 threads, which may not necessarily be the case for all future CUDA-capable architectures. Therefore, programmers should avoid warp-synchronous programming to ensure future-proof correctness in CUDA applications. On the other hand, the pre-processing WARP SIZE is available, and you could raise a compile-time error if this is not 32. Philip Blakely (LSC) Optimization 22 / 25

Atomic functions In machine-code, an instruction such as i += 1 where i is in global or shared memory requires three instructions: Read i into a register Increment i Write i back to memory which results in a potential race condition. atomicadd(&i, 1); only uses one instruction for global/shared memory Other atomic instructions: atomicsub(), atomicexch(),... Example use: Keep a count of how many thread-blocks have completed. Nearly avoids global memory access issue - but you still don t know what order updates will occur in Only available for integer and 32-bit float operations except for atomicadd() for 64-bit floats on Pascal archiecture (CC 6.0). Philip Blakely (LSC) Optimization 23 / 25

Streams Kernels can be run concurrently on latest Fermi and Kepler GPUs Done via streams Each stream is independent - no communication between them possible Any attempt at communication between streams via GPU memory very likely to give race conditions, with no way to rationalize them. Philip Blakely (LSC) Optimization 24 / 25

Streams ctd. Streams example for( int i = 0; i < 2; i++) cudastreamcreate(&stream[i]); for( int i = 0; i < 2 ; i++) f<<<100, 200, 0, stream[i]>>>(myarray[i]); If possible, the two calls of f will execute concurrently, on separate data. To synchronize all kernel calls and memory copies for a stream: cudastreamsynchronize() To make sure all streams have completed: cudathreadsynchronize() To destroy streams: cudastreamdestroy(stream[i]) Philip Blakely (LSC) Optimization 25 / 25