ME964 High Performance Computing for Engineering Applications

Similar documents
Debugging Your CUDA Applications With CUDA-GDB

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

CUDA-GDB CUDA DEBUGGER

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

CUDA-GDB. NVIDIA CUDA Debugger Release for Linux and Mac. User Manual. DU _V4.0 April 25, 2011

CUDA-GDB: The NVIDIA CUDA Debugger

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU Programming. Rupesh Nasre.

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Debugging with GDB and DDT

Debugging with GDB and DDT

CUDA DEBUGGER API. TRM _v7.0 March API Reference Manual

CUDA DEBUGGER API. TRM _v8.0 February API Reference Manual

Tesla GPU Computing A Revolution in High Performance Computing

NSIGHT ECLIPSE EDITION

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA-MEMCHECK. DU _v5.0 October User Manual

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah

CUDA-MEMCHECK. DU _v5.5 July User Manual

NVIDIA CUDA DEBUGGER CUDA-GDB. User Manual

NSIGHT ECLIPSE EDITION

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

NSIGHT ECLIPSE EDITION

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

ECE 574 Cluster Computing Lecture 17

CME 213 S PRING Eric Darve

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CUDA Programming Model

CUDA Architecture & Programming Model

ECE 574 Cluster Computing Lecture 15

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Guillimin HPC Users Meeting July 14, 2016

Debugging, benchmarking, tuning i.e. software development tools. Martin Čuma Center for High Performance Computing University of Utah

ME964 High Performance Computing for Engineering Applications

GPU Computing Master Clss. Development Tools

CS/COE 0449 term 2174 Lab 5: gdb

NVIDIA GPU CODING & COMPUTING

Tesla Architecture, CUDA and Optimization Strategies

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Debugging and Profiling

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CUDA Performance Optimization. Patrick Legresley

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

CUDA-MEMCHECK. DU _v6.5 August User Manual

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

TotalView Release Notes

Fundamental CUDA Optimization. NVIDIA Corporation

TotalView. Debugging Tool Presentation. Josip Jakić

Practical Introduction to CUDA and GPU

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Debugging with TotalView

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

Dense Linear Algebra. HPC - Algorithms and Applications

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA OPTIMIZATIONS ISC 2011 Tutorial

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Lecture 1: an introduction to CUDA

Debugging HPC Applications. David Lecomber CTO, Allinea Software

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA-MEMCHECK. DU _v03 February 17, User Manual

CS 270 Systems Programming. Debugging Tools. CS 270: Systems Programming. Instructor: Raphael Finkel

Speed Up Your Codes Using GPU

Debug for GDB Users. Action Description Debug GDB $debug <program> <args> >create <program> <args>

Introduction to CUDA Programming

ME964 High Performance Computing for Engineering Applications

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

CUDA C Programming Mark Harris NVIDIA Corporation

CNIT 127: Exploit Development. Ch 2: Stack Overflows in Linux

18-600: Recitation #3

Short Introduction to Debugging Tools on the Cray XC40

Fundamental Optimizations

CS 179 Lecture 4. GPU Compute Architecture

Hands-on CUDA Optimization. CUDA Workshop

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Optimizing CUDA for GPU Architecture. CSInParallel Project

1. Allowed you to see the value of one or more variables, or 2. Indicated where you were in the execution of a program

Profiling of Data-Parallel Processors

Debugging with gdb and valgrind

CSE 374 Programming Concepts & Tools

Debugging. ICS312 Machine-Level and Systems Programming. Henri Casanova

CUDA Tools for Debugging and Profiling

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Transcription:

ME964 High Performance Computing for Engineering Applications Debugging & Profiling CUDA Programs February 28, 2012 Dan Negrut, 2012 ME964 UW-Madison Everyone knows that debugging is twice as hard as writing a program in the first place. So if you are as clever as you can be when you write it, how will you ever debug it? Brian Kernighan

Before We Get Started Last time Discussed about registers, local memory, and shared memory Discussed execution scheduling on the GPU Important concept: Warp collection of 32 threads that are scheduled for execution as a unit Today Quick intro, debugging and profiling CUDA applications Wrap up on Th, with one debugging and profiling example Other issues HW5 due Th at 11:59 PM Midterm project topic needs to be chosen by March 8 Check the online syllabus for details, we ll discuss more on Th 2

Acknowledgments Today s slides include material provided by Sanjiv Satoor of NVIDIA India All the good material in this presentation comes from him All the strange stuff and any mistakes belong to me 3

Terminology [Review] Kernel Function to be executed in parallel on one CUDA device A kernel is executed by multiple blocks of threads Block 3-dimensional Made up of a collection of threads Warp Group of 32 threads scheduled for execution as a unit Thread Smallest unit of work 4

Terminology [Review] Divergence Occurs when any two threads on the same warp are slated to execute different instructions Active threads within a warp: threads currently executing device code Divergent threads within a warp: threads that are waiting for their turn or are done with their turn. Program Counter (PC) A processor register that holds the address (in the virtual address space) of the next instruction in the instruction sequence Each thread has a PC Useful with cuda-gdb debugging to navigate the instruction sequence 5

Debugging Solutions CUDA-GDB (Linux & Mac) Allinea DDT CUDA-MEMCHECK (Linux, Mac, & Windows) NVIDIA Parallel NSight (Windows) Rogue Wave TotalView 6

CUDA-GDB GUI Wrappers GNU DDD GNU Emacs 7

CUDA-GDB Main Features All the standard GDB debugging features Integrated CPU and GPU debugging within a single session Breakpoints and Conditional Breakpoints Inspect memory, registers, local/shared/global variables Supports multiple GPUs, multiple contexts, multiple kernels Source and Assembly (SASS) Level Debugging Runtime Error Detection (stack overflow,...) 8

Recommended Compilation Flags Compile code for your target architecture: Tesla : -gencode arch=compute_10,code=sm_10 Fermi : -gencode arch=compute_20,code=sm_20 Compile code with the debug flags: Host code : -g Device code : -G Example: $ nvcc -g -G -gencode arch=compute_20,code=sm_20 test.cu -o test 9

Usage [On your desktop] Invoke cuda-gdb from the command line: $ cuda-gdb my_application (cuda-gdb) _ 10

Usage, Further Comments [Invoking cuda-gdb on Euler] When you request resources for running a debugging session on Euler don t reserve the GPU in exclusive mode (use default ): >> qsub -I -l nodes=1:gpus=1:default -X (qsub eye ell node ) >> cuda-gdb myapp You can use ddd as a front end (you need the X in the qsub command): >> ddd --debugger cuda-gdb myapp If you don t have ddd around, use at least >> cuda-gdb tui myapp 11

Program Execution Control 12

Execution Control Execution Control is identical to host debugging: Launch the application (cuda-gdb) run Resume the application (all host threads and device threads) (cuda-gdb) continue Kill the application (cuda-gdb) kill Interrupt the application: CTRL-C 13

Execution Control Single-Stepping Single-Stepping At the source level At the assembly level Over function calls next nexti Into function calls step stepi Behavior varies when stepping syncthreads() PC at a barrier? Single-stepping applies to Notes Yes Active and divergent threads of the warp in focus and all the warps that are running the same block. Required to step over the barrier. No Active threads in the warp in focus only. 14

Breakpoints By name (cuda-gdb) break my_kernel (cuda-gdb) break _Z6kernelIfiEvPT_PT0 By file name and line number (cuda-gdb) break test.cu:380 By address (cuda-gdb) break *0x3e840a8 (cuda-gdb) break *$pc At every kernel launch (cuda-gdb) set cuda break_on_launch application 15

Conditional Breakpoints Only reports hit breakpoint if condition is met All breakpoints are still hit Condition is evaluated every time for all the threads Typically slows down execution Condition C/C++ syntax No function calls Supports built-in variables (blockidx, threadidx,...) 16

Conditional Breakpoints Set at breakpoint creation time (cuda-gdb) break my_kernel if threadidx.x == 13 Set after the breakpoint is created Breakpoint 1 was previously created (cuda-gdb) condition 1 blockidx.x == 0 && n > 3 17

Thread Focus 18

Thread Focus There are thousands of threads to deal with. Can t display all of them Thread focus dictates which thread you are looking at Some commands apply only to the thread in focus Print local or shared variables Print registers Print stack contents Attributes of the thread focus Kernel index Block index Thread index : unique, assigned at kernel s launch time : the application blockidx : the application threadidx 19

Devices To obtain the list of devices in the system: (cuda-gdb) info cuda devices Dev Desc Type SMs Wps/SM Lns/Wp Regs/Ln Active SMs Mask * 0 gf100 sm_20 14 48 32 64 0xfff 1 gt200 sm_13 30 32 32 128 0x0 The * indicates the device of the kernel currently in focus Provides an overview of the hardware that supports the code 20

Kernels To obtain the list of running kernels: (cuda-gdb) info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args * 1 0 2 0x3fff (240,1,1) (128,1,1) acos parms=... 2 0 3 0x4000 (240,1,1) (128,1,1) asin parms=... The * indicates the kernel currently in focus There is a one-to-one mapping between a kernel id (unique id across multiple GPUs) and a (dev,grid) tuple. The grid id is unique per GPU only The name of the kernel is displayed as are its size and its parameters Provides an overview of the code running on the hardware 21

Thread Focus To switch focus to any currently running thread (cuda-gdb) cuda kernel 2 block 1,0,0 thread 3,0,0 [Switching focus to CUDA kernel 2 block (1,0,0), thread (3,0,0) (cuda-gdb) cuda kernel 2 block 2 thread 4 [Switching focus to CUDA kernel 2 block (2,0,0), thread (4,0,0) (cuda-gdb) cuda thread 5 [Switching focus to CUDA kernel 2 block (2,0,0), thread (5,0,0) 22

Thread Focus To obtain the current focus: (cuda-gdb) cuda kernel block thread kernel 2 block (2,0,0), thread (5,0,0) (cuda-gdb) cuda thread thread (5,0,0) 23

Threads To obtain the list of running threads for kernel 2: (cuda-gdb) info cuda threads kernel 2 Block Thread To Block Thread Cnt PC Filename Line * (0,0,0) (0,0,0) (3,0,0) (7,0,0) 32 0x7fae70 acos.cu 380 (4,0,0) (0,0,0) (7,0,0) (7,0,0) 32 0x7fae60 acos.cu 377 Threads are displayed in (block,thread) ranges Divergent threads are in separate ranges The * indicates the range where the thread in focus resides Threads displayed as (block, thread) ranges Cnt indicates the number of threads within each range All threads in the same range are contiguous (no hole) All threads in the same range shared the same PC (and filename/line number) 24

Program State Inspection 25

Stack Trace Same (aliased) commands as in gdb: (cuda-gdb) where (cuda-gdb) bt (cuda-gdb) info stack Applies to the thread in focus On Tesla, all the functions are always inlined 26

Stack Trace (cuda-gdb) info stack #0 fibo_aux (n=6) at fibo.cu:88 #1 0x7bbda0 in fibo_aux (n=7) at fibo.cu:90 #2 0x7bbda0 in fibo_aux (n=8) at fibo.cu:90 #3 0x7bbda0 in fibo_aux (n=9) at fibo.cu:90 #4 0x7bbda0 in fibo_aux (n=10) at fibo.cu:90 #5 0x7cfdb8 in fibo_main<<<(1,1,1),(1,1,1)>>> (...) at fibo.cu:95 27

Source Variables Source variable must be live (in the scope) Read a source variable (cuda-gdb) print my_variable $1 = 3 (cuda-gdb) print &my_variable $2 = (@global int *) 0x200200020 Write a source variable (cuda-gdb) print my_variable = 5 $3 = 5 28

Memory Memory read & written like source variables (cuda-gdb) print *my_pointer May require storage specifier when ambiguous @global, @shared, @local @generic, @texture, @parameter (cuda-gdb) print * (@global int *) my_pointer (cuda-gdb) print ((@texture float **) my_texture)[0][3] 29

Hardware Registers CUDA Registers Virtual PC: $pc (read-only) SASS registers: $R0, $R1,... Show a list of registers (no argument to get all of them) (cuda-gdb) info registers R0 R1 R4 R0 0x6 6 R1 0xfffc68 16776296 R4 0x6 6 Modify one register (cuda-gdb) print $R3 = 3 30

Code Disassembly Must have cuobjdump in $PATH (cuda-gdb) x/10i $pc 0x123830a8 <_Z9my_kernel10params+8>: MOV R0, c [0x0] [0x8] 0x123830b0 <_Z9my_kernel10params+16>: MOV R2, c [0x0] [0x14] 0x123830b8 <_Z9my_kernel10params+24>: IMUL.U32.U32 R0, R0, R2 0x123830c0 <_Z9my_kernel10params+32>: MOV R2, R0 0x123830c8 <_Z9my_kernel10params+40>: S2R R0, SR_CTAid_X 0x123830d0 <_Z9my_kernel10params+48>: MOV R0, R0 0x123830d8 <_Z9my_kernel10params+56>: MOV R3, c [0x0] [0x8] 0x123830e0 <_Z9my_kernel10params+64>: IMUL.U32.U32 R0, R0, R3 0x123830e8 <_Z9my_kernel10params+72>: MOV R0, R0 0x123830f0 <_Z9my_kernel10params+80>: MOV R0, R0 31

Run-Time Error Detection 32

CUDA-MEMCHECK Stand-alone run-time error checker tool Detects memory errors like stack overflow, illegal global address,... Similar to valgrind No need to recompile the application Not all the error reports are precise Once used within cuda-gdb, the kernel launches are blocking 33

CUDA-MEMCHECK ERRORS Illegal global address Misaligned global address Stack memory limit exceeded Illegal shared/local address Misaligned shared/local address Instruction accessed wrong memory PC set to illegal value Illegal instruction encountered Illegal global address 34

CUDA-MEMCHECK Integrated in cuda-gdb More precise errors when used from cuda-gdb Must be activated before the application is launched (cuda-gdb) set cuda memcheck on What does it mean more precise? Precise Exact thread idx Exact PC Not precise A group of threads or blocks The PC is several instructions after the offending load/store 35

Example (cuda-gdb) set cuda memcheck on (cuda-gdb) run [Launch of CUDA Kernel 0 (applystencil1d) on Device 0] Program received signal CUDA_EXCEPTION_1, Lane Illegal Address. applystencil1d<<<(32768,1,1),(512,1,1)>>> at stencil1d.cu:60 (cuda-gdb) info line stencil1d.cu:60 out[ i ] += weights[ j + RADIUS ] * in[ i + j ]; 36

Increase precision Single-stepping Every exception is automatically precise The autostep command Define a window of instructions where we think the offending load/store occurs cuda-gdb will single-step all the instructions within that window automatically and without user intervention (cuda-gdb) autostep foo.cu:25 for 20 lines (cuda-gdb) autostep *$pc for 20 instructions 37

New in CUDA 4.1 Source base upgraded to gdb 7.2 Simultaneous cuda-gdb sessions support Multiple context support Device assertions support New autostep command More info: http://sbel.wisc.edu/courses/me964/2012/documents/cuda-gdb41.pdf 38

Tips & Miscellaneous Notes 39

Best Practices 1. Determine scope of the bug Incorrect result Unspecified Launch Failure (ULF) Crash Hang Slow execution 2. Reproduce with a debug build Compile your app with g G Rerun, hopefully you can reproduce problem in debug model 40

Best Practices 3. Correctness Issues First throw cuda-memcheck at it in stand-alone Then cuda-gdb and cuda-memcheck if needed printf is also an option, but not recommended 4. Performance Issues (once no bugs evident) Use a profiler (discussed next) Change the code, might have to go back to Step 1. above 41

Tips Always check the return code of the CUDA API routines! If you use printf from the device code Make sure to synchronize so that buffers are flushed 42

Tips To hide devices, launch the application with CUDA_VISIBLE_DEVICES=0,3 where the numbers are device indexes. To increase determinism, launch the kernels synchronously: CUDA_LAUNCH_BLOCKING=1 43

Tips To print multiple consecutive elements in an array, use @: (cuda-gdb) print array[3]@4 To find the mangled name of a function (cuda-gdb) set demangle-style none (cuda-gdb) info function my_function_name 44

Further Information More resources: CUDA tutorials video/slides at GTC http://www.gputechconf.com/ CUDA webinars covering many introductory to advanced topics http://developer.nvidia.com/gpu-computing-webinars CUDA Tools Download: http://www.nvidia.com/getcuda Other related topic: Performance Optimization Using the NVIDIA Visual Profiler 45

End Quick Overview, Debugging Start Quick Overview, Profiling 46

Optimization: CPU and GPU GPU GPU Memory 177 GB/s A few cores OK memory bandwidth Best at serial execution Bottom-heavy memory hierarchy Hundreds of cores Higher memory bandwidth Best at parallel execution Top-heavy memory hierarchy 47

Optimization: Maximize Performance Take advantage of strengths of both CPU and GPU You don t need to port entire application to the GPU Application Code GPU Compute-Intensive Functions Directives CUDA-C/C++ Fortran Libraries Rest of Sequential CPU Code CPU 48

Application Optimization Process and Tools Identify Optimization Opportunities (remember Ahdahl s law?) gprof Intel VTune library Parallelize with CUDA, confirm functional correctness cuda-gdb, cuda-memcheck Parallel Nsight Memory Checker, Parallel Nsight Debugger Other non NVIDIA tools: Allinea DDT, TotalView Optimize NVIDIA Visual Profiler, nvvp Parallel Nsight 3 rd party: Vampir, Tau, PAPI, 49

50

51

52

53

CUDA-GDB Main Features All the standard GDB debugging features Seamless CPU and GPU debugging within a single session Breakpoints and Conditional Breakpoints Inspect memory, registers, local/shared/global variables Supports multiple GPUs, multiple contexts, multiple kernels Source and Assembly (SASS) Level Debugging Memory Error Detection (stack overflow,...) 54

Usage CUDA application at a breakpoint == Frozen display Multiple Solutions: Console mode: no X server Multiple GPUs: one for display, one for compute Remote Debugging: SSH, VNC,... 55

Outline Introduction Installation & Usage Program Execution Control Thread Focus Program State Inspection Run-Time Error Detection Tips & Miscellaneous Notes Conclusions 56

Introduction 57

Miscellaneous Notes On sm_1x architectures, device functions are always inlined No stepping over a function call Stack trace depth always 1 58