GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Size: px

Start display at page:

Download "GPU Debugging Made Easy. David Lecomber CTO, Allinea Software"

Arnold Egbert Greer
5 years ago
Views:

1 GPU Debugging Made Easy David Lecomber CTO, Allinea Software

2 Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering, government and academic research Allinea DDT The leading debugger in parallel computing World's only scalable debugger Record holder for debugging software on largest machines Production use at extreme scale and desktop First at Petascale and first for GPUs! Allinea OPT Profiling tool for parallel applications

3 Some Clients and Partners Aviation and Defence Climate and Weather Energy Electronic Design Automation Academic Over 200 universities

Extreme machine sizes HPC core counts Growth in HPC core counts 20000 600000

2009 2010 2011 Core count 400000 Average Cores Largest Smallest 300000 200000

are exploding 100000 Skewed by largest machines but 0 2002 2004 2006 2008 2010

4 Extreme machine sizes HPC core counts Growth in HPC core counts Core count Core count Average Cores Largest Smallest Average Cores Smallest Scientific progress requires more CPU hours Machine sizes are exploding Skewed by largest machines but Year common trend Software changing to exploit the machines

HPC's current challenge New rival to traditional processors AMD and NVIDIA GPUs OpenCL and CUDA New problems for HPC developers Data transfer multiple memory levels Grid/block layout and thread

5 HPC's current challenge New rival to traditional processors AMD and NVIDIA GPUs OpenCL and CUDA New problems for HPC developers Data transfer multiple memory levels Grid/block layout and thread scheduling Synchronization New languages, compilers, standards Lower level NVIDIA CUDA C/C++; CUDA Driver API OpenCL Higher level CAPS HMPP PGI CUDA Fortran PGI Accelerators Cray OpenMP Accelerators

6 Today's parallel hybrid world Hardware determines the software Exploit single address space within nodes Shared memory via OpenMP, pthreads, Or to exploit GPUs CUDA, OpenCL Mixture of paradigms message passing, shared memory and GPU MPI + OpenMP Still not typical benefits not worth the shift for many MPI applications MPI + GPU Many HPC systems have GPU as the grunt of the machine Cannot leave majority of flops in a system idle! Extreme scale Many software projects are in progress because of GPUs More complex and heterogeneous than before More languages...

7 How do we fix software? With Thousands of threads Millions of variables Terabytes of data How do you figure out what's going on with your code? Old tricks long dead: multiple terminals, print statements, Different from (eg.) Google Everything is inter-related not independent We need to see all threads and processors together Does it look like your problem? Does it look like your next problem?

Allinea DDT Graphical debugger designed for: Multithreaded code Single address space Multiprocess or parallel code Interdependent or independent processes Multi-node software Hybrid

8 Allinea DDT Graphical debugger designed for: Multithreaded code Single address space Multiprocess or parallel code Interdependent or independent processes Multi-node software Hybrid code GPU + CPU code Any mix of the above Strong feature set Memory debugging Data analysis Managing concurrency Emphasizing differences Collective control Make as simple as possible, no more

Simplifying control flow Typical crash scenario: Threads/processes can be anywhere Cannot examine individually but locating threads is essential A good overview is important Leap to

9 Simplifying control flow Typical crash scenario: Threads/processes can be anywhere Cannot examine individually but locating threads is essential A good overview is important Leap to source for crashes Allinea DDT merges stacks from processes and threads into a tree Information scalably without overload Common faults patterns evident instantly Divergence, deadlock

interleaving order by stepping/playing selectively Integrated

10 Controlling progress Bulk control is essential for parallel debugging Group together processes Step, breakpoint, play, based on group Change interleaving order by stepping/playing selectively Integrated throughout Allinea DDT Stack and data views for group creation Morphs to scale!

Simplifying data divergence Developers need to see data Too many variables to trawl manually Allinea DDT compares data automatically Smart Highlighting Subtlely

11 Simplifying data divergence Developers need to see data Too many variables to trawl manually Allinea DDT compares data automatically Smart Highlighting Subtlely highlights if different on other process or if changed Now with sparklines! More detailed analysis Full cross process comparison Historical values of variables via tracepoints

Searching haystacks Arrays are the building blocks of HPC Largest jobs accumulate vast terabytes of data ~2GB per core is typical max available and frequently used

12 Searching haystacks Arrays are the building blocks of HPC Largest jobs accumulate vast terabytes of data ~2GB per core is typical max available and frequently used Allinea DDT displays and searches across whole job in parallel Sometimes need to search for NaN/Inf etc. Export at runtime Working on real visualization integration

13 Debugging for Petascale Allinea DDT scales DDT 3.0 Performance Figures A tree network communicates with daemons Logarithmic performance Jaguar Cray XT5 Partnership with largest users US DoE Oak Ridge National Laboratories Also projects with Argonne National Lab, CEA (France), and others Time (Seconds) 0.08 High performance debugging 0.06 Over 220,000 cores debugged simultaneously 0.04 Step all and display stacks in ~1/10 second , , , ,000 Usability: Scalable interface and features Memory debugging Array filtering MPI Processes All Step All Breakpoint Data comparison, etc.

14 but what about the GPUs?

15 Heterogeneity example - GPU Command line tool difficult to see through Fundamentals of control and inspection in place... But... intractable thread lists make usage impossible Now consider multiple such tools for MPI! Support for the other compilers?

16 Life's easier with a GUI Allinea DDT supports NVIDIA GPUs Built on NVIDIA's low level efforts Cuda-gdb Driver Compiler Compile debug fix! nvcc -g -G Running on the GPU Real chance of finding real GPU bugs Not as quick as debugging on CPU GPU and CPU within one interface Easy to switch between contexts Parallel stacks, thread selectors etc. Data and threads from each context is clear Step warps, grids, kernels

17 Simple CUDA debugging Almost like debugging a CPU Double click to set breakpoints Automatically stop on kernel launch Stop at a line of CUDA code Hover the mouse for more information Step a warp, block or kernel Follow the logic of individual threads through the kernel Switch threads to see thread data Run through to a crash CUDA Memcheck feature detects read/write errors Data types shown Register, shared, constant, global...

overview shows system properties Handy for optimizing

18 Kernel and system overviews Kernel progress view Shows progress through kernels Click to select a thread Device overview shows system properties Handy for optimizing grid sizes Handy for bug fixing and detecting hardware failure!

DDT for CAPS HMPP Source-to-source compiler Higher level than CUDA F90 to C/CUDA, C to C/CUDA Allinea and CAPS developing debugging support for HMPP Debuggable inside C

19 DDT for CAPS HMPP Source-to-source compiler Higher level than CUDA F90 to C/CUDA, C to C/CUDA Allinea and CAPS developing debugging support for HMPP Debuggable inside C kernels (codelets) on the GPU F90 multi-dimensional arrays supported Auto-reporting of CAPS runtime errors to the DDT GUI Also able to debug codelets running on the CPU

DDT For Directives - Cray Source to PTX level compiler Higher level than CUDA OpenMP accelerator directives/pragmas Debuggable Run accelerated code on the

20 DDT For Directives - Cray Source to PTX level compiler Higher level than CUDA OpenMP accelerator directives/pragmas Debuggable Run accelerated code on the CPU by setting -O0 Debugging of the GPU itself: work in progress with Allinea/Cray C/F90 examples debuggable Use -g -Gomp compiler flags to debug on the GPU

DDT for Portland compilers PGI Accelerator Model Debug on CPU only use - ta=host to set this Runs as a single thread on the CPU Debugging possible but race conditions

21 DDT for Portland compilers PGI Accelerator Model Debug on CPU only use - ta=host to set this Runs as a single thread on the CPU Debugging possible but race conditions wouldn't be seen PGI CUDA Fortran Debug on CPU only use - Mcuda=emu flag Runs multithreaded on the CPU if GPU disabled Works well with DDT Easy to see missing syncthreads, for example

22 Common Errors Part I Kernel bounds getting the right grids and blocks Incorrect kernel thread boundaries can lead to incomplete results Solution: Use DDT's multi-dimensional array viewer to look at data and find the missing indexes with 3D display and filtering support

23 Common Errors Part II Kernel bounds getting the right grids and blocks Incorrect kernel thread boundaries can lead to crashing of the kernel Solution: Bugs often trigger CUDA memcheck errors Run with DDT and CUDA memory debugging enabled

24 Summary GPU debugging can be simple with Allinea DDT On-device debugging is similar to normal debugging Many more threads but still manageable Existing MPI ideas work well for GPU threads Also a solution for cluster systems Multiple language choices for GPUs CUDA may not be the right level for your project Debugging support is available for high and low level languages

Development tools to enable Multicore

Development tools to enable Multicore From the desktop to the extreme A perspective on multicore looking in from HPC David Lecomber CTO, Allinea Software david@allinea.com Introduction The Multicore Challenge