CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Similar documents
CUDA Tools for Debugging and Profiling

Using JURECA's GPU Nodes

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

S WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE. Mathias Wagner, Jakob Progsch GTC 2017

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Profiling GPU Code. Jeremy Appleyard, February 2016

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

NSIGHT ECLIPSE EDITION

NSIGHT ECLIPSE EDITION

CME 213 S PRING Eric Darve

NSIGHT ECLIPSE EDITION

CUDA Basics. July 6, 2016

April 4-7, 2016 Silicon Valley

OpenACC Course. Office Hour #2 Q&A

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

Hands-on CUDA Optimization. CUDA Workshop

CUDA Performance Optimization

Tesla GPU Computing A Revolution in High Performance Computing

Mitglied der Helmholtz-Gemeinschaft. Eclipse Parallel Tools Platform (PTP)

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

MULTI-GPU PROGRAMMING MODELS. Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer

n N c CIni.o ewsrg.au

Future Directions for CUDA Presented by Robert Strzodka

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

Debugging Your CUDA Applications With CUDA-GDB

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

PROFILER USER'S GUIDE. DU _v7.5 September 2015

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

S CUDA on Xavier

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

PROFILER USER'S GUIDE. DU _v9.1 December 2017

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Vectorisation and Portable Programming using OpenCL

Profiling & Tuning Applications. CUDA Course István Reguly

GPU Programming. Rupesh Nasre.

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

NVIDIA CUDA TOOLKIT

NVIDIA Nsight Visual Studio Edition 4.0 A Fast-Forward of All the Greatness of the Latest Edition. Sébastien Dominé, NVIDIA

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

PROFILER. DU _v5.0 October User's Guide

Profiling of Data-Parallel Processors

GPU Computing Master Clss. Development Tools

Performance Analysis of Parallel Scientific Applications In Eclipse

Accelerator programming with OpenACC

TESLA DRIVER VERSION (LINUX)/411.82(WINDOWS)

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Best Practices for Deploying and Managing GPU Clusters

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA CUDA TOOLKIT V6.0

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

NVIDIA Parallel Nsight. Jeff Kiel

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Tesla GPU Computing A Revolution in High Performance Computing

Lecture 1: an introduction to CUDA

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

NVIDIA Application Lab at Jülich

Parallel Computing. November 20, W.Homberg

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

DEEP DIVE INTO DYNAMIC PARALLELISM

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

NVIDIA CUDA TOOLKIT

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Tesla Architecture, CUDA and Optimization Strategies

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Accelerating Data Centers Using NVMe and CUDA

PERFORMANCE OPTIMIZATIONS FOR AUTOMOTIVE SOFTWARE

PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS

An Evaluation of Unified Memory Technology on NVIDIA GPUs

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

ME964 High Performance Computing for Engineering Applications

GPU Performance Nuggets

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

CUDA-MEMCHECK. DU _v03 February 17, User Manual

Advanced CUDA Optimization 1. Introduction

Paralization on GPU using CUDA An Introduction

Operational Robustness of Accelerator Aware MPI

Overview of Performance Prediction Tools for Better Development and Tuning Support

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Accelerate your Application with Kepler. Peter Messmer

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

OpenCL: History & Future. November 20, 2017

Transcription:

Mitglied der Helmholtz-Gemeinschaft CUDA Tools for Debugging and Profiling Jiri Kraus (NVIDIA) GPU Programming with CUDA@Jülich Supercomputing Centre Jülich 25-27 April 2016

What you will learn How to use cuda-memcheck to detect invalid memory accesses How to use Nsight EE to debug a CUDA Program How to use the NVIDIA visual profiler

cuda-memcheck cuda-memcheck is a memory correctness tool similar to valgrind memcheck cuda-memcheck provided to tools (select via tool) memcheck: Memory access checking racecheck: Shared memory hazard checking Compile with debugg information (-g -G)

cuda-memcheck

Nsight Eclipse Edition Nsight Eclipse Edition is an IDE for CUDA development Source Editor with CUDA C and C++ syntax highlighting Project and files management with version control integration Integrated build system GUI for debugging heterogeneous applications Visual profiler integration Nsight EE is part of the CUDA Toolkit

Using Nsight EE to debug a CUDA Program Start Nsight EE nsight

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Using Nsight EE to debug a CUDA Program

Taks 0: Use cuda-memcheck to identify error Go to CUDATools/exercises/tasks Build Task 0 make task0-cuda-memcheck Run cuda-memcheck cuda-memcheck./task0-cuda-memcheck Identify and fix the error (cuda-memcheck should run with out errors) task0-cuda-memcheck.cu

Taks 1: Use Nsight EE to debug a program Go to CUDATools/exercises/tasks Build Task 1 make task1-cuda-gdb Start Nsight EE nsight Setup a debug session in Nsight EE Use the variable view to let thread 4 from block 1 print 4 (instead of 0) Do not modify the source code

Why Performance Measurement Tools? You can only improve what you measure Need to identify: Hotspots: Which function takes most of the run time? Bottlenecks: What limits the performance of the Hotspots? Manual timing is tedious and error prone Possible for small application like jacobi and matrix multiplication Impractical for larger/more complex application Access to hardware counters (PAPI, CUPTI)

The command line profiler nvprof Simple launcher to get profiles of your application Profiles CUDA Kernels and API calls > nvprof --unified-memory-profiling per-process-device./scale_vector_um ==32717== NVPROF is profiling process 32717, command:./scale_vector_um ==32717== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-gpu setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory Passed! ==32717== Profiling application:./scale_vector_um ==32717== Profiling result: Time(%) Time Calls Avg Min Max 100.00% 6.4320us 1 6.4320us 6.4320us 6.4320us [ ] snip Name scale(float, float*, float*, int)

Unified Memory Excursus: Zero-copy Fall back CPU0 $ nvidia-smi topo -m GPU0 GPU1 GPU0 X PIX GPU1 PIX X GPU2 GPU3 mlx5_0 PHB PHB GPU2 X PIX GPU3 PIX X mlx5_0 PHB PHB X CPU Affinity 0-11,24-35 0-11,24-35 12-23,36-47 12-23,36-47 Legend: X PHB PXB PIX = = = = = Self Path Path Path Path traverses traverses traverses traverses a socket-level link (e.g. QPI) a PCIe host bridge multiple PCIe internal switches a PCIe internal switch GPU3 GPUDirect P2P GPU2 GPU1 GPU0 GPUDirect P2P CPU1

CUDA_MANAGED_FORCE_DEVICE_ALLOC > CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 nvprof --unified-memory-profiling per-process-device./scale_vector_um ==491== NVPROF is profiling process 491, command:./scale_vector_um Passed! ==491== Profiling application:./scale_vector_um ==491== Profiling result: Time(%) Time Calls Avg Min Max 100.00% 4.1600us 1 4.1600us 4.1600us 4.1600us Name scale(float, float*, float*, int) ==491== Unified Memory profiling result: Device "Tesla K80 (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 1 8.0000KB 8.0000KB 8.0000KB 8.000000KB 19.95500us Host To Device 6 4.0000KB 4.0000KB 4.0000KB 24.00000KB 95.08400us Device To Host ==491== API calls: Time(%) Time Calls Avg Min Max 98.87% 320.30ms 2 160.15ms 50.132us 320.25ms Name cudamallocmanaged With Fallback: [ ] Zero-copy snip 100.00% 6.4320us 1 6.4320us 6.4320us 6.4320us scale(float, float*, float*, int)

nvprof interoperability with nvvp nvprof can write the application timeline to nvvp compatible file: nvprof --unified-memory-profiling per-process-device -o scale_vector.nvprof./scale_vector_um Import in nvvp

nvprof important command-line options Options: -o, --export-profile <filename> Export the result file which can be imported later or opened by the NVIDIA Visual Profiler. "%p" in the file name string is replaced with the process ID of the application being profiled. "%q{<env>}" in the file name string is replaced with the value of the environment variable "<ENV>". If the environment variable is not set it's an error. "%h" in the file name string is replaced with the hostname of the system. "%%" in the file name string is replaced with "%". Any other character following "%" is illegal. By default, this option disables the summary output. Note: If the application being profiled creates child processes, or if '--profile-all-processes' is used, the "%p" format is needed to get correct export files for each process. --analysis-metrics Collect profiling data that can be imported to Visual Profiler's "analysis" mode. Note: Use "--export-profile" to specify an export file. -h, --unified-memory-profiling <per-process-device off> Options for unified memory profiling. Allowed values: per-process-device - collect counts for each process and each device off - turn off unified memory profiling (default) --help Print this help information.

nvvp introduction

Task 2: Analyze scale_vector Start scale_vector with nvprof and write timeline to file Import profile into nvvp Create profile with --analysis-metrics and run the guided analysis

Cheat Sheet Generate timeline with nvprof nvprof --unified-memory-profiling per-process-device -o <output-profile>./a.out Collect analysis metrics nvprof nvprof --unified-memory-profiling per-process-device --analysis-metrics -o <output-profile>./a.out Start nvvp nvvp profiler users guide http://docs.nvidia.com/cuda/profiler-users-guide/index.html