NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Similar documents
April 4-7, 2016 Silicon Valley

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CUDA Tools for Debugging and Profiling

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

PROFILER USER'S GUIDE. DU _v9.1 December 2017

OpenACC Course. Office Hour #2 Q&A

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

NSIGHT ECLIPSE EDITION

NVIDIA Nsight Visual Studio Edition 4.0 A Fast-Forward of All the Greatness of the Latest Edition. Sébastien Dominé, NVIDIA

PROFILER. DU _v5.0 October User's Guide

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

S CUDA on Xavier

NSIGHT ECLIPSE EDITION

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CUPTI. DA _v10.1 February User's Guide

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Profiling of Data-Parallel Processors

Tesla GPU Computing A Revolution in High Performance Computing

PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS

Debugging Your CUDA Applications With CUDA-GDB

GPU Programming. Rupesh Nasre.

Profiling & Tuning Applications. CUDA Course István Reguly

S WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE. Mathias Wagner, Jakob Progsch GTC 2017

NVIDIA CUDA TOOLKIT

NSIGHT ECLIPSE EDITION

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

GPU Computing Master Clss. Development Tools

Intel Parallel Studio XE 2017 Composer Edition BETA C++ - Debug Solutions Release Notes

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

NVIDIA CUDA TOOLKIT

OPTIMIZING HPC SIMULATION AND VISUALIZATION CODE USING NVIDIA NSIGHT SYSTEMS

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

NVIDIA CUDA TOOLKIT

TESLA DRIVER VERSION (LINUX)/411.82(WINDOWS)

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

NVIDIA Parallel Nsight. Jeff Kiel

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

Profiling GPU Code. Jeremy Appleyard, February 2016

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

PROFILER USER'S GUIDE. DU _v7.5 September 2015

Graphics Performance Analyzer for Android

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU Fundamentals Jeff Larkin November 14, 2016

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

Heidi Poxon Cray Inc.

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS

CME 213 S PRING Eric Darve

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Eliminate Threading Errors to Improve Program Stability

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

CUDA Architecture & Programming Model

Introduction to Multicore Programming

CUDA 6.0. Manuel Ujaldón Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow

NVIDIA CUDA TOOLKIT

Performance Tools for Technical Computing

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

System Wide Tracing User Need

ME964 High Performance Computing for Engineering Applications

Introduction to Parallel Performance Engineering

Performance Analysis of Parallel Scientific Applications In Eclipse

Fundamental CUDA Optimization. NVIDIA Corporation

IBM High Performance Computing Toolkit

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Eliminate Threading Errors to Improve Program Stability

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Eliminate Memory Errors to Improve Program Stability

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA-MEMCHECK. DU _v5.0 October User Manual

Introduction to Multicore Programming

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

CS377P Programming for Performance GPU Programming - II

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile

Hands-on CUDA Optimization. CUDA Workshop

NVIDIA Fermi Architecture

Using JURECA's GPU Nodes

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Transcription:

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0 Sanjiv Satoor

CUDA TOOLS 2

NVIDIA NSIGHT Homogeneous application development for CPU+GPU compute platforms CUDA-Aware Editor CUDA Debugger CPU+GPU CUDA Profiler 3

CUDA TOOLS VISUAL PROFILER Trace CUDA activities Profile CUDA kernels Correlate performance instrumentation with source code Expert-guided performance analysis NVPROF Collect Performance events and metrics GPU LIBRARY ADVISOR Detect CUDA library optimization opportunities NVDISASM, CUOBJDUMP CUDA-MEMCHECK Detect out-of-bounds and misaligned memory accesses Detect race condition in memory accesses Detect uninitialized global memory accesses Detect incorrect GPU thread synchronization CUDA-GDB Debug CUDA kernels with CLI Debug CPU and GPU code CPU and GPU core dump support 4

NEW TOOLS FEATURES IN CUDA 8.0 5

SUPPORT FOR PASCAL ARCHITECTURE ALL TOOLS Support for GPUs with Compute Capability 6.0, 6.1 and 6.2 6

DEPENDENCY ANALYSIS Visual Profiler, nvprof Provide insight into application-level performance limiters Expose dependencies between activities according to the programming model Identify waiting time due to inter-stream dependencies Highlight activities on the critical application runtime path Supports CUDA (Linux/Mac/Windows) and POSIX threads (Linux/Mac) 7

DEPENDENCY ANALYSIS IN VISUAL PROFILER 8

UNIFIED MEMORY PROFILING Requirements: Pascal (GP100) and Linux-64, Windows TCC in the future Data transfers to/from GPU with accurate timestamps (event-based) Supported counters: CPU and GPU page faults, H2D data migration, D2H data migration New activity record with timestamps, transfer size, virtual page base address, devices involved 9

UNIFIED MEMORY PROFILING Visual profiler - 8.0 unified memory timeline 10

UNIFIED MEMORY PROFILING ANALYSIS 11

Unguided Analysis NVLINK VISUALIZATION Visual Profiler Static properties Runtime values Option to collect NVLink information Topology Achieved throughput 12

NVLINK VISUALIZATION Visual Profiler on DGX1 13

CPU PROFILING Visual Profiler Profiles the process of interest Collect CPU sampling data alongside the GPU data you are used to Sample at a specific frequency Function & library symbols listed Analyze the data per-thread from different perspectives (top-down, bottom-up or flat) 14

CPU PROFILING IN VISUAL PROFILER 15

OPENACC PROFILING Visual Profiler, nvprof nvprof: Summary of OpenACC activities with their inclusive and exclusive time Trace of activities along with CUDA API calls and GPU activities, including CUDA device/ctx/stream info, data transfer sizes, etc. Source file/line and kernel name correlation Visual Profiler: Visualization of OpenACC activities, coloring for different types of activities Correlation with their compiler-generated API calls and device activities Source code view for mapped OpenACC directives 16

OPENACC PROFILING OpenAcc->Driver API->Compute correlation OpenAcc timeline OpenAcc Properties OpenAcc->Source Code correlation 17

COMPUTE PREEMPTION SUPPORT Debugger New HW feature introduced with Pascal: Long running kernels can be preempted Processes using the GPU run without blocking each other (CUDA/Graphics) Resulting debugger improvements: Support for fast single GPU debugging Lower overhead for attaching to running process 18

DEBUG ON DISPLAY GPU WITH NO RESTRICTIONS 19

MEMORY CHECKER cuda-memcheck Support for non-migratable system-scoped atomics checking on SM 6.x in Memcheck Tool. Support for reporting fatal CPU-side faults when Unified Memory is enabled in Memcheck Tool. Support for correctly determining the expected set of threads at a barrier in the presence of exited threads in Synccheck Tool. 20

NEW TOOLS FEATURES IN CUDA 8.0 PASCAL Architecture support Profiler enhancements Dependency Analysis Unified Memory profiling CPU Profiling OpenACC profiling NVLINK analysis Debug on display GPU with no restrictions Memcheck enhancements 21

FOR MORE INFORMATION CUDA 8 Features Revealed Parallel Forall Blog Post : https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/ CUDA Documentation: http://docs.nvidia.com/cuda/ Download CUDA Toolkit 8: https://developer.nvidia.com/cuda-downloads 22

BACKUP SLIDES 24

AGENDA CUDA Tools summary New Tools Features in CUDA 8.0 25

CUDA TOOLS IDE: Debug: Memcheck: Nsight Eclipse Edition, Nsight Visual Studio Edition cuda-gdb, CUDA Debug API cuda-memcheck Profile/Trace: CUDA Visual Profiler, nvprof, CUPTI 26

DEPENDENCY ANALYSIS EXAMPLE Dependencies between events derived from programming model constraints Allows to compute wait states and the critical path cudalaunch cudastreamsynchronize Stalls CPU (waiting time) Kernel 27

DEPENDENCY ANALYSIS IN VISUAL PROFILER New option in the Visual Profiler s expert system and new modes to display the critical path data on the timeline. Exec dependencies for a selected interval: 28

UNIFIED VIRTUAL MEMORY Starting with Kepler and CUDA 6 Custom Data Management Developer View With Unified Memory System Memory GPU Memory Unified Memory 11/16/ 29

UNIFIED MEMORY PROFILING Visual profiler - 8.0 unified memory timeline (interval selection) 30

RUN GRAPHICS APPLICATIONS WHILE DEBUGGING CUDA APPLICATIONS 31

RUN MULTIPLE DEBUGGER SESSIONS 32

DEPENDENCY ANALYSIS IN NVPROF New option to run post-mortem dependency analysis New option to trace POSIX threads for multi-threaded applications ==22557== Dependency Analysis: Critical path(%) Critical path Waiting time Name 94.61% 3.942181s 0ns clock_block(long*, long) 5.20% 216.857718ms 0ns cudamalloc 0.16% 6.617667ms 0ns <Other> 0.01% 293.028000us 0ns cudevicegetattribute 0.01% 235.154000us 0ns cudagetdeviceproperties 0.01% 221.116000us 0ns cudafree 0.00% 158.703000us 0ns cudastreamcreate 0.00% 35.252000us 0ns cudaconfigurecall 0.00% 35.248000us 0ns cudevicegetname 0.00% 33.139000us 0ns cudevicetotalmem_v2 0.00% 20.298000us 0ns cudasetupargument 0.00% 19.433000us 0ns cudagetdevice 0.00% 0ns 3.942147s pthread_join 0.00% 0ns 3.942136s cudastreamsynchronize 0.00% 0ns 1.001459s pthread_mutex_lock 0.00% 0ns 980.464357ms pthread_cond_wait 33

NVLINK ANALYSIS Visual Profiler Memcpy P2P 34

NVLINK CUPTI SUPPORT New record for Nvlink topology Information about connected devices and connected ports Version, peak BW, properties (e.g. peer/sysmem accesses, atomic accesses, ) Performance given using metrics: data transmitted/throughput, overhead Metrics available split according to type of operation (atom/reduction/write/response) as well as accumulated 35

SOURCE-DISASSEMBLY VIEW ENHANCEMENTS Visual Profiler Single integrated view for the different source level analysis results The Source-Disassembly view contains: High level source Assembly instructions Hotspots at the source level Hotspots at the assembly instruction level Columns for profiling data aggregated to the source level Columns for profiling data collected at the assembly instruction level 36

SOURCE-DISASSEMBLY VIEW ENHANCEMENTS 37

MISC IMPROVEMENTS Profilers/CUPTI Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records New activity records for CUDA synchronization constructs (ctx sync, CUDA event, ) UVM-lite counter profiling support on Mac Support for FP16 data Added containers to store the information of events and metrics in the form of activity records 38