NVIDIA AFTERMATH: A NEW WAY OF DEBUGGING CRASHES ON THE GPU. Alex Dunn, 2 nd March 2017

Similar documents
DirectX 12. Why DirectX? Limited platforms Windows 10 XBox One. Large, connected API Direct3D DirectXMath XInput XAudio2

Why DirectX? Limited platforms. Large, connected API. Good for games on Microsoft platforms. Windows 10 XBox One. Direct3D DirectXMath XInput XAudio2

SPARSE FLUID SIMULATION IN DIRECTX. Alex Dunn Graphics Dev. Tech.

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

When Applications Crash Part II - WER A two part guide to how any ISV can find and fix crashes that occur at customer sites.

[537] Locks. Tyler Harter

Quality Assurance: Test Development & Execution. Ian S. King Test Development Lead Windows CE Base OS Team Microsoft Corporation

Sparse Fluid Simulation in DirectX. Alex Dunn Dev. Tech. NVIDIA

Pointers and References

ADVANCED RENDERING WITH DIRECTX 12

Vector and Free Store (Pointers and Memory Allocation)

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

C and C++ Secure Coding 4-day course. Syllabus

Direct3D API Issues: Instancing and Floating-point Specials. Cem Cebenoyan NVIDIA Corporation

Computer Components. Software{ User Programs. Operating System. Hardware

Midterm 1. Administrivia. Virtualizing Resources. CS162 Operating Systems and Systems Programming Lecture 12. Address Translation

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Xcode Encountered An Internal Logic Error >>>CLICK HERE<<<

ni.com Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects

Explicit Multi GPU Programming with DirectX 12. Juha Sjöholm Developer Technology Engineer NVIDIA

Introduce C# as Object Oriented programming language. Explain, tokens,

Tips on Using GDB to Track Down and Stamp Out Software Bugs

CS 241 Honors Memory

Lab 8. Follow along with your TA as they demo GDB. Make sure you understand all of the commands, how and when to use them.

In Java we have the keyword null, which is the value of an uninitialized reference type

JAVA An overview for C++ programmers

Optimizing Dynamic Memory Management

This is an open book, open notes exam. But no online or in-class chatting.

CMPSC 311- Introduction to Systems Programming Module: Systems Programming

Fall 2015 COMP Operating Systems. Lab #3

CSE 333. Lecture 11 - constructor insanity. Hal Perkins Paul G. Allen School of Computer Science & Engineering University of Washington

CS61C : Machine Structures

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015

Embedded/Connected Device Secure Coding. 4-Day Course Syllabus

C++: Const Function Overloading Constructors and Destructors Enumerations Assertions

CSE 374 Programming Concepts & Tools

The Witness on Android Post Mortem. Denis Barkar 3 March, 2017

Embedded Software TI2726 B. 3. C tools. Koen Langendoen. Embedded Software Group

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016

Outline. Computer programming. Debugging. What is it. Debugging. Hints. Debugging

QUICKSTART MANUAL 1 CONTROLS 2. Dual Shock 4 2. XInput (360 Controller) 3. Keyboard 3. DirectInput 4 TROUBLESHOOTING 4. In Case of Crashes 5

CS266 Software Reverse Engineering (SRE) Reversing and Patching Wintel Machine Code

CSE 410: Systems Programming

Systems software design. Software build configurations; Debugging, profiling & Quality Assurance tools

CS510 Operating System Foundations. Jonathan Walpole

1.1 For Fun and Profit. 1.2 Common Techniques. My Preferred Techniques

C++ Tutorial AM 225. Dan Fortunato

GPU Quality and Application Portability

+ C++11. Qt5 with a touch of C++11. Matthew Eshleman covemountainsoftware.com

CS61 Lecture II: Data Representation! with Ruth Fong, Stephen Turban and Evan Gastman! Abstract Machines vs. Real Machines!

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Scalable Post-Mortem Debugging Abel Mathew. CEO -

Application Note: Heap Memory Management

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Announcements. Lecture 04b Header Classes. Review (again) Comments on PA1 & PA2. Warning about Arrays. Arrays 9/15/17

Chapter 17 vector and Free Store

Nooks. Robert Grimm New York University

CSC 405 Introduction to Computer Security Fuzzing

CSCI0330 Intro Computer Systems Doeppner. Lab 02 - Tools Lab. Due: Sunday, September 23, 2018 at 6:00 PM. 1 Introduction 0.

CS 61C: Great Ideas in Computer Architecture C Pointers. Instructors: Vladimir Stojanovic & Nicholas Weaver

Shared Memory Programming. Parallel Programming Overview

Analysis Tool Project

Memory Leak. C++: Memory Problems. Memory Leak. Memory Leak. Pointer Ownership. Memory Leak

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

LANGUAGE RUNTIME NON-VOLATILE RAM AWARE SWAPPING

Accelerating Cloud Graphics

Fuzzing: Breaking software in an automated fashion

Arrays in C C Programming and Software Tools. N.C. State Department of Computer Science

Arrays and Linked Lists

COMP6771 Advanced C++ Programming

Geometric Tools Engine Version 3.14 Installation Manual and Release Notes

Using Static Code Analysis to Find Bugs Before They Become Failures

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Reliable Computing I

Tiled shading: light culling reaching the speed of light. Dmitry Zhdan Developer Technology Engineer, NVIDIA

Word 2007 Advanced. Business Management Daily. Presented by: Vickie Sokol Evans, MCT Founder, The Red Cape Company. March 22, :15 p.m.

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Stream Computing using Brook+

Psychology of Testing (or why our intuition of testing is wrong) -- Miško Hevery

Arrays and Memory Management

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

Today s class. Finish review of C Process description and control. Informationsteknologi. Tuesday, September 18, 2007

LINUX VULNERABILITIES, WINDOWS EXPLOITS Escalating Privileges with WSL. Saar Amar Recon brx 2018

Visual Profiler. User Guide

Low-Overhead Rendering with Direct3D. Evan Hart Principal Engineer - NVIDIA

Buffer Overflow Defenses

CNIT 127: Exploit Development. Ch 1: Before you begin. Updated

Kakadu and Java. David Taubman, UNSW June 3, 2003

use static size for this buffer

Computer Components. Software{ User Programs. Operating System. Hardware

09/08/2017 CS2530 INTERMEDIATE COMPUTING 9/8/2017 FALL 2017 MICHAEL J. HOLMES UNIVERSITY OF NORTHERN IOWA TODAY S TOPIC: Exceptions and enumerations.

Java: framework overview and in-the-small features

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM

Recitation 10: Malloc Lab

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

GPGPU on Mobile Devices

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CIS Operating Systems Memory Management. Professor Qiang Zeng Fall 2017

Transcription:

NVIDIA AFTERMATH: A NEW WAY OF DEBUGGING CRASHES ON THE GPU Alex Dunn, 2 nd March 2017

NVIDIA AFTERMATH What is it? New tool to diagnose GPU crashes, available on GeForce! Coming to D3D for broad availability Ability to classify GPU crashes by location and type Can be shipped in game catch crashes from the wild 2

GPU CRASH? a.k.a. TDR / Hang / Device Removed / Crash/? Annoying What can we do? 3

GPU DEBUGGING 101 Preventative Changes timing Development-use Only Limited coverage 1 st line of defense: MSFT Debug Layer 2 nd line of defense: MSFT GPU-Based Validation Final line of defense: - Catches issues that fall through - Minimal impact - Shippable 4

OBSERVATION Current state of the art in GPU crash debugging isn t enough There s no simple way to debug crashes after the fact Some bugs take months to resolve (really!!) End- User Dev. NVIDIA 5

DETECTING GPU CRASH #2 1. Crash detected based on error code from API (CPU) 2. Crash happened sometime in the last N frames of GPU commands 3. CPU call stack is likely a red-herring Frame CPU Location 0 GPU Crash Not useful for debugging! 6

POC IMPLEMENTATION KO: Increase accuracy of GPU crash location Plan: Game inserts user-defined markers in the command stream GPU signals each marker once reached Last marker reached indicates GPU crash location CPU Location F n GPU Crash 7

POC IMPLEMENTATION #2 Implemented exclusively via DX12 CopyBufferRegion inserts markers on GPU timeline Write to single memory location per queue Globally shared heap post-crash accessible data void SetMarker(char* markername) { renamingoffset = (renamingoffset + kmarkersize) % kringbuffersize; const D3D12_RANGE readrange = { 0 }; const D3D12_RANGE writerange = { renamingoffset, renamingoffset + min(kmarkersize,strlen(markername)) }; void* mappeddatabegin = nullptr; uploadheap->map(0, &readrange, &mappeddatabegin); { memcpy(((uintptr_t)mappeddatabegin + writerange.begin), &markername[0], writerange.end); } uploadheap->unmap(0, &writerange); } commandlist->copybufferregion(sharedheap, 0, uploadheap, writerange.begin, writerange.end); 8

POC IMPLEMENTATION #3 How it looks in practice; CPU Location F n sharedheap Contents: beginframe begingbufferfill begindrawopaque enddrawopaque begindrawdecals enddrawdecals endgbufferfill begindrawtransparent enddrawtransparent beginlighting tiledlighting_cull 9

POC IMPLEMENTATION #4 Case Study: IO Interactive IO Interactive facing very stubborn GPU crash Issue was open for >2 months, main focus of weekly meetings with NVIDIA With POC, issue was identified and fixed in a single afternoon This tool is excellent Conclusion: Discovering where a hang occurs in GPU timeline is valuable & actionable 10

POC (MINI) POST-MORTEM Pros: Simple API simple to integrate Enabled classification of GPU crashes Insight into where GPU crashes occur Cons: GPU copies are super slow for this purpose Timing related behavior altered Separate process for marker read-back Serializes order of GPU work (wait-for-idle) Only supports DX12 DX11 driver too smart 11

MOVING FORWARD (AFTERMATH) And so, Aftermath was born Take all the Pros, leave the Cons; polish and improve from there Make available in C++ library form Key differences from the POC: Marker insertion uses low-level HW features inside driver GPU crash reason provided, { timeout, page-fault, } 12

GAME INTEGRATION #1 Before other library calls are made: GFSDK_Aftermath_DXxx_Initialize( ) NB. Must return GFSDK_Aftermath_Result_Success 13

GAME INTEGRATION #2 To inject an event: GFSDK_Aftermath_DXxx_SetEventMarker(T*, void*, UINT) 14

GAME INTEGRATION #3 On a TDR/hang: GFSDK_Aftermath_DXxx_GetData( ) Fetches the last GPU-processed event marker Can also fetch the execution state for each GPU! 15

GAME INTEGRATION #4 enum GFSDK_Aftermath_Status { GFSDK_Aftermath_Status_Active = 0, GFSDK_Aftermath_Status_Timeout, GFSDK_Aftermath_Status_OutOfMemory, GFSDK_Aftermath_Status_PageFault, GFSDK_Aftermath_Status_Unknown, }; 16

HOW TO ENABLE YOUR GAME*? 1. Grab the Aftermath package from (available on next driver posting): https://developer.nvidia.com/nvidia-aftermath 2. Integrate header + DLL into game compile 3. Rename executable to: NvAftermath-Enable.exe *(to ship in game, contact us) 17

WORKFLOW - TIPS Emit regime name as marker: extern ID3D12CommandList* const m_commandlist; extern char* m_marker; GFSDK_Dx12_SetEventMarker(m_commandList, (void*)m_marker, strlen(m_marker)+1); Track currently bound PSO?: extern ID3D12CommandList* const m_commandlist; extern ID3D12PipelineState* const m_desiredpso; m_commandlist->setpipelinestate(m_desiredpso); GFSDK_Dx12_SetEventMarker(m_commandList, (void*)m_desiredpso, 0); Emit CPU backtrace on every/any API call: extern ID3D12CommandList* const m_commandlist; PVOID stackptrs[16] = { 0 }; CaptureStackBackTrace(1, 16, stackptrs, NULL); GFSDK_Dx12_SetEventMarker(m_commandList, &stackptrs[0], sizeof(stackptrs)); 18

ROADMAP What s next? (proposals) Expand API support Push/Pop marker style Page-fault? Supply resource identified!? (feel free to make requests during questions) NVIDIA working with Microsoft to develop an industry standard 19

QUESTIONS? Thank you! \0 20

Ref. 1. https://msdn.microsoft.com/en-gb/windows/uwp/gaming/handling-device-lostscenarios 2. https://msdn.microsoft.com/enus/library/windows/desktop/bb509553(v=vs.85).aspx 3. http://nvidia.custhelp.com/app/answers/detail/a_id/3335/~/tdr-(timeoutdetection-and-recovery)-and-collecting-dump-files 4. https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#devsandq ueues-lost-device 5. https://developer.nvidia.com/nvidia-aftermath 21