XID ERRORS. vr384 October XID Errors

Similar documents
GRID SOFTWARE FOR MICROSOFT WINDOWS SERVER VERSION /370.12

GRID SOFTWARE FOR HUAWEI UVP VERSION /370.12

GRID SOFTWARE FOR RED HAT ENTERPRISE LINUX WITH KVM VERSION /370.28

GRID VIRTUAL GPU FOR HUAWEI UVP Version /

GRID VIRTUAL GPU FOR HUAWEI UVP Version ,

GRID SOFTWARE FOR HUAWEI UVP VERSION /370.28

RMA PROCESS. vr384 October RMA Process

NVIDIA nforce 790i SLI Chipsets

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note

NVIDIA CAPTURE SDK 6.1 (WINDOWS)

HEALTHMON. vr418 March Best Practices and User Guide

QUADRO SYNC II FIRMWARE VERSION 2.02

DRIVER PERSISTENCE. vr384 October Driver Persistence

GPUMODESWITCH. DU June User Guide

NVIDIA CAPTURE SDK 7.1 (WINDOWS)

VIRTUAL GPU SOFTWARE R384 FOR MICROSOFT WINDOWS SERVER

HW FIELD DIAG. vr384 October HW Field Diag

GRID SOFTWARE MANAGEMENT SDK

NSIGHT ECLIPSE EDITION

GPUMODESWITCH. DU _v6.0 through 6.2 July User Guide

GPUMODESWITCH. DU April User Guide

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

VIRTUAL GPU SOFTWARE R384 FOR HUAWEI UVP

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GRID SOFTWARE FOR VMWARE VSPHERE VERSION /370.12

NVIDIA CAPTURE SDK 6.0 (WINDOWS)

TESLA K20 GPU ACCELERATOR

GRID VIRTUAL GPU FOR CITRIX XENSERVER Version / ,

NSIGHT ECLIPSE PLUGINS INSTALLATION GUIDE

VIRTUAL GPU SOFTWARE R384 FOR MICROSOFT WINDOWS SERVER

VIRTUAL GPU SOFTWARE R384 FOR RED HAT ENTERPRISE LINUX WITH KVM

VIRTUAL GPU SOFTWARE MANAGEMENT SDK

GRID VIRTUAL GPU FOR CITRIX XENSERVER Version /

TEGRA LINUX DRIVER PACKAGE R23.2

GRID VIRTUAL GPU FOR CITRIX XENSERVER Version ,

GRID SOFTWARE FOR VMWARE VSPHERE VERSION /370.21

VIRTUAL GPU SOFTWARE R390 FOR RED HAT ENTERPRISE LINUX WITH KVM

NVIDIA CUDA INSTALLATION GUIDE FOR MAC OS X

GRID LICENSING. DU _v4.6 January User Guide

CUDA-MEMCHECK. DU _v03 February 17, User Manual

TESLA M2050 AND TESLA M2070/M2070Q DUAL-SLOT COMPUTING PROCESSOR MODULES

VIRTUAL GPU LICENSE SERVER VERSION

NVIDIA CUDA C GETTING STARTED GUIDE FOR MAC OS X

Android PerfHUD ES quick start guide

VIRTUAL GPU SOFTWARE R384 FOR HUAWEI UVP

GRID VGPU FOR VMWARE VSPHERE Version /356.60

Technical Brief. LinkBoost Technology Faster Clocks Out-of-the-Box. May 2006 TB _v01

SDK White Paper. Vertex Lighting Achieving fast lighting results

TESLA C2050 COMPUTING SYSTEM

PNY Technologies, Inc. 299 Webro Rd. Parsippany, NJ Tel: Fax:

VIRTUAL GPU SOFTWARE R390 FOR NUTANIX AHV

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

GRID VGPU FOR VMWARE VSPHERE Version /

NVWMI VERSION 2.24 STANDALONE PACKAGE

CUDA TOOLKIT 3.2 READINESS FOR CUDA APPLICATIONS

GRID VGPU FOR VMWARE VSPHERE Version /

VIRTUAL GPU SOFTWARE. QSG _v5.0 through 5.2 Revision 03 February Quick Start Guide

GRID VGPU FOR VMWARE VSPHERE Version /

Enthusiast System Architecture Certification Feature Requirements

NVWMI VERSION 2.18 STANDALONE PACKAGE

GRID VGPU FOR VMWARE VSPHERE Version /356.53

User Guide. GLExpert NVIDIA Performance Toolkit

GRID VGPU FOR VMWARE VSPHERE Version /

VIRTUAL GPU CLIENT LICENSING

NSIGHT ECLIPSE EDITION

GRID VIRTUAL GPU FOR CITRIX XENSERVER Version ,

Specification. Tesla S870 GPU Computing System. March 13, 2008 SP _v00b

NVIDIA Quadro K6000 SDI Reference Guide

Application Note. NVIDIA Business Platform System Builder Certification Guide. September 2005 DA _v01

GLExpert NVIDIA Performance Toolkit

Getting Started. NVIDIA CUDA C Installation and Verification on Mac OS X

NSIGHT ECLIPSE EDITION

TEGRA LINUX DRIVER PACKAGE R24.1

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Mac OS X. May 2009 DU _v01

VIRTUAL GPU MANAGEMENT PACK FOR VMWARE VREALIZE OPERATIONS

PASCAL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

TESLA K20X GPU ACCELERATOR

SDK White Paper. Matrix Palette Skinning An Example

MAXWELL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

NSIGHT COMPUTE. v March Customization Guide

VIRTUAL GPU CLIENT LICENSING

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

TEGRA LINUX DRIVER PACKAGE R19.3

NVBLAS LIBRARY. DU _v6.0 February User Guide

Tuning CUDA Applications for Fermi. Version 1.2

VIRTUAL GPU CLIENT LICENSING

NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX

Getting Started. NVIDIA CUDA Development Tools 2.3 Installation and Verification on Mac OS X

User Guide. NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB

NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX

NVIDIA DEBUG MANAGER FOR ANDROID NDK - VERSION 8.0.1

User Guide. Vertex Texture Fetch Water

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Microsoft Windows XP and Windows Vista

VIRTUAL GPU SOFTWARE R390 FOR RED HAT ENTERPRISE LINUX WITH KVM

VIRTUAL GPU SOFTWARE R390 FOR LINUX WITH KVM

VIRTUAL GPU LICENSE SERVER VERSION AND 5.1.0

NVIDIA CUDA C INSTALLATION AND VERIFICATION ON

CUDA-GDB: The NVIDIA CUDA Debugger

Technical Brief. AGP 8X Evolving the Graphics Interface

Transcription:

ID ERRORS vr384 October 2017 ID Errors

Introduction... 1 1.1. What Is an id Message... 1 1.2. How to Use id Messages... 1 Working with id Errors... 2 2.1. Viewing id Error Messages... 2 2.2. Tools That Provide Additional Information About id Errors...2 2.3. Analyzing id Errors... 3 id Error Listing... 4 Common ID Errors... 8 4.1. ID 13: GR: SW Notify Error... 8 4.2. ID 31: Fifo: MMU Error... 8 4.3. ID 32: PBDMA Error... 9 4.4. ID 43: RESET CHANNEL VERIF ERROR... 9 4.5. ID 45: OS: Preemptive Channel Removal... 9 4.6. ID 48: DBE (Double Bit Error) ECC Error... 9

INTRODUCTION This document explains what id messages are, and is intended to assist system administrators, developers, and FAEs in understanding the meaning behind these messages as an aid in analyzing and resolving GPU-related problems. 1.1. What Is an id Message The id message is an error report from the NVIDIA driver that is printed to the operating system's kernel log or event log. id messages indicate that a general GPU error occurred, most often due to the driver programming the GPU incorrectly or to corruption of the commands sent to the GPU. The messages can be indicative of a hardware problem, an NVIDIA software problem, or a user application problem. These messages provide diagnostic information that can be used by both users and NVIDIA to aid in debugging reported problems. The meaning of each message is consistent across driver versions. 1.2. How to Use id Messages id messages are intended to be used as debugging guides. Because many problems can have multiple possible root causes it s not always feasible to understand each issue from the id value alone. For example, an id error might indicate that a user program tried to access invalid memory. But, in theory, memory corruption due to PCIE or frame buffer ( FB ) problems could corrupt any command and thus cause almost any error. Generally, the id classifications listed below should be used as a starting point for further investigation of each problem.

WORKING WITH ID ERRORS 2.1. Viewing id Error Messages Under Linux, the id error messages are placed in the location /var/log/messages. Grep for NVRM: id to find all the id messages. The following is an example of a id string: [ ] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c4453e44d3f7 [ ] NVRM: id (0000:03:00): 14, Channel 00000001 The first id in the log file is preceded by a line that contains the GPU GUID and device IDs. In the above example, The GUID is a globally unique, immutable identifier for each GPU. Each subsequent id line contains the device ID, the id error, and information about the id. In the above example, 2.2. Tools That Provide Additional Information About id Errors NVIDIA provides two additional tools that may be helpful when dealing with id errors. nvidia-smi is a command-line program that installs with the NVIDIA driver. It reports basic monitoring and configuration data about each GPU in the system. nvidia-smi can list ECC error counts (id 48) and indicate if a power cable is unplugged (id 54), among other things. Please see the nvidia-smi man page for more info. Run nvidia-smi q for basic output. NVIDIA Validation Suite (NVVS) is a health checking tool that is provided as part of the GPU Deployment Kit, located at https://developer.nvidia.com/gpu-deployment-

kit. NVVS can check for basic GPU health, including the presence of ECC errors, PCIe problems, bandwidth issues, and general problems with running CUDA programs. NVVS documentation is included in the GPU Deployment Kit. 2.3. Analyzing id Errors The following table lists the recommended actions to take for various issues encountered. Issue Recommended Action Suspected User Programming Issues Run the debugger tools. See the cuda-memcheck and cuda-gdb docs at http:// docs.nvidia.com/cuda/index.html Suspected Hardware Problems Contact the hardware vendor. They can run through their hardware diagnostic process. Suspected Driver Problems File a bug with NVIDIA.

ID ERROR LISTING The following table lists the id errors along with the potential causes for each. ID Failure Causes HW Error User System Driver App Memory Bus Thermal FB Error Error Corruption Error Issue Corruption 1 2 3 4 GPU semaphore timeout 5 6 7 buffer address 8 GPU stopped processing 9 Driver error programming GPU 10 11 12 Driver error handling GPU exception 13 Graphics Engine Exception 14

ID Failure Causes HW Error User System Driver App Memory Bus Thermal FB Error Error Corruption Error Issue Corruption 15 16 Display engine hung 17 18 Bus mastering disabled in PCI Config Space 19 Display Engine error 20 Invalid or corrupted Mpeg push buffer 21 Invalid or corrupted Motion Estimation push buffer 22 Invalid or corrupted Video Processor push buffer 23 24 GPU semaphore timeout 25 Invalid or illegal push buffer stream 26 Framebuffer timeout 27 28 29 30 GPU semaphore access error 31 GPU memory page fault 32 33 Internal micro-controller error 34 35 36 37 Driver firmware error 38 Driver firmware error 39 40

ID Failure Causes HW Error User System Driver App Memory Bus Thermal FB Error Error Corruption Error Issue Corruption 41 42 43 GPU stopped processing 44 Graphics Engine fault during context switch 45 Preemptive cleanup, due to previous errors -- Most likely to see when running multiple cuda applications and hitting a DBE 46 GPU stopped processing 47 48 Double Bit ECC Error 49 50 51 52 53 54 Auxiliary power is not connected to the GPU board 55 56 Display Engine error 57 Error programming video memory interface 58 Unstable video memory interface detected EDC error clarified in printout 59 Internal micro-controller error (older drivers) 60 61 Internal micro-controller breakpoint/warning (newer drivers)

ID Failure 62 Internal micro-controller halt Causes HW Error User System Driver App Memory Bus Thermal FB Error Error Corruption Error Issue Corruption (newer drivers) 63 ECC page retirement recording event 64 ECC page retirement recording failure 65 66 Illegal access by driver 67 Illegal access by driver 68 69 Graphics Engine class error 70 CE3: Unknown Error 71 CE4: Unknown Error 72 CE5: Unknown Error 73 NVENC2 Error 74 NVLINK Error 75 Reserved 76 Reserved 77 Reserved 78 vgpu Start Error 79 GPU has fallen off the bus 80 Corrupted data sent to GPU 81 VGA Subsystem Error

COMMON ID ERRORS This section provides more information on some common id errors. 4.1. ID 13: GR: SW Notify Error This event is logged for general user application faults. Typically this is an out-ofbounds error where the user has walked past the end of an array, but could also be an illegal instruction, illegal register, or other case. In rare cases, it s possible for a hardware failure or system software bugs to materialize as ID 13. When this event is logged, NVIDIA recommends the following: Run the application in cuda-gdb or cuda-memcheck, or Run the application with CUDA_DEVICE_WAITS_ON_ECEPTION=1 and then attach later with cuda-gdb, or File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug. Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. 4.2. ID 31: Fifo: MMU Error This event is logged when a fault is reported by the MMU, such as when an illegal address access is made by an applicable unit on the chip Typically these are applicationlevel bugs, but can also be driver bugs or hardware bugs. When this event is logged, NVIDIA recommends the following: Run the application in cuda-gdb or cuda-memcheck, or Run the application with CUDA_DEVICE_WAITS_ON_ECEPTION=1 and then attach later with cuda-gdb, or File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. 4.3. ID 32: PBDMA Error This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions. 4.4. ID 43: RESET CHANNEL VERIF ERROR This event is logged when a user application hits a software induced fault and must terminate. The GPU remains in a healthy state. In most cases, this is not indicative of a driver bug but rather a user application error. 4.5. ID 45: OS: Preemptive Channel Removal This event is logged when the user application aborts and the kernel driver tears down the GPU application running on the GPU. Control-C, GPU resets, sigkill are all examples where the application is aborted and this event is created. In many cases, this is not indicative of a bug but rather a user or system action. 4.6. ID 48: DBE (Double Bit Error) ECC Error This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. This is also reported back to the user application. A GPU reset or node reboot is needed to clear this error. The tool nvidia-smi can provide a summary of ECC errors. See Tools That Provide Additional Information About id Errors.

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. 2013-2017 NVIDIA Corporation. All rights reserved.