RMA PROCESS. vr384 October RMA Process

Similar documents
HEALTHMON. vr418 March Best Practices and User Guide

GRID SOFTWARE FOR RED HAT ENTERPRISE LINUX WITH KVM VERSION /370.28

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note

GRID SOFTWARE FOR MICROSOFT WINDOWS SERVER VERSION /370.12

NVIDIA nforce 790i SLI Chipsets

NSIGHT ECLIPSE PLUGINS INSTALLATION GUIDE

DRIVER PERSISTENCE. vr384 October Driver Persistence

NVWMI VERSION 2.24 STANDALONE PACKAGE

TESLA C2050 COMPUTING SYSTEM

NVIDIA CAPTURE SDK 6.1 (WINDOWS)

VIRTUAL GPU LICENSE SERVER VERSION

NVWMI VERSION 2.18 STANDALONE PACKAGE

NVIDIA CAPTURE SDK 7.1 (WINDOWS)

XID ERRORS. vr384 October XID Errors

QUADRO SYNC II FIRMWARE VERSION 2.02

GRID SOFTWARE FOR HUAWEI UVP VERSION /370.12

Enthusiast System Architecture Certification Feature Requirements

GRID SOFTWARE FOR HUAWEI UVP VERSION /370.28

TESLA K20 GPU ACCELERATOR

PNY Technologies, Inc. 299 Webro Rd. Parsippany, NJ Tel: Fax:

VIRTUAL GPU SOFTWARE R384 FOR RED HAT ENTERPRISE LINUX WITH KVM

VIRTUAL GPU SOFTWARE R384 FOR MICROSOFT WINDOWS SERVER

NSIGHT ECLIPSE EDITION

NVIDIA CAPTURE SDK 6.0 (WINDOWS)

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

TESLA M2050 AND TESLA M2070/M2070Q DUAL-SLOT COMPUTING PROCESSOR MODULES

DU _v01. September User Guide

HW FIELD DIAG. vr384 October HW Field Diag

SDK White Paper. Vertex Lighting Achieving fast lighting results

NVBLAS LIBRARY. DU _v6.0 February User Guide

NVIDIA Quadro K6000 SDI Reference Guide

GPUMODESWITCH. DU April User Guide

GRID VGPU FOR VMWARE VSPHERE Version /

NVIDIA GPU BOOST FOR TESLA

Specification. Tesla S870 GPU Computing System. March 13, 2008 SP _v00b

VIRTUAL GPU MANAGEMENT PACK FOR VMWARE VREALIZE OPERATIONS

VIRTUAL GPU SOFTWARE R384 FOR MICROSOFT WINDOWS SERVER

GPUMODESWITCH. DU June User Guide

NVIDIA DEBUG MANAGER FOR ANDROID NDK - VERSION 8.0.1

GPUMODESWITCH. DU _v6.0 through 6.2 July User Guide

Application Note. NVIDIA Business Platform System Builder Certification Guide. September 2005 DA _v01

NVIDIA CUDA C GETTING STARTED GUIDE FOR MAC OS X

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Mac OS X. May 2009 DU _v01

GRID LICENSING. DU _v4.6 January User Guide

MOSAIC CONTROL DISPLAYS

Getting Started. NVIDIA CUDA C Installation and Verification on Mac OS X

GRID SOFTWARE MANAGEMENT SDK

GRID VIRTUAL GPU FOR HUAWEI UVP Version ,

TESLA K20X GPU ACCELERATOR

NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX

GRID SOFTWARE FOR VMWARE VSPHERE VERSION /370.12

SDK White Paper. Matrix Palette Skinning An Example

GRID VGPU FOR VMWARE VSPHERE Version /356.53

GRID VIRTUAL GPU FOR HUAWEI UVP Version /

VIRTUAL GPU SOFTWARE. QSG _v5.0 through 5.2 Revision 03 February Quick Start Guide

NVIDIA CUDA GETTING STARTED GUIDE FOR LINUX

VIRTUAL GPU SOFTWARE R390 FOR RED HAT ENTERPRISE LINUX WITH KVM

TESLA 1U GPU COMPUTING SYSTEMS

User Guide. Vertex Texture Fetch Water

User Guide. GLExpert NVIDIA Performance Toolkit

Android PerfHUD ES quick start guide

CUDA TOOLKIT 3.2 READINESS FOR CUDA APPLICATIONS

GRID SOFTWARE FOR VMWARE VSPHERE VERSION /370.21

VIRTUAL GPU CLIENT LICENSING

VIRTUAL GPU LICENSE SERVER VERSION AND 5.1.0

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

Getting Started. NVIDIA CUDA Development Tools 2.3 Installation and Verification on Mac OS X

VIRTUAL GPU SOFTWARE MANAGEMENT SDK

GRID VGPU FOR VMWARE VSPHERE Version /

NVIDIA VIRTUAL GPU PACKAGING, PRICING AND LICENSING. March 2018 v2

GRID VGPU FOR VMWARE VSPHERE Version /356.60

VIRTUAL GPU SOFTWARE R384 FOR HUAWEI UVP

GRID VGPU FOR VMWARE VSPHERE Version /

PASCAL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

NVIDIA VIRTUAL GPU PACKAGING, PRICING AND LICENSING. August 2017

Technical Brief. LinkBoost Technology Faster Clocks Out-of-the-Box. May 2006 TB _v01

MAXWELL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

VIRTUAL GPU CLIENT LICENSING

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

GRID VGPU FOR VMWARE VSPHERE Version /

NVIDIA GPU CLOUD IMAGE FOR MICROSOFT AZURE

CREATING AN NVIDIA QUADRO VIRTUAL WORKSTATION INSTANCE

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GLExpert NVIDIA Performance Toolkit

User Guide. NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB

Technical Brief. NVIDIA Quadro FX Rotated Grid Full-Scene Antialiasing (RG FSAA)

NVIDIA SLI Mosaic Mode

NSIGHT ECLIPSE EDITION

NVIDIA CUDA C INSTALLATION AND VERIFICATION ON

NVIDIA Tesla Compute Cluster Driver for Windows

IBM Platform HPC V3.2:

VIRTUAL GPU CLIENT LICENSING

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

CUDA QUICK START GUIDE. DU _v9.1 January 2018

SDK White Paper. Occlusion Query Checking for Hidden Pixels

NSIGHT ECLIPSE EDITION

NVIDIA VOLTA DEEP LEARNING AMI

Tuning CUDA Applications for Fermi. Version 1.2

User Guide. GPGPU Disease

NVIDIA GPU CLOUD IMAGE FOR GOOGLE CLOUD PLATFORM

Transcription:

RMA PROCESS vr384 October 2017 RMA Process

Introduction... 1 Tools and Diagnostics... 2 2.1. nvidia-bug-report... 2 2.2. nvidia-healthmon... 3 2.3. NVIDIA Field Diagnostic... 3 Common System Level Issues... 5 RMA Checklist and Flowchart... 6 RMA Process Flow... 7

INTRODUCTION NVIDIA is committed to providing the highest level of quality, reliability, and support for the enterprise datacenter-class NVIDIA Tesla graphics processing unit (GPU) products. To that end, NVIDIA is focused on two primary goals with the Tesla RMA submission process: Expeditious replacement of returned Tesla GPU products Comprehensive understanding of the customer-observed issue and failure to allow for: NVIDIA replication and confirmation of the failure Root-cause analysis of the failure aimed at continuous improvement of the product and future Tesla offerings NVIDIA has provided this guide to ensure that the RMA requestor is able to provide the information necessary to meet these goals with each RMA request, best ensuring that such requests are quickly approved and processed.

TOOLS AND DIAGNOSTICS NVIDIA provides a few tools to help diagnose issues and failures observed with Tesla GPU products. These tools are: nvidia-bug-report nvidia-healthmon NVIDIA Field Diagnostic 2.1. nvidia-bug-report nvidia-bug-report.sh is a shell script included with the NVIDIA Linux driver that gathers system data that is highly valuable to understanding any reported field issue. This includes information such as lspci and system message log files and also includes nvidia-smi information. It is installed with the NVIDIA driver and placed in /usr/bin/ nvidia-bug-report.sh. Running nvidia-bug-report.sh will produce an output file, nvidiabug-report.log.tgz, in the current working directory. Ideally, nvidia-bug-report.sh should be run immediately after an issue is observed. This will collect the most recent information about the failure. If the report hangs or does not create a complete report, power cycle the machine, save the file that was generated, and run nvidia-bug-report.sh one more time after the power cycle to complete the log. Both logs should be sent to NVIDIA as part of any RMA submission. To run nvidia-bug-report on Linux systems, first log in to root. At command line # Type nvidia-bug-report.sh Nvidia-bug-report.sh will now collect information about your system and create the file, nvidia-bug-report.log.gz in the current directory Note: This file should be included with any RMA request. Failure to include this log file may result in delays to the processing of the RMA request. For more information, see the section titled, RMA Checklist and Flowchart.

2.2. nvidia-healthmon nvidia-healthmon detects and troubleshoots common problems affecting Tesla GPUs in a high performance computing environment. nvidia-healthmon contains limited hardware diagnostic capabilities and instead focuses on software and system configuration issues. nvidia-healthmon is designed to discover common problems that affect a GPU s ability to run a compute job, including: Software configuration issues System configuration issues System assembly issues, like loose cables A limited number of hardware issues To run nvidia-healthmon from the command line with default behavior on all supported GPUs: user@hostname$ nvidia-healthmon nvidia-healthmon will terminate once it completes the execution diagnostics on all specified devices. An exit code of zero will be used when nvidia-healthmon runs successfully. A non-zero exit code indicates that there was a problem with the nvidiahealthmon run. The output of the application must be read to determine the exact problem. nvidia-healthmon s output may include a troubleshooting report designed to address common problems, and will often suggest a number of possible solutions. These troubleshooting steps should be undertaken from the top down, as the most likely solution is listed at the top. For more details, command lines arguments, configuration options, and instructions for interpreting the results of the tool, refer to the nvidia-healthmon User Guide. 2.3. NVIDIA Field Diagnostic The NVIDIA Field Diagnostic is a comprehensive Linux based hardware diagnostic tool that provides confirmation of the numerical processing Linux engines in the GPU, integrity of data transfers to and from the GPU, and test coverage of the full onboard memory address space that is available to NVIDIA CUDA programs. In the event that any software or system configuration issue cannot be identified (for example, by nvidia-healthmon) and resolved, the NVIDIA Field Diagnostic should be run to determine whether the Tesla GPU may be faulty. The NVIDIA Field Diagnostic can be run with the command./fieldiag Note: NVIDIA Tesla GPU products have ECC memory protection enabled by default. The NVIDIA Field Diagnostic runs only on boards that have ECC enabled. If the user has previously disabled ECC on a suspect board, ECC must be re-enabled prior to running the NVIDIA Field Diagnostic on that board. NVIDIA will not accept RMA requests for failures that occur only with ECC disabled. Any failure must occur with ECC enabled to be eligible for RMA return.

For more details or product-specific command lines arguments, refer to the NVIDIA Field Diagnostic Quick Start Guide (DU-05711-001) and the NVIDIA Field Diagnostic Software Guide (DU-05363-001) included in the NVIDIA Field Diagnostic software package. Upon completion of the diagnostic, a fieldiag.log file is generated. Note: This file should be included with any RMA request. Failure to include this log file may result in delays to the processing of the RMA request. For more information, see the section titled, RMA Checklist and Flowchart. A passing result with the NVIDIA Field Diagnostics is an indication that the NVIDIA Tesla GPU hardware is in good condition, and pointing to a potential software application-level issue. Note: In the event that the NVIDIA Field Diagnostic returns a passing result, NVIDIA requests that data be provided illustrating that the failure follows the particular NVIDIA Tesla GPU board and details of the observed failures. Having this data will better allow NVIDIA to reproduce the issue and resolve any potential test weakness in the existing diagnostics.

COMMON SYSTEM LEVEL ISSUES Depending on the type and severity of the observed issue, there may be situations where it may not be possible to run nvidia-bug-report, nvidia-healthmon, or the field diagnostic. In order to better ensure that the failure is attributable to the Tesla GPU, rather than a system-level issue, and avoid any potential delays to the processing of the RMA request as a result, NVIDIA recommends that the following steps be taken to further isolate the cause of the failure. In addition to the power provided by the PCIe slot connector, Tesla GPU boards also require additional power from the host system. Ensure that the appropriate PCIe 8-pin and/or 6-pin auxiliary power cables are properly connected to the board. Consult the product specifications for the specific Tesla GPU in use to determine the auxiliary power requirements for that particular product. Physically remove the Tesla GPU board from the system and reinstall it to ensure that it is fully seated in the PCIe slot. If available, replace the suspect Tesla GPU with a known good board to confirm that the observed issue or failure does not occur with the replacement. If possible, install the suspect Tesla GPU in a different system to determine whether the observed issue or failure follows the board (or system). Note: The RMA submission process will request information demonstrating that common system-level causes have been eliminated. Submitting the RMA with the information as described in Step 1 through Step 4, indicating that system level issues were eliminated will help to accelerate the RMA approval process.

RMA CHECKLIST AND FLOWCHART Table 1.RMA Checklist Check Off Item nvidia-bug-report log file (nvidia-bug-report.log.gz) NVIDIA Field Diagnostic log file (fieldiag.log) In the event the NVIDIA Field Diagnostic returns a passing result or that the observed failure is such that the NVIDIA tools and diagnostics cannot be run, the following information should be included with the RMA request: Steps taken to eliminate common system-level causes -Check PCIe auxiliary power connections -Verify board seating in PCIe slot -Determine whether the failure follows the board or the system Details of the observed failure: -The application running at the time of failure -Description of how the product failed -Step-by-step instructions to reproduce the issue -Frequency of the failure Is there any known or obvious physical damage to the board? Submit the RMA request at http://portal.nvidia.com Note: NVIDIA Tesla GPU products have ECC memory protection enabled by default. NVIDIA will not accept RMA requests for failures that occur only with ECC disabled. Any failure must occur with ECC enabled to be eligible for RMA return.

RMA PROCESS FLOW

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. 2013-2017 NVIDIA Corporation. All rights reserved.