DGX SOFTWARE WITH RED HAT ENTERPRISE LINUX 7

Similar documents
DGX SOFTWARE FOR RED HAT ENTERPRISE LINUX 7

NVIDIA GPU CLOUD. DU _v02 July Getting Started Guide

NVIDIA GPU CLOUD IMAGE FOR GOOGLE CLOUD PLATFORM

NVIDIA VOLTA DEEP LEARNING AMI

NVIDIA GPU CLOUD IMAGE FOR MICROSOFT AZURE

USING NGC WITH YOUR NVIDIA TITAN PC

PREPARING TO USE CONTAINERS

BEST PRACTICES FOR DOCKER

CUDNN. DU _v07 December Installation Guide

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

TENSORRT 4.0 RELEASE CANDIDATE (RC)

DGX-2 SYSTEM FIRMWARE UPDATE CONTAINER

NVIDIA DGX OS SERVER VERSION 3.1.2

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

BEST PRACTICES FOR DOCKER

TENSORRT 3.0. DU _v3.0 February Installation Guide

NVIDIA DGX OS SERVER VERSION 4.0.2

TESLA DRIVER VERSION (LINUX)/411.98(WINDOWS)

NVIDIA DATA LOADING LIBRARY (DALI)

NVIDIA DGX OS SERVER VERSION 3.1.4

TESLA DRIVER VERSION (LINUX)/411.82(WINDOWS)

NVIDIA DIGITS CONTAINER

INSTALLING INSTALLING INSTALLING

NVIDIA DGX OS SERVER VERSION 4.0.3

NVIDIA DGX OS SERVER VERSION 2.1.3

Clearswift SECURE Gateway Installation & Getting Started Guide. Version 4.3 Document Revision 1.0

NVIDIA DGX OS SERVER VERSION 2.1.1

MOSAIC CONTROL DISPLAYS

INSTALLING INSTALLING INSTALLING

NVIDIA DGX OS SERVER VERSION 3.1.7

Red Hat Network Satellite 5.0.0: Virtualization Step by Step

NVIDIA DGX OS SERVER VERSION 2.1.4

Habanero BMC Configuration Guide

NVIDIA DGX-1 SOFTWARE VERSION 2.0.4

NVIDIA GPU BOOST FOR TESLA

Installation & Getting Started Guide. Version Document Revision 1.0

Clearswift SECURE Gateway Installation & Getting Started Guide. Version Document Revision 1.0

NVIDIA VIRTUAL GPU PACKAGING, PRICING AND LICENSING. August 2017

Clearswift SECURE Gateway Installation & Getting Started Guide. Version Document Revision 1.0

VIRTUAL GPU SOFTWARE. QSG _v5.0 through 5.2 Revision 03 February Quick Start Guide

Downloading and installing Db2 Developer Community Edition on Red Hat Enterprise Linux Roger E. Sanders Yujing Ke Published on October 24, 2018

TENSORFLOW. DU _v1.8.0 June User Guide

Cisco C880 M4 Server User Interface Operating Instructions for Servers with E v2 and E v3 CPUs

AST2500 ibmc Configuration Guide

Clearswift SECURE Gateway Installation & Getting Started Guide. Version Document Revision 1.0

AST2500 ibmc Configuration Guide

getting started guide

Cisco UCS C-Series. Installation Guide

PROMISE ARRAY MANAGEMENT ( PAM) FOR FastTrak S150 TX2plus, S150 TX4 and TX4000. User Manual. Version 1.3

DGX-2 SYSTEM. DU _v04 December User Guide

NVIDIA VIRTUAL GPU PACKAGING, PRICING AND LICENSING. March 2018 v2

NVIDIA T4 FOR VIRTUALIZATION

SECURE Gateway with Microsoft Azure Installation Guide. Version Document Revision 1.0

Ubuntu Supplement to X350 & X550 User s Guide NComputing X350 & X550 vspace Software for Linux on Ubuntu 8.04

NetApp Cloud Volumes Service for AWS

Enterprise Vault.cloud CloudLink Google Account Synchronization Guide. CloudLink to 4.0.3

INSTALLING INSTALLING INSTALLING

Red Hat Quay 2.9 Deploy Red Hat Quay - Basic

Acronis Backup Version 11.5 Update 6 INSTALLATION GUIDE. For Linux Server APPLIES TO THE FOLLOWING PRODUCTS

USING NGC WITH AZURE. DU _v01 September Setup Guide

GRID SOFTWARE. DU _v4.6 January User Guide

Installation & Getting Started Guide. Version Document Revision 1.0

VIRTUAL GPU LICENSE SERVER VERSION

CREATING AN NVIDIA QUADRO VIRTUAL WORKSTATION INSTANCE

Intel Software Guard Extensions SDK for Linux* OS. Installation Guide

Cluster Server Generic Application Agent Configuration Guide - AIX, Linux, Solaris

USING NGC WITH GOOGLE CLOUD PLATFORM

Isilon InsightIQ. Version Installation Guide

Reinstalling the Operating System on the Dell PowerVault 745N

INSTALLING INSTALLING INSTALLING

Intel Server RAID Controller U2-1 Integration Guide For Microsoft* Windows NT* 4.0

Veritas System Recovery 18 Management Solution Administrator's Guide

VIRTUAL GPU LICENSE SERVER VERSION

Lenovo XClarity Provisioning Manager User Guide

One Identity Starling Two-Factor AD FS Adapter 6.0. Administrator Guide

PrinterOn Embedded Agent for Samsung Printers and MFPs. Setup Guide for PrinterOn Hosted

DGX-2 SYSTEM. DU _v02 December Service Manual

TESLA K20X GPU ACCELERATOR

Clearswift Gateway Installation & Getting Started Guide. Version 4.1 Document Revision 1.4

Setting up the DR Series System with vranger. Technical White Paper

VIRTUAL GPU SOFTWARE. DU _v5.0 through 5.2 Revision 05 March User Guide

Clearswift SECURE ICAP Gateway Installation & Getting Started Guide. Version Document Revision 1.0

Intel Entry Storage System SS4000-E

DGX-1 DOCKER USER GUIDE Josh Park Senior Solutions Architect Contents created by Jack Han Solutions Architect

VIRTUAL GPU CLIENT LICENSING

CUDA QUICK START GUIDE. DU _v9.1 January 2018

Clearswift SECURE Gateway Installation & Getting Started Guide. Version Document Revision 1.0

Authentication Services ActiveRoles Integration Pack 2.1.x. Administration Guide

Red Hat JBoss Developer Studio 11.3

Symantec Ghost Solution Suite Web Console - Getting Started Guide

NSIGHT ECLIPSE EDITION

Managing Remote Presence

Multifactor Authentication Installation and Configuration Guide

VIRTUAL GPU SOFTWARE R390 FOR RED HAT ENTERPRISE LINUX WITH KVM

User s Manual for H4S & NetPoint 2.2

USER GUIDE. CTERA Agent for Windows. June 2016 Version 5.5

One Identity Starling Two-Factor Desktop Login 1.0. Administration Guide

VIRTUAL GPU LICENSE SERVER VERSION AND 5.1.0

SUNDE. User s Manual for NetPoint2.2 & H4S USER MANUAL FOR NETPOINT2.2 AND H4S 1

Transcription:

DGX SOFTWARE WITH RED HAT ENTERPRISE LINUX 7 RN-09301-001 _v02 January 2019 Installation Guide

TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. Related Documentation... 1 1.2. Prerequisites... 1 1.2.1. Red Hat Subscription... 1 1.2.2. Access to Repositories... 2 1.2.2.1. NVIDIA Repositories... 2 1.2.2.2. Red Hat Repositories... 2 1.2.3. Network File System... 2 1.2.4. BMC Password... 2 Chapter 2. Installing Red Hat Enterprise Linux 7... 3 2.1. Obtaining Red Hat Enterprise Linux 7...3 2.2. Booting Red Hat Enterprise Linux 7 ISO Locally... 3 2.3. Booting Red Hat Enterprise Linux 7 ISO Remotely... 4 2.4. Installing Red Hat Enterprise Linux...7 Chapter 3. Installing the DGX Software...14 3.1. Configuring a System Proxy...14 3.2. Enabling the DGX Software Repository... 14 3.3. Installing Required Components... 14 3.4. Installing Diagnostic Components...17 3.4.1. NVIDIA System Management Tools...18 3.5. Installing Optional Components... 18 Chapter 4. Running Containers... 20 Chapter 5. Configuring Storage - NFS Mount and Cache... 21 Appendix A. Appendix A: Installing Software on Air-Gapped NVIDIA DGX-1 Systems... 23 A.1. A.1 Registering Your System...23 A.2. A.2 Creating a Local Mirror of the NVIDIA Repository...23 A.3. A.3 Installing Docker Containers... 24 Appendix B. Appendix B: Changing the BMC Login... 25 Appendix C. Appendix C: Installing Mellanox InfiniBand Drivers... 31 RN-09301-001 _v02 ii

Chapter 1. INTRODUCTION The NVIDIA DGX-1 server is shipped with DGX OS, which incorporates the NVIDIA DGX software stack built upon the Ubuntu Linux distribution. Instead of running the Ubuntu distribution, you can run Red Hat Enterprise Linux or CentOS on the DGX-1 and take advantage of the advanced features provided by the DGX-1. This document explains how to install and configure the NVIDIA DGX software stack on the DGX-1 using Red Hat Enterprise Linux 7. While it may be possible to use other derived Linux distributions besides Red Hat Enterprise Linux 7, not all have been tested and qualified by NVIDIA. Refer to the DGX Software for Red Hat Enterprise Linux 7 Release Notes for the list of tested and qualified software and Linux distributions. 1.1. Related Documentation DGX NVIDIA DGX Software for Red Hat Enterprise Linux - Release Notes NVIDIA DGX-1 User Guide 1.2. Prerequisites The following are required (or recommended where indicated). 1.2.1. Red Hat Subscription You need a Red Hat subscription if you plan to install and use Red Hat Enterprise Linux 7 on the DGX-1. A subscription also lets you obtain update packages and additional packages for Red Hat Enterprise Linux. You can either purchase a subscription or obtain a free evaluation subscription from the Red Hat Software & Download Center. RN-09301-001 _v02 1

Introduction 1.2.2. Access to Repositories The repositories can be accessed from the internet. If your installation does not allow connection to the internet, see the section Installing Software on Air-Gapped NVIDIA DGX-1 Systems for information about updating software on air-gapped systems. 1.2.2.1. NVIDIA Repositories NVIDIA DGX Software Repository Instructions for enabling the NVIDIA DGX software repository on the DGX-1 system can be found in the latest announcement for supporting Red Hat Enterprise Linux on DGX on the NVIDIA Enterprise Support portal. This is to be performed after installing Red Hat Enterprise Linux on the DGX-1. 1.2.2.2. Red Hat Repositories Installation of the DGX Software over Red Hat Enterprise Linux 7 requires access to several additional repositories. Red Hat Enterprise Server Extras Repository: rhel-7-server-extras-rpms Required for container support Red Hat Enterprise Server Optional Repository: rhel-7-server-optional-rpms Required by NVIDIA System Manager (NVSM) Red Hat Software Collections Repository: rhel-server-rhscl-7-rpms This repository is required by the NVSM tool for Python3. If you do not have access to the Red Hat software collections repository, refer to https://access.redhat.com/ solutions/472793 for instructions on requesting access for free. 1.2.3. Network File System A network file system (NFS) is recommended to take advantage of the cache file system provided by the DGX-1 software stack. 1.2.4. BMC Password The DGX-1 BMC comes with default login credentials. NVIDIA recommends creating a unique user ID and password, Refer to Appendix B: Changing the BMC Login for instructions. RN-09301-001 _v02 2

Chapter 2. INSTALLING RED HAT ENTERPRISE LINUX 7 Red Hat provides several methods for installing Red Hat Enterprise Linux as described in the Red Hat Enterprise Linux Installation Guide. See the DGX Software for Red Hat Enterprise Linux Release Notes for the version of Red Hat Enterprise Linux 7 that is qualified and tested for use with the DGX Software. For convenience, this section describes how to install Red Hat Enterprise Linux using the Quick Install method, and shows when to reclaim disk space in the process. It describes a minimal installation. If you have a preferred method for installing Red Hat Enterprise Linux, then you can skip this section but be sure to reclaim disk space occupied by the existing Ubuntu installation. The interactive method described here installs Red Hat Enterprise Linux on DGX-1 using a connected monitor and keyboard and USB stick with the ISO image, or remotely through the remote console of the BMC. 2.1. Obtaining Red Hat Enterprise Linux 7 Obtain the Red Hat Enterprise Linux 7 ISO image and store on your local disk or create a boot USB drive. See Downloading Red Hat Enterprise Linux for instructions. 2.2. Booting Red Hat Enterprise Linux 7 ISO Locally 1. 2. 3. 4. 5. Plug the USB flash drive containing the Red Hat Enterprise Linux 7 ISO image into the DGX-1. Connect a monitor and keyboard directly to the DGX-1. Boot the system and press F11 when the NVIDIA logo appears to get to the boot menu. Select the USB volume name that corresponds to the inserted USB flash drive, and boot the system from it. Follow the instructions at Installing Red Hat Enterprise Linux RN-09301-001 _v02 3

Installing Red Hat Enterprise Linux 7 2.3. Booting Red Hat Enterprise Linux 7 ISO Remotely Skip this chapter if you are using a monitor and keyboard for installing locally. 1. Connect to the BMC and change user privileges. a) Open a Java-enabled web browser within your LAN and go to http://ipmiip-address/, then log in. Use Firefox or Internet Explorer. Google Chrome is not officially supported by the BMC. b) From the top menu, click Configuration and then select User Management. c) Select the user name that you created for the BMC, then click Modify User. d) In the Modify User dialog, select the VMedia checkbox to add it to the extended privileges for the user, then click Modify. 2. Set up the ISO image as virtual media and reboot the system. a) From the top menu, click Remote Control and select Console Redirection. b) Click Java Console to open the remote JViewer window. Make sure pop-up blockers are disabled for this site. RN-09301-001 _v02 4

Installing Red Hat Enterprise Linux 7 c) From the JViewer top menu bar, click Media and then select Virtual Media Wizard. d) From the CD/DVD Media: I section of the Virtual Media dialog, click Browse and then locate the Red Hat Enterprise Linux ISO file on your system and click Open. You can ignore the device redirection warning at the bottom of the Virtual Media wizard as it does not affect the ability to re-image the system. e) Click Connect CD/DVD, then click OK at the Information dialog. The Virtual Media window shows that the ISO image is connected. f) Close the window. The CD ROM icon in the menu bar turns green to indicate that the ISO image is attached. g) From the top menu, click Power and then select Reset Server. RN-09301-001 _v02 5

Installing Red Hat Enterprise Linux 7 h) Click Yes and then OK at the Power Control dialogs, then wait for the system to power down and then come back online. 3. Boot the CD ROM image The default boot order does typically not boot the CDROM image. This can be changed in the BIOS or as a one-time option in the boot menu. To bring up the boot menu, press F11 at the beginning of the boot process. Pressing F11 will display Show Boot Options at the top of the virtual display before entering the boot menu. Use the soft keyboard (Menu Keyboard Layout SoftKeyboard <Language>) to bring up a virtual keyboard if pressing the physical key has no effect. a) In the boot menu, select UEFI: AMI Virtual CDROM 1.00 as the boot device and then press ENTER RN-09301-001 _v02 6

Installing Red Hat Enterprise Linux 7 b) Follow the instructions at Installing Red Hat Enterprise Linux 2.4. Installing Red Hat Enterprise Linux 1. After booting the ISO image through either the BMC or from the USB drive, select Install Red Hat Enterprise Linux and then press Enter to start the installation. 2. Refer to the Red Hat Enterprise Linux Quick Installation Guide for guidance on using the installer. Configure the language, region, date, time, keyboard, and other configuration options you may need from the Installation Summary screen. 3. Set up the system drive. RN-09301-001 _v02 7

Installing Red Hat Enterprise Linux 7 This step removes the Ubuntu installation in order to reclaim space for the Red Hat installation. a) From the Installation Summary screen, click INSTALLATION DESTINATION. b) Select the first drive (sda) as the installation drive and click Done. RN-09301-001 _v02 8

Installing Red Hat Enterprise Linux 7 The Installation Options dialog box appears. c) At the Installation Options dialog, click Reclaim space. RN-09301-001 _v02 9

Installing Red Hat Enterprise Linux 7 d) At the Device Selection screen, click Delete all to delete all existing data on the system drive. RN-09301-001 _v02 10

Installing Red Hat Enterprise Linux 7 e) Click Reclaim space to permanently delete all data from the drive and to use it as the destination drive. RN-09301-001 _v02 11

Installing Red Hat Enterprise Linux 7 4. Configure Ethernet. Select and enable the Ethernet device. This defaults to DHCP and can be changed for static IP configurations under Configure. 5. From the INSTALLATION SUMMARY screen, click Begin installation to start the installation. RN-09301-001 _v02 12

Installing Red Hat Enterprise Linux 7 a) While the installation process is running, set your password (ROOT PASSWORD) and create a new user (USER CREATION) from the Configuration screen. b) When the installation completes, click Reboot to reboot the system. If you have installed Red Hat Enterprise Linux 7.5 and are using the BMC remote console, then follow the instructions provided in the release notes under Black screen on BMC Remote Console with Red Hat Enterprise Linux 7.5. 6. Register the system with the Red Hat Enterprise Customer Portal to complete the initial setup. If you installed with the Server with GUI base environment, the Initial Setup starts automatically where you can accept the license agreement and register the system. See the Red Hat instructions for details. If you installed with any other base environment, log on to the system as root user and then register the system. subscription-manager register --auto-attach --username=user_name -password=password See How to register and subscribe a system to the Red Hat Customer Portal using Red Hat Subscription-Manager for further information. RN-09301-001 _v02 13

Chapter 3. INSTALLING THE DGX SOFTWARE This section requires that you have already installed Red Hat Enterprise Linux 7 or derived operating system on the DGX-1. 3.1. Configuring a System Proxy If your network requires use of a proxy, then edit the file /etc/yum.conf and make sure the following lines are present in the [main] section, using the parameters that apply to your network: proxy=http://<proxy-server-ip-address>:<proxy-port> proxy_username=<proxy-username>proxy_password=<proxy-password> 3.2. Enabling the DGX Software Repository Obtain instructions for enabling the DGX software repository from the NVIDIA Enterprise Support portal (available to DGX customers with an NVIDIA Enterprise Support account). Look for the announcement regarding Red Hat support on DGX-1. 3.3. Installing Required Components 1. 2. On Red Hat Enterprise Linux, enable the following repository. sudo subscription-manager repos --enable=rhel-7-server-extras-rpms Install DGX tools and configuration files. a) Install DGX Configurations. sudo yum groupinstall -y 'DGX Configurations' b) The configuration changes will take effect only after rebooting the system. To minimize extra reboots, we will defer this step after the drivers have been installed. 3. Configure the /raid partition for use as a data cache for NFS mounted directories. RN-09301-001 _v02 14

Installing the DGX Software The DGX-1 uses a 4-drive RAID0 array, mounted at /raid, for caching NFS reads. a) Configure the RAID array. This will create the RAID group, mount it to /raid, and create an appropriate entry in /etc/fstab. sudo configure_raid_array.py -c -f The RAID array must be configured before installing dgx-conf-cachefilesd, which places the proper SELinux label on the /raid directory. If you ever need to recreate the RAID array -- which will wipe out any labeling on /raid -- after dgx-conf-cachefilesd has already been installed, be sure to restore the label manually before restarting cachefilesd. sudo restorecon /raid sudo systemctl restart cachefilesd b) Install dgx-conf-cachefilesd to update the cachefilesd configuration to use the /raid partition. 4. sudo yum install -y dgx-conf-cachefilesd Install the NVIDIA CUDA drivers a) Install the kernel-devel package The kernel-devel package provides kernel headers required for the NVIDIA CUDA driver. Use the following command to install the kernel headers for the kernel version that is currently running on the system. sudo yum install -y "kernel-devel-uname-r == $(uname -r)" b) Install the cuda-drivers package. This will build and install the driver kernel modules. The installation of the dkms-nvidia package can take approximately five minutes. sudo yum install -y cuda-drivers cuda-drivers-diagnostic dgx-persistencemode Red Hat Enterprise Linux 7.5 ships with OpenGL libraries that conflict with versions included in the CUDA drivers. Depending on the Software Selection performed in Installing Red Hat Enterprise Linux, you might encounter an error with the following libraries: mesa-libgl, mesa-libegl, or mesa-libgles. Simply remove these libraries and re-issue the yum install command. sudo rpm -e mesa-libgl.x86_64 --nodeps sudo rpm -e mesa-libegl.x86_64 --nodeps sudo rpm -e mesa-libgles.x86_64 --nodeps sudo yum install -y cuda-drivers cuda-drivers-diagnostic dgxpersistence-mode 5. Reboot the systems to load the drivers and to update system configurations. a) Issue reboot sudo reboot RN-09301-001 _v02 15

Installing the DGX Software b) After the server has rebooted, verify that the drivers have been loaded and are handling the NVIDIA devices. nvidia-smi The output should show all available GPUs. 6. +----------------------------------------------------------------------------+ NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 -------------------------------+---------------------+----------------------+ GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+====================== +====================== 0 Tesla V100-SXM2... On 00000000:06:00.0 Off 0 N/A 33C P0 45W / 300W 0MiB / 32480MiB 0% Default +-------------------------------+---------------------+----------------------+ 1 Tesla V100-SXM2... On 00000000:07:00.0 Off 0 N/A 35C P0 44W / 300W 0MiB / 32480MiB 0% Default +-------------------------------+---------------------+----------------------+ : : : : : +-------------------------------+---------------------+----------------------+ 7 Tesla V100-SXM2... On 00000000:8A:00.0 Off 0 N/A 34C P0 44W / 300W 0MiB / 32480MiB 0% Default +-------------------------------+---------------------+----------------------+ +----------------------------------------------------------------------------+ Processes: GPU Memory GPU PID Type Process name Usage ============================================================================= No running processes found +----------------------------------------------------------------------------+ Install the NVIDIA container device plugin. a) Install docker 1.13 from the rhel-7-server-extras-rpms repository. sudo yum install -y docker b) Install the NVIDIA Container Runtime group. sudo yum groupinstall -y 'NVIDIA Container Runtime' c) Run the following command to verify the installation. RN-09301-001 _v02 16

Installing the DGX Software sudo docker run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda nvidia-smi See the section Running Containers for more information about this command. For a description of nvcr.io, see the NGC Registry Spaces documentation. To ensure that Docker can access the NGC container registry through a proxy, refer to the Red Hat customer portal knowledge base article Configure Docker to use a proxy with or without authentication. The output should show all available GPUs. +----------------------------------------------------------------------------+ NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 -------------------------------+---------------------+----------------------+ GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+====================== +====================== 0 Tesla V100-SXM2... On 00000000:06:00.0 Off 0 N/A 33C P0 45W / 300W 0MiB / 32480MiB 0% Default +-------------------------------+---------------------+----------------------+ 1 Tesla V100-SXM2... On 00000000:07:00.0 Off 0 N/A 35C P0 44W / 300W 0MiB / 32480MiB 0% Default +-------------------------------+---------------------+----------------------+ : : : : : +-------------------------------+---------------------+----------------------+ 7 Tesla V100-SXM2... On 00000000:8A:00.0 Off 0 N/A 34C P0 44W / 300W 0MiB / 32480MiB 0% Default +-------------------------------+---------------------+----------------------+ +----------------------------------------------------------------------------+ Processes: GPU Memory GPU PID Type Process name Usage ============================================================================= No running processes found +----------------------------------------------------------------------------+ 3.4. Installing Diagnostic Components RN-09301-001 _v02 17

Installing the DGX Software 3.4.1. NVIDIA System Management Tools NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It includes active health monitoring, system alerts, and log generation. The NVIDIA System Management tools require Python 3. It is available from the Red Hat Software Collections. The Fedora EPEL repository also contains a version of Python3; however, this combination has not been tested. Install NVSM as follows. 1. Enable the Red Hat Software Collections and Red Hat Enterprise Linux 7 Server Optional repositories. sudo subscription-manager repos --enable=rhel-server-rhscl-7-rpms sudo subscription-manager repos --enable=rhel-7-server-optional-rpms If you do not have access to the Red Hat Software Collections repository, refer to https://access.redhat.com/solutions/472793 for instructions on requesting access for free. 2. Install Python 3.6. 3. sudo yum install -y rh-python36 Install DGX System Management tools that includes the NVSM tool. sudo yum groupinstall -y 'DGX System Management' 3.5. Installing Optional Components The DGX-1 is fully functional after installing the components as described in Installing Required Components. If you intend to launch NGC containers (which incorporate the CUDA toolkit, NCCL, cudnn, and TensorRT) on the DGX-1, which is the expected use case, then you can skip this section. If you intend to use your DGX-1 as a development system for running deep learning applications on bare metal, then install the optional components as described in this section. 1. 2. 3. Install the CUDA toolkit. sudo yum install cuda Install the NVIDIA Collectives Communication Library (NCCL) Runtime. sudo yum groupinstall 'NVIDIA Collectives Communication Library Runtime' Install the CUDA Deep Neural Networks (cudnn) Library Runtime. sudo yum groupinstall 'CUDA Deep Neural Networks Library Runtime' RN-09301-001 _v02 18

Installing the DGX Software 4. Install NVIDIA TensorRT. sudo yum install tensorrt RN-09301-001 _v02 19

Chapter 4. RUNNING CONTAINERS The following is an example of running the CUDA container from the NGC registry. sudo docker run --security-opt label=type:nvidia_container_t --rm nvcr.io/ nvidia/cuda nvidia-smi To accommodate SELinux, the DGX software stack includes a package (nvidiacontainer-selinux) that defines a policy for allowing containers to access NVIDIA GPUs. The --security-opt option in the command sets the corresponding label type permitting the specified container to access NVIDIA GPUs. If SELinux is removed or disabled, then the --security-opt option is not needed. RN-09301-001 _v02 20

Chapter 5. CONFIGURING STORAGE - NFS MOUNT AND CACHE By default, the DGX-1 System includes four SSDs in a RAID 0 configuration. These SSDs are intended for application caching, so NVIDIA recommends that you set up your own NFS storage for long term data storage. The following instructions describe how to mount the NFS onto the DGX-1 System, and how to cache the NFS using the DGX-1 SSDs for improved performance. Make sure that you have an NFS server with one or more exports with data to be accessed by the DGX-1 System, and that there is network access between the DGX-1 System and the NFS server. 1. Configure an NFS mount for the DGX-1. a) Edit the filesystem tables configuration. sudo vi /etc/fstab b) Add a new line for the NFS mount, using the local mount point of /mnt. <nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0 /mnt is used here as an example mount point. Consult your Network Administrator for the correct values for <nfs_server> and <export_path>. The nfs arguments presented here are a list of recommended values based on typical use cases. However, "fsc" must always be included as that argument specifies use of FS-Cache. c) Save the changes. 2. Verify the NFS server is reachable. ping <nfs_server> Use the server IP address or the server name provided by your network administrator. 3. Mount the NFS export. RN-09301-001 _v02 21

Configuring Storage - NFS Mount and Cache sudo mount /mnt /mnt is the example mount point used in step 1. 4. Verify caching is enabled. cat /proc/fs/nfsfs/volumes Look for the text FSC=yes in the output.the NFS will be mounted and cached on the DGX-1 System automatically upon subsequent reboot cycles. RN-09301-001 _v02 22

Appendix A. APPENDIX A: INSTALLING SOFTWARE ON AIR-GAPPED NVIDIA DGX-1 SYSTEMS For security purposes, some installations require that systems be isolated from the internet or outside networks. Since most DGX-1 software updates are accomplished through an over-the-network process with NVIDIA servers, this section explains how updates can be made when using an over-the-network method is not an option. It includes a process for installing Docker containers as well. A.1. A.1 Registering Your System See the Red Hat customer portal knowledge base article How to register and subscribe a system offline to the Red Hat Customer Portal. A.2. A.2 Creating a Local Mirror of the NVIDIA Repository Instructions for setting up a private repository or mirroring the NVIDIA and the Red Hat repositories are beyond the scope of this document. It is expected that users are knowledgeable about those processes. The Red Hat customer portal provides a knowledge base article for creating a local mirror. Pay particular attention to the instructions under Create a local repository that allows clients to install groups and use the security plugin to ensure that you include information about package groups when downloading the repository. The repo-id for the DGX Software repository is nvidia-repo-7. The instructions assume that you have the repositories enabled on the local machine. See Enabling the DGX Software Repository for instructions on enabling the NVIDIA DGX EL7 repository. RN-09301-001 _v02 23

Appendix A: Installing Software on Air-Gapped NVIDIA DGX-1 Systems A.3. A.3 Installing Docker Containers This method applies to Docker containers hosted on the NGC Container Registry. Most container images are freely available, but some are locked and require that you have an NGC account to access. See the NGC Registry for DGX User Guide for instructions on accessing locked container images. 1. 2. 3. Enter the docker pull command, specifying the image registry, image repository, and tag. docker pull nvcr.io/nvidia/repository:tag Verify the image is on your system using docker images. docker images Save the Docker image as an archive. docker save nvcr.io/nvidia/repository:tag > framework.tar Transfer the image to the air-gapped system using removable media such as a USB flash drive. 5. Load the NVIDIA Docker image. 4. 6. docker load -i framework.tar Verify the image is on your system. docker images RN-09301-001 _v02 24

Appendix B. APPENDIX B: CHANGING THE BMC LOGIN The NVIDIA DGX-1 includes a base management controller (BMC) for out-of-band management of the DGX-1 system. NVIDIA recommends creating a unique username and password as soon as possible. Log into the BMC. a) Open a browser within your LAN and go to http://<ipmi-ip-address>/. Use Firefox or Internet Explorer. Google Chrome is not officially supported by the DGX-1 BMC. b) Log in, using qct.admin/qct.admin for the User ID/Password. 2. Select Configuration Users. 1. 3. Add a new user. a) Select an empty field and click Add User. RN-09301-001 _v02 25

Appendix B: Changing the BMC Login b) Enter new user information and click Add. RN-09301-001 _v02 26

Appendix B: Changing the BMC Login Log out and then log back in as the new user. 5. Select Configuration Users. 6. Disable User Access for the user qct.admin. a) Select the user qct.admin user and select Modify User 4. RN-09301-001 _v02 27

Appendix B: Changing the BMC Login b) Deselect Enable in User Access and click Modify. RN-09301-001 _v02 28

Appendix B: Changing the BMC Login c) Ensure User Access is disabled for the user qct.admin. RN-09301-001 _v02 29

Appendix B: Changing the BMC Login 7. Log out. RN-09301-001 _v02 30

Appendix C. APPENDIX C: INSTALLING MELLANOX INFINIBAND DRIVERS Unlike the DGX OS shipped with the NVIDIA DGX-1, the DGX software stack for Red Hat-derived operating systems does not include the Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) for Linux. This is due to the likelihood of the MLNX_OFED kernel being out of sync with the Red Hat distribution kernel. This can result in system instability. To use InfiniBand on the DGX-1, 1. Either visit the Mellanox site and download and install the latest drivers, or use the in-box drivers. The in-box drivers provide a much lower level of performance than the official Mellanox drivers. 2. After installing the MLNX_OFED drivers, install the NVIDIA peer memory module. sudo yum install nvidia-peer-memory-dkms RN-09301-001 _v02 31

Notice THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED AS IS. NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product. THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES. NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/ or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs. Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices. Trademarks NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, and DGX Station are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright 2019 NVIDIA Corporation. All rights reserved.