Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

Similar documents
Code: Slides:

Question: 2 Kubernetes changed the name of cluster members to "Nodes." What were they called before that? Choose the correct answer:

TensorFlow on vivo

USING NGC WITH GOOGLE CLOUD PLATFORM

USING NGC WITH AZURE. DU _v01 September Setup Guide

PRP Distributed Kubernetes Cluster

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

DGX-1 DOCKER USER GUIDE Josh Park Senior Solutions Architect Contents created by Jack Han Solutions Architect

/ Cloud Computing. Recitation 5 February 14th, 2017

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

TENSORFLOW. DU _v1.8.0 June User Guide

Shifter at CSCS Docker Containers for HPC

Singularity CRI User Documentation

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

Installation Guide for Kony Fabric Containers Solution On-Premises

NGC CONTAINER. DU _v02 November User Guide

BEST PRACTICES FOR DOCKER

/ Cloud Computing. Recitation 5 September 26 th, 2017

Node Feature Discovery

BUILDING A GPU-FOCUSED CI SOLUTION

Shifter: Fast and consistent HPC workflows using containers

Blockchain on Kubernetes

agenda PAE Docker Docker PAE

BEST PRACTICES FOR DOCKER

Introduction to Kubernetes Storage Primitives for Stateful Workloads

AZURE CONTAINER INSTANCES

Singularity: container formats

Using PCF Ops Manager to Deploy Hyperledger Fabric

DevOps Workflow. From 0 to kube in 60 min. Christian Kniep, v Technical Account Manager, Docker Inc.

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Submitting batch jobs Slurm on ecgate Solutions to the practicals

An Introduction to Kubernetes

A Hands on Introduction to Docker

Tensorflow v0.10 installed from scratch on Ubuntu 16.04, CUDA 8.0RC+Patch, cudnn v5.1 with a 1080GTX

Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. Singularity overview. Vanessa HAMAR

INSTALLATION RUNBOOK FOR Iron.io + IronWorker

Kubernetes made easy with Docker EE. Patrick van der Bleek Sr. Solutions Engineer NEMEA

Infoblox Kubernetes1.0.0 IPAM Plugin

Managing Compute and Storage at Scale with Kubernetes. Dan Paik / Google

Docker Enterprise Edition 2.0 Platform Public Beta Install and Exercises Guide

Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

Blockchain on Kubernetes

Containerizing GPU Applications with Docker for Scaling to the Cloud

IBM Bluemix compute capabilities IBM Corporation

PREPARING TO USE CONTAINERS

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Opportunities for container environments on Cray XC30 with GPU devices

OpenShift Cheat Sheet

Deep RL and Controls Homework 2 Tensorflow, Keras, and Cluster Usage

Infoblox IPAM Driver for Kubernetes User's Guide

DEVELOPER INTRO

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Supporting GPUs in Docker Containers on Apache Mesos

Kuberiter White Paper. Kubernetes. Cloud Provider Comparison Chart. Lawrence Manickam Kuberiter Inc

Infoblox IPAM Driver for Kubernetes. Page 1

WHITE PAPER. RedHat OpenShift Container Platform. Benefits: Abstract. 1.1 Introduction

NVDIA DGX Data Center Reference Design

Note: Currently (December 3, 2017), the new managed Kubernetes service on Azure (AKS) does not yet support Windows agents.

Kuber-what?! Learn about Kubernetes

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Developing Kubernetes Services

Our Workshop Environment

HPC Introductory Course - Exercises

Blockchain on vsphere By VMware

Harbor Registry. VMware VMware Inc. All rights reserved.

Red Hat Quay 2.9 Deploy Red Hat Quay - Basic

The Why and How of HPC-Cloud Hybrids with OpenStack

Scaling Jenkins with Docker and Kubernetes Carlos

Graham vs legacy systems

ASP.NET Core & Docker

Docker task in HPC Pack

Lab 8: Building Microservices on Oracle Autonomous Transaction Processing Service

The Long Road from Capistrano to Kubernetes

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Shifter and Singularity on Blue Waters

Red Hat Quay 2.9 Deploy Red Hat Quay on OpenShift

Blockchain on Kubernetes User Guide

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

Who is Docker and how he can help us? Heino Talvik

EASILY DEPLOY AND SCALE KUBERNETES WITH RANCHER

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

Kubernetes: Twelve KeyFeatures

Description of Power8 Nodes Available on Mio (ppc[ ])

Getting Started With Containers

4 Effective Tools for Docker Monitoring. By Ranvijay Jamwal

Running MarkLogic in Containers (Both Docker and Kubernetes)

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

MAKING CONTAINERS EASIER WITH HPC CONTAINER MAKER. Scott McMillan September 2018

Test Automation with Jenkins & TidalScale WaveRunner

Docker All The Things

Deploying PXC in Kubernetes / Openshift. Alexander Rubin, Percona

GPU Cluster Usage Tutorial

The missing CI/CD Kubernetes component: Helm package manager

KUBERNETES IN A GROWN ENVIRONMENT AND INTEGRATION INTO CONTINUOUS DELIVERY

Convergence of VM and containers orchestration using KubeVirt. Chunfu Wen

P a g e 1. HPC Example for C with OpenMPI

Using Docker in High Performance Computing in OpenPOWER Environment

Transcription:

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers Technical White Paper Table of Contents Pre-requisites...1 Setup...2 Run PyTorch in Kubernetes...3 Run PyTorch in Singularity...4 Run Tensorflow in Kubernetes... 5 Run Tensorflow in Singularity...6 Run MXNet in Kubernetes...7 Run MXNet in Singularity...8 How could this be improved?...9 Conclusion...13 Bright makes it so easy to use the excellent NVIDIA NGC deep learning containers in so many ways! You can run them in containers, through a batch scheduler, on physical nodes, or in a cloud. Bright deploys and manages the entire stack from bare metal all the way to the application layer, and it provides an intuitive installer that makes it simple to deploy both Docker and Kubernetes. Why wade through endless documentation? This paper shows how to run three popular deep learning frameworks PyTorch, Tensorflow, and MXNet in Docker containers scheduled by Kubernetes, and in Singularity containers scheduled through an HPC workload manager. You will see how to run them as Kubernetes pods and jobs, how to pass an input script to a job using a configmap, and how to build a custom version of an NGC container image. Pre-requisites Bright Cluster Manager >= 8.2 NVIDIA Cuda 9.2 Docker and Kubernetes installed Docker registry or Harbor installed (optional) NVIDIA NGC account created 1 NVIDIA NGC API key This document was created on nodes equipped with NVIDIA V100 GPUs. 1 2018 Bright Computing. 1 Please visit https://ngc.nvidia.com to create and account and get an API key

Using the NVIDIA NGC Deep Learning Containers Setup Log in to the NGC container registry Use the docker login command to authenticate to the NGC registry. $ module load docker $ docker login nvcr.io Username: $oauthtoken Password: <api-key> Login Succeeded Now that you have authenticated to the NGC registry you can pull images from it. Pull the following three images. In the next steps you will tag them, and push them into the local Docker registry. This will allow the pods to run on all the Kubernetes nodes without storing a copy of each image on each node. $ docker pull nvcr.io/nvidia/pytorch:18.09-py3 $ docker pull nvcr.io/nvidia/tensorflow:18.08-py3 $ docker pull nvcr.io/nvidia/mxnet:18.08-py3 Tag the images. $ docker image tag nvcr.io/nvidia/pytorch:18.09-py3 \ kube-ngc:5000/site-pytorch:18.09-py3 $ docker image tag nvcr.io/nvidia/tensorflow:18.08-py3 \ kube-ngc:5000/site-tensorflow:18.08-py3 $ docker image tag nvcr.io/nvidia/mxnet:18.08-py3 \ kube-ngc:5000/site-mxnet:18.08-py3 Push the tagged images into the local Docker registry. In this case the local Docker registry is installed on the node kube-ngc and is running on port 5000. $ docker push kube-ngc:5000/site-pytorch:18.09-py3 $ docker push kube-ngc:5000/site-tensorflow:18.08-py3 $ docker push kube-ngc:5000/site-mxnet:18.08-py3 Prepare your environment to run Kubernetes $ module load kubernetes 2

Using Running the PyTorch NVIDIA NGC Deep Learning Containers Run PyTorch in Kubernetes Create a Kubernetes pod definition file You can modify this example pod definition file to run an NGC PyTorch container. As written, this pod will mount /home/rstober on /home inside the Docker container. It pulls the site-pytorch:18.09-py3 image from the local Docker registry. apiversion: v1 kind: Pod name: two-layer-pod restartpolicy: Never containers: - name: pytorch volumemounts: - mountpath: /home name: home-directory image: kube-ngc:5000/site-pytorch:18.09-py3 command: [ python, /home/script/two_layer_net_tensor.py ] resources: limits: nvidia.com/gpu: 1 volumes: - name: home-directory hostpath: path: /home/rstober Example Python script Download the example script two_layer_net_tensor.py from the PyTorch tutorials website. $ cd /home/rstober/script $ wget https://pytorch.org/tutorials/_downloads/two_layer_net_tensor.py Run the pod Use the kubectl command to create the pod. It may take a minute for the pod to start running as it needs to download the container image from the NGC registry. $ kubectl create -f./k8s/two_layer_pod.yaml $ kubectl get pods NAME READY STATUS RESTARTS AGE Two-layer-pod 1/1 Running 0 13s 3 Retrieve the job output when it is completed. $ kubectl logs two-layer-pod head -5 0 33391274.0 1 29912904.0 2 30585640.0 3 30197114.0 4 25912002.0...

Using Running the PyTorch NVIDIA NGC Deep Learning Containers Run PyTorch in Singularity Create a Slurm job script Bright provides Singularity which can convert a Docker image to a Singularity image onthe-fly. This allows you to run any of the NGC containers through your batch workload management system. This example Slurm job script ( two-layer-batch ) authenticates to the NGC registry and pulls the image from it. Alternatively, we could have pulled the image from the local repository, in which case the SINGULARITY_DOCKER_USERNAME and SINGULARITY_ DOCKER_PASSWORD environment variables would not be necessary. #!/bin/bash # two-layer-batch module load singularity export PATH=$PATH:/cm/local/apps/cuda-driver/libs/current/bin export SINGULARITY_DOCKER_USERNAME= $oauthtoken export SINGULARITY_DOCKER_PASSWORD=$(cat /home/rstober/ngc-api-key) singularity run --nv docker://nvcr.io/nvidia/pytorch:18.09-py3 python / home/rstober/script/two_layer_net_tensor.py Example Python script If you ran the Kubernetes example above you should have already downloaded the two_ layer_net_tensor.py example script from the PyTorch tutorials website. If not, download it to the scripts directory as shown below. $ cd /home/rstober/script $ wget https://pytorch.org/tutorials/_downloads/two_layer_net_tensor.py Submit the job $ sbatch./slurm/two-layer-batch $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4 defq two-laye rstober R 0:07 1 cnode005 View the job output $ tail -5 slurm-4.out 495 5.537470497074537e-05 496 5.4810028814245015e-05 497 5.395523476181552e-05 498 5.3191412007436156e-05 499 5.235371281742118e-05 4

Using Running the Tensorflow NVIDIA NGC Deep Learning Containers Run Tensorflow in Kubernetes Create a Kubernetes pod definition file You can modify this example pod definition file to run an NGC Tensorflow container As written, this pod will mount /home/rstober on /home inside the Docker container. It will pull the site-tensorflow:18.08-py3 image from the local Docker registry. apiversion: v1 kind: Pod name: tensorflow-pod restartpolicy: Never containers: - name: tensorflow volumemounts: - mountpath: /home name: home-directory image: kube-ngc:5000/site-tensorflow:18.08-py3 command: [ python, /home/script/tf.py ] resources: limits: nvidia.com/gpu: 1 volumes: - name: home-directory hostpath: path: /home/rstober Example Tensorflow job This is the python script /home/rstober/script/tf.py that gets run inside the container. import tensorflow as tf hello = tf.constant( Hello, TensorFlow! ) sess = tf.session() print(sess.run(hello)) a = tf.constant(10) b = tf.constant(32) print(sess.run(a+b)) Run the Tensorflow pod $ kubectl create -f./k8s/tensorflow-pod.yaml pod tensorflow-pod created $ kubectl get pods NAME READY STATUS RESTARTS AGE tensorflow-pod 0/1 ContainerCreating 0 8s 5 Get the job output $ kubectl logs tensorflow-pod 2018-09-27 15:54:51.106593: I tensorflow/stream_executor/cuda/cuda_gpu_ executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero... Hello, TensorFlow! 42

Using Running the Tensorflow NVIDIA NGC Deep Learning Containers Run Tensorflow in Singularity Create a Slurm job script This example Slurm job script ( tf-batch ) pulls the Tensorflow image from the local repository. Note that the SINGULARITY_DOCKER_USERNAME and SINGULARITY_ DOCKER_PASSWORD environment variables are not necessary and have been omitted. #!/bin/bash # tf-batch module load singularity export PATH=$PATH:/cm/local/apps/cuda-driver/libs/current/bin singularity run --nv docker://kube-ngc:5000/site-tensorflow:18.08-py3 python /home/rstober/script/tf.py Example Tensorflow job This is the python script /home/rstober/script/tf.py that gets run inside the container. import tensorflow as tf hello = tf.constant( Hello, TensorFlow! ) sess = tf.session() print(sess.run(hello)) a = tf.constant(10) b = tf.constant(32) print(sess.run(a+b)) Submit the job $ sbatch./slurm/tf-batch $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6 defq tf-batch rstober R 0:02 1 cnode005 View the job output $ cat slurm-6.out... 2018-10-18 17:40:29.538361: I tensorflow/core/common_runtime/gpu/gpu_ device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/ task:0/device:gp U:0 with 14880 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) b Hello, TensorFlow! 42 6

Using Running the MXNet NVIDIA NGC Deep Learning Containers Run MXNet in Kubernetes Create a Kubernetes pod definition file You can modify this example pod definition file to run an NGC Tensorflow container As written, this pod will mount /home/rstober on /home inside the Docker container. It will pull the site-mxnet:18.08-py3 image from the local Docker registry. apiversion: v1 kind: Pod name: mxnet-pod restartpolicy: Never containers: - name: mxnet-container volumemounts: - mountpath: /home name: home-directory image: kube-ngc:5000/site-mxnet:18.08-py3 command: [ python, /home/script/example-mxnet.py ] resources: limits: nvidia.com/gpu: 1 volumes: - name: home-directory hostpath: path: /home/rstober Example MXNet job This is the python script mxnet.py that gets run inside the container. $ cat script/example-mxnet.py import mxnet as mx a = mx.nd.ones((2,3), mx.gpu()) print((a*2).asnumpy()) Run the MXNet pod $ kubectl create -f./k8s/mxnet-pod.yaml $ kubectl get pods NAME READY STATUS RESTARTS AGE mxnet-pod 1/1 Running 0 22s Get the job output $ kubectl logs mxnet-pod [[ 2. 2. 2.] [ 2. 2. 2.]] 7

Using Running the MXNet NVIDIA NGC Deep Learning Containers Run MXNet in Singularity Create a Slurm job script This example Slurm job script ( mxnet-batch ) pulls the MXNet image from the local repository. #!/bin/bash # mxnet-batch module load singularity export PATH=$PATH:/cm/local/apps/cuda-driver/libs/current/bin singularity run --nv docker://kube-ngc:5000/site-mxnet:18.08-py3 python /home/rstober/script/example-mxnet.py Example MXNet job This is the python script mxnet.py that gets run inside the container. $ cat script/example-mxnet.py import mxnet as mx a = mx.nd.ones((2,3), mx.gpu()) print((a*2).asnumpy()) Submit the job $ sbatch slurm/mxnet-batch $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 defq mxnet-ba rstober R 0:02 1 cnode005 View the job output $ tail slurm-7.out... [[ 2. 2. 2.] [ 2. 2. 2.]] 8

Using A Better the Way NVIDIA NGC Deep Learning Containers How could this be improved? Although the pod ran successfully, there are a couple of problems: 1. This is a naked pod, because it is not bound to a Deployment or Job. Because it s naked, it will not be rescheduled should the execution node fail. 2. It s also not as portable as it could be, in that the configuration (e.g., volumes and volumemounts) is hardcoded in the pod definition. To solve problem number one, we can run the pod as a job using the batch API. The job will then be managed by the Job Controller, which will restart the job if the execution node fails. Problem number two can be solved in two ways. We will first show how to inject the job script into the container using a ConfigMap. This will make the job a lot more portable. Then we will show how to add the job script to the image by rebuilding it. Create a Kubernetes job definition file Below we convert our pod definition to a job definition. This will solve problem one, above. apiversion: batch/v1 kind: Job name: two-layer-job template: labels: app: two-layer-job containers: - name: pytorch image: kube-ngc:5000/site-pytorch:18.09-py3 command: [ python, /home/script/two_layer_net_tensor.py ] volumemounts: - mountpath: /home name: home-directory resources: limits: nvidia.com/gpu: 1 volumes: - name: home-directory hostpath: path: /home/rstober restartpolicy: Never Run the job The job is run in the way the pods were run. $ kubectl create -f./k8s/two-layer-job.yaml 9

Create a ConfigMap ConfigMaps separate configurations from Pods and components, which make workloads more portable, configurations easier to manage, and prevent hardcoding configuration information to Pod specifications. This command creates a configmap named my-scripts that contains the job script two_ layer_net_tensor.py. $ kubectl create configmap my-scripts \ from-file=./script/two_layer_net_tensor.py configmap my-scripts created This pod definition exposes the configmap using a volume. The volumes.configmap.name field refers to the configmap my-scripts created above. The job script will be injected into a dynamically created volume, which will be mounted on /scripts in the container. apiversion: batch/v1 kind: Job name: python-configmap template: name: python-configmap containers: - name: pytorch image: kube-ngc:5000/site-pytorch command: [ python, /scripts/two_layer_net_tensor.py ] volumemounts: - name: script-volume mountpath: /scripts resources: limits: nvidia.com/gpu: 1 volumes: - name: script-volume configmap: name: my-scripts restartpolicy: Never $ kubectl create -f./k8s/python-configmap.yaml job python-configmap created Add the job script to the image Another solution is to include the job script in the container is to rebuild the container image. Fortunately, NVIDIA makes it easy to rebuild their images. Once rebuilt, we push the image into a local Docker registry, and modify the pod definition to use it. Prepare a Dockerfile The FROM line specifies the PyTorch image we pulled from the NGC container registry. The COPY line copies the python script two_layer_net_tensor.py into the /workspace directory that already exists in the image. $ cat Dockerfile FROM nvcr.io/nvidia/pytorch:18.09-py3 # Copy the python script into the image COPY two_layer_net_tensor.py /workspace 10

Build the image $ docker build -t site-pytorch:18.09-py3. Sending build context to Docker daemon 4.608 kb Step 1/2 : FROM nvcr.io/nvidia/pytorch:18.09-py3 ---> a1ed9d391e62 Step 2/2 : COPY two_layer_net_tensor.py /workspace ---> b1700195be0a Removing intermediate container 5e10eb5e6efe Successfully built b1700195be0a Tag the image Now that the image is built we will tag it in preparation for pushing it to the local Docker registry. The Docker registry is installed on the cluster head node kube-ngc and running on port 5000. $ docker image tag site-pytorch:18.09-py3 \ kube-ngc:5000/site-pytorch:18.09-py3 Where: site-pytorch:18.09-py3 is the custom image that we just built kube-ngc:5000/site-pytorch:18.09-py3 is the tag. This is how we will refer to the new image. Push the new image into Docker registry $ docker push kube-ngc:5000/site-pytorch:18.09-py3 Now let s run a job using the image we just created. It will be pulled from our local Docker registry. The image includes the job script so we don t need to mount the volume. Here s the job definition file. apiversion: batch/v1 kind: Job name: two-layer-job template: name: two-layer-job containers: - name: pytorch image: kube-ngc:5000/site-pytorch:18.09-py3 command: [ python, /workspace/two_layer_net_tensor.py ] resources: limits: nvidia.com/gpu: 1 restartpolicy: OnFailure 11

Run the Job $ kubectl create -f yaml/two-layer-job.yaml pod two-layer-job created $ kubectl get pods NAME READY STATUS RESTARTS AGE two-layer-job 1/1 Running 0 13s View the job output $ kubectl logs two-layer-job 0 23417680.0 1 16624169.0 2 13556437.0 3 12046015.0 4 11048217.0... 12

Conclusion As of today there are 13 separate frameworks and tools available in the NVIDIA NGC registry. And there are many revisions of each. For example, there are 21 Tensorflow revisions available. NVIDIA is doing a fantastic job of managing these images, and is updating them regularly. You ve seen how to run three of them in Kubernetes on a Bright managed Linux cluster. The good news is that you can run any of the containers not specifically detailed here in the same ways we ran these three. Bright with Kubernetes and Singularity makes it simple to take advantage of the tremendous value provided by the NVIDIA NGC containers. Bright provides intelligent cluster management from bare metal all the way to the application layer, along with enterprise-grade support. www.brightcomputing.com info@brightcomputing.com NGCWPP.2018.10.26.v01