Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

Size: px

Start display at page:

Download "Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center"

Marianna Carson
5 years ago
Views:

1 Managing and Deploying GPU Accelerators ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

2 OUTLINE GPU resources at the SRCC Slurm and GPUs Slurm and GPU P2P Running Amber with GPU P2P (intranode) Running TensorFlow with Singularity

3 GPU RESOURCES AT THE SRCC

4 STANFORD SHERLOCK Supermicro GPU SuperServer 4028GR-TRT with 4U with 8 x Nvidia GeForce consumer-grade cards Shared compute cluster Dell C4130 1U with 4 x Nvidia Tesla cards Open to the Stanford community as a resource to support sponsored research Condo cluster, nodes are ordered quarterly (currently nodes) Heterogenous cluster with 73 GPU nodes / 500 GPU cards / Tesla and GeForce Total of ~2500 users (~420 are faculty) and 64 owners

5 STANFORD STREAM June 2015 (ISC15) 87th 6th Nov 2015 (SC15) 102nd 5th June 2016 (ISC16) 122nd 6th Nov 2016 (SC16) 162nd 8th June 2017 (ISC17) 214th Multi-GPU HPC cluster 520 x Nvidia K80, 1040 GPUs, 16 GPUs / node 24 x Nvidia P100E, 8 GPUs / node Energy efficient 1 PFlops (LINPACK peak) for only 190kW! 24th

6 STREAM SYSTEM CHARACTERISTICS (NODE LAYOUT) Stream s Cray CS-Storm 26268N node specs: 20 CPU cores (12 GB RAM/core) 256 GB of DDR3 RAM 16 Nvidia K80 GPUs, each with 12 GB of GDDR5 with ECC support Balanced PCIe bandwidth across GPUs (dual Root Complex) 1 Infiniband FDR

7 SLURM AND GPUs

8 SLURM WITH GENERIC RESOURCES (GRES) SLURM is the resource manager on both Sherlock and Stream Open source and full-featured (like GPU support) Generic Resource (GRES) scheduling $ srun [...] --gres gpu:2 [...] or #SBATCH --gres gpu:2 This defines the number of GPUs per node. GPU Compute Mode selection #SBATCH -C gpu_shared Custom option. Set the GPU Compute Mode to DEFAULT (shared) instead of ECLUSIVE PROCESS in Slurm Prolog. With -C gpu_shared, multiple processes are able to access a GPU. In general NOT recommended but sometimes required for multi-gpu jobs, for instance when running Amber or LAMMPS.

9 SHERLOCK SLURM GPU QOS SETTINGS Simple enforcement of GPU usage on Sherlock s GPU QoS MinTRES set to cpu=1,gres/gpu=1 Example of error if the above rule is not respected $ srun -p gpu --pty bash srun: error: Unable to allocate resources: Job violates accounting/qos policy (job submit limit, user's size and/or time limits) $ srun -p gpu --gres gpu:1 --pty bash srun: job queued and waiting for resources

10 SHERLOCK SLURM GPU FEATURES Sherlock has many different types of Nvidia GPUs We use Slurm FEATURES (-C) for GPU type selection GRES gpu:n is used for GPU allocation We used to specify the GPU type in GRES as in gpu:tesla:2 but using features is more flexible! # sinfo -o "%.10R %.8D %.10m %.5c %7z %8G %100f %N" PARTITION NODES MEMORY test test ownerxx ownerxy ownerxy ownerxz owners owners CPUS S:C:T GRES AVAIL_FEATURES 20 2:10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P40,GPU_MEM:24GB sh :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_P,GPU_MEM:12GB sh-113-[06-07] :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-[06-07] :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_P,GPU_MEM:12GB sh-112-[08-11] :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_P,GPU_MEM:12GB sh :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TESLA_P100_PCIE,GPU_MEM:16GB sh-112-[06-07] :10:1 gpu:4 CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,GPU_GEN:PSC,GPU_SKU:TITAN_P,GPU_MEM:12GB sh-112-[08-12]... Example of GPU type constraint: -C GPU_SKU:TITAN_P sh

11 STREAM SLURM CPU/GPU RATIOS Job submission rules are enforced to maximize GPU efficiency Max CPU/GPU ratio Default memory per CPU Max memory per CPU Max (system) memory per GPU Min GPU count 5/4 (20/16) 12,000 MB 12,800 MB 16,000 MB(*) 1 (*) Unlike memory/cpu, the number of GPUs is NOT automatically updated when you request more memory. CPU/GPU ratio enforcement implemented using a job_submit.lua plugin Example of error if the above CPU/GPU ratio rule is not respected $ srun -c 5 --gres gpu:1 command srun: error: CPUs requested per node (5) not allowed with only 1 GPU(s); increase the number of GPUs to 4 or reduce the number of CPUs srun: error: Unable to allocate resources: More processors requested than permitted

12 STREAM MULTI-GPU RESOURCE ALLOCATION GPU devices cgroups Slurm on Stream uses the Linux cgroup devices subsystem so that a job is only allowed to access its allocated GPU devices. $ srun --gres gpu:3 nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-fdae33a e8-1aef-4fa8a745ad07) GPU 1: Tesla K80 (UUID: GPU-f4735a45-ea34-55f1-35ba-50a84c4b462c) GPU 2: Tesla K80 (UUID: GPU-b1d8438e-7f05-9bc3-8d3e f) Consequence: the GPU IDs above and $CUDA_VISIBLE_DEVICES IDs within a job always start at 0. Display full node GPU Direct communication matrix and CPU affinity $ srun -c 20 --gres gpu:16 nvidia-smi topo -m

13 SLURM AND GPU PEER TO PEER COMMUNICATION

14 GPU PEER-TO-PEER WITH SLURM ON STREAM (1/5) #SBATCH --gres-flags=enforce-binding Standard Slurm option. Enforce GRES/CPU binding, ie. all CPUs and GRES (here, GPUs) will be allocated within the same CPU socket(s). Required for GPU P2P. Sufficient when used with 1 CPU (-c1) but USELESS when CPUs are allocated across different CPU sockets. Need to bind tasks (and their cpus) to a specific CPU socket. #SBATCH --cpus-per-task=1 to 10, for example: -c 8 #SBATCH --ntasks-per-socket=n, with n == ntasks Standard Slurm option. Masks will automatically be generated to bind the tasks to specific sockets.

15 GPU PEER-TO-PEER WITH SLURM ON STREAM (2/5) Example of bad resource allocation on Stream for GPU P2P

16 GPU PEER-TO-PEER WITH SLURM ON STREAM (3/5) Bad case of GPU P2P allocation: 1 CPU (core 0) and 6 GPUs are already allocated on the first CPU socket by requesting 8 CPUs and 8 GPUs, we may get: 8 CPUs not on the same CPU socket 8 GPUs not on the same PCIe Root Complex (CPU socket): $ srun -c8 --gres gpu:8 nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity GPU0 PI 1-8 GPU1 PI 1-8 GPU2 PI PB PB 1-8 GPU3 PI PB PB 1-8 GPU4 PB PB PI 1-8 GPU5 PB PB PI 1-8 GPU6 PI 1-8 GPU7 PI 1-8 mlx4_0

17 GPU PEER-TO-PEER WITH SLURM ON STREAM (4/5) Fix issue by using the correct Slurm options for GPU P2P Group CPUs on the same socket with --ntasks-per-socket=n Group GPUs on the same socket with --gres-flags=enforce-binding $ srun -c8 --gres gpu:8 --gres-flags=enforce-binding --ntasks-per-socket=1 \ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity GPU0 PI PB PB GPU1 PI PB PB GPU2 PB PB PI GPU3 PB PB PI GPU4 PI PB PB GPU5 PI PB PB GPU6 PB PB PI GPU7 PB PB PI mlx4_0

18 GPU PEER-TO-PEER WITH SLURM ON STREAM (5/5) Example of proper resource allocation on Stream for GPU P2P

19 RUNNING AMBER WITH GPU P2P (INTRANODE)

20 STREAM RUNNING AMBER WITH GPU P2P (1/3) AMBER ( is a popular molecular dynamics simulation software used on both Sherlock and Stream. AMBER is interesting to study as it has the ability to use GPUs to massively accelerate PMEMD for both explicit solvent PME (Particle Mesh Ewald) and implicit solvent GB (Generalized Born). Key new features include (...) peer to peer support for multi-gpu runs providing enhanced multi-gpu scaling.

21 STREAM RUNNING AMBER WITH GPU P2P (2/3) AMBER Benchmark: DHFR NPT HMR 4fs = 23,558 atoms amber_bench_pme_p2p_4gpus.sbatch #!/bin/bash #SBATCH -o slurm_amber_pme_p2p_4gpus.%j.out #SBATCH --ntasks=4 #SBATCH --ntasks-per-socket=4 #SBATCH --gres gpu:4 #SBATCH --gres-flags=enforce-binding #SBATCH -C gpu_shared #SBATCH -t 1:00:00 echo echo echo echo "" "JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs" " " "" module load intel/ Amber/14 cd PME/JAC_production_NPT_4fs srun $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin.gpu -o mdout.4gpu -inf mdinfo.4gpu -x mdcrd.4gpu -r restrt.4gpu

22 STREAM RUNNING AMBER WITH GPU P2P (3/3) Overview of CPU / GPU / multigpu / multigpu P2P performance Other results are available at (without ECC)

23 GPU RESOURCE MANAGEMENT WITH CONTAINERS: RUNNING TENSORFLOW WITH SINGULARITY

24 TENSORFLOW MODEL TRAINING (1/4) Software like TensorFlow is evolving so quickly that it requires too much pain to install/upgrade on an HPC cluster on a regular basis, especially on an old OS (Stream is still running RHEL 6.9). Singularity is a container technology developed for HPC. It is also installed on the major SEDE computing systems. Thanks to the new Nvidia GPU support in Singularity 2.3 (through the --nv option), we can now use Singularity with GPUs on both Sherlock and Stream!

25 TENSORFLOW MODEL TRAINING (2/4) Step 1: get the latest Tensorflow image Create a new Singularity image and import docker image. $ module load singularity $ singularity pull docker://tensorflow/tensorflow:latest-gpu Step 2: TensorFlow container quick test $ singularity shell --home $WORK:/home --nv tensorflow-latest-gpu.img Singularity tensorflow.img:~> python >>> import tensorflow as tf

26 TENSORFLOW MODEL TRAINING (3/4) Step 3: run CIFAR10 training on single GPU #!/bin/bash #SBATCH --job-name=cifar10_1gpu #SBATCH --output=slurm_cifar10_1gpu_%j.out #SBATCH --cpus-per-task=1 #SBATCH --gres gpu:1 #SBATCH --time=1:00:00 TENSORFLOW_IMG=tensorflow-latest-gpu.img CIFAR10_DIR=PEARC17_ECSS/tensorflow/cifar10 mkdir $LSTOR/cifar10_data cp -v cifar-10-binary.tar.gz $LSTOR/cifar10_data/ module load singularity srun singularity exec --home $WORK:/home --bind $LSTOR:/tmp --nv $TENSORFLOW_IMG \ python $CIFAR10_DIR/cifar10_train.py --batch_size=128 \ --log_device_placement=false \ --max_steps= extract of cifar10_1gpu.sbatch for Stream

TENSORFLOW MODEL TRAINING (4/4) Step 4: run CIFAR10 training on multiple GPUs #!/bin/bash #SBATCH --job-name=cifar10_2gpu #SBATCH --output=slurm_cifar10_2gpu_%j.

27 TENSORFLOW MODEL TRAINING (4/4) Step 4: run CIFAR10 training on multiple GPUs #!/bin/bash #SBATCH --job-name=cifar10_2gpu #SBATCH --output=slurm_cifar10_2gpu_%j.out #SBATCH --cpus-per-task=2 #SBATCH --ntasks-per-socket=1 #SBATCH --gres gpu:2 #SBATCH --gres-flags=enforce-binding #SBATCH --time=1:00:00 TENSORFLOW_IMG=tensorflow-latest-gpu.img CIFAR10_DIR=PEARC17_ECSS/tensorflow/cifar10 mkdir $LSTOR/cifar10_data cp -v cifar-10-binary.tar.gz $LSTOR/cifar10_data/ module load singularity srun singularity exec --home $WORK:/home --bind $LSTOR:/tmp --nv $TENSORFLOW_IMG \ python $CIFAR10_DIR/cifar10_multi_gpu_train.py --num_gpus=2 \ --batch_size=64 \ --log_device_placement=false \ --max_steps= extract of cifar10_2gpu.sbatch for Stream

28 CONTACT

Sherlock for IBIIS. William Law Stanford Research Computing

Sherlock for IBIIS. William Law Stanford Research Computing Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to