RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU-Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de March 2012 Rechen- und Kommunikationszentrum (RZ)

The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative computer architecture Reasonable usage of resources Daytime: CAVE (VR): 25 nodes Interactive software development (HPC): 4 nodes Nighttime: CAVE, VR, RWTH Aachen, since 2004 Processing of GPGPU compute jobs (HPC) Slide 2

GPU-Cluster: Hardware stack 4 dialogue nodes 24 rendering nodes 1 head node Name linuxgpud[1-4] linuxgpus[01-24] linuxgpum1 Devices # 2 1 details/gpu NVIDIA Quadro 6000 (Fermi) 448 cores 1.15 GHz 6 GB RAM ECC on max. GFlops: 1030.4 (SP), 515.2 (DP) Host processor 2 x Intel Xeon X5650 EP (Westmere) (12-core CPU) @ 2.67GHz Network RAM 24 GB 48 GB QDR InfiniBand Slide 3

GPU-Cluster: Software stack Scientific Linux 6.1 Modules (as on new compute cluster) CUDA Toolkit: 4.0 (3.2) CUDA OpenCL (1.0) PGI Compiler PGI Accelerator Model CUDA Fortran module load cuda directory: $CUDA_ROOT module load pgi (module switch intel pgi) Intel OpenCL SDK: OpenCL (1.1) for Intel CPUs CUDA Debugging TotalView DDT module load totalview module load ddt Slide 4

How to use? Innovative computer architectures No real production mode (e.g. group membership needed) As stable and reliable as possible Mandatory membership in group gpu E-mail to servicedesk@rz.rwth-aachen.de Short description of your (GPU) application (or your purposes) Programming paradigm (e.g. CUDA, OpenCL, ) Single or multi GPU usage Access to GPU-Cluster (+ single GPU machines) Access to GPGPU-Wiki (full documentation) Demo Slide 5

How to use? Interactive mode Short runs/tests only, debugging 1 dialogue node (linuxgpud1): 24/7 2 dialogue nodes (linuxgpud[2,3]): Mon Fri, 8am 8pm Batch mode No interaction, commands are queued + scheduled For performance tests, long runs 24+1 rendering nodes 2 dialogue nodes Mon Fri, 8pm 8am; Sat + Son, whole day 1 dialogue node (linuxgpud4): Mon Fri, 8am 8pm for short test runs during daytime Note: reboot at switch from interactive to batch mode Configuration might change Slide 6

How to use: Interactive mode Login with your TIM-account on dialogue nodes: linuxgpud[1-3] GPUs are set to exclusive mode (per process) Only one person can access GPU If occupied, e.g. message all CUDA-capable devices are busy or unavailable If not set a certain device in (CUDA) code, automatically scheduled to other GPU within node (if available) Debugging (default) TotalView and DDT support CUDA Toolkit 3.2 TotalView 8.9.2 (!) supports CUDA Toolkit 4.0, but currently not working (due to NVIDIA driver) but cuda-gdb should work Be aware: debugger run usually on GPU with ID 0 (fails if GPU is occupied) Nodes with special X-Configuration for debugging: linuxgpud[1,2] Slide 7

How to use: Interactive mode See what is running: nvidia-smi linuxgpud1$> nvidia-smi Mon Oct 17 12:41:01 2011 +------------------------------------------------------+ NVIDIA-SMI 2.285.05 Driver Version: 285.05.09 -------------------------------+----------------------+----------------------+ GPU ID + type display mode Nb. Name Bus Id Disp. Volatile ECC SB / DB Fan Temp Power Usage /Cap Memory Usage GPU Util. Compute M. ===============================+======================+====================== 0. Quadro 6000 0000:02:00.0 Off 0 0 30% 80 C P0 Off / Off 4% 208MB / 5375MB 99% E. Process -------------------------------+----------------------+---------------------- 1. Quadro 6000 0000:85:00.0 On 0 0 36% 84 C P8 Off / Off 0% 22MB / 5375MB 0% E. Process -------------------------------+----------------------+---------------------- Compute processes: GPU Memory GPU PID Process name Usage ============================================================================= process running on GPU nvidia-smi q Lists GPU details ECC (SB: single bit, DB: double bit) compute mode: 1 person (1 process) 0. 30234 nbody 196MB +-----------------------------------------------------------------------------+ Slide 8

How to use: Batch mode Create batch compute job for LSF Select appropriate queue to get scheduled on GPU-cluster q gpu Exclusive nodes Nodes are allocated exclusively at least 2 GPUs for one job Please use resources reasonably! Submit your job bsub < mygpuscript.sh Starts running as soon as: batch mode starts and job is scheduled Display pending reason: bjobs p During daytime: Dispatch windows closed More documentation Reminder: Only one node in batch mode on daytime (for testing): -a gpu (instead of -g gpu) (-q is given priority to -a) RWTH Compute Cluster User s Guide: http://www.rz.rwth-aachen.de/hpc/primer Unix-Cluster Documentation: https://wiki2.rz.rwth-aachen.de/ display/bedoku/usage+of+the+linux+rwth+compute+cluster Slide 9

Batch script for single GPU (node) usage #!/usr/bin/env zsh ### Job name #BSUB -J GPUTest-Cuda ### File / path where STDOUT & STDERR will be written to #BSUB -o gputest-cuda.o%j ### Request GPU Queue #BSUB -q gpu ### Request the time you need for execution in [hour:]minute #BSUB -W 15 ### Request virtual memory (in MB) #BSUB -M 512 module load cuda/40 CUDA code needs the whole virtual address space of the node Currently, we disabled the memory limit for the gpu queue cd $HOME/NVIDIA_GPU_Computing_SDK_4.0.17/C/bin/linux/release devicequery -noprompt Slide 10

How to use: GPU + MPI Multi-GPU usage with MPI 1 process per node (ppn) If you want to use only one GPU per node. If your process uses both GPUs in one node, e.g. via cudasetdevice. 2 processes per node If each process communicates to one GPU of the node. More processes per node If you have processes which do only computation on the CPU Note: exclusive process" mode still restricts one process per GPU Slide 11

How to use: GPU + MPI Interactive Specify GPU-hosts (otherwise job will run in compute cluster) nodes $MPIEXEC -n 3 -H linuxgpud1:1,linuxgpud2:2 <prog> Batch $MPIEXEC -n 2 m 1 -H linuxgpud1,linuxgpud2,linuxgpud3 ppn <prog> Set number of processes per node -n <#procs> -a {open intel}mpi -R span[ptile=<ppn>] To use the batch test node on daytime with MPI: -a gpu {open intel}mpi Note: In batch mode, all (working) GPUs are available also head node with only one GPU To get only machines with TWO attached GPUs: -m bull-gpu-om Slide 12

Batch script for multi GPU usage (with MPI) #!/usr/bin/env zsh known so far ### Job name #BSUB -J GPUTestMPI-Cuda ### File / path where STDOUT & STDERR will be written to #BSUB -o gputestmpi-cuda.o%j ### Request GPU Queue #BSUB -q gpu ### Request the time you need for execution in [hour:]minute #BSUB -W 15 ### Request virtual memory (in MB) #BSUB -M 512 [..] Slide 13

Batch script for multi GPU usage (with MPI) [..] ### Request the number of compute slots #BSUB -n 4 ### Set one process per node (ptile=ppn) #BSUB -R "span[ptile=1]" ### Use Open MPI #BSUB -a openmpi module load cuda/40 cd $HOME/simpleMPI $MPIEXEC $FLAGS_MPI_BATCH simplempi Slide 14

Additional notes: Windows + GPUs Access restriction Windows GPU group E-mail to servicedesk@rz.rwth-aachen.de GPU machines cluster-win-gpu ½ NVIDIA Tesla S1070 (2 GT200 GPUs) Host: 8-core Intel X5570 (Nehalem) @ 2.93 GHz In future NVIDIA Tesla C2050 (1 Fermi GPU) Host: 4-core Intel E5620 (Westmere) @ 2.40GHz Interactive + batch mode Software: CUDA Toolkit, Matlab, Parallel Nsight Debugger, Slide 15

Batch mode Windows Login with your TIM-account (WIN-HPC\xx) to the cluster frontend cluster-win Start HPC Job Manager Create a new job Slide 16

Batch mode Windows Select Job Details Job template: GPU Sets GPU resources Allows group gpu to access these resources Enter a job name Add your tasks More details: HPC on Windows Batch usage Compute Cluster Scheduler Submit you job Starts running as soon as: batch mode starts + job is scheduled Command line: job /help Slide 17