Advanced Research Computing. ARC3 and GPUs. Mark Dixon

Size: px

Start display at page:

Download "Advanced Research Computing. ARC3 and GPUs. Mark Dixon"

Adele Dean
5 years ago
Views:

1 Advanced Research Computing Mark Dixon

2 ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM PCIe cards ARC3 upgrade (21st September 217) Added 6 GPU nodes, each with: 24 Intel CPU cores & 256G RAM 4 NVIDIA Tesla P1 12G RAM PCIe cards

3 Why? Increasing interest in GPUs from users Increasing support for GPUs in applications High theoretical speed Suited to strongly-scaling problems requiring relatively low memory sizes GPUs typically use a memory latency hiding programming model: helps make best use of memory bandwidth

4 Why NOT? Can be difficult to extract the theoretical performance More complicated programming: Code runs on the CPU and the GPU(s), transferring memory between them as needed Need to overlap multiple tasks at once: transferring memory, multiple different sets of computations Algorithms need to be amenable to SIMD (Single Instruction Multiple Data, a.k.a. vectorisation) The unwary can be locked in to NVIDIA's hardware

5 ARC3 Standard compute node

6 ARC3 P1 GPU node

7 Comparison: 2x Intel E5-265 v4 NVIDIA K8 card NVIDIA P1 card Calc/s (DP) 845 Gflops 2.9 Tflops 4.7 Tflops Cores 24 Intel cores 2x 2496 Cuda cores 3584 Cuda cores Clock speed 2.2 GHz 875 MHz 1.3 GHz Mem size 128 GB 2x 12 GB 12 GB ~5 GB/s ~5 GB/s Mem Bandwidth ~1 GB/s Different terminology between CPU and GPU worlds Cuda cores are not comparable to CPU cores!

8 ARC3 NVIDIA P1 GPU (4.7 Tflops)

9 What is SIMD / Vectorisation? Where the compiler can automatically rewrite something like this: for (i =, i < SIZE; i++) { c[i] = a[i] * b[i]; } Into something like this (loop unrolling): for (i =, i < SIZE; i += 4) { c[i] = a[i] * b[i]; And then the memory c[i+1] = a[i+1] * b[i+1]; layout means it can c[i+2] = a[i+2] * b[i+2]; issue a single c[i+3] = a[i+3] * b[i+3]; instruction to do this. } }

10 Submitting jobs: To run a batch job: qsub -l h_rt=<time>,coproc_<gpu_type>=<num_cards> job_script.sh To get an interactive session (use for compiling GPU applications): qrsh -l h_rt=<time>,coproc_<gpu_type>=<num_cards> -pty y bash Jobs given dedicated access to GPU, no sharing NVIDIA K8 (num_cards = 1 or 2): NVIDIA P1 (num_cards = 1, 2, 3 or 4): Allocated fractions of a compute node if it has 4 cards and job asks for 1 card, get ¼ of the cores and memory in node.

11 Submitting jobs (examples): NVIDIA K8: qsub -l h_rt=1::,coproc_k8=1 job_script.sh (batch) qrsh -l h_rt=1::,coproc_k8=1 -pty y bash (interactive) Gives you: 12 CPU cores & 64G RAM 1 NVIDIA K8 card (the K8 is a bit odd - appears as 2 GPUs) NVIDIA P1: qsub -l h_rt=1::,coproc_p1=1 job_script.sh (batch) qrsh -l h_rt=1::,coproc_p1=1 -pty y bash (interactive) Gives you: 6 CPU cores & 64G RAM 1 NVIDIA P1 card (appears as 1 GPU)

12 Useful commands: See what hardware is free: $ qstat -g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoacds cdsue core-128G-2K8.q core-128G.q core-256G-4P1.q core-768G.q thread-112G-cache.q thread-112G-half -NA 256thread-112G-memory.q astro.q cryoem-2p4.q cryoem-4p1.q maths.q minphys.q omics.q skyblue.q

13 Useful commands: See what you are running: $ qstat job-id prior name user state submit/start at queue slots ja-task-id bash issmcd r 11/24/217 9:54:5 24core-256G-4P1.q@db12gpu5.a bash issmcd r 11/24/217 9:54:5 24core-256G-4P1.q@db12gpu5.a bash issmcd r 11/24/217 9:54:59 24core-256G-4P1.q@db12gpu6.a sleep.sh issmcd qw 11/24/217 11:25:33 12

14 OK, so you have your GPU. What next? Within a job, setup NVIDIA development environment: module load cuda module switch intel gnu/native Once done, the NVIDIA CUDA compiler (nvcc) is available, as are things like a status tool (nvidia-smi), profiling tools (nvprof and nvvp) Also makes CUDA optimised numerical libraries available, e.g. linear algebra (cublas, cusparse), FFTs (cufft), image processing (NPP), etc. Note: some applications need a specific major version (execute module avail cuda to see choices) Note: if you are building an application and its build scripts asks where the CUDA files are, their location can be found by executing: echo $CUDA_HOME

15 The NVIDIA CUDA compiler (nvcc) Different GPU cards have different capabilities and different architectures Can be important to optimise when building applications Nvcc can create a single fat binary optimised for multiple GPU cards To run on a: K8, want: -gencode arch=compute_37,code=sm_37 P1, want: -gencode arch=compute_6,code=sm_6 Use both to optimise for both cards nvcc doesn't need a GPU to run (e.g. on a login node), but many application build scripts expect one so build applications within a GPU job.

16 Useful commands: Within a job, see what GPUs you have: $ nvidia-smi Fri Nov 24 11:28: NVIDIA-SMI Driver Version: GPU Name Fan Temp Persistence-M Bus-Id Perf Pwr:Usage/Cap Disp.A Volatile Uncorr. ECC Memory-Usage GPU-Util Compute M. ===============================+======================+====================== N/A Tesla K8 45C P On :5:. Off 58W / 149W 1911MiB / 11439MiB % Default N/A Tesla K8 39C P On :6:. Off 72W / 149W 1873MiB / 11439MiB % Default Processes: GPU GPU Memory PID Type Process name Usage ============================================================================= 8885 C /opt/conda/bin/python 1896MiB C /opt/conda/bin/python 1858MiB

17 Useful commands: Within a job, see how busy your GPUs are: $ nvidia-smi dmon # gpu pwr temp # Idx W C ^C sm % mem % enc % dec % (updates every second, Ctrl-C to quit) mclk MHz pclk MHz

18 Example job script (submit with qsub job_script.sh): #$ -l h_rt=1:: #$ -l coproc_p1=2 #$ -j y # Asked for two GPUs. # The first is device, the second is device 1. # Log GPU details nvidia-smi # Collect GPU performance stats, one line per 1 seconds nvidia-smi dmon -d 1 > "${JOB_NAME}.g${JOB_ID}" & # Run program./mygpuprogram

19 Example job script output: File job_script.sh.g<job_id> # gpu # Idx pwr W temp C sm % 4 mem % enc % dec % mclk MHz pclk MHz

20 Example job script output: File job_script.sh.o<job_id> ## Hosts assigned to job : ## ## db12gpu8.arc3.leeds.ac.uk 12 slots ## ## Resources granted: ## ## h_vmem = (per slot) ## h_rt = 1:: ## disk = 1G (per slot) Fri Nov 24 11:47: NVIDIA-SMI Driver Version: GPU Name Fan Temp Persistence-M Bus-Id Perf Pwr:Usage/Cap Disp.A Volatile Uncorr. ECC Memory-Usage GPU-Util Compute M. ===============================+======================+====================== N/A Tesla P1-PCIE... 29C P On :82:. Off 29W / 25W MiB / 12193MiB % Default

21 Program ming Other tools Tool Trade off NVIDIA CUDA (C dialect, nvcc compiler) OpenCL (C API, non-proprietary). Existing codes need to be rewritten. Provides the most optimisation opportunities. Experimenting with OpenMP 4 / OpenACC (code annotation of Fortran, C or C++) on GCC 7.2. Existing code can be reused. Unlikely to extract as good performance as CUDA. NVIDIA CUDA provides a profiling and optimisation tool. NVIDIA CUDA provides optimised libraries (linear algebra, FFT, image processing, deep learning, etc.) Looking at integrating 3rd party tools like ddt, map and tau Multi-node MPI not currently supported on ARC3 with GPUs

22 Training: ARC to run CUDA courses in January Apply for an account: Documentation: ARC: (GPU sections still work in progress) NVIDIA:

Accelerator programming with OpenACC

..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling