HPC Introductory Training. on Balena by Team Bath

Size: px

Start display at page:

Download "HPC Introductory Training. on Balena by Team Bath"

Quentin Harrison
5 years ago
Views:

1 HPC Introductory Training on Balena by Team Bath

2 What is HPC and why is it different to using your desktop? High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. - insidehpc Aggregated computing power Very large problem sizes Multiple problems simultaneously

3 General Information Scheduler Managing workloads Visualisation Balena - HPC at Bath

4 Objectives of this training Login to Balena Understand what storage is available on Balena Finding the software you want to use Create you own jobscript and be familiar with options Submit and manage workloads on Balena Use interactive nodes for Test and Development Use visualisation tools Knowing where find more information and to ask for help

5 General Information What makes up the cluster (Hardware/Technology - servers, interconnects, storage, etc) Accessing the cluster; logging in, copying files Data storage areas; /home and /beegfs partitions, expected performance, quotas Modules environments; setting up your working environment, compilers, libraries, software/applications

6 Balena Technical Specification CPU cores Memory Storage Network Size Performance Accelerators Add. Services 3,072 Intel Ivy Bridge 2.6 GHz cores (16 per node) 18 TB of main memory (mostly 4 or 8 GB/core, but 2 nodes have 512 GB) ~50TB, NFS for home area ~200TB, BeeGFS non-archival (Parallel Filesystem) Intel TrueScale 2:1 blocking (worst case) QDR Infiniband fabric 202 nodes (including management) 57 TFlops 22 Xeon Phi (5110p), 24 GPUs (K20x), 2 S10K Visualisation and Test & Development Service levels Limits Free 6hrs per job, max 16 nodes, max cores 256 Premium 5days per job, max 32 nodes, max cores 512

8 Accessing Balena # Exercise 1 Note: Balena can only be accessed from within the campus network

9 # Exercise 2 Where can I keep put my files/data? # Exercise 3 /home $HOME Total capacity ~50TB ~200TB User quota 5 GB Unlimited Peak performance from login node <500MB/sec 1-3GB/sec Peak performance from a compute node <100MB/sec Data policy Backed up Non archival /beegfs/scratch $SCRATCH 1-3GB/sec (Aggregate BW for all users in excess of 10GB/dec)

10 # Exercise 4 Modules where is the software I want to run? $ module avail Applications Ansys, VASP, Matlab, Gaussian etc Compilers-langs Intel Compiler suite, GNU compilers, python etc Libraries MKL, MPI, CUDA, FFTW3 etc Tools Allinea DDT/MAP, Intel vtune, Valgrind etc

11 [balena-01 ~]$ module avail /apps/modules/balena cluster-manager dot module-info teaching use.own cluster-tools/7.0 http_proxy slurm/ untested version /apps/modules/applications ansys/v150 cp2k/ssmp/3.0 gromacs/cpu/dp/5.0.2 phylobayes/4.1b ansys/v161 cplex/ gromacs/cpu/sp/5.0.2 phylobayes/4.1c ansys/v162 crystal09/intel/1.0.1 gromacs/cuda/5.0.2 phylobayes/mpi1.6j ansys/v170 crystal14/intel/1.0.3 hwloc/1.8.1 R/3.1.2 bonnie++/ esm/2.0 idl/8.5 R/3.3.0 bowtie/1.1.2 esm/3.0 lammps/intel/dec-2013 relion/1.4 CASTEP/16.11 espresso/5.1 llvm/3.4.1 stata/14 CASTEP/8.0 espresso/5.2.0 matlab/2014b vasp/intel/5.3.5 comsol/5.1 gaussian09/a.02 matlab/2015b vmd/1.9.2 cp2k/popt/3.0(default) gaussian09/intel/d.01 nwchem/6.5 cp2k/psmp/3.0 gaussian09/intel/d.01-linda openfoam/intel/2.3.x-svn /apps/modules/compilers-langs gcc/4.8.2 intel/compiler/64/ (default) intel/compiler/mic/ python/3.4.2(default) intel/compiler/64/ intel/compiler/64/ java/jdk/1.8.0 intel/compiler/64/14.0/2013_sp intel/compiler/mic/ python/ /apps/modules/libraries acml/gcc/64/5.3.1 cuda/blas/ intel/mkl/64/11.3(default) lapack/intel/64/3.5.0 acml/gcc/fma4/5.3.1 cuda/blas/ intel/mkl/mic/11.2 mvapich2/gcc/2.1-qib acml/gcc/mp/64/5.3.1 cuda/blas/ intel/mkl/mic/11.3 mvapich2/icc/2.1-qib acml/gcc/mp/fma4/5.3.1 cuda/fft/ intel/mpi/64/ openblas/dynamic/0.2.8 acml/gcc-int64/64/5.3.1 cuda/fft/ intel/mpi/64/ (default) opencv/3.1.0 acml/gcc-int64/fma4/5.3.1 cuda/fft/ intel/mpi/mic/ openmpi/gcc/1.8.4 acml/gcc-int64/mp/64/5.3.1 cudnn/3.0 intel/mpi/mic/ openmpi/intel/1.8.4 acml/gcc-int64/mp/fma4/5.3.1 fftw3/intel/avx/3.3.4 intel-mpi/32/4.1.3/049 pcre/8.38 blas/gcc/64/1 fftw3/intel/sse/3.3.4 intel-mpi/64/4.1.3/049 wannier/1.2 blas/intel/64/1 gsl/2.1 intel-mpi/mic/4.1.3/049 xz/5.2.2 boost/gcc/ intel/mkl/64/ intel-tbb-oss/ia32/42_ oss zlib/1.2.8 boost/intel/ intel/mkl/64/11.1/2013_sp intel-tbb-oss/intel64/42_ oss bzip2/1.0.6 intel/mkl/64/11.2 lapack/gcc/64/ /apps/modules/tools allinea/ddt-map/4.2 cmake/3.5.2 git/2.5.1 intel-cluster-runtime/ia32/3.6 allinea/ddt-map/6.0.2 cuda/nsight/ hdf5/ intel-cluster-runtime/intel64/3.6 allinea/reports/5.0 cuda/nsight/ hdf5_18/ intel-cluster-runtime/mic/3.6 allinea/reports/6.0.2 cuda/nsight/ htop iozone/3_420 anaconda/2.3.0 cuda/profiler/ intel/adviser/ netcdf/gcc/64/ anaconda3/2.5.0 cuda/profiler/ intel/inspector/ netperf/2.6.0 autotools/latest cuda/profiler/ intel/itac/ paraview/4.3.1 bonnie++/ cuda/tdk/ intel/mpss/runtime/3.4.3 valgrind/ cmake/ cuda/toolkit/ intel/mpss/sdk/3.4.3 cmake/3.2.3 cuda/toolkit/ intel/vtune/ cmake/3.3.1 cuda/toolkit/ intel-cluster-checker/2.1.2

12 Scheduler Brief introduction to SLURM How to discover information on how the scheduler is configured, listing the queues/partitions, fairshare Essential scheduler commands: sinfo, squeue, sbatch etc. Understanding about the topology, effects of intra- and inter-ibswitch communication, how to assign workloads to single or multiple switches

13 Simple Linux Utility for Resource Management (SLURM) Terminology CPU For multicore machines this will be the core Task A task is synonymous to a process, usually the number of MPI processes that are required Partition (queue) Grouping of nodes based on features Account Grouping of users Job An application/program submitted to the scheduler, each job gets a unique job identifier (jobid)

14 Essential SLURM commands User commands to View information about SLURM nodes and partitions List status of jobs in the queue Jobs by user Jobs by jobid List all jobs (completed and running/pending) by the user Submit a job Cancel a job Hold a job in the queue Release a job that is held View get detailed information of a job in the queue Get detailed information of a node Get licenses available on SLURM Show fairshare information SLURM sinfo squeue squeue --user [userid] squeue --job [jobid] sacct sbatch [jobscript] scancel [jobid] scontrol hold [jobid] scontrol release [jobid] scontrol show job [jobid] scontrol show node [nodename] scontrol show license sshare

15 # Exercise 5 Essential SLURM commands: squeue [balena-01 ~]$ squeue JOBID NAME USER ACCOUNT PARTITION ST NODES CPUS MIN_MEMORY START_TIME TIME_LEFT PRIORITY NODELIST(REASON) CoO2-7Lhs hc722 free batch R K T10:21:55 4: node-sw-[ ,041,045] V2OBv6L hc722 free batch R K T10:29:02 11: node-sw-[021,048,060,068] V2OV4L hc722 free batch R K T10:42:22 25: node-sw-[ ] V4OV9B hc722 free batch R K T10:53:01 35: node-sw-[124,127, ] V4OV9C hc722 free batch R K T10:54:08 36: node-sw-[ ,072] V4OV9D hc722 free batch R K T10:55:06 37: node-sw-[ ,032,034] Li1RuO3-r3 hc722 free batch R K T10:57:09 39: node-sw-[ ,100,104] VASP jf298 free batch R T10:57:57 40: node-sw-[058,061,155,160] ompam_50_2 ide20 free batch-all R K T11:58:26 1:41: node-as-ngpu ReederMC3 cgk26 free batch-all R K T12:32:29 2:15: node-as-phi b120mc1 cgk26 free batch-all R K T12:34:32 2:17: node-sw ReederMC6 cgk26 free batch-all R K T12:35:03 2:17: node-sw-125 Useful filters --name [jobname] --partition [partition] --user [userid] --job [jobid] Filter based on job name List of jobs is a specific partition List jobs of a specific user in the queue Status of a single job in the queue

16 Essential SLURM commands: sinfo # Exercise 5 [balena-01 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch* up infinite 160 alloc node-sw-[ ] batch-acc up infinite 21 alloc node-as-agpu-001,node-as-ngpu-[ ],node-as-phi-[ ] batch-all up infinite 181 alloc node-as-agpu-001,node-as-ngpu-[ ],node-as-phi-[ ] batch-512gb up infinite 2 idle node-sw-fat-[ ] batch-64gb up infinite 80 alloc node-sw-[ ] batch-128gb up infinite 80 alloc node-sw-[ ] batch-devel up infinite 4 idle node-sw-[ ] teaching up infinite 4 idle node-sw-[ ] batch-micnative up infinite 4 idle node-as-phi-003-mic[0-3] [balena-02 ~]$ sinfo -Nel --partition batch-acc --Format=nodelist,features,gres Mon Aug 22 17:18: NODELIST AVAIL_FEATURES GRES node-as-agpu-001 s10k (null) node-as-ngpu-001 k20x gpu:4 node-as-ngpu-002 k20x gpu:4 node-as-ngpu-003 k20x gpu:4 node-as-ngpu-004 k20x gpu:4 node-as-ngpu-005 k20x gpu:1 node-as-ngpu-006 k20x gpu:1 node-as-phi p,michost mic:4 node-as-phi p,michost mic:4 node-dw-ngpu-001 k20x gpu:1 node-dw-ngpu-002 k20x gpu:1 node-dw-ngpu-003 k20x gpu:1 node-dw-ngpu-004 k20x gpu:1 node-dw-phi p,michost mic:1 node-dw-phi p,michost mic:1 node-dw-phi p,michost mic:1 node-dw-phi p,michost mic:1

17 # Exercise 5 Essential SLURM commands: scontrol [balena-01 ~]$ scontrol show job JobId= Name=MAPbI3 UserId=jms70( ) GroupId=balena_ch(10307) Priority= Nice=0 Account=free QOS=free JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:47:01 TimeLimit=06:00:00 TimeMin=N/A SubmitTime= T11:17:05 EligibleTime= T11:17:05 StartTime= T12:19:34 EndTime= T18:19:35 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=balena-01: ReqNodeList=(null) ExcNodeList=(null) NodeList=node-sw-111 BatchHost=node-sw-111 NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=0 MinCPUsNode=16 MinMemoryNode=62G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/beegfs/scratch/user/e/jms70/MAPbI3/Cubic-Phono3py/Phono3py-16x16x16_ job WorkDir=/beegfs/scratch/user/e/jms70/MAPbI3/Cubic-Phono3py StdErr=/beegfs/scratch/user/e/jms70/MAPbI3/Cubic-Phono3py/StdErr.e.%j StdIn=/dev/null StdOut=/beegfs/scratch/user/e/jms70/MAPbI3/Cubic-Phono3py/StdOut.o

18 # Exercise 5 Essential SLURM commands: scontrol $ scontrol show node node-sw-100 NodeName=node-sw-100 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=16.03 Features=(null) Gres=(null) NodeAddr=node-sw-100 NodeHostName=node-sw-100 Version= OS=Linux RealMemory=64498 AllocMem=63488 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=2015 Weight=1 BootTime= T14:52:50 SlurmdStartTime= T15:03:07 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s $ scontrol show node node-dw-ngpu-001 NodeName=node-dw-ngpu-001 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=15.21 Features=k20x Gres=gpu:1 NodeAddr=node-dw-ngpu-001 NodeHostName=node-dw-ngpu-001 Version= OS=Linux RealMemory= AllocMem=60000 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=2015 Weight=1 BootTime= T14:50:49 SlurmdStartTime= T14:54:05 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

19 # Exercise 5 Essential SLURM commands: sshare List accounts to which a user has access to [balena-02 ~]$ sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare free test-rtm

20 Topology 2:1 Blocking fabric 24 nodes per QDR and 12 uplinks to the core switches All nodes on switch 1 talking to all nodes on switch 2 will effectively communicate at DDR speed which is ~1.6GB/sec 24 nodes on a single switch communicate at full QDR speed SLURM is configured to understand the Infiniband topology Request your job to only start on a single switch with --switches=1

21 Managing workloads Creating job-scripts for SLURM Requesting specific features; GPU (K20x and S10k) or Xeon Phi nodes, specific nodes, partitions Managing workloads: submitting, cancelling Running interactive sessions

22 Example hybrid code in C - hello_world.c #include <stdio.h> #include "mpi.h #include "omp.h int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[mpi_max_processor_name]; int iam = 0, np = 1; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); #pragma omp parallel default(shared) private(iam, np) { np = omp_get_num_threads(); iam = omp_get_thread_num(); printf("hello from thread %d out of %d from process %d out of %d on %s\n", iam, np, rank+1, numprocs, processor_name); } MPI_Finalize(); }

23 Compiling the example code # Exercise 6 Load necessary Modules For our example lets load Intel Compiler suite module load intel/compiler Intel MPI library module load intel/mpi $ module load intel/compiler/64/ $ module load intel/mpi/64/ Compile Hybrid (MPI + openmp): mpiicc -qopenmp source.c o output $ mpiicc -qopenmp hello_world.c -o hello_world

24 Anatomy of a job-script # Exercise 6 hash-bang: tell linux which interpreter to use SLRUM directives Job environment Instructions to run your application #!/bin/bash #SBATCH --job-name=testjob #SBATCH --account=free #SBATCH --partition=batch #SBATCH --nodes=1 #SBATCH --cpus-per-task=16 #SBATCH --time=00:05:00 #SBATCH --error=hello-%j.err #SBATCH --output=hello-%j.out module purge module load slurm module load intel/compiler/64/ module load intel/mpi/64/ mpirun./hello_world

25 Job-script: Submitting the job # Exercise 6 Submit the job to the queue [balena-02 ~]$ cd $SCRATCH/training/hello_world [balena-02 ~]$ sbatch hello_world.slurm Submitted batch job View job in the queue [balena-02 ~]$ squeue --job View job from SLURM accounting log [balena-02 ~]$ sacct --job If you want to cancel a job that is pending/running [balena-02 ~]$ scancel

26 Job-script: Running the example code in OpenMP mode # Exercise 6 #!/bin/bash #SBATCH --job-name=testjob #SBATCH --account=free #SBATCH --partition=batch #SBATCH --nodes=1 #SBATCH --cpus-per-task=16 #SBATCH --time=00:05:00 #SBATCH --error=hello_1-%j.err #SBATCH --output=hello_1-%j.out module purge module load slurm module load intel/compiler/64/ module load intel/mpi/64/ mpirun./hello_world Hello from thread 12 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 13 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 0 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 14 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 6 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 9 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 2 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 10 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 1 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 4 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 15 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 8 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 5 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 11 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 3 out of 16 from process 1 out of 1 on node-sw-165 Hello from thread 7 out of 16 from process 1 out of 1 on node-sw-165 Submission with: 16 OpenMP threads per task 1 MPI task

27 Job-script: Running the example code in MPI mode # Exercise 6 #!/bin/bash #SBATCH --job-name=testjob #SBATCH --account=free #SBATCH --partition=batch #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --time=00:05:00 #SBATCH --error=hello_2-%j.err #SBATCH --output=hello_2-%j.out module purge module load slurm module load intel/compiler/64/ module load intel/mpi/64/ mpirun./hello_world Submission with: 1 OpenMP thread per task (default) 32 MPI tasks Hello from thread 1 out of 1 from process 1 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 2 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 3 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 4 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 7 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 17 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 5 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 18 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 6 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 19 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 8 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 20 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 9 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 21 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 10 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 22 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 11 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 12 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 13 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 14 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 15 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 16 out of 32 on node-sw-166 Hello from thread 1 out of 1 from process 23 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 24 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 25 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 26 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 27 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 28 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 29 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 32 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 30 out of 32 on node-sw-167 Hello from thread 1 out of 1 from process 31 out of 32 on node-sw-167

28 Job-script: Running the example code in Hybrid mode # Exercise 6 #!/bin/bash #SBATCH --job-name=testjob #SBATCH --account=free #SBATCH --partition=batch-devel #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=8 #SBATCH --time=00:05:00 #SBATCH --error=hello_3-%j.err #SBATCH --output=hello_3-%j.out module purge module load slurm module load intel/compiler/64/ module load intel/mpi/64/ mpirun./hello_world Submission with: 8 OpenMP threads per task 4 MPI tasks Hello from thread 5 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 2 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 4 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 1 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 6 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 3 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 1 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 5 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 8 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 7 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 8 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 4 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 1 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 5 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 2 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 2 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 3 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 7 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 8 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 4 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 7 out of 8 from process 1 out of 4 on node-sw-165 Hello from thread 6 out of 8 from process 2 out of 4 on node-sw-165 Hello from thread 3 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 8 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 4 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 6 out of 8 from process 4 out of 4 on node-sw-166 Hello from thread 5 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 1 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 6 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 7 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 2 out of 8 from process 3 out of 4 on node-sw-166 Hello from thread 3 out of 8 from process 3 out of 4 on node-sw-166

29 Job-script: SLURM directives #!/bin/bash Job specification SLURM directive prefix Parition/Queue Job name Wall clock limit Node count CPU count address Event notification SLURM #SBATCH --partition=[partition] --job-name=[name] --time=[min] or [days-hh:mm:ss] --nodes=[no of nodes] --ntasks=[count] --mail-user=[address] #SBATCH --job-name=testjob #SBATCH --account=free #SBATCH --partition=batch #SBATCH --nodes=1 #SBATCH --error=slurm-%j.err module load intel/mpi --mail-type=[events] eg. BEGIN, END, FAIL, ALL etc mpirun -np $SLULRM_NTASKS./my_mpi_app Node features Generic resources Working directory Licenses Job arrays Job restart Standard Output file Standard Error file --constraint=[feature] eg. k20x, 5110p --gres=[resource] eg. gpu:4 or mic:2 --workdir=[full_path_of_dir] --licenses=[license_name:count] --array=[array_spec] --requeue OR --no-requeue --output=[file_name] --error=[file_name] The filename pattern may contain one or more replacement symbols [ a percent sign "%" followed by a letter (e.g. %j)] %A Job array's master job allocation number. %a Job array ID (index) number. %j Job allocation number. %N Node name (name of the first node in the job) %u User name.

30 Requesting specific resources on Balena Different partitions --partition=batch --partition=batch-acc --partition=batch-64gb --partition=batch-128gb --partition=batch-512gb --partition=batch-all 64GB and 128GB nodes Accelerator nodes (GPUs and Xeon Phis) 64GB nodes 128GB nodes 512GB nodes All the nodes except 512GB nodes #!/bin/bash #SBATCH --job-name=testjob #SBATCH --account=free #SBATCH --partition=batch-acc #SBATCH --nodes=1 #SBATCH --gres=gpu:2 #SBATCH --constraint=k20x #SBATCH --error=slurm-%j.err module load intel/mpi mpirun -np $SLULRM_NTASKS./my_mpi_app Specific accelerator --gres=gpu:2 --gres=mic:1 Nodes with 2 GPUs Nodes with 1 MIC Specific features --constraint=k20x --constraint=5110p Nodes with K20 GPU Nodes with Intel Xeon Phi 5100p

31 Job Environment The SLURM controller will set the following variables in the environment of the batch script. OUTPUT environment variables $SLURM_JOB_ID $SLURM_JOB_NAME $SLURM_JOB_NODELIST $SLURM_NTASKS $SLURM_ARRAY_JOB_ID $SLURM_ARRAY_TASK_ID Description The unique jobid for a job Name of the job (--job-name) Nodes allocated to the job No of process started for a job Job array s master job ID number Job array ID (index) number

32 Interactive Testing and Development: sinteractive # Exercise 7 By default submitted to ITD partition using free account SHARED resources (CPU,MEM,GPU,MIC) among other users on the node Each user limited to one interactive job on the ITD partition For an EXCLUSIVE interactive session Requesting specific resources $ sinteractive --time=00:20:00 --gres=gpu:1 $ sinteractive --time=00:20:00 --gres=gpu:4 --partition=batch-acc [user123@balena-01 ~]$ sinfo --partition itd PARTITION AVAIL TIMELIMIT NODES STATE NODELIST itd up infinite 2 mix itd-ngpu-[01-02] itd up infinite 2 idle itd-phi-[01-02]

33 Monitoring workloads Monitoring workloads; CPU usage, memory usage Profiling workloads; Allinea `perf-report` and MAP, Intel profiles Debugging issues with submission scripts

34 Monitoring workloads: top # Exercise 8 $ top

35 Monitoring workloads: htop # Exercise 8 $ module load htop $ htop

36 Monitoring workloads: perfquery # Exercise 8 $ watch "perfquery -C qib0 -r" Every 2.0s: perfquery -C qib0 -r # Port counters: Lid 211 port 1 (CapMask: 0x200) PortSelect:...1 CounterSelect:...0x0000 SymbolErrorCounter:...0 LinkErrorRecoveryCounter:...0 LinkDownedCounter:...0 PortRcvErrors:...0 PortRcvRemotePhysicalErrors:...0 PortRcvSwitchRelayErrors:...0 PortXmitDiscards:...0 PortXmitConstraintErrors:...0 PortRcvConstraintErrors:...0 CounterSelect2:...0x00 LocalLinkIntegrityErrors:...0 ExcessiveBufferOverrunErrors:...0 VL15Dropped:...0 PortXmitData: PortRcvData: PortXmitPkts: PortRcvPkts:

37 Allinea Performance Reports example # Exercise 9 $ mpiicc mpi_pi_reduce.c -o mpi_pi #!/bin/bash -l #SBATCH --job-name=testjob #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=8 #SBATCH --time=00:05:00 #SBATCH --error=slurm-%j.err #SBATCH --partition=batch module purge module load slurm module load intel/compiler/64/ module load intel/mpi/64/ CPU: module load allinea/reports/6.0.2 perf-report mpirun./mpi_pi HTML Output of performance report Command: Resources: Memory: Tasks: Machine: mpirun./mpi_pi 2 nodes (16 physical, 16 logical cores per node) 63 GB per node 4 processes node-sw-165 Started on: Fri Jul 29 11:36: Total time: Executable: Full path: Input file: Notes: 1 second (0 minutes) mpi_pi /beegfs/scratch/user/q/rtm25/training Summary: mpi_pi is Compute-bound in this configuration Compute: 92.9% ======== MPI: 7.1% I/O: 0.0% This application run was Compute-bound. A breakdown of this time and advice for investigating further is found in the CPU section below. As very little time is spent in MPI calls, this code may also benefit from running at larger scales. A breakdown of the 92.9% total compute time: Scalar numeric ops: 0.0% Vector numeric ops: 0.0% Memory accesses: 100.0% ========= The per-core performance is memory-bound. Use a profiler to identify timeconsuming loops and check their cache performance. No time is spent in vectorized instructions. Check the compiler's

38 Scaling

39 Debugging and Profiling Valgrind module load valgrind Allinea module load allinea/ddt-map module load allinea/reports Intel module load intel/itac Nvidia Profiler module load cuda/profiler GDB Available on all the nodes by default

40 Visualisation A brief on the technology behind Visualisation nodes (GPU virtualisation) Multiple use-cases on how our researchers can exploit this technology Setting the expectations/limits on the capability of our visualisation nodes Interacting with Visualisation nodes Example jobs Sharing visualisation session with another user

41 Technology overview and benefits Components VirtualGL TurboVNC Web portal Benefits Access to BeeGFS parallel file system ($SCRATCH) at full speed Usable from a low end laptop/tablet

42 Access the Visualisation service Balena portal: balena.bath.ac.uk What do you need? A low latency connection with reasonable bandwidth to the cluster login nodes 1MB/sec downstream for low-mid quality JPEG compression (perfectly usable from UK and European ADSL links and over wifi) 10MB/sec downstream (100Mbit) for a high quality stream (within the university campus with wired network connections) A web browser with either: A working java runtime A turbo vnc client

43 # Exercise 10 no shared viewonly Hold down CRTL to select multiple users --account=prj-tst123 Submit Job

44 # Exercise 10 Lets try with Local VNC Client (download client from HOME tab if you do not have a client on the system already) Connect

45 # Exercise 10

46 # Exercise 10

47 Finding Help Balena wiki HPC Support

48 Objectives of this training Login to Balena Understand what storage is available on Balena Finding the software you want to use Create you own jobscript and be familiar with options Submit and manage workloads on Balena Use interactive nodes for Test and Development Use visualisation tools Knowing where find more information and to ask for help

49 Thank you

HPC Introductory Training. on Balena by Team Bath

HPC Introductory Training. on Balena by Team Bath HPC Introductory Training on Balena by Team HPC @ Bath Housekeeping Attendance sheet Fire alarm Refreshment breaks Questions anytime lets us know if you need any assistance. Feedback at the end of the