CalcUA. Agenda. UAntwerpen-HPC. CalcUA. Stefan Becuwe Kurt Lust. Franky Backeljauw Bert Tijskens. 1. Introduction to UAntwerpen-HPC

Size: px

Start display at page:

Download "CalcUA. Agenda. UAntwerpen-HPC. CalcUA. Stefan Becuwe Kurt Lust. Franky Backeljauw Bert Tijskens. 1. Introduction to UAntwerpen-HPC"

Debra Reynolds
5 years ago
Views:

1 UAntwerpen-HPC Stefan Becuwe Kurt Lust Franky Backeljauw Bert Tijskens Agenda 1. Introduction to UAntwerpen-HPC 2. Getting an HPC account 3. Preparing the environment 4. Running batch jobs 5. Running interactive jobs 6. Running jobs with input/output data 7. Multi-core jobs 8. Multi-job submission 9. UAntwerpen-HPC policies

Goal make you comfortable to work with and use UAntwerpen-HPC Structure of the tutorial Questions Chapter Title What is High Performance Computing exactly? Can it solve my computational needs?

2 Goal make you comfortable to work with and use UAntwerpen-HPC Structure of the tutorial Questions Chapter Title What is High Performance Computing exactly? Can it solve my computational needs? 1 Introduction to HPC How to get an account? 2 Getting a VSC account How do I connect to the HPC and transfer my files and programs? 3 Preparing the environment How to start background jobs? 4 Running batch jobs How to start jobs with user interaction? 5 Running interactive jobs Where do the input and output go? Where to collect my results? Can I speed up my program by exploring parallel programming techniques? 6 7 Running jobs with input/output data Multi-core jobs Parallel programming Can I start many jobs at once? 8 Multi-job submission What are the policies? 9 HPC policies

3 Introduction to the VSC and Chapter 1 Flemish Supercomputer Center 1. Intro Vlaams Supercomputer Centrum (VSC) Established in December 2007 Partnership between 5 University associations: Antwerp, Brussels, Ghent, Hasselt, Leuven Funded by Flemish Government through the FWO (Research Fund Flanders) Goal: make HPC available to all researchers in Flanders (academic and industrial) Local infrastructure (Tier-2) in Antwerp: HPC core facility

4 VSC partners 1. Intro VSC: An integrated infrastructure according to the European model 1. Intro Tier-0 European VSC network Storage Job management Tier-1 (national) regional Tier-2 (regional) university PRACE BEgrid EGI Tier-3 desktop KL

5 UAntwerpen 1. Intro Hopper 168 compute nodes 144 nodes with 2 ten core Ivy Bridge processors 64 GB memory, 500 GB local disk, FDR10-IB network 24 nodes with 2 ten core Ivy Bridge processors 256 GB memory, 500 GB local disk, FDR10-IB network 3360 cores, theoretical peak performance 75 Tflops 100 TB online GPFS storage 4 login nodes for user access 2 service nodes for admin and maintenance UAntwerpen 1. Intro Turing 168 compute nodes 64 nodes with 2 quad core Harpertown processors 16 GB memory, 120 GB local disk, GbE network 32 nodes with 2 quad core Harpertown processors 16 GB memory, 120 GB local disk, DDR-IB network 64 nodes with 2 six core Westmere processors 24 GB memory, 120 GB local disk, QDR-IB network 8 nodes with 2 six core Westmere processors 24 GB memory, 120 GB local disk, GbE network 1632 cores, theoretical peak performance 15.5 Tflops 2 login nodes for user access

6 1. Intro Hardware 1. Intro Tier-1 infrastructure Muk (UGent) inaugurated on October 25, compute nodes with 2 eight core Sandy Bridge processors 64 GB memory, FDR-IB network 400 TB net storage capacity, peak bandwidth 9.5 GB/s 8448 cores, theoretical peak performance 175 Tflops # 118 in Top 500 on June 2012 Next Tier-1 system will be installed in Leuven Expect pilot use starting in July Muk will be phased out at the end of 2016

you may have to wait a little Therefore mostly for batch applications, not well suited for interactive applications Built for parallel work.

7 Tier-1 infrastructure Muk (UGent) 1. Intro Some characteristics Shared infrastructure, used by multiple users simultaneously So you have to request the amount of resources you need And you may have to wait a little Therefore mostly for batch applications, not well suited for interactive applications Built for parallel work. No parallelism = no supercomputing. Not meant for running a single single-core job Remote use model But you re running (a different kind of) remote applications all the time Linux-based So no Windows or OS X software, But most standard Linux applications will run.

8 Getting a VSC account Chapter 2 Create SSH keys 2. Account ssh-keygen /home/<username>/.ssh/id_rsa this is your private key /home/<username>/.ssh/id_rsa.pub this is your public key /user/antwerpen/2xx/<vsc-username>/.ssh/authorized_keys

9 Register your account 2. Account web-based procedure : your VSC username is vsc2xxxx Exercises Follow the instructions in Chapter 2 Getting an account

10 Connecting to Hopper Chapter 3 A typical workflow 3. Connecting 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results 2 0

11 Connect to Hopper 3. Connecting open a terminal Windows (7) : All Programs > PuTTY OS X : Applications > Utilities > Terminal Linux (Ubuntu) : Applications > Accessories > Terminal (or CTRL+Alt+T) login via secure shell: $ ssh vsc2xxxx@login.hpc.uantwerpen.be login.hpc.uantwerpen.be Hopper login nodes login.hpc.uantwerpen.be 3. Connecting Generic name for 4 login nodes (DNS alias) Limited access only accessible from within the campus network external access (e.g., from home) possible using UA-VPN For editing and submitting jobs Best to compile on target computer nodes Do not use the login nodes for running your executables

12 Computation workflow 3. Connecting 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results Transfer your files 3. Connecting Graphical ssh file managers for Windows, OS X, Linux PuTTY, WinSCP, CyberDuck, FileZilla, Command line: scp - secure copy copy from the local computer to Hopper $ scp file.ext vsc2xxxx@login.hpc.uantwerpen.be: copy from Hopper to the local computer $ scp vsc2xxxx@login.hpc.uantwerpen.be:file.ext.

13 Computation workflow 3. Connecting 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results Available development software 3. Connecting C/C++/Fortran compilers: Intel Parallel Studio XE Cluster Edition and GCC Intel compiler comes with several tools for performance analysis and assistance in vectorisation/parallelisation of code Fairly up-to-date OpenMP support in both compilers Support for other shared memory parallelisation technologies such as Cilk Plus, other sets of directives, etc. to a varying level in both compilers PGAS Co-Array Fortran support in Intel Compiler (and some in GNU Fortran) Message passing libraries: Intel MPI, Open MPI Mathematical libraries: Intel MKL, OpenBLAS, FFTW, MUMPS, GSL, Many other libraries, including HDF5, NetCDF, Metis,... Python

14 Available application software 3. Connecting ABINIT, CP2K, OpenMX, QuantumESPRESSO VASP, CHARMM, GAMESS-US, LAMMPS Gaussian, Gromacs, Molpro Telemac SAMTools, TopHat, Bowtie, BLAST MATLAB, R, Additional software can be installed on demand Available software The fine print 3. Connecting Restrictions on commercial software 5 licenses for Intel Parallel Studio No license for Parallel Computing Toolbox in MATLAB Other licenses paid for and provided by the users, owner of the license decides who can use the license No user software as part of the system software Additional software can be installed on demand Take into account the system requirements Provide working building instructions and licenses Compilation support is limited Best effort, no code fixing Installed in /apps/antwerpen

15 Operating system and system management 3. Connecting Operating system : Scientific Linux 6.x Red Hat Enterprise Linux (RHEL) 6 clone kernel with cpusetsupport and BLCR modules Job management and accounting Torque : resource manager (based on PBS) MOAB : job scheduler and management tools MAM : allocation manager and accounting system Maintenance and monitoring Icinga : monitoring software CMU : node management Building your environment: module command 3. Connecting Dynamic software management through modules Set environment variables : $PATH, $LD_LIBRARY_PATH,... Application-specific environment variables Automatically pull in required dependencies VSC-specific variables that you can use to point to the location of a package (typically start with EB) Software build on our system is organised in toolchains Packages build with the same toolchain will interoperate Build tools (compilers etc) refreshed twice yearly: 2015a, 2015b, 2016a, intel toolchain: Intel compilers, MPI and math libraries. foss toolchain: GNU compilers, Open MPI, OpenBLAS, FFTW,... Many packages have intel-2015a or foss-2015b in their name which denotes that they are build with that toolchain.

16 Building your environment: module command 3. Connecting Module name is typically of the form: <name of package>/<version>, e.g. intel/2016a : The Intel toolchain package, version 2016a PLUMED/2.1.0-intel-2014a : PLUMED library, version 2.1.0, build with the intel/2014a toolchain Some module commands : module avail : list installed software packages module avail intel : list all available versions of the intel package module help foss : show help about a module (the module foss in this example) module list : list all loaded modules in the current session module load hopper/2015b : load a list of modules module load MATLAB/R2015b : load a specific version module unload MATLAB : unload MATLAB settings module purge : unload all modules (incl. dependencies) Exercises 3. Connecting Getting ready (see also 3.1): Log on to the cluster Go to your data directory cd $VSC_DATA Copy the examples for the tutorial to that directory: type cp -r /apps/antwerpen/tutorials/intro-hpc/examples. Experiment with the module command as in 3.3 Preparing your environment: Modules

17 Running batch jobs Chapter 4 Computation workflow 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results

18 Job workflow What happens behind the scenes? #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Moab Submit Job Query Cluster Torque Don t worry too much who does what Figure 1 Job Management System Architecture Submitting a job #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Submit Job Query Cluster Figure 1 Job Management System Architecture

19 The job script #!/bin/bash #PBS l nodes=1:ppn=20 #PBS l walltime=1:00:00 #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo This is a bash script (but could be, e.g., Perl) Specify the resources for the job (and some other instructions for the resource manager), but look as comments to Bash Build a suitable environment for your job. The script executes in your home directory! The command(s) that you want to execute in your job. The job script Specifies the resources needed for the job and other instructions for the resource manager: Number and type of CPUs How much compute time? Output files (stdout and stderr) optional Can instruct the resource manager to send mail when a job starts or ends Followed by a sequence of commands to run on the compute node Script runs in your home directory, so don t forget to go to the appropriate directory Load the modules you need to for your job Then specify all the commands you want to execute And these are really all the commands that you would also do interactively or in a regular script to perform the tasks you want to perform

20 Intermezzo: Why do Torque commands start with PBS? In the 90s of the previous century, there was a resource manager called Portable Batch System, developed by a contractor for NASA. This was opensourced. But that company was acquired by another company that then sold the rights to Altair Engineering that evolved the product into the closed-source product PBSpro. And the open-source version was forked by what is now Adaptive Computing and became Torque. Torque remained open source. Torque = Terascale Opensource Resource and QUEue manager. The name was changed, but the commands remained. Likewise, Moab evolved from MAUI, an open-source scheduler. Adaptive Computing, the company behind Moab, contributed a lot to MAUI but then decided to start over with a closed source product. They still offer MAUI on their website though. MAUI used to be widely used in large USA supercomputer centres, but most now throw their weight behind SLURM with or without another scheduler. Environment variables in job scripts The resource manager passes some environment variables $PBS_O_WORKDIR : directory from where the script was submitted $PBS_JOBID : the job identifier $PBS_NUM_PPN : Number of cores/hardware threads per node available to the job, i.e., the value of the ppn parameter (see later) $PBS_NP : Total number of cores/hardware threads for the job $PBS_NODES, $PBS_JOBNAME, $PBS_QUEUE,... Look for Exported batch environment variables in the Torque manual (currently version 4.2 on our clusters) Some VSC-specific environment variables $VSC_HOME = /user/antwerpen/2xx/vsc2xxxx, your home directory $VSC_DATA = /data/antwerpen/2xx/vsc2xxxx, your data directory $VSC_SCRATCH = /scratch/antwerpen/2xx/vsc2xxxx, shared scratch space $VSC_SCRATCH_NODE = /tmp, scratch directory on the node Variables that start with EB: they point to the root directories of software packages loaded with the module command.

21 Specifying resources in job scripts /apps/antwerpen/examples/example.sh #!/bin/bash #PBS N name_of_job #PBS l walltime=01:00:00 #PBS l nodes=1:ppn=1 #PBS l pmem=1gb #PBS m e #PBS M <your address> #PBS o stdout.file #PBS e stderr.file cd $PBS_O_WORKDIR echo Job started : `/bin/date`./hello_world echo Job finished : `/bin/date` the shell (language) in which the script is executed name of the job (optional) requested running time (required) l resource specifications (e.g., nodes, processors per node) amount of memory per process mail notification at begin, end or abort (optional) your address (should always be specified) redirect standard output ( o) and standard error ( e) (optional) if omitted, a generic filename will be used change to the current working directory (optional, if it is the directory where you submitted the job) the program itself finished $ qsub example.sh Specifying resources in job scripts: Requesting cores #PBS l nodes=2:ppn=20 : Request 2 nodes, using 20 cores (in fact, hardware threads) per node Different node types have a different number of cores per node. On Hopper, all nodes have 20 cores (until we install new nodes ) #PBS l nodes=2:ppn=20:ivybridge : Request 2 nodes, using 20 cores per node, and those nodes should have the property ivybridge (which is currently all nodes on Hopper) See the UAntwerpen clusters page in the Available hardware section of the User Portal on the VSC website for an overview of properties for various node types. What happens if you don t request all cores on a node? Other jobs of you may start on the same node But in our current setup, other users cannot use these cores (so that they cannot accidentally crash your job) You re likely wasting expensive resources!

22 Specifying resources in job scripts: Requesting memory #PBS l mem=55gb : Job needs 55 GB of memory, but only use mem for single node jobs! #PBS l pmem=2.5gb : This job needs 2.5 GB of memory per core vmem and pvmem do the same for virtual memory, but we strongly discourage jobs that need swapping as they tend to bring the node to a halt What happens if you don t use all memory and cores? Other jobs but only yours can use it If you submit a lot of 1 core jobs that use more than 2.75 GB each on Hopper: Please specify the fat property for the nodes so that they get scheduled on nodes with more memory and also fill up as many cores as possible See the UAntwerpen clusters page in the Available hardware section of the User Portal on the VSC website for the usable amounts of memory for each node type This is always less than the physical amount of memory in the node Specifying resources in job scripts: Requesting compute time #PBS l walltime=30:25:55 : This job is expected to run for 30h35m55s. What happens if I underestimate the time? Bad luck, your job will be killed when the walltime has expired. What happens if I overestimate the time? Not much, but Your job may have a lower priority for the scheduler, or may not run because we limit the number of very long jobs that can run simultaneously. The estimates for the start time of other jobs that the scheduler produces will be even worse than they are. Maximum wall time is 7 days on Hopper and 21 days on Turing But remember what we said in last week s lecture about reliability? Longer jobs will need a checkpoint-and-restart strategy

23 Specifying resources in job scripts All resource specifications start with #PBS l These #PBS lines look suspiciously like command line arguments? That s because they can also be specified on the command line of qsub. Full list of options in the Torque manual, search for Requesting resources But it is very unlikely that you ll need one that is not yet mentioned here Remember that available resources change over time as new hardware is installed and old hardware is decommisioned. Share your scripts But don t forget to check the resources from time to time! Non-resource-related #PBS parameters #PBS -N job_name : Sets the name of the job, and is used, e.g., in constructing default file names for stdout and stderr #PBS o stdout.file : Send the stdout output of the job to the file stdout.file instead of the default one #PBS e stderr.file : Send the stderr output of the job to the file stdout.file instead of the default one #PBS m abe M my.mail@uantwerpen.be Send mail when the job is aborted (a), begins (b) or ends (e) to the given mail address. Don t forget M if you use m, or the mail will never arrive! Any other of the qsub command line arguments could be specified in the job file as well, but these are the most useful ones. See the qsub manual page in the Torque manual. example job scripts in /apps/antwerpen/examples

24 Submitting jobs If all resources are specified in the job script, submitting a job is as simple as $ qsub jobscript.sh But all parameters specified in the job script in #PBS lines can also be specified on the command line, e.g., $ qsub l nodes=x[:ppn=y][,pmem=zmb] example.sh x : the number of nodes y : the number of cores per node (max ppn : 20 for the Ivy Bridge nodes on Hopper) pmem : the amount of memory needed per core (other memory specifications : mem, vmem, pvmem) Submitting jobs (turing) On turing additional properties are often needed $ qsub l nodes=x[:ppn=y][:harpertown westmere][:ib gbe][,pmem=zmb] x : the number of nodes y : the number of cores per node (max ppn : 8 for Harpertown nodes, 24 for Westmere nodes where each hardware thread is shown as a different virtual core) harpertown or westmere : explicitly choose the architecture ib or gbe : use InfiniBand (IB) or Gigabit Ethernet (GbE) pmem : the amount of memory needed per core (other memory specifications : mem, vmem, pvmem) a job can only run within a single node set (nodes with the same processor architecture and network interface)

Enforcing the node specification The scheduler will try to fill up the compute nodes

e.g., l nodes=2:ppn=4 might still result in the job being run on only one physical

scheduler will try to fill up a node with several jobs (though only your own), so

e.g., l nodes=2:ppn=4 might still result in several of such jobs being run on the

naccesspolicy=singlejob These are very asocial options, so use them only if you have

25 Enforcing the node specification The scheduler will try to fill up the compute nodes completely, so specifying, e.g., l nodes=2:ppn=4 might still result in the job being run on only one physical compute node To enforce the node specification : W x=nmatchpolicy:exactnode The scheduler will try to fill up a node with several jobs (though only your own), so specifying, e.g., l nodes=2:ppn=4 might still result in several of such jobs being run on the same compute nodes To enforce a single job on the compute nodes : l naccesspolicy=singlejob These are very asocial options, so use them only if you have a very good reason to do so! 4 9 Queue manager #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Torque Submit Job Query Cluster Figure 1 Job Management System Architecture

26 Queue manager Jobs are submitted to the queue manager And the queue manager provides functionality to start, delete, hold, cancel and monitor jobs on the queue Some commands : qsub submit a job qdel delete a submitted (queued and/or running) job qstat get status of your jobs on the system Under the hood: There are multiple queues, but qsub will place the job in the most suitable queue On our system, queues are based on the requested walltime There are limits on the amount of jobs a user can submit to a queue or have running from one queue To avoid that a user monopolises the system for a prolongued time Job scheduling #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo MOAB Submit Job Query Cluster Figure 1 Job Management System Architecture

27 Job scheduling MOAB Job scheduler Decides which job will run next based on (continuously evolving) priorities Works with the queue manager Works with the resource manager to check available resources Tries to optimise (long-term) resource use, keeping fairness in mind priorities : used to regulate the promotion of queued jobs, based on the requested resources and the recent history of the user fairshare : spread the use of resources fairly over all users by lowering the priority of jobs of users who have recently used a lot of resources backfill : allows some jobs to be run 'out of order' provided they do not delay the highest priority jobs in the queue Lots of other features : QoS, reservations,... Getting information from the scheduler showq : Displays information about the queue as the scheduler sees it (i.e., taking into account the priorities assigned by the scheduler). 3 categories: starting/running jobs, eligible jobs and blocked jobs You can only see your own jobs! checkjob : Detailed status report for specified job showstart : Show an estimate of when job can/will start This estimate is notoriously inaccurate! A job can start earlier than expected because other jobs take less time than requested But it can also start later if a higher-priority job enters the queue showbf: Shows the resources that are immediately available to you should you submit a job that matches those resources

28 Information returned by showq active jobs are running or starting and have resources assigned to them eligible jobs are queued and are considered eligible for scheduling and backfilling blocked jobs jobs that are ineligible to be run or queued job states in the STATE column: idle : job violates a fairness policy userholdor systemhold: user or administrative hold deferred : a temporary hold when unable to start after a specified number of attempts batchhold: requested resources are not available repeated failure to start the job notqueued : scheduling daemon is unavailable Resource manager #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Submit Job Query Cluster Torque Figure 1 Job Management System Architecture

29 Resource manager Torque The resource manager manages the cluster resources (e.g., CPUs, memory,...) This includes the actual job handling at the compute nodes Start, hold, cancel and monitor jobs on the compute nodes Enforces resource limits (e.g., available memory for your job,...) Logging and accounting of used resources Consists of a server component and a small daemon process that runs on each compute node Further notes Currently a node is never shared by users Seems like a waste of resources, but If you can t fill up a node you shouldn t be using a supercomputer but buy a workstation Parallel jobs can influence each other in unpredictable ways and may ruin each others performance. We want users to have control over this Multiple jobs of a single user can run on a single node, it is up to the user to ensure that they can run efficiently This implies you cannot log in to a node (using ssh) unless you have a job running on that node 2 very technical but sometime useful resource manager commands: pbsnodes : returns information on the configuration of all nodes tracejob : looks back into the log files for information about the job and can be used to get more information about job failures

30 Exercises Do the exercises in Chapter 4 Batch jobs 4.1 the scripts are in examples/running-batch-jobs Now extend that script to request 1 core on 1 node with a walltime of 1 minute. Running interactive jobs Chapter 5

31 Interactive jobs 5. Interactive jobs Starting an interactive job : qsub I : open a shell on compute node qsub I X : open a shell with X support must log in to the cluster with X forwarding enabled (ssh -X) works only when you are requesting a single node When to start an interactive session to test and debug your code to use (graphical) input/output at runtime only for short duration jobs to compile on specific architecture Interactive jobs 5. Interactive jobs Do the exercises in Chapter 5 Interactive jobs Files in examples/running-interactive-jobs Text example in 5.2 GUI program in 5.3 (assuming you have an X server installed)

32 Input / Output Chapter 6 Files, stdout, stderr 6. Input/Output The output to stdout and stderr will be redirected to files by the resource manager These two files by default in the directory where you submitted job ( qsub test.pbs ) Default naming convention stdout: <job-name>.o<job-id> stderr: <job-name>.e<job-id> The default job name is the name of the job script Overwrite the default name with #PBS -N my_job stdout: my_job.o<job-id> stderr: my_job.e<job-id> Other files are put by default in the current working directory. This is your home directory when your job starts!

33 Directories en quota 6. Input/Output /scratch/antwerpen/2xx/vsc2xxxx fast scratch for temporary files default block quota : 25 GB default file quota: 100,000 files /data/antwerpen/2xx/vsc2xxxx to store long-time data default block quota : 25 GB default file quota: 100,000 files /user/antwerpen/2xx/vsc2xxxx only for small input and configuration files default block quota : 5 GB default file quota: 20,000 files $VSC_SCRATCH $VSC_DATA $VSC_HOME Directories en quota GPFS 6. Input/Output IBM General Parallel File System your quota is shown when you log in quota for both size and number of files use show_quota.py to check show_quota.py $ module load scripts $ show_quota.py VSC_DATA: used 13640MB (56%) quota 25600MB VSC_HOME: used 2156MB (74%) quota 5120MB VSC_SCRATCH: used 19907MB (82%) quota 25600MB

34 Best practices for file storage 6. Input/Output The cluster is not for long-term file storage. You have to move back your results to your laptop or server in your department. There is currently no backup. We are considering testing a backup solution for the home and data directories. Old data on scratch can be deleted if scratch fills up Remember that the cluster is optimised for parallel access to big files, not for using tons of small files (e.g., one per MPI process). If you run out of block quota, you ll get more without too much motivation. If you run out of file quota, you ll have a lot of explaining to do before getting more. Text files are good for summary output while your job is running, or data that you ll read into a spreadsheet, but not for storing 1000x1000-matrices. Use binary files for that! Exercises 6. Input/Output Do the exercises in Chapter 6 Jobs with input/output. Files in examples/running-jobs-with-input-output-data 6.3 Writing output files : Optional 6.4 Reading an input file : Homework, not in this session (and most useful if you understand Python scripts)

35 Multi-core jobs Chapter 7 Why parallel computing? 7. Multicore jobs 1. Faster time to solution distributing code over N cores hope for a speedup by a factor of N 2. Lager problem size distributing your code over N nodes increase the available memory by a factor N hope to tackle problems which are N times bigger 3. In practice gain limited due to communication and memory overhead and sequential fractions in the code optimal number of nodes is problem-dependent but no escape possible as computers don t really become faster for serial code Parallel computing is here to stay.

version stays intact automated, higher-level API POSIX threads (pthreads) doing multi-threaded programming "by hand Distributed memory - MPI

36 Shared memory - Threading 7. Multicore jobs OpenMP API that implements a multi-threaded, shared memory form of parallelism set of compiler directives, so original serial version stays intact automated, higher-level API POSIX threads (pthreads) doing multi-threaded programming "by hand Distributed memory - MPI 7. Multicore jobs Message Passing Interface (MPI) Single executable running concurrently on multiple processors, with communication between the processes The executable is started from one core using mpirun Process 0 Process 1 Process 2 Parallel task

37 Starting a MPI job 7. Multicore jobs (Intel MPI) example script : #!/bin/bash #PBS N mpihello #PBS l walltime=00:05:00 #PBS l nodes=2:ppn=20 module purge module load intel/2015b cd $PBS_O_WORKDIR mpirun./mpihello generic-mpi.sh will use two nodes with 20 cores each make sure to purge all modules (dependencies) load the Intel toolchain (for the MPI libraries and mpirun) change to the current working directory run the MPI program (mpihello) mpirun communicates with the resource manager to acquire the number of nodes, cores per node, node list etc. Exercises 7. Multicore jobs Do the exercises in Chapter 7 Multi-core jobs Scripts in examples/multi-core-jobs-parallel-computing 7.2 Example pthreads program (shared memory) 7.3 Example OpenMP program (shared memory) : Only look at the first example (omp1.c) 7.4 Example MPI program (distributed memory)

38 Policies on our clusters Chapter 9 Some site policies 9. Policies Several site policies have been implemented : Single user per node policy Priority based scheduling (so not first come first get) Fairshare mechanism Make sure one user cannot monopolise the cluster Crediting system Dormant, not enforced Implicit user agreement : The cluster is valuable research equipment Do not use it for other purposes than your research for the university No SETI-at-home and similar initiatives Not for private use And you have to acknowledge the VSC in your publications Don t share your account

39 Single user per node 9. Policies Policy motivation : parallel jobs can suffer badly if they don t have exclusive access to the full node sometimes the wrong job gets killed if a user uses more memory than requested and a node runs out of memory a user may accidentally spawn more processes/threads and use more cores than expected (depends on the OS) Remember : the scheduler will (try to) fill up a node with several jobs from the same user if you don t have enough work for a single node, you need a good PC/workstation and not a supercomputer Priorities 9. Policies The scheduler associates a priority numberto each job the highest priority job will (usually) be the next one to run jobs with negative priority will (temporarily) be blocked Currently, the priority calculation is based on : the time a job is queued the priority of a job is increased in accordance to the time (in minutes) it is waiting in the queue to get started a fairshare window of 7 days with a target of 10% the priority of jobs of a user who has used too many resources over the current fairshare window is lowered This formula might be adapted in the future (will be communicated)

40 Fairshare 9. Policies Mechanism to incorporate historical resource utilization incorporated into job feasibility and priority decisions only affects the job s priority relative to other jobs fairshare allocation of computing time Fairshare Interval (24 hours) Fairshare Decay (80%) Fairshare Usage 10 Past Fairshare Depth (7 days) 0 5 Present Effective Fairshare Runtime example priorities, fairshare, eligible 9. Policies Mon Oct 8 09:25:47 CEST 2012 eligible jobs JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME date showq vsc20067 Idle 96 1:00:00:00 Sat Oct 6 11:54: vsc20034 Idle 96 7:00:00:00 Mon Oct 8 09:11:37 blocked jobs JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME vsc20118 Idle 8 6:23:59:50 Wed Oct 3 17:13:04 mdiag -f : Depth: 7 intervals Interval Length: 1:00:00:00 Decay Rate: 0.80 mdiag -f FSInterval % Target FSWeight TotalUsage vsc vsc vsc20034* Job PRIORITY* Cred( User:Class) FS( User) Serv(QTime:UPrio) Weights ( 1: 1) 500( 1) 1( 1: 1) mdiag -p ( 0.0: 0.0) 38.8( 3.4) 61.2( 2714: 0.0) ( 0.0: 0.0) 99.7( 8.5) 0.3( 14: 0.0) WallTime: 00:00:00 of 6:23:59:50 SubmitTime: Wed Oct 3 17:13:04 (Time Queued Total: 4:18:10:23 Eligible: 1:29:52) checkjob StartPriority:

41 Credits 9. Policies At UAntwerp we monitor cluster use and send periodic reports to group leaders. On the Tier-1 system at UGent, you get a compute time allocation (number of node days). They are weakly enforced. A free test ride (motivation required) Requesting compute time through a proposal At KU Leuven, you need compute credits which are enforced by the scheduler and a component called MAM or Moab Accounting Manager. Credits can be bought via the KU Leuven Charge policy (subject to change) based upon : fixed start-up cost used resources (number and type of nodes) actual duration (used walltime) Advanced Topics Part B

42 Agenda part B 10. Multi-job submission 11. Compiling 12. Monitoring HPC utilization 13. Checkpointing 14. Program examples 15. Good practices Multi-job submission Chapter 10

43 Worker framework 10. Multi-job Many users run many serial jobs, e.g., for a parameter exploration Problem : submitting many short jobs to the queueing system puts quite a burden on the scheduler, so try to avoid it Worker framework parameter sweep : many small jobs and a large number of parameter variations. Parameters are passed to the program through command line arguments. job arrays : many jobs with different input/output files Worker framework example case : parameter exploration 10. Multi-job #!/bin/bash l #PBS l nodes=1:ppn=20 #PBS l walltime=04:00:00 weather.sh cd $PBS_O_WORKDIR weather t $temperature p $pressure \ v $volume data.csv temperature,pressure, volume 293.0, 1.0e05, 87...,..., , 1.0e05, 75 (data in CSV format) $ module load worker $ wsub data data.csv batch weather.sh wsub weather will be run for all data, until all computations are done

44 Worker framework an alternative for job arrays 10. Multi-job #!/bin/bash l #PBS l nodes=1:ppn=20 l walltime=04:00:00 l l jobarray.sh cd $PBS_O_WORKDIR INPUT_FILE="input_${PBS_ARRAYID}.dat" for every run, there is a OUTPUT_FILE="output_${PBS_ARRAYID}.dat" separate input file and an program input ${INPUT_FILE} output ${OUTPUT_FILE} associated output file $ module load worker $ wsub t batch jobarray.sh program will be run for all input files (100) Exercises 10. Multi-job Do the exercises in Chapter 10 Multi-job submission Files in the subdirectories of examples/multi-job-submission Have a look at the parameter sweep example weather from 10.1 Have a look at the job array example from 10.2 (Time left: Have a look at the map-reduce example from 10.3)

45 Compiling Chapter 10 Process 11. Compiling Check the pre-installed software and toolchains Other software: on demand Porting your code to Linux / cluster environments Own responsibility A typical process looks like: Copy your software to UAntwerpen-HPC Start an interactive session Compile Test (locally) on a compute node Generate your job-scripts Test it on the UAntwerpen-HPC Run it (in parallel we hope)

46 Toolchains 11. Compiling We currently support the GNU compiler collection and Intel compilers on all VSC systems. If you really need LLVM or any other compiler, we can discuss Made available via toolchains, e.g.: intel/2015b : All Intel components (C/C++/Fortran, MPI, MKL) foss/2015b : gcc/gfortran, Open MPI, OpenBLAS, Lapack, FFTW, binutils with a recent assembler Program without MPI Program with MPI GNU Intel GNU Intel C gcc icc mpicc mpiicc C++ g++ icpc mpicxx mpiicpc Fortran gfortran ifort mpif90 mpiifort Exercises 11. Compiling Do the exercises in Chapter 11 Compiling and testing your software on the HPC See the files in examples/compiling-and-testing-your-software-on-the-hpc for an example serial program for an example parallel program (or for an example with the Intel compilers)

47 Checkpointing Chapter XX Checkpointing XX. Checkpointing checkpointing makes it possible to run jobs for weeks or months or... a snapshot is taken each time a sub-job is running out of requested walltime subsequently, a new sub-job will be scheduled extra options : job arrays, using shared scratch,... Uses Berkeley Lab Checkpoint/Restart (BLCR) but we re experimenting with another tool also Note however that the highest-performance and most robust checkpointing is checkpointing provided by the applications themselves. Most good applications store a state from time to time from which they can restart.

48 Checkpointing XX. Checkpointing Checkpoint submit command : csub specify at least the job script preferably also the queue specify the time for checkpointing with chkpt_time specify the walltime for the job with job_time #PBS directives are ignored (apart from nodes=... ) csub $ module load scripts $ csub s jobscript.sh q qshort job_time=00:04:30 chkpt_time=00:01:00 $ watch qstat $ vi csub-output.txt For a more complete introduction, see the slides of the HPC Tips & Tricks 3: Using checkpointing session in December 2015 Good practices Chapter 13

49 Debug & test 13. Good practices Before starting you should always check : are there any errors in the script? are the required modules loaded? is the correct executable used? Check your jobs at runtime login on node and use, e.g., top, vmstat,... alternatively : run an interactive job ( qsub -I ) If you will be doing a lot of runs with a package, try to benchmark the software for scaling issues when using MPI I/O issues Hints & tips 13. Good practices Job will logon to the compute node(s) and start executing the commands in the job script start in your home directory ($VSC_HOME), so go to current directory with cd $PBS_O_WORKDIR don t forget to load the software with module load Why is my job not running? use checkjob commands might timeout with overloaded scheduler Submit your job and wait (be patient) Using $VSC_SCRATCH_NODE (/tmp on the compute node) might be beneficial for jobs with many small files. But that scratch space is node-specific and hence will disappear when your job ends So copy what you need to your data directory or to the shared scratch

50 Warnings 13. Good practices Avoid submitting many small jobs by grouping them using a job array or using the Worker framework Runtime is limited by the maximum walltime of the queues for longer walltimes, use checkpointing Requesting many processors could imply long waiting times (though we re still trying to favouring parallel jobs) what a cluster is not a computer that will automatically run the code of your (PC) application much faster or for much bigger problems User Support

51 User support Try to get self-organized limited resources for user support best effort, no code fixing Contact and be as precise as possible VSC documentation : vscentrum.be/en/user-portal External documentation : docs.adaptivecomputing.com manuals of Torque and MOAB We re currently at version 7 of the Enterprise Edition mailing-lists VSC uantwerpen.be/hpc User support mailing-lists : announcements : calcua-announce@sympa.ua.ac.be (for official announcements and communications) users : calcua-users@sympa.ua.ac.be (for discussions between users) contacts : hpc@uantwerpen.be phone : 3860 (Stefan), 3855 (Franky), Kurt (3852), Bert (3879) office : G.310, G.309 (CMI) Don t be afraid to ask, but being as precise as possible when specifying your problem definitely helps in getting a quick answer.

UAntwerpen, 24 June 2016

Tier-1b Info Session UAntwerpen, 24 June 2016 VSC HPC environment Tier - 0 47 PF Tier -1 623 TF Tier -2 510 Tf 16,240 CPU cores 128/256 GB memory/node IB EDR interconnect Tier -3 HOPPER/TURING STEVIN THINKING/CEREBRO