CalcUA. Agenda. UAntwerpen-HPC. CalcUA. Stefan Becuwe Kurt Lust. Franky Backeljauw Bert Tijskens. 1. Introduction to UAntwerpen-HPC

Size: px
Start display at page:

Download "CalcUA. Agenda. UAntwerpen-HPC. CalcUA. Stefan Becuwe Kurt Lust. Franky Backeljauw Bert Tijskens. 1. Introduction to UAntwerpen-HPC"

Transcription

1 UAntwerpen-HPC Stefan Becuwe Kurt Lust Franky Backeljauw Bert Tijskens Agenda 1. Introduction to UAntwerpen-HPC 2. Getting an HPC account 3. Preparing the environment 4. Running batch jobs 5. Running interactive jobs 6. Running jobs with input/output data 7. Multi-core jobs 8. Multi-job submission 9. UAntwerpen-HPC policies

2 Goal make you comfortable to work with and use UAntwerpen-HPC Structure of the tutorial Questions Chapter Title What is High Performance Computing exactly? Can it solve my computational needs? 1 Introduction to HPC How to get an account? 2 Getting a VSC account How do I connect to the HPC and transfer my files and programs? 3 Preparing the environment How to start background jobs? 4 Running batch jobs How to start jobs with user interaction? 5 Running interactive jobs Where do the input and output go? Where to collect my results? Can I speed up my program by exploring parallel programming techniques? 6 7 Running jobs with input/output data Multi-core jobs Parallel programming Can I start many jobs at once? 8 Multi-job submission What are the policies? 9 HPC policies

3 Introduction to the VSC and Chapter 1 Flemish Supercomputer Center 1. Intro Vlaams Supercomputer Centrum (VSC) Established in December 2007 Partnership between 5 University associations: Antwerp, Brussels, Ghent, Hasselt, Leuven Funded by Flemish Government through the FWO (Research Fund Flanders) Goal: make HPC available to all researchers in Flanders (academic and industrial) Local infrastructure (Tier-2) in Antwerp: HPC core facility

4 VSC partners 1. Intro VSC: An integrated infrastructure according to the European model 1. Intro Tier-0 European VSC network Storage Job management Tier-1 (national) regional Tier-2 (regional) university PRACE BEgrid EGI Tier-3 desktop KL

5 UAntwerpen 1. Intro Hopper 168 compute nodes 144 nodes with 2 ten core Ivy Bridge processors 64 GB memory, 500 GB local disk, FDR10-IB network 24 nodes with 2 ten core Ivy Bridge processors 256 GB memory, 500 GB local disk, FDR10-IB network 3360 cores, theoretical peak performance 75 Tflops 100 TB online GPFS storage 4 login nodes for user access 2 service nodes for admin and maintenance UAntwerpen 1. Intro Turing 168 compute nodes 64 nodes with 2 quad core Harpertown processors 16 GB memory, 120 GB local disk, GbE network 32 nodes with 2 quad core Harpertown processors 16 GB memory, 120 GB local disk, DDR-IB network 64 nodes with 2 six core Westmere processors 24 GB memory, 120 GB local disk, QDR-IB network 8 nodes with 2 six core Westmere processors 24 GB memory, 120 GB local disk, GbE network 1632 cores, theoretical peak performance 15.5 Tflops 2 login nodes for user access

6 1. Intro Hardware 1. Intro Tier-1 infrastructure Muk (UGent) inaugurated on October 25, compute nodes with 2 eight core Sandy Bridge processors 64 GB memory, FDR-IB network 400 TB net storage capacity, peak bandwidth 9.5 GB/s 8448 cores, theoretical peak performance 175 Tflops # 118 in Top 500 on June 2012 Next Tier-1 system will be installed in Leuven Expect pilot use starting in July Muk will be phased out at the end of 2016

7 Tier-1 infrastructure Muk (UGent) 1. Intro Some characteristics Shared infrastructure, used by multiple users simultaneously So you have to request the amount of resources you need And you may have to wait a little Therefore mostly for batch applications, not well suited for interactive applications Built for parallel work. No parallelism = no supercomputing. Not meant for running a single single-core job Remote use model But you re running (a different kind of) remote applications all the time Linux-based So no Windows or OS X software, But most standard Linux applications will run.

8 Getting a VSC account Chapter 2 Create SSH keys 2. Account ssh-keygen /home/<username>/.ssh/id_rsa this is your private key /home/<username>/.ssh/id_rsa.pub this is your public key /user/antwerpen/2xx/<vsc-username>/.ssh/authorized_keys

9 Register your account 2. Account web-based procedure : your VSC username is vsc2xxxx Exercises Follow the instructions in Chapter 2 Getting an account

10 Connecting to Hopper Chapter 3 A typical workflow 3. Connecting 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results 2 0

11 Connect to Hopper 3. Connecting open a terminal Windows (7) : All Programs > PuTTY OS X : Applications > Utilities > Terminal Linux (Ubuntu) : Applications > Accessories > Terminal (or CTRL+Alt+T) login via secure shell: $ ssh vsc2xxxx@login.hpc.uantwerpen.be login.hpc.uantwerpen.be Hopper login nodes login.hpc.uantwerpen.be 3. Connecting Generic name for 4 login nodes (DNS alias) Limited access only accessible from within the campus network external access (e.g., from home) possible using UA-VPN For editing and submitting jobs Best to compile on target computer nodes Do not use the login nodes for running your executables

12 Computation workflow 3. Connecting 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results Transfer your files 3. Connecting Graphical ssh file managers for Windows, OS X, Linux PuTTY, WinSCP, CyberDuck, FileZilla, Command line: scp - secure copy copy from the local computer to Hopper $ scp file.ext vsc2xxxx@login.hpc.uantwerpen.be: copy from Hopper to the local computer $ scp vsc2xxxx@login.hpc.uantwerpen.be:file.ext.

13 Computation workflow 3. Connecting 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results Available development software 3. Connecting C/C++/Fortran compilers: Intel Parallel Studio XE Cluster Edition and GCC Intel compiler comes with several tools for performance analysis and assistance in vectorisation/parallelisation of code Fairly up-to-date OpenMP support in both compilers Support for other shared memory parallelisation technologies such as Cilk Plus, other sets of directives, etc. to a varying level in both compilers PGAS Co-Array Fortran support in Intel Compiler (and some in GNU Fortran) Message passing libraries: Intel MPI, Open MPI Mathematical libraries: Intel MKL, OpenBLAS, FFTW, MUMPS, GSL, Many other libraries, including HDF5, NetCDF, Metis,... Python

14 Available application software 3. Connecting ABINIT, CP2K, OpenMX, QuantumESPRESSO VASP, CHARMM, GAMESS-US, LAMMPS Gaussian, Gromacs, Molpro Telemac SAMTools, TopHat, Bowtie, BLAST MATLAB, R, Additional software can be installed on demand Available software The fine print 3. Connecting Restrictions on commercial software 5 licenses for Intel Parallel Studio No license for Parallel Computing Toolbox in MATLAB Other licenses paid for and provided by the users, owner of the license decides who can use the license No user software as part of the system software Additional software can be installed on demand Take into account the system requirements Provide working building instructions and licenses Compilation support is limited Best effort, no code fixing Installed in /apps/antwerpen

15 Operating system and system management 3. Connecting Operating system : Scientific Linux 6.x Red Hat Enterprise Linux (RHEL) 6 clone kernel with cpusetsupport and BLCR modules Job management and accounting Torque : resource manager (based on PBS) MOAB : job scheduler and management tools MAM : allocation manager and accounting system Maintenance and monitoring Icinga : monitoring software CMU : node management Building your environment: module command 3. Connecting Dynamic software management through modules Set environment variables : $PATH, $LD_LIBRARY_PATH,... Application-specific environment variables Automatically pull in required dependencies VSC-specific variables that you can use to point to the location of a package (typically start with EB) Software build on our system is organised in toolchains Packages build with the same toolchain will interoperate Build tools (compilers etc) refreshed twice yearly: 2015a, 2015b, 2016a, intel toolchain: Intel compilers, MPI and math libraries. foss toolchain: GNU compilers, Open MPI, OpenBLAS, FFTW,... Many packages have intel-2015a or foss-2015b in their name which denotes that they are build with that toolchain.

16 Building your environment: module command 3. Connecting Module name is typically of the form: <name of package>/<version>, e.g. intel/2016a : The Intel toolchain package, version 2016a PLUMED/2.1.0-intel-2014a : PLUMED library, version 2.1.0, build with the intel/2014a toolchain Some module commands : module avail : list installed software packages module avail intel : list all available versions of the intel package module help foss : show help about a module (the module foss in this example) module list : list all loaded modules in the current session module load hopper/2015b : load a list of modules module load MATLAB/R2015b : load a specific version module unload MATLAB : unload MATLAB settings module purge : unload all modules (incl. dependencies) Exercises 3. Connecting Getting ready (see also 3.1): Log on to the cluster Go to your data directory cd $VSC_DATA Copy the examples for the tutorial to that directory: type cp -r /apps/antwerpen/tutorials/intro-hpc/examples. Experiment with the module command as in 3.3 Preparing your environment: Modules

17 Running batch jobs Chapter 4 Computation workflow 1. Connect to Hopper 2. Transfer your files to Hopper 3. Build your environment 4. Compile your source code and test it 5. Submit your job 6. Wait while Ø your job gets into the queue Ø your job gets executed Ø your job finishes 7. Move your results

18 Job workflow What happens behind the scenes? #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Moab Submit Job Query Cluster Torque Don t worry too much who does what Figure 1 Job Management System Architecture Submitting a job #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Submit Job Query Cluster Figure 1 Job Management System Architecture

19 The job script #!/bin/bash #PBS l nodes=1:ppn=20 #PBS l walltime=1:00:00 #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo This is a bash script (but could be, e.g., Perl) Specify the resources for the job (and some other instructions for the resource manager), but look as comments to Bash Build a suitable environment for your job. The script executes in your home directory! The command(s) that you want to execute in your job. The job script Specifies the resources needed for the job and other instructions for the resource manager: Number and type of CPUs How much compute time? Output files (stdout and stderr) optional Can instruct the resource manager to send mail when a job starts or ends Followed by a sequence of commands to run on the compute node Script runs in your home directory, so don t forget to go to the appropriate directory Load the modules you need to for your job Then specify all the commands you want to execute And these are really all the commands that you would also do interactively or in a regular script to perform the tasks you want to perform

20 Intermezzo: Why do Torque commands start with PBS? In the 90s of the previous century, there was a resource manager called Portable Batch System, developed by a contractor for NASA. This was opensourced. But that company was acquired by another company that then sold the rights to Altair Engineering that evolved the product into the closed-source product PBSpro. And the open-source version was forked by what is now Adaptive Computing and became Torque. Torque remained open source. Torque = Terascale Opensource Resource and QUEue manager. The name was changed, but the commands remained. Likewise, Moab evolved from MAUI, an open-source scheduler. Adaptive Computing, the company behind Moab, contributed a lot to MAUI but then decided to start over with a closed source product. They still offer MAUI on their website though. MAUI used to be widely used in large USA supercomputer centres, but most now throw their weight behind SLURM with or without another scheduler. Environment variables in job scripts The resource manager passes some environment variables $PBS_O_WORKDIR : directory from where the script was submitted $PBS_JOBID : the job identifier $PBS_NUM_PPN : Number of cores/hardware threads per node available to the job, i.e., the value of the ppn parameter (see later) $PBS_NP : Total number of cores/hardware threads for the job $PBS_NODES, $PBS_JOBNAME, $PBS_QUEUE,... Look for Exported batch environment variables in the Torque manual (currently version 4.2 on our clusters) Some VSC-specific environment variables $VSC_HOME = /user/antwerpen/2xx/vsc2xxxx, your home directory $VSC_DATA = /data/antwerpen/2xx/vsc2xxxx, your data directory $VSC_SCRATCH = /scratch/antwerpen/2xx/vsc2xxxx, shared scratch space $VSC_SCRATCH_NODE = /tmp, scratch directory on the node Variables that start with EB: they point to the root directories of software packages loaded with the module command.

21 Specifying resources in job scripts /apps/antwerpen/examples/example.sh #!/bin/bash #PBS N name_of_job #PBS l walltime=01:00:00 #PBS l nodes=1:ppn=1 #PBS l pmem=1gb #PBS m e #PBS M <your address> #PBS o stdout.file #PBS e stderr.file cd $PBS_O_WORKDIR echo Job started : `/bin/date`./hello_world echo Job finished : `/bin/date` the shell (language) in which the script is executed name of the job (optional) requested running time (required) l resource specifications (e.g., nodes, processors per node) amount of memory per process mail notification at begin, end or abort (optional) your address (should always be specified) redirect standard output ( o) and standard error ( e) (optional) if omitted, a generic filename will be used change to the current working directory (optional, if it is the directory where you submitted the job) the program itself finished $ qsub example.sh Specifying resources in job scripts: Requesting cores #PBS l nodes=2:ppn=20 : Request 2 nodes, using 20 cores (in fact, hardware threads) per node Different node types have a different number of cores per node. On Hopper, all nodes have 20 cores (until we install new nodes ) #PBS l nodes=2:ppn=20:ivybridge : Request 2 nodes, using 20 cores per node, and those nodes should have the property ivybridge (which is currently all nodes on Hopper) See the UAntwerpen clusters page in the Available hardware section of the User Portal on the VSC website for an overview of properties for various node types. What happens if you don t request all cores on a node? Other jobs of you may start on the same node But in our current setup, other users cannot use these cores (so that they cannot accidentally crash your job) You re likely wasting expensive resources!

22 Specifying resources in job scripts: Requesting memory #PBS l mem=55gb : Job needs 55 GB of memory, but only use mem for single node jobs! #PBS l pmem=2.5gb : This job needs 2.5 GB of memory per core vmem and pvmem do the same for virtual memory, but we strongly discourage jobs that need swapping as they tend to bring the node to a halt What happens if you don t use all memory and cores? Other jobs but only yours can use it If you submit a lot of 1 core jobs that use more than 2.75 GB each on Hopper: Please specify the fat property for the nodes so that they get scheduled on nodes with more memory and also fill up as many cores as possible See the UAntwerpen clusters page in the Available hardware section of the User Portal on the VSC website for the usable amounts of memory for each node type This is always less than the physical amount of memory in the node Specifying resources in job scripts: Requesting compute time #PBS l walltime=30:25:55 : This job is expected to run for 30h35m55s. What happens if I underestimate the time? Bad luck, your job will be killed when the walltime has expired. What happens if I overestimate the time? Not much, but Your job may have a lower priority for the scheduler, or may not run because we limit the number of very long jobs that can run simultaneously. The estimates for the start time of other jobs that the scheduler produces will be even worse than they are. Maximum wall time is 7 days on Hopper and 21 days on Turing But remember what we said in last week s lecture about reliability? Longer jobs will need a checkpoint-and-restart strategy

23 Specifying resources in job scripts All resource specifications start with #PBS l These #PBS lines look suspiciously like command line arguments? That s because they can also be specified on the command line of qsub. Full list of options in the Torque manual, search for Requesting resources But it is very unlikely that you ll need one that is not yet mentioned here Remember that available resources change over time as new hardware is installed and old hardware is decommisioned. Share your scripts But don t forget to check the resources from time to time! Non-resource-related #PBS parameters #PBS -N job_name : Sets the name of the job, and is used, e.g., in constructing default file names for stdout and stderr #PBS o stdout.file : Send the stdout output of the job to the file stdout.file instead of the default one #PBS e stderr.file : Send the stderr output of the job to the file stdout.file instead of the default one #PBS m abe M my.mail@uantwerpen.be Send mail when the job is aborted (a), begins (b) or ends (e) to the given mail address. Don t forget M if you use m, or the mail will never arrive! Any other of the qsub command line arguments could be specified in the job file as well, but these are the most useful ones. See the qsub manual page in the Torque manual. example job scripts in /apps/antwerpen/examples

24 Submitting jobs If all resources are specified in the job script, submitting a job is as simple as $ qsub jobscript.sh But all parameters specified in the job script in #PBS lines can also be specified on the command line, e.g., $ qsub l nodes=x[:ppn=y][,pmem=zmb] example.sh x : the number of nodes y : the number of cores per node (max ppn : 20 for the Ivy Bridge nodes on Hopper) pmem : the amount of memory needed per core (other memory specifications : mem, vmem, pvmem) Submitting jobs (turing) On turing additional properties are often needed $ qsub l nodes=x[:ppn=y][:harpertown westmere][:ib gbe][,pmem=zmb] x : the number of nodes y : the number of cores per node (max ppn : 8 for Harpertown nodes, 24 for Westmere nodes where each hardware thread is shown as a different virtual core) harpertown or westmere : explicitly choose the architecture ib or gbe : use InfiniBand (IB) or Gigabit Ethernet (GbE) pmem : the amount of memory needed per core (other memory specifications : mem, vmem, pvmem) a job can only run within a single node set (nodes with the same processor architecture and network interface)

25 Enforcing the node specification The scheduler will try to fill up the compute nodes completely, so specifying, e.g., l nodes=2:ppn=4 might still result in the job being run on only one physical compute node To enforce the node specification : W x=nmatchpolicy:exactnode The scheduler will try to fill up a node with several jobs (though only your own), so specifying, e.g., l nodes=2:ppn=4 might still result in several of such jobs being run on the same compute nodes To enforce a single job on the compute nodes : l naccesspolicy=singlejob These are very asocial options, so use them only if you have a very good reason to do so! 4 9 Queue manager #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Torque Submit Job Query Cluster Figure 1 Job Management System Architecture

26 Queue manager Jobs are submitted to the queue manager And the queue manager provides functionality to start, delete, hold, cancel and monitor jobs on the queue Some commands : qsub submit a job qdel delete a submitted (queued and/or running) job qstat get status of your jobs on the system Under the hood: There are multiple queues, but qsub will place the job in the most suitable queue On our system, queues are based on the requested walltime There are limits on the amount of jobs a user can submit to a queue or have running from one queue To avoid that a user monopolises the system for a prolongued time Job scheduling #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo MOAB Submit Job Query Cluster Figure 1 Job Management System Architecture

27 Job scheduling MOAB Job scheduler Decides which job will run next based on (continuously evolving) priorities Works with the queue manager Works with the resource manager to check available resources Tries to optimise (long-term) resource use, keeping fairness in mind priorities : used to regulate the promotion of queued jobs, based on the requested resources and the recent history of the user fairshare : spread the use of resources fairly over all users by lowering the priority of jobs of users who have recently used a lot of resources backfill : allows some jobs to be run 'out of order' provided they do not delay the highest priority jobs in the queue Lots of other features : QoS, reservations,... Getting information from the scheduler showq : Displays information about the queue as the scheduler sees it (i.e., taking into account the priorities assigned by the scheduler). 3 categories: starting/running jobs, eligible jobs and blocked jobs You can only see your own jobs! checkjob : Detailed status report for specified job showstart : Show an estimate of when job can/will start This estimate is notoriously inaccurate! A job can start earlier than expected because other jobs take less time than requested But it can also start later if a higher-priority job enters the queue showbf: Shows the resources that are immediately available to you should you submit a job that matches those resources

28 Information returned by showq active jobs are running or starting and have resources assigned to them eligible jobs are queued and are considered eligible for scheduling and backfilling blocked jobs jobs that are ineligible to be run or queued job states in the STATE column: idle : job violates a fairness policy userholdor systemhold: user or administrative hold deferred : a temporary hold when unable to start after a specified number of attempts batchhold: requested resources are not available repeated failure to start the job notqueued : scheduling daemon is unavailable Resource manager #!/bin/bash #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Submit Job Query Cluster Torque Figure 1 Job Management System Architecture

29 Resource manager Torque The resource manager manages the cluster resources (e.g., CPUs, memory,...) This includes the actual job handling at the compute nodes Start, hold, cancel and monitor jobs on the compute nodes Enforces resource limits (e.g., available memory for your job,...) Logging and accounting of used resources Consists of a server component and a small daemon process that runs on each compute node Further notes Currently a node is never shared by users Seems like a waste of resources, but If you can t fill up a node you shouldn t be using a supercomputer but buy a workstation Parallel jobs can influence each other in unpredictable ways and may ruin each others performance. We want users to have control over this Multiple jobs of a single user can run on a single node, it is up to the user to ensure that they can run efficiently This implies you cannot log in to a node (using ssh) unless you have a job running on that node 2 very technical but sometime useful resource manager commands: pbsnodes : returns information on the configuration of all nodes tracejob : looks back into the log files for information about the job and can be used to get more information about job failures

30 Exercises Do the exercises in Chapter 4 Batch jobs 4.1 the scripts are in examples/running-batch-jobs Now extend that script to request 1 core on 1 node with a walltime of 1 minute. Running interactive jobs Chapter 5

31 Interactive jobs 5. Interactive jobs Starting an interactive job : qsub I : open a shell on compute node qsub I X : open a shell with X support must log in to the cluster with X forwarding enabled (ssh -X) works only when you are requesting a single node When to start an interactive session to test and debug your code to use (graphical) input/output at runtime only for short duration jobs to compile on specific architecture Interactive jobs 5. Interactive jobs Do the exercises in Chapter 5 Interactive jobs Files in examples/running-interactive-jobs Text example in 5.2 GUI program in 5.3 (assuming you have an X server installed)

32 Input / Output Chapter 6 Files, stdout, stderr 6. Input/Output The output to stdout and stderr will be redirected to files by the resource manager These two files by default in the directory where you submitted job ( qsub test.pbs ) Default naming convention stdout: <job-name>.o<job-id> stderr: <job-name>.e<job-id> The default job name is the name of the job script Overwrite the default name with #PBS -N my_job stdout: my_job.o<job-id> stderr: my_job.e<job-id> Other files are put by default in the current working directory. This is your home directory when your job starts!

33 Directories en quota 6. Input/Output /scratch/antwerpen/2xx/vsc2xxxx fast scratch for temporary files default block quota : 25 GB default file quota: 100,000 files /data/antwerpen/2xx/vsc2xxxx to store long-time data default block quota : 25 GB default file quota: 100,000 files /user/antwerpen/2xx/vsc2xxxx only for small input and configuration files default block quota : 5 GB default file quota: 20,000 files $VSC_SCRATCH $VSC_DATA $VSC_HOME Directories en quota GPFS 6. Input/Output IBM General Parallel File System your quota is shown when you log in quota for both size and number of files use show_quota.py to check show_quota.py $ module load scripts $ show_quota.py VSC_DATA: used 13640MB (56%) quota 25600MB VSC_HOME: used 2156MB (74%) quota 5120MB VSC_SCRATCH: used 19907MB (82%) quota 25600MB

34 Best practices for file storage 6. Input/Output The cluster is not for long-term file storage. You have to move back your results to your laptop or server in your department. There is currently no backup. We are considering testing a backup solution for the home and data directories. Old data on scratch can be deleted if scratch fills up Remember that the cluster is optimised for parallel access to big files, not for using tons of small files (e.g., one per MPI process). If you run out of block quota, you ll get more without too much motivation. If you run out of file quota, you ll have a lot of explaining to do before getting more. Text files are good for summary output while your job is running, or data that you ll read into a spreadsheet, but not for storing 1000x1000-matrices. Use binary files for that! Exercises 6. Input/Output Do the exercises in Chapter 6 Jobs with input/output. Files in examples/running-jobs-with-input-output-data 6.3 Writing output files : Optional 6.4 Reading an input file : Homework, not in this session (and most useful if you understand Python scripts)

35 Multi-core jobs Chapter 7 Why parallel computing? 7. Multicore jobs 1. Faster time to solution distributing code over N cores hope for a speedup by a factor of N 2. Lager problem size distributing your code over N nodes increase the available memory by a factor N hope to tackle problems which are N times bigger 3. In practice gain limited due to communication and memory overhead and sequential fractions in the code optimal number of nodes is problem-dependent but no escape possible as computers don t really become faster for serial code Parallel computing is here to stay.

36 Shared memory - Threading 7. Multicore jobs OpenMP API that implements a multi-threaded, shared memory form of parallelism set of compiler directives, so original serial version stays intact automated, higher-level API POSIX threads (pthreads) doing multi-threaded programming "by hand Distributed memory - MPI 7. Multicore jobs Message Passing Interface (MPI) Single executable running concurrently on multiple processors, with communication between the processes The executable is started from one core using mpirun Process 0 Process 1 Process 2 Parallel task

37 Starting a MPI job 7. Multicore jobs (Intel MPI) example script : #!/bin/bash #PBS N mpihello #PBS l walltime=00:05:00 #PBS l nodes=2:ppn=20 module purge module load intel/2015b cd $PBS_O_WORKDIR mpirun./mpihello generic-mpi.sh will use two nodes with 20 cores each make sure to purge all modules (dependencies) load the Intel toolchain (for the MPI libraries and mpirun) change to the current working directory run the MPI program (mpihello) mpirun communicates with the resource manager to acquire the number of nodes, cores per node, node list etc. Exercises 7. Multicore jobs Do the exercises in Chapter 7 Multi-core jobs Scripts in examples/multi-core-jobs-parallel-computing 7.2 Example pthreads program (shared memory) 7.3 Example OpenMP program (shared memory) : Only look at the first example (omp1.c) 7.4 Example MPI program (distributed memory)

38 Policies on our clusters Chapter 9 Some site policies 9. Policies Several site policies have been implemented : Single user per node policy Priority based scheduling (so not first come first get) Fairshare mechanism Make sure one user cannot monopolise the cluster Crediting system Dormant, not enforced Implicit user agreement : The cluster is valuable research equipment Do not use it for other purposes than your research for the university No SETI-at-home and similar initiatives Not for private use And you have to acknowledge the VSC in your publications Don t share your account

39 Single user per node 9. Policies Policy motivation : parallel jobs can suffer badly if they don t have exclusive access to the full node sometimes the wrong job gets killed if a user uses more memory than requested and a node runs out of memory a user may accidentally spawn more processes/threads and use more cores than expected (depends on the OS) Remember : the scheduler will (try to) fill up a node with several jobs from the same user if you don t have enough work for a single node, you need a good PC/workstation and not a supercomputer Priorities 9. Policies The scheduler associates a priority numberto each job the highest priority job will (usually) be the next one to run jobs with negative priority will (temporarily) be blocked Currently, the priority calculation is based on : the time a job is queued the priority of a job is increased in accordance to the time (in minutes) it is waiting in the queue to get started a fairshare window of 7 days with a target of 10% the priority of jobs of a user who has used too many resources over the current fairshare window is lowered This formula might be adapted in the future (will be communicated)

40 Fairshare 9. Policies Mechanism to incorporate historical resource utilization incorporated into job feasibility and priority decisions only affects the job s priority relative to other jobs fairshare allocation of computing time Fairshare Interval (24 hours) Fairshare Decay (80%) Fairshare Usage 10 Past Fairshare Depth (7 days) 0 5 Present Effective Fairshare Runtime example priorities, fairshare, eligible 9. Policies Mon Oct 8 09:25:47 CEST 2012 eligible jobs JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME date showq vsc20067 Idle 96 1:00:00:00 Sat Oct 6 11:54: vsc20034 Idle 96 7:00:00:00 Mon Oct 8 09:11:37 blocked jobs JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME vsc20118 Idle 8 6:23:59:50 Wed Oct 3 17:13:04 mdiag -f : Depth: 7 intervals Interval Length: 1:00:00:00 Decay Rate: 0.80 mdiag -f FSInterval % Target FSWeight TotalUsage vsc vsc vsc20034* Job PRIORITY* Cred( User:Class) FS( User) Serv(QTime:UPrio) Weights ( 1: 1) 500( 1) 1( 1: 1) mdiag -p ( 0.0: 0.0) 38.8( 3.4) 61.2( 2714: 0.0) ( 0.0: 0.0) 99.7( 8.5) 0.3( 14: 0.0) WallTime: 00:00:00 of 6:23:59:50 SubmitTime: Wed Oct 3 17:13:04 (Time Queued Total: 4:18:10:23 Eligible: 1:29:52) checkjob StartPriority:

41 Credits 9. Policies At UAntwerp we monitor cluster use and send periodic reports to group leaders. On the Tier-1 system at UGent, you get a compute time allocation (number of node days). They are weakly enforced. A free test ride (motivation required) Requesting compute time through a proposal At KU Leuven, you need compute credits which are enforced by the scheduler and a component called MAM or Moab Accounting Manager. Credits can be bought via the KU Leuven Charge policy (subject to change) based upon : fixed start-up cost used resources (number and type of nodes) actual duration (used walltime) Advanced Topics Part B

42 Agenda part B 10. Multi-job submission 11. Compiling 12. Monitoring HPC utilization 13. Checkpointing 14. Program examples 15. Good practices Multi-job submission Chapter 10

43 Worker framework 10. Multi-job Many users run many serial jobs, e.g., for a parameter exploration Problem : submitting many short jobs to the queueing system puts quite a burden on the scheduler, so try to avoid it Worker framework parameter sweep : many small jobs and a large number of parameter variations. Parameters are passed to the program through command line arguments. job arrays : many jobs with different input/output files Worker framework example case : parameter exploration 10. Multi-job #!/bin/bash l #PBS l nodes=1:ppn=20 #PBS l walltime=04:00:00 weather.sh cd $PBS_O_WORKDIR weather t $temperature p $pressure \ v $volume data.csv temperature,pressure, volume 293.0, 1.0e05, 87...,..., , 1.0e05, 75 (data in CSV format) $ module load worker $ wsub data data.csv batch weather.sh wsub weather will be run for all data, until all computations are done

44 Worker framework an alternative for job arrays 10. Multi-job #!/bin/bash l #PBS l nodes=1:ppn=20 l walltime=04:00:00 l l jobarray.sh cd $PBS_O_WORKDIR INPUT_FILE="input_${PBS_ARRAYID}.dat" for every run, there is a OUTPUT_FILE="output_${PBS_ARRAYID}.dat" separate input file and an program input ${INPUT_FILE} output ${OUTPUT_FILE} associated output file $ module load worker $ wsub t batch jobarray.sh program will be run for all input files (100) Exercises 10. Multi-job Do the exercises in Chapter 10 Multi-job submission Files in the subdirectories of examples/multi-job-submission Have a look at the parameter sweep example weather from 10.1 Have a look at the job array example from 10.2 (Time left: Have a look at the map-reduce example from 10.3)

45 Compiling Chapter 10 Process 11. Compiling Check the pre-installed software and toolchains Other software: on demand Porting your code to Linux / cluster environments Own responsibility A typical process looks like: Copy your software to UAntwerpen-HPC Start an interactive session Compile Test (locally) on a compute node Generate your job-scripts Test it on the UAntwerpen-HPC Run it (in parallel we hope)

46 Toolchains 11. Compiling We currently support the GNU compiler collection and Intel compilers on all VSC systems. If you really need LLVM or any other compiler, we can discuss Made available via toolchains, e.g.: intel/2015b : All Intel components (C/C++/Fortran, MPI, MKL) foss/2015b : gcc/gfortran, Open MPI, OpenBLAS, Lapack, FFTW, binutils with a recent assembler Program without MPI Program with MPI GNU Intel GNU Intel C gcc icc mpicc mpiicc C++ g++ icpc mpicxx mpiicpc Fortran gfortran ifort mpif90 mpiifort Exercises 11. Compiling Do the exercises in Chapter 11 Compiling and testing your software on the HPC See the files in examples/compiling-and-testing-your-software-on-the-hpc for an example serial program for an example parallel program (or for an example with the Intel compilers)

47 Checkpointing Chapter XX Checkpointing XX. Checkpointing checkpointing makes it possible to run jobs for weeks or months or... a snapshot is taken each time a sub-job is running out of requested walltime subsequently, a new sub-job will be scheduled extra options : job arrays, using shared scratch,... Uses Berkeley Lab Checkpoint/Restart (BLCR) but we re experimenting with another tool also Note however that the highest-performance and most robust checkpointing is checkpointing provided by the applications themselves. Most good applications store a state from time to time from which they can restart.

48 Checkpointing XX. Checkpointing Checkpoint submit command : csub specify at least the job script preferably also the queue specify the time for checkpointing with chkpt_time specify the walltime for the job with job_time #PBS directives are ignored (apart from nodes=... ) csub $ module load scripts $ csub s jobscript.sh q qshort job_time=00:04:30 chkpt_time=00:01:00 $ watch qstat $ vi csub-output.txt For a more complete introduction, see the slides of the HPC Tips & Tricks 3: Using checkpointing session in December 2015 Good practices Chapter 13

49 Debug & test 13. Good practices Before starting you should always check : are there any errors in the script? are the required modules loaded? is the correct executable used? Check your jobs at runtime login on node and use, e.g., top, vmstat,... alternatively : run an interactive job ( qsub -I ) If you will be doing a lot of runs with a package, try to benchmark the software for scaling issues when using MPI I/O issues Hints & tips 13. Good practices Job will logon to the compute node(s) and start executing the commands in the job script start in your home directory ($VSC_HOME), so go to current directory with cd $PBS_O_WORKDIR don t forget to load the software with module load Why is my job not running? use checkjob commands might timeout with overloaded scheduler Submit your job and wait (be patient) Using $VSC_SCRATCH_NODE (/tmp on the compute node) might be beneficial for jobs with many small files. But that scratch space is node-specific and hence will disappear when your job ends So copy what you need to your data directory or to the shared scratch

50 Warnings 13. Good practices Avoid submitting many small jobs by grouping them using a job array or using the Worker framework Runtime is limited by the maximum walltime of the queues for longer walltimes, use checkpointing Requesting many processors could imply long waiting times (though we re still trying to favouring parallel jobs) what a cluster is not a computer that will automatically run the code of your (PC) application much faster or for much bigger problems User Support

51 User support Try to get self-organized limited resources for user support best effort, no code fixing Contact and be as precise as possible VSC documentation : vscentrum.be/en/user-portal External documentation : docs.adaptivecomputing.com manuals of Torque and MOAB We re currently at version 7 of the Enterprise Edition mailing-lists VSC uantwerpen.be/hpc User support mailing-lists : announcements : calcua-announce@sympa.ua.ac.be (for official announcements and communications) users : calcua-users@sympa.ua.ac.be (for discussions between users) contacts : hpc@uantwerpen.be phone : 3860 (Stefan), 3855 (Franky), Kurt (3852), Bert (3879) office : G.310, G.309 (CMI) Don t be afraid to ask, but being as precise as possible when specifying your problem definitely helps in getting a quick answer.

UAntwerpen, 24 June 2016

UAntwerpen, 24 June 2016 Tier-1b Info Session UAntwerpen, 24 June 2016 VSC HPC environment Tier - 0 47 PF Tier -1 623 TF Tier -2 510 Tf 16,240 CPU cores 128/256 GB memory/node IB EDR interconnect Tier -3 HOPPER/TURING STEVIN THINKING/CEREBRO

More information

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Introduction What are the intended uses of the MTL? The MTL is prioritized for supporting the Intel Academic Community for the testing, validation

More information

Introduction to Discovery.

Introduction to Discovery. Introduction to Discovery http://discovery.dartmouth.edu The Discovery Cluster 2 Agenda What is a cluster and why use it Overview of computer hardware in cluster Help Available to Discovery Users Logging

More information

Genius Quick Start Guide

Genius Quick Start Guide Genius Quick Start Guide Overview of the system Genius consists of a total of 116 nodes with 2 Skylake Xeon Gold 6140 processors. Each with 18 cores, at least 192GB of memory and 800 GB of local SSD disk.

More information

Introduction to Discovery.

Introduction to Discovery. Introduction to Discovery http://discovery.dartmouth.edu The Discovery Cluster 2 Agenda What is a cluster and why use it Overview of computer hardware in cluster Help Available to Discovery Users Logging

More information

Introduction to GALILEO

Introduction to GALILEO Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Maurizio Cremonesi m.cremonesi@cineca.it

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Introduction to PICO Parallel & Production Enviroment

Introduction to PICO Parallel & Production Enviroment Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it

More information

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop Before We Start Sign in hpcxx account slips Windows Users: Download PuTTY Google PuTTY First result Save putty.exe to Desktop Research Computing at Virginia Tech Advanced Research Computing Compute Resources

More information

Cerebro Quick Start Guide

Cerebro Quick Start Guide Cerebro Quick Start Guide Overview of the system Cerebro consists of a total of 64 Ivy Bridge processors E5-4650 v2 with 10 cores each, 14 TB of memory and 24 TB of local disk. Table 1 shows the hardware

More information

Transitioning to Leibniz and CentOS 7

Transitioning to Leibniz and CentOS 7 Transitioning to Leibniz and CentOS 7 Fall 2017 Overview Introduction: some important hardware properties of leibniz Working on leibniz: Logging on to the cluster Selecting software: toolchains Activating

More information

Introduction to Discovery.

Introduction to Discovery. Introduction to Discovery http://discovery.dartmouth.edu March 2014 The Discovery Cluster 2 Agenda Resource overview Logging on to the cluster with ssh Transferring files to and from the cluster The Environment

More information

PACE. Instructional Cluster Environment (ICE) Orientation. Research Scientist, PACE

PACE. Instructional Cluster Environment (ICE) Orientation. Research Scientist, PACE PACE Instructional Cluster Environment (ICE) Orientation Mehmet (Memo) Belgin, PhD Research Scientist, PACE www.pace.gatech.edu What is PACE A Partnership for an Advanced Computing Environment Provides

More information

PACE. Instructional Cluster Environment (ICE) Orientation. Mehmet (Memo) Belgin, PhD Research Scientist, PACE

PACE. Instructional Cluster Environment (ICE) Orientation. Mehmet (Memo) Belgin, PhD  Research Scientist, PACE PACE Instructional Cluster Environment (ICE) Orientation Mehmet (Memo) Belgin, PhD www.pace.gatech.edu Research Scientist, PACE What is PACE A Partnership for an Advanced Computing Environment Provides

More information

User Guide of High Performance Computing Cluster in School of Physics

User Guide of High Performance Computing Cluster in School of Physics User Guide of High Performance Computing Cluster in School of Physics Prepared by Sue Yang (xue.yang@sydney.edu.au) This document aims at helping users to quickly log into the cluster, set up the software

More information

NBIC TechTrack PBS Tutorial

NBIC TechTrack PBS Tutorial NBIC TechTrack PBS Tutorial by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen Visit our webpage at: http://www.nbic.nl/support/brs 1 NBIC PBS Tutorial

More information

Effective Use of CCV Resources

Effective Use of CCV Resources Effective Use of CCV Resources Mark Howison User Services & Support This talk... Assumes you have some familiarity with a Unix shell Provides examples and best practices for typical usage of CCV systems

More information

Introduction to GALILEO

Introduction to GALILEO November 27, 2016 Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it SuperComputing Applications and Innovation Department

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Introduction to the Cluster

Introduction to the Cluster Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be

More information

Introduction to HPC Using the New Cluster at GACRC

Introduction to HPC Using the New Cluster at GACRC Introduction to HPC Using the New Cluster at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What is the new cluster

More information

Introduction to CINECA HPC Environment

Introduction to CINECA HPC Environment Introduction to CINECA HPC Environment 23nd Summer School on Parallel Computing 19-30 May 2014 m.cestari@cineca.it, i.baccarelli@cineca.it Goals You will learn: The basic overview of CINECA HPC systems

More information

A Brief Introduction to The Center for Advanced Computing

A Brief Introduction to The Center for Advanced Computing A Brief Introduction to The Center for Advanced Computing May 1, 2006 Hardware 324 Opteron nodes, over 700 cores 105 Athlon nodes, 210 cores 64 Apple nodes, 128 cores Gigabit networking, Myrinet networking,

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What is HPC Concept? What is

More information

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a

More information

Using the IBM Opteron 1350 at OSC. October 19-20, 2010

Using the IBM Opteron 1350 at OSC. October 19-20, 2010 Using the IBM Opteron 1350 at OSC October 19-20, 2010 Table of Contents Hardware Overview The Linux Operating System User Environment and Storage 2 Hardware Overview Hardware introduction Login node configuration

More information

Working on the NewRiver Cluster

Working on the NewRiver Cluster Working on the NewRiver Cluster CMDA3634: Computer Science Foundations for Computational Modeling and Data Analytics 22 February 2018 NewRiver is a computing cluster provided by Virginia Tech s Advanced

More information

XSEDE New User Tutorial

XSEDE New User Tutorial April 2, 2014 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Make sure you sign the sign in sheet! At the end of the module, I will ask you to

More information

HPC Tutorial Last updated: November For Windows Users

HPC Tutorial Last updated: November For Windows Users VLAAMS SUPERCOMPUTER CENTRUM Innovative Computing for A Smarter Flanders HPC Tutorial Last updated: November 22 2018 For Windows Users Authors: Franky Backeljauw 5, Stefan Becuwe 5, Geert Jan Bex 3, Geert

More information

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved. Minnesota Supercomputing Institute Introduction to Job Submission and Scheduling Andrew Gustafson Interacting with MSI Systems Connecting to MSI SSH is the most reliable connection method Linux and Mac

More information

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste Queue systems and how to use Torque/Maui Piero Calucci Scuola Internazionale Superiore di Studi Avanzati Trieste March 9th 2007 Advanced School in High Performance Computing Tools for e-science Outline

More information

Using Sapelo2 Cluster at the GACRC

Using Sapelo2 Cluster at the GACRC Using Sapelo2 Cluster at the GACRC New User Training Workshop Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Sapelo2 Cluster Diagram

More information

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende Introduction to the NCAR HPC Systems 25 May 2018 Consulting Services Group Brian Vanderwende Topics to cover Overview of the NCAR cluster resources Basic tasks in the HPC environment Accessing pre-built

More information

HPC Resources at Lehigh. Steve Anthony March 22, 2012

HPC Resources at Lehigh. Steve Anthony March 22, 2012 HPC Resources at Lehigh Steve Anthony March 22, 2012 HPC at Lehigh: Resources What's Available? Service Level Basic Service Level E-1 Service Level E-2 Leaf and Condor Pool Altair Trits, Cuda0, Inferno,

More information

Introduc)on to Hyades

Introduc)on to Hyades Introduc)on to Hyades Shawfeng Dong Department of Astronomy & Astrophysics, UCSSC Hyades 1 Hardware Architecture 2 Accessing Hyades 3 Compu)ng Environment 4 Compiling Codes 5 Running Jobs 6 Visualiza)on

More information

Introduction to GALILEO

Introduction to GALILEO Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Alessandro Grottesi a.grottesi@cineca.it SuperComputing Applications and

More information

Getting started with the CEES Grid

Getting started with the CEES Grid Getting started with the CEES Grid October, 2013 CEES HPC Manager: Dennis Michael, dennis@stanford.edu, 723-2014, Mitchell Building room 415. Please see our web site at http://cees.stanford.edu. Account

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

OBTAINING AN ACCOUNT:

OBTAINING AN ACCOUNT: HPC Usage Policies The IIA High Performance Computing (HPC) System is managed by the Computer Management Committee. The User Policies here were developed by the Committee. The user policies below aim to

More information

High Performance Computing (HPC) Using zcluster at GACRC

High Performance Computing (HPC) Using zcluster at GACRC High Performance Computing (HPC) Using zcluster at GACRC On-class STAT8060 Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC?

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC On-class PBIO/BINF8350 Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What

More information

and how to use TORQUE & Maui Piero Calucci

and how to use TORQUE & Maui Piero Calucci Queue and how to use & Maui Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing Outline 1 We Are Trying to Solve 2 Using the Manager

More information

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group The cluster system Introduction 22th February 2018 Jan Saalbach Scientific Computing Group cluster-help@luis.uni-hannover.de Contents 1 General information about the compute cluster 2 Available computing

More information

OpenPBS Users Manual

OpenPBS Users Manual How to Write a PBS Batch Script OpenPBS Users Manual PBS scripts are rather simple. An MPI example for user your-user-name: Example: MPI Code PBS -N a_name_for_my_parallel_job PBS -l nodes=7,walltime=1:00:00

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC On-class STAT8330 Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu Slides courtesy: Zhoufei Hou 1 Outline What

More information

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende Introduction to NCAR HPC 25 May 2017 Consulting Services Group Brian Vanderwende Topics we will cover Technical overview of our HPC systems The NCAR computing environment Accessing software on Cheyenne

More information

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved. Minnesota Supercomputing Institute Introduction to MSI Systems Andrew Gustafson The Machines at MSI Machine Type: Cluster Source: http://en.wikipedia.org/wiki/cluster_%28computing%29 Machine Type: Cluster

More information

PACE Orientation. Research Scientist, PACE

PACE Orientation. Research Scientist, PACE PACE Orientation Mehmet (Memo) Belgin, PhD Research Scientist, PACE www.pace.gatech.edu What is PACE A Partnership for an Advanced Computing Environment Provides faculty and researchers vital tools to

More information

A Brief Introduction to The Center for Advanced Computing

A Brief Introduction to The Center for Advanced Computing A Brief Introduction to The Center for Advanced Computing February 8, 2007 Hardware 376 Opteron nodes, over 890 cores Gigabit networking, Myrinet networking, Infiniband networking soon Hardware: nyx nyx

More information

XSEDE New User Tutorial

XSEDE New User Tutorial June 12, 2015 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please remember to sign in for today s event: http://bit.ly/1fashvo Also, please

More information

Running Jobs, Submission Scripts, Modules

Running Jobs, Submission Scripts, Modules 9/17/15 Running Jobs, Submission Scripts, Modules 16,384 cores total of about 21,000 cores today Infiniband interconnect >3PB fast, high-availability, storage GPGPUs Large memory nodes (512GB to 1TB of

More information

Batch Systems. Running your jobs on an HPC machine

Batch Systems. Running your jobs on an HPC machine Batch Systems Running your jobs on an HPC machine Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Supercomputing environment TMA4280 Introduction to Supercomputing

Supercomputing environment TMA4280 Introduction to Supercomputing Supercomputing environment TMA4280 Introduction to Supercomputing NTNU, IMF February 21. 2018 1 Supercomputing environment Supercomputers use UNIX-type operating systems. Predominantly Linux. Using a shell

More information

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System Image Sharpening Practical Introduction to HPC Exercise Instructions for Cirrus Tier-2 System 2 1. Aims The aim of this exercise is to get you used to logging into an HPC resource, using the command line

More information

Name Department/Research Area Have you used the Linux command line?

Name Department/Research Area Have you used the Linux command line? Please log in with HawkID (IOWA domain) Macs are available at stations as marked To switch between the Windows and the Mac systems, press scroll lock twice 9/27/2018 1 Ben Rogers ITS-Research Services

More information

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS Introduction to High Performance Computing High Performance Computing at UEA http://rscs.uea.ac.uk/hpc/

More information

PBS Pro Documentation

PBS Pro Documentation Introduction Most jobs will require greater resources than are available on individual nodes. All jobs must be scheduled via the batch job system. The batch job system in use is PBS Pro. Jobs are submitted

More information

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220 Introduction to HPC Using zcluster at GACRC On-Class GENE 4220 Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu Slides courtesy: Zhoufei Hou 1 OVERVIEW GACRC

More information

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike

More information

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing HPC Workshop Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing NEEDED EQUIPMENT 1. Laptop with Secure Shell (ssh) for login A. Windows: download/install putty from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

More information

Introduction to the Cluster

Introduction to the Cluster Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu Follow us on Twitter for important news and updates: @ACCREVandy The Cluster We will be

More information

HPC Tutorial Last updated: November For Windows Users

HPC Tutorial Last updated: November For Windows Users VLAAMS SUPERCOMPUTER CENTRUM Innovative Computing for A Smarter Flanders HPC Tutorial Last updated: November 7 2018 For Windows Users Authors: Franky Backeljauw 5, Stefan Becuwe 5, Geert Jan Bex 3, Geert

More information

Computing with the Moore Cluster

Computing with the Moore Cluster Computing with the Moore Cluster Edward Walter An overview of data management and job processing in the Moore compute cluster. Overview Getting access to the cluster Data management Submitting jobs (MPI

More information

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing Quick Start Guide by Burak Himmetoglu Supercomputing Consultant Enterprise Technology Services & Center for Scientific Computing E-mail: bhimmetoglu@ucsb.edu Contents User access, logging in Linux/Unix

More information

A Brief Introduction to The Center for Advanced Computing

A Brief Introduction to The Center for Advanced Computing A Brief Introduction to The Center for Advanced Computing November 10, 2009 Outline 1 Resources Hardware Software 2 Mechanics: Access Transferring files and data to and from the clusters Logging into the

More information

XSEDE New User Tutorial

XSEDE New User Tutorial May 13, 2016 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please complete a short on-line survey about this module at http://bit.ly/hamptonxsede.

More information

The JANUS Computing Environment

The JANUS Computing Environment Research Computing UNIVERSITY OF COLORADO The JANUS Computing Environment Monte Lunacek monte.lunacek@colorado.edu rc-help@colorado.edu What is JANUS? November, 2011 1,368 Compute nodes 16,416 processors

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Batch Systems. Running calculations on HPC resources

Batch Systems. Running calculations on HPC resources Batch Systems Running calculations on HPC resources Outline What is a batch system? How do I interact with the batch system Job submission scripts Interactive jobs Common batch systems Converting between

More information

Introduction to High Performance Computing (HPC) Resources at GACRC

Introduction to High Performance Computing (HPC) Resources at GACRC Introduction to High Performance Computing (HPC) Resources at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? Concept

More information

Introduction to HPC Numerical libraries on FERMI and PLX

Introduction to HPC Numerical libraries on FERMI and PLX Introduction to HPC Numerical libraries on FERMI and PLX HPC Numerical Libraries 11-12-13 March 2013 a.marani@cineca.it WELCOME!! The goal of this course is to show you how to get advantage of some of

More information

HPC Tutorial Last updated: November For Mac Users

HPC Tutorial Last updated: November For Mac Users VLAAMS SUPERCOMPUTER CENTRUM Innovative Computing for A Smarter Flanders HPC Tutorial Last updated: November 22 2018 For Mac Users Authors: Franky Backeljauw 5, Stefan Becuwe 5, Geert Jan Bex 3, Geert

More information

Introduction to High Performance Computing Using Sapelo2 at GACRC

Introduction to High Performance Computing Using Sapelo2 at GACRC Introduction to High Performance Computing Using Sapelo2 at GACRC Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu 1 Outline High Performance Computing (HPC)

More information

Knights Landing production environment on MARCONI

Knights Landing production environment on MARCONI Knights Landing production environment on MARCONI Alessandro Marani - a.marani@cineca.it March 20th, 2017 Agenda In this presentation, we will discuss - How we interact with KNL environment on MARCONI

More information

Introduction to HPC Resources and Linux

Introduction to HPC Resources and Linux Introduction to HPC Resources and Linux Burak Himmetoglu Enterprise Technology Services & Center for Scientific Computing e-mail: bhimmetoglu@ucsb.edu Paul Weakliem California Nanosystems Institute & Center

More information

XSEDE New User Tutorial

XSEDE New User Tutorial October 20, 2017 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please complete a short on line survey about this module at http://bit.ly/xsedesurvey.

More information

Slurm basics. Summer Kickstart June slide 1 of 49

Slurm basics. Summer Kickstart June slide 1 of 49 Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource

More information

New User Seminar: Part 2 (best practices)

New User Seminar: Part 2 (best practices) New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency

More information

Cluster Clonetroop: HowTo 2014

Cluster Clonetroop: HowTo 2014 2014/02/25 16:53 1/13 Cluster Clonetroop: HowTo 2014 Cluster Clonetroop: HowTo 2014 This section contains information about how to access, compile and execute jobs on Clonetroop, Laboratori de Càlcul Numeric's

More information

UMass High Performance Computing Center

UMass High Performance Computing Center .. UMass High Performance Computing Center University of Massachusetts Medical School October, 2015 2 / 39. Challenges of Genomic Data It is getting easier and cheaper to produce bigger genomic data every

More information

CENTER FOR HIGH PERFORMANCE COMPUTING. Overview of CHPC. Martin Čuma, PhD. Center for High Performance Computing

CENTER FOR HIGH PERFORMANCE COMPUTING. Overview of CHPC. Martin Čuma, PhD. Center for High Performance Computing Overview of CHPC Martin Čuma, PhD Center for High Performance Computing m.cuma@utah.edu Spring 2014 Overview CHPC Services HPC Clusters Specialized computing resources Access and Security Batch (PBS and

More information

Our Workshop Environment

Our Workshop Environment Our Workshop Environment John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2015 Our Environment Today Your laptops or workstations: only used for portal access Blue Waters

More information

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen NBIC TechTrack PBS Tutorial by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen 1 NBIC PBS Tutorial This part is an introduction to clusters and the PBS

More information

Genius - introduction

Genius - introduction Genius - introduction HPC team ICTS, Leuven 13th June 2018 VSC HPC environment GENIUS 2 IB QDR IB QDR IB EDR IB EDR Ethernet IB IB QDR IB FDR Numalink6 /IB FDR IB IB EDR ThinKing Cerebro Accelerators Genius

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit CPU cores : individual processing units within a Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Introduction to Cheyenne. 12 January, 2017 Consulting Services Group Brian Vanderwende

Introduction to Cheyenne. 12 January, 2017 Consulting Services Group Brian Vanderwende Introduction to Cheyenne 12 January, 2017 Consulting Services Group Brian Vanderwende Topics we will cover Technical specs of the Cheyenne supercomputer and expanded GLADE file systems The Cheyenne computing

More information

Using the Yale HPC Clusters

Using the Yale HPC Clusters Using the Yale HPC Clusters Stephen Weston Robert Bjornson Yale Center for Research Computing Yale University Oct 2015 To get help Send an email to: hpc@yale.edu Read documentation at: http://research.computing.yale.edu/hpc-support

More information

Crash Course in High Performance Computing

Crash Course in High Performance Computing Crash Course in High Performance Computing Cyber-Infrastructure Days October 24, 2013 Dirk Colbry colbrydi@msu.edu Research Specialist Institute for Cyber-Enabled Research https://wiki.hpcc.msu.edu/x/qamraq

More information

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology Introduction to the SHARCNET Environment 2010-May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology available hardware and software resources our web portal

More information

Outline. March 5, 2012 CIRMMT - McGill University 2

Outline. March 5, 2012 CIRMMT - McGill University 2 Outline CLUMEQ, Calcul Quebec and Compute Canada Research Support Objectives and Focal Points CLUMEQ Site at McGill ETS Key Specifications and Status CLUMEQ HPC Support Staff at McGill Getting Started

More information

Tech Computer Center Documentation

Tech Computer Center Documentation Tech Computer Center Documentation Release 0 TCC Doc February 17, 2014 Contents 1 TCC s User Documentation 1 1.1 TCC SGI Altix ICE Cluster User s Guide................................ 1 i ii CHAPTER 1

More information

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory

More information

New User Tutorial. OSU High Performance Computing Center

New User Tutorial. OSU High Performance Computing Center New User Tutorial OSU High Performance Computing Center TABLE OF CONTENTS Logging In... 3-5 Windows... 3-4 Linux... 4 Mac... 4-5 Changing Password... 5 Using Linux Commands... 6 File Systems... 7 File

More information

Introduction to HPCC at MSU

Introduction to HPCC at MSU Introduction to HPCC at MSU Chun-Min Chang Research Consultant Institute for Cyber-Enabled Research Download this presentation: https://wiki.hpcc.msu.edu/display/teac/2016-03-17+introduction+to+hpcc How

More information

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions How to run applications on Aziz supercomputer Mohammad Rafi System Administrator Fujitsu Technology Solutions Agenda Overview Compute Nodes Storage Infrastructure Servers Cluster Stack Environment Modules

More information

Introduction to HPC Using the New Cluster at GACRC

Introduction to HPC Using the New Cluster at GACRC Introduction to HPC Using the New Cluster at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What is the new cluster

More information

Answers to Federal Reserve Questions. Training for University of Richmond

Answers to Federal Reserve Questions. Training for University of Richmond Answers to Federal Reserve Questions Training for University of Richmond 2 Agenda Cluster Overview Software Modules PBS/Torque Ganglia ACT Utils 3 Cluster overview Systems switch ipmi switch 1x head node

More information

Moab Workload Manager on Cray XT3

Moab Workload Manager on Cray XT3 Moab Workload Manager on Cray XT3 presented by Don Maxwell (ORNL) Michael Jackson (Cluster Resources, Inc.) MOAB Workload Manager on Cray XT3 Why MOAB? Requirements Features Support/Futures 2 Why Moab?

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System

More information

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011) UoW HPC Quick Start Information Technology Services University of Wollongong ( Last updated on October 10, 2011) 1 Contents 1 Logging into the HPC Cluster 3 1.1 From within the UoW campus.......................

More information