Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Similar documents
Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

How to run a job on a Cluster?

High Performance Computing Cluster Advanced course

Heterogeneous Job Support

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts

Exercises: Abel/Colossus and SLURM

Introduction to SLURM & SLURM batch scripts

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Using a Linux System 6

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Introduction to GACRC Teaching Cluster

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Introduction to GACRC Teaching Cluster

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Introduction to GACRC Teaching Cluster PHYS8602

CNAG Advanced User Training

Federated Cluster Support

High Performance Computing Cluster Basic course

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Using Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Slurm basics. Summer Kickstart June slide 1 of 49

Introduction to RCC. September 14, 2016 Research Computing Center

Training day SLURM cluster. Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Introduction to RCC. January 18, 2017 Research Computing Center

SCALABLE HYBRID PROTOTYPE

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

ECE 574 Cluster Computing Lecture 4

Batch Systems. Running your jobs on an HPC machine

HPC Introductory Course - Exercises

Resource Management at LLNL SLURM Version 1.2

Directions in Workload Management

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Parallel Merge Sort Using MPI

Using the SLURM Job Scheduler

Submitting batch jobs Slurm on ecgate Solutions to the practicals

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Introduction to Slurm

Sherlock for IBIIS. William Law Stanford Research Computing

Training day SLURM cluster. Context. Context renewal strategy

Introduction to High-Performance Computing (HPC)

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Submitting batch jobs

Scientific Computing in practice

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

Scheduling By Trackable Resources

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

1 Bull, 2011 Bull Extreme Computing

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Student HPC Hackathon 8/2018

Applications Software Example

Batch Systems. Running calculations on HPC resources

Choosing Resources Wisely. What is Research Computing?

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

P a g e 1. HPC Example for C with OpenMPI

Compilation and Parallel Start

BRC HPC Services/Savio

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CRUK cluster practical sessions (SLURM) Part I processes & scripts

Compiling applications for the Cray XC

Beginner's Guide for UK IBM systems

Using the MATLAB Parallel Computing Toolbox on the UB CCR cluster

Bash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University

Introduction to the Cluster

OBTAINING AN ACCOUNT:

User Guide of High Performance Computing Cluster in School of Physics

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Submitting batch jobs Slurm on ecgate

Introduction to High Performance Computing at Case Western Reserve University. KSL Data Center

Introduction to HPC2N

Graham vs legacy systems

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Introduction to GALILEO

Introduction to HPC Using the New Cluster at GACRC

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary

Description of Power8 Nodes Available on Mio (ppc[ ])

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

TITANI CLUSTER USER MANUAL V.1.3

COSC 6385 Computer Architecture. - Homework

SLURM Operation on Cray XT and XE

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Slurm Birds of a Feather

Slurm Workload Manager Introductory User Training

Introduction to Abel/Colossus and the queuing system

Linux Tutorial. Ken-ichi Nomura. 3 rd Magics Materials Software Workshop. Gaithersburg Marriott Washingtonian Center November 11-13, 2018

Introduction to High-Performance Computing (HPC)

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Using the Yale HPC Clusters

How to Use a Supercomputer - A Boot Camp

High Performance Compu2ng Using Sapelo Cluster

OpenPBS Users Manual

CSC Supercomputing Environment

For Dr Landau s PHYS8602 course

Transcription:

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959 Office: B1-109 cdc at ccr.buffalo.edu June 2013

What is SLURM? SLURM is an acronym for Simple Linux Utility for Resource Management. SLURM is a workload manager that provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs. SLURM replaces the PBS resource manager and MAUI scheduler on the CCR clusters.

Why SLURM? SLURM provides superior scalability and performance. It can manage and allocate the compute nodes for large clusters. SLURM can accept up to 1,000 jobs per second. SLURM is a comprehensive resource manager. Individual node resources, such as GPUs or number of threads on a core, can be scheduled with SLURM.

What has changed? Outline of topics: New compute nodes. Front-end (login) machine. Home directory path. Queue names. Scheduler commands. Syntax for specifying resource requests. Task launching. Node sharing policy. Interactive job submission. Job monitoring.

CCR Cluster Compute Nodes New! 32 16-core nodes with 128GB and Mellanox IB New! 12-core Visualization node with 256GB, 2 Nvidia Fermi GPUs and remote visualization clients 372 12-core nodes with 48GB and Qlogic IB 256 8-core nodes with 24GB and Mellanox IB 16 32-core nodes with 256GB and Qlogic IB 2 32-core nodes with 512GB and Qlogic IB 32 12-core nodes with 2 Nvidia Fermi GPUs, 48GB and Mellanox IB

Front-end and Home directory The new front-end machine is rush.ccr.buffalo.edu 32-core node with 256GB of memory. Home directories are /user/username on the front-end and compute nodes. The /user/usename/u2 directory still exists under /user/username. No data have been lost. Users who were using the /user/username/u2 directory should copy the.bashrc,.bash_profile and the.ssh directory to /user/username.

SLURM Partitions All general and faculty clusters are partitions in SLURM. The general-compute partition corresponds to the u2 cluster default queue. The debug partition corresponds to the u2 debug queue. GPU nodes are requested resources rather than residing in a separate partition. Faculty clusters have separate access controlled partitions.

SLURM Commands squeue shows the status of jobs. sbatch submits a script job. salloc submits an interactive job. srun runs a command across nodes. scancel cancels a running or pending job. sinfo provides information on partitions and nodes. sview graphical interface to view job, node and partition information.

squeue example squeue -u cdc JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4832 general-c hello_te cdc R 0:20 2 f16n[10-11] Job status: R job is running. PD job is waiting for resource. Reasons are usually (Resources) or (Priority). Others commons reasons are CA (cancelled) and CD (completed).

sinfo example sinfo -p general-compute PARTITION AVAIL TIMELIMIT NODES STATE NODELIST general-comput* up 3-00:00:00 264 idle d07n07s[01-02],d07n08s[01-02], Node states: alloc all cores are in use. mix some cores are available. idle node is free. All cores are available. down- node is down. drained node is offline.

sinfo example More detailed sinfo query: sinfo --exact --partition=general-compute --format="%15p %5a %10A %.4c %6m %6G %16f %t %N" more PARTITION AVAIL NODES(A/I) CPUS MEMORY GRES FEATURES STATE NODELIST general-comput* up 0/0 12 48000 gpu:2 IB,CPU- X5650 down* d05n11s[01-02],d05n12s[0102],d05n20s[01-02],d05n21s[01-02],d05n24s[01-02],d05n25s[01-02],d05n30s[01-02],d05n31s[01-02],d05n39s[01-02],d05n40s[01-02] general-comput* up 0/151 12 48000 (null) IB,CPU- E5645 idle k13n17 s[01-02],k13n18s[01-02],k13n19s[01-02],k13n23s[01-02],k13n24s[01-02],

sview example

sbatch example sbatch slurmhelloworld #!/bin/sh #SBATCH --partition=general-compute #SBATCH --time=00:15:00 #SBATCH --nodes=2 [in PBS -lnodes=2:ppn=16] #SBATCH --ntasks-per-node=16 #SBATCH --job-name="hello_test #SBATCH --output test.out #SBATCH --mail-user=usename@buffalo.edu #SBATCH --mail-type=end

Commonly used SLURM variables $SLURM_JOB_ID SLURM_JOB_NODELIST Node list in SLURM format; for example f16n[04,06]. $SLURM_NNODES Number of nodes $SLURMTMPDIR /scratch/jobid ($PBSTMPDIR) $SLURM_SUBMIT_DIR Directory from which the job was submitted ($PBS_O_WORKDIR). NOTE! Jobs start in the $SLURM_SUBMIT_DIR.

Task Launching The number of cores/processes must be computed. The $PBS_NODEFILE does not exist, however it can be created if necessary. Intel-MPI mpirun and mpiexec are SLURM aware. Note: mpirun does not launch properly if nodes are undersubscribed. The $PBS_NODEFILE is not necessary. srun will execute a command across nodes. It can be used to generate the number of processors and a PBS-like nodelist, as well as launch MPI computations.

Node Sharing Compute nodes are shared among different jobs and users. In most cases, tasks will be limited to the number of cores and memory specified. The integration of CPUSETS and SLURM makes this possible. CPUSET is a Linux kernel level mechanism that can be used to control the processor memory utilization. The --exclusive flag will request the nodes as dedicated. The nodes will not be shared.

Interactive Job The salloc command requests the nodes. Once the nodes have been allocated to the job, then the user can login to the compute node. The user is not logged into the compute node when the job starts. Only ssh can be used to login to the nodes assigned to a job. The job allocation can persist through logout.

Example of an Interactive Job [cdc@rush:~]$ salloc --partition=general-compute -- nodes=1 --time=01:00:00 --exclusive salloc: Granted job allocation 54124 [cdc@rush:~]$ export grep SLURM declare -x SLURM_JOBID="54124" declare -x SLURM_JOB_CPUS_PER_NODE="8" declare -x SLURM_JOB_ID="54124" declare -x SLURM_JOB_NODELIST="d07n35s01" declare -x SLURM_JOB_NUM_NODES="1" declare -x SLURM_NNODES="1" declare -x SLURM_NODELIST="d07n35s01 [cdc@rush:~]$ exit exit salloc: Relinquishing job allocation 54124 salloc: Job allocation 54124 has been revoked. [cdc@rush:~]$

Example of an Interactive Job [cdc@rush ~]$ salloc --partition=general-compute - -nodes=1 --time=01:00:00 --exclusive & [1] 14269 [cdc@rush ~]$ salloc: Granted job allocation 4716 [cdc@rush ~]$ Note! Placing the salloc in the background allows the allocation to persist. The user is not logged into the compute node when the job starts.

Job monitoring The NEW slurmjobvis is a graphical display of the activity on the node. CPU, memory, network, as well as GPU utilization are displayed. This an improved version of ccrjobvis. User can login using ssh to the compute nodes in the job.

More Information and Help CCR SLURM web page Compute Cluster web page Remote Visualization web page More sample SLURM scripts can be found in the /util/slurm-scripts directory on rush. Users can get assistance with the transition to SLURM by sending a request to ccrhelp@ccr.buffalo.edu. Drop-in hours for assistance: North Campus: Monday and Tuesday 9am-11am in Bell 107 COE: Friday 9am-11am in Visualization Room