Using the SLURM Job Scheduler

Size: px

Start display at page:

Download "Using the SLURM Job Scheduler"

Dominic Goodman
5 years ago
Views:

Using the SLURM Job Scheduler [web] [email] portal.biohpc.

1 Using the SLURM Job Scheduler [web] [ ] portal.biohpc.swmed.edu 1 Updated for

2 Overview Today we re going to cover: Part I: What is SLURM? How to use a basic set of SLURM commands? Part II: Learn how to write a sbatch script for job submission Demo Part III: Things need to know before running multi-threading, MPI and GPU jobs This is only an introduction, but it should provide you a good start. 2/32

3 Part I : What is SLURM Simple Linux Utility for Resource Management - Started as a simple resource manger for Linux clusters, about 500,000 lines of C code - Easy to use (e.g.: run a.out on a PC, run sbatch a.out on a cluster) - Fair-share resource allocations The glue for a parallel computer to execute parallel jobs - Make a parallel computer as almost easy to use as a PC - Typically use MPI to manage communications within the parallel program 3/32

4 4/32 Part I : Role of SLURM (resource management & job scheduling)

Nucleus005 (Login Node) compute nodes 68 CPU nodes 8 GPU nodes You may

5 Part I : BioHPC Login via SSH to nucleus.biohpc.swmed.edu AND Submit job via SLURM /home2 /project /work storage systems Nucleus005 (Login Node) compute nodes 68 CPU nodes 8 GPU nodes You may also submit your job from workstation or thin-client Separate training session : July 14th 5/32

6 Part I: More about Login Node ( nucleus.biohpc.swmed.edu) Nucleus005 (Login Node) the Gateway of BioHPC cluster shared resource At Login Node You CAN : view/move/copy/edit files compile jobs submit jobs via SLURM check job status You CAN NOT: run long-term applications/jobs direct download large data 6/32

Part I: BioHPC Partitions (or Queue) Partition -- a collection of compute nodes BioHPC has 5 partitions 32 128GB : Nucleus010-041 32

7 Part I: BioHPC Partitions (or Queue) Partition -- a collection of compute nodes BioHPC has 5 partitions GB : Nucleus GB : Nucleus GB : Nucleus super : Nucleus , Nucleus GPU : Nucleus Total 76 compute nodes 7/32

Part I : Life cycle of a job after submission Submit a Job Pending Configuration (node booting) Resizing Running Suspended Completing * BioHPC Policy : 16

8 Part I : Life cycle of a job after submission Submit a Job Pending Configuration (node booting) Resizing Running Suspended Completing * BioHPC Policy : 16 running CPU nodes per user 2 running GPU nodes per user Cancelled (scancel) Completed (zero exit code) Failed (non-zero exit code) Time out (time limit reached) 8/32

Job C Job E Job A Job B Number of Nodes Estimated compute

9 Time Time Limit Part I : Set up time limit -- small jobs are easier to fit SLURM JOB Queue Job D Requested Nodes Job C Job E Job A Job B Number of Nodes Estimated compute time < User specified time limit < 2*Estimated compute time 9/32

10 Part I : SLURM commands SLURM commands - Before job submission : sinfo, squeue, sview, smap - Submit a job : sbatch, srun, sallocate, sattach - During job running : squeue, sview, scontrol - After job completed : sacct Man pages available for all commands - Help option prints brief descriptions of all options - Usage option prints a list of the options - Almost all options have two formats: A single letter option (e.g. -p super ) and A verbose option (e.g. --partition=super ) 10/32

11 Part I : sinfo (Reports status of nodes of partitions ) > sinfo (report status in node-oriented form) >sinfo p 256GB (report status of nodes in partition 256GB ) 11/32

12 Part I : use squeue, and scontrol to check job status > squeue > scontrol show job {jobid} 12/32

13 Part I : cancel job with scancle, and use sacct to check previous jobs > scancel {jobid} > sacct j {jobid} scontrol gives more detailed information of the job, but only for recent jobs; sacct keeps a completed history of job status, but only basic information 13/32

14 Part I : List of SLURM commands sbatch : Submit a script for later execution (batch mode) salloc : Create job allocation and start a shell to use it (interactive mode) srun : Create a job allocation (if needed) and launch a job step (typically an MPI job) sattach : Connect stdin/stdout/stderr for an existing job or job step squeue : Report job and job step status smap : Report system, job or step status with topology, less functionality than sview sview : Report/update system, job, step, partition or reservation status (GTK-based GUI) scontrol : Administrator tool to view/update system, job, step, partition or reservation status sacct : Report accounting information by individual job and job step 1432

15 Part I : List of valid job states PENDING (PD) : Job is awaiting resource allocation RUNNING (R) : Job is currently has an allocation SUSPENDED (S) : Job has an allocation, but execution has been suspended COMPLETING (CG) : Job is in the process of completing COMPLETED (CD) : Job has terminated all processes on all nodes COMFIGURING (CF) : Job has been allocated resources, waiting for them to be ready for use CANCELLED (CA) : Job was explicitly cancelled by the user or system administrator FAILED (F) : Job terminated with non-zero code or other failure condition TIMEOUT (TO) : job terminated upon reaching its time limit NODE_FAIL (NF) : Job terminated with non-zero exit code or other failure condition 15/32

16 Part II A Little Background of Filament Analysis Testing your code before submit to BioHPC cluster Job submission Demo 01: basic structure of sbatch script Demo 02: submit sequential jobs Demo 03 & 04 : submit parallel jobs Demo 05 & 06 : submit parallel jobs with srun Demo 07 : submit job with dependency 1632

17 Part II : Vimentin Filament Analysis Image Capture Plot Straightness 384-well plate DAPI FITC TRITC Network Analysis Straightness Intensity Length Steerable Filter Filament Segmentation 17/32

18 Part II : Testing your job before submission Testing your job On your own machine At your local workstation/thin-client Reserve a BioHPC compute node CPU job : remotegui or webgui GPU job : remotegpu or webgpu 18/32

19 Part II : Demo01 -- Basic structure of SLRUM script #!/bin/bash run SLURM script under bash shell #SBATCH --job-name=singlematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=00-00:01:00 #SBATCH --output=single.%j.out #SBATCH --error=single.%j.err format: D-H:M:S set up SLURM environment module add matlab load software (export path & library) matlab nodisplay -nodesktop r forbiohpctestplot(1), exit Command (s) to be executed 19/32

20 Part II : More SBATCH options #SBATCH --begin=now+1hour Defer the allocation of the job until the specified time #SBATCH mail-type=all Notify user by when certain event types occur (BEGIN, END, FAIL, REQUEUE, etc.) #SBATCH mail-user=yi.du@utsouthwestern.edu Use to receive notification of state changes as defined by mail-type #SBATCH --mem= Specify the real memory required per node in MegaBytes. #SBATCH --nodelist=nucleus0[10-20] Request a specific list of node names. The order of the node names in the list is not important, the node names will be sorted by SLURM 20/32

Part II : Demo 02 & Demo 03 -- submit multiple tasks to single node sequential tasks V.S. parallel tasks Filter Segmentation Analysis Plot 0 Time 0 Time #!

21 Part II : Demo 02 & Demo submit multiple tasks to single node sequential tasks V.S. parallel tasks Filter Segmentation Analysis Plot 0 Time 0 Time #!/bin/bash #SBATCH --job-name=matlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=00-00:01:00 #SBATCH --output=single.%j.out #SBATCH --error=single.%j.err For both sequential and parallel tasks, SLURM environment and the software we needed are the same. The difference is from how your write your commands. module add matlab 21/32

Part II : Demo 02 & Demo 03 -- submit multiple tasks to single node Demo 02: sequential tasks # Step 1: Steerable filter matlab nodisplay -nodesktop r MDFillter(1), exit #

Straightness matlab nodisplay nodesktop r forbiohpctestplot(1), exit Demo 03: parallel tasks # submit job to background matlab nodisplay -nodesktop r forbiohpctestplot(1),

22 Part II : Demo 02 & Demo submit multiple tasks to single node Demo 02: sequential tasks # Step 1: Steerable filter matlab nodisplay -nodesktop r MDFillter(1), exit # Step2: Filament Segmentation matlab nodisplay nodesktop r vimfilament(1), exit # Step3: Network Analysis matlab nodisplay nodesktop r MDAnalysis(1), exit # Step4: Plot Straightness matlab nodisplay nodesktop r forbiohpctestplot(1), exit Demo 03: parallel tasks # submit job to background matlab nodisplay -nodesktop r forbiohpctestplot(1), exit & matlab nodisplay -nodesktop r forbiohpctestplot(2), exit & matlab nodisplay -nodesktop r forbiohpctestplot(3), exit & # wait for background job to terminate, then returns wait 22/32

23 Part II : Demo version 2 of Demo 03 #!/bin/bash #SBATCH --job-name=multimatlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=00-00:01:00 #SBATCH --output=multi.%j.out #SBATCH --error=multi.%j.err module add matlab for i in `seq 1 16` do matlab nodisplay nodesktop r forbiohpctestplot($i), exit & done wait 23/32

24 Part II : How many tasks should I submitted to each node? Answer: For each node, socket : cores per socket : threads per core = 2 : 8 : 2, therefore, total available parallel tasks within a single node = 2*8*2 = 32 tasks/node Node Socket Core * 2 logical cores/threads inside each physical core Both sbatch and srun will create a resource relocation to run the job Moreover, srun allows user to specify on which node/core your job to be executed 24/32

25 Part II : Demo submit multiple jobs to single node with srun #!/bin/bash #SBATCH --job-name=srunsinglenodematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=16 #SBATCH --time=00-00:01:00 #SBATCH --output=srunsinglenode.%j.out #SBATCH --error=srunsinglenode.%j.err module add matlab srun sh script.m Total number of tasks in the current job script.m SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) #!/bin/bash matlab nodisplay nodesktop r forbiohpctestplot($slurm_localid+1), exit 25/32

Part II : Demo 06 -- submit multiple jobs to multi-node with srun #!

26 Part II : Demo submit multiple jobs to multi-node with srun #!/bin/bash #SBATCH --job-name=srun2nodematlab #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=16 #SBATCH --time=00-00:01:00 #SBATCH --output=srun2node.%j.out #SBATCH --error=srun2node.%j.err module add matlab SLURM_NODEID : the relative node ID of the current node (zero-based) SLURM_NNODES : Total number of nodes in the job s resource allocation SLURM_NTASKS : Total number of tasks in the current job srun sh script.m script.m #!/bin/bash SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) Let ID=$SLURM_NODEID*$SLURM_NTASKS/$SLURM_NNODES+$SLURM_LOCALID+1 echo process data $ID on `hostname`>>namelist.txt matlab nodisplay nodesktop r forbiohpctestplot($id), exit 26/32

27 Part II : Demo submit job with dependency >sbatch test_srun2nodes.sh >sbatch --depend={jobid} gen_gif.sh 27/32

28 Part III Multi-threading job on single node (shared memory) MPI job on multiple nodes (distributed memory) GPU job on single node 28/32

29 Part III : Demo submit multi-threading job to BioHPC #!/bin/bash #SBATCH --job-name=phenix #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=30 #SBATCH --time=0-20:00:00 #SBATCH --output=phenix.%j.out #SBATCH --error=phenix.%j.time module add phenix/1.9 phenix.den_refine model.pdb data.mtz nproc=30 Q: How big is your data? Choose the proper partition to fit in your data Q: What is the limit of number of threads from your software? Our cluster limit is 32 threads/node, choose whatever smaller 29/32

30 Part III : Demo submit a MPI job #!/bin/bash #SBATCH --job-name=mpi_relion #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-80:00:00 #SBATCH --output=mpi_relion.%j.out Q: How big is your data? Choose partition and number of Nodes to fit in your data total tasks / number of Node <= 32 memory needed for each task * tasks on each node <= 128GB/256GB/384GB Q: What is the maximum speed-up you could achieve? module add relion/gcc/1.3 module add mvapich2/gcc/1.9 mpirun relion_refine_mpi --o Class3D_2classes/run5 --i all_particles.star --particle_diameter angpix ref run1_c1_sort_it025_class001.mrc --firstiter_cc --ini_high 30 --iter tau2_fudge 4 --flatten_solvent --zero_mask --ctf --ctf_corrected_ref --ctf_phase_flipped --sym C1 - -K 2 --oversampling 1 --healpix_order 3 --offset_range 5 --offset_step 2 --norm --scale --j 4 -- dont_combine_weights_via_disc --dont_centralize_image_reading --limit_tilt 30 --helix --htune -- htake hsuper 4 --hinner 0 --houter 80 --hrise hturn /32

31 Part III : Demo Submit a GPU job #!/bin/bash #SBATCH --job-name=cuda_test #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --time=0-00:10:00 #SBATCH output=cuda.%j.out #SBATCH --error=cuda.%j.err module add cuda65 Jobs will not be allocated any generic resources unless specifically requested at job submit time. Using the gres option supported by sbatch and srun. Format: --gres=gpu:[n], where n is the number of GPUs Use GPU partition A(320, 10240) B(10240, 320)./matrixMul wa=320 ha=10240 wb=10240 hb=320 31/32

32 Getting Effective Help the ticket system: What is the problem? Provide any error message, and diagnostic output you have When did it happen? What time? Cluster or client? What job id? How did you run it? What did you run, what parameters, what do they mean? Any unusual circumstances? Have you compiled your own software? Do you customize startup scripts? Can we look at your scripts and data? Tell us if you are happy for us to access your scripts/data to help troubleshoot. 32/32

Introduction to BioHPC

Introduction to BioHPC New User Training [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-06-03 Overview Today we re going to cover: What is BioHPC? How do I access