June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

Size: px

Start display at page:

Download "June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez"

Adela O’Neal’
5 years ago
Views:

1 June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center Carrie Brown, Adam Caprez

2 Setup Instructions Please complete these steps before the lessons start at 1:00 PM: Setup instructions: If you need to use a demo account please speak with one of the helpers If you need help with the setup, please put a red sticky note at the top of your laptop. When you are done with the setup, please put a green sticky note at the top of your laptop.

manager (scheduler) and how to select the best options to streamline your jobs.

3 June Workshop Series Schedule June 6th: Introductory Bash June 13th: Advanced Bash and Git June 20th: Introductory HCC June 27th: All about SLURM Learn all about the Simple Linux Utility for Resource Management (SLURM), HCC's workload manager (scheduler) and how to select the best options to streamline your jobs. Upcoming Software Carpentry Workshops UNL: HCC Kickstart Bash, Git and HCC Basics September 5 th and 6 th UNO: Software Carpentry Bash, Git and R October 16 th and 17 th

4 Logistics Name tags, sign-in sheet Sticky notes: Red = need help, Green = all good Link to Workshop Materials: Etherpad: Terminal commands are in this font Any entries surrounded by <brackets> need to be filled in with information Example: <username>@crane.unl.edu becomes demo01@crane.unl.edu if your username=demo01. Today we will be using the reservation hccjune for all jobs Make sure your submit scripts include the line: #SBATCH --reservation=hccjune

5 What is a Cluster?

6 Exercises 1. If you aren t already, connect to the Crane cluster 2. Navigate to your $WORK directory 3. If you were not here last week, or do not have the tutorial directory, clone the files to your $WORK directory with the command: git clone 4. Make a new directory inside the tutorial directory (./HCCWorkshops/) named slurm this is where we will put all of our tutorial files for today. Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

7 SLURM Simple Linux Utility for Resource Management Open source, scalable cluster management and job scheduling system Used on ~60% of the TOP500 supercomputers 3 key functions Allocates exclusive or non-exclusive access to resources Framework for starting, executing and monitoring work Manages a queue of pending jobs Uses a best fit algorithm to assign tasks Fair Tree Fairshare Algorithm

8 Slurm vs PBS To PBS/SGE Command Slurm Equivalent Submit a job qsub <script_file> sbatch <script_file> Cancel a job qdel <job_id> scancel <job_id> Check the status of a job qstat <job_id> squeue <job_id> Check the status of all jobs by user qstat u <user_name> squeue u <user_name> Hold a job qhold <job_id> scontrol hold <job_id> Release a job qrls <job_id> scontrol release <job_id> More commands and schedulers:

9 sinfo Shows a listing of all partitions on a cluster Use #SBATCH --partition=<partition_name> All partitions have a 7 day run-time limitation Publically available partitions: Partition Description Limitations Clusters batch Defaultpartition 2000 max CPUs per user Crane, Tusker guest Uses free time on owned or leased Intellaband (IB) or Omni-Path Architecture (OPA) nodes Pre-emptable Max 158 IB CPU s and 2000 OPA CPU s per user Crane highmem High memory nodes (512 and 1024 GB) 192 max CPUs per user Tusker gpu_k20 GPU nodes include 3x Tesla K20m per node with IB 48 max CPUs per user Crane gpu_m2070 GPU nodes include 2x Tesla M2070 per node, non-ib 48 max CPUs per user Crane gpu_p100 GPU nodes include 2x Tesla P100 per node with OPA 40 max CPUs per user Crane

10 Fair Tree Fairshare Algorithm Fair Tree prioritizes users such that if accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B Benefits: All users in a higher priority account receive a higher fair share factor than all users from a lower priority account Users in a more active group have lower priority than users in a less active group Users are sorted and ranked to prevent precision loss Priority is calculated based on rank, not directly off of Level FS value New jobs are immediately assigned a priority User ranking is calculated at 5 minute intervals

11 Calculation of Level FS (LF) Where: LF = $ 0 LF % S = Shares Norm assigned shares normalized to the shares assigned to itself and its siblings S = $ *+,-./0 $ *+,-./01-23/245-0 S 1 U = Effective Usage usage normalized to the account s usage U = % *+,-./0 % *+,-./01-23/245-0 U 1

12 Fairshare Algorithm root Groups gprof1 gprof2 Users uprof1 ustudent3 uprof2 ucollab78 uphd17 Uses a rooted plane tree (aka rooted ordered tree) sorted by Level FS descending from left to right Tree is traversed depth-first users are assigned rank and given a fairshare factor Process: Calculate Level FS for subtree s children Sort children of the subtree Visit children in descending order and assign fairshare factor fairshare factor = rank total # of users

13 Exercises 1. You can check on the share division and usage on Holland clusters with the sshare command. The output of this command can be quite long, combine it with head or grep to see individual portions of it. Can you write a command so you only see the first 10 lines of output? Modify the previous command to use grep to find your user and group information Compare the amount of your EffectvUsage to your NormShares Have you used more than your NormShares? How about your group overall? How does the group s EffectvUsage compare to the NormShares? 2. The sshare argument -l shows extended output, including the current calculated LevelFS values. Repeat the steps in #1, but with the -l argument this time. How does your LevelFS value compare to your group s LevelFS value? Does the calculated LevelFS value correspond to the differences you observed EffectvUsage? Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

14 sbatch Used to asynchronously submit a batch job to execute on allocated resources. Sequence of events: 1. User submits a script via sbatch 2. When resources become available they are allocated to the job 3. The script is executed on one node (the master node) The script must launch other tasks on allocated nodes STDOUT and STDERR are captured and redirected to the output file(s) 4. When script terminates, the allocation is released Any non-zero exit will be interpreted as a failure

Submit Scripts Shebang The shebang tells Slurm what interpreter to use for this file. This one is for the shell (Bash) Name of the submit file This can be anything. Here we are using invert_single.

15 Submit Scripts Shebang The shebang tells Slurm what interpreter to use for this file. This one is for the shell (Bash) Name of the submit file This can be anything. Here we are using invert_single.slurm the.slurm makes it easy to recognize that this is a submit file. Commands Any commands after the SBATCH lines will be executed by the interpreter specified in the shebang similar to what would happen if you were to type the commands interactively SBATCH options These must be immediately after the shebang and before any commands. The only required SBATCH options are time, nodes and mem, but there are many that you can use to fully customize your allocation.

16 Submit Files Best Practices Put all module loads immediately after SBATCH lines Quickly locate what modules and versions were used. Specify versions on module loads Allows you to see what versions were used during the analysis Use a separate submit file for each analysis Instead of editing and resubmitting a submit files, copy a previous one and make changes to it Keep a running record of your analyses Redirect output and error to separate files Allows you to see quickly whether a job completes with errors or not Separate individual workflow steps into individual jobs Avoid putting too many steps into a single job

17 Shebang! - Interpreters Must be included in the first line of the submit script Must be an absolute path Specifies which program is used to execute the contents of the script The shebang in the submit file can be one of the following: #!/bin/bash The most common shell and also the default shell at HCC #!/bin/csh - symlink to tcsh #!/usr/bin/perl #!/usr/bin/python Using Perl or Python interpreters can make loading modules difficult Scripts that return anything but 0 will be interpreted as a failed job by Slurm

18 Common SBATCH Options Command --nodes Number of nodes requested What it does --time --mem --ntasks-per-node --mem-per-cpu --output --error --job-name Maximum walltime for the job in DD-HHH:MM:SS format maximum of 7 days on batch partition Real memory (RAM) required per node - can use KB, MB, and GB units default is MB Request less memory than total available on the node - The maximum available on a 512 GB RAM node is 500, for 256 GB RAM node is 250 Number of tasks per node used to request a specific number of cores Minimum of memory required per allocated CPU default is 1 GB Filename where all STDOUT will be directed default is slurm-<jobid>.out Filename where all STDERR will be directed default is slurm-<jobid>.out How the job will show up in the queue For more information: sbatch help SLURM Documentation:

19 scancel Use to cancel jobs prior to completion Usage: scancel <job_id> Use other arguments to cancel multiple jobs at once or combine both to prevent accidentally canceling the wrong job Other arguments: Argument Cancel --name=<job_name> --partition=<partition> --user=<user_name> --state=<job_state> jobs with this name jobs in this partition jobs of this user jobs in this state Valid states: PENDING, RUNNING, and SUSPENDED

20 Short qos Increases a jobs priority, allowing it to run as soon as possible Useful for testing and developmental work Limitations: 6 hour runtime 1 job of 16 CPU s or fewer Max of 2 jobs per user Max of 256 CPU s in use for all short jobs from all users To use, include this line in your submit script: #SBATCH --qos=short For more information:

21 Exercise 1. Write a submit script from scratch. (No copying previous ones!) The script should use the following parameters: Uses 1 node Uses 10 GB RAM 10 minutes Runtime Executes the command: echo I can write submit scripts! Submit your script and watch for output. If you run into errors, copy the error to Etherpad. If you were able to fix the error, add a brief note explaining how you did. Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

22 Exercise Solution

squeue Job ID The ID number assigned to your job by Slurm Name The name you gave the job as specified in the submit script Time The length of time the job has been running Nodes The number of nodes

23 squeue Job ID The ID number assigned to your job by Slurm Name The name you gave the job as specified in the submit script Time The length of time the job has been running Nodes The number of nodes the job is running on Partition The partition the job is running on or assigned to User The user that owns the job State The current status of the job. Common states include: CD Completed CA Canceled F Failed PD Pending R Running Nodelist If the job is running: the names of the nodes the job is running on If the job is pending: the reason the job is pending For more information:

24 Common Reason Codes Job Reason Codes Dependency NodeDown PartitionDown Priority ReqNodeNotAvail Reservation Description This job is waiting for a dependent job to complete. A node required by the job is down. The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintainance. Note that this message may be displayed for a time even after the system is back up. One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours. No nodes can be found satisfying your limits, for instance because maintainance is scheduled and the job can not finish before it The job is waiting for its advanced reservation to become available. More information: squeue --help

25 Common squeue Options Option -u <user_name> --user=<user_name> Displays information about -j <job_list> specified job(s) * jobs owned by the specified user_name(s) * -p <part_list> jobs in a specified partition(s) * -t <state_list> jobs in the specified state(s) {PD, R, S, CG, CD, CF, CA, F, TO, PR, NF} * -i <interval> --interate= <interval> -S <sort_list> --sort=<sort_list> --start jobs repeatedly reported at intervals (in seconds) jobs sorted by specified field(s) * pending jobs and scheduled start times * Indicates arguments that can take a comma-separated list For more options:

26 Exercise 1. Use the squeue command to determine the following. Hint: Don t forget about wc l How many jobs are currently Running? How many jobs are currently Pending? The grid partition is composed of resources that are made available to the Open Science Grid. How many jobs are currently in the queue for this partition? How many jobs are currently in queue for the user root? 1. Edit the submit script you made previously. Add the following command to execute after the echo command: sleep 120 Submit the updated script file and monitor its progress with squeue. If it is pending for a while, use --start to see how much longer until it is expected to start. How accurate was the estimate? Can you guess what sleep does just by how your job changes? If not, take a look at the documentation (sleep --help).

27 Customizing squeue output Use the --Format argument (must be capitalized) Fields you want displayed are specified in a comma-separated list without spaces after the argument Fields of note: priority reason dependency eligibletime endtime state / statecompact submittime Even more customization options are available for --Format and the -- format flag check out man squeue for more information.

28 Environmental Variables and Replacement Symbols Environmental Variables Can be used in the command section of a submit file (passed to scripts or programs via arguments) Cannot be used within an #SBATCH directive Use Replacement Symbols instead Environment Variable SLURM_JOB_ID SLURM_JOB_NAME SLURM_NNODES SLURM_NODELIST SLURM_NTASKS SLURM_QUEUE SLURM_SUBMIT_DIR SLURM_TASKS_PER_NODE Description batch job id assigned by Slurm upon submission user-assigned job name number of nodes list of nodes total number of tasks queue (partition) directory of submission number of tasks per node Replacement Symbols Symbol %A Value Job array s master job allocation number %a Job array ID (index) number %j Job allocation number (job id) %N Node name will be replaced by the name of the first node in the job (the one that runs the script) %u User name %% The character % A number can be placed between % and the following character to zero-pad the result For example: job%j.out would create job out for job_id= job%9j.out would create job out for job_id=

29 Additional sbatch Options Argument --begin:<time> --deadline=<time> --hold --immediate --mail-type=<type> --mail-user=<user_ > --open-mode=<append truncate> --test-only --tmp=<mb> Details The controller will wait to allocate the job until the specified time Specific Time: HH:MM:SS Specific Date: MMDDYY or MM/DD/YY or YYY-MM-DD Specific Date and Time: YYYY-MM-DD[THH:MM:SS] Keywords can be used now, today, tomorrow Can also be relative in format now+<time> Remove the job if it cannot finish before the deadline Valid time formats: HH:MM[:SS] [AM PM] MMDD[YY] or MM/DD[/YY] or MM.DD[.YY] MM/DD[/YY]-HH:MM[:SS] YYYY-MM-DD[THH:MM[:SS]]] Will hold the job in held state until released manually using the command scontrol release <job_id> Will only release the job if the resources are immediately available Notify user by when certain event types occur. Valid type include: BEGIN, END, FAIL, ALL, TIME_LIMIT, TIME_LIMIT_X (When X% of the time is up, where X is 90, 80, or 50) Specify an to send event notifications to Specify how to open output files default is truncate Validates the script and returns a starting estimate based on the current queue and job requirements Does not submit the job Minimum amount of temporary disk space on the allocated node

30 3. \ Exercises 1. Edit the submit script you created previous to: Include at least two of the additional options we discussed. Submit the script to see how they work. Try changing some of the parameters (number of nodes, memory, or time) and use the #SBATCH --testonly argument to see how the estimated start time changes. Which parameter seems to affect it the most? 2. Using the cd command, navigate to the matlab directory inside of HCCWorkshops. Use less to view the contents of the invertrand.submit file. Can you find all of the environmental variables and replacement symbols used? What role do each of them play in this script? 4. Navigate back into the directory which contains the submit script you made today. Edit the script to include one environmental variable and one replacement symbol. Submit the script and check to see if your changes worked the way you expected. Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

31 Array Job Submissions Submits a specified number of identical jobs Use environmental variables and replacement symbols to separate output Usage: #SBATCH --array=<array numbers or ranges> Array list can any combination of the following: a comma separated list of values. #SBATCH --array=1,5,10 : submits 3 array jobs with array ids 1, 5, 10 a range of values with a separator. #SBATCH --array=0-5 : submits 6 array jobs with array ids 0, 1, 2, 3, 4, 5 A range of values with a : to indicate step value #SBATCH array=1-9:2 : submits 5 array jobs with array ids 1, 3, 5, 7, 9 A % can be used to specify the maximum number of simultaneous tasks (default is 1000) #SBATCH --array=1-10%4 : submits 10 array jobs with 4 simultaneous running jobs To cancel array jobs: Usage: scancel <job_id>_<array numbers> Cancel all array jobs: scancel <job_id> Cancel single array ids: scancel <job_id>_<array id>

32 Exercises 1. Specify how many jobs these commands will create. What are they re array id s? How many will run simultaneously? #SBATCH --array=5-10 #SBATCH --array=0-4,15-20 #SBATCH --array=1,3-10:2 #SBATCH --array=0-20:2%10 2. When we looked at the output of the example array job, the output is not in numeric order. Can you think of a reason why that happens? 3. Edit the example array job to do the following: Run 15 array tasks, each one with an odd array id Run 5 array tasks, each one with a unique 3 digit id Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

33 Job Dependencies Allows you to queue multiple jobs that depend on the completion of one or more previous jobs When submitting the job, use the -d argument followed by specification of what jobs and when to execute <when_to_execute>:<job_id> After successful completion afterok:<job_id> After non-successful completion afternotok:<job_id> Multiple job ids can be specified, separate with colons afterok:<job_id1>:<job_id2> Dependent jobs can use output and files created from previous jobs

Exercises 1. Copy the JobB.submit script, calling the new one JobC.submit and edit the contents accordingly (replace all instances of B with C ). Using sbatch, queue JobA.

34 Exercises 1. Copy the JobB.submit script, calling the new one JobC.submit and edit the contents accordingly (replace all instances of B with C ). Using sbatch, queue JobA. Then queue JobB and JobC, setting them both to begin after the successful completion of JobA. 2. Using the previous three submit scripts, create a new submit script which will do the following: Combine the output from both JobB and JobC into a text file called JobD.txt Add the line Sample job D output to this new text file 3. Using these four submit scripts, Run them so the jobs trigger in the order according to the diagram to the right Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

35 Exercise Solution

36 srun Used to synchronously submit a single command Commonly used to start interactive sessions Sequence of Events: 1. User submits a command for execution May include command line arguments will be executed exactly as specified 2. If allocation exists, the job executes immediately Otherwise, the job will block until a new allocation is established 3. n identical copies of the command are run simultaneously on allocated resources as individual tasks --pty induces pseudo-terminal mode input and output is directed to the users shell 4. Once all tasks terminate, the srun session will terminate If the allocation was created with srun, it will be released

37 Using srun to monitor batch jobs 1. Connect to the node running the job: srun -j <job_id> --pty bash {or top} srun -nodelist=<node_id> --pty bash {or top} 2. Monitor: top (if not already running) Use to monitor core use ideal for multi-core processes Press u to search for your username cat /cgroup/memory/slurm/uid_<uid>/job_<job_id>/memory.max_usage_in_bytes Use to monitor memory use To determine your uid use: id -u <user_name> Match with watch -n to specify a refresh interval - default is 2 seconds CTRL + C to exit

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 6 February 2018 Overview of Talk Basic SLURM commands SLURM batch