Slurm. Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)

Size: px

Start display at page:

Download "Slurm. Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)"

Madlyn Tamsin Hudson
5 years ago
Views:

1 Slurm Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)

2 Slurm Workload Manager What is Slurm? Installation Slurm Configuration Daemons Configuration Files Client Commands User and Account Management Policies, tuning, and advanced configuration Priorities Fairshare Backfill QOS 4-8 August

3 What is Slurm? Simple Linux Utility for Resource Management Anything but simple Resource manager and scheduler Originally developed at LLNL (Lawrence Livermore) GPL v2 Commercial support/development available Core development by SchedMD Other major contributors exist Built for scale and fault tolerance Plugin-based: lots of plugins to modify Slurm behavior for your needs 4-8 August

4 BYU's Scheduler History BYU has run several scheduling systems through its HPC history Moab/Torque was the primary scheduling system for many years Slurm replaced Moab/Torque as BYU's sole scheduler in January 2013 BYU has now contributed Slurm patches, some small and some large. Examples: New fair share algorithm: LEVEL_BASED cgroup out-of-memory notification in job output Script to generate a file equivalent to PBS_NODEFILE Optionally charge for CPU equivalents instead of just CPUs (WIP) 4-8 August

5 Terminology Partition A set of nodes (usually a cluster, using the traditional definition of cluster ) Cluster Multiple Slurm clusters can be managed by one slurmdbd; one slurmctld per cluster Job step a suballocation within a job E.g. Job 1234 has been allocated 12 nodes. It launches 4 job steps that each run on 3 of the nodes. Similar to subletting an apartment Sublet the whole place or just a room or two User A user ( Bob has a Slurm user : bob ) Account A group of users and subaccounts Association the combination of user, account, partition, and cluster A user can be members of multiple accounts with different limits for different partitions and clusters, etc. 4-8 August

6 Installation Version numbers are Ubuntu-style (e.g ) == major version released in March == minor version Download official releases from schedmd.com git repo available at Active development occurs at github.com; releases are tagged (git tag) Two main methods of installation./configure && make && make install # and install missing -dev{,el} packages Build RPMs, etc Some distros have a package, usually slurm-llnl Version may be behind by a major release or three If you want to patch something, this is the hardest approach 4-8 August

7 Installation: Build RPMs Set up ~/.rpmmacros with something like this (see top of slurm.spec for more options): ##slurm macros %_with_blcr 1 %_with_lua 1 %_with_mysql 1 %_with_openssl 1 %_smp_mflags -j16 ##%_prefix /usr/local/slurm Copy missing version info from META file to slurm.spec (grep for META in slurm.spec) Let's assume we add the following lines to slurm.spec: Name: slurm Version: Release: 0%{?dist}-custom1 Assuming RHEL 6, the RPM version will become: slurm el6-custom1 If the slurm code is in./slurm/, do: ln -s slurm slurm el6-custom1 tar hzcvf slurm el6-custom1.tgz slurm el6-custom1 rpmbuild -tb slurm el6-custom1.tgz The *.rpm files will be in ~/rpmbuild/rpms 4-8 August

8 Configuration: Daemons Daemons slurmctld controller that handles scheduling, communication with nodes, etc slurmdbd (optional) communicates with MySQL database slurmd runs on a compute node and launches jobs slurmstepd run by slurmd to launch a job step munged authenticates RPC calls ( Install munged everywhere with the same key slurmd hierarchical communication between slurmd instances (for scalability) slurmctld and slurmdbd can have primary and backup instances for HA State synchronized through shared file system (StateSaveLocation) 4-8 August

9 Configuration: Config Files Config files are read directly from the node by commands and daemons Config files should be kept in sync everywhere Exception slurmdbd.conf: only used by slurmdbd, contains database passwords DebugFlags=NO_CONF_HASH tell Slurm to tolerate some differences. Everything should be consistent except maybe backfill parameters, etc that slurmd doesn't need Can use Include /path/to/file.conf to separate out portions, e.g. partitions, nodes, licenses Can configure generic resources with GresTypes=gpu man slurm.conf Easy: Almost as easy: August

10 Configuration: Gotchas SlurmdTimeout The interval that slurmctld waits for slurmd to respond before assuming a node is dead and killing its jobs Set appropriately so file system disruptions and Slurm updates don't kill everything. Ours is 1800 (30 minutes). Slurm queries the hardware and configures nodes appropriately... may not be what you want if you want Mem=64GB instead of GB Can set FastSchedule=2 You probably want this: AccountingStorageEnforce=associations,limits,qos ulimit at the time of sbatch gets propagated to the job: set PropagateResourceLimits if you don't like that 4-8 August

11 Commands squeue view the queue sbatch submit a batch job salloc launch an interactive job srun two uses: outside of a job run a command through the scheduler on compute node(s) and print the output to stdout inside of a job launch a job step (i.e. suballocation) and print to the job's stdout sacct view job accounting information sacctmgr manage users and accounts including limits sstat view job step information (I rarely use) sreport view reports about usage (I rarely use) sinfo information on partitions and nodes scancel cancel jobs or steps, send arbitrary signals (INT, USR1, etc) scontrol list and update jobs, nodes, partitions, reservations, etc 4-8 August

12 Commands: Read the Manpages Slurm is too configurable to cover everything here I will share some examples in the next few slides New features are added frequently squeue now has more output options than A-z (printf style): new output formatting method added in August

13 Host Range Syntax Host range syntax is more compact, allows smaller RPC calls, easier to read config files, etc Node lists have a range syntax with [] using, and - Usable with commands and config files n[1-10,40-50] and n[5-20] are valid Up to two ranges are allowed: n[1-100]-[1-16] I haven't tried this out recently so it may have increased; manpage still says two Comma separated lists are allowed: a-[1-5]-[1-2],b-3-[1-16],b-[4-5]-[1-2,7,9] 4-8 August

14 Commands: squeue Want to see all running jobs on nodes n[4-31] submitted by all users in account accte using QOS special with a certain set of job names in reservation res8 but only show the job ID and the list of nodes the jobs are assigned then sort it by time remaining then descending by job ID? There's a command for that! squeue -t running -w n[4-31] -A accte -q special -n name1,name2 -R res8 -o "%.10i %N" -S +L,-i Way too many options to list here. Read the manpage. 4-8 August

15 Commands: sbatch (and salloc, srun) sbatch parses #SBATCH in a job script and accepts parameters on CLI Also parses most #PBS syntax salloc and srun accept most of the same options LOTS of options: read the manpage Easy way to learn/teach the syntax: BYU's Job Script Generator LGPL v3, Javascript, available on Github Slurm and PBS syntax May need modification by your site August

16 Script Generator (1/2) 4-8 August

Script Generator (2/2) Demo: https://byuhpc.github.

17 Script Generator (2/2) Demo: Code: August

18 Commands: sbatch (and salloc, srun) Short and long versions exist for most options -N 2 # node count -n 8 # task count default behavior is to try loading up fewer nodes as much as possible rather than spreading tasks -t 2-04:30:00 # time limit in d-h:m:s, d-h, h:m:s, h:m, or m -p p1 # partition name(s): can list multiple partitions --qos=standby # QOS to use --mem=24g # memory per node --mem-per-cpu=2g # memory per CPU -a # job array 4-8 August

19 Job Arrays Used to submit homogeneous scripts that differ only by an index number $SLURM_ARRAY_TASK_ID stores the job's index number (from -a) An individual job looks like 1234_7 where ${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID} scancel 1234 for the whole array or scancel 1234_7 for just one job in the array Prior to Job arrays are purely for convenience One sbatch call, scancel can work on the entire array, etc Internally, one job entry created for each job array entry at submit time Overhead of job array w/1000 tasks is about equivalent to 1000 individual jobs Starting in Meta job is used internally Scheduling code is aware of the homogeneity of the array Individual job entries are created once a job is started Big performance advantage 4-8 August

20 Commands: scontrol scontrol can list, set and update a lot of different things scontrol show job $jobid scontrol show node $node scontrol show reservation # checkjob equiv scontrol <hold release> $jobid # hold/release ( uhold allows user to release) Update syntax: scontrol update JobID=1234 Timelimit=2-0 #set 1234 to a 2 day timelimit scontrol update NodeName=n-4-5 State=DOWN Reason= cosmic rays Create reservation: scontrol create reservation reservationname=testres nodes=n-[4,7-10] flags=maint,ignore_jobs,overlap starttime=now duration=2-0 users=root scontrol reconfigure #reread slurm.conf LOTS of other options: read the manpage 4-8 August

21 Resource Enforcement Slurm can enforce resource requests through the OS CPU task/cgroup uses cpuset cgroup (best) task/affinity pins a task using sched_setaffinity (good but a user can escape it) memory memory cgroup (best) polling (polling-based: huge race conditions exist, but much better than nothing; users can escape it) 4-8 August

22 QOS A QOS can be used to: Modify job priorities based on QOS priority Configure preemption Allow access to dedicated resources Override or impose limits Change charge rate (a.k.a. UsageFactor) A QOS can have limits: per QOS and per user per QOS List existing QOS with: sacctmgr list qos Modify: sacctmgr modify qos long set MaxWall=14-0 UsageFactor= August

23 QOS: Preemption Preemption is easy to configure: sacctmgr modify qos normal set preempt=standby You can set up the following sacctmgr modify qos high set preempt=normal,low sacctmgr modify qos normal set preempt=low GraceTime (optional) guarantees a minimum runtime for preempted jobs Use AllowQOS to specify which QOS is allowed to run in each partition If userbob owns all the nodes in partition bobpartition : In slurm.conf, set AllowQOS=bobqos,standby in partition bobpartition sacctmgr modify user userbob set qos+=bobqos sacctmgr modify qos bobqos set preempt=standby 4-8 August

24 User/Account Management sacctmgr load/dump can be used as a poor way to implement user management Proper use of sacctmgr in a transactional manner is better and allows more flexibility, though you'll need to integrate it with your account creation process, etc A user can be a member of multiple accounts Default account can be specified with sacctmgr (DefaultAccount) Fairshare Shares can be set to favor/penalize certain users Can grant/revoke access to multiple QOS's sacctmgr list assoc user=userbob sacctmgr list assoc user=userbob account=prof7 Filter by user and account sacctmgr create user userbob Accounts=prof2 DefaultAccount=prof2 Fairshare= August

25 User Limits Limits of CPUs, memory, nodes, timelimits, allocated cpus*time, etc. can be set on an association sacctmgr modify user userbob set GrpCPUs=1024 sacctmgr modify account prof7 set GrpCPUs=2048 Account Grp* limits are a limit for the entire account (sum of children) Max* limits are usually per user or per job Set a limit to -1 to remove it 4-8 August

26 User Limits: GrpCPURunMins GrpCPURunMins is a limit on the sum of an association's jobs' (allocated CPUs * time_remaining) Similar to MAXPS in Moab/Maui Staggers the start time of jobs Allows more jobs to start as other jobs near completion Simulator available: Download for your own site (LGPL v3): More info about why we use this: August

27 GrpCPURunMins: 1 core job, 7 days, limit= GrpCPURunMins: 1 core job, 3 days, limit= August

28 Account Coordinator An account coordinator can do the following for users and subaccounts under the account: Set limits (CPUs, nodes, walltime, etc.) Modify fairshare Shares to favor/penalize certain users Grant/revoke access to a QOS * Hold and cancel jobs We set faculty to be account coordinators for their accounts End-user documentation: *Any QOS: August

29 Allocation Management BYU does not use, therefore I don't know much about it GrpCPUMins (different than GrpCPURunMins) GrpCPUMins - The total number of CPU minutes that can possibly be used by past, present and future jobs running from this association and its children. Can be reset manually or periodically. See PriorityUsageResetPeriod A QOS can have a UsageFactor that makes it so you get billed more or less depending on QOS: 5.0 for immediate, 1.0 for normal, 0.1 for standby 4-8 August

30 Job Priorities priority/multifactor plugin uses weights * values priority = sum(configured_weight_int * actual_value_float) Weights are integers and the values themselves are floats ( ) Available components Age (queue wait time) Fairshare JobSize Partition QOS 4-8 August

31 Job Priorities: Example Let's say the weights are: PriorityWeightAge=0 PriorityWeightFairshare=10000 (ten thousand) PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=10000 (ten thousand) QOS Priorities are: high=5 normal=2 low=0 userbob (fairshare=0.23) submits a job in qos normal (qos_priority=2): priority = (PriorityWeightFairshare *.23) + (PriorityWeightQOS * 2 / MAX(qos_priority))) priority = (10000 *.23) + (10000 * (2/5)) = August

32 Backfill Can be tuned with SchedulerParameters in slurm.conf Example: SchedulerParameters=bf_max_job_user=20,bf_interval=60,defaul t_queue_depth=15,max_job_bf=8000,bf_window=14400,bf_conti nue,max_sched_time=6,bf_resolution=1800,defer Goal: Only backfill a job if it will not delay the start time of any higher priority job So many nice tuning parameters pop up all the time that I can't keep up. See the slurm.conf manpage for SchedulerParameters options 4-8 August

33 Fairshare Algorithms Warning: Sites have widely varying use cases so I don't necessarily understand the reason for some of the algorithms priority/multifactor plugin can use different fair share algorithms Default (no algorithm override specified with PriorityFlags) Fairshare factor affected by you vs your siblings, your parent versus its siblings, your grandparent versus its siblings, etc. FSFactor=2**(-Usage/Shares). Seems to be the most common but it doesn't work for us PriorityFlags=DEPTH_OBLIVIOUS Improves handling of deep and/or unbalanced trees PriorityFlags=TICKET_BASED We used it for a while and it mostly worked but the algorithm itself is flawed LEVEL_BASED recommended as a replacement PriorityFlags=LEVEL_BASED Users in an under-served account will always have a higher fair share factor than users in an over-served account. E.g. Account hogs has higher usage than account idle. All users in idle will have a higher FS factor than all users in hogs Available in through github.com/byuhpc/slurm. Used in production at BYU Available through upstream in (as of pre3) 4-8 August

34 Job Submit Plugin Slurm can run a job submit plugin written in Lua Lua looks like pseudo-code and doesn't take long to learn The plugin can modify a job's submission based on whatever business logic you want Example uses: Allow access to a partition based on the requested CPU count being a multiple of 3 Change the QOS to something different based on different factors Output a custom error message, such as Error! You requested x, y, and z but... True business logic is possible with this script. It is worth your time to look 4-8 August

35 Other Stuff Check out SPANK plugins: runs on a node and can do lots of stuff for job start/end events Prolog and epilog are available in lots of different ways (job, step, task) 4-8 August

36 User Education Slurm (mostly) speaks #PBS and has many wrapper scripts Maybe this is sufficient? BYU switched from Moab/Torque to Slurm before notifying users of the change (Yes, we are that crazy. Yes, it worked great for >95% of use cases, which was our target. The other options/commands were esoteric and silently ignored by Moab/Torque anyway) Slurm/PBS Script Generator available: github.com/byuhpc LGPL v3, demo linked to from github Introduction to Slurm Tools video is linked from there August

37 Diagnostics Backtraces from core dumps are typically best for crashes Be sure you don't have any ulimit-type restrictions on them For slurmctld: gdb `which slurmctld` /var/log/slurm/core thread apply all bt SchedMD is usually able to diagnose problems from backtraces and maybe a few extra print statements they'll ask for Each component has its own logging level you can specify in.conf There are extra flags for slurmctld: scontrol setdebugflags +backfill #and others like Priority 4-8 August

38 User Education Slurm (mostly) speaks #PBS and has many wrapper scripts Maybe this is sufficient? BYU switched from Moab/Torque to Slurm before notifying users of the change (Yes, we are that crazy. Yes, it worked great for >95% of use cases, which was our target. The other options/commands were esoteric and silently ignored by Moab/Torque anyway) Slurm/PBS Script Generator available: github.com/byuhpc LGPL v3, demo linked to from github Introduction to Slurm Tools video is linked from there August

39 Support SchedMD Excellent support from the original developers Bugfixes typically committed to github within a day Other support vendors listed on Slurm's Wikipedia page Usually tied to a specific hardware vendor or as part of a larger software installation slurm-dev mailing list ( You should subscribe Hand holding is extremely rare Don't expect to use slurm-dev for support 4-8 August

40 Recommendations Requirements documents Don't have your primary scheduler admin write it unless the admin can step back and write what you actually need rather than must have features A, B, and C exactly [even though Slurm may have a better way of accomplishing the same thing] Think: I want Prof Bob and his designated favorite students to have access to his privately owned hardware but also want preemptable jobs to run on there when they aren't. He shouldn't get charged cputime for using his own resources. How should I do that in Slurm? Set AllowQOS=profbob,standby on his partition in slurm.conf sacctmgr create qos profbob UsageFactor=0 Then add each user who should have access to the QOS: `sacctmgr modify user $user set qos+=profbob` 4-8 August

41 Questions? 4-8 August

Brigham Young University

Brigham Young University Fulton Supercomputing Lab Ryan Cox Slurm User Group September 16, 2015 Washington, D.C. Open Source Code I'll reference several codes we have open sourced http://github.com/byuhpc