Slurm. Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)
|
|
- Madlyn Tamsin Hudson
- 5 years ago
- Views:
Transcription
1 Slurm Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)
2 Slurm Workload Manager What is Slurm? Installation Slurm Configuration Daemons Configuration Files Client Commands User and Account Management Policies, tuning, and advanced configuration Priorities Fairshare Backfill QOS 4-8 August
3 What is Slurm? Simple Linux Utility for Resource Management Anything but simple Resource manager and scheduler Originally developed at LLNL (Lawrence Livermore) GPL v2 Commercial support/development available Core development by SchedMD Other major contributors exist Built for scale and fault tolerance Plugin-based: lots of plugins to modify Slurm behavior for your needs 4-8 August
4 BYU's Scheduler History BYU has run several scheduling systems through its HPC history Moab/Torque was the primary scheduling system for many years Slurm replaced Moab/Torque as BYU's sole scheduler in January 2013 BYU has now contributed Slurm patches, some small and some large. Examples: New fair share algorithm: LEVEL_BASED cgroup out-of-memory notification in job output Script to generate a file equivalent to PBS_NODEFILE Optionally charge for CPU equivalents instead of just CPUs (WIP) 4-8 August
5 Terminology Partition A set of nodes (usually a cluster, using the traditional definition of cluster ) Cluster Multiple Slurm clusters can be managed by one slurmdbd; one slurmctld per cluster Job step a suballocation within a job E.g. Job 1234 has been allocated 12 nodes. It launches 4 job steps that each run on 3 of the nodes. Similar to subletting an apartment Sublet the whole place or just a room or two User A user ( Bob has a Slurm user : bob ) Account A group of users and subaccounts Association the combination of user, account, partition, and cluster A user can be members of multiple accounts with different limits for different partitions and clusters, etc. 4-8 August
6 Installation Version numbers are Ubuntu-style (e.g ) == major version released in March == minor version Download official releases from schedmd.com git repo available at Active development occurs at github.com; releases are tagged (git tag) Two main methods of installation./configure && make && make install # and install missing -dev{,el} packages Build RPMs, etc Some distros have a package, usually slurm-llnl Version may be behind by a major release or three If you want to patch something, this is the hardest approach 4-8 August
7 Installation: Build RPMs Set up ~/.rpmmacros with something like this (see top of slurm.spec for more options): ##slurm macros %_with_blcr 1 %_with_lua 1 %_with_mysql 1 %_with_openssl 1 %_smp_mflags -j16 ##%_prefix /usr/local/slurm Copy missing version info from META file to slurm.spec (grep for META in slurm.spec) Let's assume we add the following lines to slurm.spec: Name: slurm Version: Release: 0%{?dist}-custom1 Assuming RHEL 6, the RPM version will become: slurm el6-custom1 If the slurm code is in./slurm/, do: ln -s slurm slurm el6-custom1 tar hzcvf slurm el6-custom1.tgz slurm el6-custom1 rpmbuild -tb slurm el6-custom1.tgz The *.rpm files will be in ~/rpmbuild/rpms 4-8 August
8 Configuration: Daemons Daemons slurmctld controller that handles scheduling, communication with nodes, etc slurmdbd (optional) communicates with MySQL database slurmd runs on a compute node and launches jobs slurmstepd run by slurmd to launch a job step munged authenticates RPC calls ( Install munged everywhere with the same key slurmd hierarchical communication between slurmd instances (for scalability) slurmctld and slurmdbd can have primary and backup instances for HA State synchronized through shared file system (StateSaveLocation) 4-8 August
9 Configuration: Config Files Config files are read directly from the node by commands and daemons Config files should be kept in sync everywhere Exception slurmdbd.conf: only used by slurmdbd, contains database passwords DebugFlags=NO_CONF_HASH tell Slurm to tolerate some differences. Everything should be consistent except maybe backfill parameters, etc that slurmd doesn't need Can use Include /path/to/file.conf to separate out portions, e.g. partitions, nodes, licenses Can configure generic resources with GresTypes=gpu man slurm.conf Easy: Almost as easy: August
10 Configuration: Gotchas SlurmdTimeout The interval that slurmctld waits for slurmd to respond before assuming a node is dead and killing its jobs Set appropriately so file system disruptions and Slurm updates don't kill everything. Ours is 1800 (30 minutes). Slurm queries the hardware and configures nodes appropriately... may not be what you want if you want Mem=64GB instead of GB Can set FastSchedule=2 You probably want this: AccountingStorageEnforce=associations,limits,qos ulimit at the time of sbatch gets propagated to the job: set PropagateResourceLimits if you don't like that 4-8 August
11 Commands squeue view the queue sbatch submit a batch job salloc launch an interactive job srun two uses: outside of a job run a command through the scheduler on compute node(s) and print the output to stdout inside of a job launch a job step (i.e. suballocation) and print to the job's stdout sacct view job accounting information sacctmgr manage users and accounts including limits sstat view job step information (I rarely use) sreport view reports about usage (I rarely use) sinfo information on partitions and nodes scancel cancel jobs or steps, send arbitrary signals (INT, USR1, etc) scontrol list and update jobs, nodes, partitions, reservations, etc 4-8 August
12 Commands: Read the Manpages Slurm is too configurable to cover everything here I will share some examples in the next few slides New features are added frequently squeue now has more output options than A-z (printf style): new output formatting method added in August
13 Host Range Syntax Host range syntax is more compact, allows smaller RPC calls, easier to read config files, etc Node lists have a range syntax with [] using, and - Usable with commands and config files n[1-10,40-50] and n[5-20] are valid Up to two ranges are allowed: n[1-100]-[1-16] I haven't tried this out recently so it may have increased; manpage still says two Comma separated lists are allowed: a-[1-5]-[1-2],b-3-[1-16],b-[4-5]-[1-2,7,9] 4-8 August
14 Commands: squeue Want to see all running jobs on nodes n[4-31] submitted by all users in account accte using QOS special with a certain set of job names in reservation res8 but only show the job ID and the list of nodes the jobs are assigned then sort it by time remaining then descending by job ID? There's a command for that! squeue -t running -w n[4-31] -A accte -q special -n name1,name2 -R res8 -o "%.10i %N" -S +L,-i Way too many options to list here. Read the manpage. 4-8 August
15 Commands: sbatch (and salloc, srun) sbatch parses #SBATCH in a job script and accepts parameters on CLI Also parses most #PBS syntax salloc and srun accept most of the same options LOTS of options: read the manpage Easy way to learn/teach the syntax: BYU's Job Script Generator LGPL v3, Javascript, available on Github Slurm and PBS syntax May need modification by your site August
16 Script Generator (1/2) 4-8 August
17 Script Generator (2/2) Demo: Code: August
18 Commands: sbatch (and salloc, srun) Short and long versions exist for most options -N 2 # node count -n 8 # task count default behavior is to try loading up fewer nodes as much as possible rather than spreading tasks -t 2-04:30:00 # time limit in d-h:m:s, d-h, h:m:s, h:m, or m -p p1 # partition name(s): can list multiple partitions --qos=standby # QOS to use --mem=24g # memory per node --mem-per-cpu=2g # memory per CPU -a # job array 4-8 August
19 Job Arrays Used to submit homogeneous scripts that differ only by an index number $SLURM_ARRAY_TASK_ID stores the job's index number (from -a) An individual job looks like 1234_7 where ${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID} scancel 1234 for the whole array or scancel 1234_7 for just one job in the array Prior to Job arrays are purely for convenience One sbatch call, scancel can work on the entire array, etc Internally, one job entry created for each job array entry at submit time Overhead of job array w/1000 tasks is about equivalent to 1000 individual jobs Starting in Meta job is used internally Scheduling code is aware of the homogeneity of the array Individual job entries are created once a job is started Big performance advantage 4-8 August
20 Commands: scontrol scontrol can list, set and update a lot of different things scontrol show job $jobid scontrol show node $node scontrol show reservation # checkjob equiv scontrol <hold release> $jobid # hold/release ( uhold allows user to release) Update syntax: scontrol update JobID=1234 Timelimit=2-0 #set 1234 to a 2 day timelimit scontrol update NodeName=n-4-5 State=DOWN Reason= cosmic rays Create reservation: scontrol create reservation reservationname=testres nodes=n-[4,7-10] flags=maint,ignore_jobs,overlap starttime=now duration=2-0 users=root scontrol reconfigure #reread slurm.conf LOTS of other options: read the manpage 4-8 August
21 Resource Enforcement Slurm can enforce resource requests through the OS CPU task/cgroup uses cpuset cgroup (best) task/affinity pins a task using sched_setaffinity (good but a user can escape it) memory memory cgroup (best) polling (polling-based: huge race conditions exist, but much better than nothing; users can escape it) 4-8 August
22 QOS A QOS can be used to: Modify job priorities based on QOS priority Configure preemption Allow access to dedicated resources Override or impose limits Change charge rate (a.k.a. UsageFactor) A QOS can have limits: per QOS and per user per QOS List existing QOS with: sacctmgr list qos Modify: sacctmgr modify qos long set MaxWall=14-0 UsageFactor= August
23 QOS: Preemption Preemption is easy to configure: sacctmgr modify qos normal set preempt=standby You can set up the following sacctmgr modify qos high set preempt=normal,low sacctmgr modify qos normal set preempt=low GraceTime (optional) guarantees a minimum runtime for preempted jobs Use AllowQOS to specify which QOS is allowed to run in each partition If userbob owns all the nodes in partition bobpartition : In slurm.conf, set AllowQOS=bobqos,standby in partition bobpartition sacctmgr modify user userbob set qos+=bobqos sacctmgr modify qos bobqos set preempt=standby 4-8 August
24 User/Account Management sacctmgr load/dump can be used as a poor way to implement user management Proper use of sacctmgr in a transactional manner is better and allows more flexibility, though you'll need to integrate it with your account creation process, etc A user can be a member of multiple accounts Default account can be specified with sacctmgr (DefaultAccount) Fairshare Shares can be set to favor/penalize certain users Can grant/revoke access to multiple QOS's sacctmgr list assoc user=userbob sacctmgr list assoc user=userbob account=prof7 Filter by user and account sacctmgr create user userbob Accounts=prof2 DefaultAccount=prof2 Fairshare= August
25 User Limits Limits of CPUs, memory, nodes, timelimits, allocated cpus*time, etc. can be set on an association sacctmgr modify user userbob set GrpCPUs=1024 sacctmgr modify account prof7 set GrpCPUs=2048 Account Grp* limits are a limit for the entire account (sum of children) Max* limits are usually per user or per job Set a limit to -1 to remove it 4-8 August
26 User Limits: GrpCPURunMins GrpCPURunMins is a limit on the sum of an association's jobs' (allocated CPUs * time_remaining) Similar to MAXPS in Moab/Maui Staggers the start time of jobs Allows more jobs to start as other jobs near completion Simulator available: Download for your own site (LGPL v3): More info about why we use this: August
27 GrpCPURunMins: 1 core job, 7 days, limit= GrpCPURunMins: 1 core job, 3 days, limit= August
28 Account Coordinator An account coordinator can do the following for users and subaccounts under the account: Set limits (CPUs, nodes, walltime, etc.) Modify fairshare Shares to favor/penalize certain users Grant/revoke access to a QOS * Hold and cancel jobs We set faculty to be account coordinators for their accounts End-user documentation: *Any QOS: August
29 Allocation Management BYU does not use, therefore I don't know much about it GrpCPUMins (different than GrpCPURunMins) GrpCPUMins - The total number of CPU minutes that can possibly be used by past, present and future jobs running from this association and its children. Can be reset manually or periodically. See PriorityUsageResetPeriod A QOS can have a UsageFactor that makes it so you get billed more or less depending on QOS: 5.0 for immediate, 1.0 for normal, 0.1 for standby 4-8 August
30 Job Priorities priority/multifactor plugin uses weights * values priority = sum(configured_weight_int * actual_value_float) Weights are integers and the values themselves are floats ( ) Available components Age (queue wait time) Fairshare JobSize Partition QOS 4-8 August
31 Job Priorities: Example Let's say the weights are: PriorityWeightAge=0 PriorityWeightFairshare=10000 (ten thousand) PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=10000 (ten thousand) QOS Priorities are: high=5 normal=2 low=0 userbob (fairshare=0.23) submits a job in qos normal (qos_priority=2): priority = (PriorityWeightFairshare *.23) + (PriorityWeightQOS * 2 / MAX(qos_priority))) priority = (10000 *.23) + (10000 * (2/5)) = August
32 Backfill Can be tuned with SchedulerParameters in slurm.conf Example: SchedulerParameters=bf_max_job_user=20,bf_interval=60,defaul t_queue_depth=15,max_job_bf=8000,bf_window=14400,bf_conti nue,max_sched_time=6,bf_resolution=1800,defer Goal: Only backfill a job if it will not delay the start time of any higher priority job So many nice tuning parameters pop up all the time that I can't keep up. See the slurm.conf manpage for SchedulerParameters options 4-8 August
33 Fairshare Algorithms Warning: Sites have widely varying use cases so I don't necessarily understand the reason for some of the algorithms priority/multifactor plugin can use different fair share algorithms Default (no algorithm override specified with PriorityFlags) Fairshare factor affected by you vs your siblings, your parent versus its siblings, your grandparent versus its siblings, etc. FSFactor=2**(-Usage/Shares). Seems to be the most common but it doesn't work for us PriorityFlags=DEPTH_OBLIVIOUS Improves handling of deep and/or unbalanced trees PriorityFlags=TICKET_BASED We used it for a while and it mostly worked but the algorithm itself is flawed LEVEL_BASED recommended as a replacement PriorityFlags=LEVEL_BASED Users in an under-served account will always have a higher fair share factor than users in an over-served account. E.g. Account hogs has higher usage than account idle. All users in idle will have a higher FS factor than all users in hogs Available in through github.com/byuhpc/slurm. Used in production at BYU Available through upstream in (as of pre3) 4-8 August
34 Job Submit Plugin Slurm can run a job submit plugin written in Lua Lua looks like pseudo-code and doesn't take long to learn The plugin can modify a job's submission based on whatever business logic you want Example uses: Allow access to a partition based on the requested CPU count being a multiple of 3 Change the QOS to something different based on different factors Output a custom error message, such as Error! You requested x, y, and z but... True business logic is possible with this script. It is worth your time to look 4-8 August
35 Other Stuff Check out SPANK plugins: runs on a node and can do lots of stuff for job start/end events Prolog and epilog are available in lots of different ways (job, step, task) 4-8 August
36 User Education Slurm (mostly) speaks #PBS and has many wrapper scripts Maybe this is sufficient? BYU switched from Moab/Torque to Slurm before notifying users of the change (Yes, we are that crazy. Yes, it worked great for >95% of use cases, which was our target. The other options/commands were esoteric and silently ignored by Moab/Torque anyway) Slurm/PBS Script Generator available: github.com/byuhpc LGPL v3, demo linked to from github Introduction to Slurm Tools video is linked from there August
37 Diagnostics Backtraces from core dumps are typically best for crashes Be sure you don't have any ulimit-type restrictions on them For slurmctld: gdb `which slurmctld` /var/log/slurm/core thread apply all bt SchedMD is usually able to diagnose problems from backtraces and maybe a few extra print statements they'll ask for Each component has its own logging level you can specify in.conf There are extra flags for slurmctld: scontrol setdebugflags +backfill #and others like Priority 4-8 August
38 User Education Slurm (mostly) speaks #PBS and has many wrapper scripts Maybe this is sufficient? BYU switched from Moab/Torque to Slurm before notifying users of the change (Yes, we are that crazy. Yes, it worked great for >95% of use cases, which was our target. The other options/commands were esoteric and silently ignored by Moab/Torque anyway) Slurm/PBS Script Generator available: github.com/byuhpc LGPL v3, demo linked to from github Introduction to Slurm Tools video is linked from there August
39 Support SchedMD Excellent support from the original developers Bugfixes typically committed to github within a day Other support vendors listed on Slurm's Wikipedia page Usually tied to a specific hardware vendor or as part of a larger software installation slurm-dev mailing list ( You should subscribe Hand holding is extremely rare Don't expect to use slurm-dev for support 4-8 August
40 Recommendations Requirements documents Don't have your primary scheduler admin write it unless the admin can step back and write what you actually need rather than must have features A, B, and C exactly [even though Slurm may have a better way of accomplishing the same thing] Think: I want Prof Bob and his designated favorite students to have access to his privately owned hardware but also want preemptable jobs to run on there when they aren't. He shouldn't get charged cputime for using his own resources. How should I do that in Slurm? Set AllowQOS=profbob,standby on his partition in slurm.conf sacctmgr create qos profbob UsageFactor=0 Then add each user who should have access to the QOS: `sacctmgr modify user $user set qos+=profbob` 4-8 August
41 Questions? 4-8 August
Brigham Young University
Brigham Young University Fulton Supercomputing Lab Ryan Cox Slurm User Group September 16, 2015 Washington, D.C. Open Source Code I'll reference several codes we have open sourced http://github.com/byuhpc
More informationSlurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC
Slurm Overview Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17 Outline Roles of a resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm
More informationIntroduction to Slurm
Introduction to Slurm Tim Wickberg SchedMD Slurm User Group Meeting 2017 Outline Roles of resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm configuration
More informationLinux Clusters Institute: Scheduling and Resource Management. Brian Haymore, Sr IT Arch - HPC, University of Utah May 2017
Linux Clusters Institute: Scheduling and Resource Management Brian Haymore, Sr IT Arch - HPC, University of Utah May 2017 This content was originally developed by Ryan Cox (2014) and updated by Brian Haymore
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents Overview. Principal concepts. Architecture. Scheduler Policies. 2 Bull, 2011 Bull Extreme Computing SLURM Overview Ares, Gerardo, HPC Team Introduction
More informationSlurm Birds of a Feather
Slurm Birds of a Feather Tim Wickberg SchedMD SC17 Outline Welcome Roadmap Review of 17.02 release (Februrary 2017) Overview of upcoming 17.11 (November 2017) release Roadmap for 18.08 and beyond Time
More informationBrigham Young University Fulton Supercomputing Lab. Ryan Cox
Brigham Young University Fulton Supercomputing Lab Ryan Cox SLURM User Group 2013 Fun Facts ~33,000 students ~70% of students speak a foreign language Several cities around BYU have gige at home #6 Top
More informationLinux Clusters Institute: Scheduling
Linux Clusters Institute: Scheduling David King, Sr. HPC Engineer National Center for Supercomputing Applications University of Illinois August 2017 1 About me Worked in HPC since 2007 Started at Purdue
More informationSlurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC https://www.schedmd.
Slurm Roadmap Danny Auble, Morris Jette, Tim Wickberg SchedMD Slurm User Group Meeting 2017 HPCWire apparently does awards? Best HPC Cluster Solution or Technology https://www.hpcwire.com/2017-annual-hpcwire-readers-choice-awards/
More informationHigh Scalability Resource Management with SLURM Supercomputing 2008 November 2008
High Scalability Resource Management with SLURM Supercomputing 2008 November 2008 Morris Jette (jette1@llnl.gov) LLNL-PRES-408498 Lawrence Livermore National Laboratory What is SLURM Simple Linux Utility
More informationCase study of a computing center: Accounts, Priorities and Quotas
Afficher le masque pour Insérer le titre ici Direction Informatique 05/02/2015 Case study of a computing center: Accounts, Priorities and Quotas Michel Ringenbach mir@unistra.fr HPC Center, Université
More informationSlurm Workload Manager Introductory User Training
Slurm Workload Manager Introductory User Training David Bigagli david@schedmd.com SchedMD LLC Outline Roles of resource manager and job scheduler Slurm design and architecture Submitting and running jobs
More informationSlurm Version Overview
Slurm Version 18.08 Overview Brian Christiansen SchedMD Slurm User Group Meeting 2018 Schedule Previous major release was 17.11 (November 2017) Latest major release 18.08 (August 2018) Next major release
More informationFederated Cluster Support
Federated Cluster Support Brian Christiansen and Morris Jette SchedMD LLC Slurm User Group Meeting 2015 Background Slurm has long had limited support for federated clusters Most commands support a --cluster
More informationHeterogeneous Job Support
Heterogeneous Job Support Tim Wickberg SchedMD SC17 Submitting Jobs Multiple independent job specifications identified in command line using : separator The job specifications are sent to slurmctld daemon
More informationResource Management at LLNL SLURM Version 1.2
UCRL PRES 230170 Resource Management at LLNL SLURM Version 1.2 April 2007 Morris Jette (jette1@llnl.gov) Danny Auble (auble1@llnl.gov) Chris Morrone (morrone2@llnl.gov) Lawrence Livermore National Laboratory
More informationSlurm basics. Summer Kickstart June slide 1 of 49
Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource
More informationCNAG Advanced User Training
www.bsc.es CNAG Advanced User Training Aníbal Moreno, CNAG System Administrator Pablo Ródenas, BSC HPC Support Rubén Ramos Horta, CNAG HPC Support Barcelona,May the 5th Aim Understand CNAG s cluster design
More informationSLURM Operation on Cray XT and XE
SLURM Operation on Cray XT and XE Morris Jette jette@schedmd.com Contributors and Collaborators This work was supported by the Oak Ridge National Laboratory Extreme Scale Systems Center. Swiss National
More informationBatch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC
Batch Usage on JURECA Introduction to Slurm May 2016 Chrysovalantis Paschoulas HPS group @ JSC Batch System Concepts Resource Manager is the software responsible for managing the resources of a cluster,
More informationIntroduction to High-Performance Computing (HPC)
Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid
More informationSlurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015
Slurm at the George Washington University Tim Wickberg - wickberg@gwu.edu Slurm User Group Meeting 2015 September 16, 2015 Colonial One What s new? Only major change was switch to FairTree Thanks to BYU
More informationHigh Performance Computing Cluster Advanced course
High Performance Computing Cluster Advanced course Jeremie Vandenplas, Gwen Dawes 9 November 2017 Outline Introduction to the Agrogenomics HPC Submitting and monitoring jobs on the HPC Parallel jobs on
More informationVersions and 14.11
Slurm Update Versions 14.03 and 14.11 Jacob Jenson jacob@schedmd.com Yiannis Georgiou yiannis.georgiou@bull.net V14.03 - Highlights Support for native Slurm operation on Cray systems (without ALPS) Run
More informationIntroduction to RCC. January 18, 2017 Research Computing Center
Introduction to HPC @ RCC January 18, 2017 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much
More informationSLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
More informationIntroduction to RCC. September 14, 2016 Research Computing Center
Introduction to HPC @ RCC September 14, 2016 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers
More informationHigh Performance Computing Cluster Basic course
High Performance Computing Cluster Basic course Jeremie Vandenplas, Gwen Dawes 30 October 2017 Outline Introduction to the Agrogenomics HPC Connecting with Secure Shell to the HPC Introduction to the Unix/Linux
More informationIntroduction to SLURM on the High Performance Cluster at the Center for Computational Research
Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY
More informationNERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016
NERSC Site Report One year of Slurm Douglas Jacobsen NERSC SLURM User Group 2016 NERSC Vital Statistics 860 active projects 7,750 active users 700+ codes both established and in-development migrated production
More informationSubmitting batch jobs
Submitting batch jobs SLURM on ECGATE Xavi Abellan Xavier.Abellan@ecmwf.int ECMWF February 20, 2017 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic concepts
More informationHigh Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain
High Throughput Computing with SLURM SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain Morris Jette and Danny Auble [jette,da]@schedmd.com Thanks to This work is supported by the Oak Ridge National
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 6 February 2018 Overview of Talk Basic SLURM commands SLURM batch
More informationSherlock for IBIIS. William Law Stanford Research Computing
Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to
More informationSLURM Administrators Tutorial
SLURM Administrators Tutorial 20/01/15 Yiannis Georgiou Resource Management Systems Architect Bull, 2012 1 Introduction SLURM scalable and flexible RJMS Part 1: Basics Overview, Architecture, Configuration
More informationRHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK
RHRK-Seminar High Performance Computing with the Cluster Elwetritsch - II Course instructor : Dr. Josef Schüle, RHRK Overview Course I Login to cluster SSH RDP / NX Desktop Environments GNOME (default)
More informationDirections in Workload Management
Directions in Workload Management Alex Sanchez and Morris Jette SchedMD LLC HPC Knowledge Meeting 2016 Areas of Focus Scalability Large Node and Core Counts Power Management Failure Management Federated
More informationSubmitting and running jobs on PlaFRIM2 Redouane Bouchouirbat
Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat Summary 1. Submitting Jobs: Batch mode - Interactive mode 2. Partition 3. Jobs: Serial, Parallel 4. Using generic resources Gres : GPUs, MICs.
More informationCEA Site Report. SLURM User Group Meeting 2012 Matthieu Hautreux 26 septembre 2012 CEA 10 AVRIL 2012 PAGE 1
CEA Site Report SLURM User Group Meeting 2012 Matthieu Hautreux 26 septembre 2012 CEA 10 AVRIL 2012 PAGE 1 Agenda Supercomputing Projects SLURM usage SLURM related work SLURM
More informationJune Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez
June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center Carrie Brown, Adam Caprez Setup Instructions Please complete these steps before the lessons start
More informationIntroduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU
Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU What is Joker? NMSU s supercomputer. 238 core computer cluster. Intel E-5 Xeon CPUs and Nvidia K-40 GPUs. InfiniBand innerconnect.
More informationExercises: Abel/Colossus and SLURM
Exercises: Abel/Colossus and SLURM November 08, 2016 Sabry Razick The Research Computing Services Group, USIT Topics Get access Running a simple job Job script Running a simple job -- qlogin Customize
More informationDuke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu
Duke Compute Cluster Workshop 3/28/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch
More informationHosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)
Slurm Training 15 Agenda 1 2 Compute Hosts State of the node FrontEnd Hosts FrontEnd Hosts Control Machine Define Partitions Job Preemption 3 4 Define Limits Define ACLs Shared resources Partition States
More informationECE 574 Cluster Computing Lecture 4
ECE 574 Cluster Computing Lecture 4 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 31 January 2017 Announcements Don t forget about homework #3 I ran HPCG benchmark on Haswell-EP
More informationcli_filter command line filtration, manipulation, and introspection of job submissions
cli_filter command line filtration, manipulation, and introspection of job submissions Douglas Jacobsen Systems Software Engineer, NERSC Slurm User Group * 2017/09/25 What is cli_filter cli_filter is a
More informationHow to run a job on a Cluster?
How to run a job on a Cluster? Cluster Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory Samuel.kortas@kaust.edu.sa 17 October 2017 Outline 1. Resources available
More informationSlurm Workload Manager Overview SC15
Slurm Workload Manager Overview SC15 Alejandro Sanchez alex@schedmd.com Slurm Workload Manager Overview Originally intended as simple resource manager, but has evolved into sophisticated batch scheduler
More informationSlurm Burst Buffer Support
Slurm Burst Buffer Support Tim Wickberg (SchedMD LLC) SC15 Burst Buffer Overview A cluster-wide high-performance storage resource Burst buffer (BB) support added Slurm version 15.08 Two types of BB allocations:
More informationSlurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)
Slurm Roadmap Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull) Exascale Focus Heterogeneous Environment Scalability Reliability Energy Efficiency New models (Cloud/Virtualization/Hadoop) Following
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationDuke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/
Duke Compute Cluster Workshop 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ rescompu>ng@duke.edu Outline of talk Overview of Research Compu>ng resources Duke Compute Cluster overview Running interac>ve and
More informationSlurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX
Slurm at UPPMAX How to submit jobs with our queueing system Jessica Nettelblad sysadmin at UPPMAX Slurm at UPPMAX Intro Queueing with Slurm How to submit jobs Testing How to test your scripts before submission
More informationA declarative programming style job submission filter.
A declarative programming style job submission filter. Douglas Jacobsen Computational Systems Group Lead NERSC -1- Slurm User Group 2018 NERSC Vital Statistics 860 projects 7750 users Edison NERSC-7 Cray
More informationBatch Systems & Parallel Application Launchers Running your jobs on an HPC machine
Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike
More informationSlurm Support for Linux Control Groups
Slurm Support for Linux Control Groups Slurm User Group 2010, Paris, France, Oct 5 th 2010 Martin Perry Bull Information Systems Phoenix, Arizona martin.perry@bull.com cgroups Concepts Control Groups (cgroups)
More informationFrom Moab to Slurm: 12 HPC Systems in 2 Months. Peltz, Fullop, Jennings, Senator, Grunau
From Moab to Slurm: 12 HPC Systems in 2 Months Peltz, Fullop, Jennings, Senator, Grunau Tuesday, 26 September 2017 Where we started Multiple systems with various operating systems and architectures Moab
More informationScheduler Optimization for Current Generation Cray Systems
Scheduler Optimization for Current Generation Cray Systems Morris Jette SchedMD, jette@schedmd.com Douglas M. Jacobsen, David Paul NERSC, dmjacobsen@lbl.gov, dpaul@lbl.gov Abstract - The current generation
More informationIntroduction to High-Performance Computing (HPC)
Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid
More informationIntroduction to PICO Parallel & Production Enviroment
Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it
More informationMoab Workload Manager on Cray XT3
Moab Workload Manager on Cray XT3 presented by Don Maxwell (ORNL) Michael Jackson (Cluster Resources, Inc.) MOAB Workload Manager on Cray XT3 Why MOAB? Requirements Features Support/Futures 2 Why Moab?
More informationUsing a Linux System 6
Canaan User Guide Connecting to the Cluster 1 SSH (Secure Shell) 1 Starting an ssh session from a Mac or Linux system 1 Starting an ssh session from a Windows PC 1 Once you're connected... 1 Ending an
More informationSLURM Simulator improvements and evaluation
SLURM Simulator improvements and evaluation Marco D Amico Ana Jokanovic Julita Corbalan SLUG 18 Introduction SLURM Simulator is able to simulate workloads execution Why not just a simulator? It keeps code
More informationScheduling By Trackable Resources
Scheduling By Trackable Resources Morris Jette and Dominik Bartkiewicz SchedMD Slurm User Group Meeting 2018 Thanks to NVIDIA for sponsoring this work Goals More flexible scheduling mechanism Especially
More informationQueuing and Scheduling on Compute Clusters
Queuing and Scheduling on Compute Clusters Andrew Caird acaird@umich.edu Queuing and Scheduling on Compute Clusters p.1/17 The reason for me being here Give some queuing background Introduce some queuing
More informationUPPMAX Introduction Martin Dahlö Valentin Georgiev
UPPMAX Introduction 2017-11-27 Martin Dahlö martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Objectives What is UPPMAX what it provides Projects at UPPMAX How to access UPPMAX
More informationResource Management using SLURM
Resource Management using SLURM The 7 th International Conference on Linux Clusters University of Oklahoma May 1, 2006 Morris Jette (jette1@llnl.gov) Lawrence Livermore National Laboratory http://www.llnl.gov/linux/slurm
More informationTo connect to the cluster, simply use a SSH or SFTP client to connect to:
RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or
More informationSubmitting batch jobs Slurm on ecgate
Submitting batch jobs Slurm on ecgate Xavi Abellan xavier.abellan@ecmwf.int User Support Section Com Intro 2015 Submitting batch jobs ECMWF 2015 Slide 1 Outline Interactive mode versus Batch mode Overview
More informationSLURM. User's Guide. ITS Research Computing Northeastern University Nilay K Roy, PhD
SLURM User's Guide ITS Research Computing Northeastern University Nilay K Roy, PhD Table of Contents Chapter 1. SLURM Overview... 1 1.1 SLURM Key Functions... 1 1.2 SLURM Components... 2 1.3 SLURM Daemons...
More informationIntroduction to GALILEO
Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Maurizio Cremonesi m.cremonesi@cineca.it
More informationAZURE CONTAINER INSTANCES
AZURE CONTAINER INSTANCES -Krunal Trivedi ABSTRACT In this article, I am going to explain what are Azure Container Instances, how you can use them for hosting, when you can use them and what are its features.
More informationSlurm Inter-Cluster Project. Stephen Trofinoff CSCS Via Trevano 131 CH-6900 Lugano 24-September-2014
Slurm Inter-Cluster Project Stephen Trofinoff CSCS Via Trevano 131 CH-6900 Lugano 24-September-2014 Definition Functionality pertaining to operations spanning different clusters is what this project refers
More informationPBS PROFESSIONAL VS. MICROSOFT HPC PACK
PBS PROFESSIONAL VS. MICROSOFT HPC PACK On the Microsoft Windows Platform PBS Professional offers many features which are not supported by Microsoft HPC Pack. SOME OF THE IMPORTANT ADVANTAGES OF PBS PROFESSIONAL
More informationGood to Great: Choosing NetworkComputer over Slurm
Good to Great: Choosing NetworkComputer over Slurm NetworkComputer White Paper 2560 Mission College Blvd., Suite 130 Santa Clara, CA 95054 (408) 492-0940 Introduction Are you considering Slurm as your
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 16 Feb 2017 Overview of Talk Basic SLURM commands SLURM batch
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 23 June 2016 Overview of Talk Basic SLURM commands SLURM batch
More informationDuke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu
Duke Compute Cluster Workshop 10/04/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch
More informationUsing and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015
Using and Modifying the BSC Slurm Workload Simulator Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015 Using and Modifying the BSC Slurm Workload Simulator The
More informationExascale Process Management Interface
Exascale Process Management Interface Ralph Castain Intel Corporation rhc@open-mpi.org Joshua S. Ladd Mellanox Technologies Inc. joshual@mellanox.com Artem Y. Polyakov Mellanox Technologies Inc. artemp@mellanox.com
More informationTransient Compute ARC as Cloud Front-End
Digital Infrastructures for Research 2016 2016-09-29, 11:30, Cracow 30 min slot AEC ALBERT EINSTEIN CENTER FOR FUNDAMENTAL PHYSICS Transient Compute ARC as Cloud Front-End Sigve Haug, AEC-LHEP University
More informationChoosing Resources Wisely. What is Research Computing?
Choosing Resources Wisely Scott Yockel, PhD Harvard - Research Computing What is Research Computing? Faculty of Arts and Sciences (FAS) department that handles nonenterprise IT requests from researchers.
More informationCYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś
CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś Presentation plan 2 Cyfronet introduction System description SLURM modifications Job
More informationSLURM Workload and Resource Management in HPC
SLURM Workload and Resource Management in HPC Users and Administrators Tutorial 02/07/15 Yiannis Georgiou R&D Sofware Architect Bull, 2012 1 Introduction SLURM scalable and flexible RJMS Part 1: Basics
More informationLesson 3 Transcript: Part 1 of 2 - Tools & Scripting
Lesson 3 Transcript: Part 1 of 2 - Tools & Scripting Slide 1: Cover Welcome to lesson 3 of the db2 on Campus lecture series. Today we're going to talk about tools and scripting, and this is part 1 of 2
More informationTroubleshooting Jobs on Odyssey
Troubleshooting Jobs on Odyssey Paul Edmon, PhD ITC Research CompuGng Associate Bob Freeman, PhD Research & EducaGon Facilitator XSEDE Campus Champion Goals Tackle PEND, FAIL, and slow performance issues
More informationLAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery
More informationSGE 6.0 configuration guide, version 1.1
SGE 6.0 configuration guide, version 1.1 Juha Jäykkä juolja@utu.fi Department of Physics Laboratory of Theoretical Physics University of Turku 18.03.2005 First, some notes This needs to be revised to include
More informationIntel Manycore Testing Lab (MTL) - Linux Getting Started Guide
Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Introduction What are the intended uses of the MTL? The MTL is prioritized for supporting the Intel Academic Community for the testing, validation
More informationEbook : Overview of application development. All code from the application series books listed at:
Ebook : Overview of application development. All code from the application series books listed at: http://www.vkinfotek.com with permission. Publishers: VK Publishers Established: 2001 Type of books: Develop
More informationHow to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende
How to access Geyser and Caldera from Cheyenne 19 December 2017 Consulting Services Group Brian Vanderwende Geyser nodes useful for large-scale data analysis and post-processing tasks 16 nodes with: 40
More informationOpenPBS Users Manual
How to Write a PBS Batch Script OpenPBS Users Manual PBS scripts are rather simple. An MPI example for user your-user-name: Example: MPI Code PBS -N a_name_for_my_parallel_job PBS -l nodes=7,walltime=1:00:00
More informationTraining day SLURM cluster. Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP
Training day SLURM cluster Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP Context PRE-REQUISITE : LINUX connect to «genologin»
More informationFlux: Practical Job Scheduling
Flux: Practical Job Scheduling August 15, 2018 Dong H. Ahn, Ned Bass, Al hu, Jim Garlick, Mark Grondona, Stephen Herbein, Tapasya Patki, Tom Scogland, Becky Springmeyer This work was performed under the
More informationWorking with Shell Scripting. Daniel Balagué
Working with Shell Scripting Daniel Balagué Editing Text Files We offer many text editors in the HPC cluster. Command-Line Interface (CLI) editors: vi / vim nano (very intuitive and easy to use if you
More informationBACKING UP LINUX AND OTHER UNIX(- LIKE) SYSTEMS
BACKING UP LINUX AND OTHER UNIX(- LIKE) SYSTEMS There are two kinds of people: those who do regular backups and those who never had a hard drive failure Unknown. 1. Introduction The topic of doing backups
More informationDay 9: Introduction to CHTC
Day 9: Introduction to CHTC Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Chapter 1: Overview Chapter 2: Users Manual (at most, 2.1 2.7) 1 Turn In Homework 2 Homework
More informationCluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50
Cluster Computing Resource and Job Management for HPC SC-CAMP 16/08/2010 ( SC-CAMP) Cluster Computing 16/08/2010 1 / 50 Summary 1 Introduction Cluster Computing 2 About Resource and Job Management Systems
More informationBatch Systems. Running your jobs on an HPC machine
Batch Systems Running your jobs on an HPC machine Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntroduction to the Cluster
Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be
More information