Slurm Birds of a Feather

Similar documents
Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

Heterogeneous Job Support

Introduction to Slurm

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Slurm Version Overview

Versions and 14.11

Federated Cluster Support

Resource Management at LLNL SLURM Version 1.2

SLURM Operation on Cray XT and XE

1 Bull, 2011 Bull Extreme Computing

Slurm Burst Buffer Support

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Scheduling By Trackable Resources

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

cli_filter command line filtration, manipulation, and introspection of job submissions

Slurm Inter-Cluster Project. Stephen Trofinoff CSCS Via Trevano 131 CH-6900 Lugano 24-September-2014

Slurm Workload Manager Introductory User Training

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

High Performance Computing Cluster Advanced course

Graham vs legacy systems

Submitting batch jobs

Directions in Workload Management

Native SLURM on Cray XC30. SLURM Birds of a Feather SC 13

High Performance Computing Cluster Basic course

A declarative programming style job submission filter.

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Slurm. Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)

Slurm Workload Manager Overview SC15

Introduction to High-Performance Computing (HPC)

CNAG Advanced User Training

From Moab to Slurm: 12 HPC Systems in 2 Months. Peltz, Fullop, Jennings, Senator, Grunau

How to run a job on a Cluster?

Slurm basics. Summer Kickstart June slide 1 of 49

Sherlock for IBIIS. William Law Stanford Research Computing

CEA Site Report. SLURM User Group Meeting 2012 Matthieu Hautreux 26 septembre 2012 CEA 10 AVRIL 2012 PAGE 1

Introduction to SLURM & SLURM batch scripts

ECE 574 Cluster Computing Lecture 4

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Exascale Process Management Interface

Case study of a computing center: Accounts, Priorities and Quotas

Introduction to RCC. September 14, 2016 Research Computing Center

CRUK cluster practical sessions (SLURM) Part I processes & scripts

Using and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Introduction to RCC. January 18, 2017 Research Computing Center

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Brigham Young University

Introduction to SLURM & SLURM batch scripts

Linux Clusters Institute: Scheduling

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Using a Linux System 6

Slurm Support for Linux Control Groups

Transient Compute ARC as Cloud Front-End

Exercises: Abel/Colossus and SLURM

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

Scheduler Optimization for Current Generation Cray Systems

Choosing Resources Wisely. What is Research Computing?

Reducing Cluster Compatibility Mode (CCM) Complexity

For Dr Landau s PHYS8602 course

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Moab Workload Manager on Cray XT3

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

MSLURM SLURM management of multiple environments

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Introduction to HPC2N

Introduction to SLURM & SLURM batch scripts

Introduction to the Cluster

Introduction to UBELIX

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Requirement and Dependencies

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

SLURM Simulator improvements and evaluation

XSEDE New User Training. Ritu Arora November 14, 2014

MDHIM: A Parallel Key/Value Store Framework for HPC

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Submitting batch jobs Slurm on ecgate Solutions to the practicals

Introduction to High-Performance Computing (HPC)

Student HPC Hackathon 8/2018

Resource Management using SLURM

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Workload Managers. A Flexible Approach

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

Introduction to GACRC Teaching Cluster PHYS8602

Bash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University

Opportunities for container environments on Cray XC30 with GPU devices

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

MIC Lab Parallel Computing on Stampede

Introduction to GACRC Teaching Cluster

COSC 6385 Computer Architecture. - Homework

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Introduction to High Performance Computing at Case Western Reserve University. KSL Data Center

Moab Passthrough. Administrator Guide February 2018

Transcription:

Slurm Birds of a Feather Tim Wickberg SchedMD SC17

Outline Welcome Roadmap Review of 17.02 release (Februrary 2017) Overview of upcoming 17.11 (November 2017) release Roadmap for 18.08 and beyond Time Remaining - Open Q+A

Slurm 17.02

Version 17.02 Released February 2017 Some background restructuring for Federation Relatively few changes visible to users or admins 1,913 commits ahead of 16.05 release. 1948 files changed, 48407 insertions(+), 62333 deletions(-)

sbcast improvements sbcast - introduced in 2006, uses the hierarchical communication to propagate files to compute nodes. Added srun --bcast to fan out command binary as part of launch. Added lz4 and gzip compression options. Set through --compress option, or SBCAST_COMPRESS, and/or SbcastParameters option in slurm.conf. lz4 highly recommended. gzip not recommended in most environments.

slurmdbd daemon statistics sdiag, but for slurmdbd sacctmgr show stats - Reports current daemon statistics sacctmgr clear stats - Clear daemon statistics sacctmgr shutdown - Shutdown the daemon

slurmdbd daemon statistics $ sacctmgr show stats Rollup statistics Hour count:8 ave_time:150348 max_time:342905 total_time:1202785 Day count:1 ave_time:285012 max_time:285012 total_time:285012 Month count:0 ave_time:0 max_time:0 total_time:0 Remote Procedure Call statistics by message type DBD_NODE_STATE ( 1432) count:40 ave_time:979 total_time:39162 DBD_GET_QOS ( 1448) count:12 ave_time:949 total_time:11389... Remote Procedure Call statistics by user alex ( 1001) count:18 ave_time:1342 total_time:24156...

Other 17.02 Changes Cgroup containers automatically cleaned up after steps complete Added MailDomain configuration parameter to qualify email addresses Added PrologFlags=Serial configuration parameter to prevent Epilog from starting before Prolog completes (even if job cancelled while Prolog is active)

Other 17.02 Changes Added burst buffer support for job arrays Memory values changed from 32-bit to 64-bit, increasing maximum supported limit enforcement and schedule for nodes above 2TB Removed AIX, BlueGene/L, BlueGene/P, and Cygwin support Removed sched/wiki and sched/wiki2 plugins

Slurm 17.11

Version 17.11 To be released in November 2017 Two big ticket items: Federated Clusters support Heterogeneous Job support 2,359+ commits ahead of 17.02 1227 files changed, 82452 insertions(+), 53719 deletions(-) Release candidates out now, 17.11.0 release at the end of November.

Version 17.11 Federation Scale out by scheduling multiple clusters as one Submit and schedule jobs on multiple clusters Unified jobid s Unified views Established through a central slurmdbd, managed with sacctmgr command. For more details please see the Federation presentation.

Heterogenous Jobs Join resource allocation requests into a single job. As an example, this makes it easy to allocate a job with 10 Haswell nodes and 1000 KNL nodes. Currently, this is difficult to accomplish, and requires careful manipulation of --constraint and CPU count calculation. http://www.schedmd.com

Heterogenous Jobs Multiple independent job specifications identified in command line using : separator The job specifications are sent to slurmctld daemon as a list in a single RPC The entire request is validated and accepted or rejected Response is also a list of data (e.g. job IDs) $ salloc -n1 -C haswell : -n256 -C knl bash http://www.schedmd.com

Submitting Hetereogenous Jobs Multiple independent job specifications identified in command line using : separator The job specifications are sent to slurmctld daemon as a list in a single RPC The entire request is validated and accepted or rejected Response is also a list of data (e.g. job IDs) $ salloc -n1 -C haswell : -n256 -C knl bash http://www.schedmd.com

Heterogeneous Batch Jobs Job components specified using : command line separator OR Use #SBATCH options in script separating components using #SBATCH packjob Script runs on first component specified $ echo my.bash #!/bin/bash #SBATCH -n1 -C haswell #SBATCH packjob #SBATCH -n256 -C knl $ sbatch my.bash http://www.schedmd.com

Heterogeneous Job Data Structure Each component of a heterogeneous job has its own job structure entry Head job has pointers to all components (like job arrays) New fields JobID - Unique for each component of the heterogeneous job PackJobID - Common value for all components PackJobOffset - Unique for each component, zero origin PackJobIdSet - List of all job IDs in the heterogeneous job http://www.schedmd.com

Heterogenous Job Management Standard format ID for managing heterogeneous jobs is <PackJobID>+<PackJobOffset> $ squeue --job=93 JOBID PARTITION NAME USER ST TIME NODES NODELIST 93+0 debug test adam R 4:56 1 nid00001 93+1 debug test adam R 4:56 2 nid000[10-11] 93+2 debug test adam R 4:56 4 nid000[20-23] http://www.schedmd.com

Version 17.11 Changed to dynamic linking by default Single libslurmfull.so used by all binaries. Install footprint drops from 180MB (17.02) for {bin, lib, sbin}, to 83MB. Use --without-shared-libslurm configure option to revert to old behavior. Native Cray is now the default for --with-cray

Version 17.11 Overhaul of slurm.spec Rearrangement components into obvious packages slurm - libraries and all commands slurmd - binary + service file for compute node slurmctld - binary + service file for controller slurmdbd - binary, mysql plugin, + service file for database Correct support for systemd service file installation. No SystemV init support, assumes RHEL7+ / SuSE12+ environment. Older version preserved as contribs/slurm.spec-legacy Deprecated, and will receive minimal maintenance

Version 17.11 Removal of Solaris support Removal of obsolete MPI plugins Only PMI2, PMIx, and OpenMPI/None remain. The PMI2 and PMIx APIs are supported by all modern MPI stacks. And almost all can launch without those APIs by interpreting SLURM_* environment variables directly. Note that the OpenMPI plugin is identical to the None plugin. MpiDefault=openmpi doesn t do anything. Please update your configs, it will be removed in a future release.

Version 17.11 AccountingStorage=MySQL no longer supported in slurm.conf Mode that used to allow you to get minimal accounting support without slurmdbd. Has not worked correctly for some time. Use AccountingStorage=SlurmDBD in slurm.conf. Along with AccountingStorage=MySQL in slurmdbd.conf

Version 17.11 Log rotation: SIGUSR2 to {slurmctld, slurmdbd, slurmd} will cause it to drop and reopen the log files. Use instead of SIGHUP, which accomplishes this by reconfiguring the daemon. (Which will cause re-registration, and can cause performance issues in slurmctld.) MaxQueryTimeRange in slurmdbd.conf. Limit range in a single sacct query, helps prevent slurmdbd hanging/crashing by trying to return too large of a data set.

Version 17.11 Additional resiliency in slurmctld high-availability code paths. Avoids split-brain issues if the path to StateSaveLocation is not the same as the network path to the nodes. Code hardening using Coverity Static analysis tool, free for open-source projects Outstanding issue count reduced from 725 to 0 This compliments the long-standing use of -Wall on all development builds, and testing using several compilers and platforms.

Coverity

Built-in X11 Forwarding Similar functionality to CEA s SPANK X11 plugin. Use configure option of --without-x11 to continue using SPANK plugin instead. Implementation uses libssh2 to setup and coordinate tunnels directly. Adds --x11 option to salloc/sbatch/srun. Optional arguments control which nodes will establish tunnels: --x11={all,batch,first,last}

Built-in X11 Forwarding Enable with PrologFlags=X11 PrologFlags=Contain is implied. Uses the extern step on the allocated compute node(s) to launch one tunnel per node, regardless of how many steps are running simultaneously. Users must have either SSH hostkey authentication or password-less SSH keys installed inside the cluster. Can work alongside pam_slurm_adopt to set the correct DISPLAY on SSH processes when forwarding is in place.

Centralized Extended GID lookup Lookup the extended GIDs for the user before launching the job, and send as part of the job credential to all allocated slurmd s. Enable with LauchParameters=send_gids

Centralized Extended GID lookup Avoids compute nodes all making simultaneous calls to getgrouplist(), which has been a scalability issue for >O(1000) nodes. If sssd/ldap fails to respond promptly, getgrouplist() may return with no extended groups. Leaving a job with a mix of nodes with and without the correct gids.

Billing TRES New billing TRES On by default -- AccountingStorageTRES Enforce limits on usage calculated from partition s TRESBillingWeights Use existing limits (GrpTRESMins, GrpTRESRunMins, GrpTRES, MaxTRESMins, MaxTRES, etc.) Usage seen with scontrol show jobs, sacct, sreport.

Other 17.11 Changes More flexible advanced reservations (FLEX flag on reservations) Jobs able to use resources inside and outside of the reservation Jobs able to start before and end after the reservation sprio command reports information for every partition associated with a job rather than just one partition Support for stacking different interconnect plugins (JobAcctGather) Add scancel --hurry option Cancel job without staging-out burst buffer files sdiag reports DBD agent queue size

Slurm 18.08

Version 18.08 Release August 2018 Google Cloud support (integration scripts provided) Support for MPI jobs that span heterogeneous job allocations Support for multiple backup slurmctlds Improvements to KNL scheduling and CPU binding Cray Manage DataWarp allocations without allocating compute nodes. ( --nodes=0 ) scontrol show dwstat - report output from dwstat command

Version 18.08 A few planned anti-features: Remove support for Cray/ALPS mode Must use native Slurm mode (recommended for some time) Remove support for BlueGene/Q Remove or repair support for macos Has been broken for years due to linking issues Patch submissions welcome

and Beyond!