A declarative programming style job submission filter.

Similar documents
NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

cli_filter command line filtration, manipulation, and introspection of job submissions

Slurm Version Overview

Slurm Birds of a Feather

Slurm Burst Buffer Support

Scheduler Optimization for Current Generation Cray Systems

Heterogeneous Job Support

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Introduction to Slurm

Directions in Workload Management

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

API and Usage of libhio on XC-40 Systems

libhio: Optimizing IO on Cray XC Systems With DataWarp

Versions and 14.11

Introduction to High-Performance Computing (HPC)

Moab Workload Manager on Cray XT3

CNAG Advanced User Training

Scheduling By Trackable Resources

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Introduction to High-Performance Computing (HPC)

Federated Cluster Support

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

High Performance Computing Cluster Basic course

SLURM Operation on Cray XT and XE

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

Slurm Workload Manager Overview SC15

High Performance Computing Cluster Advanced course

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Realtime Data Analytics at NERSC

NERSC. National Energy Research Scientific Computing Center

Sherlock for IBIIS. William Law Stanford Research Computing

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

Batch Systems. Running your jobs on an HPC machine

Boost your efficiency when dealing with multiple jobs on the Cray XC40 supercomputer Shaheen II. KAUST Supercomputing Laboratory KSL Workshop Series

1 Bull, 2011 Bull Extreme Computing

Bright Cluster Manager

Trust Separation on the XC40 using PBS Pro

Opportunities for container environments on Cray XC30 with GPU devices

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Introduction to the Cluster

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Description of Power8 Nodes Available on Mio (ppc[ ])

Resource Management at LLNL SLURM Version 1.2

Introduction to the Cluster

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

Project 2: CPU Scheduling Simulator

From Moab to Slurm: 12 HPC Systems in 2 Months. Peltz, Fullop, Jennings, Senator, Grunau

Real-time job monitoring using an extended slurmctld generic plugin

The Why and How of HPC-Cloud Hybrids with OpenStack

General Objectives: To understand the process management in operating system. Specific Objectives: At the end of the unit you should be able to:

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Oracle System Administrator Fundamentals It s All about Controlling What Users Can See and Do

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Slurm Inter-Cluster Project. Stephen Trofinoff CSCS Via Trevano 131 CH-6900 Lugano 24-September-2014

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Precision Routing. Capabilities. Precision Queues. Capabilities, page 1 Initial Setup, page 6

Slurm basics. Summer Kickstart June slide 1 of 49

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Never Port Your Code Again Docker functionality with Shifter using SLURM

Cloud Computing For Researchers

Slurm Workload Manager Introductory User Training

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

Submitting batch jobs

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

Table of Contents. Table of Contents Job Manager for remote execution of QuantumATK scripts. A single remote machine

HSC Data Processing Environment ~Upgrading plan of open-use computing facility for HSC data

Leveraging Flash in HPC Systems

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Working with Shell Scripting. Daniel Balagué

CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

B. V. Patel Institute of Business Management, Computer &Information Technology, UTU

CS 3204 Operating Systems Programming Project #2 Job / CPU Scheduling Dr. Sallie Henry Spring 2001 Due on February 27, 2001.

Exercises: Abel/Colossus and SLURM

Introduction to RCC. September 14, 2016 Research Computing Center

CSCI 204 Introduction to Computer Science II Lab 7 Queue ADT

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Object-based Storage Devices (OSD) T10 Standard

Brigham Young University

Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

Számítogépes modellezés labor (MSc)

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Outline. March 5, 2012 CIRMMT - McGill University 2

Introduction to RCC. January 18, 2017 Research Computing Center

New Directions and BNL

Batch Systems. Running calculations on HPC resources

Cloud Computing CS

EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH

Dask-jobqueue Documentation

Transcription:

A declarative programming style job submission filter. Douglas Jacobsen Computational Systems Group Lead NERSC -1- Slurm User Group 2018

NERSC Vital Statistics 860 projects 7750 users Edison NERSC-7 Cray XC30 5,603 ivybridge nodes 700+ applications Cori NERSC-8 Cray XC40 12,070 compute nodes 9,685 KNL 2,385 Haswell 134,472 cores 24 cores per node, 134,472 cores total 64 GB per node, 2.6 GB/core, 350 TB total Primarily used for large capability jobs Small midrange as well ~ 7PB of local Lustre scratch -2-658,580 cores 76,320 cores 1.16 PiB DRAM, 151 GiB MCDRAM (KNL Flat mode) DataWarp aka Burst Buffer (1.6 PiB) realtime jobs for experimental facililities massive quantities of serial jobs regular HPC workload shifter for Linux containers ~30 PB of Lustre scratch, also shared with edison

Motivation The NERSC job_submit.lua was getting difficult to maintain and tended to have some buggy edge cases. In redesigning the system we wanted a generic abstraction of the process, that met these goals: Aim 1: Provide simple user interface to batch system Focus user effort onresource requests, not site policies User scripts should rarely change in response to a policy change Enable backwards compatibility for policy syntax Aim 2: Allow site policies to change in a minimally disruptive way Implementation of policy changes should be error free Job submission logic and changes to policy should be traceable Job submission logic should be testable and provably correct before deployment Aim 3: Separate job submission logic, policies, and user authorization - 3 -

Motivation: User Interface Job Submission Parameters sbatch -N 500 -C knl script.sh sbatch -q regular -N 8192 -C knl script.sh scontrol update job=1234 qos=premium sbatch -q shared -c 4 script.sh Job_submit/lua Job_submit/lua Job_submit/lua Job_submit/lua Job Execution Parameters qos: debug_knl partition: debug gres: craynetwork:1 qos: regular_knl_0 partition: regular gres: craynetwork:1 qos: premium partition: regular gres: craynetwork:1 qos: shared partition: shared features: shared gres: craynetwork:0 sbatch -q shared -c 36 script.sh Job_submit/lua (too many cpus for shared) - 4 - X

/etc/slurm/job_submit.lua slurm.log_info("loading job_submit.lua") package.path = package.path.. ';/usr/lib/nersc-slurm-plugins/?.lua' local jsl = require "lib_job_submit" jsl.setupfromyaml("/etc/slurm/policy.yaml") return slurm.success - 5 -

lib_job_submit.lua (The Code) Library code for implementing generic abstraction of site job policies. Enforces same logic for all job submissions and job modifications Code design rewards small simple functions with limited scope. Internally akes extensive use of lua metatables to map and remap functions for match parse and match policies with jobs without using complex procedural code Full unit test suite, which can run independent of slurm Some limited capabilities to test policies ahead of deployment - 6 -

/etc/slurm/policy.yaml (The Data) Single YAML file that describes: Job submission policies (Logical Queues, system defaults, etc) QOS names and limits (Not covered today) License Descriptions and mapping to external resources (Not covered today) Spank plugin actions (Not covered today) Actions to take in Job Submission logic Record user requests in AdminComment YAML document is easily managed, simple to track in git for better auditing of policy change. - 7 -

General Data Model and Flow job_submit/lua lib_job_submit.lua init() setup_from_yaml() SiteConfig{} job_submit() job_modify() job_desc{} job_record{} part_list{} slurm_job_submit() slurm_job_modify() Evaluate each queue in reverse priority order queues[] process_job() LogicalQueue{} match_*() validate_*() apply_*() evaluate(job) DerivedJobData{} Request{} Record{} ToWrite{} _init_*() _apply_*() apply(queue) Arbitrary functions to perform job preprocessing Arbitrary functions to perform job postprocessing - 8 -

Logical Queues Each LogicalQueue defined in policy.yaml Each Logical Queue instantiated as a LogicalQueue class instance Methods are functions for evaluating arbitrary expressions LogicalQueue can be inherited fromto implementnew, site-specific functionality Upon slurm_job_submit() or slurm_job_modify(): Job data is gathered into a DerivedJob instance, initial job processing occurs LogicalQueues are evaluated in reverse priority order (highest to lowest) Evaluation stops upon firstfull match Final job processing occurs (static, regardless of matched LogicalQueue) LogicalQueue Execution parameters are applied to the job - 9 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Request* fields are matched against the job_desc data structure, set in either job_submit() or job_modify() Record* fields are matched against the existing job_record data structure, only set in job_modify() - 10 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 RequestQos is set to None here to ensure that the job_desc qos field is empty. This prevents an scontrol update trying to change a QOS from seeing lower priority LogicalQueues - 11 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10-12 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Which matches? sbatch --qos regular -N 5 script.sh - 13 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Which matches? sbatch --qos regular -N 5 script.sh - 14 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10-15 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Which matches? sbatch --qos regular -N 5000 script.sh - 16 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Which matches? sbatch --qos regular -N 5000 script.sh - 17 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10-18 -

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Which matches? $sbatch --qos regular -N 50 script.sh Submitted batch job 1234-19 - $scontrol update job=1234 NumNodes=80

Logical Queues - Matching Criteria Criteria (key, value) that, if evaluated to True may match a given queue Can be multiple sets of criteria, subject to an OR (any single criteria can match) Each criterium may have several matching expressions, ANDed regular_large: MatchingCritera: - RecordQos: regular_0 RecordPartition: regularx EvaluationPriority: 100 regular: - RecordQos: regular_1 EvaluationPriority: 10 Which matches? $sbatch --qos regular -N 50 script.sh Submitted batch job 1234-20 - $scontrol update job=1234 NumNodes=80

Logical Queue: Matching Function in lib_job_submit function LogicalQueue:match_RecordPartition(value, job) value = value or {} return self._findstring(job.record.partition, value) end function LogicalQueue:match_MinNodes(value, job) local min_nodes,max_nodes = job:getnodes() if not min_nodes or min_nodes < tonumber(value) then return false end return true end - 21 -

Logical Queues - Requirements Similar syntax and functions as Matching expressions Used to reject a job if specified requirements are not true Allows a policy to match a job, but still reject it (useful to provide good error messages) policy.yaml (ina LogicalQueue): Requirements: RequireArchSpecified: true lib_job_submit.lua: function LogicalQueue:validate_RequireArchSpecified(value, job) if value and not job.nodeclass then local msg = string.format("no hardware architecture specified (-C)!") error(slurmerror:new(slurm.error, msg, msg)) end return true end - 22 -

Logical Queues Execution Parameters Similar syntax and functions as Matching expressions Used to rewrite job parameters Ideally each Execution* function should make only one change. Good to log changes here to expose changes in the slurmctld log policy.yaml (ina LogicalQueue): Apply: ExecutionQos: regular_0 ExecutionPartition: regular lib_job_submit.lua: function LogicalQueue:apply_ExecutionQos(value, job) -- this will get set elsewhere if append account is enabled if self.apply.executionqosappendaccount then return end slurm.log_info("apply_executionqos (LogicalQueue:%s): setting qos to %s", self.label, value) job.request.qos = value end - 23 -

Examples: Default Logical Queue with Legacy Support Goal: Send jobs to the debug logical queue by default, assuring previous user interface is supported. Final qos and partition should be debug User Interface: sbatch script.sh; sbatch -q debug ; sbatch -p debug ; queues: debug: # in the case of a no option job # submission, need all blank to prevent # accidental matching during an update - RecordQos: None RecordPartition: None # expected job submission (-q debug) - RequestQos: debug - 24 - # old interface (-p debug) # allow qos if user is verbose - RequestPartition: debug RequestQos: [None, debug] # match for job update - RecordQos: debug Apply: ExecutionQos: debug ExecutionPartition: debug EvaluationPriority: 1

Examples: Reservations Goal: Allow jobs run in advanced reservations to never be limited to normal job policies. Rather, delegate limits to the reservation limits. User Interface: sbatch --reservation=myres --exclusive Implementation: resv partition and resv qos only matched and used when a reservation is specified. reservation: - RequestAdvancedReservation: true Exclusive: true - RecordAdvancedReservation: true Exclusive: true Requirements: RequireArchSpecified: true Apply: ExecutionQos: resv ExecutionPartition: resv EvaluationPriority: 2501-25 - reservation_shared: - RequestAdvancedReservation: true Exclusive: false - RecordAdvancedReservation: true Exclusive: false Requirements: RequireMaxCpuPerNodeFraction: 0.5 Apply: ExecutionQos: resv_shared ExecutionPartition: resv_shared ExecutionArch: haswell ExecutionGres: craynetwork:0 EvaluationPriority: 2500

Examples: Special Users with Extraordinary Needs Goal: Allow slurm account ResGroup1 to get static priority boost of 100000 for a subset of their work (the rest at normal priority), not to exceed 300 nodes simultaneously. User Interface: sbatch -q special -A ResGroup1 special Logical Queue provides a generic method for implementing account-focused QOS policies Maintains user interface simplicity and keeps policy management at qos level (not at account level) - 26 - policy.yaml changes queues: special: - RequestQos: special - RecordQosRegexMatch: special_%s+ Apply: ExecutionQos: special ExecutionQosAppendAccount: true ExecutionPartition: regular EvaluationPriority: 999... qos: special_resgroup1: GrpTRES: nodes=300 priority: 100000 MaxSubmitJobsPerUser: 10

Conclusions lib_job_submit.lua provides a generic library code for manipulating jobs and reading Slurm state when enforcing policy /etc/slurm/policy.yaml describes all the system policies other than those in slurm.conf Each system gets its own policy.yaml Enables focus on desired user interface Flexibly support new and old interfaces Cori has 22 logical queues (policy sets) 50 Matching criteria Two distinct accounting hierarchies (funding sources) 44 Execution QOS in operation Separate job submission parameters fromexecution parameters simplifies policy change management. Managing policy.yaml is much easier than managing procedural code for making these decisions - 27 -

Future Work Abstracting NERSC-specific logic to allow fully generic implementation Site-specific lua code can be implemented in classes inheriting SiteConfig, DerivedJob, or LogicalQueue Public release of this and allother NERSC spank plugins http://github.com/nersc/nersc-slurm-ext Still awaiting open source approval from Lab, DOE Support introspection of Slurm account hierarchies Allow policies to make decisions based on accounting hierarchy metadata instead of account name (e.g., RecordAccountAncestor: fundinga) Need modifications to job_submit/lua and lib_job_submit.lua Extend policy.yamlfor Slurm Federation and pseudo Federation Enable job submission and verification for slurm federated clusters supporting nonhomogeneous policies - 28 -

Thank You. - 29 -