Resource Management at LLNL SLURM Version 1.2

Similar documents
High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Introduction to Slurm

1 Bull, 2011 Bull Extreme Computing

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

SLURM Operation on Cray XT and XE

Resource Management using SLURM

Heterogeneous Job Support

Slurm Workload Manager Introductory User Training

Scheduling By Trackable Resources

Flux: Practical Job Scheduling

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

Slurm Birds of a Feather

NIF ICCS Test Controller for Automated & Manual Testing

Enhancing an Open Source Resource Manager with Multi-Core/Multi-threaded Support

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Testing PL/SQL with Ounit UCRL-PRES

Federated Cluster Support

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

Slurm Version Overview

Introduction to RCC. January 18, 2017 Research Computing Center

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

Versions and 14.11

Portable Data Acquisition System

Introduction to RCC. September 14, 2016 Research Computing Center

Information to Insight

How to run a job on a Cluster?

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Java Based Open Architecture Controller

METADATA REGISTRY, ISO/IEC 11179

High Performance Computing Cluster Advanced course

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Testing of PVODE, a Parallel ODE Solver

ECE 574 Cluster Computing Lecture 4

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

HPC Colony: Linux at Large Node Counts

Electronic Weight-and-Dimensional-Data Entry in a Computer Database

Student HPC Hackathon 8/2018

In-Field Programming of Smart Meter and Meter Firmware Upgrade

Alignment and Micro-Inspection System

Go SOLAR Online Permitting System A Guide for Applicants November 2012

Using the SLURM Job Scheduler

Optimizing Bandwidth Utilization in Packet Based Telemetry Systems. Jeffrey R Kalibjian

The ANLABM SP Scheduling System

Directions in Workload Management

Adding a System Call to Plan 9

Stereo Vision Based Automated Grasp Planning

FY97 ICCS Prototype Specification

Using a Linux System 6

Installation Guide for sundials v2.6.2

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

Slurm basics. Summer Kickstart June slide 1 of 49

Advanced Synchrophasor Protocol DE-OE-859. Project Overview. Russell Robertson March 22, 2017

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Moab Workload Manager on Cray XT3

SmartSacramento Distribution Automation

High Performance Computing Cluster Basic course

CEA Site Report. SLURM User Group Meeting 2012 Matthieu Hautreux 26 septembre 2012 CEA 10 AVRIL 2012 PAGE 1

PJM Interconnection Smart Grid Investment Grant Update

Slurm Inter-Cluster Project. Stephen Trofinoff CSCS Via Trevano 131 CH-6900 Lugano 24-September-2014

Graham vs legacy systems

Intelligent Grid and Lessons Learned. April 26, 2011 SWEDE Conference

Cluster Network Products

and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

LosAlamos National Laboratory LosAlamos New Mexico HEXAHEDRON, WEDGE, TETRAHEDRON, AND PYRAMID DIFFUSION OPERATOR DISCRETIZATION

HPC Introductory Course - Exercises

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

On Demand Meter Reading from CIS

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

Számítogépes modellezés labor (MSc)

Beginner's Guide for UK IBM systems

Entergy Phasor Project Phasor Gateway Implementation

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

DOE EM Web Refresh Project and LLNL Building 280

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Slurm. Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)

Slurm Workload Manager Overview SC15

ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different?

Exercises: Abel/Colossus and SLURM

Introduction to High-Performance Computing (HPC)

Tutorial 3: Cgroups Support On SLURM

User Scripting April 14, 2018

Compiling applications for the Cray XC

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Development of Web Applications for Savannah River Site

Introduction to SLURM & SLURM batch scripts

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Bridging The Gap Between Industry And Academia

SCALABLE HYBRID PROTOTYPE

Real Time Price HAN Device Provisioning

JURECA Tuning for the platform

Integrated Volt VAR Control Centralized

Data Management Technology Survey and Recommendation

Introduction to HPC2N

Transcription:

UCRL PRES 230170 Resource Management at LLNL SLURM Version 1.2 April 2007 Morris Jette (jette1@llnl.gov) Danny Auble (auble1@llnl.gov) Chris Morrone (morrone2@llnl.gov) Lawrence Livermore National Laboratory

Disclaimer This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process discloses, or represents that its use would not infringe privately owned rights. References herein to any specific commercial product, process, or service by trade name, trademark, manufacture, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes. This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livremore National Laboratory under Contract No. W 7405 Eng 48.

What is SLURM s Role Performs resource management within a single cluster Arbitrates requests by managing queue of pending work Allocates access to resources (processors, memory, sockets, interconnect resources, etc) Boots nodes and reconfigure network on BlueGene Launches and manages tasks (job steps) on most clusters Forwards stdin/stdout/stderr Enforce resource limits Can bind tasks to specific sockets or cores Can perform MPI to initialization (communicate host, socket and task details) Supports job accounting Supports file transfer mechanism (via hierarchical communications)

Job Scheduling at LLNL LCRM or Moab (http://www.clusterresources.com) provide highly flexible enterprise wide job scheduling and reporting with a rich set of user tools Slurm (http://www.llnl.gov/linux/slurm) provides highly scalable and flexible resource management within each cluster Slurm Cluster 1 LCRM or Moab Slurm Cluster # Slurm MMCS job control BlueGene Both Moab and Slurm are production quality systems in widespread use on many of the worlds most powerful computers

SLURM Plugins (A building block approach to design) Dynamically linked objects loaded at run time per configuration file 34 different plugins of 10 different varieties Interconnect Quadrics Elan3/4, IBM Federation, BlueGene or none (for Infiniband, Myrinet and Ethernet) Scheduler Maui, Moab, FIFO or backfill Authentication, Accounting, Logging, MPI type, etc. SLURM daemons and commands Authentication Interconnect Scheduler Others...

SLURM s Scope (It s not so simple any more) Over 200,000 lines of code Over 20,000 lines of documentation Over 40,000 lines of code in automated test suite Roughly 35% of the code developed outside of LLNL HP Added support for Myrinet, job accounting, consumable resource, and multi core resource management Working on gang scheduling support Bull Added support for CPUsets Working on expanded MPI support for MPICH2/MVAPICH2 Dozens of additional contributors at 15 sites world wide

SLURM is Widely Deployed SLURM is production quality and widely deployed today (best guess is ~1000 clusters) ~7 downloads per day from LLNL and SourceForge To over 500 distinct sites in 41 countries Directly distributed by HP, Bull and other vendors to many other sites No other resource manager comes close to SLURM s scalability and performance

SLURM 1.0 Architecture on a Typical Linux Cluster Compute Nodes (Point to point communications) Administration Nodes slurmctld slurmd slurmd slurmd slurmctld (optional backup) (primary) sinfo Flatfile (system status) squeue srun (submit job or spawn tasks) sacct (accounting) (status jobs) scancel (signal jobs) scontrol (admin tool) smap (topology info)

SLURM 1.2 Architecture on a Typical Linux Cluster Compute Nodes (Hierarchical communications) slurmd slurmd slurmd sacct sbcast (file broadcast) (accounting) slurmctld (optional backup) Flatfile Administration Nodes slurmctld salloc (primary) (get an allocation) sbatch (submit a batch job) sattach (attach to an existing job) srun (submit job or spawn tasks) sinfo (system status) squeue (status jobs) scancel (signal jobs) scontrol (admin tool) smap (topology info) sview (GUI for system information) strigger (event manager)

SLURM on Linux launch Login Node Management Node Compute Node Compute Node Compute Node slurmctld slurmd slurmd slurmd

SLURM on Linux launch srun slurmctld slurmd slurmd slurmd srun N3 n6 ppdebug mpiprog

SLURM on Linux launch srun slurmctld slurmd slurmd slurmd srun N3 n6 ppdebug mpiprog

SLURM on Linux launch srun slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd srun N3 n6 ppdebug mpiprog

SLURM on Linux launch srun slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd TCP streams for task standard IO (one per NODE)

SLURM on Linux launch srun slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd MPI MPI MPI MPI MPI MPI srun N3 n6 ppdebug mpiprog

SLURM on Linux Infiniband launch srun slurmctld slurmd slurmd slurmd mvapich plugin slurmstepd slurmstepd slurmstepd MPI MPI MPI MPI MPI MPI MPI_Init()

SLURM on AIX Must use IBM's poe command to launch MPI applications (SLURM replaces LoadLeveler) Limited number of switch windows per node + unreliable step completion

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd slurmstepd pmd pmd pmd

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd slurmstepd pmd pmd pmd

SLURM LoadLeveler API Library poe slurm_ll_api.so slurmctld slurmd slurmd slurmd Application launch slurmstepd slurmstepd slurmstepd slurmstepd pmd pmd pmd MPI MPI MPI MPI MPI

AIX Limited Switch Windows 16 switch windows per adapter (32 per node) At 8 tasks per node, only two applications can be launched before windows are exhausted

Old SLURM Step Completion poe slurm_ll_api.so slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd slurmstepd pmd pmd pmd MPI Tasks Exit MPI MPI MPI MPI MPI

Old SLURM Step Completion poe slurm_ll_api.so slurmctld slurmd slurmd slurmd slurmstepd slurmstepd slurmstepd PMD exit slurmstepd pmd pmd pmd

Old SLURM Step Completion Possible problems if POE killed X step complete message poe Xslurm_ll_api.so slurmctld slurmd slurmd slurmd X slurmstepd slurmstepd slurmstepd slurmstepd report PMD complete

New SLURM Step Completion poe slurm_ll_api.so slurmctld slurmd slurmd slurmd step complete message slurmstepd slurmstepd slurmstepd Tree avoids unreliable user process

Bluegene Differences Uses only one slurmd to represent many nodes Treats one midplane with many compute nodes (c nodes) as one SLURM node. Creates blocks of bluegene midplanes (group of 512 c nodes) or partial midplanes (32 or 128 c nodes) to run jobs on SLURM wires together the midplanes to talk to each other Use IBM's mpirun command to launch MPI applications (SLURM replaces LoadLeveler) Monitoring the system through an API into a DB2 database to know the status of the various parts of the system

Bluegene Job Request Flow Front End Node Service Node Compute Nodes psub script slurmctld (primary) bluegene plugin DB2 batch script with mpirun mpirun be

New user commands salloc Create an resource allocation, and run one command locally sbatch Submit a batch script to the queue sattach Attach to a running job step X slaunch Launch a parallel application (requires existing resource allocation) srun Launch a parallel application, with or without and existing resource allocation

More new user commands sbcast File broadcast using hierarchical slurmd communication strigger Event trigger management sview GTK GUI for users and admins

New Command Examples salloc N4 ppdebug xterm sbatch N1000 n8000 mybatchscript sbatch N4 <<EOF #!/bin/sh hostname EOF sattach 6234.15

Multi Core Support Resources allocated by node, socket, core or thread Complete control is provided over how tasks are laid out on sockets, cores, and threads including binding tasks to specific resources to optimize performance Explicitly set masks with cpu_bind and mem_bind OR Automatically generate binding with simple directives HPLinpack speedup of 8.5%, LSDyna speedup of 10.5% Details at http://www.llnl.gov/linux/slurm/mc_support.html Example: srun N4 n32 B4:2:1 distribution=block:cyclic a.out Task distribution across nodes : within nodes Sockets per node : cores per CPU : threads per code

Future Enhancements Gang scheduling and automatic job preemption by job partition/queue (work by HP) Full PMI support for MPICH2 (work by Bull) Improved integration with Moab Scheduler (especially for BlueGene) Support for BlueGene/P Improved integration with OpenMPI Job accounting in proper database (not data file) Automatic job migration when node failures are predicted Support for jobs to change size