High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Similar documents
Resource Management at LLNL SLURM Version 1.2

NIF ICCS Test Controller for Automated & Manual Testing

SLURM Operation on Cray XT and XE

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

METADATA REGISTRY, ISO/IEC 11179

1 Bull, 2011 Bull Extreme Computing

Introduction to Slurm

Flux: Practical Job Scheduling

HPC Colony: Linux at Large Node Counts

Federated Cluster Support

Directions in Workload Management

Slurm Workload Manager Overview SC15

Information to Insight

Resource Management using SLURM

Testing PL/SQL with Ounit UCRL-PRES

In-Field Programming of Smart Meter and Meter Firmware Upgrade

Portable Data Acquisition System

Go SOLAR Online Permitting System A Guide for Applicants November 2012

ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different?

Adding a System Call to Plan 9

Java Based Open Architecture Controller

Slurm Birds of a Feather

Slurm Workload Manager Introductory User Training

Versions and 14.11

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

FY97 ICCS Prototype Specification

Electronic Weight-and-Dimensional-Data Entry in a Computer Database

Alignment and Micro-Inspection System

Real Time Price HAN Device Provisioning

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

On Demand Meter Reading from CIS

SmartSacramento Distribution Automation

Testing of PVODE, a Parallel ODE Solver

Optimizing Bandwidth Utilization in Packet Based Telemetry Systems. Jeffrey R Kalibjian

Heterogeneous Job Support

DOE EM Web Refresh Project and LLNL Building 280

Intelligent Grid and Lessons Learned. April 26, 2011 SWEDE Conference

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

OPTIMIZING CHEMICAL SENSOR ARRAY SIZES

Scheduling By Trackable Resources

Data Management Technology Survey and Recommendation

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Integrated Volt VAR Control Centralized

Advanced Synchrophasor Protocol DE-OE-859. Project Overview. Russell Robertson March 22, 2017

The ANLABM SP Scheduling System

PJM Interconnection Smart Grid Investment Grant Update

GA A22720 THE DIII D ECH MULTIPLE GYROTRON CONTROL SYSTEM

and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Bridging The Gap Between Industry And Academia

PJM Interconnection Smart Grid Investment Grant Update

Stereo Vision Based Automated Grasp Planning

Installation Guide for sundials v2.6.2

Slurm Version Overview

NATIONAL GEOSCIENCE DATA REPOSITORY SYSTEM

Centrally Managed. Ding Jun JLAB-ACC This process of name resolution and control point location

Development of Web Applications for Savannah River Site

TUCKER WIRELINE OPEN HOLE WIRELINE LOGGING

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

CHANGING THE WAY WE LOOK AT NUCLEAR

v /4 Quick Short Test Report Technical Publication Transfer Test Hughes Tucson Support Systems Operation MIL-D-28001A (SGML) CTN Test Report AFTB-ID

MSLURM SLURM management of multiple environments

@ST1. JUt EVALUATION OF A PROTOTYPE INFRASOUND SYSTEM ABSTRACT. Tom Sandoval (Contractor) Los Alamos National Laboratory Contract # W7405-ENG-36

ESNET Requirements for Physics Reseirch at the SSCL

Enhancing an Open Source Resource Manager with Multi-Core/Multi-threaded Support

Southern Company Smart Grid

MAS. &lliedsignal. Design of Intertransition Digitizing Decomutator KCP Federal Manufacturing & Technologies. K. L.

LosAlamos National Laboratory LosAlamos New Mexico HEXAHEDRON, WEDGE, TETRAHEDRON, AND PYRAMID DIFFUSION OPERATOR DISCRETIZATION

Site Impact Policies for Website Use

SLURM Simulator improvements and evaluation

Use of the target diagnostic control system in the National Ignition Facility

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Cross-Track Coherent Stereo Collections

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

Big Data Computing for GIS Data Discovery

ACCELERATOR OPERATION MANAGEMENT USING OBJECTS*

KCP&L SmartGrid Demonstration

Washington DC October Consumer Engagement. October 4, Gail Allen, Sr. Manager, Customer Solutions

Entergy Phasor Project Phasor Gateway Implementation

TOSS - A RHEL-based Operating System for HPC Clusters

RECENT ENHANCEMENTS TO ANALYZED DATA ACQUISITION AND REMOTE PARTICIPATION AT THE DIII D NATIONAL FUSION FACILITY

5A&-qg-oOL6c AN INTERNET ENABLED IMPACT LIMITER MATERIAL DATABASE

Software Integration Plan for Dynamic Co-Simulation and Advanced Process Control

CEA Site Report. SLURM User Group Meeting 2012 Matthieu Hautreux 26 septembre 2012 CEA 10 AVRIL 2012 PAGE 1

REAL-TIME MULTIPLE NETWORKED VIEWER CAPABILITY OF THE DIII D EC DATA ACQUISITION SYSTEM

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

Large Scale Test Simulations using the Virtual Environment for Test Optimization

Exascale Process Management Interface

FSEC Procedure for Testing Stand-Alone Photovoltaic Systems

A Global Operating System for HPC Clusters

GA A22637 REAL TIME EQUILIBRIUM RECONSTRUCTION FOR CONTROL OF THE DISCHARGE IN THE DIII D TOKAMAK

Integrated Training for the Department of Energy Standard Security System

ALAMO: Automatic Learning of Algebraic Models for Optimization

To connect to the cluster, simply use a SSH or SFTP client to connect to:

SLURM Event Handling. Bill Brophy

NEODC Service Level Agreement (SLA): NDG Discovery Client Service

Displacements and Rotations of a Body Moving About an Arbitrary Axis in a Global Reference Frame

An Introduction to GPFS

INCLUDING MEDICAL ADVICE DISCLAIMER

Resource allocation and utilization in the Blue Gene/L supercomputer

Transcription:

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008 Morris Jette (jette1@llnl.gov) LLNL-PRES-408498 Lawrence Livermore National Laboratory

What is SLURM Simple Linux Utility for Resource Management Performs resource management within a single cluster Typically used with an external scheduler (e.g. Moab or LSF) Arbitrates requests by managing queues of pending work Allocates access to computer nodes and their interconnect Launches parallel jobs and manages them (I/O, signals, time limits, etc.) Developed by LLNL with help from HP, Bull, Linux NetworX, and others 2

Design objectives High scalability Thousands of nodes Reliable Avoid single point of failure Simple to administer Open source (GPL) Extensible Very flexible plugin mechanism 3

SLURM architecture overview slurmctld (backup) slurmctld slurmdbd MySQL Control Functions Accounting and Configuration Information slurmd slurmd slurmd slurmd Compute Nodes 4

How we achieve high scalability High parallelism Dozens of active threads are common in the daemons Independent read and write locks on various data structures Offload as much work as possible from slurmctld (SLURM Control Daemon) 5

Example of parallel slurmctld operations Get system configuration information slurmctld THREAD 1 Get information about job 1234 THREAD 2 Get information about all of user bob s jobs THREAD 3 Get queue configurations THREAD4 6

Hierarchical communication Slurmd daemons (one per compute node) have hierarchical communications with fault-tolerance and configurable fanout slurmctld Most of the communications overhead is pushed down to slurmd daemons. slurmd slurmd slurmd slurmd Synchronizes system noise to minimize impact upon applications. Slurmd provides fault-tolerance and combines results into one response message, dramatically reducing slurmctld s overhead. 7

Non-killable processes Non-killable processes (typically hung on I/O to global file system) are not uncommon so we try to minimize their impact SLURM supports configurable timeout and program to execute when non-killable processes are found so that system administrators can respond quickly to problems SLURM will release a resource for reuse by another job once all processes associated with the previous job on that node complete. There is no need to wait for all processes on all nodes to complete 8

Highly efficient algorithms Bitmap operations used for much of the scheduling work RPCs use simple binary information rather than XML (which is more flexible, but slower) Nodes in selected partition AND Nodes with selected features AND Nodes with available resources Select from these nodes--> 11111111111111110000 11111111000000000000 00111000000011100001 00111000000000000000 9

Hostlist expressions used in configuration file Configuration file size is relatively independent of cluster size # slurm.conf # plugins, timers, log files, etc. # NodeName=tux[0-1023] SocketsPerNode=4 CoresPerSocket=4 # PartitionName=debug Nodes=tux[0-15] MaxTime=30:00 PartitionName=batch Nodes=tux[16-1023] MaxTime=1-00:00:00 10

Hostlist expressions used in most commands $ squeue PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up 30:00 16 idle tux[0-15] batch up 1-00:00:00 32 alloc tux[16-47] batch up 1-00:00:00 976 idle tux[48-1023] $ sinfo JOBID PARTITION NAME USER ST TIME NODES NODELIST 1234 batch a.out bob R 10:14 16 tux[16-31] 1238 batch my.sh alice R 8:57 16 tux[32-47] 11

Results SLURM is running on about 35% of the Top500 systems Total execution time (resource allocation, launch, I/O processing, resource deallocation) 32 nodes 0.1 second 256 nodes 1.0 seconds 1k nodes 3.7 seconds 2k nodes 19.5 seconds 4k nodes 56.6 seconds Same system Two different systems with different configurations Virtual machine with 64k nodes has been emulated 12

Special note on task launch Some vendors supply proprietary task launch mechanisms (e.g. IBM BlueGene mpirun) For compatibility with existing vendor tools and/or infrastructure (rather than for performance reasons), the vendor supplied task launch mechanism can be used with SLURM performing the resource management Current model on IBM BlueGene systems 13

For more information about SLURM Information: https://computing.llnl.gov/linux/slurm/ Downloads: http://sourceforge.net/projects/slurm/ Email: jette1@llnl.gov SLURM BOF in Hilton (directly across the street) Thursday 20 November 3PM to 5PM Room: Salon D 14

Disclaimer This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process discloses, or represents that its use would not infringe privately owned rights. References herein to any specific commercial product, process, or service by trade name, trademark, manufacture, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livremore National Security, LLC, and shall not be used for advertising or product endorsement purposes. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. 15