SLURM Operation on Cray XT and XE

Similar documents
Resource Management at LLNL SLURM Version 1.2

Introduction to Slurm

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

1 Bull, 2011 Bull Extreme Computing

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

Scheduling By Trackable Resources

Slurm Birds of a Feather

Versions and 14.11

Slurm Version Overview

Heterogeneous Job Support

Slurm Burst Buffer Support

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

Slurm Workload Manager Introductory User Training

Slurm Workload Manager Overview SC15

Workload Managers. A Flexible Approach

Native SLURM on Cray XC30. SLURM Birds of a Feather SC 13

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Deploying SLURM on XT, XE, and Future Cray Systems

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Requirement and Dependencies

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Compiling applications for the Cray XC

Federated Cluster Support

Batch Systems. Running your jobs on an HPC machine

Reducing Cluster Compatibility Mode (CCM) Complexity

Batch Systems. Running calculations on HPC resources

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Running Jobs on Blue Waters. Greg Bauer

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Running applications on the Cray XC30

CLE and How to. Jan Thorbecke

Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

cli_filter command line filtration, manipulation, and introspection of job submissions

MDHIM: A Parallel Key/Value Store Framework for HPC

Moab Workload Manager on Cray XT3

How to run a job on a Cluster?

Extending SLURM with Support for GPU Ranges

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Enhancing an Open Source Resource Manager with Multi-Core/Multi-threaded Support

Directions in Workload Management

Introduction to RCC. September 14, 2016 Research Computing Center

Introduction to RCC. January 18, 2017 Research Computing Center

Scheduler Optimization for Current Generation Cray Systems

Resource allocation and utilization in the Blue Gene/L supercomputer

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

Exascale Process Management Interface

Resource Management using SLURM

Intel MPI Cluster Edition on Graham A First Look! Doug Roberts

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

SCALABLE HYBRID PROTOTYPE

Slurm BOF SC13 Bull s Slurm roadmap

Advanced Job Launching. mapping applications to hardware

MSLURM SLURM management of multiple environments

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Slurm Configuration Impact on Benchmarking

MIC Lab Parallel Computing on Stampede

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

UPPMAX Introduction Martin Dahlö Valentin Georgiev

PROGRAMMING MODEL EXAMPLES

pbsacct: A Workload Analysis System for PBS-Based HPC Systems

Brigham Young University

Working with Shell Scripting. Daniel Balagué

Enabling Contiguous Node Scheduling on the Cray XT3

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

How-to write a xtpmd_plugin for your Cray XC system Steven J. Martin

Parallel Merge Sort Using MPI

OPERATING SYSTEM. PREPARED BY : DHAVAL R. PATEL Page 1. Q.1 Explain Memory

From Moab to Slurm: 12 HPC Systems in 2 Months. Peltz, Fullop, Jennings, Senator, Grunau

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Monitoring Power CrayPat-lite

Production and Operational Use of the Cray X1 System

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Using Resource Utilization Reporting to Collect DVS Usage Statistics

First steps on using an HPC service ARCHER

Our Workshop Environment

Important DevOps Technologies (3+2+3days) for Deployment

Preparing GPU-Accelerated Applications for the Summit Supercomputer

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

Bash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University

Sharpen Exercise: Using HPC resources and running parallel applications

Advanced Threading and Optimization

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

Introduction to Molecular Dynamics on ARCHER: Instructions for running parallel jobs on ARCHER

A Global Operating System for HPC Clusters

LRZ Site Report. Athens, Juan Pancorbo Armada

Sharpen Exercise: Using HPC resources and running parallel applications

CSinParallel Workshop. OnRamp: An Interactive Learning Portal for Parallel Computing Environments

Using a Linux System 6

Using and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015

Changing landscape of computing at BNL

SLURM Event Handling. Bill Brophy

Dynamic RDMA Credentials

Transcription:

SLURM Operation on Cray XT and XE Morris Jette jette@schedmd.com

Contributors and Collaborators This work was supported by the Oak Ridge National Laboratory Extreme Scale Systems Center. Swiss National Supercomputing Centre performed some of the development and testing. Cray helped with integration and testing.

Outline Cray hardware and software architecture SLURM architecture for Cray SLURM configuration and use Status

Cray Architecture Many of the most powerful computers built by Cray Nodes are diskless 2 or 3-dimension torus interconnect Multiple nodes at each coordinate on some systems Full Linux on front-end nodes Lightweight Linux kernel on compute nodes Whole nodes must be allocated to jobs

ALPS and BASIL ALPS Application Level Placement Scheduler Cray's resource manager Six daemons plus variety of tools One daemon runs on each compute node to launch user tasks Other daemons run on service nodes Rudimentary scheduling software Dependent upon external scheduler (e.g. SLURM, etc) for workload management BASIL Batch Application Scheduler Interface Layer XML interface to ALPS

SLURM Architecture for Cray Many tools dependent upon ALPS Use SLURM as scheduler layer above ALPS and BASIL, not a replacement SLURM BASIL ALPS Compute Nodes

SLURM and ALPS Functionality SLURM ALPS Prioritizes queue(s) of work and enforces limits Allocates and releases resources for jobs Decides when and where to start jobs Terminates job when appropriate Launches tasks Monitors node health Accounts for jobs

SLURM Architecture for Cray (Detailed) Slurmctld (SLURM controller daemon) (primary or backup) Coordinates all activities Slurmd (SLURM job daemons) (Active on one or more service nodes) Runs batch script BASIL ALPS Compute Nodes

Job Launch Process 1. User submits script Slurmctld (SLURM controller daemon) (primary or backup) Coordinates all activities Slurmd (SLURM job daemons) Runs batch script BASIL ALPS Compute Nodes

Job Launch Process 1. User submits script 2. Slurmctld creates ALPS reservation Slurmctld (SLURM controller daemon) (primary or backup) Coordinates all activities Slurmd (SLURM job daemons) Runs batch script BASIL ALPS Compute Nodes

Job Launch Process 1. User submits script 2. Slurmctld creates ALPS reservation 3. Slurmctld sends script to slurmd Slurmctld (SLURM controller daemon) (primary or backup) Coordinates all activities BASIL ALPS Slurmd (SLURM job daemons) Runs batch script Compute Nodes

Job Launch Process 1. User submits script 2. Slurmctld creates ALPS reservation 3. Slurmctld sends script to slurmd 4. Slurmd claims reservation for specific session ID and launches interpreter for script Slurmctld (SLURM controller daemon) (primary or backup) Coordinates all activities BASIL ALPS Compute Nodes Slurmd (SLURM job daemons) Runs batch script #!/bin/bash srun a.out

Job Launch Process 1. User submits script 2. Slurmctld creates ALPS reservation 3. Slurmctld sends script to slurmd 4. Slurmd claims reservation for specific session ID and launches interpreter for script 5. aprun (optionally using the srun wrapper) launches tasks on compute nodes Slurmctld (SLURM controller daemon) (primary or backup) Coordinates all activities BASIL ALPS Compute Nodes Slurmd (SLURM job daemons) Runs batch script #!/bin/bash srun a.out aprun a.out

SLURM Architecture for Cray Almost all Cray-specific logic is in a Resource Selection plugin (as SLURM does for IBM BlueGene systems) The select/cray plugin in-turn calls the select/linear plugin to provide full-node resource allocation support including job preemption, memory allocation, optimized topology layout, etc. SLURM Kernel Resource Selection Plugin BlueGene ConsRes Cray Linear Linear Other Plugins New for Cray

SLURM's select/cray plugin select/cray select/linear libalps (SLURM module) OR libemulate (SLURM module) BASIL Emulates BASIL and ALPS for development and testing ALPS (via MySQL) Emulate any size Cray on any test system To use, build SLURM with configure option of with-alps-emulation

SLURM Configuration # # Sample slurm.conf file for Cray system # Selected portions # SelectType=select/cray # Communicates with ALPS # FrontEndName=front[00-03] # Where slurmd daemons run NodeName=nid[00000-01023] PartitionName=debug Nodes=nid[00000-00015] MaxTime=30 PartitionName=batch Nodes=nid[00016-01023] MaxTime=24:00:00

srun Command SLURM has srun command for Cray systems that is a wrapper for both salloc (to allocate resources as needed) and aprun (to launch tasks) Options are translated to the extent possible Build SLURM with configure option with-srun2aprun to build wrapper Otherwise srun command advises use of aprun and exits Srun (wrapper) salloc aprun OR aprun

srun Options If no allocation exists, the srun options are translated directly to salloc options to create a job allocation Many srun option only apply when a job allocation is created After an allocation is created Most common options are translated from srun to aprun (task count, node count, time limit, file names, support for multiple executables, etc.) Some srun options lack aprun equivalent and vice-versa srun's alps= option can pass any other options to aprun SLURM environment variables are not currently set There are fundamental differences in I/O For example, ALPS does not support per-rank I/O streams

sview of Emulated System

sview of Emulated System

smap of Emulated System

Caveats Some SLURM functionality has no ALPS equivalent Independent I/O by task Output labeled by task ID Some ALPS options have no SLURM equivalent srun wrapper has alps option to pass arbitrary arguments to aprun Some options are similar, but impossible to directly translate Task binding syntax Per-task vs. per-cpu limits

Caveats (continued) SLURM environment variables are not set Many need to be set on a per-node or per-task basis, so ALPS must do this Under development by Cray

Caveats (continued) SLURM GUIs (actually the curses and GTK libraries they use) have limited scalability Scales to a few thousand nodes Currently each position displayed represents a unique X, Y, Z-coordinate If multiple nodes share an X, Y, Z-coordinate, the information for only one node is displayed We found this better than displaying each node independently and providing confusing topology information, but could change this if desired

Status SLURM has been running reliably over ALPS at Swiss National Supercomputer Centre since April 2011 Validated by Cray in July/August 2011