Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

Similar documents
SLURM Operation on Cray XT and XE

Deploying SLURM on XT, XE, and Future Cray Systems

Batch Scheduling on XT3

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Running Jobs on Blue Waters. Greg Bauer

1 Bull, 2011 Bull Extreme Computing

Batch Systems. Running your jobs on an HPC machine

Compiling applications for the Cray XC

Cray RS Programming Environment

Blue Waters I/O Performance

Moab Workload Manager on Cray XT3

Heterogeneous Job Support

Enabling Contiguous Node Scheduling on the Cray XT3

Bright Cluster Manager

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

CLE and How to. Jan Thorbecke

The Red Storm System: Architecture, System Update and Performance Analysis

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

Workload Managers. A Flexible Approach

Introduction to Slurm

Cluster Network Products

Resource Management at LLNL SLURM Version 1.2

Cray Operating System and I/O Road Map Charlie Carroll

Slurm Version Overview

The Effect of Page Size and TLB Entries on Application Performance

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

What Can a Small Country Do? The MeteoSwiss Implementation of the COSMO Suite on the Cray XT4

Wednesday : Basic Overview. Thursday : Optimization

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Using Quality of Service for Scheduling on Cray XT Systems

TOSS - A RHEL-based Operating System for HPC Clusters

ECE 574 Cluster Computing Lecture 4

4/20/15. Blue Waters User Monthly Teleconference

Slurm basics. Summer Kickstart June slide 1 of 49

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Our Workshop Environment

Illinois Proposal Considerations Greg Bauer

Large Scale Visualization on the Cray XT3 Using ParaView

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Native SLURM on Cray XC30. SLURM Birds of a Feather SC 13

Technical Computing Suite supporting the hybrid system

Cray XT Series System Overview S

Compute Node Linux (CNL) The Evolution of a Compute OS

Batch Systems. Running calculations on HPC resources

Slurm Birds of a Feather

Cray Operating System Plans and Status. Charlie Carroll May 2012

Running Galaxy in an HPC environment requirements, challenges and some solutions : the LIFEPORTAL

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors

CSinParallel Workshop. OnRamp: An Interactive Learning Portal for Parallel Computing Environments

Designed for Maximum Accelerator Performance

Extending SLURM with Support for GPU Ranges

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Introducing the next generation of affordable and productive massively parallel processing (MPP) computing the Cray XE6m supercomputer.

The Spider Center-Wide File System

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Understanding Aprun Use Patterns

I/O Monitoring at JSC, SIONlib & Resiliency

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Preparing GPU-Accelerated Applications for the Summit Supercomputer

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

ReFrame: A Regression Testing Framework Enabling Continuous Integration of Large HPC Systems

How to run a job on a Cluster?

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reducing Cluster Compatibility Mode (CCM) Complexity

Running applications on the Cray XC30

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Opportunities for container environments on Cray XC30 with GPU devices

Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Scheduling By Trackable Resources

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Technology evaluation at CSCS including BeeGFS parallel filesystem. Hussein N. Harake CSCS-ETHZ

pbsacct: A Workload Analysis System for PBS-Based HPC Systems

Versions and 14.11

Scalable Computing at Work

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Introduction to High-Performance Computing (HPC)

OpenPBS Users Manual

Enabling Contiguous Node Scheduling on the Cray XT3

Use of Common Technologies between XT and Black Widow

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Slurm Workload Manager Overview SC15

Introduction to Abel/Colossus and the queuing system

Unifying Heterogeneous Resources Moab Con Scott Jackson Engineering

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Slurm at CEA. status and evolutions. 13 septembre 2013 CEA 10 AVRIL 2012 PAGE 1. SLURM User Group - September 2013 F. Belot, F. Diakhaté, M.

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Advanced Job Launching. mapping applications to hardware

Sherlock for IBIIS. William Law Stanford Research Computing

Transcription:

Porting SLURM to the Cray XT and XE Neil Stringfellow and Gerrit Renker

Background

Cray XT/XE basics Cray XT systems are among the largest in the world 9 out of the top 30 machines on the top500 list June 2010 are XTs Number 1 system on top500 June 2010 is a Cray XT5 installation New Cray XE line is expected to maintain this position for Cray Machines consist of a number of diskless compute nodes and a set of service nodes for external connectivity All nodes run Linux Cray runs a lightweight version of Linux on the compute nodes Nodes are connected with a proprietary interconnect in a 3D torus configuration Small systems may be connected in a 2D torus or mesh All nodes are part of the torus, both compute nodes and service nodes 3

The Cray XT series in the beginning The first Cray XT3 systems were delivered in 2005 with a strictly defined set of hardware and software DDN disk controllers for I/O Lustre file system PGI compilers Totalview debugger (optional) PBSPro batch system 4

The Cray XT series in the beginning The first Cray XT3 systems were delivered in 2005 with a strictly defined set of hardware and software DDN disk controllers for I/O Lustre file system PGI compilers Totalview debugger (optional) PBSPro batch system Over time the hardware and software ecosystem has diversified DDN/LSI/Panasas file servers/systems 5

The Cray XT series in the beginning The first Cray XT3 systems were delivered in 2005 with a strictly defined set of hardware and software DDN disk controllers for I/O Lustre file system PGI compilers Totalview debugger (optional) PBSPro batch system Over time the hardware and software ecosystem has diversified DDN/LSI/Panasas file servers/systems PGI/Pathscale/GNU/Intel/Cray compilers 6

The Cray XT series in the beginning The first Cray XT3 systems were delivered in 2005 with a strictly defined set of hardware and software DDN disk controllers for I/O Lustre file system PGI compilers Totalview debugger (optional) PBSPro batch system Over time the hardware and software ecosystem has diversified DDN/LSI/Panasas file servers/systems PGI/Pathscale/GNU/Intel/Cray compilers Totalview/DDT debuggers 7

The Cray XT series in the beginning The first Cray XT3 systems were delivered in 2005 with a strictly defined set of hardware and software DDN disk controllers for I/O Lustre file system PGI compilers Totalview debugger (optional) PBSPro batch system Over time the hardware and software ecosystem has diversified DDN/LSI/Panasas file servers/systems PGI/Pathscale/GNU/Intel/Cray compilers Totalview/DDT debuggers PBSpro/Torque-Moab/LSF batch systems 8

The Cray XT series in the beginning The first Cray XT3 systems were delivered in 2005 with a strictly defined set of hardware and software DDN disk controllers for I/O Lustre file system PGI compilers Totalview debugger (optional) PBSPro batch system Over time the hardware and software ecosystem has diversified DDN/LSI/Panasas file servers/systems PGI/Pathscale/GNU/Intel/Cray compilers Totalview/DDT debuggers PBSpro/Torque-Moab/LSF batch systems SLURM is missing from this list 9

Cray history at CSCS CSCS runs a total of 6 different Cray systems 3 production machines (one XT5 and two XT4) 3 test and development machines (XE6, XT5, XT3) Cray XT systems have been in CSCS for five years A 1100 core Cray XT3 machine was delivered in June 2005 Cray XT3 machines ran the Catamount microkernel on the compute nodes with yod as the job launcher Early systems were delivered with PBSpro and a primitive scheduler Base scheduler provided with systems in 2005 was unsuitable for batch work on large machines 10

Building on the base PBSpro distribution Cray (Jason Coverston) wrote a scheduler in Tcl to plug in to PBSpro Had a simple backfilling algorithm Provided advanced reservations Had a simple formula for priority calculation Was the base on which all subsequent scheduling algorithms were added Tcl was chosen based on success at Pittsburgh Supercomputing Centre 11

Building on the base PBSpro distribution Cray (Jason Coverston) wrote a scheduler in Tcl to plug in to PBSpro Had a simple backfilling algorithm Provided advanced reservations Had a simple formula for priority calculation Was the base on which all subsequent scheduling algorithms were added Tcl was chosen based on success at Pittsburgh Supercomputing Centre CSCS added to the scheduler over time, to produce a very efficient scheduling policy Large jobs run with high priority and little delay System utilisation is over 90% Has special features that are required for operational weather forecasting 12

Building on the base PBSpro distribution Cray (Jason Coverston) wrote a scheduler in Tcl to plug in to PBSpro Had a simple backfilling algorithm Provided advanced reservations Had a simple formula for priority calculation Was the base on which all subsequent scheduling algorithms were added Tcl was chosen based on success at Pittsburgh Supercomputing Centre CSCS added to the scheduler over time, to produce a very efficient scheduling policy Large jobs run with high priority and little delay System utilisation is over 90% Has special features that are required for operational weather forecasting Ported to XT4/ALPS in 2008: Tcl works great for custom backfilling Scripting proved a great asset of PBS but ALPS problems created downtime 13

Building on the base PBSpro distribution Cray (Jason Coverston) wrote a scheduler in Tcl to plug in to PBSpro Had a simple backfilling algorithm Provided advanced reservations Had a simple formula for priority calculation Was the base on which all subsequent scheduling algorithms were added Tcl was chosen based on success at Pittsburgh Supercomputing Centre CSCS added to the scheduler over time, to produce a very efficient scheduling policy Large jobs run with high priority and little delay System utilisation is over 90% Has special features that are required for operational weather forecasting Ported to XT4/ALPS in 2008: Tcl works great for custom backfilling Scripting proved a great asset of PBS but ALPS problems created downtime CSCS has a source code licence for PBSpro Allows for fixes if the distribution is broken but patches supplied back to Altair might not be included in the next release 14

Motivation for change Mid-2010 Altair announced that they will no longer be supporting Tcl plugin Could convert to Python plugin, but opportune moment to consider alternatives Many other Cray sites use Torque/Moab Brief testing was not particularly successful at CSCS Torque has the same background as PBSpro CSCS will make its next major procurement in 2012, so we need to look at the best options in this timeframe for whatever architecture we choose 15

Selection of SLURM We considered SLURM to be the best candidate to look at for the future Designed for large scale systems from the outset (and CSCS current XT5 systems is reasonably large) In use on a variety of large systems around the world Investment in SLURM will reap benefits whichever large machine we chose in the future Can be deployed on all current systems at CSCS We would like to have a common batch environment for all our machines Open source Responsive and active developers Cray itself and several Cray customers have expressed interest so We have been experimenting with SLURM since spring 2010 First working port implemented early June 2010 16

Cray systems at CSCS We have a set of 3 production systems rosa XT5, 20 cabinets, 12 core, PBSpro 22,128 core machine, national service for academic research 17

Cray systems at CSCS We have a set of 3 production systems rosa XT5, 20 cabinets, 12 core, PBSpro 22,128 core machine, national service for academic research buin XT4, 3 cabinets, 4 core, PBSpro 1,040 core machine, production system for operational weather forecasting 18

Cray systems at CSCS We have a set of 3 production systems rosa XT5, 20 cabinets, 12 core, PBSpro 22,128 core machine, national service for academic research buin XT4, 3 cabinets, 4 core, PBSpro 1,040 core machine, production system for operational weather forecasting dole XT4, 2 cabinets, 4 core, PBSpro 688 core machine, failover system for weather forecasters 19

Cray test systems at CSCS We have 3 test and development systems palu XE6, 2 cabinets, 24 core per node, slurm Newest CSCS machine Has 4224 compute cores and 5.6 terabytes of memory First XE6 system shipped out of Cray Now considered a production-development machine 20

Cray test systems at CSCS We have 3 test and development systems palu XE6, 2 cabinets, 24 core per node, slurm gele XT5, 1 chassis, 12 core per node, slurm Newest CSCS machine Has 4224 compute cores and 5.6 terabytes of memory First XE6 system shipped out of Cray Now considered a production-development machine A single chassis version of the large XT5 Rosa fred XT3, 1 chassis, dual core node, slurm Test system for the Meteo machines 21

ALPS and BASIL 22

Overview of Cray job placement architecture No direct access to hardware Hardware 23

Overview of Cray job placement architecture No direct access to hardware Cray provides ALPS interface allocates resources house-keeping node health checking Hardware 24

Overview of Cray job placement architecture No direct access to hardware Cray provides ALPS interface allocates resources house-keeping node health checking ALPS is the whole system Application Level Placement Scheduler 6 daemons plus supporting programs Single daemon running on each compute node Hardware 25

Overview of Cray job placement architecture No direct access to hardware Cray provides ALPS interface allocates resources house-keeping node health checking ALPS is the whole system Application Level Placement Scheduler 6 daemons plus supporting programs Single daemon running on each compute node Basil is the XML interface used by ALPS Batch Application Scheduler Interface Layer Hardware 26

ALPS/Basil layer 1/2 XML-RPC interface with 4 methods XML query via stdin to /usr/bin/apbasil XML reply from stdout of apbasil 27

ALPS/Basil layer 1/2 XML-RPC interface with 4 methods XML query via stdin to /usr/bin/apbasil XML reply from stdout of apbasil QUERY - INVENTORY lists compute nodes (service nodes not listed) node attributes (cpus, mem, state, allocation) lists current ALPS resource allocations 28

ALPS/Basil layer 2/2 RESERVE - allocate resources allocate by mpp{width,depth,nppn} (PBS) or use explicit nodelist (slurm) only Basil 3.1 lists nodes selected by ALPS 29

ALPS/Basil layer 2/2 RESERVE - allocate resources allocate by mpp{width,depth,nppn} (PBS) or use explicit nodelist (slurm) only Basil 3.1 lists nodes selected by ALPS CONFIRM - commit allocation must happen ca. 120 seconds after RESERVE to clear state if anything hangs or crashes 30

ALPS/Basil layer 2/2 RESERVE - allocate resources allocate by mpp{width,depth,nppn} (PBS) or use explicit nodelist (slurm) only Basil 3.1 lists nodes selected by ALPS CONFIRM - commit allocation must happen ca. 120 seconds after RESERVE to clear state if anything hangs or crashes RELEASE - orderly return of resources 31

ALPS Job launch Compute Nodes ALPS Database ALPS SLURM 32

ALPS Job launch Batch system makes a request for resources from ALPS ALPS Database ALPS Compute Nodes SLURM 33

ALPS Job launch Batch system makes a request for resources from ALPS ALPS Database ALPS Compute Nodes Reserva:on SLURM 34

ALPS Job launch Batch system makes a request for resources from ALPS ALPS Database ALPS Compute Nodes Reserva:on Confirmed resource reservation allows job to run SLURM 35

ALPS Job launch Batch system makes a request for resources from ALPS ALPS Database ALPS Compute Nodes Reserva:on Confirmed resource reservation allows job to run SLURM SLURM scripts are actually running on Cray service nodes Compute Nodes ALPS SLURM #!/bin/sh! #SBATCH! aprun -n 48./a.out! 36

ALPS Job launch Batch system makes a request for resources from ALPS ALPS Database ALPS Compute Nodes Reserva:on Confirmed resource reservation allows job to run SLURM SLURM scripts are actually running on Cray service nodes Batch systems only manage allocations through ALPS layer ALPS Compute Nodes SLURM #!/bin/sh! #SBATCH! aprun -n 48./a.out! 37

ALPS Job launch Batch system makes a request for resources from ALPS ALPS Database ALPS Compute Nodes Reserva:on Confirmed resource reservation allows job to run SLURM SLURM scripts are actually running on Cray service nodes Batch systems only manage allocations through ALPS layer Within job, aprun launches and manages the execution of the application ALPS SLURM Compute Nodes./a.out! #!/bin/sh! #SBATCH! aprun -n 48./a.out! 38

Current Status 39

Cray restrictions/complications Cray nodes cannot be shared between jobs due to an ALPS aprun restriction hence select/linear (by node) allocation no select/cons_res in foreseeable future 40

Cray restrictions/complications Cray nodes cannot be shared between jobs due to an ALPS aprun restriction hence select/linear (by node) allocation no select/cons_res in foreseeable future Hardware range keeps expanding SeaStar / Gemini ASICs (also cabling) 4 SeaStar cabling variants for 2D/3D tori 41

Cray restrictions/complications Cray nodes cannot be shared between jobs due to an ALPS aprun restriction hence select/linear (by node) allocation no select/cons_res in foreseeable future Hardware range keeps expanding SeaStar / Gemini ASICs (also cabling) 4 SeaStar cabling variants for 2D/3D tori Torus is unlike BlueGene may have odd dimensions (e.g. 1 x 12 x 16) always holes in the torus due to service nodes The Torus itself is complete with communication ASICs, but some ASICs are not connected to compute nodes E.g. CSCS Cray XT5 Rosa is a 16x10x12 Torus with holes 42

E.g. 2D Torus with holes Our new Cray XE6 palu has a 2D torus topology Nodes are arranged in a 16x12 grid Two nodes with two NICs share a Gemini ASIC connection to the torus Gemini communication topology is 16x6 XX XX XX XX! XX XX XX XX! XX - holes in Torus due to service nodes 43

E.g. 2D Torus with holes Our new Cray XE6 palu has a 2D torus topology Nodes are arranged in a 16x12 grid Two nodes with two NICs share a Gemini ASIC connection to the torus Gemini communication topology is 16x6 XX XX XX XX! XX XX XX XX! XX - holes in Torus due to service nodes 44

E.g. 2D Torus with holes Our new Cray XE6 palu has a 2D torus topology Nodes are arranged in a 16x12 grid Two nodes with two NICs share a Gemini ASIC connection to the torus Gemini communication topology is 16x6 Hilbert curve ordering algorithm used in Slurm does not work well with this type of topology 36 39 43 42 31 30 32 33 54 55 57 56 45 44 48 51! 37 38 40 41 28 29 35 34 53 52 58 59 46 47 49 50! 12 11 9 8 27 26 14 15 72 73 61 60 79 78 76 75! 13 10 6 7 24 25 17 16 71 70 62 63 80 81 77 74! XX 0 5 4 23 20 19 XX XX 1 2 3 22 21 18 XX XX - holes in Torus due to service nodes XX 68 67 64 83 82 87 XX! XX 69 66 65 84 85 86 XX! 45

E.g. 2D Torus with holes Our new Cray XE6 palu has a 2D torus topology Nodes are arranged in a 16x12 grid Two nodes with two NICs share a Gemini ASIC connection to the torus Gemini communication topology is 16x6 Hilbert curve ordering algorithm used in Slurm does not work well with this type of topology 36 39 43 42 31 30 32 33 54 55 57 56 45 44 48 51! 37 38 40 41 28 29 35 34 53 52 58 59 46 47 49 50! 12 11 9 8 27 26 14 15 72 73 61 60 79 78 76 75! 13 10 6 7 24 25 17 16 71 70 62 63 80 81 77 74! XX 0 5 4 23 20 19 XX XX 1 2 3 22 21 18 XX XX - holes in Torus due to service nodes XX 68 67 64 83 82 87 XX! XX 69 66 65 84 85 86 XX! - large jump through torus in Hilbert ordering! 46

We could not fully think of it as a Blue Gene

CSCS libasil library abstraction layer for Basil 1.0/1.1/1.2/3.1 XML 3 parsers to handle version differences switches to highest-supported Basil version adapts parameter passing to current Basil version talks to SDB database to resolve (X, Y, Z) coordinates distinguish Gemini/SeaStar currently still external (--with-alpslib=...) could it be shipped under 'contribs/'? otherwise will be merged into select/cray 48

Working status We have slurm working on 2 XT and 1 XE system Over 15,000 jobs have been run so far All test systems have 2D torus looking for collaborators to test on 3D expect different node ranking Runs in frontend mode with single slurmd hence srun disabled use sbatch/salloc + aprun/mpirun OpenMPI supported on XT now supports BASIL_RESERVATION XE support still waiting to be done 49

Working status select/cray plugin thanks to Danny defers to select/linear for actual work is being populated with ALPS-specific code Most salloc/sbatch options supported mapped into corresponding ALPS parameters --mem, -n, -N, --ntasks-per-node, --cpus-per-task ( overcommit -- unusable options blocked (--share / An ALPS inventory every HealthCheckInterval 50

Node ordering ALPS uses its own ALPS_NIDORDER ranking database ordering - can be reconfigured slurm picks up this ordering for select/linear but reconfiguration requires ALPS/slurm restart Node ranking generalizes Hilbert ordering dynamically plug in different orderings ranking done by topology or node selection plugin 51

Outlook

Integration work ongoing Patches for revision during Oct/Nov Discussions ongoing Need decisions from slurm main developers libasil in contrib or put into select/cray? interface details for node ranking... 53

Would be good to have... Frontend mode with multiple slurmds for (ALPS) load-balancing actually ALPS appears to be the main bottleneck, not slurm would be good to avoid single point of failure 54

Would be good to have... Frontend mode with multiple slurmds for (ALPS) load-balancing actually ALPS appears to be the main bottleneck, not slurm would be good to avoid single point of failure Scripted scheduler interface (lua?) 55

Would be good to have... Frontend mode with multiple slurmds for (ALPS) load-balancing actually ALPS appears to be the main bottleneck, not slurm would be good to avoid single point of failure Scripted scheduler interface (lua?) Would like to port our backfilling algorithm very high utilisation (> 90%) on rosa requires large array to make schedule plan 56

Would be good to have... Frontend mode with multiple slurmds for (ALPS) load-balancing actually ALPS appears to be the main bottleneck, not slurm would be good to avoid single point of failure Scripted scheduler interface (lua?) Would like to port our backfilling algorithm very high utilisation (> 90%) on rosa requires large array to make schedule plan We are also interested in GRES for GPU cluster 57

Bon appetit 58