Native SLURM on Cray XC30. SLURM Birds of a Feather SC 13

Similar documents
Workload Managers. A Flexible Approach

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Introduction to Slurm

SLURM Operation on Cray XT and XE

Slurm Birds of a Feather

Slurm Roadmap. Morris Jette, Danny Auble (SchedMD) Yiannis Georgiou (Bull)

Slurm Workload Manager Overview SC15

Heterogeneous Job Support

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

Exascale Process Management Interface

Using Docker in High Performance Computing in OpenPOWER Environment

Running Jobs on Blue Waters. Greg Bauer

MDHIM: A Parallel Key/Value Store Framework for HPC

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Versions and 14.11

Moab Workload Manager on Cray XT3

Cray Operating System and I/O Road Map Charlie Carroll

Running applications on the Cray XC30

Compiling applications for the Cray XC

IRIX Resource Management Plans & Status

Scheduler Optimization for Current Generation Cray Systems

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

Resource Management at LLNL SLURM Version 1.2

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

Grid Computing in SAS 9.4

Understanding Aprun Use Patterns

Brigham Young University Fulton Supercomputing Lab. Ryan Cox

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Reducing Cluster Compatibility Mode (CCM) Complexity

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Grid Architectural Models

Microsoft SharePoint 2010 The business collaboration platform for the Enterprise and the Web. We have a new pie!

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Directions in Workload Management

Slurm Burst Buffer Support

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

From Moab to Slurm: 12 HPC Systems in 2 Months. Peltz, Fullop, Jennings, Senator, Grunau

Intel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Unifying Heterogeneous Resources Moab Con Scott Jackson Engineering

The Eclipse Parallel Tools Platform Project

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Cisco CloudCenter Solution with Cisco ACI: Common Use Cases

Improved Infrastructure Accessibility and Control with LSF for LS-DYNA

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Bright Cluster Manager

Managing complex cluster architectures with Bright Cluster Manager

Hedvig as backup target for Veeam

Introduction to Eclipse

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Harmony Quick Upgrade Guide

Exadata Database Machine Administration Workshop NEW

Your World is Hybrid:

Enterasys 2B Enterasys Certified Internetworking Engineer(ECIE)

IBM CICS TS V5.5. Your essential guide to this release

Building NFV Solutions with OpenStack and Cisco ACI

Using Quality of Service for Scheduling on Cray XT Systems

Slurm BOF SC13 Bull s Slurm roadmap

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

Oracle Enterprise Manager 12c IBM DB2 Database Plug-in

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Enabling the MERLOT Repository to be searched in Moodle

Planning and Administering SharePoint 2016

vsan Mixed Workloads First Published On: Last Updated On:

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Enabling Cloud Adoption. Addressing the challenges of multi-cloud

Your Microservice Layout

Performance Monitoring and Management of Microservices on Docker Ecosystem

Software Defined Storage for the Evolving Data Center

Table of Contents DevOps Administrators

WEBSPHERE APPLICATION SERVER

DEPLOY MODERN APPS WITH KUBERNETES AS A SERVICE

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

Planning and Administering SharePoint 2016

DEPLOY MODERN APPS WITH KUBERNETES AS A SERVICE

The four forces of Cloud Native

z/os Management Facility demonstration

Red Hat HyperConverged Infrastructure. RHUG Q Marc Skinner Principal Solutions Architect 8/23/2017

WhatÕs New in the Message-Passing Toolkit

Advanced Job Launching. mapping applications to hardware

Veritas Dynamic Multi-Pathing for VMware 6.0 Chad Bersche, Principal Technical Product Manager Storage and Availability Management Group

Networking & Security for Mesos

Scale-Out Architectures for Secondary Storage

The EU DataGrid Fabric Management

pbsacct: A Workload Analysis System for PBS-Based HPC Systems

Important DevOps Technologies (3+2+3days) for Deployment

VMware Cloud on AWS Technical Deck VMware, Inc.

Twitter Heron: Stream Processing at Scale

Workload Management Automation Drives Digital Business and Multicloud Expansion

Grid Computing in SAS 9.2. Second Edition

IBM WebSphere MQ for HP NonStop Update

Cray Operating System Plans and Status. Charlie Carroll May 2012

Scheduling Jobs onto Intel Xeon Phi using PBS Professional

CSinParallel Workshop. OnRamp: An Interactive Learning Portal for Parallel Computing Environments

RED HAT GLUSTER TECHSESSION CONTAINER NATIVE STORAGE OPENSHIFT + RHGS. MARCEL HERGAARDEN SR. SOLUTION ARCHITECT, RED HAT BENELUX April 2017

Secure, scalable storage made simple. OEM Storage Portfolio

Building scalable service-based applications Wicked Fast

Transcription:

Native on Cray XC30 Birds of a Feather SC 13

What s being offered? / ALPS The current open source version available on the SchedMD/ web page 2.6 validated for Cray systems Basic WLM functions This version supports most Cray features, but is a subset Cray doesn t support all of the capabilities, conversely doesn t support all of the ALPS capabilities Cray Cluster Compatibility Mode <NEW> Cray has contract(s) to add enhancements to for Cray systems These and existing enhancements will be pushed upstream to SchedMD to be included in open source repository COMPUTE STORE ANALYZE 2

Hybrid Architecture for Cray Prioritizes queue(s) of work and enforces limits Decides when and where to start jobs Terminates job when appropriate No daemons on compute nodes BASIL ALPS ALPS Daemon ALPS Allocates and releases resources for jobs (as directed by ) Launches tasks Monitors node health Manages node state Has daemons on compute nodes Manages Cray network resources Compute Nodes is a scheduler layer above ALPS and BASIL, not currently a replacement 3

ALPS Refactoring: Motivation Evolving system requirements driving changes Workload manager role not just launching jobs The role of a resource manager is managing job s resource requirements throughout job The resource manager s work only starts when a job begins processing The information a resource manager needs is constantly changing Resources a job needs are constantly changing Resiliency is an application s responsibility with system s assistance From User Group Meeting Keynote Address, 2011 4

Motivation Make Cray Systems More Accessible with Additional WLMs Increase potential customer base some RFPs require WLMs that we do not support LSF GridEngine Customers have expressed desire to run the same WLM across entire enterprise and/or data center 5

Native

Native Ø Create library of low-level ALPS functions and h/w APIs Ø Develop a native implementation Ø Native means no interaction with ALPS/BASIL Ø Cray developed plugins to provide following services (correspond to ALPS common APIs): Dynamic node state change information System topology information Congestion management information for HSS Protection key and protection domain management Node Health Check support Network performance counter management PMI port assignment management (when more than one application per compute node) Ø Working with SchedMD on implementation 7

Native Architecture for Cray Prioritizes queue(s) of work and enforces limits Allocates and releases resources for jobs Decides when and where to start jobs Terminates job when appropriate AlpsComm Library Launches tasks Monitors node health Manages node state Has daemon on compute nodes Plugin changes to: Daemon Compute Nodes Select Switch Task ProcTrack Cgroup AlpsComm Low level interfaces for network management 8

BASIL ALPS Q1 Q2 2013 Q3 Q4 Q1 Native Moab PBS Pro WLM Roadmap: Common Libs Q2 2014 Q3 Q4 Native GA Native GA release available 1Q2014 All plug-ins completed now including: Multiple Program Multiple Data (MPMD) launch Core Specialization Network performance counters WLM support (common) libraries documented ALPS Users: No change in functionality or interfaces Users: Native available for all sites WLM Vendors: WLM that are currently integrated with ALPS continue to work WLM (common) libraries documented and available for vendors to access/use directly 9

Questions? 10