Using Quality of Service for Scheduling on Cray XT Systems

Similar documents
Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory

pbsacct: A Workload Analysis System for PBS-Based HPC Systems

Regression Testing on Petaflop Computational Resources. CUG 2010, Edinburgh Mike McCarty Software Developer May 27, 2010

Unifying Heterogeneous Resources Moab Con Scott Jackson Engineering

Moab Workload Manager on Cray XT3

Cluster Network Products

The Red Storm System: Architecture, System Update and Performance Analysis

Batch Systems. Running your jobs on an HPC machine

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

A Workload Analysis System for PBS-Based HPC Systems

Our Workshop Environment

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

Regional & National HPC resources available to UCSB

The Architecture and the Application Performance of the Earth Simulator

CISL Update. 29 April Operations and Services Division

Batch Systems. Running calculations on HPC resources

Batch Scheduling on XT3

ACCRE High Performance Compute Cluster

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

PRACTICAL MACHINE SPECIFIC COMMANDS KRAKEN

Habanero Operating Committee. January

CENTER FOR HIGH PERFORMANCE COMPUTING. Overview of CHPC. Martin Čuma, PhD. Center for High Performance Computing

Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

Outline. March 5, 2012 CIRMMT - McGill University 2

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

UAntwerpen, 24 June 2016

Service Level Agreement Research Computing Environment and Managed Server Hosting

and how to use TORQUE & Maui Piero Calucci

vsan Mixed Workloads First Published On: Last Updated On:

SuperMike-II Launch Workshop. System Overview and Allocations

FastTrack: An Optimized Transit Tracking System for the MBTA

Guillimin HPC Users Meeting. Bryan Caron

Workshop Set up. Workshop website: Workshop project set up account at my.osc.edu PZS0724 Nq7sRoNrWnFuLtBm

Running Jobs on Blue Waters. Greg Bauer

HSC Data Processing Environment ~Upgrading plan of open-use computing facility for HSC data

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Our new HPC-Cluster An overview

Genius Quick Start Guide

Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

Job Management on LONI and LSU HPC clusters

Eliminate the Complexity of Multiple Infrastructure Silos

White Paper Features and Benefits of Fujitsu All-Flash Arrays for Virtualization and Consolidation ETERNUS AF S2 series

Preview. Process Scheduler. Process Scheduling Algorithms for Batch System. Process Scheduling Algorithms for Interactive System

Announcements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM.

Integrating Apache Mesos with Science Gateways via Apache Airavata

Introduction to HPC Using the New Cluster at GACRC

Introduction to High Performance Computing (HPC) Resources at GACRC

What SMT can do for You. John Hague, IBM Consultant Oct 06

Performance of Variant Memory Configurations for Cray XT Systems

NERSC. National Energy Research Scientific Computing Center

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

JMR ELECTRONICS INC. WHITE PAPER

BRC HPC Services/Savio

OBTAINING AN ACCOUNT:

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

PBS Pro Documentation

Introduction to Slurm

PRIMEPOWER ARMTech Resource Management Provides a Workload Management Solution

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Guillimin HPC Users Meeting. Bart Oldeman

Advances of parallel computing. Kirill Bogachev May 2016

What Can a Small Country Do? The MeteoSwiss Implementation of the COSMO Suite on the Cray XT4

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Directions in Workload Management

Overview of HPC at LONI

Queuing and Scheduling on Compute Clusters

Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts. Scheduling Criteria Scheduling Algorithms

UL HPC Monitoring in practice: why, what, how, where to look

Performance of Variant Memory Configurations for Cray XT Systems

Oracle Database 10g Resource Manager. An Oracle White Paper October 2005

The Development of Modeling, Simulation and Visualization Services for the Earth Sciences in the National Supercomputing Center of Korea

The Computation and Data Needs of Canadian Astronomy

CS3733: Operating Systems

Overview of the Texas Advanced Computing Center. Bill Barth TACC September 12, 2011

The Automatic Library Tracking Database

Introduction to ARSC. David Newman (from Tom Logan slides), September Monday, September 14, 15

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

Moab, TORQUE, and Gold in a Heterogeneous, Federated Computing System at the University of Michigan

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Consolidation Assessment Final Report

The Total Network Volume chart shows the total traffic volume for the group of elements in the report.

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Early Evaluation of the Cray XD1

Scheduler Optimization for Current Generation Cray Systems

Data Transfers in the Grid: Workload Analysis of Globus GridFTP

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Initial Performance Evaluation of the Cray SeaStar Interconnect

Available Resources Considerate Usage Summary. Scientific Computing Resources and Current Fair Usage. Quentin CAUDRON Ellak SOMFAI Stefan GROSSKINSKY

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Current Progress of Grid Project in KMA

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Social Environment

Native SLURM on Cray XC30. SLURM Birds of a Feather SC 13

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

High-Performance Computing at The University of Michigan College of Engineering

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

DEDICATED SERVERS WITH EBS

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Transcription:

Using Quality of Service for Scheduling on Cray XT Systems Troy Baer HPC System Administrator National Institute for Computational Sciences, University of Tennessee

Outline Introduction Scheduling Cray XT systems with TORQUE and Moab Scheduling algorithm Queue structure Job Prioritization Quality of service levels Case studies Normal operation on Kraken Nightly weather forecasting on Kraken User-managed scheduling on Athena Conclusions

Introduction NICS operates two Cray XT systems for the U.S. National Science Foundation's Teragrid system: Kraken, 88 cabinet XT5. Athena, 48 cabinet XT4. The two systems are allocated differently: Kraken is allocated through the Teragrid Resource Allocation Committee. Athena is dedicated to one or two projects at a time on a roughly quarterly basis. On both systems, there are occasionally needs to give special scheduling consideration to individual projects or users.

System Details System Kraken Athena Cabinets 88 48 Compute Nodes 8,256 4,512 Processor AMD Opteron 2.6 GHz hexcore AMD Opteron 2.3 GHz quadcore Total Cores 99,072 18,048 Peak Performance (TFLOP/s) 1,030.3 165.9 Memory (TB) 129 17.6 Disk (TB) 2,400 85 Disk Bandwidth (GB/s) 30 12

Scheduling Cray XT Systems with TORQUE and Moab Like many large XT systems, Kraken and Athena use TORQUE and Moab for resource management. TORQUE is an open source batch system from Adaptive Computing (nee Cluster Resources). Derived from PBS. Works on everything from a laptop to JaguarPF. Moab is a closed source scheduler from Adaptive Computing that works with many batch systems, including TORQUE. Highly flexible priority system. Advance reservations. Quality of service (QOS) mechanisms. Modular design allows integration with external services (e.g. ALPS on Cray XTs).

Scheduling Algorithm The basic Moab scheduling algorithm has seven steps: Update status information from resource manager(s). * Refresh reservations. Start jobs with reservations (if possible). * Start jobs with highest priority (if possible). * Backfill jobs. * Update statistics. Handle client requests. * Requires interaction with ALPS on Cray XTs

Queue Structure Queue Max. Walltime Max. Cores (Athena) Max. Cores (Kraken) small 24 hours 512 512 longsmall 60 hours 512 256 medium 24 hours 2,048 8,192 large 24 hours 8,192 49,536 capability 24 hours 18,048 99,072 dmover 24 hours 0 0 hpss 24 hours 0 0

Job Prioritization In addition to having similar queue structures, Kraken and Athena use the same set of priority weights, which prioritize jobs based on: Size. Queue time. Expansion factor (though this was recently removed).

Quality of Service Levels A quality of service (QOS) level is an object in Moab enabling a job to request special scheduling considerations. Can be applied automatically to a job by Moab or explicitly requested (e.g. -l qos=foo in TORQUE). Access to a QOS can be assigned to any Moab credential (user, group, account, queue/class). Can have a number of effects: Modify priority. Modify throttling policies. Change reservation behavior. Change backfill behavior. Enable preemption.

Quality of Service Levels (con't.) Kraken and Athena originally had two non-default QOS levels available: sizezero Applied automatically by Moab to any job requesting size=0 (typically data transfer jobs). Uses RUNNOW flag to cause size=0 jobs to run in a timely manner despite their low priority. negbal Applied by the TORQUE submit filter to jobs charging against projects whose balances have gone negative. Large, negative priority modifier.

Case Studies Three cases Normal operation on Kraken. Normal operation on Kraken with reservations in support of nightly weather forecasts. Dedicated operation of Athena for a single group. In all cases, we'll look at how different QOSes affect the queue times of jobs, as users tend to look at shorter queue times as proof of better service. The Moab settings used to accomplish all of the work described here are in the paper's appendix.

Normal Operation on Kraken Baseline. Time frame: 5 Oct 2009 to 31 Mar 2010. System utilization: 65.62% (not compensated for downtime).

Normal Operation on Kraken (con't.) QOS Jobs CPU Hours Min Queue Time Max Queue Time Mean Queue Time default 220,197 241.7M 00:00:03 858:59:59 03:59:26 negbal 13,137 39.5M 00:00:04 398:04:00 08:51:31 sizezero 2,231 0.0M 00:00:05 262:51:39 04:11:42

Normal Operation on Kraken (con't.) negbal jobs generally wait longer than jobs with other QOSes. sizezero jobs wait slightly longer than jobs in the default QOS. This is because the RUNNOW flag was added to the sizezero QOS after some time after the start of the period in question in response to long queue times for those jobs.

Nightly Weather Forecasting on Kraken The Center for the Analysis and Prediction of Storms (CAPS) at Oklahoma University received a Teragrid allocation on Kraken for their 2009 Spring Experiment: WRF-based weather forecasts and associated postprocessing five nights a week, 10:30pm to 6:30am. Standing reservations held resources (~10k cores) available. Access to reservations controlled by capsforecast and capspostproc QOSes, which also gave priority bumps. Time frame: 16 Apr 2009 to 12 June 2009. System utilization: 66.02% (not compensated for downtime).

Nightly Weather Forecasting on Kraken (con't.) QOS Jobs CPU Hours Min Queue Time Max Queue Time Mean Queue Time default 35,257 55.6M 00:00:02 466:24:39 03:51:50 negbal 2,692 3.0M 00:00:04 146:53:14 02:15:37 sizezero 611 0.0M 00:00:03 132:26:36 01:04:35 caps forecast caps postproc 68 3.0M 00:00:05 42:01:04 01:48:31 2,155 0.1M 00:00:03 26:42:12 00:02:49

Nightly Weather Forecasting on Kraken (con't.) CAPS jobs generally received excellent service. The largest queue time component for forecast jobs was the fact that they were generally submitted around 9PM with a flag that prevented them from being eligible to run before 10:30PM. However, this caused significant impact on other users. longsmall jobs' mean queue time went from ~52 hours to ~96.5 hours. capability jobs could not run longer than 16 hours during the week. The caps* QOSes also caused capability jobs to lose reservations, resulting in near-miss behavior and several user complaints. The capability queue was subsequently assigned a bigjob QOS with its own reservation pool.

User-Managed Scheduling on Athena The first group given dedicated access to Athena was a climate modeling project from the Center for Ocean-Land- Atmosphere Studies (COLA) at the Institute of Global Environment and Society (IGES). Two codes: IFS from EMCWF. NICAM from University of Tokyo (previously run only on the Earth Simulator). Two QOSes added to allow COLA users to manage their own scheduling: bypass (has NTR next-to-run flag). bottomfeeder (no backfill or reservations). Time frame: 1 Oct 2009 to 31 Mar 2010. System utilization: 90.51% (not compensated for downtime).

User-Managed Scheduling on Athena (con't.) QOS Jobs CPU Hours Min Queue Time Max Queue Time Mean Queue Time default 11,851 37.4M 00:00:02 138:33:50 01:12:50 negbal 57 0.5M 00:00:02 06:19:08 00:37:19 sizezero 4,822 0.0M 00:00:01 148:22:35 01:31:55 bypass 1,540 32.2M 00:00:02 43:32:33 01:01:22 bottom feeder 85 1.3M 00:00:03 137:09:04 22:27:44

User-Managed Scheduling on Athena (con't.) Jobs with bypass QOS waited slightly less than default and sizezero jobs. Jobs with bottomfeeder QOS waited much, much longer than others. Users initially complained that Moab scheduled bottomfeeder jobs too aggressively. It turned out that the complaining users had a simulation job that submitted a size=0 data transfer job that in turn submitted another simulation job, resulting in many situations where only size=0 and bottomfeeder jobs were eligible to run. After the users restructured their jobs to have their simulation jobs submit both data transfer and subsequent simulation jobs, things behaved as expected.

Conclusions The use of QOS levels has become integral to scheduling on Kraken and Athena: Pushing through data transfer jobs. Deprioritizing jobs from projects with negative balances. Scheduling policy modifications and exceptions for individual users, groups, and projects. In the extreme, allowing users of a dedicated system to manage their own scheduling.