Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Similar documents
Moab Workload Manager on Cray XT3

OAR batch scheduler and scheduling on Grid'5000

1 Bull, 2011 Bull Extreme Computing

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Queuing and Scheduling on Compute Clusters

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Frequently Asked Questions

Batch Systems. Running calculations on HPC resources

Resource Management at LLNL SLURM Version 1.2

OAR Documentation. Release 2.5

M. Roehrig, Sandia National Laboratories. Philipp Wieder, Research Centre Jülich Nov 2002

Introduction to Slurm

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Scheduling. Jesus Labarta

OAR Documentation - User Guide

Cluster Network Products

Batch Systems. Running your jobs on an HPC machine

Data Intensive processing with irods and the middleware CiGri for the Whisper project Xavier Briand

OAR Documentation - User Guide

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Grid Architectural Models

OPERATING SYSTEM. PREPARED BY : DHAVAL R. PATEL Page 1. Q.1 Explain Memory

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

UL HPC Monitoring in practice: why, what, how, where to look

Running in parallel. Total number of cores available after hyper threading (virtual cores)

Using Docker in High Performance Computing in OpenPOWER Environment

InfoBrief. Platform ROCKS Enterprise Edition Dell Cluster Software Offering. Key Points

Execo tutorial Grid 5000 school, Grenoble, January 2016

Operating System. Operating System Overview. Layers of Computer System. Operating System Objectives. Services Provided by the Operating System

Operating System Overview. Operating System

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

WHY PARALLEL PROCESSING? (CE-401)

CS370 Operating Systems

The Use of Cloud Computing Resources in an HPC Environment

Grid Scheduling Architectures with Globus

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling

Allowing Users to Run Services at the OLCF with Kubernetes

LSF HPC :: getting most out of your NUMA machine

Operating Systems Overview. Chapter 2

A batch scheduler with high level components

Slurm at CEA. status and evolutions. 13 septembre 2013 CEA 10 AVRIL 2012 PAGE 1. SLURM User Group - September 2013 F. Belot, F. Diakhaté, M.

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

SLURM Operation on Cray XT and XE

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

CHAPTER NO - 1 : Introduction:

Introduction to Grid 5000

QosCosGrid Middleware

Slurm Workload Manager Overview SC15

Multi-Level Feedback Queues

Technical Computing Suite supporting the hybrid system

Jobs Resource Utilization as a Metric for Clusters Comparison and Optimization. Slurm User Group Meeting 9-10 October, 2012

LECTURE 3:CPU SCHEDULING

COSC 6385 Computer Architecture - Multi Processor Systems

MPI versions. MPI History

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Operating Systems Overview. Chapter 2

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

Bright Cluster Manager Advanced HPC cluster management made easy. Martijn de Vries CTO Bright Computing

Kadeploy3. Efficient and Scalable Operating System Provisioning for Clusters. Reliable Deployment Process with Kadeploy3

Working with Shell Scripting. Daniel Balagué

Introduction to computing, architecture and the UNIX OS. HORT Lecture 1 Instructor: Kranthi Varala

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

Introduction to Cluster Computing

Scheduler Optimization for Current Generation Cray Systems

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Workload management at KEK/CRC -- status and plan

CPU Scheduling: Objectives

HPC Architectures. Types of resource currently in use

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009

Compute Cluster Server Lab 1: Installation of Microsoft Compute Cluster Server 2003

An Advance Reservation-Based Computation Resource Manager for Global Scheduling

CS370 Operating Systems

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Process. One or more threads of execution Resources required for execution

OS Design Approaches. Roadmap. OS Design Approaches. Tevfik Koşar. Operating System Design and Implementation

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

IRIX Resource Management Plans & Status

QUESTION BANK UNIT I

Multiprocessor Scheduling. Multiprocessor Scheduling

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Introduction to High-Performance Computing (HPC)

Process. Memory Management

SNS COLLEGE OF ENGINEERING

An introduction to checkpointing. for scientific applications

Readme for Platform Open Cluster Stack (OCS)

QoS support for Intelligent Storage Devices

Name Department/Research Area Have you used the Linux command line?

Building a Data-Friendly Platform for a Data- Driven Future

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

CS370 Operating Systems Midterm Review

Transcription:

Cluster Computing Resource and Job Management for HPC SC-CAMP 16/08/2010 ( SC-CAMP) Cluster Computing 16/08/2010 1 / 50

Summary 1 Introduction Cluster Computing 2 About Resource and Job Management Systems 3 OAR a highly configurable RJMS 4 OAR Concepts and architecture 5 OAR Scheduling 6 OAR Interfaces 7 Research Issues and Conclusions ( SC-CAMP) Cluster Computing 16/08/2010 2 / 50

Introduction Cluster Computing Top500 larger computing infrastructures http ://www.top500.org/ ( SC-CAMP) Cluster Computing 16/08/2010 3 / 50

Introduction Cluster Computing Definition Cluster A computer cluster for High Performance Computing is a group of linked computers, working together closely connected to each other through fast local area networks. A cluster is also called parallel supercomputer. Cluster Client Central Server Computing Nodes ( SC-CAMP) Cluster Computing 16/08/2010 4 / 50

Introduction Cluster Computing Concepts Cluster Computing Process Processes are runing on CPUs. A process is a program that is running into memory On UNIX, several processes may be running on one or several processors (multitask). They are hierarchical and belong to a particular user. Each UNIX process has a unique number on a given system. It is called the PID. A thread (also known as a light process ) is a program that shares a certain amount of memory with another thread belonging to the same father process (under Linux, threads are viewed with ps -elf ) ( SC-CAMP) Cluster Computing 16/08/2010 5 / 50

Introduction Cluster Computing Concepts Cluster Computing Jobs Processes can be groupped into jobs. A job may be a single process, a group of processes or even a batch. In our context (HPC clusters), a job is a set of processes that have been automatically launched by a scheduler by the way of a user s submitted script. A job may result in N instances of the same program on N nodes or processors of a cluster. ( SC-CAMP) Cluster Computing 16/08/2010 6 / 50

Introduction Cluster Computing Concepts Cluster Computing Nodes Jobs are running on nodes. A node is a computer having a set of p CPU, an amount of memory, one or several network interfaces and that may have a storage unit (local disk). ( SC-CAMP) Cluster Computing 16/08/2010 7 / 50

Introduction Cluster Computing Concepts Cluster Computing Computing network Several nodes are connected to a computing network, generaly low latency network (Myrinet, Infiniband, Numalink,...) ( SC-CAMP) Cluster Computing 16/08/2010 8 / 50

Introduction Cluster Computing Concepts Cluster Computing Types of jobs Jobs maybe parallel or sequential. A parallel job runs on several nodes, using the computing network to communicate. Sequential Job is a unique process that runs on a unique processor of a unique host. Parallel Job Several processes or threads that can communicate via a specific library (MPI, OpenMP, threads,...). Some of them are shared memory parallel jobs (they run on a unique multiprocessor host) and some are distributed memory parallel jobs (they may run on several hosts that communicate via a low latency high speed network) Warning : a fake parallel job may hide several sequential jobs (several independent processes) ( SC-CAMP) Cluster Computing 16/08/2010 9 / 50

Introduction Cluster Computing Concepts Cluster Computing Types of jobs The Jobs can come at the form of Batch or Interactive type. A batch job is a script. A shell script is a given list of shell commands to be executed in a given order Todays shells are so sophisticated that you can create scripts that are real programs with some variables, controle structures, loops, We sometimes also call script a program that has been written in an interpreted language (like perl, php, or ruby) An interactive job is usually an allocation upon one or more nodes where the user gets a shell on one of the cluster nodes. ( SC-CAMP) Cluster Computing 16/08/2010 10 / 50

Introduction Cluster Computing Concepts Cluster Computing Resource and Job Management System The Resource and Job Management System (RJMS) or Batch Scheduler is a particular software of the cluster that is responsible to distribute computing power to user jobs within a parallel computing infrastructure. ( SC-CAMP) Cluster Computing 16/08/2010 11 / 50

About Resource and Job Management Systems Concepts Resource and Job Management System The goal of a Resource and Job Management System is to satisfy users demands for computation and achieve a good performance in overall system s utilization by efficiently assigning jobs to resources. ( SC-CAMP) Cluster Computing 16/08/2010 12 / 50

About Resource and Job Management Systems Concepts Resource and Job Management System This assignement involves three principal abstraction layers : the declaration of a job where the demand of resources and job characteristics take place, the scheduling of the jobs upon the resources and the launching and placement of job instances upon the computation resources along with the job s control of execution. In this sense, the work of a RJMS can be decomposed into three main subsystems : Job Management, Scheduling and Resource Management. ( SC-CAMP) Cluster Computing 16/08/2010 13 / 50

About Resource and Job Management Systems RJMS subsystems Principal Concepts Advanced Features Resource Management -Resource Treatment (hierarchy, partitions,..) - High Availability -Job Launcing, Propagation, Execution control - Energy Efficiency -Task Placement (topology,binding,..) - Topology aware placement Job Management -Job declaration (types, characteristics,...) - Authentication (limitations, security,..) -Job Control (signaling, reprioritizing,...) - QOS (checkpoint, suspend, accounting,...) -Monitoring (reporting, visualization,..) - Interfacing (MPI libraries, debuggers, APIs,..) Scheduling -Scheduling Algorithms (builtin, external,..) - Advanced Reservation -Queues Management (priorities,multiple,..) - Application Licenses Tab.: General Principles of a Resource and Job Management System ( SC-CAMP) Cluster Computing 16/08/2010 14 / 50

About Resource and Job Management Systems Scheduling Policy Description FIFO jobs are treated with the order they arrive. Backfill Gang Scheduling TimeSharing Fairshare Preemption fill up empty wholes in the scheduling tables without modifying the order or the execution of previous submitted jobs. multiple jobs may be allocated to the same resources and are alternately suspended/resumed letting only one of them at a time have dedicated use of those resources, for a predefined duration. multiple jobs may be allocated to the same resources allowing the sharing of computational resources. The sharing is managed by the scheduler of the operating system take into account past executed jobs of each user and give priorities to users that have been less using the cluster. suspending one or more low-priority jobs to let a high-priority job run uninterrupted until it completes Tab.: Common Scheduling policies for RJMS ( SC-CAMP) Cluster Computing 16/08/2010 15 / 50

About Resource and Job Management Systems Opensource and Commercial RJMS Context : High Performance Computing A lot of them : Condor, Sun Grid Engine (SGE), MAUI/Torque, Slurm, OAR, Catalina, LSF, Lava, PBS Pro, Moab, Loadleveler, CCS... http://en.wikipedia.org/wiki/job_scheduler ( SC-CAMP) Cluster Computing 16/08/2010 16 / 50

About Resource and Job Management Systems RJMS General organization A central host Client programs (command line) to interract with users A lot of configuration parameters Users RJMS Job Management Submission Job Declaration, Control, Monitoring Scheduling Job priorities, Resource matching Resource Management Job propagation, binding, execution control Log, Accounting Client Server Computing Nodes ( SC-CAMP) Cluster Computing 16/08/2010 17 / 50

About Resource and Job Management Systems RJMS Features (1/2) non-exhaustive list Interactive tasks (submission)(shell) / Batch Sequential tasks Walltime limit. (very important for scheduling!) Exclusive / non-exclusive access to resources Resources matching Scripts Epilogue/Prologue (before and after tasks) Monitoring of the tasks (resources consumption) Job dependencies (workflow) Logging and accounting Suspend/resume ( SC-CAMP) Cluster Computing 16/08/2010 18 / 50

About Resource and Job Management Systems RJMS Features (2/2) non-exhaustive list Array jobs First-Fit (Conservative Backfilling,) Fairsharing... ( SC-CAMP) Cluster Computing 16/08/2010 19 / 50

OAR a highly configurable RJMS Objectives OAR has been designed with versatililty and customization in mind. Following technological evolutions (machines and infrastructures more and more complicated) Initial context : CIMENT... and Grid5000... Different contexts adaptation (cluster, cluster-on-demand, virtual cluster, multi-cluster, lightweight grid, experimental platform as Grid 5000, big cluster, special needs). ( SC-CAMP) Cluster Computing 16/08/2010 20 / 50

OAR a highly configurable RJMS CIMENT local HPC center Université Joseph Fourier (Grenoble) A dozen of heterogeneous supercomputers, more than 2000 cores today special feature : a lightweight grid (CiGri) making profit of unused cpu cycles through best-effort jobs (a OAR feature) Great collaborative work production/research ( SC-CAMP) Cluster Computing 16/08/2010 21 / 50

OAR a highly configurable RJMS Grid 5000 Grid for computing experiments 9 sites in France + 2 sites abroad Interconnection gigabit Renater More than 5000 cores today ( SC-CAMP) Cluster Computing 16/08/2010 22 / 50

OAR a highly configurable RJMS OAR special features non-exhaustive list Classical features + Advance Reservation Hierarchical expressions into requests Different types of resources (ex licence, storage capacity, network capacity...) Besteffort tasks (zero priority task, highly used by CiGri) Multiple task types (besteffort, deploy, timesharing, idempotent, power, cosystem...) (customizable) Moldable tasks Energy saving ( SC-CAMP) Cluster Computing 16/08/2010 23 / 50

OAR Concepts and architecture OAR : design concepts High level components use Relational database (MySql/PostgreSQL) as the kernel to store and exchange data resources and tasks data internal state of the system Script languages (Perl, Ruby) for the execution engine and modules Well suited for system parts of the code High level strcutures (lists, hash tables, sort...) Short developpement cycles Other components : SSH, CPUSETS, Taktuk SSH, CPUSET (isolation, cleaning) Taktuk adaptative parallel launcher ( SC-CAMP) Cluster Computing 16/08/2010 24 / 50

OAR Concepts and architecture OAR : general organisation Users OAR-RJMS Job Management Scheduling Submission Job Declaration, Control, Monitoring Job priorities, Resource matching Log, Accounting Resource Management Job propagation, binding, execution control oarsub oarstat oarnodes oarcancel oarnodesetting oaraccounting oarproperty oarmonitor oarnotify oaradmin Almighty taktuk / oarsh taktuk / oarsh Ping_checker oarexec... Client Server Database Computing Nodes ( SC-CAMP) Cluster Computing 16/08/2010 25 / 50

Job life cycle OAR Concepts and architecture ( SC-CAMP) Cluster Computing 16/08/2010 26 / 50

Task status diagram OAR Concepts and architecture Exectution steps Scheduling Waiting tolaunch Launching Running Terminated toerror toackreservation Error Hold Advance reservation negociation ( SC-CAMP) Cluster Computing 16/08/2010 27 / 50

OAR Concepts and architecture Submission examples : OAR Interactive task : 1 oarsub -l nodes=4 -I Batch submission (with walltime and choice of the queue) : oarsub -q default -l walltime=2 :00,nodes=10 /home/toto/script Advance reservation : oarsub -r 2008-04-27 11 :00 -l nodes=12 Connection to a reservation (using task s id) : oarsub -C 154 1 Note : Each submission command returns an id number. ( SC-CAMP) Cluster Computing 16/08/2010 28 / 50

OAR Concepts and architecture Hierarchical Resources ( SC-CAMP) Cluster Computing 16/08/2010 29 / 50

OAR Scheduling Scheduling Scheduling is the process that is responsible to assign jobs according to the users needs and predefined rules and policies, upon available computational resources that match with the demands. A typical scheduler functions in cooperation with queues which are elements defined to organize jobs according to similar characteristics (for example priorities). 2 Numerous parameters are used to guide and scope the allocations and priorities. 2 Note : scheduling is computed again at each major change in the state of a task. ( SC-CAMP) Cluster Computing 16/08/2010 30 / 50

OAR Scheduling Scheduling organization into OAR Tasks are put into queues each queue as a priority level each queue obey to a scheduling policy Scheduler Meta scheduler Priority 10 Admission Priority 7 Priority 2 Scheduler Scheduler Priority 1 Best effort Scheduler ( SC-CAMP) Cluster Computing 16/08/2010 31 / 50

OAR Scheduling Resource matching A preliminary step to scheduling Resources filtering Resources sorting Allows the user to have special needs memory, architecture, special hosts, OS, workload... ( SC-CAMP) Cluster Computing 16/08/2010 32 / 50

FIFO : Fisrt-In First-Out OAR Scheduling ( SC-CAMP) Cluster Computing 16/08/2010 33 / 50

First-Fit (Backfilling) OAR Scheduling Holes are filled if the previous tasks order is not changed. ( SC-CAMP) Cluster Computing 16/08/2010 34 / 50

FairSharing OAR Scheduling The order of tasks is computed depending on what has already been consumed (low time-consuming users are prioritized) A time-window and weighting parameters are defined. ( SC-CAMP) Cluster Computing 16/08/2010 35 / 50

TimeSharing OAR Scheduling ( SC-CAMP) Cluster Computing 16/08/2010 36 / 50

Advance Reservation OAR Scheduling Very useful for demos, planification, multi-site tasks or grid tasks... But : Restrictive for the scheduling Resources are rarely used along the whole duration of the reservation (waste!) ( SC-CAMP) Cluster Computing 16/08/2010 37 / 50

OAR : Monika OAR Interfaces ( SC-CAMP) Cluster Computing 16/08/2010 38 / 50

OAR : Gantt diagramm OAR Interfaces ( SC-CAMP) Cluster Computing 16/08/2010 39 / 50

Research Issues and Conclusions Research Issues Topological Constraints for improvement of application performance. Energy consumption mechanisms in HPC clusters ( SC-CAMP) Cluster Computing 16/08/2010 40 / 50

Research Issues and Conclusions Topological constraints Hardware evolution switch/node/cpu/core : Hierarchical architecture NUMA host/ BlueGene host : 2D grid, 3D or hybrid... ( SC-CAMP) Cluster Computing 16/08/2010 41 / 50

Research Issues and Conclusions Hierarchical topological constraints Problem with parallel applications that are network-bandwidth sensitive. ( SC-CAMP) Cluster Computing 16/08/2010 42 / 50

Research Issues and Conclusions Hierarchical topological constraints Hierarchy in the requests : oarsub -l switch=1/nodes=2/cpu=2/core=2./my-app = 1x2x2x2 = 8 cores Linux cpusets management (be aware of the cpu affinity inside a cpuset) ( SC-CAMP) Cluster Computing 16/08/2010 43 / 50

Research Issues and Conclusions Energy saving We can shutdown nodes when not used (on-demand wake up) Timetable priorities Work done during Google Summer Of Code 2008 (Gsoc 08) Parametrical job type : powersaving + options (cpufreq, disk shutdown, video..., special policy) Ex Job BestEffort lowest CPU frequency.... ( SC-CAMP) Cluster Computing 16/08/2010 44 / 50

Research Issues and Conclusions Conclusion Management of Resources and Jobs for high performance computing in modern architectures has become a complex procedure with interesting research issues. Opensource and commercial RJMS have been evolving to provide efficient exploitation of the infrastructures along with quality of services to the users. OAR versatile and customizable distributed RJMS. ( SC-CAMP) Cluster Computing 16/08/2010 45 / 50

Research Issues and Conclusions Questions? http ://oar.imag.fr/ ( SC-CAMP) Cluster Computing 16/08/2010 46 / 50

Liens Research Issues and Conclusions Condor http ://www.cs.wisc.edu/condor/ Sun Grid Engine (SGE) http ://gridengine.sunsource.net TORQUE/MAUI http ://www.clusterresources.com/ SLURM www.llnl.gov/linux/slurm/ LSF http ://www.platform.com OAR http ://oar.imag.fr ( SC-CAMP) Cluster Computing 16/08/2010 47 / 50

Research Issues and Conclusions ( SC-CAMP) Cluster Computing 16/08/2010 48 / 50

Research Issues and Conclusions Back. ( SC-CAMP) Cluster Computing 16/08/2010 49 / 50