Grid Compute Resources and Job Management

Similar documents
Grid Compute Resources and Grid Job Management

Introduction to Grid Computing

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

! " #$%! &%& ' ( $ $ ) $*+ $

Tutorial 4: Condor. John Watt, National e-science Centre

! " # " $ $ % & '(()

Condor-G and DAGMan An Introduction

Cloud Computing. Up until now

Condor-G Stork and DAGMan An Introduction

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Grid Scheduling Architectures with Globus

Cloud Computing. Up until now

AN INTRODUCTION TO WORKFLOWS WITH DAGMAN

Getting Started with OSG Connect ~ an Interactive Tutorial ~

Condor and Workflows: An Introduction. Condor Week 2011

How to install Condor-G

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

History of SURAgrid Deployment

The Problem of Grid Scheduling

Cloud Computing. Summary

Grid Architectural Models

Gridbus Portlets -- USER GUIDE -- GRIDBUS PORTLETS 1 1. GETTING STARTED 2 2. AUTHENTICATION 3 3. WORKING WITH PROJECTS 4

HTCondor Essentials. Index

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Day 9: Introduction to CHTC

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

GEMS: A Fault Tolerant Grid Job Management System

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Grid Computing Middleware. Definitions & functions Middleware components Globus glite

QosCosGrid Middleware

Managing large-scale workflows with Pegasus

Globus Toolkit 4 Execution Management. Alexandra Jimborean International School of Informatics Hagenberg, 2009

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

NUSGRID a computational grid at NUS

Design and Implementation of Condor-UNICORE Bridge

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

Corral: A Glide-in Based Service for Resource Provisioning

HPC Resources at Lehigh. Steve Anthony March 22, 2012

Day 11: Workflows with DAGMan

Introduction to Grid Computing

DAGMan workflow. Kumaran Baskaran

Submission, Monitoring and Control of Jobs

User Tools and Languages for Graph-based Grid Workflows

Introduction to HTCondor

GT 4.2.0: Community Scheduler Framework (CSF) System Administrator's Guide

CSF4:A WSRF Compliant Meta-Scheduler

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center

JOB SUBMISSION ON GRID

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

STORK: Making Data Placement a First Class Citizen in the Grid

An Example Grid Middleware - The Globus Toolkit. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Layered Architecture

Architecture Proposal

Gergely Sipos MTA SZTAKI

GRAM: Grid Resource Allocation & Management

Presented by: Jon Wedell BioMagResBank

M. Roehrig, Sandia National Laboratories. Philipp Wieder, Research Centre Jülich Nov 2002

The GridWay. approach for job Submission and Management on Grids. Outline. Motivation. The GridWay Framework. Resource Selection

Architecture of the WMS

WMS overview and Proposal for Job Status

Installation and Administration

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

UNIT IV PROGRAMMING MODEL. Open source grid middleware packages - Globus Toolkit (GT4) Architecture, Configuration - Usage of Globus

XSEDE High Throughput Computing Use Cases

OSGMM and ReSS Matchmaking on OSG

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Dynamic Workflows for Grid Applications

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

Update on EZ-Grid. Priya Raghunath University of Houston. PI : Dr Barbara Chapman

An Introduction to Using HTCondor Karen Miller

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

Process Migration via Remote Fork: a Viable Programming Model? Branden J. Moor! cse 598z: Distributed Systems December 02, 2004

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

Cluster Abstraction: towards Uniform Resource Description and Access in Multicluster Grid

An Introduction to Using HTCondor Karen Miller

Project Blackbird. U"lizing Condor and HTC to address archiving online courses at Clemson on a weekly basis. Sam Hoover

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

To the Cloud and Back: A Distributed Photo Processing Pipeline

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Advanced Job Submission on the Grid

AN INTRODUCTION TO USING

Introducing the HTCondor-CE

Resource Sharing Problem

Grid Mashups. Gluing grids together with Condor and BOINC

Making Workstations a Friendly Environment for Batch Jobs. Miron Livny Mike Litzkow

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

Introduction to Condor. Jari Varje

Managing MPICH-G2 Jobs with WebCom-G

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

Stork: State of the Art

Batch Systems. Running calculations on HPC resources

Sphinx: A Scheduling Middleware for Data Intensive Applications on a Grid

Transcription:

Grid Compute Resources and Job Management

How do we access the grid? Command line with tools that you'll use Specialised applications Ex: Write a program to process images that sends data to run on the grid as an inbuilt feature. Web portals I2U2 SIDGrid 2

Grid Middleware glues the grid together A short, intuitive definition: the software that glues together different clusters into a grid, taking into consideration the sociopolitical side of things (such as common policies on who can use what, how much, and what for) 3

Grid middleware Offers services that couple users with remote resources through resource brokers Remote process management Co-allocation of resources Storage access Information Security QoS 4

Globus Toolkit the de facto standard for grid middleware. Developed at ANL & UChicago (Globus Alliance) Open source Adopted by different scientific communities and industries Conceived as an open set of architectures, services and software libraries that support grids and grid applications Provides services in major areas of distributed systems: Core services Data management Security 5

Globus - core services Are the basic infra-structure needed to create grid services Authorization Message level security System level services (e.g., monitoring) Associated data management provides file services GridFTP RFT (Reliable File Transfer) RLS (Replica Location Service) Globus uses GT4 Promotes open high-performance computing (HPC) 6

Local Resource Managers (LRM) Compute resources have a local resource manager (LRM) that controls: Who is allowed to run jobs How jobs run on a specific resource Example policy: Each cluster node can run one job. If there are more jobs, then they must wait in a queue LRMs allow nodes in a cluster can be reserved for a specific person Examples: PBS, LSF, Condor 7

GRAM Globus Resource Allocation Manager GRAM = provides a standardised interface to submit jobs to LRMs. Clients submit a job request to GRAM GRAM translates into something a(ny) LRM can understand. Same job request can be used for many different kinds of LRM 8

Job Management on a Grid User GRAM Condor Site A LSF Site C PBS fork Site B Site D The Grid 9

Two versions of GRAM There are two versions of GRAM GT2 GT4 Own protocols Older More widely used No longer actively developed Web services Newer New features go into GRAM4 In this module, will be using GT2 10

GRAM s abilities Given a job specification: Creates an environment for the job Stages files to and from the environment Submits a job to a local resource manager Monitors a job Sends notifications of the job state change Streams a job s stdout/err during execution 11

GRAM components Clients eg. globus-job-submit, globus-run Gatekeeper Server Accepts job submissions Handles security Jobmanager Knows how to send a job into the local resource manager Different job managers for different LRMs 12

GRAM components Gatekeeper globus-job-run Internet Jobmanager Jobmanager Submitting machine (e.g. User's workstation) LRM eg Condor, PBS, LSF Worker nodes / CPUs Worker node / CPU Worker node / CPU Worker node / CPU Worker node / CPU 13

Remote Resource Access: Globus globusrun myjob Globus GRAM Protocol Globus JobManager () fork Organization A Organization B 14

Submitting a job with GRAM globus-job-run command $ globus-job-run rookery.uchicago.edu /bin/hostname Run '/bin/hostname' on the resource rookery.uchicago.edu We don't care what LRM is used on 'rookery'. This command works with any LRM. 15

The client can describe the job with GRAM s Resource Specification Language (RSL) Example: &(executable = a.out) (directory = /home/nobody ) (arguments = arg1 "arg 2") Submit with: globusrun -f spec.rsl -r rookery.uchicago.edu 16

Use other programs to generate RSL RSL job descriptions can become very complicated We can use other programs to generate RSL for us Example: Condor-G next section 17

Condor is a software system that creates an HTC environment Created at UW-Madison Condor is a specialized workload management system for compute-intensive jobs. Detects machine availability Harnesses available resources Uses remote system calls to send R/W operations over the network Provides powerful resource management by matching resource owners with consumers (broker) 18

How Condor works Condor provides: a job queueing mechanism scheduling policy priority scheme resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. 19

Condor - features Checkpoint & migration Remote system calls Able to transfer data files and executables across machines Job ordering Job requirements and preferences can be specified via powerful expressions 20

Condor lets you manage a large number of jobs. Specify the jobs in a file and submit them to Condor Condor runs them and keeps you notified on their progress Mechanisms to help you manage huge numbers of jobs (1000 s), all the data, etc. ( DAGMan ) Handles inter-job dependencies Users can set Condor's job priorities Condor administrators can set user priorities Can do this as: Local resource manager (LRM) on a compute resource ( Condor-G Grid client submitting to GRAM (as 21

Condor-G is the job management part of Condor. Hint: Install Condor-G to submit to resources accessible through a Globus interface. Condor-G does not create a grid service. It only deals with using remote grid services. 22

Condor-G does whatever it takes to run your jobs, even if The gatekeeper is temporarily unavailable The job manager crashes Your local machine crashes The network goes down 23

Remote Resource Access: Condor-G + Globus + Condor Condor-G myjob1 myjob2 myjob3 myjob4 myjob5 Globus GRAM Protocol Globus GRAM Submit to LRM Organization A Organization B 24

Condor-G: Access non-condor Grid resources Globus Condor middleware deployed across entire Grid job scheduling across multiple resources remote access to computational resources strong fault tolerance with checkpointing and migration dependable, robust data transfer layered over Globus as personal batch system for the Grid 25

Four Steps to Run a Job with Condor These choices tell Condor how when where to run the job, and describe exactly what you want to run. Choose a Universe for your job Make your job batch-ready Create a submit description file Run condor_submit 26

1. Choose a Universe There are many choices Vanilla: any old job Standard: checkpointing & remote I/O Java: better for Java jobs MPI: Run parallel MPI jobs Virtual Machine: Run a virtual machine as job For now, we ll just consider vanilla 27

2. Make your job batch-ready Must be able to run in the background: no interactive input, windows, GUI, etc. Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices Organize data files 28

3. Create a Submit Description File A plain ASCII text file Condor does not care about file extensions Tells Condor about your job: Which executable to run and where to find it Which universe Location of input, output and error files Command-line arguments, if any Environment variables Any special requirements or preferences 29

Simple Submit Description File # myjob.submit file # Simple condor_submit input file ( comments (Lines beginning with # are # # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = analysis Log = my_job.log Queue 30

4. Run condor_submit You give condor_submit the name of the submit file you have created: condor_submit my_job.submit condor_submit parses the submit file 31

Another Submit Description File # Example condor_submit input file ( comments (Lines beginning with # are # # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue 32

Details Lots of options available in the submit file Commands to watch the queue, the state of your pool, and lots more You ll see much of this in the hands-on exercises. 33

Other Condor commands condor_q show status of job queue condor_status show status of compute nodes condor_rm remove a job condor_hold hold a job temporarily condor_release release a job from hold 34

Submitting more complex jobs express dependencies between jobs WORKFLOWS And also, we would like the workflow to be managed even in the face of failures 35

Want other Scheduling possibilities? Use the Scheduler Universe In addition to VANILLA, another job universe is the Scheduler Universe. Scheduler Universe jobs run on the submitting machine and serve as a meta-scheduler. Condor s Scheduler Universe lets you set up and manage job workflows. DAGMan meta-scheduler included DAGMan manages these jobs 36

DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., Don t run job B until job A has completed (. successfully 37

What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Job A Each job is a node in the DAG. Each node can have any number of parent or children nodes as long as there are no loops! Job B Job D Job C 38

Defining a DAG A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job B Job A Job D Job C each node will run the Condor job specified by its accompanying Condor submit file 39

Submitting a DAG To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don t have to baby-sit it. 40

Running a DAG DAGMan acts as a meta-scheduler, managing the submission of your jobs to Condor-G based on the DAG dependencies. A Condor-G Job Queue A B DAGMan D C.dag File 41

Running a DAG (cont d) DAGMan holds & submits jobs to the Condor-G queue at the appropriate times. A Condor-G Job Queue B C B DAGMan D C 42

Running a DAG (cont d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a rescue file with the current state of the DAG. A Condor-G Job Queue B DAGMan D X Rescue File 43

Recovering a DAG -- fault tolerance Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor-G Job Queue C B DAGMan D C Rescue File 44

Recovering a DAG (cont d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. A Condor-G Job Queue D B DAGMan D C 45

Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. A Condor-G Job Queue B DAGMan D C 46

We have seen how Condor: monitors submitted jobs and reports progress implements your policy on the execution order of the jobs keeps a log of your job activities 47

Long jobs: if my jobs run for weeks What happens to my job when a machine is shut down there is a network outage, or another job with higher priority preempts it? Do I lose all of those hours or days of computation time?? What happens when they get pre-empted? How can I add fault tolerance to my jobs? 48

Condor s Standard Universe to the rescue! Condor can support various combinations of features/environments in different Universes Different Universes provide different functionalities to your job: Vanilla: Run any serial job Scheduler: Plug in a scheduler Standard: Support for transparent process checkpoint and restart provides two important services to your job: process checkpoint remote system calls. 49

Process Checkpointing Condor s process checkpointing mechanism saves the entire state of a process into a checkpoint file Memory, CPU, I/O, etc. The process can then be restarted from the point it left off Typically no changes to your job s source code needed however, your job must be relinked with Condor s Standard Universe support library 50

OSG & job submissions OSG sites present interfaces allowing remotely submitted jobs to be accepted, queued and executed locally. OSG supports the Condor-G job submission client which interfaces to either the pre-web service or web services GRAM Globus interface at the executing site. Job managers at the backend of the GRAM gatekeeper support job execution by local Condor, LSF, PBS, or SGE batch systems. 51

. Now go to the Lab part. 52

Acknowledgments: This presentation based on: Grid Resources and Job Management Jaime Frey and Becky Gietzel Condor Project U. Wisconsin-Madison