Advanced Job Submission on the Grid

Similar documents
Architecture of the WMS

How to use computing resources at Grid

Gergely Sipos MTA SZTAKI

glite Middleware Usage

Problemi di schedulazione distribuita su Grid

International Collaboration to Extend and Advance Grid Education. glite WMS Workload Management System

DataGrid. Document identifier: Date: 24/11/2003. Work package: Partner: Document status. Deliverable identifier:

Parallel Computing in EGI

glite Grid Services Overview

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

GRID COMPANION GUIDE

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

LCG-2 and glite Architecture and components

The glite middleware. Ariel Garcia KIT

DataGrid. Document identifier: Date: 16/06/2003. Work package: Partner: Document status. Deliverable identifier:

DataGrid D EFINITION OF ARCHITECTURE, TECHNICAL PLAN AND EVALUATION CRITERIA FOR SCHEDULING, RESOURCE MANAGEMENT, SECURITY AND JOB DESCRIPTION

glite/egee in Practice

Grid Computing. Olivier Dadoun LAL, Orsay. Introduction & Parachute method. Socle 2006 Clermont-Ferrand Orsay)

J O B D E S C R I P T I O N L A N G U A G E A T T R I B U T E S S P E C I F I C A T I O N

DataGrid. Document identifier: Date: 28/10/2003. Work package: Partner: Document status. Deliverable identifier:

Parallel Job Support in the Spanish NGI! Enol Fernández del Cas/llo Ins/tuto de Física de Cantabria (IFCA) Spain

FREE SCIENTIFIC COMPUTING

WMS overview and Proposal for Job Status

glite Advanced Job Management

AGATA Analysis on the GRID

WMS Application Program Interface: How to integrate them in your code

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Grid Computing. Olivier Dadoun LAL, Orsay Introduction & Parachute method. APC-Grid February 2007

MPI SUPPORT ON THE GRID. Kiril Dichev, Sven Stork, Rainer Keller. Enol Fernández

On the employment of LCG GRID middleware

E UFORIA G RID I NFRASTRUCTURE S TATUS R EPORT

CREAM-WMS Integration

Programming the Grid with glite

A Practical Approach for a Workflow Management System

EUROPEAN MIDDLEWARE INITIATIVE

Overview of WMS/LB API

DIRAC Documentation. Release integration. DIRAC Project. 09:29 20/05/2016 UTC

Grid Documentation Documentation

Monitoring the Usage of the ZEUS Analysis Grid

Grid Infrastructure For Collaborative High Performance Scientific Computing

Future of Grid parallel exploitation

ISTITUTO NAZIONALE DI FISICA NUCLEARE Sezione di Padova

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

DataGrid TECHNICAL PLAN AND EVALUATION CRITERIA FOR THE RESOURCE CO- ALLOCATION FRAMEWORK AND MECHANISMS FOR PARALLEL JOB PARTITIONING

DataGrid EDG-BROKERINFO USER GUIDE. Document identifier: Date: 06/08/2003. Work package: Document status: Deliverable identifier:

Introduction to Grid Infrastructures

Troubleshooting Grid authentication from the client side

EGEE. Grid Middleware. Date: June 20, 2006

Multi-thread and Mpi usage in GRID Roberto Alfieri - Parma University & INFN, Gr.Coll. di Parma

Programming the Grid with glite *

Interconnect EGEE and CNGRID e-infrastructures

Client tools know everything

CERN LCG. LCG Short Demo. Markus Schulz. FZK 30 September 2003

Grid Compute Resources and Job Management

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

Job submission and management through web services: the experience with the CREAM service

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

MyProxy Server Installation

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Deploying virtualisation in a production grid

GRID COMPUTING APPLIED TO OFF-LINE AGATA DATA PROCESSING. 2nd EGAN School, December 2012, GSI Darmstadt, Germany

Grid Compute Resources and Grid Job Management

Architecture Proposal

Dr. Giuliano Taffoni INAF - OATS

Tutorial for CMS Users: Data Analysis on the Grid with CRAB

A Login Shell interface for INFN-GRID

Introduction to Programming and Computing for Scientists

DIRAC pilot framework and the DIRAC Workload Management System

Tutorial 4: Condor. John Watt, National e-science Centre

Access the power of Grid with Eclipse

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

Troubleshooting Grid authentication from the client side

Implementing GRID interoperability

HTCondor Essentials. Index

XRAY Grid TO BE OR NOT TO BE?

A unified user experience for MPI jobs in EMI

PoS(ACAT)020. Status and evolution of CRAB. Fabio Farina University and INFN Milano-Bicocca S. Lacaprara INFN Legnaro

ALICE Grid/Analysis Tutorial Exercise-Solutions

NorduGrid Tutorial. Client Installation and Job Examples

How to use the Grid for my e-science

Globus Toolkit 4 Execution Management. Alexandra Jimborean International School of Informatics Hagenberg, 2009

QosCosGrid Middleware

Heterogeneous Grid Computing: Issues and Early Benchmarks

EGEE and Interoperation

Grid Experiment and Job Management

HEP replica management

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

Monitoring tools in EGEE

Introduction. creating job-definition files into structured directories etc.

Improving Grid User's Privacy with glite Pseudonymity Service

Parallel computing on the Grid

Compiling applications for the Cray XC

Setup Desktop Grids and Bridges. Tutorial. Robert Lovas, MTA SZTAKI

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Pan-European Grid einfrastructure for LHC Experiments at CERN - SCL's Activities in EGEE

CRAB tutorial 08/04/2009

AMGA metadata catalogue system

Failover procedure for Grid core services

The PanDA System in the ATLAS Experiment

Transcription:

Advanced Job Submission on the Grid Antun Balaz Scientific Computing Laboratory Institute of Physics Belgrade http://www.scl.rs/ 30 Nov 11 Dec 2009 www.eu-egee.org

Scope User Interface Submit job Workload Management System Information System Retrieve status & output create credential query query File and Replica Catalog Job status Retrieve output Logging Submit job publish state Site X Authorization Service (VOMS) Job status Computing process Storage Logging and bookkeeping

Contents Again a little bit on WMS How are your jobs handled JDL attributes Which types of jobs exist Examples of complex JDLs

Basics Why does the Workload Management System exist? Grids have Many users Many jobs a job = an executable you want to run Where many compute nodes are available Workload Management System is a software service that makes running jobs easier for the user It builds on the basic grid services E.g. Authorisation, Authentication, Security, Information Systems, Job submission Terminology: Compute element : defined as a batch queue - One cluster can have many queues

What WMS does? Compute s WMS manages jobs on users behalf User doesn t decide where jobs are run User defines the job and its requiremements, WMS matches this with available CEs Effect: Easier submission Users insulated from change in Compute elements WMS can optimise your jobs e.g. which CE?

Managing jobs with glite command line tools Local Workstation UI (user interface) has preinstalled client software UI ssh User describes job in text file using Job Description Language Submits job to WMS using (usually) the command-line interface Workload Management System WMS CEs

Using WMS Jobs run in batch mode on grids. Steps in running a job on a glite grid with WMS: 1. Create a text file in Job Description Language 2. Optional check: list the compute elements that match your requirements ( list match command) 3. Submit the job ~ glite-wms-job-submit a myfile.jdl Non-blocking - Each job is given an id. 4. Occasionally check the status of your job 5. When Done retrieve output

Simple example of a JDL file Type = Job ; Executable = /bin/hostname ; Arguments = ; StdError = stderr.txt ; StdOutput = stdout.txt ; InputSandbox = ; OutputSandbox = { stderr.txt, stdout.txt }; $ glite-wms-job-submit a my.jdl Returns a job-id used to monitor job, retrieve output...

Standard JDL-file attributes Enabling Grids for E-sciencE Type Job for sequential jobs; later more details Executable sets the name of the executable file; Arguments command line arguments of the program; StdOutput, StdError - files for storing the standard output and error messages output; InputSandbox set of input files needed by the program, including the executable; OutputSandbox set of output files which will be written during the execution, including standard output and standard error output; these are sent from the CE to the WMS for you to retrieve ShallowRetryCount in case of grid error, retry job this many times ( Shallow : before job is running)

Scenario Submit stderr.txt stdout.txt Input sandbox Get output Job Submit Event Status / log query Output sandbox stderr.txt stdout.txt publish state Logging and bookkeeping Job status update STD input stream is read from file STD out and err. streams are redirected into files stderr.txt stdout.txt /bin/hostname A worker node is allocated by the local jobmanager

The Executable Script: No compilation is necessary Can invoke binary that is statically installed on the CE Binary: Must be compiled on the User Interface binary compatibility with CEs is guaranteed Statically linked to avoid errors caused by library versions Coming from client side Part of InputSandbox Installed on the CE Standard software in Linux VO specific software: advertised in information system Use JDL to navigate job to such a site

Recommended Job control commands WMS version Delegate proxy Submit Status Logging Output Cancel Compatible resources LCG-2 WMS edg-job-submit [-o joblist]jdlfile edg-job-status [-v verbosity] [-i joblist] jobids edg-job-get-logginginfo [-v verbosity] [-i joblist] jobids edg-job-get-output [-dir outdir] [-i joblist] jobids edg-job-cancel [-i joblist] jobid edg-job-list-match glite WMS via NS glite 3.0 D E P glite-job-submit [-o joblist] jdlfile R E glite-job-status [-v verbosity] [-i joblist] jobids C glite-job-logging-info [-v verbosity] [-i joblist] jobids A T glite-job-output [-dir outdir] [-i joblist] jobids E D glite-job-cancel [-i joblist] jobid glite WMS via WMProxy glite 3.1+ glite-wms-jobdelegate-proxy -d delegid glite-wms-job-submit [-d delegid] [-a] [-o joblist] jdlfile glite-wms-job-status [-v verbosity] [-i joblist] jobids glite-wms-job-logginginfo [-v verbosity] [-i joblist] jobids glite-wms-job-output [-dir outdir] [-i joblist] jobids glite-wms-job-cancel [-i joblist] jobid glite-wms-job-list- glite-job-list-match jdlfile jdlfile match [-d delegid] [-a] jdlfile

Specifying CE and data Type = Job ; Executable = /bin/hostname ; Arguments = ; StdError = stderr.txt ; WMS uses StdOutput = stdout.txt ; Information System InputSandbox = ; to find CE OutputSandbox = { stderr.txt, stdout.txt }; Requirements = other.architecture== INTEL && other.glueceinfototalcpus > 480; Rank = other.gluecestatetotaljobs; InputData = lfn:/grid/gridbox/antun/file.txt ; WMS uses File Catalog to find file location

Basic services of glite User Interface Information System Submit job Resource Broker Retrieve status & output create credential query query File and Replica Catalog Job status Retrieve output Logging Submit job publish state Site X Authorization Service (VOMS) Job status Computing process Storage Logging and bookkeeping

Location of files UI Network Daemon Characteristics of resources LFC Workload Manager Inform. Service Job Contr. - CondorG WMS CE characts & status SE characts & status Computing Storage

Daemon responsible for accepting incoming requests submitted UI JDL Network Daemon LFC waiting Input Sandbox files RB storage Workload Manager Inform. Service glite-wms-job-submit myjob.jdl Job Contr. - CondorG WMS CE characts & status SE characts & status Computing Storage

Job Status Enabling Grids for E-sciencE submitted UI Network Daemon Job LFC waiting RB storage Workload Manager Inform. Service WM: responsible to take the appropriate actions to satisfy the request Job Contr. - CondorG WMS CE characts & status SE characts & status Computing Storage

submitted UI Network Daemon Match- Maker/ Broker LFC waiting RB storage Workload Manager Where this job can be executed? Inform. Service Job Contr. - CondorG WMS CE characts & status SE characts & status Computing Storage

submitted UI Network Matchmaker: responsible Daemon to find the best CE where to submit a job Match- Maker/ Broker LFC waiting RB storage Workload Manager Inform. Service Job Contr. - CondorG WMS CE characts & status SE characts & status Computing Storage

Where is the needed InputData? Enabling Grids for E-sciencE submitted UI Network Daemon Match- Maker/ Broker LFC waiting RB storage Workload Manager Inform. Service Job Contr. - CondorG What is the status of the Grid? WMS CE characts & status SE characts & status Computing Storage

submitted UI Network Daemon Match- Maker/ Broker LFC waiting RB storage Workload Manager CE choice Inform. Service Job Contr. - CondorG WMS CE characts & status SE characts & status Computing Storage

submitted UI Network Daemon LFC waiting RB storage Workload Manager Job Contr. - CondorG Job Adapter Inform. Service CE characts & status JA: responsible WMS for the final touches to the job before performing submission (e.g. creation of wrapper script, etc.) SE characts & status Computing Storage

submitted UI Network Daemon LFC waiting RB storage Workload Manager Job Inform. Service ready Job Contr. - CondorG JC: responsible for the actual job management operations (done via CondorG) WMS CE characts & status SE characts & status Computing Storage

UI Enabling Grids for E-sciencE Network Daemon LFC submitted waiting RB storage Workload Manager Inform. Service ready scheduled Job Contr. - CondorG Input Sandbox files WMS Job CE characts & status SE characts & status Computing Storage

UI Enabling Grids for E-sciencE Network Daemon LFC submitted waiting RB storage Workload Manager Inform. Service ready scheduled Job Contr. - CondorG running Input Sandbox WMS Computing Grid enabled data transfers/ accesses Storage Job

submitted UI Network Daemon LFC waiting RB storage Workload Manager Inform. Service ready scheduled Job Contr. - CondorG running Output Sandbox files WMS done Computing Storage

submitted UI Network Daemon LFC waiting RB storage Workload Manager Inform. Service ready scheduled glite-wms-get-output <jobid> Job Contr. - CondorG running Output Sandbox WMS done Computing Storage

submitted UI Network Daemon LFC waiting Output Sandbox files RB storage Workload Manager Inform. Service ready scheduled Job Contr. - CondorG running WMS done cleared Computing Storage

glite-wms-job-status <jobid> Enabling Grids for E-sciencE glite-wms-job-logging-info <jobid> Job monitoring UI LB: receives and stores job events; processes corresponding job status Network Daemon Job status Logging & Bookkeeping LB proxy Workload Manager Job Contr. - CondorG WMS Log of job events Computing

Brokering policy 1. Meet CE requirements (defined by Requirements part of JDL) 2. Select CE which is close to InputData Close relationship is defined between CEs and SEs by site administrators Close is not necessarily physical distance rather bandwidth Close typically means same site SE: se-3.grid.box CE: ce-1.grid.box:2119/jobmanager-lcgpbs-gridbox 3. Select CE with highest rank (rank formula is defined by Rank part of JDL)

GlueCEUniqueID Identifyer of a CE Eliminating an erroneous CE: other.glueceuniqueid!=! Some relevant CE attributes ce-1.grid.box:2119/jobmanager-lcgpbs-gridbox! GlueCEInfoTotalCPUs max number of CPUs at a CE!Rank = other. GlueCEInfoTotalCPUs;! GlueCEStateWaitingJobs number of waiting jobs! GlueCEPolicyMaxCPUTime job will be killed after this number of minutes!other.gluecepolicymaxcputime > 300; GlueHostMainMemoryRAMSize memory size other.gluehostmainmemoryramsize > 1024; http://glite.web.cern.ch/glite/documentation/ JDL specification

Examples 1 Rank = ( other.gluecestatewaitingjobs == 0? other.gluecestatefreecpus : -other.gluecestatewaitingjobs); if there are no waiting jobs, then the selected CE will be the one with the most free CPUs else the one with the least waiting jobs. Requirements = ( Member( IDL2.1, other.gluehostapplicationsoftwareruntimeenvironment) ) && (other.gluecepolicymaxwallclocktime > 10000); CE where, IDL2.1 software is available At least 10000s can be spent on the site (waiting + running)

Examples 2 other.gluehostmainmemoryramsize >= 512 * ( other.gluehostarchitecturesmpsize > 0? other.gluehostarchitecturesmpsize : 1 ) At least 512 MB of RAM memory per CPU core should be available other.gluehostarchitectureplatformtype == "x86_64 x86_64 arch requested

Job states Flag SUBMITTED WAIT READY Meaning submission logged in the Logging & Bookkeeping service job match making for resources job being sent to executing CE SCHEDULED RUNNING DONE CLEARED ABORT job scheduled in the CE queue manager job executing on a Worker Node of the selected CE queue job terminated without grid errors job output retrieved job aborted by middleware, check reason

WMProxy (1) Local Workstation UI (user interface) has preinstalled client software Workload Management System UI WMProxy WMS ssh Client on the UI communicates with the WM Proxy WMPProxy can manage complex jobs WMProxy acts on your behalf to access WMS. It needs a delegated proxy : Delegation once per session OR Delegation at every submission CEs

WMProxy (2) WMProxy is a SOAP Web service providing access to the Workload Management System (WMS) Job characteristics specified via JDL jobregister create id map to local user and create job dir register to L&B return id to user input files transfer jobstart register sub-jobs to L&B map to local user and create sub-job dir s unpack sub-job files deliver jobs to WM

Relevant JDL attributes 1 Type Job, DAG, Collection JobType (only when the Type is set to Job!) Normal (sequential batch job), Parametric, Interactive, MPICH, Checkpointable, Partitionable Executable The name of the executable (absolute path) Arguments Job command line arguments StdInput, StdOutput, StdError Standard input/output/error of the job (stdin absolute path; stdout & stderr relative path) Environment List of environment variables to be set for the binary InputSandbox List of files on the UI local disk needed by the job for running The listed files will be staged to the remote resource OutputSandbox List of files, generated by the job, which have to be retrieved Files will be transfered back

Relevant JDL attributes 2 Input Data For the broker, WMS does not transfer these files Output Data For the broker, WMS does not transfer these files Requirements Required CE caracteristics Rank Goodness value for compatible CEs ShallowRetryCount in case of grid error, retry job this many times Shallow : before job is running RetryCount resubmit if the job failed in Running mode If job fails after it has already done something (e.g. creating a Grid file) then resubmission can generate inconsistencies MyProxyServer where to download proxy from in case of the existing proxy expires Done by WMS

A set of independent jobs Complex jobs 1: Job collection For some reason must be managed as a single unit Possible reasons: Belong to the same experiment Share common input files Optimize network traffic Sharing of sandboxes

[!!Type = Collection";! JDL of a job collection Transfer from UI only once!inputsandbox = {!!!! sharedfile1 ;...; sharedfilem };!!nodes = {!!![ JobType = "Normal";!!! InputSandbox = {root.inputsandbox,...}!!!! ; ],!!!!...!!![ JobType = "Normal";!!!! ; ],!!!...!!};! ]! JDL of 1 st job JDL of N th job

Complex jobs 2: DAG workflow Enabling Grids for E-sciencE Direct Acyclic Graph (DAG) is a set of jobs where the input, output, or execution of one or more jobs depends on one or more other jobs B A C D Sharing and inheritance of sandboxes Include OutputSandbox in the next InputSandbox E Dependencies defined between pairs of jobs

JDL of a DAG [ Type = DAG";!!InputSandbox = {!!!! sharedfile1,..., sharedfilen };!!nodes = [!!!job1 = [! Transfer from UI only once JDL of 1 st job Job 1!!!descpription = [!!!!!JobType = "Normal";! Job 2!!!!...; ];! Job 4!!];!...!!!...!!];!!dependencies = {! {job1,{job2,job3}}, {job2,job4}, {job3,job4} };!!};! ]! Graph structure Job 3

[ Type = DAG";!!!!!job4 = [!!!!descpription = [!!!!!JobType = "Normal";!!!!!InputSandbox = {! root.nodes.job1.description.outputsandbox[0],! root.nodes.job2.description.outputsandbox,!!!!!...;];!!!];!!!...!!]! DAG: inheritance of sandboxes...};!

Complex jobs 3: parametric jobs Enabling Grids for E-sciencE A set of jobs generated from one JDL Useful where many similar (but not identical) jobs must be executed Parameter study, parametric sweep applications Majority of grid applications are parametric! One or more parametric attributes in the JDL: Use the _PARAM_ keyword E.g. InputSandbox = input_param_";

JDL of a parametric job 1 [!!Type = Parametric";!!...!!ParameterStart = 0;!!ParameterStep = 2;!!Parameters = 6; _PARAM_: 0, 2, 4, 6, 8, 10!!Arguments = inputfigure_param_.jpg";!!stdoutput = transformed_param_.jpg";!!outputsandbox = {" transformed_param_.jpg ", };!!...! ]!

JDL of a parametric job 2 [!!Type = Parametric";!!...!!!Parameters = {alpha, beta, gama};!!arguments = inputfigure_param_.jpg";!!stdoutput = transformed_param_.jpg";!!outputsandbox = {" transformed_param_.jpg ", };!!...! ]!

Status of WMProxy For simple jobs: glite-wms- is the recommended way to use the WMS History: Before the glite-wms- commands we had glite- commands used the WMS without WMProxy Before the glite- commands we had edg- commands (edg-job-submit.) European Data Grid project before EGEE Used the resource broker Still very widely used You might see these commands still in use. Status Complex jobs with WMProxy: first stable version just released. Not yet in routine production use Watch for news!

Practicals on advanced job submission Create and submit a JDL file with different requirements and rankings Create and submit a JDL file for a collection of jobs Create and submit a JDL file for a parametric job Create and submit a JDL file for a DAG job