International Collaboration to Extend and Advance Grid Education glite WMS Workload Management System Marco Pappalardo Consorzio COMETA & INFN Catania, Italy ITIS Ferraris, Acireale, Tutorial GRID per gli Insegnanti, 11.05.2007 INFSO-SSA-26637
Workload Management System (WMS) components and services User Interface (RB) Resource Broker (RB) Logging and Bookkeeping (LB) Computing Element (CE) Contents Job Description Language (JDL) JDL document Sandboxes attachment Type of jobs Type of requests
Overview of the Architecture
Architecture overview Output Sandbox Replicas info Authentication & authirization User Interface Job submit event Input Sandbox Job status Network Server (Resource Broker) Input Sandbox Output Sandbox Publish LHC File Catalogue Information Index Logging & Bookkeeping Computing Element Workload Management System components Storage Element
Components and Services User Interface (UI) terminal to access to all grid facilities, including the WMS Command Line Interface (CLI) Grid Portals (such as GENIUS) Network Server(NS) / Resource Broker (RB) WMS access point. Dispatches jobs across computing resources Implements some scheduling algorithm Logging and Bookkeeping (LB) Keeps track of any WMS status or action Computing Element (CE) Is the actual computing resource It is an interface. We don t know what stays beyond
More on Computing Elements Logically, it is a queue of pending jobs Physically, it is a farm of Worker Nodes (WNs) Stays on top of a Local Resource Management System (LRMS) PBS, Condor, LSF CCS in the future? It exposes a common interface independent of the underlying LRMS
Job Life Cycle (1/4) X The grid user describes a job via a Job Description Language (JDL) document. Some input files (Input Sandbox) can be attached to the JDL doc. The grid user submits the JDL job using the CLI and waits for reply. The Resource Broker gets and stores the JDL document together with attached input files. The just generated jobid is sent back to the user to refer to that job univocally in the future
Job Life Cycle (2/4) The Resources Broker executes a special algorithm (MatchMaking) and selects a Computing Element according to best-fit rules. The job is handed to chosen CE together with the Input SandBox The Computing Element accepts the job and queues it. The job starts execution over the Local Resources Management System (LRMS)
Job Life Cycle (3/4) When the job terminates, the produced output is sent back to the Resource Broker The Resource Broker gets the results and the Output Sandbox and stores them in the local repository At the same time, the Computing Element notifies the Logging & Bookkeeping Now the job output is available on the Resource Broker
Job Life Cycle (4/4) Job-status? Terminated The User queries the L&B to have a look on his/her jobs and realizes that the job has terminated. The User gets Output SandBox from the Resource Broker. The Resource Broker clears all no more needed info from its repository. The job life cycle has terminated (either well or not)!!!
Job State Machine Submitted: The job has been created on the UI but not yet sent to the Resource Broker Waiting: The job is now being processed by the Resource Broker Ready: The job has been processed but not yet sent to chosen CE Scheduled: Job is now queued on the CE and is waiting to be executed Running: The job is running on the Computing Element Done: The job has terminated its execution Aborted: The job has been aborted by the WMS Cancelled: The job has been cancelled by the user Cleared: The job has terminated and the output has been retrieved
Matchmaking Algorithm The Matchmaking Algorithm (within the RB) Decides how to dispatch jobs across resources (where). Uses the Information System as resource discovery system. First phase: Selection In this phase, the algorithm chooses which Computing Elements are suitable for executing a given job Requirements JDL attribute is evaluated for any candidate CE Second Phase: Ranking A fitness function (Rank JDL attribute) is evaluated over suitable CEs. The CE that maximizes the above function is chosen. The job is submitted to the selected CE.
Job Description Language (JDL)
Job Description Language The Job Description Language (JDL) is a language to describe job is composed mainly of a collection of attribute-value pairs allows the attachment of files
Type of jobs Normal A batch executable Interactive Requires interaction of the user MPICH Needs the Message Passing Interface (MPI) installed on the computing resource Partitionable Can be partitioned into more sub-jobs. Deprecated. Checkpointable Execution can be marked at some specific position of the code (checkpoints) to be resumed later. Deprecated. Parametric a job whose JDL contains parametric attributes (e.g. Arguments, StdInput etc.).
Types of requests The JDL allows description of the following types of requests: JOB a simple job (default) DAG a Direct Acyclic Graph of dependent jobs Collection a set of independent jobs
JDL format A JDL file consists of lines having the format: Attribute = expression; and terminated by a semicolon. Expressions can span several lines, but only the last one must be terminated by a semicolon. Comments must have a sharp character (#) or a double slash (//) at the beginning of each line. Comments spanning multiple lines can be specified enclosing the text between /* and */.
Type Example The Type attribute is a string representing the type of the request described by the JDL, e.g. Type = Job ; Possible values are: Job DAG Collection The value for this attribute is case insensitive. If this attribute is not specified in the JDL description, the WMS will set it to Job. Default: Job
JobType Example The JobType attribute is a string representing the type of the job described by the JDL, e.g.: JobType = Interactive ; Possible values are: Normal Interactive MPICH Checkpointable Partitionable Parametric This attribute only makes sense when the Type attribute equals to Job. The value for this attribute is case insensitive. Default: Normal
Example of JDL file scriptls.jdl VirtualOrganisation= gilda ; Executable = "ls.sh"; // this will run on the endpoint StdError = "stderr.log"; // redirect stderror to this file StdOutput = "stdout.log"; // redirect stdout to this file InputSandbox = "ls.sh"; // attach this file to the JDL OutputSandbox = // these files will be the output {"stderr.log", stdout.log"}; // for this job ls.sh #!/bin/sh /bin/ls // simply executes ls on the final // computing resource
JDL Attributes This is a very not exhaustive list of JDL attributes Type Job or DAG JobType Interactive, MPICH... Executable Arguments StdInput StdOutput and StdError InputSandbox The command line to execute String used as argument for the executable A file attached as standard input Files to which redirect standard output and standard error List of files attached to the JDL OutputSandboxList of files to be retrieved as output Requirements Rank RetryCount Boolean expression to select suitable CEs Fitness function to evaluate over candidate CEs Retry matchmaking
Requirements = < logic expression > Expression that uses C-like operators. It represents job requirements on resources. Requirements The Requirements expression can contain attributes that describe the CE in the IS which are prefixed with other.. e.g.: Requirements = other.glueceinfolrmstype == "PBS" && other.glueceinfototalcpus > 2); Rank = < floating point expression > Fitness function that uses C-like operators to select the best CE The Rank expression can contain attributes that describe the CE in the IS which are prefixed with other.. e.g. Rank = other.gluecepolicymaxrunningjobs other.gluecestaterunningjobs
Command Line Interface
edg-job-list-match [glite edg]-job-* commands Gets the list of CEs that satisfy requirements to execute the job edg-job-submit Submits the job to the Resources Broker and returns the just generated job_id edg-job-status Retrieves the status of the given job edg-job-cancel Cancels a submitted jobs edg-job-get-output Retrieves the output only if the job has terminated
Example: hostname.jdl (i) $> cat hostname.jdl Type = Job ; JobType = Normal ; It is a standard job Executable = /bin/sh/ ; Arguments = start_hostname.sh ; The executable to run StdError = stderr.log ; StdOutput = stdout.log ; Redirect standard output and standard error to these files InputSandbox = start_hostname.sh ; OutputSandbox = { stderr.log, stdout.log }; RetryCount = 7; If the job fails the execution, retrys for at most 7 times $> cat start_hostname.sh #!/bin/sh sleep 5 hostname f Attach this file to the JDL document Thesefileshavetoberetrieved when the job terminates
Example: hostname.jdl $> edg-job-submit -o jobid hostname.jdl Selected Virtual Organisation name (from proxy certificate extension): gilda Connecting to host glite-rb.ct.infn.it, port 7772 Logging to host glite-rb.ct.infn.it, port 9002 ================== glite-job-submit Success ============================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier is: - https://glite-rb.ct.infn.it:9000/lb6lihd93s7vyz1rvbcp8a just generated job id The job identifier has been saved in the following file: /home/fscibi/glite/other/jobid ===================================================================== option -o jobid $> cat jobid ###Submitted Job Ids### https://glite-rb.ct.infn.it:9000/lb6lihd93s7vyz1rvbcp8a
Example: hostname.jdl $> edg-job-status -i jobid ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://glite-rb.ct.infn.it:9000/lb6lihd93s7vyz1rvbcp8a Current Status: Done (Success) Terminated Exit code: 0 Status Reason: Job terminated successfully Destination: grid004.iucc.ac.il:2119/jobmanager-lcgpbs-short Submitted: Mon Apr 3 12:27:28 2006 CEST Computing ************************************************************* Element where the job executed
Esempio: hostname.jdl (v) edg-job-status -v 3 -i jobid ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://glite-rb.ct.infn.it:9000/lb6lihd93s7vyz1rvbcp8a Current Status: Cleared Status Reason: user retrieved output sandbox Destination: grid004.iucc.ac.il:2119/jobmanager-lcgpbs-short Submitted: Mon Apr 3 12:27:28 2006 CEST --- - stateentertimes = Submitted : Mon Apr 3 12:27:28 2006 CEST Waiting : Mon Apr 3 12:27:37 2006 CEST Ready Scheduled Running Done Cleared Aborted : --- Cancelled : --- Unknown : --- : Mon Apr 3 12:27:42 2006 CEST : Mon Apr 3 12:28:01 2006 CEST : Mon Apr 3 12:28:55 2006 CEST : Mon Apr 3 12:30:37 2006 CEST : Mon Apr 3 15:36:39 2006 CEST Job status variation
Esempio: hostname.jdl (vi) edg-job-get-output -i jobid Retrieving files from host: glite-rb.ct.infn.it ( for https://gliterb.ct.infn.it:9000/lb6lihd93s7vyz1rvbcp8a ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://glite-rb.ct.infn.it:9000/lb6lihd93s7vyz1rvbcp8a have been successfully retrieved and stored in the directory: /tmp/glite/glite-ui/fscibi_lb6lihd93s7vyz1rvbcp8a Default dir where retrieved output is stored ********************************************************************************* To specify a different $> edg-job-get-output -i jobid --dir <dirname> directory
Esempio: sphere.jdl (i) $> cat sphere.jdl #author: giuseppe.larocca@ct.infn.it Type = "Job"; JobType = "Normal"; Executable = "/bin/sh"; MyProxyServer="lxshare0207.cern.ch"; StdOutput = "sphere.out"; StdError = "sphere.err"; InputSandbox = {"start_sphere.sh","sphere1.pov","sphere1.ini"}; OutputSandbox = {"sphere.out","sphere.err","final_sphere.gif"}; RetryCount = 7; Arguments = "start_sphere.sh"; Requirements = Member("POVRAY-3.5",other.GlueHostApplicationSoftwareRunTimeEnvironment); Select only CEs where POVRAY is installed
Esempio: sphere.jdl (ii) $> edg-job-list-match sphere.jdl Selected Virtual Organisation name (from proxy certificate extension): gilda Connecting to host glite-rb.ct.infn.it, port 7772 *************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* dgt01.ui.savba.sk:2119/jobmanager-lcgpbs-infinite dgt01.ui.savba.sk:2119/jobmanager-lcgpbs-long dgt01.ui.savba.sk:2119/jobmanager-lcgpbs-short egee008.cnaf.infn.it:2119/blah-pbs-infinite egee008.cnaf.infn.it:2119/blah-pbs-long egee008.cnaf.infn.it:2119/blah-pbs-short fenrir.uniandes.edu.co:2119/blah-pbs-infinite fenrir.uniandes.edu.co:2119/blah-pbs-long *************************************************************************** CEs where POVRAY is installed
Esempio: sphere.jdl (iii) $> edg-job-submit -o jobid sphere.jdl Connecting to host glite-rb.ct.infn.it, port 7772 Logging to host glite-rb.ct.infn.it, port 9002 ====================== glite-job-submit Success ========================= The job has been successfully submitted to the Network Server. Use glite-job-status command to check job current status. Your job identifier is: - https://glite-rb.ct.infn.it:9000/jsowaze9kfzs4tdg1tairq The job identifier has been saved in the following file: /home/fscibi/glite/other/jobid ====================================================================
Esempio: pds2jpg-asar-demo.jdl (i) $> cat pds2jpg-asar-demo.jdl [ ] VirtualOrganisation = "gilda"; Executable = "/bin/bash"; Arguments = "pds2jpg_asar_install.sh ASA_APG_1PXPDE20020819_093043_000000152008_00394_02452_0000"; StdOutput = "pds2jpg_asar.out"; StdError = "pds2jpg_asar.err"; OutputSandbox = { "ASA_APG_1PXPDE20020819_093043_000000152008_00394_02452_0000-b1.jpg", "ENVISAT_Product_courtesy_of_European_Space_Agency", "pds2jpg_asar.out", "pds2jpg_asar.err }; RetryCount = 3; JobType = "normal"; Type = "Job"; InputSandbox = {"./pds2jpg_asar_install.sh","./beam20.tar.gz"}; rank = (-other.gluecestateestimatedresponsetime); requirements = (other.gluecestatestatus=="production") The installer is attached to the JDL Fitness function to select the best CE
Esempio: pds2jpg-asar-demo.jdl (ii) $> cat pds2jpg_asar_install.sh echo Staging Input Data \(Courtesy of European Space Agency\); #edg-rm --vo=gilda copyfile lfn:$1.n1 file://$pwd/$1.n1; lcg-cp --vo=gilda lfn:$1.n1 file://$pwd/$1.n1; echo Staging Application; gunzip beam20.tar.gz; tar xvf beam20.tar; cd beam-2.0/bin echo Starting Application;./pds2jpg-ASAR-run.sh $1; mv $1-b*.jpg../.. cd../.. rm -fr beam-2.0; rm -fr $PWD/$1.N1; rm -fr $PWD/beam20.tar; echo Input ENVISAT Product courtesy of European Space Agency touch ENVISAT_Product_courtesy_of_European_Space_Agency echo No Output Packaging; echo Done!;
Questions