DataGrid TECHNICAL PLAN AND EVALUATION CRITERIA FOR THE RESOURCE CO- ALLOCATION FRAMEWORK AND MECHANISMS FOR PARALLEL JOB PARTITIONING

Size: px

Start display at page:

Download "DataGrid TECHNICAL PLAN AND EVALUATION CRITERIA FOR THE RESOURCE CO- ALLOCATION FRAMEWORK AND MECHANISMS FOR PARALLEL JOB PARTITIONING"

Hortense Henry
6 years ago
Views:

1 DataGrid D EFINITION OF THE ARCHITECTURE, WP1: Workload Management Document identifier: Work package: Partner(s): Lead Partner: Document status WP1: Workload Management INFN INFN APPROVED Deliverable identifier: DataGrid-D1.4 Abstract: This document updates the architecture specification of the Workload Management System, in particular to describe how resource co-allocation and job partitioning are addressed in this framework. The subjects of advance reservation (whose mechanisms are used for resource co-allocation), job checkpointing (the job partitioning problem is addressed in this framework) and inter-job dependencies (necessary to manage partitionable jobs) are also discussed. IST PUBLIC 1 / 46

2 Doc. Identifier: Delivery Slip Name Partner Date Signature From Massimo Sgaravatto INFN Verified by Francesco Prelz INFN Approved by PTB 10/09/2002 Document Log Issue Date Comment Author 0_0 03/06/2002 First draft F. Giacomini, A. Gianelle, R. Peluso, M. Sgaravatto 0_1 07/06/2002 F. Giacomini, M. Sgaravatto 0_2 13/06/2002 F. Giacomini, M. Sgaravatto 0_3 19/06/2002 M. Sgaravatto 0_4 28/06/2002 F. Giacomini, M. Sgaravatto 0_5 02/07/2002 F. Giacomini, M. Sgaravatto 0_6 12/07/2002 M. Sgaravatto 1_0 10/09/2002 Final PTB approval IST PUBLIC 2 / 46

3 Doc. Identifier: Document Change Record Issue Item Reason for Change 0_1 Addressed Mirek Ruda s comments 0_2 Modified figure on new architecture; modified state machines; modified presentation page 0_3 Addressed Francesco Prelz s comments 0_4 Addressed reviewers comments 0_5 Addressed D. Bosio, B. Jones and J. Montagnat s comments 0_6 Addressed M. Parson s comments 1_0 Final PTB approval Files Software Products Microsoft Word 2000 Adobe Acrobat 5.0 User files IST PUBLIC 3 / 46

4 Doc. Identifier: CONTENT 1. INTRODUCTION OBJECTIVES OF THIS DOCUMENT APPLICABLE DOCUMENTS AND REFERENCE DOCUMENTS DOCUMENT AMENDMENT PROCEDURE TERMINOLOGY ACKNOWLEDGEMENTS EXECUTIVE SUMMARY REVIEW OF THE WORKLOAD MANAGEMENT SYS TEM ARCHITECTURE THE NEW WMS ARCHITECTURE SUPPORTING NEW FUNCTIONALITIES Supporting Job Partitioning Supporting Job Dependencies RESOURCE RESERVATION EXTERNAL INTERFACE Language Application Programming Interface INTERNAL DESIGN DEPENDENCIES CO-ALLOCATION EXTERNAL INTERFACE Language Application Programming Interface INTERNAL DESIGN DEPENDENCIES EVALUATION CRITERIA FOR RESOURCE CO-ALLOCATION JOB AND JOB CHECKPOINTING JOB : INTRODUCTION TO THE PROBLEM JOB CHECKPOINTING Job state Job checkpointing scenario Job checkpointing Application Programming Interface Internal design for job checkpointing JOB Job partitioning scenario Internal design for job partitioning EVALUATION CRITERIA FOR JOB INTER-JOB DEPENDENCIES INTRODUCTION EXTERNAL INTERFACE Language Application Programming Interface DAG State Machine...44 IST PUBLIC 4 / 46

5 Doc. Identifier: 7.3. RECOVERY DETAILED DESIGN CONCLUSIONS...46 IST PUBLIC 5 / 46

6 1. INTRODUCTION In [R1] the architecture of the WP1 Workload Management System (WMS) was presented. This architecture is now reviewed and complemented. In particular: to increase the reliability, the efficiency and the flexibility of the system; to allow exploiting and using WP1 modules also outside the WP1 Workload Management System, and therefore assuring interoperability with other Grid frameworks; to support new functionalities. Resource co-allocation is one of the new functionalities that will be supported in the WMS: coallocation means the concurrent allocation of multiple resources, which can be homogeneous or heterogeneous. The Workload Management System will provide a generic framework to support coallocation, using techniques based on immediate or advanced reservation of resources. Job partitioning is another new functionality introduced in the revised Workload Management System framework. Job partitioning takes place when a job has to process a large set of independent elements, as it often happens in many applications, such as the ones directly involved in the DataGrid project. In these cases it may be worthwhile to decompose the job into smaller sub-jobs (each one responsible for processing just a sub-set of the original large set of elements), in order to reduce the overall time needed to process all these elements through trivial parallelisation, and to optimize the usage of all available Grid resources. Job partitioning will be supported in the context of logical job checkpointing, and will also make use of techniques used to support inter-job dependencies. Section 3 of this document presents how the existing Workload Management System is reviewed and complemented. Section 4 discusses about resource reservation, whose mechanisms are then exploited in the context of resource co-allocation, presented in section 5. Job partitioning and job checkpointing are the subjects of section 6, while section 7 discusses inter-job dependencies. Section 8 concludes the document OBJECTIVES OF THIS DOCUMENT The goal of this document is to review and complement the first version of the WP1 architecture document ([R1]), in particular to describe how the new functionalities of resource co-allocation (addressed using techniques based on immediate or advanced reservation of resources) and job partitioning (addressed in the context of logical job checkpointing, and which requires also solutions for the problem of inter-job dependencies) can be supported in the framework of the workload management system. IST PUBLIC 6 / 46

7 1.2. APPLICABLE DOCUMENTS AND REFERENCE DOCUMENTS Reference documents [R1] DataGrid - Definition of Architecture, Technical Plan and Evaluation Criteria for Scheduling, Resource Management, Security and Job Description. [R2] Condor-G Home Page [R3] Classified Advertisements Home Page [R4] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, A. Roy, A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation, Intl Workshop on Quality of Service, [R5] A. Roy, V. Sander, Advance Reservation API, GGF Scheduling Working Group, Scheduling Working Document [R6] DataGrid - Job Description Language HowTo. [R7] DataGrid - User Requirements and Specifications for the DataGrid Project. [R8] DataGrid - Long Term Specifications of LHC Experiment. Part B. [R9] DataGrid - Requirements for grid-aware biology applications. [R10] DataGrid - Job partitioning and checkpointing. [R11] The Condor project IST PUBLIC 7 / 46

8 [R12] DataGrid - Logging and Bookkeeping Service for the DataGrid [R13] DataGrid - JDL Attributes [R14] DAGman [R15] Information and Monitoring (WP3) Architecture Report DOCUMENT AMENDMENT PROCEDURE Since DataGrid has very fast-paced timescales, with very frequent software releases, it is very difficult to follow a traditional design cycle. Therefore the architecture described in this document represents work-in-progress and hasn t been fully implemented yet. It could become necessary to revise this architecture according to new or more focused user requirements, to the evaluation of the framework that will implement the specified architecture, to feedback received by users, to new evolutions in the available technology TERMINOLOGY Glossary API CA CE DAG DAGMAN GARA GGF JDL LB MPI MRI Application Programming Interface Co-allocation Agent Computing Element Directed Acyclic Graph DAG MANager Globus Architecture for Reservation and Allocation Global Grid Forum Job Description Language Logging & Bookkeeping Message Passing Interface Magnetic Resonance Image IST PUBLIC 8 / 46

9 QoS RA RB RC RSL SE UI WMS Quality of Service Reservation Agent Resource Broker Replica Catalog Resource Specification Language Storage Element User Interface Workload Management System 1.5. ACKNOWLEDGEMENTS We would like to thank everyone in WP1 for their contributions. We also wish to thank Miron Livny for the very fruitful discussions. Special thanks to Diana Bosio, Bob Jones, Julian Linford, Maciej Malawski and Johan Montagnat for their useful comments. IST PUBLIC 9 / 46

10 2. EXECUTIVE SUMMARY The initial WP1 Workload Management System software architecture described in [R1] and implemented in the first phase of the project, is being reviewed and complemented. The objectives for the revised architecture, discussed in this document, are: to increase the reliability and the flexibility of the system; to address some of the shortcomings that emerged in the first DataGrid testbed; to simplify the whole system; to favor the interoperability with other Grid frameworks, allowing the use of modules (e.g. the Resource Broker) also outside the WP1 Workload Management System; to make it easy to plug-in new components implementing new functionalities. Immediate or advance reservation of resources, which can be heterogeneous in type and implementation and independently controlled and administered, is one of the new functionalities that must be supported, to allow the use of end-to-end quality of service (QoS) services in emerging network-based applications. The Workload Management System will provide a generic framework to support reservation of resources, based on concepts that have emerged and been widely discussed in the Global Grid Forum. In its implementation it is foreseen to address at least computing, network and storage resources, provided that adequate support exists from the local management systems. The Reservation Agent is the core component of the resource reservation framework. Its main functionalities are to accept a generic reservation request from a user (specified via the usual Job Description Language, used also for job specification), to map it into a reservation on a specific resource, to match the requirements and preferences specified by the user, to perform the allocation on the specific resource, and then to allow the user to use a granted reservation for his job. The mechanisms designed and implemented for resource reservation will be exploited also for resource co-allocation, that is the concurrent allocation of multiple, homogeneous or heterogeneous, resources. The process for performing a co-allocation, given a co-allocation request (specified using the Job Description Language), consists of three major steps: 1. discover resources compatible with the requirements and preferences included in all the resource descriptions; 2. find compatible combinations of resources that would satisfy the co-allocation request; 3. try each combination taking into account optimisation criteria, until one succeeds or all fail. The Co-allocation Agent is the component responsible for applying this 3-step procedure. IST PUBLIC 10 / 46

11 Another new functionality that must be supported is job partitioning. Job partitioning takes place when a job has to process a large set of independent elements. In these cases it may be worthwhile to decompose the job into smaller sub-jobs (each one responsible for processing just a sub-set of the original large set of elements), in order to reduce the overall time needed to process all these elements, and to optimize the usage of all available Grid resources. The proposed approach is to address the job partitioning problem in the context of job checkpointing, as described in section 6. Checkpointing a job during its execution means saving in some way its state, so that the job execution can be suspended, and resumed later, starting from the same point where it was previously stopped. Users are provided with a logical checkpointing service: through a proper API, a user can save, at any moment during the execution of a job, the state of this job, and the job must also be instrumented so that it can be restarted from an intermediate (i.e. previously saved) state. The checkpointing API can be exploited also in the context of job partitioning: the processing of a job could be described as a set of independent steps/iterations, and this characteristic can be exploited, considering different, simultaneous, independent sub-jobs, each one taking care of a step or of a subset of steps, and which can be executed in parallel. The partial results (that is the results of the various sub-jobs) can be represented by job states (the final job states of the various sub-jobs) which can then be merged together by a job aggregator, which must start its execution when the various sub-jobs have terminated their execution. Hence job partitioning also requires mechanisms to address the problem of inter-job dependencies. Generalizing the problem, it is possible to define a whole web of dependencies on a set of program executions, building a Directed Acyclic Graph (DAG), whose nodes are program executions (jobs), and whose arcs represent dependencies between them. Within the Workload Management System, a DAG will be managed by a meta-scheduler, called DAGMan (DAG Manager), whose main purpose is to navigate the graph, determine which nodes are free of dependencies, and follow the execution of the corresponding jobs. IST PUBLIC 11 / 46

12 3. REVIEW OF THE WORKLOAD MANAGEMENT SYSTEM ARCHITECTURE The objectives that we want to achieve by reviewing the existing WP1 software architecture are: the simplification of the flow of control within the Workload Management System, minimizing the dependencies between the different components; the reduction of stateful invocations of the internal components. In particular the aim is to minimize the duplication of persistent information related to the same events, which is difficult to keep coherent; increasing the reliability and the flexibility of the system; the support for new functionalities, which must be accommodated in the Workload Management System framework. There are a number of new functionalities, which have been introduced, but their discussion lies outside the scope of this document. Here we will discuss only job partitioning and co-allocation (and the other frameworks needed to implement these two services, that is resource reservation, job checkpointing and job dependencies); New functionalities include job partitioning, the possibility to reserve resources and to coallocate them, the management of dependencies between jobs; the possibility to exploit and use WP1 modules (i.e. the Resource Broker) also outside the WP1 Workload Management System. This is particularly important to guarantee interoperability with other Grid frameworks (e.g. the ones developed within the various US Grid projects) THE NEW WMS ARCHITECTURE The new Workload Management System architecture is represented in Figure 1. IST PUBLIC 12 / 46

13 Figure 1: UML diagram describing the new WMS architecture The User Interface (UI) is the component that allows users to access the functionalities offered by the Workload Management System. Although there are several changes compared to the architecture described in [R1], the commands available at the user interface level are the same. This means that this modification of the architecture does not imply significant changes from the user point of view. The Network Server is a generic network daemon, responsible for accepting incoming requests from the UI (e.g. job submission, job removal), which, if valid, are then passed to the Workload Manager. For this purpose the Network Server uses Protocol, to check if the incoming requests conform to the agreed protocol. The Workload Manager is the core component of the Workload Management System. Given a valid request, it has to take the appropriate actions to satisfy it. To do so, it may need support from other components, which are specific to the different request types. IST PUBLIC 13 / 46

14 Doc. Identifier: All these components that offer support to the Workload Manager provide a class whose interface is inherited from a Helper class, which consists of a single method (resolve()). Essentially the Helper, given a JDL expression, returns a modified one, which represents the output of the required action. For example, if the request was to find a suitable resource for a job, the input JDL expression will be the one specified by the user, and the output will be the JDL expression augmented with the CE choice. The Resource Broker is one of these classes offering support to the Workload Manager. It provides a matchmaking service: given a JDL expression (e.g. for a job submission, or for a resource reservation request), it finds the resources that best match the request. Actually the Resource Broker can be decomposed in three sub-modules: a sub-module responsible for performing the matchmaking, therefore returning all the resources suitable for that JDL expression; a sub-module responsible for performing the ranking of matches resources, therefore returning just the best resource suitable for that JDL expression; a sub-module implementing the chosen scheduling strategy; this must be easily pluggable and replaceable with other ones implementing different scheduling strategies. Within this architecture, the Resource Broker is therefore re-cast as a module, implementing the Helper interface, which can be plugged and used also in frameworks other than the WP1 Workload Management System. The Job Adapter is responsible for making the final touches to the JDL expression for a job, before it is passed to CondorG for the actual submission. So, besides preparing the CondorG submission file, this module is also responsible for creating the wrapper script: as described in [R1], the user job is wrapped within this script, which is responsible for creating the appropriate execution environment in the CE worker node (this includes the transfer of the input and of the output sandboxes). CondorG ([R2]) is the module responsible for performing the actual job management operations (job submission, job removal, etc.), issued on request of the Workload Manager. The CondorG framework is exploited for various reasons: the reliable two-phase commit protocol used by CondorG for job management operations; the persistency: CondorG keeps a persistent (crash proof) queue of jobs. This queue will be used as a persistent database storing information, represented via Condor classads ([R3]), concerning active jobs; the logging system: CondorG logs all the signification events (e.g. job started its execution, job execution completed, etc.) concerning the managed jobs: this is useful to increase the reliability of the whole system; IST PUBLIC 14 / 46

15 the increased openness of the CondorG framework; the need for interoperability with the US Grid projects, of which CondorG is an important component. The Log Monitor is responsible for watching the CondorG log file, intercepting interesting events concerning active jobs, that is events affecting the job state machine described in [R1] (e.g. job done, job cancelled, etc.), and therefore triggering appropriate actions. The Reservation Agent and the Co-Allocation Agent are represented in the same block to simplify the figure. The Reservation Agent is the core component of the reservation framework. As explained in section 4, its main functionalities are to accept a reservation request from a user, map it into a reservation on a specific resource, perform the allocation on this resource, and allow the user to use the granted reservation. The Co-Allocation Agent, in the framework of resource co-allocation, is responsible for discovering resources compatible with the requirements and preferences included in a co-allocation request, finding compatible combinations of resources that would satisfy the co-allocation request, and trying each combination until one succeeds or all fail (see section 5). For what concerns the Logging and Bookkeeping service, it stores logging and bookkeeping information concerning events generated by the various components of the WMS. Using this information, the LB service keeps a state machine view of each job. This service is not subject to any change compared to the previous architecture discussed in [R1]. The only significant expected change is the planned use of R-GMA [R15] framework to improve the efficiency of the service. The dependencies between this component and the other modules of the Workload Management System (UI accessing the LB service to get status and logging information on jobs, and the various modules pushing events concerning jobs to the LB) are not represented in the figure, again for increased simplicity. The other modules (Partitioner and DAGMan) are explained in the following section. Therefore, besides having introduced new components to support new functionalities (e.g. DAGMan, to support job dependencies, the Reservation Agent to support resource reservation, etc.), the core functionalities have been split between the Workload Manager (responsible to take the appropriate actions given a request) and the Resource Broker (which is now a pluggable component, just responsible to find the best CEs given a specific JDL expression). Moreover, the duplication of persistent information was avoided by relying more widely on the CondorG system. IST PUBLIC 15 / 46

16 3.2. SUPPORTING NEW FUNCTIONALITIES This new proposed architecture, besides improving the efficiency of the whole system, makes it also easy to plug in new components implementing new functionalities. This is the case, for example, for job partitioning (discussed in section 6) and for job dependencies (the subject of section 7) Supporting Job Partitioning As it will be discussed in section 6.3.2, in the context of job partitioning, a JDL expression representing a partitionable job must be transformed in a JDL expression representing a Directed Acyclic Graph (DAG) of jobs. The component providing this functionality, the Partitioner, is an example of a Helper class, which can be called by the Workload Manager when it receives a request of type "partitionable job" Supporting Job Dependencies As it will be discussed in section 7, in order to manage dependencies between jobs, a component called DAGMan (DAG Manager) is introduced. DAGMan can be seen as an iterator on a graph of jobs: when a job is free of dependencies it can be submitted to CondorG for its execution. Before passing a job belonging to the DAG to CondorG, it is first necessary to find an appropriate resource for its execution. This can be accomplished by calling the Resource Broker, in the same way as it is usually called by the Workload Manager for single jobs: the result of the call will be the original JDL expression augmented with the resource choice. DAGMan has then to also call the Job Adapter, to make the final adjustments to the job description. Figure 1 shows the DAGMan in the overall design. The JDL expression, received from the UI, representing the DAG, is first of all modified by the Job Adapter, to create the Condor submit file, and the DAG is submitted to CondorG. DAGMan, in turn, for each job (node) composing the DAG calls the Resource Broker (to bind the job to a resource). Note that in the picture there is a DAGMan dependency on the Partitioner since a DAG job could in turn be partitionable: in this case DAGMan has to call the Partitioner, which returns a DAG (which, like all other DAGs, is then submitted to CondorG, which then spawns another DAGMan). In other words, every time CondorG receives job, which is a DAG, it spawns a DAGMan process to manage that DAG. In turn DAGMan, for each job contained in the DAG, passes it to CondorG (after calling one or more helpers). IST PUBLIC 16 / 46

17 4. RESOURCE RESERVATION The realization of end-to-end quality of service guarantees in emerging network-based applications may require mechanisms that support advance or immediate reservation of resources, which can be heterogeneous in type and implementation, and independently controlled and administered. Employing reservation mechanisms can also help to reduce competition for resources and to implement higherlevel abstractions, such as the concurrent allocation of multiple resources. Therefore, by (immediately or in advance) reserving resources, it is possible to mediate among competing requests and to avoid oversubscriptions, and therefore prevent degraded services and/or increased costs due to excess overprovisioning ([R4]). The Workload Management System will provide a generic framework to support reservation of resources. In its implementation it is foreseen to address at least computing, network and storage resources, provided that adequate support exists from the local management systems. The approach described here for reserving a resource and then using that reservation is based on concepts described in the Global Grid Forum (GGF) draft Advanced Reservation API ([R5]). A resource reservation request is specified by the following attributes: Start time: the earliest time that the reservation may begin; Duration: how long the reservation lasts; Resource type: the type of the underlying resource, such as network, computation, storage; Reservation type: used in case a resource supports different types of reservation; Resource-specific parameters: parameters that are specific to the type of resource, such as bandwidth or maximum jitter for a network reservation, number of nodes for a reservation of a computing resource, etc.; End time: the latest time that the reservation can expire. If the difference between end time and start time exceeds the requested duration, any given time interval of the correct duration starting at or after start time and not ending past end time is acceptable. Not all the attributes are mandatory: if not specified, start time defaults to now and end time defaults to infinite. A reservation request can also contain additional information that could help the system to find a better resource match. For example, if with a request for a computing resource it is also specified that the considered job will have to access a certain data set, the system (i.e. the Resource Broker) will try to find and reserve a resource able to provide the best access to the required data. IST PUBLIC 17 / 46

18 The process for acquiring a resource, given a reservation request, consists of two major steps: 1. Discover resources compatible with the requirements and preferences specified in the request; this phase implies querying, either directly or indirectly, the resources to know their current status and availability; 2. Perform the allocation on the best-matched resource. In a Grid context, searching for resources, querying their status and allocating one of them cannot be reasonably performed in an atomic manner, hence this two-step allocation procedure is expected to fail (i.e. the process of finding and reserving a resource compatible with the requirements doesn t succeed) with a non-negligible rate. The system must therefore be designed and implemented to be resilient to such failures. Once the reservation has been granted, it can then be used by a user to perform the corresponding job: a computing reservation can be used for running applications, a storage reservation can be used to store files, a network reservation can be used to initiate data transfers, etc. The declaration of which reservations a job is entitled to use is expressed by the attribute UseReservation in the JDL expression of a job. UseReservation is a list of reservation identifiers: UseReservation = { <reservation_id_1>,..., <reservation_id_n> }; The information associated to a reservation will be used by the Workload Management System to act appropriately. Using a previously granted reservation may require for some resources and/or resource types passing one or more parameters that were not known at reservation time, and which are needed to use the reservation. This is the case for the example of a network reservation manager that implements QoS marking packets based on the port used by the application sending data; such information is usually available only when the application has started. Passing run-time information to the reservation manager is called binding. A binding can then be cancelled with an unbinding operation, that is the run-time information passed during the binding is not valid any more. In the network example above it means that the router should not mark any more the packets originated from the source address (port) specified during the bind. Depending on the resource and/or resource type a reservation can: be bound-unbound several times; be bound multiple times without being unbound. IST PUBLIC 18 / 46

19 Other useful operations on a reservation are: cancellation: a user can explicitly release a reservation when not needed anymore. Otherwise the reservation is released by the resource itself according to its own policy; for example the reservation is released at expiration time or if not used within a certain timeout; monitoring: a reservation goes through a number of different states during its lifetime (resource discovery, resource allocation, reservation utilization). A user may want to be informed about the changes of status, for example to know when he can finally use a reservation; modification: sometimes it is desirable to modify the parameters associated to an active reservation, specified in the original request. If the new parameters are less demanding in terms of resource consumption, the modification request can be usually accepted. Therefore the user, before submitting a job, can request to reserve a specific resource. The result of this reservation request will be an identifier (see section 4.1.2). The user can then check the status of the reservation and, when it has been granted and therefore is ready to be used, it can then be specified along with the other attributes of the JDL expression needed to submit the considered job. It is then up to the Workload Management System to transparently use this reservation. For example, if the reservation request was for a computing resource, the job will have to be submitted to the CE chosen at reservation time, and it will be necessary to interact with the correspondent Resource Manager to be able to use the granted reservation EXTERNAL INTERFACE The Reservation Agent (RA) is the core component of the resource reservation framework. The main functionalities of the RA are: to accept a generic reservation request, map it into a reservation on a specific resource, matching the requirements and preferences specified by the user, and perform the allocation on the specific resource; to allow the user to use a granted reservation for his job. In a broad sense, the interface that the Reservation Agent presents to its clients is characterized by: the language used to express reservation requests; the Application Programming Interface (API) provided to interact with the Reservation Agent. IST PUBLIC 19 / 46

20 Language The language used to specify a resource reservation request is the Job Description Language ([R6]) used for job submission. As mentioned above, a reservation request is characterized by several attributes, which, in JDL, are mapped to the following: ReservationResource: to specify the resource type. The type is expressed as a string chosen in a predefined set, currently including computing, network, storage ; ReservationType: to specify the reservation type. This is expressed as a string and is resourcedependent; ReservationStart: to specify the start time. The time is an integer value expressing the number of seconds since the epoch 1 ; ReservationEnd: to specify the end time, expressed as the number of seconds since the epoch; ReservationDuration: to specify the duration, described as the number of seconds the reservation should last. ReservationParameters: to specify resource-dependent parameters. Besides the above attributes, the JDL expression should also contain a Rank and especially a Requirements expression, as it happens for job submissions. In particular the Requirements expression should ask that the selected resource support reservation (boolean attribute SupportReservation): this attribute will be automatically added by the Reservation Agent to the JDL expressions specifying resource reservation requests. The JDL Type attribute is used to specify that a JDL expression define a resource reservation request if it equals Reservation. As an example, the following JDL expression represents a reservation request for three nodes for 300 seconds on a Computing Element running Linux, whose architecture is i386: [ Type = Reservation ; ReservationType = computing ; ReservationStart = ; ReservationEnd = ; ReservationDuration = 300; ReservationParameters = [ nodes = 3; ];... 1 Corresponding to the midnight of the 1 st of January 1970 UTC IST PUBLIC 20 / 46

21 ] Requirements = other.arch == i386 && other.opsys == Linux && other.supportreservation; Application Programming Interface The Reservation Agent provides its clients with an Application Programming Interface designed along the guidelines specified in [R5]: create_reservation(): creates a reservation for the specified request bind_reservation(): binds a reservation to run-time parameters unbind_reservation(): unbinds a reservation cancel_reservation(): cancels a reservation modify_reservation(): modifies the parameters associated with a reservation status_reservation(): returns the status of the resource reservation As previously introduced, a reservation is referenced via an identifier. This is a user-controlled handle that can be used to manipulate the reservation. The identifier is assigned and used like a job identifier (see [R1]): it is assigned on the client side at the moment the reservation request is created. The above resource-independent API is implemented on top of a number of resource-dependent Reservation Agents. These support the same functionality of the generic Reservation Agent but with the following personalisations: no resource discovery is done (this is a responsibility of the generic RA); for reasons of reliability and flexibility, the creation of a reservation is implemented as a twophase commit. To support this, the resource-dependent API includes a commit_reservation() function. Therefore the Reservation Agent does all the operations that are resource independent. For the rest it delegates the resource-dependent Reservation Agent. So there will be a Network Reservation Agent IST PUBLIC 21 / 46

22 Doc. Identifier: (possibly many, if different QoS techniques are possible for the network), a Computing Reservation Agent, and a Storage Reservation Agent INTERNAL DESIGN The creation of a reservation is shown in Figure 2. The submission of the request (1) returns immediately, indicating if the RA is willing to accept the request. If that is the case, the RA starts the discovery phase contacting the RB, which returns an ordered (by rank) list of suitable resources (2). The RB performs this task querying the Information System, where the characteristics and the status of the various resources, including the schedule of reservations (provided by the Resource Managers) are published, and performs the matchmaking with the JDL expression specifying the reservation request. The RA then tries to reserve a resource: it iterates through the list representing suitable resources, and contacts the correspondent Resource Managers, until it succeeds (3). Figure 2: Creation of a reservation IST PUBLIC 22 / 46

23 Note also that, when available, Grid accounting mechanisms will be exploited to avoid abuses, for example in case a malicious user reserve in advance a large amount of resources degrading the overall performance of the Grid. Where possible, the Reservation Agent will be implemented on top of the Globus Architecture for Reservation and Allocation, GARA ([R4]), in particular to implement the interaction with the local resource managers, for resource reservation requests and for using the granted reservations. GARA in fact provides mechanisms for QoS reservations for different types of resources, including computers, networks, and disks DEPENDENCIES The reservation framework depends on services provided by other components: the Resource Broker finds resources that match the requirements and preferences of a resource reservation request; local Resource Managers must support reservation requests; they should also publish their capability in the Information System (attribute SupportReservation) and possibly also the current schedule of reservations; the Logging and Bookkeeping Service receives the various events concerning resource reservation and keeps a reservation state machine; the overall Workload Management System should provide support for allowing jobs to use granted reservations. IST PUBLIC 23 / 46

24 5. CO-ALLOCATION With co-allocation we mean the concurrent allocation of multiple resources. These resources can be homogeneous, as it is for example for the concurrent allocation of multiple nodes on two different computing elements, needed to run a distributed parallel job, or they can be heterogeneous. An example of co-allocation of heterogeneous resources is the concurrent allocation of space on a storage element, a node on a computing element and some bandwidth on the network link between the computing and storage elements to run a job that writes a large amount of data at a high rate. Co-allocating resources, especially in a distributed non-centrally-managed environment, is usually a difficult problem, for various reasons: resources can be of widely varying types, can be located in different administrative domains and subject to different control policies and mechanisms, ([R4]), etc. The Workload Management System will provide a generic framework to support co-allocation of resources, using techniques based on immediate or advanced reservation of resources. When a co-allocation is requested, it is necessary to specify the time frame and the list of the required resources. As in the case of a resource reservation (see section 3), the attributes concerning the time frame are: Start time: the earliest time that the co-allocation may begin; Duration: how long the co-allocation lasts; End time: the latest time that the co-allocation can expire. Each resource involved in the co-allocation is described by the following attributes: Resource type: the type of the underlying resource, such as network, computation or storage; Reservation type: used in case a resource supports multiple types of reservation (e.g. for the network a guaranteed bandwidth, or a guaranteed latency, or a limit on the jitter, etc.); Resource-specific parameters: parameters that are specific to the type of resource; A "critical" flag: if set to true, the whole co-allocation fails if the reservation of this particular resource fails. The process for performing a co-allocation, given a co-allocation request, consists of three major steps: 1. discover resources compatible with the requirements and preferences included in all the resource descriptions; this phase implies querying, either directly or indirectly, the resources to know their current status and availability; IST PUBLIC 24 / 46

25 2. find compatible combinations of resources that would satisfy the co-allocation request. A combination is a sequence of specific resources that, if reserved successfully, would make the whole co-allocation successful; 3. try each combination taking into account optimisation criteria, until one succeeds or all fail. For each combination the resources are tried sequentially. Once the co-allocation is granted, it can be used to perform the considered job; in this respect the coallocation becomes simply a set of reservations and it is the responsibility of the user to correctly manage those reservations. Nevertheless the co-allocation entity can still be used as a handle for the whole set (for example for operations such as cancellation, monitoring and modifications, the latter however limited to the time frame) EXTERNAL INTERFACE The Co-allocation Agent (CA), in the framework of co-allocation, is responsible for accepting a coallocation request from a user and applying the 3-step procedure described above: resource discovery, identify combinations of resources compatible with the request, select an acceptable one. The Co-allocation Agent is not responsible for the subsequent use of the reservations representing the result of the co-allocation. In a broad sense the interface presented by the Co-allocation Agent to its clients is described by: the language used to express co-allocation requests; the API to interact with the Co-allocation Agent Language The language used to specify a co-allocation request is, as for reservation, the Job Description Language ([R6]). As mentioned above, a co-allocation request contains information concerning the time frame for the allocation, and the description of the involved resources. The time frame is described by the following attributes, analogous to the ones used in the context of resource reservation (see section 4.1.1): ReservationStart: for the start time; ReservationEnd: for the end time; ReservationDuration: for the duration. IST PUBLIC 25 / 46

26 The description of a resource contains the following attributes (for a more detailed description see section 4.1.1): ReservationResource: for the resource type; ReservationType: for the reservation type; ReservationParameters: for the resource-dependent parameters. Both at co-allocation and reservation level, the JDL expression should contain a Rank and especially a Requirements expression. In particular the Requirements expression for a reservation must require that the considered resource support reservation. The Type attribute is used to identify a JDL expression specifying a co-allocation, when it equals coallocation. As an example, the following JDL expression specifies a co-allocation request for a computing node, 100 GB of storage in a storage element speaking a certain protocol (gridftp), and a connection between the considered computing element and storage element of 10 MB/s. Note also that the data set used as input by the job which will use this co-allocation has been also specified (attribute InputData ), along with the identifier of the Replica Catalog (RC) this data set refers to, so that the system will try to find the best CE-SE match with respect to the I/O access to these data: [ Type = "coallocation"; ReservationStart = ; ReservationEnd = ; ReservationDuration = 3600; Res1 = [ Type = "Reservation"; ResourceType = "Computing"; ReservationParameters = [ nodes = 3; ]; Requirements = other.arch == "i386 && other.opsys== Linux" && other.supportreservation; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=infn Test RC,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; ]; IST PUBLIC 26 / 46

27 Res2 = [ Type = "Reservation"; ResourceType = "Storage"; ReservationParameters = [ space = ; ]; Requirements = other.supportreservation && other.freespace > ReservationParameters.space && other.protocol == gridftp ; ]; Res3 = [ Type = "Reservation"; ResourceType = "Network"; ReservationParameters = [ Bandwidth = 10000; EndPoints = {Res1.CeId, Res2.SEId}; ]; Requirements = other.supportreservation; ] ] Application Programming Interface The Co-allocation Agent provides its clients with an Application Programming Interface (API) that allows management of co-allocations: create_allocation(): creates a co-allocation, applying the 3-step procedure previously described. The result is a coallocation represented by a set of reservations; cancel_allocation(): cancel a co-allocation by cancelling all the reservations belonging to the specified coallocation; modify_allocation(): used to modify the allocation time frame parameters; status_allocation(): returns the status of the co-allocation, in terms of the status of the associated reservations. IST PUBLIC 27 / 46

28 The individual reservations can also be directly manipulated through the reservation system API described in section A co-allocation is addressable with an identifier. This handle is used whenever the owner wants to operate on the co-allocation. The identifier, assigned on the client side at the moment the co-allocation request is generated, is used like a job identifier (see [R1]). The single reservations that are part of the co-allocation are identified in the same way as simple reservations are INTERNAL DESIGN It is felt that resource co-allocation is best implemented on top of the resource reservation mechanisms. As described in section 4, a Reservation Agent performs two actions: resource discovery and resource reservation, through resource-specific Reservation Agents. In the context of co-allocation, resource discovery must be implemented by the Co-allocation Agent: this cannot be demanded of the Reservation Agent because of the need of having a more global view of the available resources. The resource reservation phase can then be based directly on the resource-specific Reservation Agents, whose interface is required to be public. In order to apply some optimisation to the co-allocation process described in section 5 (in particular step 3 of the described procedure), especially if the failure rate (i.e. failure to find suitable coallocations) happens to be excessively high, the agent may try several combinations of resources in parallel. If the reservation process is implemented as a two-phase commit the co-allocator can try several possibilities in parallel without committing. It can then choose the best co-allocation, commit the corresponding reservations and forget about the others. If such a feature is not available (i.e. create_reservation() really allocates the resource) the co-allocator should then cancel the unneeded reservations DEPENDENCIES The co-allocation framework depends on services provided by other components: the Resource Broker finds resources that match the requirements and preferences of a coallocation request; resource-specific Reservation Agents provide the primitives for managing resource reservations; the Logging and Bookkeeping Service records the transitions of the state of the co-allocation. IST PUBLIC 28 / 46

29 5.4. EVALUATION CRITERIA FOR RESOURCE CO-ALLOCATION The process to acquire a co-allocation described above requires that various decisions must be taken such as: Which preferences and requirements, besides those expressed by the user, have to be taken into account during the resource discovery phase (e.g. the availability of data, the overall status of the Grid, etc.) How to decide if it is worth trying a particular set of compatible resources. For example, given a computing-storage co-allocation and a pair <CE, SE> satisfying the requirements for this co-allocation, the agent may decide that the available bandwidth between the CE and SE is not suitable for the application that will use the co-allocation, and therefore can decide to ignore that combination. Given a compatible combination of resources: in which order to try each reservation, and how to bind some parameters given the result of previous reservations in the same combination? For example, if the co-allocation requires some QoS on two Computing Elements and on the network link between them, the agent can decide to reserve first the CEs and then the network between them. In this case, when trying the reservation on the network link, the two endpoints are already bound. On the other hand, if the agent decides to reserve first a suitable network link, then the parameter for the two subsequent computing reservations is fixed. How to decide, in a set of successful co-allocations, which one to retain. Multiple co-allocations may be successful for the same request. In this case the agent has to decide which one to retain, releasing (or not committing) the others. The agent design shall allow different strategies to be adopted, possibly at the same time. The behaviour of the co-allocation agent will be evaluated considering various parameters, including the following: the rate of successful co-allocation requests versus the total number of requests the average time needed to process a successful co-allocation request the average time needed to process an unsuccessful co-allocation request the number of combinations attempted before finding a successful co-allocation Much work still needs to be done in this area and a detailed plan is therefore still lacking. IST PUBLIC 29 / 46

DataGrid D EFINITION OF ARCHITECTURE, TECHNICAL PLAN AND EVALUATION CRITERIA FOR SCHEDULING, RESOURCE MANAGEMENT, SECURITY AND JOB DESCRIPTION

DataGrid D EFINITION OF ARCHITECTURE, SECURITY AND JOB DESCRIPTION Document identifier: Work package: Partner: WP1: Workload Management INFN Document status DRAFT Deliverable identifier: DataGrid-D1.2