Designing a Java-based Grid Scheduler using Commodity Services

Designing a Java-based Grid Scheduler using Commodity Services Patrick Wendel Arnold Fung Moustafa Ghanem Yike Guo patrick@inforsense.com arnold@inforsense.com mmg@doc.ic.ac.uk yg@doc.ic.ac.uk InforSense InforSense Computing Department Computing Department London London Imperial College Imperial College London London Abstract Common approaches to implementing Grid schedulers usually rely directly on relatively low-level protocols and services, to benefit from better performances by having full control of file and network usage patterns. Following this approach, the schedulers are bound to particular network protocols, communication patterns and persistence layers. With the availability of standardized highlevel application hosting environments providing a level of abstraction between the application and the resources and protocols it uses, we present the design and the implementation necessary to build a Java-based, protocol agnostic, scheduler for Grid applications using commodity services for messaging and persistence. We present how it can be deployed following two different strategies, either as a scheduler for a campus grid, or a scheduler for a wide-area network grid. 1 Motivation This project was started as part of the development of the Discovery Net platform[1], a workflow-based platform for the analysis of large-scale scientific data. The platform s architecture consists of one or more workflow execution servers and a workflow submission server tightly coupled with an interactive client tool for building, executing and monitoring. Thus the client tool benefits from the ability to communicate complex objects as well as code with the other components of the system and allow rich interaction between these components. Interoperability of the workflow server with other services within a more loosely-coupled Grid architecture is enabled by providing a set of stateless services accessible using Web Services protocols, although for a subset of the functionalities. As well, the platform relies on the services of a Java application server for providing a hosting environment for the workflow activities to be executed. This environment can, for instance, provide the activity with support for authentication and authorisation management or logging. However, such an approach has the drawback of complicating the integration with schedulers based on native process submission, which is usually the case, as in our case each workflow execution has to run within a hosting Java-based environment that provides a different set of services above the operating system. Equally, it is difficult to reuse the clustering features usually provided by Java application servers as they are designed for short executions, usually for transaction-based web applications, over a close cluster of machines. 2 Approach Instead of trying to integrate schedulers based around command-line tools and native processes, the approach is to build a generic scheduler for long-running tasks using the services provided by the hosting environment. In particular, the availability of a messaging service providing both point-to-point and publish/subscribe models, a container-managed handling of the persistence of longlived objects as well as the ability to bind object types to specific point-to-point message services, are of interests for building the scheduler. It follows from that approach that the implementation: only requires a few classes, does not access I/O and resources directly, is network protocol agnostic, is agnostic to the Java application server it runs atop. The scheduling policy resides in the messaging service s handling of the subscribers to its point-to-point model. The service provider used in the experiment allows configuring and extending that handling, thus making it possible to use various scheduling algorithm or services. As an example, the Sun Grid Engine was used to find out what resource the scheduler should choose. 1

3 Container services The design is based around a set of four standard mechanisms available in Java-based application server, part of the Enterprise Java Beans[3] (EJB) and Java Messaging System[4] (JMS) specifications, as shown on Figure 1. 3.1 Stateless Objects Stateless remote objects, also known as Stateless Session Beans, have a very simple lifecycle as they cannot keep any state. The container then provides support to access these objects following several protocols: RMI/JRMP: For Java-based systems, thus allowing to exchange any serialiseable Java object. RMI/IIOP: To support CORBA-IIOP interoperability SOAP/WSDL: To support Web Services interoperability 3.2 Persistence Management for Stateful Objects This service allows the container to the lifecycle and the persistence of stateful objects. The objects are mapped into a relational database using a predefined mapping and the container is responsible for making sure of the consistency between the database instance and the object in memory. This service is provided as Container-Managed Persistence for Entity Beans. 3.3 Messaging JMS is a messaging service that supports both point-topoint model using Queue objects and publish/subscribe model using Topic objects. JMS providers are responsible for the network protocol that they use to communicate and deliver messages. In particular, the service we used JBossMQ has support for the following communication protocols: RMI/JRMP: Allows faster communication by pushing the notifications to the subscriber, but requires the subscriber to be able to export RMI objects. Being able to export RMI objects adds constraints on the network architecture as it means that the machine that exports the object must know the IP address or name by which it can be reached by the caller. This is the main reason why such protocol cannot be used easily for WAN deployments if the subscribers for the message service is behind a firewall or belongs to a network using NAT. HTTP: On the subscriber-side, the messaging service pulls regularly information from the messaging provider. This approach solves the issue discussed above but is not as efficient as sending the notification as it happens. 3.4 Message-driven Objects Associated with the messaging service, special object types can be registered to be instantiated to handle messages coming to a Queue object (point-to-point model), thus removing the need of subscribers to act as factories for the actual instances that will deal with the processing of the object from the queue. 3.5 Security Modules The authentication and authorisation mechanism for Java containers[8] (JAAS) supports the definition of authentication policies as part of the configuration of the containers and the descriptors of the application instead of being coupled to the application code itself. It also allows defining authorisation information such as the roles associated with a user. Modules can be defined to authenticate access to components using most standard mechanisms such as LDAP-based authentication infrastructures, NT authentication, UNIX authentication, as well as support for Shibboleth[10]. In our application, this facility is used to enable secure propagation of authentication information from the submission server to the execution server. 4 Design 4.1 Architecture The scheduler was designed to be applied to workflow executions. One of the differences between the submission of workflows and the submission of executables and scripts invoked through command lines, as is often the case for job submission, is the size of the workflow description. Potentially, the workflow can represent a complex process, and its entire description needs to be submitted for execution. The system must therefore ensure that the workflow is reliably stored in a database before performing the execution, as the risks of failures at that stage are greater. The overall architecture is shown in Figure 2. Both Web and thick clients talk to the TaskManagement service, a stateless service implemented by a stateless session bean for job submission, control and some basic level of monitoring. The client also connects to the messaging service to receive monitoring information about the execution. The TaskManagement service is hosted by a container that also provides hosting to the JobEntity bean, a container-managed persistence entity bean stored in a 2

Figure 1: Container Services common persistent storage, and access to the JMS service providers for the Job queue, the Job topic and the Status topic. The server hosting this service is also called submission server. A JobEntity instance has the main following variables: unique ID, execution status, workflow definition, workflow status information, last status update time, start date, end date, user information. Execution servers host a pool of message driven ExecutionHandler beans which, on receiving a message from the Job queue, subscribe to messages added to the Job topic. The pool size represents the maximum number of tasks that the execution server allows to be processed concurrently. The container hosting the execution server also has access to the same persistent storage and messaging service providers as the submission server and so can host instances of Job entities. 4.2 Submission The following sequence of events happen when a workflow is submitted for execution (See Figure 3): 1. The client submits the workflow to the TaskManagement service 2. The service then creates a new JobEntity object, which is transparently persisted by the container in the database 3. It then publishes a request for execution to thejob queue and returns the ID of the JobEntity to the caller. 4. That execution is picked up by one of the ExecutionHandlers of any execution server, following the allocation policy of the JMS provider. 5. The ExecutionHandler subscribe to the Job topic, selecting only to be notified of messages related to the ID of the JobEntity it must handle. 6. The ExecutionHandler then instantiates the JobEntity object and starts its execution. 4.3 Control The following sequence of events happen when the user sends a control command to the execution handler, such as pause,resume,stop or kill (See Figure 4): 1. The TaskManagement service receives the request for a control command to a given JobEntity ID. 2. If the request to execute that JobEntity is not in the Job queue and its state is running then it posts the control request to the Job topic. 3. The listening ExecutionHandler receives the notification and performs the control action on the JobEntity which will accordingly modify its execution status. 4.4 Monitoring As the scheduler is used to execute workflows which must be monitored from a client tool, the monitoring mechanism needs to support relatively large workflow status information that could include complex activity specific objects describing the status of each running activity in the workflow. The sequence of events for monitoring the execution is as follows (See Figure 5): 1. The ExecutionHandler for a running JobEntity regularly requests the latest workflow status information from the running workflow. 3

Figure 2: Architecture Figure 3: Job Submission Figure 4: Job Control 4

2. If that status has changed since the last status update, the state of the JobEntity is updated and the new execution status and workflow status information is submitted to the Status topic. 3. While the client tool is started, it will be subscribing to publications on the Status topic, for the tasks that have been submitted by the current user, and will therefore receive the notification and associated status information. The policy followed by the ExecutionHandler to schedule the request for the latest workflow status information, is based on a base period and a maximum period. The status is requested according to the base period, except if the status has not changed in which case the update period is doubled, up to the maximum period. 4.5 Failure detection One problem with the de-coupled architecture presented, is that there is no immediate notification that an execution server has failed, as it only communicates through the messaging service which does not provide by default information to the application about its subscribers. The approach used to detect failures of the execution server is to check regularly, on the server hosting the TaskManagement service, the last status update time of all the running JobEntity. If that update time is significantly above the maximum update period, then that job is stopped, killed if necessary, and restarted. 4.6 Security In order to make sure that workflows run in the correct context and has the correct associated roles and authorization, the ExecutionHandler needs to impersonate the user who submitted the workflow. This is implemented by a specific JAAS module that enables that impersonation to happens for a particular security policy used by the ExecutionHandler to login. 4.7 Scheduling Policy The scheduling policy is defined by the way the JMS service provider decides which ExecutionHandler should receive requests added to the Job queue. Several policies are provided by default with the JMS provider that was used, in particular we used a simple round-robin policy at first. In order to integrate at that stage with a resource management tool such as the Grid Engine[11], the scheduling policy was extended. To find out which subscriber to use, we assumed that the set of execution servers was the same as the set of resources managed by the grid engine and submitted a request to execute the command hostname and use the information returned to choose the relevant subscriber. Although these policies can be sufficient in general, for our application to workflow execution, another policy was created to make sure that we use the resource that holds any intermediate results that have already been processed in the workflow, in order to optimise its execution. In this case, the policy has to check the workflow description associated with the request, to find out any intermediate results associated with any activity, and then decide to use the corresponding server if possible. 5 Deployment The persistence service provider used is HSQL[5]. It provides support for the persistence of basic types as well as Java objects. The messaging service is JBossMQ[7]. Messages are stored if needed in the same HSQL database instance used for the persistence of container managed entity objects. 5.1 Campus Grid scheduler The first deployment of the scheduler uses protocols that are only suitable over an open network without communication restrictions or network address translation (NAT) in some parts. This is usually the case for deployment inside an organisation. The RMI protocol can be used for simplicity and efficiency, as well as direct connection and notification of the client tool can be performed without the risk of network configuration issues. This setup is described in Figure 6 5.2 Scheduling over WAN The second deployment of the scheduler uses HTTP tunnelling for method calls and HTTP-based polling mechanism for queue and topic subscribers as shown on Figure 7. The main advantages are that the client does not have to have an IP address accessible by the messaging service, as there is no direct call-back. This means that although the client tool, in our application, performs rich interactions with the workflow, it does not have to be on the same network as the task management server. The execution servers also do not need to have a public IP, which makes it theoretically possible, given the right software delivery mechanism for the execution server, to use the scheduler in configuration such as supported by the SETI@Home[2] scheduler. Other configurations need to be modified to support such deployment. In particular the values for time out and retries values for connections to the persistence manager and the messaging service need to be increased, as network delays or even network failures are more likely. 5

Figure 5: Job Monitoring Figure 6: Deployment as Campus Grid Scheduler Figure 7: WAN Deployment 6

6 Evaluation 6.1 Functional Evaluation We evaluate each element of the architecture to see how its scalability and robustness properties: Task Management Service: This service is stateless and therefore can easily be made highly available through standard load-balancing and clustering techniques. If it fails, the system can carry on processing jobs submitted. Only the clients currently connected to it will not be able to submit and control the workflows, and the check for failed execution servers will not be performed. Execution Server: The number of concurrent execution server is only limited by the maximum number of subscribers that the messaging service supports and the maximum number of connections that the persistence provider can handle. In case of failure, the job will not be lost. Once the failure detected, the task will be resubmitted to the queue. Messaging Service Provider: This is a service provided to our implementation. Its robustness and scalability characteristics depend on its implementation and on the database it uses for persistence. Persistence Service Provider: As for the previous provider, the robustness and scalability characteristics of the database vary with providers. While we used HSQL, which does not provide specific failover, scalability or high-availability features, there are many database vendors providing such capabilities. 6.2 Experimental Evaluation We have implemented and tested the scheduler using a variety of Discovery Net bioinformatics and cheminformatics application workflows. A complete empirical evaluation of the scheduling is beyond the scope of this paper since it takes into account the characteristics of the application workflows themselves. However, in this section we provide a brief overview of the experimental setting used to test and evaluate the scheduler implementation. The scheduler was deployed for testing over an ad-hoc and heterogeneous set of machines. The submission server was hosted on a Linux server where the persistence and messaging services were also held. The execution servers were running on a set of 15 Windows single processor desktop machines on the same organisation s network without restrictions. The scheduler sustained overnight constant submission and execution of workflows, each execution server handling a maximum of 3 concurrent executions, without apparent bottlenecks. It has also been deployed over a cluster of 12 IBM Blades server running Linux, where the execution servers are running on a private network not accessible by the client machine. Finally it was also deployed over WAN and networks with NAT. The client was hosted on the organisation s internal network, with only a private IP address, in the UK. The submission server was hosted in the US on a machine that had a public IP address. The execution servers were hosted on a private network in the US, although directly accessible by the submission server. Even though the impact of using tunnelling and pull mechanisms do affect the overall feel of the client application in terms of its latency when submitting, monitoring jobs and particularly visualising the results, because of increased communication overheads, it does not affect the performance of the workflow execution itself, which is the main concern. The main potential bottlenecks needing further investigation are the behaviour of the messaging service with increasing number of subscribers, in particular execution servers, as well as growing size of workflow descriptions. 7 Comparison The Java CoGKit [12] is a Java wrapper around the Globus toolkit and provides a range of functionalities for Grid applications including job submission. It is therefore based on native process execution, while our approach is to distribute executions of entities that run in a Java hosting environments. However the Java CogKit could be used in the scheduler proposed as a way to implement the scheduling policy for the Job queue, but not to submit the jobs directly. The Grid Application Toolkit [9] has wrappers for Java called JavaGAT. This interface is a wrapper over the main native GAT engine which itself is trying to provide a consistent interface layer above several Grid infrastructure such as Globus, Condor and Unicore. Again, the difference of the approach relies on the submission of native process executions over the grid, instead of handling that process at a higher level and leaving the scheduling, potentially, to a natively implemented Grid or resource management service. Proactive [6] takes a Java oriented approach by providing a library for parallel, distributed and concurrent computing interoperating with several Grid standards. While using the same approach as here, we based the engineering around Java commodity messaging and persistence services such that the robustness of the system mainly and the network protocols it uses depends on these service rather than on the implementation. 7

8 Conclusion To be able to build a robust Java-based scheduler based on commodity services could enable a wider range of Grid applications to benefit from the rich framework provided by Java application server, and help to simplify their implementation. This paper presents a possible way to implement such a scheduler in this framework, as well as its deployment and some of its robustness characteristics. 9 Acknowledgements The authors would like to thank the European Commission for funding this research through the SIMDAT project. [9] E. Seidel, G. Allen, A. Merzky, and J. Nabrzyski. Gridlab a grid application toolkit and testbed. Future Generation Computer Systems, 18(8):1143 1153, Oct. 2002. [10] Shibboleth-aware portals and information environments (spie) project. http://spie.oucs.ox.ac.uk/. [11] Sun Grid Engine. http://gridengine.sunsource.net. [12] G. von Laszewski, I. T. Foster, J. Gawor, and P. Lane. A Java commodity grid kit. Concurrency and Computation: Practice and Experience, 13(8-9):645 662, 2001. References [1] S. AlSairafi, F.-S. Emmanouil, M. Ghanem, N. Giannadakis, Y. Guo, D. Kalaitzopoulos, M. Osmond, A. Rowe, J. Syed, and P. Wendel. The design of discovery net: Towards open grid services for knowledge discovery. International Journal of High Performance Computing Applications, 17, Aug. 2003. [2] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer. SETI@home: an experiment in public-resource computing. Commun. ACM, 45(11):56 61, 2002. [3] Enterprise JavaBeans Technology. http://java.sun.com/products/ejb. [4] M. Hapner, R. Burridge, R. Sharma, J. Fialli, and K. Stout. Java Message Service. Sun Microsystems, Inc., 901 San Antonio Road Palo Alto, CA 94303 USA, 2002. [5] HSQL. http://www.hsqldb.org. [6] F. Huet, D. Caromel, and H. E. Bal. A high performance java middleware with a real application. In SC 2004 Conference CD, Pittsburgh, PA, Nov. 2004. IEEE/ACM SIGARCH. [7] JBossMQ. http://www.jboss.com/products/messaging. [8] C. Lai, L. Gong, L. Koved, A. Nadalin, and R. Schemers. User authentication and authorization in the java platform. In Proceedings of the 15th Annual Computer Security Applications Conference, pages 285 290, Scottsdale, Arizona, Dec. 1999. IEEE Computer Society Press. 8