execution host commd

Size: px

Start display at page:

Download "execution host commd"

Claribel Doyle
5 years ago
Views:

1 Batch Queuing and Resource Management for Applications in a Network of Workstations Ursula Maier, Georg Stellner, Ivan Zoraja Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR-TUM) Institut fur Informatik Technische Universitat Munchen fmaier,stellner,zorajag@informatik.tu-muenchen.de Abstract A resource management system can eectively shorten the runtime of batch jobs in a network of workstations (NOW). This is achieved with load balancing mechanisms to distribute the load equally among the hosts. To avoid conicts between interactive users and batch jobs, a resource management system must be able to migrate batch jobs from an interactive host to an idle host. Common resource management systems oer process migration only for sequential jobs but not for parallel jobs. Within the SEMPA project a resource management system with batch queuing functionalities including checkpointing and migration is designed and implemented. We focus on applications because oers dynamic task management and an interface to resource management systems 1. 1 Introduction Parallel scientic computing applications, e.g. in computational uid dynamics, require a large amount of CPU time and memory. Therefore, they are often run on massively parallel systems. However, networks of workstations (NOWs) often have computing capacities available that are sucient for the computation of resource intense applications. Especially smaller companies or research institutes use their NOWs for parallel applications as a low-cost alternative to massively parallel systems. A resource management system makes the use of a NOW transparent to the user and guarantees that the computational power of a NOW is utilized in the best possible way. To take advantage of a resource management system, resource intense applications are executed as batch jobs. In the remaining paper a parallel application is a application submitted as batch job to a resource management system. Checkpointing and migration of applications are important functionalities of a resource management system for reasons of fault tolerance and dynamic load balancing. Periodic checkpoints are written of long running applications to avoid the loss of the so far computed results, if the application unexpectedly aborts, e.g. because of a hardware error. Process migration is a way to equalize the load in a NOW if the load situation is unbalanced or to relocate processes during runtime. 1 This work has been funded by the German Federal Department of Education, Science, Research and Technology, BMBF (Bundesministerium fur Bildung, Wissenschaft, Forschung und Technologie) within the research project SEMPA (Software Engineering Methods for Parallel Applications in Scientic Computing).

2 Primarily a NOW is used for interactive work, batch jobs only utilize idle resources and hence, the interactive users have precedence over batch jobs. If an interactive user wants to work on a host running a process of a parallel application, the process must be migrated because it probably claims such an enormous amount of resources that the interactive user will have unacceptable response times on the host. Existing resource management systems, e.g. Condor [LTBL97] and LSF [Pla96] oer checkpointing and migration only for sequential applications. Merely initial process placement is supported for parallel applications, i.e. the processes of a parallel application are mapped to appropriate hosts. The processes are bound to their hosts and cannot be migrated to other hosts at runtime because checkpointing mechanisms for parallel applications with communicating processes are rarely available [ZB96]. The reason why existing resource management systems hardly support parallel applications is the lack of control over the processes of a parallel application. Without control over the processes a resource management system is unable to kill, checkpoint or migrate a running process of a parallel application or to observe resource limitations. A major goal of the SEMPA project [LMRW96] is to design and implement a batch queuing and resource management system for sequential and parallel applications in a NOW. Available resources should always be utilized for the execution of batch jobs. A mechanism for checkpointing and migration of parallel applications must be provided to equalize the load in the NOW and to release hosts running processes of a parallel application if the hosts are needed by an interactive user. Our basic idea was to use existing batch queuing and resource management facilities and add new features supporting the ecient computation of parallel applications in a NOW. The SEMPA Resource Manager is based on the batch queuing and resource management system CODINE [GEN96] and the checkpointing and migration capability for parallel applications of CoCheck [Ste95]. A resource manager is implemented to control the parallel applications and to join the components and functions of CODINE and CoCheck. The remaining paper is organized as follows. Section 2 describes the design concept of the SEMPA Resource Manager. The structure and functionalities of the basic components are explained in section 3. Section 4 shows some implementation details of the SEMPA Resource Manager. First performance measurements are presented in section 5. The paper closes with a brief summary and an outlook on further research. 2 The Design Concept of the SEMPA Resource Manager An architectural design of a distributed resource management system for parallel applications in a NOW is introduced in [MS97]. The concept of the distributed resource management system comprehends modular components for the main functionalities batch queuing, scheduling and load management and includes dened interfaces between these components. The scheduling component is organized hierarchically, i.e. a global resource manager places a parallel application initially and then passes it to a local resource manager that is responsible for the parallel application until it has nished. The functions of the local resource manager are the management of hosts and processes and the remapping of the parallel application. The SEMPA Resource Manager is an implementation of the design concept presented in [MS97] based on CODINE, CoCheck and the resource manager interface.

3 The architectural design of the SEMPA Resource Manager strongly depends on the structure and the components of CODINE and CoCheck that should be retained as far as possible. An important issue in the design of the SEMPA Resource Manager is to dene a communication model for the information exchange between the dierent components. One of the major functions of the SEMPA Resource Manager is to control the parallel applications which means to control each of its processes. This is the basic assumption for further functions of the SEMPA Resource Manager that operate on single processes of a parallel application. Control over a parallel application is required to: suspend a running parallel application stop a running parallel application, e.g. by the job owner write periodic checkpoints of a parallel application migrate one or more processes of a parallel application observe resource limitations of a parallel application collect accounting information about a parallel application 3 Components of the SEMPA Resource Manager Structure and functionalities of the main components of the SEMPA Resource Manager, CO- DINE, CoCheck and resource manager interface, are explained in the following sections. 3.1 CODINE CODINE is a batch queuing and resource management system for NOWs [GEN96]. Users submit their jobs to CODINE that queues the jobs until the required resources are available. A batch job is composed of an application and resource requirements specied by the user, e.g. machine architecture or size of memory. CODINE maps sequential and parallel applications to idle or low loaded hosts. CODINE is built up of various components to queue and schedule jobs and to measure the load on the hosts in the NOW: qmaster: The qmaster is the central component in CODINE and has the control over all other components. It corresponds to a database server containing the information about hosts and jobs. schedd: The schedd is the component that performs the scheduling algorithm. It gets information about hosts and jobs from the qmaster and computes the job order list. commd: A communication daemon is running on every host that is controlled by CODINE. The commd implements the communication between the CODINE components over TCP sockets. Some connections are permanent, e.g. between qmaster and schedd, other connections are set up on demand and closed when the transmission is over.

4 execd: An execution daemon is running on every host that executes batch jobs. The execd starts and controls jobs and measures the load on its host. When a job has nished, the execd returns the accounting information about the job to the qmaster. shepherd: The shepherd process is started by the execd and builds up the execution environment for a job. The execd does not start a job immediately but starts a shepherd and the shepherd starts the job by forking a process. When the job has nished, the shepherd collects the accounting information about the job. Figure 1 shows the components of CODINE and their relationship. qmaster and schedd usually run on the same host to minimize the communication overhead. Jobs are running on execution hosts and for every job a shepherd is existing that controls the job. execution host qmaster execd commd commd schedd shepherd shepherd job job Figure 1: The structure of CODINE When a parallel job is submitted, additional resource requirements must be specied compared to a sequential batch job, e.g. the parallel programming environment or the minimum and maximum number of hosts. Parallel CODINE jobs can use, MPI or EXPRESS as parallel programming environment. A job in CODINE is not directly started by an execd but by a shepherd process that is started by the execd. The shepherd is parent of the started job and has control over the job, e.g. to suspend or kill the job during runtime or to collect accounting information about the job. A shepherd can only start one job whereas an execd can start several shepherd processes. In the current version of CODINE there is only a single shepherd existing for each parallel job, i.e. CODINE only has control over the process forked by the shepherd but not over processes that are created by parallel programming environments, e.g. spawned by. Thus, operations such as resource limitation and the collection of accounting information can only be performed for the master process forked by the shepherd but not for the spawned processes. One of the aims of the SEMPA Resource Manager is to overcome this deciency.

5 3.2 CoCheck CoCheck (Consistent Checkpoints) is an extension to message-passing libraries that allows the creation of checkpoints of parallel applications and the migration of processes. Implementations of CoCheck for [Ste95] and MPI [Ste96] exist. For the remainder of the paper we will refer to the version of CoCheck. Before the application can actually be started the user must relink the application with the CoCheck libraries to incorporate the code which implements checkpointing and migration. A resource manager is provided [GBD + 94] that receives and handles requests to checkpoint or restart an application or to migrate processes. An API has been dened to send such requests to the resource manager. After the resource manager of CoCheck has received a request to checkpoint or migrate it initiates the CoCheck checkpointing protocol. All processes of the currently executing application are informed about a pending checkpoint. In turn all the processes start to exchange so called \ready messages". These ready messages ush all communication channels between all the processes. Messages that were in transmit upon checkpoint time are thus forwarded to their destination and stored there. After restart these messages are automatically retrieved from the buers. When the processes are restarted they get a new identier. These identiers are then sent to the CoCheck resource manager. It in turn sets up a mapping table from old to current identiers. Within the wrappers for the communication calls these current values are used to send and receive messages instead of the values that the application actually uses. Hence, checkpointing and migration is transparent to the application [Ste95]. 3.3 The Resource Manager Interface 3.3 oers a resource manager interface to dene an own host and task management and new scheduling strategies [GBD + 94]. Usually calls are handled by the daemons, but if there is a resource manager registered in the virtual machine, calls concerning hosts and tasks, e.g. pvm addhosts or pvm spawn are redirected to the resource manager. The resource manager provides handler functions to execute the redirected calls. The handler functions in the resource manager are not part of, they must be explicitly written by the user corresponding to a given message framework. CoCheck uses the resource manager interface for the implementation of additional handler functions for checkpointing and migration. For the SEMPA Resource Manager a complete resource manager has been implemented with handler functions for all aected calls that joins the components of CODINE, CoCheck and and realizes a local resource manager for every application. 4 Implementation Aspects of the SEMPA Resource Manager In the previous sections the architectural design and the components of the SEMPA Resource Manager have been introduced. This section explains some functionalities of the SEMPA Resource Manager and shows some implementation details.

6 The main component of the SEMPA Resource Manager is the resource manager with its handler functions for host and task management that initiate certain operations of CODINE, CoCheck or. The data exchange between CODINE and components is realized by calls and a signal interface. 4.1 Starting a Job by the SEMPA Resource Manager Before a job can be started, hosts for the execution of the job must be selected and the parallel environment must be congured. In the SEMPA Resource Manager the CODINE scheduler selects the hosts for the application and the master host where the application is started corresponding to the load on the hosts. Then the execd on the master host starts a shepherd, called the master shepherd. The master shepherd starts the master daemon (pvmd) and the resource manager. The resource manager sets up the virtual machine with the hosts selected by the schedd, i.e. it starts a slave pvmd and a tasker on each host belonging to the virtual machine. Due to implementation constraints the resource manager must be started before hosts are added to the virtual machine. Now the virtual machine is built up completely with the master pvmd and the resource manager running on the master host and a slave pvmd and a tasker on every other host in the virtual machine as shown in Figure 2. As the next step the application is started by the master shepherd, i.e. the rst task is started that usually spawns further tasks. 4.2 Spawning a Task As mentioned above, CODINE is intended to have control over all tasks spawned by. The tasker concept is used to implement the creation of a new task with an own strategy. The resource manager selects a host within the virtual machine where the new task is started. If no appropriate host is available in the virtual machine, the resource manager requests a new host maybe with specic hardware requirements from the CODINE qmaster. The pvm spawn call is sent to the resource manager that selects a host and sends a message to the tasker on that host. uses the round-robin strategy to map tasks to hosts. A strategy considering load information about the hosts will be implemented in the next phase of the project [SKS92]. It is not reasonable to specify a particular host in the pvm spawn call because the resource manager selects a host for the task. If the pvm spawn call fails, a corresponding error is generated and the responsibility to handle the error message is turned to the calling task. The tasker implements a procedure that prevents the tasker to fork the new task itself but causes the execd to start a shepherd that nally creates the task (see Figure 3). The task is spawned on a host belonging to the virtual machine, i.e. that a slave pvmd and a tasker are already running on that host. The spawned task is now under the control of CODINE and.

7 host 1 (master host) schedd execd shepherd qmaster master pvmd start appl. spawn task resource manager execute host 2 host n request tasker slave pvmd tasker slave pvmd message exchange execd execd Figure 2: Starting a job by the SEMPA Resource Manager 4.3 Exiting a Task When a task exits, CODINE and the resource manager must be notied. An exiting task sends a signal SIGCHLD to its parent process that is a shepherd process. After receiving the signal SIGCHLD, the shepherd writes the accounting information about the task to a temporary le and sends a signal SIGCHLD to the tasker to inform it that the task has exited. The shepherd exits and sends a signal SIGCHLD to its parent process, the execd. When the resource manager recognizes that all tasks have terminated, it stops and exits. 5 Performance Measurements Functionalities and performance of the SEMPA Resource Manager have been evaluated with ParTfC as a real world test case. ParTfC is a computational uid dynamics package to compute laminar and turbulent viscous ows in three dimensional geometries. It has been parallelized within the SEMPA project corresponding to the SPMD (single program, multiple data) paradigm [LMR + 96]. The underlying grid is partitioned into smaller parts and every partition is computed by an own process.

8 host 1 (master host) host n master task spawn task resource manager tasker slave pvmd execd execute request shepherd message exchange task Figure 3: Spawning a task by the SEMPA Resource Manager The presented time measurements show the inuence of a resource management system to the runtime of ParTfC. The following three measurement models have been viewed: (M1) ParTfC in interactive mode (M2) ParTfC started as CODINE batch job without a resource manager (M3) ParTfC started as batch job to the SEMPA Resource Manager The time measurements were performed with two dierent grids: (T1) A grid with 3150 grid nodes divided into 4 partitions. (T2) A grid with grid nodes divided into 4 partitions. The four processes of ParTfC were computed on two SGI Indigo 4400 so that two processes were running on one host. The two grids are relatively small but they are sucient to show that the overhead produced by CODINE or the SEMPA Resource Manager is negligible. Table 1 shows that the runtime of ParTfC hardly increases if ParTfC is started as a batch job in CODINE or the SEMPA Resource Manager compared to the runtime of ParTfC in the interactive mode. The time for start and stop scripts in CODINE and the SEMPA Resource Manager that are performed before starting and after nishing ParTfC are shown in Table 2. However, compared to the runtime of ParTfC these times can be neglected. The start script in CODINE starts and sets up the virtual machine. The execution of the

9 (M1) (M2) (M3) (T1) 190 s 194 s 197 s (T2) 389 s 395 s 396 s Table 1: Runtime of ParTfC for the three measurement models start script of the SEMPA Resource Manager takes more time compared to the start script of CODINE because the resource manager and the tasker must be started in addition. The stop script of CODINE performs a pvm halt to stop the virtual machine. The stop script of the SEMPA Resource Manager sends a signal to the resource manager to stop the virtual machine if all processes of the parallel application have nished. (M2) (M3) start script 100 ms 4.2 s stop script 100 ms 60 ms Table 2: Time for start and stop scripts in CODINE and the SEMPA Resource Manager 6 Conclusion The SEMPA Resource Manager provides batch queuing and resource management facilities for applications in a NOW. Parallel applications are started as batch jobs and each process of a parallel application is under the control of the SEMPA Resource Manager so that e.g. resource limitation and migration of each process can be performed. The presented approach is restricted to applications because oers dynamic task management and features to dene own resource management services. The exibility of the concept prevents changes in the code. Modications in CODINE and CoCheck are necessary but reduced to a minimum. The implementation of the SEMPA Resource Manager has almost been completed except the integration of the CoCheck handler functions into the resource manager. The next step after the integration of the migration facilities will be to improve the scheduling strategy of the resource manager to decide about the mapping and remapping of processes more eciently. Currently the round-robin method is used that does not consider the dierent CPU and memory capacities of the hosts and the actual load situation in the virtual machine and the NOW. The interface between the resource manager and the CODINE qmaster must be extended to make scheduling information of CODINE available to the resource manager. References [GBD + 94] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. : Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing. Scientic and Engineering Computation. The MIT Press, Cambridge, MA, 1994.

10 [GEN96] [LMR + 96] GENIAS Software GmbH, Erzgebirgstr. 2B, D Neutraubling, Germany. CO- DINE Reference Manual, Version 4.0, Peter Luksch, Ursula Maier, Sabine Rathmayer, Friedemann Unger, and Matthias Weidmann. Parallelization of a state-of-the-art industrial CFD Package for Execution on Networks of Workstations and Massively Parallel Processors. In Third European Users' Group Meeting, Euro 96, Munchen, October [LMRW96] Peter Luksch, Ursula Maier, Sabine Rathmayer, and Matthias Weidmann. SEMPA: Software Engineering Methods for Parallel Scientic Applications. In International Software Engineering Week, First International Workshop on Software Engineering for Parallel and Distributed Systems, Berlin, March [LTBL97] [MS97] [Pla96] [SKS92] [Ste95] [Ste96] [ZB96] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Environment. Technical Report 1346, University of Wisconsin-Madison, April Ursula Maier and Georg Stellner. Distributed Resource Management for Parallel Applications in Networks of Workstations. In HPCN Europe 1997, volume 1225 of Lecture Notes in Computer Science, pages 462{471. Springer-Verlag, Platform Computing Corporation, North York, Ontario, Canada. LSF Documentation, December Niranjan G. Shivaratri, Phillip Krueger, and Mukesh Singhal. Load Distributing for Locally Distributed Systems. Computer, 25(12):33{44, December Georg Stellner. Checkpointing and Process Migration for. In Arndt Bode, Thomas Ludwig, Vaidy Sunderam, and Roland Wismuller, editors, Workshop on, MPI Tools and Applications, number 342/18/95 A in SFB-Bericht, pages 44{48. Technische Universitat Munchen, Institut fur Informatik, November Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the International Parallel Processing Symposium, pages 526{531, Honolulu, HI, April IEEE Computer Society Press, Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA Avi Ziv and Jehoshua Bruck. Checkpointing in Parallel and Distributed Systems. In Albert Zomaya, editor, Parallel and Distributed Computing Handbook, Series on Computer Engineering, chapter 10, pages 274{302. McGraw-Hill, 1996.

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl