The Hector Distributed Run Time

Size: px

Start display at page:

Download "The Hector Distributed Run Time"

Hillary O’Brien’
6 years ago
Views:

1 The Hector Distributed Run Time Environment 1 A Manuscript Submitted to the IEEE Transactions on Parallel and Distributed Systems Dr. Samuel H. Russ, Jonathan Robinson, Dr. Brian K. Flachs, and Bjorn Heckel russ@erc.msstate.edu, jon@aue.com, flachs@umunhum.stanford.edu, heckel@durango.cipic.ucdavis.edu s Abstract Harnessing the computational capabilities of a network of workstations promises to off load work from overloaded supercomputers onto largely idle resources overnight. Several capabilities are needed to do this, including support for an architecture independent parallel programming environment, task migration, automatic resource allocation, and fault tolerance. The Hector distributed run time environment is designed to present these capabilities transparently to programmers. MPI programs can be run under this environment with no modifications to their source code needed. The design of Hector, its internal structure, and several benchmarks and tests are presented. Index Terms: Parallel Computing, Load Balancing, Fault Tolerance, Resource Allocation, Task Migration I. INTRODUCTION AND PREVIOUS WORK A. Networks of Workstations Networked workstations have been available for many years. Since a modern network of workstations represents a total computational capability on the order of a supercomputer, there is strong motivation to use a network of workstations (NOW) as a type of low cost supercomputer. Note that a typical institution has access to a variety of computer resources that are network interconnected. These range from workstations to shared memory multiprocessors to high end parallel supercomputers. In the context of this paper, a network of workstations or NOW is considered to be a network of heterogeneous computational resources, some of which may actually be workstations. A run time system for parallel programs on a NOW must have several important properties. First, scientific programmers needed a way to run supercomputer class programs on such a system with little or no source code modifications. The most common way to do this is to use an architecture independent parallel programming standard to code the applications. This permits the same program source code to run on a NOW, a shared memory multiprocessor, and the latest generations of parallel supercomputers, for example. Two major message passing standards have emerged that can do this, PVM [1],[2] and MPI [3], as well as numerous distributed shared memory (DSM) systems. Second, the ability to run large jobs on networked resources only proves attractive to workstation users if their individual workstations can still be used for more mundane tasks, such as word proces- 1 This work was funded in part by NSF Grant No. EEC Amendment 021 and by ONR Grant No. N

2 sing. Task migration is needed to offload work from a user s workstation and return the station to its owner. This ability also permits dynamic load balancing and fault tolerance. (Note that, in this paper, a task is a piece of a parallel program, and a program or job is a complete parallel program. In the message passing model used here, a program is always decomposed into multiple communicating tasks. A platform or host is a computer that can run jobs if idle, and can range from a workstation to an SMP.) Third, a run time environment for NOW s needs the ability to track the availability and relative performance of resources as programs run, because it needs the information to conduct ongoing performance optimizations. This is true for several reasons. The relative speed of nodes in a network of workstations, even of homogeneous workstations, can vary widely. Workstation availability is a function of individual users, and users can create an external load that must be worked around. Programs themselves may run more efficiently if redistributed certain ways. Fourth, fault tolerance is extremely important in systems that involve dozens to hundreds of workstations. For example, if there is a 1% chance a single workstation will go down overnight, then there is only a (0.99) 75 = 47% chance that a network of 75 stations will stay up overnight. A complete run time system for NOW computing must therefore include an architecture independent coding interface, task migration, automatic resource allocation, and fault tolerance. It is also desirable for this system to support all of these features with no source code modifications. These individual components already exist in various forms, and a review of them is in order. It is the goal of this work to combine these individual components into a single system. B. Parallel Programming Standards A wide variety of parallel and distributed architectures exist today to run parallel programs. Varying in their degree of interconnection, interconnection topology, bandwidth, latency, and scale of geographic distribution, they offer a wide range of performance and cost trade offs and of applicability to solve different classes of problems. They can generally be characterized on the basis of their support for physical shared memory. Two major classes of parallel programming paradigms have emerged from the two major classes of parallel architectures. The shared memory model has its origins in programming tightly coupled processors and offers relative convenience. The message passing model is well suited for loosely coupled architectures and coarse grained applications. Programming models that implement these paradigms can further be classified by their manifestation either as extensions to existing programming languages or as custom languages. Because our system is intended for a network of workstations, it was felt that a message passing based parallel programming paradigm more closely reflected the underlying physical structure. It was also felt that this paradigm should be expressed in existing scientific programming languages in order to draw on an existing base of scientific programmers. Both the PVM and MPI standards support these goals, and the MPI parallel programming environment was selected. Able to be called from C or FORTRAN programs, it provides a robust set of extensions that can send and receive messages between tasks working in parallel [3]. While a discussion of the relative merits of PVM and MPI is outside the scope of this paper, this decision was partially driven by ongoing work in MPI implementations that had already been conducted at Mississippi State and Argonne National Laboratory [4]. A more detailed discussion of the taxonomy of parallel paradigms and systems, and of the rationale for our decision, can be found in [5]. One desirable property of a run time system is for its services to be offered transparently to applications programmers. Programs written to a common programming standard should not have to be modified to have access to more advanced run time system features. This level of transparency permits programs to maintain conformity to the programming standard and simplifies code development. 2

3 C. Cluster Computing Systems Systems that can harness the aggregate performance of networked resources have been under development for quite some time. For example, one good review of cluster computing systems, conducted in 1995 and prepared at Syracuse University, listed 7 commercial and 13 research systems [6]. The results of the survey, along with a comparison to the Hector environment discussed in this paper, are summarized in [7]. It was found that the Hector environment had as many features as some of the other full featured systems (such as Platform Computing s LSF, Florida State s DQS, IBM s Load Leveler, Genias Codine, and the University of Wisconsin Madison s Condor), and that its support for the simultaneous combination of programmer transparent task migration and load balancing, fully automatic and dynamic load balancing, support for MPI, and job suspension was unique. Additionally, the commitment to programmer transparency has led to the development of extensive run time information gathering about the programs as they run, and so the breadth and depth of the information that is gathered is unique. It should also be added that its support for typical commercial features, such as GUI and configuration management tools, was noticeably lacking, as Hector is a research project. It should also be mentioned at this point that there are (at least) two other research projects using the name Hector doing work in distributed computing and multiprocessing. The first is the well known Hector multiprocessor project at the University of Toronto [8],[9]. The second is a system for supporting distributed objects in Python at the CRC for Distributed Systems Technology at the University of Queensland [10]. The Hector environment described in this paper is unrelated to either. Three other research systems can allocate tasks across NOW s and have some degree of support for task migration. Figure 1 summarizes these systems. Fully dynamic processor allocation and reallocation Only stops task under migration User transparent load balancing Condor/ CARMI Prospero MIST DQS Y Y Y User transparent fault tolerance Y Y Y Works with MPI No source code modifications to existing MPI/PVM program Uses existing operating system Y Y Y Y Y Y Figure 1: Comparison of Existing Task Allocators One such system is a special version of Condor [11], named CARMI [12]. CARMI can allocate PVM tasks across idle workstations and can migrate PVM tasks as workstations fail or become non idle. It has two limitations. First, it cannot claim newly available resources. For example, it does not attempt to move work onto workstations that have become idle. Second, it checkpoints all tasks when one task needs to migrate [13]. Stopping all tasks when only one task needs to migrate slows program execution since only the migrating task must be stopped. Another automated allocation environment is the Prospero Resource Manager, or PRM [14]. Each parallel program run under PRM has its own job manager. Custom written for each program, the job manager acts like a purchasing agent and negotiates the acquisition of system resources as additional resources are needed. PRM is scheduled to use elements of Condor to support task migration and checkpointing and uses information gathered from the job and node managers to reallocate Y Y Y 3

4 resources. Notice that use of PRM requires a custom allocation program for each parallel application, and future versions may require modified operating systems and kernels. The MIST system is intended to integrate several development efforts and develop an automated task allocator [15]. Because it uses PRM to allocate tasks, the user must custom build the allocation scheme for each program. MIST is built on top of PVM, and PVM s support of indirect communications can potentially lead to administrative overhead, such as message forwarding, when a task has been migrated [16]. The implementation of MPI that Hector uses, with a globally available list of task hosts, does not incur this overhead. (Note that MPI does not have indirect communications, and so Hector does not have any overhead to support it.) Every task sends its messages directly to the receiving task, and the only overhead required after a task has migrated is to notify every other task of the new location. As will be shown below, this process has very little overhead, even for large parallel applications. The Distributed Queuing System, or DQS, is designed to manage jobs across multiple computers simultaneously [17]. It can support one or more queue masters which process user requests for resources on a first come, first served basis. Users prepare small batch files to describe the type of machine needed for particular applications. (For example, the application may require a certain amount of memory.) Thus resource allocation is performed as jobs are started. It currently has no built in support for task migration or fault tolerance, but can issue commands to applications that can migrate and/or checkpoint themselves. It does support both PVM and MPI applications. The differences between these systems and Hector highlight two of the key differences among cluster computing systems. First, there is a trade off between task migration mechanisms that are programmer written versus those that are supported automatically by the environment. Second, there is a trade off between centralized and decentralized decision making and information gathering. D. Task Migration in the Context of Networked Resources Two strategies have emerged for creating the program s checkpoint or task migration image. First, checkpointing routines can be written by the application programmer. Presumably he or she is sufficiently familiar with the program to know when checkpoints can be created efficiently (for example, places where the running state is relatively small and free of local variables) and to know which variables are needed to create a complete checkpoint. Second, checkpointing routines can transfer the entire program state automatically onto another machine. The entire address space and registers are copied and carefully restored. By way of comparison, user written checkpointing routines have some inherent space advantages, because the state s size is inherently minimized, and they may have cross platform compatibility advantages, if the state is written in some architecture independent format. User written routines have two disadvantages. First, they add coding burden onto the programmer, as he or she must not only write but also maintain checkpointing routines. Second, checkpoints are only available at certain, predetermined places in the program. Thus the program cannot be checkpointed immediately on demand. It would appear from the Syracuse survey of systems [6] that most commercial systems only support user written checkpointing for checkpointing and migration. One guesses that this is so because user written checkpointing is much easier for resource management system developers responsibility for correct checkpointing is transferred to the applications programmer. As noted in earlier discussion, research systems such as PRM and DQS, also have user written checkpointing, and at least one system (not discussed in the Syracuse report) has used this capability to perform cross architecture task migration [18]. Condor and Hector use the complete state transfer method. This form of state transfer inherently only works across homogeneous platforms, because it involves actual replacement of a program s 4

5 state. It is also completely transparent to the programmer, requiring no modifications or additions to the source code. An alternate approach is to modify the compiler in order to retain necessary type information and pointer information. These two pieces of information (which data structures contain pointers and what the pointers point to ) are needed if migration is to be accomplished across heterogeneous platforms. At least one such system (the MSRM Library at Louisiana State University) has been implemented [19]. The MSRM approach may make automatic, cross platform migration possible, at the expense of requiring custom compilers and of increased migration time. Once a system has a correct and consistent task migration capability, it is simple to add checkpointing. By having tasks transfer their state to disk (instead of to another task) it becomes possible to create a checkpoint of the program. This can be used for fault recovery and for job suspension. Thus both Hector and Condor provide checkpoint and rollback capability. However task migration is accomplished, there are two technical issues to deal with. First, the program s state must be transferred completely and correctly. Second, in the case of parallel programs, any communications in progress must continue consistently. That is, the tasks must agree on the status of communications among themselves. Hector s solutions to these issues are discussed in section II.B. E. Automatic Resource Allocation There is a trade off between a centralized allocation mechanism, in which all tasks and programs are scheduled centrally and the policy is centrally designed and administered, and a competitive, distributed model, in which programs compete and/or bid for needed resources and the bidding agents are written as part of the individual programs. Besides some of the classic trade offs between centralized and distributed processing (such as overhead and scalability), there is an implied trade off of the degree of support required by the applications programmer and of the intelligence with which programs can acquire the resources they need. The custom written allocation approach places a larger burden on the applications programmer, but permits more well informed acquisition of needed resources. Since the overall goal of Hector is to minimize programmer burden, it does not use any a priori information or any custom written allocation policies. This is discussed further in section II.C. The degree to which a priori applications information can boost run time performance has been explored for some time [20]. For example, Nguyen et al. have shown that extracting run time information can be minimally intrusive and can substantially improve the performance of a parallel job scheduler [21]. Their approach used a combination of software and special purpose hardware on a KSR 1 parallel computer to measure a program s speedup and efficiency and then used that information to improve program performance. However, Nguyen s work is only relevant for applications that can vary their own number of tasks in response to some optimization. Many parallel applications are launched with a specific number of tasks that does not vary as the program runs. Additionally, it requires the use of the KSR s special purpose timing hardware. Gibbons proposed a simpler system to correlate run times to different job queues [22]. Even this relatively coarse measurement was shown to improve scheduling, as it permits a scheduling model that more closely approaches the well known Shortest Job First (SJF) algorithm. Systems can develop reasonably accurate estimates of a job s completion time based on historical traces of other programs submitted to the same queue. Since this information is coarse and gathered historically, it cannot be used to improve the performance of a single application at run time. (It can, however, improve the efficiency of the scheduler that manages several jobs at once.) Some recent results by Feitelson and Weil have shown the surprising result that user estimates of run time can make the performance of a job scheduler worse than the performance with no estimates at all [23]. While the authors concede that additional work is needed in the area, it does highlight that 5

6 user supplied information can be unreliable, which is an additional reason why Hector does not use it. These approaches have shown the ability of detailed performance information to improve job scheduling. However, to summarize, these approaches have several shortcomings. First, some of them require special purpose hardware. Second, some systems require user modifications to the applications program in order to keep track of relevant run time performance information. Third, the information that is gathered is relatively coarse. Fourth, some systems require applications that can dynamically alter the number of tasks in use. Fifth, user supplied information can be not only inaccurate but also misleading. The goal of Hector s resource allocation infrastructure is to overcome these shortcomings. There is another trade off in degrees of support for dynamically changing workloads and computational resource availability. The ideal NOW distributed run time system can automatically allocate jobs to available resources and move them around during the program run, both in order to maximize performance and in order to release workstations back to users. Current clustering systems support this goal to varying degrees. For example, some systems launch programs when enough resources (such as enough processors) become available. This is the approach taken by IBM s LoadLeveler, for example [6]. Other systems can migrate jobs as workstations become busy, such as Condor [11]. It appears that, as of the time of the Syracuse survey, only Hector attempts to move jobs onto newly idle resources as well. F. Goals and Objectives of Hector Because of the desire to design a system that supports fully transparent task migration, fully automatic and dynamic resource allocation, and transparent fault tolerance, the Hector distributed run time environment is now being developed and tested at. These requirements necessitated the development of a task migration method and a modified MPI implementation that would continue correct communications during task migration. A run time infrastructure that could gather and process run time performance information was simultaneously created. The primary aim of this paper is to discuss these steps in more detail, as well as the steps needed to add support for fault tolerance. Hector is designed to use a central decision maker, called the master allocator or MA to perform the optimizations and make allocation decisions. It uses a small task running on each candidate processor to detect idle resources and monitor the performance of programs during execution. These tasks are called slave allocators or SA s. The amount of overhead associated with an SA is an important design consideration. An individual SA currently updates its statistics every 5 seconds. (This time interval is a compromise between timeliness and overhead.) This process takes about 5 ms on a Sun Sparcstation 5, and so corresponds to an extra CPU load of 0.1% [24]. The process of reading the task s CPU usage adds 581 s per task every time that the SA updates its statistics (every 5 s). Adding the reading of detailed usage information therefore adds about 0.01% CPU load per task. For example, an SA supervising 5 MPI tasks will add a CPU load of 0.15%. When an MPI program is launched, individual pieces or tasks are allocated to available machines and migrated as needed. The SA s and the MA communicate to maintain awareness of the state of all running programs. The structure is diagrammed below in Figure 2. 6

7 Commands Master Allocator System Info Slave Allocator Performance Information Other Slave Allocators Commands Local MPI Tasks Performance Information Figure 2: Structure of Hector Running MPI Programs Key design features and the design process are described below in section II and benchmarks and tests that measure Hector s performance are described in section III. This paper concludes with a discussion of future plans. II. GOALS, OBSTACLES, AND ACCOMPLISHMENTS A. Ease of Use A system must be easy to use if it is to gain widespread acceptance. In this context, ease of use can be supported two different ways. First, adherence to existing, widely accepted standards allows programmers to use the environment with a minimal amount of extra training. Second, the complexities of task allocation and migration and of fault tolerance should be hidden from unsophisticated scientific programmers. That is, scientific programmers should be able to write their programs and submit them to the resource management system without having to provide additional information about their program. 1. Using Existing Standards Hector runs on existing workstations and SMP s using existing operating systems, currently Sun systems running SunOS or Solaris and SGI systems running Irix. Several parts of the system, such as task migration and correctness of socket communications, would be simpler to support if modifications were made to the operating system. However, this would dramatically limit the usefulness of the system in using existing resources, and so the decision was made not to modify the operating system. The MPI and PVM standards provide architecture independent parallel coding capability in both C and FORTRAN. MPI and PVM are supported on a wide and growing body of parallel architectures ranging from networks of workstations to high end SMP s and parallel mainframes. Since these represent systems that have gained and are gaining widespread acceptance, there already exists a sizable body of programmers that can use it. Hector supports MPI as its coding standard. 2. Total Transparency of Task Allocation and Fault Tolerance Experience at the Mississippi State indicates that most scientific programmers are unwilling (or unable) to provide such information as program size, esti- 7

8 mated run time, or communications topology. This situation exists for two reasons. First, such programmers are solving a physical problem and so programming is a means to an end. Second, they may not have enough detailed knowledge about the internal workings of computers to provide information useful to computer engineers and scientists. Hector is therefore designed to operate with no a priori knowledge of the program to be executed. This considerably complicates the task allocation process, but is an almost necessary step in order to promote transparency of task allocation to the programmer and, as a result, ease of use to the scientific programmer. Not currently supported, future versions of Hector may be able to benefit from user supplied a priori information. A new implementation of MPI, named MPI TM, has been created to support task migration [25] and fault tolerance. MPI TM is based on the MPICH implementation of MPI [4]. In order to run with these features, a programmer merely has to re link the application with the Hector modified version of MPI. The modified MPI implementation and the Hector central decision maker handle allocation and migration automatically. The programmer simply writes a normal MPI program and submits it to Hector for execution. Hector exists as the MPI TM library and a collection of executables. The library is linked with applications and provides a self migration facility, a complete MPI implementation, and an interface to the run time system. The executables include the SA, MA, a text based command line interface to the MA, and a rudimentary Motif based GUI. Its installation is roughly as complicated as installing a new MPI implementation and a complete applications package. 3. Support for Multiple Platforms Hector is supported on Sun computers running SunOS and Solaris and on SGI computers running Irix. The greatest obstacle under Solaris is its dynamic linker which, due to its ability to link at run time, can create incompatible versions of the same executable file. This creates the undesirable situation that migration is impossible between nearly, but not completely, identical machines, and has the consequence of dividing the Sun computers into many, smaller clusters. This situation exists because of the combination of two factors. First, Hector performs automatic, programmer transparent task migration without compiler modifications. Thus it cannot move pointers and must treat the program s state as an unalterable binary image. Second, dynamically linked programs may map system libraries and their associated data segments to different virtual addresses in runs of one program on different machines. The solution adopted by Condor is to rewrite the linker (more accurately, to replace the Solaris dynamic linker with a custom written one) to make migration of system library data segments possible [26]. This option is under consideration in Hector, but is not currently supported. B. Task Migration 1. Correct State Transfer The state of a running program, in a Unix environment, can be considered in six parts. First, the actual program text may be dynamically linked, and has references to data that may be statically or dynamically located. Second, the program s static data is divided into initialized and uninitialized sections. Third, any use of dynamically allocated data is stored in the heap. Fourth, the program s stack grows as subroutines and functions are called, and is used for locally visible data and dynamic data storage. Fifth, the CPU contains internal registers, usually used to hold results of intermediate calculations. Sixth, the Unix kernel maintains some system level information about the program, such as file descriptors. This is summarized below in Figure 3. 8

9 CPU Registers User Memory Text Static Data Heap Visible to user Not visible to user Kernel Memory Kernel Structs Stack Wrapper functions keep track of kernel information in a place visible to the user. Figure 3: State of a Program During Execution The first five parts of the state can, in principle, be transferred between two communicating user level programs. One exception occurs when programs are dynamically linked, as parts of the program text and data may not reside at the same virtual address in two different instantiations of the same program. As discussed above, this matter is under investigation. The sixth part of a program s state, kernel related information, is more difficult to transfer because it is invisible to a user level program. This information may include file descriptors and pointers, signal handlers, and memory mapped files. Without kernel source code, it is almost impossible to read these structures directly. If the operating system is unmodified, the solution is to create wrapper functions that let the program keep track of its own kernel related structures. All user code that modifies kernel structures must pass through a trap interface. (Traps are the only way user level code can execute supervisor level functions.) The Unix SYSCALL.H file documents all of the system calls that use traps, and all other system calls are built on top of them. One can create a function with the same name and arguments as a system call, such as open(). The arguments to the function are passed into an assembly language routine that calls the system trap properly. The remainder of the function keeps track of the file descriptor, path name, permissions, and other such information. The lseek() function can keep track of the location of the file pointer. Calls that change the file pointer (such as read() and write() ) also call the instrumented lseek(), so that file pointer information is updated automatically. This permits migrated tasks to resume reading and writing files at the proper place. It was discovered that the MPI environment for which task migration was being added [5] also uses signals and memory mapping. (The latter is due to the fact that gethostbyname() makes a call to mmap.) All system calls that affect signal handling and memory mapping are replaced with wrapper functions as well. The task migration system requires knowledge of a running program s image in a particular operating system, the development of a small amount of assembly language, and reliance on certain properties pertaining to signal handling, and these all affect the portability of the task migration system. The assembly language is needed because this is the only way to save and restore registers and call traps. Since the task migration routine is inside a signal handler, it is also necessary for the restarted program to be able to exit the signal handler coherently. Other systems that perform a similar style of system supported task migration, such as MIST and Condor, have also been ported to Linux, Alpha, and HP environments [27],[15]. This seems to indicate that this style of task migration is reasonably portable among Unix based operating systems, probably because these different operating systems have strong structural similarities. It is 9

10 interesting to note that no system level migration support for Windows/NT based systems has been reported. The exact sequence of steps involved in the actual state transfer are described in more detail in [25]. Two tests confirm this method s speed and stability, and are described below in section III. 2. Keeping MPI Intact: A Task Migration Protocol The state restoration process described above is not guaranteed to preserve communications on sockets. This is because at any point in the execution of a program, fragments of messages may reside in the kernel s buffers on either the sending or receiving side. The solution is to notify all tasks when a single task is about to migrate. Each task that is communicating with the task under migration sends an end of channel message to the migrating task and then closes the socket that connects them. The tasks then mark the migrating task as under migration in its table of tasks, and attempts to initiate communications will block until migration is complete. Once the task under migration receives all of its end of channel messages, it can be assured that no messages are trapped in the kernel s buffers. That is, it knows that all messages reside in its data segment, and so it can be migrated safely. Once the state has been transferred, another global update is needed so that other tasks know its new location and know that communications can be resumed with it. Tasks that are not migrating remain able to initiate connections and communicate with one another. The MPI 1.1 standard (the original MPI standard) only permits static task tables. That is, the number of tasks used by a parallel program is fixed when the program is launched. (It is important to note that the static number of tasks in a program is an MPI 1.1 limit, not a limitation of Hector. This is also one of the important differences between PVM and MPI 1.1.) Thus updates to this table do not require synchronization with MPI and do not confuse an MPI program. The MPI 2 standard (a newer standard currently in development) permits dynamically changing task tables, but, with proper use of critical sections, task migration will not interfere with programs written under the MPI 2 standard. A series of steps is needed to update the task table globally and atomically. Hector s MA and SA s are used to provide synchronization and machine to machine communications during migration and task termination. The exact sequence of steps required to synchronize tasks and update the communication status consistently is described in detail in [25]. It should be noted that if the MA crashes in the middle of a migration, the program will deadlock, because the MA is used for global synchronization and to guarantee inter task consistency. 3. Task Termination Protocol Task termination presents another complication. If a task is migrating while or after another task terminates, the task under migration never receives an end of channel message from the terminated task. Two measures are taken to provide correct program behavior. First, the MA limits each MPI program to only one migration or only one termination at a time. It can do this because of the handshaking needed both to migrate and to terminate. Second, a protocol involving the SA s and MA s was developed to govern task termination and is described below. 1. A task preparing to terminate notifies its SA. The task can receive and process table updates and requests for end of channel (EOC) messages, but will block requests to migrate. It cannot be allowed to migrate so that the MA can send it a termination signal. 2. The SA notifies the MA that the task is ready to terminate. 3. Once all pending migrations and terminations have finished, the MA notifies the SA that the task has permission to terminate. It will then block (and enqueue) further termination and migration requests until this termination has ended. 4. The SA notifies the task. 5. The task sends the SA a final message before exiting. 6. The SA notifies the MA that the task is exiting, and so the MA can permit other migrations and terminations. 10

11 Notice that an improperly written program may attempt to communicate with a task after the task has ended. In the world of message passing based parallel programming, this is a programmer s mistake. Behavior of the program is undefined at this point, and the program itself will deadlock under Hector. (The program deadlocks, not Hector.) 4. Minimizing Migration Time The operating system already has one mechanism for storing a program s state. A core dump creates a file that has a program s registers, data segment, and stack. The first version of state transfer used this capability to move programs around. There are two advantages to this approach. First, it is built into the operating system. Second, there are symbolic debuggers and other tools that can extract useful information from core files. There are some disadvantages to this approach. First, multiple network transfers are needed if the disk space is shared over a network. This means that the state is actually copied multiple times. Second, the speed of transfer is limited further by the speed of the disk and by other, unrelated programs sharing that disk. One way around all of these shortcomings is to transfer the state directly over the network. Originally implemented by the MIST team [15], network state transfer overcomes these disadvantages. The information is written over the network in slightly modified core file format. (The only modification is that unused stack space is not transmitted. There is no other penalty for using the SunOS core file format.) The information is written over a network socket connection by the application itself, instead of being written to a file by the operating system. Notice that this retains the advantage of core file tool compatibility. Experiments show that it is over three times faster [25], as will be shown below. C. Automatic Resource Allocation 1. Sources of Information Hector s overall goal is to attempt to minimize programmer overhead. In the context of awareness of program behavior, this incurs the expense of not having access to potentially beneficial program specific information. This approach was used based on experiences with scientific programmers within the authors research center, who are unwilling to invest time and effort to use new systems because of the perceived burden of source code modifications. This approach dictates that Hector be able to operate with no a priori applications knowledge, which, in turn, increases the requirement for the depth and breadth of information that is gathered at run time. This lack of a priori information makes Hector s allocation decision making more difficult. However, experiments confirm that the information it is able to extract at run time can improve performance, and its ability to exploit newly idle resources is especially helpful. 2. Structure There is a range of ways that resource allocation can be structured, from completely centrally located to completely distributed. Hector s resource allocation uses features of both. The decision making portion (and global synchronization) resides in the MA and is therefore completely centralized. The advantage of a single, central allocation decision maker is that it is easier to modify and test different allocation strategies. Since the UNIX operating system will not permit signals to be sent between hosts, it is necessary to have an executive process running on each candidate host. Since it is necessary to have such processes, they can be used to gather performance information as well. Thus its information gathering and information execution portions are fully distributed, being carried out by the SA s. 3. Collecting Information There are two types of information that the master allocator needs in order to make decisions. First, it needs to know about the resources that are potentially available, such as which hosts to con- 11

12 sider and how powerful they are. Second, it needs to know how efficiently and to what extent these resources are being used, such as how much external (non Hector) load there is and how much load the various MPI programs under its control are imposing. The relative performance of each candidate host is determined by the slave allocator when it is started. (It does so by running the Livermore Loops [28], which actually measure floating point performance.) It is also possible for the slave allocator to measure disk availability and physical memory size, for example. This information is transmitted to the master allocator, which maintains its own centralized database of this information. Current resource usage is monitored by analyzing information from the kernel of each candidate processor. Allocation algorithms draw on idle time information, CPU time information, and the percentage of CPU time devoted to non Hector related tasks. The percentage of CPU time is used to detect external workload, such as an interactive user logging in, which is a criterion for automatic migration. The Hector MPI library contains additional, detailed self instrumentation that logs the amount of computation and communication time each task expends. This data is gathered by the SA s by using shared memory and is forwarded to the MA. A more detailed discussion of this agent based approach to information gathering, as well as testing and results, may be found in [29]. 4. Making Decisions One of the primary advantages of this performance monitoring based approach is its ability to claim idle resources rapidly. As will be shown below, tests on busy workstations during the day show that migrating to newly available resources can reduce run time and promote more effective use of workstations. Further implementation and testing of more sophisticated allocation policies are also under way. D. Fault Tolerance The ability to migrate tasks in mid execution can be used to suspend tasks. In fact, fault tolerance has historically been one major motivation for task migration. In effect, each task transfers its state into a file to checkpoint the program. When a node failure has been detected, the files can be used to roll back the program to the state of the last checkpoint. While all calculations between the checkpoint and node failure are lost, the calculations up to the checkpoint are not, which may represent a substantial time savings. Also, known unrelated failures and/or routine maintenance may occur or be needed in the middle of a large program run, and so the ability to suspend tasks is helpful. It can be shown that in order to guarantee program correctness, all tasks must be checkpointed consistently[13]. That is, the tasks must be at a consistent point in their execution and in their message exchange status. For example, all messages in transit must be fully received and transmission of any new messages must be suspended. As was the case with migration and termination, a series of steps are needed to checkpoint and to roll back parallel programs. 1. Checkpointing Protocol The following steps are taken to checkpoint a program. 1. The MA decides to checkpoint a program for whatever reason. (This is currently supported as a manual user command, and may eventually be done on a periodic basis.) It waits until all pending migrations and terminations have finished, and then it notifies all tasks in the program, via the SA s, to prepare for checkpointing. 2. The tasks send end of channel (EOC) messages to all connected tasks, and then receive EOC s from all connected tasks. Again, this guarantees that there are no messages in transit. 3. Once all EOC s have been exchanged, the task notifies its SA that it is ready for checkpointing and informs the SA of the size of its state information. This information is passed on to the MA. 4. Once the MA has received confirmation from every task, it is ready to begin the actual checkpointing process. It notifies each task when the task is to begin checkpointing. 12

13 5. After each task finishes transmitting its state (or writing a file), it notifies the MA. Note that it is possible for more than one task to checkpoint at a time, and experiments with the ordering of checkpointing are described below. 6. After all tasks have checkpointed, the MA writes out a small bookkeeping file which contains state information pertinent to the MA and SA s. (For example, it contains the execution time to the point of checkpointing, so that the total execution time will be accurate if the job is rolled back to that checkpoint.) 7. The MA broadcasts either a Resume or Suspend command to all tasks. The tasks either resume execution or stop, respectively. The former is used to create a backup copy of a task in the event of future node failure. The latter is used if it is necessary to remove a job temporarily from the system. 2. Rollback Protocol The following steps are taken to roll back a checkpointed program. 1. The MA is given the name of a checkpoint file that provides all necessary information to restart the program. 2. It allocates tasks on available workstations, just as if the program were being launched. 3. Based on its allocation, it notifies the SA on the first machine. 4. The SA restarts the task from the state file, the name of which is found in the checkpoint file and sent to the SA. 5. The task notifies its SA that it restarted properly and waits for a table update. 6. Once the MA receives confirmation of one task s successful restart, it notifies the SA of the next task. It continues to do this until all tasks have restarted. Task rollback is sequential primarily for performance reasons. The file server that is reading the checkpoints and sending them over sockets to the newly launched tasks will perform more efficiently if only one checkpoint is sent at a time. 7. As confirmation arrives at the MA, it builds a table similar to that used by the MPI tasks themselves. It lists the hostnames and Unix PID s of all the tasks in the parallel program. Once all tasks have restarted, this table is broadcast to all tasks. Note that this broadcast occurs via the Hector run time infrastructure and is invisible to the MPI program. It does not use an MPI broadcast, as MPI is inactive during rollback. 8. Each task resumes normal execution once it receives its table update, and so the entire program is restarted. 3. The Checkpoint Server As is the case with task migration, there are two ways to save a program s state. One way is for each program to write directly to a checkpoint file. The other way is to launch a checkpoint server on a machine with a large amount of physically mounted disk space. (The latter concept was first implemented by the Condor group [30].) Each task transmits its state via the network directly to the server, and the server writes the state directly to its local disk. The reason each task cannot write its state to its local disk is obvious if the machine crashes, the backup copy of the state would be lost as well. The checkpoint server method is expected to be faster, because it uses direct socket connections and local disk writes, which are more efficient than writing files over a network. Note that many of the local disk caching strategies used by systems like NFS do not work well for checkpoints, because checkpoint files are typically written once and not read back [30]. Different, novel scheduling strategies for checkpoint service are described and tested below. 4. Other Issues The MA is not fault tolerant. That is, the MA represents a single point of failure. The SA s have been modified to terminate themselves, and the tasks running under them, if they lose contact with 13

14 the MA. (This feature was intentionally added because total termination of programs distributed across dozens of workstations can be quite tedious unless it is automated.) If this feature is disabled, then SA s and their tasks could continue working without the MA, although all task migrations and job launches would cease, and job termination would deadlock. The checkpoints collected by the system will enable a job to be restarted after the MA and SA s have been restarted, and so the checkpoint and rollback based fault tolerance can tolerate a fault in the run time infrastructure. Another approach to solve this problem, and to support more rapid fault tolerance, would be to incorporate an existing group communication library (such as Isis or Horus [31]) and use its message duplication facility. One possible design is described in [32]. Means of rapid fault detection can also be added to future versions of Hector [32]. Each SA sends performance information to the MA periodically. (The current period is 5 seconds, which may grow as larger tests are performed.) If the SA does not send a performance update after some suitable timeout, it can be assumed that the node is not running properly, and all jobs on that node can be rolled back. This strategy will detect heavily overloaded nodes as well as catastrophically failed nodes. III. BENCHMARKS AND TESTS A. Ease of Use As an example, an existing computational fluid dynamics simulation was obtained, because it was large and complex, having a total data size of around 1 GByte and about 13,000 lines of Fortran source code. The simulation had already been parallelized, coded in MPI, and tested on a parallel computer for correctness. With no modifications to the source code, the program was relinked and run completely and correctly under Hector. This highlights its ability to run real, existing MPI programs. B. Task Migration Two different task migration mechanisms were proposed, implemented, and tested. The first used core dump to write out a program s state for transfer to a different machine. The second transferred the information directly over a socket connection. In order to compare their relative speeds, tasks of different sizes were migrated 60 times each between two Sparcstations 10 s connected by ordinary 10 MBit/sec ethernet. The tests were run during normal daily operations at the Engineering Research Center [25]. The results are show below in Figure 4. Time to Migrate (sec) Core File Transfer Network Transfer Program Size (kbytes) Figure 4: Time to migrate tasks of different sizes 14

ECE519 Advanced Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor