The Hector Distributed Run Time

Size: px
Start display at page:

Download "The Hector Distributed Run Time"

Transcription

1 The Hector Distributed Run Time Environment 1 A Manuscript Submitted to the IEEE Transactions on Parallel and Distributed Systems Dr. Samuel H. Russ, Jonathan Robinson, Dr. Brian K. Flachs, and Bjorn Heckel russ@erc.msstate.edu, jon@aue.com, flachs@umunhum.stanford.edu, heckel@durango.cipic.ucdavis.edu s Abstract Harnessing the computational capabilities of a network of workstations promises to off load work from overloaded supercomputers onto largely idle resources overnight. Several capabilities are needed to do this, including support for an architecture independent parallel programming environment, task migration, automatic resource allocation, and fault tolerance. The Hector distributed run time environment is designed to present these capabilities transparently to programmers. MPI programs can be run under this environment with no modifications to their source code needed. The design of Hector, its internal structure, and several benchmarks and tests are presented. Index Terms: Parallel Computing, Load Balancing, Fault Tolerance, Resource Allocation, Task Migration I. INTRODUCTION AND PREVIOUS WORK A. Networks of Workstations Networked workstations have been available for many years. Since a modern network of workstations represents a total computational capability on the order of a supercomputer, there is strong motivation to use a network of workstations (NOW) as a type of low cost supercomputer. Note that a typical institution has access to a variety of computer resources that are network interconnected. These range from workstations to shared memory multiprocessors to high end parallel supercomputers. In the context of this paper, a network of workstations or NOW is considered to be a network of heterogeneous computational resources, some of which may actually be workstations. A run time system for parallel programs on a NOW must have several important properties. First, scientific programmers needed a way to run supercomputer class programs on such a system with little or no source code modifications. The most common way to do this is to use an architecture independent parallel programming standard to code the applications. This permits the same program source code to run on a NOW, a shared memory multiprocessor, and the latest generations of parallel supercomputers, for example. Two major message passing standards have emerged that can do this, PVM [1],[2] and MPI [3], as well as numerous distributed shared memory (DSM) systems. Second, the ability to run large jobs on networked resources only proves attractive to workstation users if their individual workstations can still be used for more mundane tasks, such as word proces- 1 This work was funded in part by NSF Grant No. EEC Amendment 021 and by ONR Grant No. N

2 sing. Task migration is needed to offload work from a user s workstation and return the station to its owner. This ability also permits dynamic load balancing and fault tolerance. (Note that, in this paper, a task is a piece of a parallel program, and a program or job is a complete parallel program. In the message passing model used here, a program is always decomposed into multiple communicating tasks. A platform or host is a computer that can run jobs if idle, and can range from a workstation to an SMP.) Third, a run time environment for NOW s needs the ability to track the availability and relative performance of resources as programs run, because it needs the information to conduct ongoing performance optimizations. This is true for several reasons. The relative speed of nodes in a network of workstations, even of homogeneous workstations, can vary widely. Workstation availability is a function of individual users, and users can create an external load that must be worked around. Programs themselves may run more efficiently if redistributed certain ways. Fourth, fault tolerance is extremely important in systems that involve dozens to hundreds of workstations. For example, if there is a 1% chance a single workstation will go down overnight, then there is only a (0.99) 75 = 47% chance that a network of 75 stations will stay up overnight. A complete run time system for NOW computing must therefore include an architecture independent coding interface, task migration, automatic resource allocation, and fault tolerance. It is also desirable for this system to support all of these features with no source code modifications. These individual components already exist in various forms, and a review of them is in order. It is the goal of this work to combine these individual components into a single system. B. Parallel Programming Standards A wide variety of parallel and distributed architectures exist today to run parallel programs. Varying in their degree of interconnection, interconnection topology, bandwidth, latency, and scale of geographic distribution, they offer a wide range of performance and cost trade offs and of applicability to solve different classes of problems. They can generally be characterized on the basis of their support for physical shared memory. Two major classes of parallel programming paradigms have emerged from the two major classes of parallel architectures. The shared memory model has its origins in programming tightly coupled processors and offers relative convenience. The message passing model is well suited for loosely coupled architectures and coarse grained applications. Programming models that implement these paradigms can further be classified by their manifestation either as extensions to existing programming languages or as custom languages. Because our system is intended for a network of workstations, it was felt that a message passing based parallel programming paradigm more closely reflected the underlying physical structure. It was also felt that this paradigm should be expressed in existing scientific programming languages in order to draw on an existing base of scientific programmers. Both the PVM and MPI standards support these goals, and the MPI parallel programming environment was selected. Able to be called from C or FORTRAN programs, it provides a robust set of extensions that can send and receive messages between tasks working in parallel [3]. While a discussion of the relative merits of PVM and MPI is outside the scope of this paper, this decision was partially driven by ongoing work in MPI implementations that had already been conducted at Mississippi State and Argonne National Laboratory [4]. A more detailed discussion of the taxonomy of parallel paradigms and systems, and of the rationale for our decision, can be found in [5]. One desirable property of a run time system is for its services to be offered transparently to applications programmers. Programs written to a common programming standard should not have to be modified to have access to more advanced run time system features. This level of transparency permits programs to maintain conformity to the programming standard and simplifies code development. 2

3 C. Cluster Computing Systems Systems that can harness the aggregate performance of networked resources have been under development for quite some time. For example, one good review of cluster computing systems, conducted in 1995 and prepared at Syracuse University, listed 7 commercial and 13 research systems [6]. The results of the survey, along with a comparison to the Hector environment discussed in this paper, are summarized in [7]. It was found that the Hector environment had as many features as some of the other full featured systems (such as Platform Computing s LSF, Florida State s DQS, IBM s Load Leveler, Genias Codine, and the University of Wisconsin Madison s Condor), and that its support for the simultaneous combination of programmer transparent task migration and load balancing, fully automatic and dynamic load balancing, support for MPI, and job suspension was unique. Additionally, the commitment to programmer transparency has led to the development of extensive run time information gathering about the programs as they run, and so the breadth and depth of the information that is gathered is unique. It should also be added that its support for typical commercial features, such as GUI and configuration management tools, was noticeably lacking, as Hector is a research project. It should also be mentioned at this point that there are (at least) two other research projects using the name Hector doing work in distributed computing and multiprocessing. The first is the well known Hector multiprocessor project at the University of Toronto [8],[9]. The second is a system for supporting distributed objects in Python at the CRC for Distributed Systems Technology at the University of Queensland [10]. The Hector environment described in this paper is unrelated to either. Three other research systems can allocate tasks across NOW s and have some degree of support for task migration. Figure 1 summarizes these systems. Fully dynamic processor allocation and reallocation Only stops task under migration User transparent load balancing Condor/ CARMI Prospero MIST DQS Y Y Y User transparent fault tolerance Y Y Y Works with MPI No source code modifications to existing MPI/PVM program Uses existing operating system Y Y Y Y Y Y Figure 1: Comparison of Existing Task Allocators One such system is a special version of Condor [11], named CARMI [12]. CARMI can allocate PVM tasks across idle workstations and can migrate PVM tasks as workstations fail or become non idle. It has two limitations. First, it cannot claim newly available resources. For example, it does not attempt to move work onto workstations that have become idle. Second, it checkpoints all tasks when one task needs to migrate [13]. Stopping all tasks when only one task needs to migrate slows program execution since only the migrating task must be stopped. Another automated allocation environment is the Prospero Resource Manager, or PRM [14]. Each parallel program run under PRM has its own job manager. Custom written for each program, the job manager acts like a purchasing agent and negotiates the acquisition of system resources as additional resources are needed. PRM is scheduled to use elements of Condor to support task migration and checkpointing and uses information gathered from the job and node managers to reallocate Y Y Y 3

4 resources. Notice that use of PRM requires a custom allocation program for each parallel application, and future versions may require modified operating systems and kernels. The MIST system is intended to integrate several development efforts and develop an automated task allocator [15]. Because it uses PRM to allocate tasks, the user must custom build the allocation scheme for each program. MIST is built on top of PVM, and PVM s support of indirect communications can potentially lead to administrative overhead, such as message forwarding, when a task has been migrated [16]. The implementation of MPI that Hector uses, with a globally available list of task hosts, does not incur this overhead. (Note that MPI does not have indirect communications, and so Hector does not have any overhead to support it.) Every task sends its messages directly to the receiving task, and the only overhead required after a task has migrated is to notify every other task of the new location. As will be shown below, this process has very little overhead, even for large parallel applications. The Distributed Queuing System, or DQS, is designed to manage jobs across multiple computers simultaneously [17]. It can support one or more queue masters which process user requests for resources on a first come, first served basis. Users prepare small batch files to describe the type of machine needed for particular applications. (For example, the application may require a certain amount of memory.) Thus resource allocation is performed as jobs are started. It currently has no built in support for task migration or fault tolerance, but can issue commands to applications that can migrate and/or checkpoint themselves. It does support both PVM and MPI applications. The differences between these systems and Hector highlight two of the key differences among cluster computing systems. First, there is a trade off between task migration mechanisms that are programmer written versus those that are supported automatically by the environment. Second, there is a trade off between centralized and decentralized decision making and information gathering. D. Task Migration in the Context of Networked Resources Two strategies have emerged for creating the program s checkpoint or task migration image. First, checkpointing routines can be written by the application programmer. Presumably he or she is sufficiently familiar with the program to know when checkpoints can be created efficiently (for example, places where the running state is relatively small and free of local variables) and to know which variables are needed to create a complete checkpoint. Second, checkpointing routines can transfer the entire program state automatically onto another machine. The entire address space and registers are copied and carefully restored. By way of comparison, user written checkpointing routines have some inherent space advantages, because the state s size is inherently minimized, and they may have cross platform compatibility advantages, if the state is written in some architecture independent format. User written routines have two disadvantages. First, they add coding burden onto the programmer, as he or she must not only write but also maintain checkpointing routines. Second, checkpoints are only available at certain, predetermined places in the program. Thus the program cannot be checkpointed immediately on demand. It would appear from the Syracuse survey of systems [6] that most commercial systems only support user written checkpointing for checkpointing and migration. One guesses that this is so because user written checkpointing is much easier for resource management system developers responsibility for correct checkpointing is transferred to the applications programmer. As noted in earlier discussion, research systems such as PRM and DQS, also have user written checkpointing, and at least one system (not discussed in the Syracuse report) has used this capability to perform cross architecture task migration [18]. Condor and Hector use the complete state transfer method. This form of state transfer inherently only works across homogeneous platforms, because it involves actual replacement of a program s 4

5 state. It is also completely transparent to the programmer, requiring no modifications or additions to the source code. An alternate approach is to modify the compiler in order to retain necessary type information and pointer information. These two pieces of information (which data structures contain pointers and what the pointers point to ) are needed if migration is to be accomplished across heterogeneous platforms. At least one such system (the MSRM Library at Louisiana State University) has been implemented [19]. The MSRM approach may make automatic, cross platform migration possible, at the expense of requiring custom compilers and of increased migration time. Once a system has a correct and consistent task migration capability, it is simple to add checkpointing. By having tasks transfer their state to disk (instead of to another task) it becomes possible to create a checkpoint of the program. This can be used for fault recovery and for job suspension. Thus both Hector and Condor provide checkpoint and rollback capability. However task migration is accomplished, there are two technical issues to deal with. First, the program s state must be transferred completely and correctly. Second, in the case of parallel programs, any communications in progress must continue consistently. That is, the tasks must agree on the status of communications among themselves. Hector s solutions to these issues are discussed in section II.B. E. Automatic Resource Allocation There is a trade off between a centralized allocation mechanism, in which all tasks and programs are scheduled centrally and the policy is centrally designed and administered, and a competitive, distributed model, in which programs compete and/or bid for needed resources and the bidding agents are written as part of the individual programs. Besides some of the classic trade offs between centralized and distributed processing (such as overhead and scalability), there is an implied trade off of the degree of support required by the applications programmer and of the intelligence with which programs can acquire the resources they need. The custom written allocation approach places a larger burden on the applications programmer, but permits more well informed acquisition of needed resources. Since the overall goal of Hector is to minimize programmer burden, it does not use any a priori information or any custom written allocation policies. This is discussed further in section II.C. The degree to which a priori applications information can boost run time performance has been explored for some time [20]. For example, Nguyen et al. have shown that extracting run time information can be minimally intrusive and can substantially improve the performance of a parallel job scheduler [21]. Their approach used a combination of software and special purpose hardware on a KSR 1 parallel computer to measure a program s speedup and efficiency and then used that information to improve program performance. However, Nguyen s work is only relevant for applications that can vary their own number of tasks in response to some optimization. Many parallel applications are launched with a specific number of tasks that does not vary as the program runs. Additionally, it requires the use of the KSR s special purpose timing hardware. Gibbons proposed a simpler system to correlate run times to different job queues [22]. Even this relatively coarse measurement was shown to improve scheduling, as it permits a scheduling model that more closely approaches the well known Shortest Job First (SJF) algorithm. Systems can develop reasonably accurate estimates of a job s completion time based on historical traces of other programs submitted to the same queue. Since this information is coarse and gathered historically, it cannot be used to improve the performance of a single application at run time. (It can, however, improve the efficiency of the scheduler that manages several jobs at once.) Some recent results by Feitelson and Weil have shown the surprising result that user estimates of run time can make the performance of a job scheduler worse than the performance with no estimates at all [23]. While the authors concede that additional work is needed in the area, it does highlight that 5

6 user supplied information can be unreliable, which is an additional reason why Hector does not use it. These approaches have shown the ability of detailed performance information to improve job scheduling. However, to summarize, these approaches have several shortcomings. First, some of them require special purpose hardware. Second, some systems require user modifications to the applications program in order to keep track of relevant run time performance information. Third, the information that is gathered is relatively coarse. Fourth, some systems require applications that can dynamically alter the number of tasks in use. Fifth, user supplied information can be not only inaccurate but also misleading. The goal of Hector s resource allocation infrastructure is to overcome these shortcomings. There is another trade off in degrees of support for dynamically changing workloads and computational resource availability. The ideal NOW distributed run time system can automatically allocate jobs to available resources and move them around during the program run, both in order to maximize performance and in order to release workstations back to users. Current clustering systems support this goal to varying degrees. For example, some systems launch programs when enough resources (such as enough processors) become available. This is the approach taken by IBM s LoadLeveler, for example [6]. Other systems can migrate jobs as workstations become busy, such as Condor [11]. It appears that, as of the time of the Syracuse survey, only Hector attempts to move jobs onto newly idle resources as well. F. Goals and Objectives of Hector Because of the desire to design a system that supports fully transparent task migration, fully automatic and dynamic resource allocation, and transparent fault tolerance, the Hector distributed run time environment is now being developed and tested at. These requirements necessitated the development of a task migration method and a modified MPI implementation that would continue correct communications during task migration. A run time infrastructure that could gather and process run time performance information was simultaneously created. The primary aim of this paper is to discuss these steps in more detail, as well as the steps needed to add support for fault tolerance. Hector is designed to use a central decision maker, called the master allocator or MA to perform the optimizations and make allocation decisions. It uses a small task running on each candidate processor to detect idle resources and monitor the performance of programs during execution. These tasks are called slave allocators or SA s. The amount of overhead associated with an SA is an important design consideration. An individual SA currently updates its statistics every 5 seconds. (This time interval is a compromise between timeliness and overhead.) This process takes about 5 ms on a Sun Sparcstation 5, and so corresponds to an extra CPU load of 0.1% [24]. The process of reading the task s CPU usage adds 581 s per task every time that the SA updates its statistics (every 5 s). Adding the reading of detailed usage information therefore adds about 0.01% CPU load per task. For example, an SA supervising 5 MPI tasks will add a CPU load of 0.15%. When an MPI program is launched, individual pieces or tasks are allocated to available machines and migrated as needed. The SA s and the MA communicate to maintain awareness of the state of all running programs. The structure is diagrammed below in Figure 2. 6

7 Commands Master Allocator System Info Slave Allocator Performance Information Other Slave Allocators Commands Local MPI Tasks Performance Information Figure 2: Structure of Hector Running MPI Programs Key design features and the design process are described below in section II and benchmarks and tests that measure Hector s performance are described in section III. This paper concludes with a discussion of future plans. II. GOALS, OBSTACLES, AND ACCOMPLISHMENTS A. Ease of Use A system must be easy to use if it is to gain widespread acceptance. In this context, ease of use can be supported two different ways. First, adherence to existing, widely accepted standards allows programmers to use the environment with a minimal amount of extra training. Second, the complexities of task allocation and migration and of fault tolerance should be hidden from unsophisticated scientific programmers. That is, scientific programmers should be able to write their programs and submit them to the resource management system without having to provide additional information about their program. 1. Using Existing Standards Hector runs on existing workstations and SMP s using existing operating systems, currently Sun systems running SunOS or Solaris and SGI systems running Irix. Several parts of the system, such as task migration and correctness of socket communications, would be simpler to support if modifications were made to the operating system. However, this would dramatically limit the usefulness of the system in using existing resources, and so the decision was made not to modify the operating system. The MPI and PVM standards provide architecture independent parallel coding capability in both C and FORTRAN. MPI and PVM are supported on a wide and growing body of parallel architectures ranging from networks of workstations to high end SMP s and parallel mainframes. Since these represent systems that have gained and are gaining widespread acceptance, there already exists a sizable body of programmers that can use it. Hector supports MPI as its coding standard. 2. Total Transparency of Task Allocation and Fault Tolerance Experience at the Mississippi State indicates that most scientific programmers are unwilling (or unable) to provide such information as program size, esti- 7

8 mated run time, or communications topology. This situation exists for two reasons. First, such programmers are solving a physical problem and so programming is a means to an end. Second, they may not have enough detailed knowledge about the internal workings of computers to provide information useful to computer engineers and scientists. Hector is therefore designed to operate with no a priori knowledge of the program to be executed. This considerably complicates the task allocation process, but is an almost necessary step in order to promote transparency of task allocation to the programmer and, as a result, ease of use to the scientific programmer. Not currently supported, future versions of Hector may be able to benefit from user supplied a priori information. A new implementation of MPI, named MPI TM, has been created to support task migration [25] and fault tolerance. MPI TM is based on the MPICH implementation of MPI [4]. In order to run with these features, a programmer merely has to re link the application with the Hector modified version of MPI. The modified MPI implementation and the Hector central decision maker handle allocation and migration automatically. The programmer simply writes a normal MPI program and submits it to Hector for execution. Hector exists as the MPI TM library and a collection of executables. The library is linked with applications and provides a self migration facility, a complete MPI implementation, and an interface to the run time system. The executables include the SA, MA, a text based command line interface to the MA, and a rudimentary Motif based GUI. Its installation is roughly as complicated as installing a new MPI implementation and a complete applications package. 3. Support for Multiple Platforms Hector is supported on Sun computers running SunOS and Solaris and on SGI computers running Irix. The greatest obstacle under Solaris is its dynamic linker which, due to its ability to link at run time, can create incompatible versions of the same executable file. This creates the undesirable situation that migration is impossible between nearly, but not completely, identical machines, and has the consequence of dividing the Sun computers into many, smaller clusters. This situation exists because of the combination of two factors. First, Hector performs automatic, programmer transparent task migration without compiler modifications. Thus it cannot move pointers and must treat the program s state as an unalterable binary image. Second, dynamically linked programs may map system libraries and their associated data segments to different virtual addresses in runs of one program on different machines. The solution adopted by Condor is to rewrite the linker (more accurately, to replace the Solaris dynamic linker with a custom written one) to make migration of system library data segments possible [26]. This option is under consideration in Hector, but is not currently supported. B. Task Migration 1. Correct State Transfer The state of a running program, in a Unix environment, can be considered in six parts. First, the actual program text may be dynamically linked, and has references to data that may be statically or dynamically located. Second, the program s static data is divided into initialized and uninitialized sections. Third, any use of dynamically allocated data is stored in the heap. Fourth, the program s stack grows as subroutines and functions are called, and is used for locally visible data and dynamic data storage. Fifth, the CPU contains internal registers, usually used to hold results of intermediate calculations. Sixth, the Unix kernel maintains some system level information about the program, such as file descriptors. This is summarized below in Figure 3. 8

9 CPU Registers User Memory Text Static Data Heap Visible to user Not visible to user Kernel Memory Kernel Structs Stack Wrapper functions keep track of kernel information in a place visible to the user. Figure 3: State of a Program During Execution The first five parts of the state can, in principle, be transferred between two communicating user level programs. One exception occurs when programs are dynamically linked, as parts of the program text and data may not reside at the same virtual address in two different instantiations of the same program. As discussed above, this matter is under investigation. The sixth part of a program s state, kernel related information, is more difficult to transfer because it is invisible to a user level program. This information may include file descriptors and pointers, signal handlers, and memory mapped files. Without kernel source code, it is almost impossible to read these structures directly. If the operating system is unmodified, the solution is to create wrapper functions that let the program keep track of its own kernel related structures. All user code that modifies kernel structures must pass through a trap interface. (Traps are the only way user level code can execute supervisor level functions.) The Unix SYSCALL.H file documents all of the system calls that use traps, and all other system calls are built on top of them. One can create a function with the same name and arguments as a system call, such as open(). The arguments to the function are passed into an assembly language routine that calls the system trap properly. The remainder of the function keeps track of the file descriptor, path name, permissions, and other such information. The lseek() function can keep track of the location of the file pointer. Calls that change the file pointer (such as read() and write() ) also call the instrumented lseek(), so that file pointer information is updated automatically. This permits migrated tasks to resume reading and writing files at the proper place. It was discovered that the MPI environment for which task migration was being added [5] also uses signals and memory mapping. (The latter is due to the fact that gethostbyname() makes a call to mmap.) All system calls that affect signal handling and memory mapping are replaced with wrapper functions as well. The task migration system requires knowledge of a running program s image in a particular operating system, the development of a small amount of assembly language, and reliance on certain properties pertaining to signal handling, and these all affect the portability of the task migration system. The assembly language is needed because this is the only way to save and restore registers and call traps. Since the task migration routine is inside a signal handler, it is also necessary for the restarted program to be able to exit the signal handler coherently. Other systems that perform a similar style of system supported task migration, such as MIST and Condor, have also been ported to Linux, Alpha, and HP environments [27],[15]. This seems to indicate that this style of task migration is reasonably portable among Unix based operating systems, probably because these different operating systems have strong structural similarities. It is 9

10 interesting to note that no system level migration support for Windows/NT based systems has been reported. The exact sequence of steps involved in the actual state transfer are described in more detail in [25]. Two tests confirm this method s speed and stability, and are described below in section III. 2. Keeping MPI Intact: A Task Migration Protocol The state restoration process described above is not guaranteed to preserve communications on sockets. This is because at any point in the execution of a program, fragments of messages may reside in the kernel s buffers on either the sending or receiving side. The solution is to notify all tasks when a single task is about to migrate. Each task that is communicating with the task under migration sends an end of channel message to the migrating task and then closes the socket that connects them. The tasks then mark the migrating task as under migration in its table of tasks, and attempts to initiate communications will block until migration is complete. Once the task under migration receives all of its end of channel messages, it can be assured that no messages are trapped in the kernel s buffers. That is, it knows that all messages reside in its data segment, and so it can be migrated safely. Once the state has been transferred, another global update is needed so that other tasks know its new location and know that communications can be resumed with it. Tasks that are not migrating remain able to initiate connections and communicate with one another. The MPI 1.1 standard (the original MPI standard) only permits static task tables. That is, the number of tasks used by a parallel program is fixed when the program is launched. (It is important to note that the static number of tasks in a program is an MPI 1.1 limit, not a limitation of Hector. This is also one of the important differences between PVM and MPI 1.1.) Thus updates to this table do not require synchronization with MPI and do not confuse an MPI program. The MPI 2 standard (a newer standard currently in development) permits dynamically changing task tables, but, with proper use of critical sections, task migration will not interfere with programs written under the MPI 2 standard. A series of steps is needed to update the task table globally and atomically. Hector s MA and SA s are used to provide synchronization and machine to machine communications during migration and task termination. The exact sequence of steps required to synchronize tasks and update the communication status consistently is described in detail in [25]. It should be noted that if the MA crashes in the middle of a migration, the program will deadlock, because the MA is used for global synchronization and to guarantee inter task consistency. 3. Task Termination Protocol Task termination presents another complication. If a task is migrating while or after another task terminates, the task under migration never receives an end of channel message from the terminated task. Two measures are taken to provide correct program behavior. First, the MA limits each MPI program to only one migration or only one termination at a time. It can do this because of the handshaking needed both to migrate and to terminate. Second, a protocol involving the SA s and MA s was developed to govern task termination and is described below. 1. A task preparing to terminate notifies its SA. The task can receive and process table updates and requests for end of channel (EOC) messages, but will block requests to migrate. It cannot be allowed to migrate so that the MA can send it a termination signal. 2. The SA notifies the MA that the task is ready to terminate. 3. Once all pending migrations and terminations have finished, the MA notifies the SA that the task has permission to terminate. It will then block (and enqueue) further termination and migration requests until this termination has ended. 4. The SA notifies the task. 5. The task sends the SA a final message before exiting. 6. The SA notifies the MA that the task is exiting, and so the MA can permit other migrations and terminations. 10

11 Notice that an improperly written program may attempt to communicate with a task after the task has ended. In the world of message passing based parallel programming, this is a programmer s mistake. Behavior of the program is undefined at this point, and the program itself will deadlock under Hector. (The program deadlocks, not Hector.) 4. Minimizing Migration Time The operating system already has one mechanism for storing a program s state. A core dump creates a file that has a program s registers, data segment, and stack. The first version of state transfer used this capability to move programs around. There are two advantages to this approach. First, it is built into the operating system. Second, there are symbolic debuggers and other tools that can extract useful information from core files. There are some disadvantages to this approach. First, multiple network transfers are needed if the disk space is shared over a network. This means that the state is actually copied multiple times. Second, the speed of transfer is limited further by the speed of the disk and by other, unrelated programs sharing that disk. One way around all of these shortcomings is to transfer the state directly over the network. Originally implemented by the MIST team [15], network state transfer overcomes these disadvantages. The information is written over the network in slightly modified core file format. (The only modification is that unused stack space is not transmitted. There is no other penalty for using the SunOS core file format.) The information is written over a network socket connection by the application itself, instead of being written to a file by the operating system. Notice that this retains the advantage of core file tool compatibility. Experiments show that it is over three times faster [25], as will be shown below. C. Automatic Resource Allocation 1. Sources of Information Hector s overall goal is to attempt to minimize programmer overhead. In the context of awareness of program behavior, this incurs the expense of not having access to potentially beneficial program specific information. This approach was used based on experiences with scientific programmers within the authors research center, who are unwilling to invest time and effort to use new systems because of the perceived burden of source code modifications. This approach dictates that Hector be able to operate with no a priori applications knowledge, which, in turn, increases the requirement for the depth and breadth of information that is gathered at run time. This lack of a priori information makes Hector s allocation decision making more difficult. However, experiments confirm that the information it is able to extract at run time can improve performance, and its ability to exploit newly idle resources is especially helpful. 2. Structure There is a range of ways that resource allocation can be structured, from completely centrally located to completely distributed. Hector s resource allocation uses features of both. The decision making portion (and global synchronization) resides in the MA and is therefore completely centralized. The advantage of a single, central allocation decision maker is that it is easier to modify and test different allocation strategies. Since the UNIX operating system will not permit signals to be sent between hosts, it is necessary to have an executive process running on each candidate host. Since it is necessary to have such processes, they can be used to gather performance information as well. Thus its information gathering and information execution portions are fully distributed, being carried out by the SA s. 3. Collecting Information There are two types of information that the master allocator needs in order to make decisions. First, it needs to know about the resources that are potentially available, such as which hosts to con- 11

12 sider and how powerful they are. Second, it needs to know how efficiently and to what extent these resources are being used, such as how much external (non Hector) load there is and how much load the various MPI programs under its control are imposing. The relative performance of each candidate host is determined by the slave allocator when it is started. (It does so by running the Livermore Loops [28], which actually measure floating point performance.) It is also possible for the slave allocator to measure disk availability and physical memory size, for example. This information is transmitted to the master allocator, which maintains its own centralized database of this information. Current resource usage is monitored by analyzing information from the kernel of each candidate processor. Allocation algorithms draw on idle time information, CPU time information, and the percentage of CPU time devoted to non Hector related tasks. The percentage of CPU time is used to detect external workload, such as an interactive user logging in, which is a criterion for automatic migration. The Hector MPI library contains additional, detailed self instrumentation that logs the amount of computation and communication time each task expends. This data is gathered by the SA s by using shared memory and is forwarded to the MA. A more detailed discussion of this agent based approach to information gathering, as well as testing and results, may be found in [29]. 4. Making Decisions One of the primary advantages of this performance monitoring based approach is its ability to claim idle resources rapidly. As will be shown below, tests on busy workstations during the day show that migrating to newly available resources can reduce run time and promote more effective use of workstations. Further implementation and testing of more sophisticated allocation policies are also under way. D. Fault Tolerance The ability to migrate tasks in mid execution can be used to suspend tasks. In fact, fault tolerance has historically been one major motivation for task migration. In effect, each task transfers its state into a file to checkpoint the program. When a node failure has been detected, the files can be used to roll back the program to the state of the last checkpoint. While all calculations between the checkpoint and node failure are lost, the calculations up to the checkpoint are not, which may represent a substantial time savings. Also, known unrelated failures and/or routine maintenance may occur or be needed in the middle of a large program run, and so the ability to suspend tasks is helpful. It can be shown that in order to guarantee program correctness, all tasks must be checkpointed consistently[13]. That is, the tasks must be at a consistent point in their execution and in their message exchange status. For example, all messages in transit must be fully received and transmission of any new messages must be suspended. As was the case with migration and termination, a series of steps are needed to checkpoint and to roll back parallel programs. 1. Checkpointing Protocol The following steps are taken to checkpoint a program. 1. The MA decides to checkpoint a program for whatever reason. (This is currently supported as a manual user command, and may eventually be done on a periodic basis.) It waits until all pending migrations and terminations have finished, and then it notifies all tasks in the program, via the SA s, to prepare for checkpointing. 2. The tasks send end of channel (EOC) messages to all connected tasks, and then receive EOC s from all connected tasks. Again, this guarantees that there are no messages in transit. 3. Once all EOC s have been exchanged, the task notifies its SA that it is ready for checkpointing and informs the SA of the size of its state information. This information is passed on to the MA. 4. Once the MA has received confirmation from every task, it is ready to begin the actual checkpointing process. It notifies each task when the task is to begin checkpointing. 12

13 5. After each task finishes transmitting its state (or writing a file), it notifies the MA. Note that it is possible for more than one task to checkpoint at a time, and experiments with the ordering of checkpointing are described below. 6. After all tasks have checkpointed, the MA writes out a small bookkeeping file which contains state information pertinent to the MA and SA s. (For example, it contains the execution time to the point of checkpointing, so that the total execution time will be accurate if the job is rolled back to that checkpoint.) 7. The MA broadcasts either a Resume or Suspend command to all tasks. The tasks either resume execution or stop, respectively. The former is used to create a backup copy of a task in the event of future node failure. The latter is used if it is necessary to remove a job temporarily from the system. 2. Rollback Protocol The following steps are taken to roll back a checkpointed program. 1. The MA is given the name of a checkpoint file that provides all necessary information to restart the program. 2. It allocates tasks on available workstations, just as if the program were being launched. 3. Based on its allocation, it notifies the SA on the first machine. 4. The SA restarts the task from the state file, the name of which is found in the checkpoint file and sent to the SA. 5. The task notifies its SA that it restarted properly and waits for a table update. 6. Once the MA receives confirmation of one task s successful restart, it notifies the SA of the next task. It continues to do this until all tasks have restarted. Task rollback is sequential primarily for performance reasons. The file server that is reading the checkpoints and sending them over sockets to the newly launched tasks will perform more efficiently if only one checkpoint is sent at a time. 7. As confirmation arrives at the MA, it builds a table similar to that used by the MPI tasks themselves. It lists the hostnames and Unix PID s of all the tasks in the parallel program. Once all tasks have restarted, this table is broadcast to all tasks. Note that this broadcast occurs via the Hector run time infrastructure and is invisible to the MPI program. It does not use an MPI broadcast, as MPI is inactive during rollback. 8. Each task resumes normal execution once it receives its table update, and so the entire program is restarted. 3. The Checkpoint Server As is the case with task migration, there are two ways to save a program s state. One way is for each program to write directly to a checkpoint file. The other way is to launch a checkpoint server on a machine with a large amount of physically mounted disk space. (The latter concept was first implemented by the Condor group [30].) Each task transmits its state via the network directly to the server, and the server writes the state directly to its local disk. The reason each task cannot write its state to its local disk is obvious if the machine crashes, the backup copy of the state would be lost as well. The checkpoint server method is expected to be faster, because it uses direct socket connections and local disk writes, which are more efficient than writing files over a network. Note that many of the local disk caching strategies used by systems like NFS do not work well for checkpoints, because checkpoint files are typically written once and not read back [30]. Different, novel scheduling strategies for checkpoint service are described and tested below. 4. Other Issues The MA is not fault tolerant. That is, the MA represents a single point of failure. The SA s have been modified to terminate themselves, and the tasks running under them, if they lose contact with 13

14 the MA. (This feature was intentionally added because total termination of programs distributed across dozens of workstations can be quite tedious unless it is automated.) If this feature is disabled, then SA s and their tasks could continue working without the MA, although all task migrations and job launches would cease, and job termination would deadlock. The checkpoints collected by the system will enable a job to be restarted after the MA and SA s have been restarted, and so the checkpoint and rollback based fault tolerance can tolerate a fault in the run time infrastructure. Another approach to solve this problem, and to support more rapid fault tolerance, would be to incorporate an existing group communication library (such as Isis or Horus [31]) and use its message duplication facility. One possible design is described in [32]. Means of rapid fault detection can also be added to future versions of Hector [32]. Each SA sends performance information to the MA periodically. (The current period is 5 seconds, which may grow as larger tests are performed.) If the SA does not send a performance update after some suitable timeout, it can be assumed that the node is not running properly, and all jobs on that node can be rolled back. This strategy will detect heavily overloaded nodes as well as catastrophically failed nodes. III. BENCHMARKS AND TESTS A. Ease of Use As an example, an existing computational fluid dynamics simulation was obtained, because it was large and complex, having a total data size of around 1 GByte and about 13,000 lines of Fortran source code. The simulation had already been parallelized, coded in MPI, and tested on a parallel computer for correctness. With no modifications to the source code, the program was relinked and run completely and correctly under Hector. This highlights its ability to run real, existing MPI programs. B. Task Migration Two different task migration mechanisms were proposed, implemented, and tested. The first used core dump to write out a program s state for transfer to a different machine. The second transferred the information directly over a socket connection. In order to compare their relative speeds, tasks of different sizes were migrated 60 times each between two Sparcstations 10 s connected by ordinary 10 MBit/sec ethernet. The tests were run during normal daily operations at the Engineering Research Center [25]. The results are show below in Figure 4. Time to Migrate (sec) Core File Transfer Network Transfer Program Size (kbytes) Figure 4: Time to migrate tasks of different sizes 14

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are:

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are: 1 CHAPTER Introduction Practice Exercises 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are: To provide an environment for a computer user to execute programs

More information

Distributed OS and Algorithms

Distributed OS and Algorithms Distributed OS and Algorithms Fundamental concepts OS definition in general: OS is a collection of software modules to an extended machine for the users viewpoint, and it is a resource manager from the

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

Chapter 11: Implementing File-Systems

Chapter 11: Implementing File-Systems Chapter 11: Implementing File-Systems Chapter 11 File-System Implementation 11.1 File-System Structure 11.2 File-System Implementation 11.3 Directory Implementation 11.4 Allocation Methods 11.5 Free-Space

More information

Making Workstations a Friendly Environment for Batch Jobs. Miron Livny Mike Litzkow

Making Workstations a Friendly Environment for Batch Jobs. Miron Livny Mike Litzkow Making Workstations a Friendly Environment for Batch Jobs Miron Livny Mike Litzkow Computer Sciences Department University of Wisconsin - Madison {miron,mike}@cs.wisc.edu 1. Introduction As time-sharing

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Objectives of Chapter To provide a grand tour of the major computer system components:

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems Personal-Computer Systems Parallel Systems Distributed Systems Real -Time

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

OPERATING SYSTEM. Functions of Operating System:

OPERATING SYSTEM. Functions of Operating System: OPERATING SYSTEM Introduction: An operating system (commonly abbreviated to either OS or O/S) is an interface between hardware and user. OS is responsible for the management and coordination of activities

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

Chapter 11: File System Implementation

Chapter 11: File System Implementation Chapter 11: File System Implementation Chapter 11: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

Chapter 17: Distributed Systems (DS)

Chapter 17: Distributed Systems (DS) Chapter 17: Distributed Systems (DS) Silberschatz, Galvin and Gagne 2013 Chapter 17: Distributed Systems Advantages of Distributed Systems Types of Network-Based Operating Systems Network Structure Communication

More information

Chapter 11: File System Implementation

Chapter 11: File System Implementation Chapter 11: File System Implementation Chapter 11: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

OPERATING SYSTEMS UNIT - 1

OPERATING SYSTEMS UNIT - 1 OPERATING SYSTEMS UNIT - 1 Syllabus UNIT I FUNDAMENTALS Introduction: Mainframe systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered Systems Real Time Systems Handheld Systems -

More information

Announcement. Exercise #2 will be out today. Due date is next Monday

Announcement. Exercise #2 will be out today. Due date is next Monday Announcement Exercise #2 will be out today Due date is next Monday Major OS Developments 2 Evolution of Operating Systems Generations include: Serial Processing Simple Batch Systems Multiprogrammed Batch

More information

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches Background 20: Distributed File Systems Last Modified: 12/4/2002 9:26:20 PM Distributed file system (DFS) a distributed implementation of the classical time-sharing model of a file system, where multiple

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File-Systems, Silberschatz, Galvin and Gagne 2009 Chapter 11: Implementing File Systems File-System Structure File-System Implementation ti Directory Implementation Allocation

More information

* What are the different states for a task in an OS?

* What are the different states for a task in an OS? * Kernel, Services, Libraries, Application: define the 4 terms, and their roles. The kernel is a computer program that manages input/output requests from software, and translates them into data processing

More information

Distributed Operating Systems

Distributed Operating Systems 2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System

More information

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Warm Standby...2 The Business Problem...2 Section II:

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1 Credits:4 1 Understand the Distributed Systems and the challenges involved in Design of the Distributed Systems. Understand how communication is created and synchronized in Distributed systems Design and

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD. OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD. File System Implementation FILES. DIRECTORIES (FOLDERS). FILE SYSTEM PROTECTION. B I B L I O G R A P H Y 1. S I L B E R S C H AT Z, G A L V I N, A N

More information

Chapter 12 File-System Implementation

Chapter 12 File-System Implementation Chapter 12 File-System Implementation 1 Outline File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance Recovery Log-Structured

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

Introduction CHAPTER. Exercises

Introduction CHAPTER. Exercises 1 CHAPTER Introduction Chapter 1 introduces the general topic of operating systems and a handful of important concepts (multiprogramming, time sharing, distributed system, and so on). The purpose is to

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Computer System Overview

Computer System Overview Computer System Overview Introduction A computer system consists of hardware system programs application programs 2 Operating System Provides a set of services to system users (collection of service programs)

More information

IT 540 Operating Systems ECE519 Advanced Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (3 rd Week) (Advanced) Operating Systems 3. Process Description and Control 3. Outline What Is a Process? Process

More information

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk Expressing Fault Tolerant Algorithms with MPI-2 William D. Gropp Ewing Lusk www.mcs.anl.gov/~gropp Overview Myths about MPI and Fault Tolerance Error handling and reporting Goal of Fault Tolerance Run

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

Chapter 10: File System Implementation

Chapter 10: File System Implementation Chapter 10: File System Implementation Chapter 10: File System Implementation File-System Structure" File-System Implementation " Directory Implementation" Allocation Methods" Free-Space Management " Efficiency

More information

CSE380 - Operating Systems. Communicating with Devices

CSE380 - Operating Systems. Communicating with Devices CSE380 - Operating Systems Notes for Lecture 15-11/4/04 Matt Blaze (some examples by Insup Lee) Communicating with Devices Modern architectures support convenient communication with devices memory mapped

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

CPS221 Lecture: Operating System Protection

CPS221 Lecture: Operating System Protection Objectives CPS221 Lecture: Operating System Protection last revised 9/5/12 1. To explain the use of two CPU modes as the basis for protecting privileged instructions and memory 2. To introduce basic protection

More information

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods

More information

Operating Systems Fundamentals. What is an Operating System? Focus. Computer System Components. Chapter 1: Introduction

Operating Systems Fundamentals. What is an Operating System? Focus. Computer System Components. Chapter 1: Introduction Operating Systems Fundamentals Overview of Operating Systems Ahmed Tawfik Modern Operating Systems are increasingly complex Operating System Millions of Lines of Code DOS 0.015 Windows 95 11 Windows 98

More information

Chapter 1: Distributed Information Systems

Chapter 1: Distributed Information Systems Chapter 1: Distributed Information Systems Contents - Chapter 1 Design of an information system Layers and tiers Bottom up design Top down design Architecture of an information system One tier Two tier

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Distributed System Chapter 16 Issues in ch 17, ch 18

Distributed System Chapter 16 Issues in ch 17, ch 18 Distributed System Chapter 16 Issues in ch 17, ch 18 1 Chapter 16: Distributed System Structures! Motivation! Types of Network-Based Operating Systems! Network Structure! Network Topology! Communication

More information

Multiprocessor and Real-Time Scheduling. Chapter 10

Multiprocessor and Real-Time Scheduling. Chapter 10 Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Client Server & Distributed System. A Basic Introduction

Client Server & Distributed System. A Basic Introduction Client Server & Distributed System A Basic Introduction 1 Client Server Architecture A network architecture in which each computer or process on the network is either a client or a server. Source: http://webopedia.lycos.com

More information

Chapter 18 Distributed Systems and Web Services

Chapter 18 Distributed Systems and Web Services Chapter 18 Distributed Systems and Web Services Outline 18.1 Introduction 18.2 Distributed File Systems 18.2.1 Distributed File System Concepts 18.2.2 Network File System (NFS) 18.2.3 Andrew File System

More information

File-System Structure

File-System Structure Chapter 12: File System Implementation File System Structure File System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance Recovery Log-Structured

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1 ITS 225: Operating Systems Operating Systems Lecture 1 Introduction & Overview Jan 15, 2004 Dr. Matthew Dailey Information Technology Program Sirindhorn International Institute of Technology Thammasat

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2 Chapter 1: Distributed Information Systems Gustavo Alonso Computer Science Department Swiss Federal Institute of Technology (ETHZ) alonso@inf.ethz.ch http://www.iks.inf.ethz.ch/ Contents - Chapter 1 Design

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods

More information

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling. Multiprocessor Scheduling Multiprocessor Scheduling Will consider only shared memory multiprocessor or multi-core CPU Salient features: One or more caches: cache affinity is important Semaphores/locks typically implemented as spin-locks:

More information

RealTime. RealTime. Real risks. Data recovery now possible in minutes, not hours or days. A Vyant Technologies Product. Situation Analysis

RealTime. RealTime. Real risks. Data recovery now possible in minutes, not hours or days. A Vyant Technologies Product. Situation Analysis RealTime A Vyant Technologies Product Real risks Data recovery now possible in minutes, not hours or days RealTime Vyant Technologies: data recovery in minutes Situation Analysis It is no longer acceptable

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

Lecture 23 Database System Architectures

Lecture 23 Database System Architectures CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used

More information

Chapter-1: Exercise Solution

Chapter-1: Exercise Solution Chapter-1: Exercise Solution 1.1 In a multiprogramming and time-sharing environment, several users share the system simultaneously. This situation can result in various security problems. a. What are two

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Module 16: Distributed System Structures

Module 16: Distributed System Structures Chapter 16: Distributed System Structures Module 16: Distributed System Structures Motivation Types of Network-Based Operating Systems Network Structure Network Topology Communication Structure Communication

More information

OPERATING- SYSTEM CONCEPTS

OPERATING- SYSTEM CONCEPTS INSTRUCTOR S MANUAL TO ACCOMPANY OPERATING- SYSTEM CONCEPTS SEVENTH EDITION ABRAHAM SILBERSCHATZ Yale University PETER BAER GALVIN Corporate Technologies GREG GAGNE Westminster College Preface This volume

More information

Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that:

Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that: Distributed Systems Overview Distributed Systems September 2002 1 Distributed System: Definition A distributed system is a piece of software that ensures that: A collection of independent computers that

More information

The modularity requirement

The modularity requirement The modularity requirement The obvious complexity of an OS and the inherent difficulty of its design lead to quite a few problems: an OS is often not completed on time; It often comes with quite a few

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real -Time Systems Handheld Systems Computing Environments

More information

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling. Multiprocessor Scheduling Multiprocessor Scheduling Will consider only shared memory multiprocessor or multi-core CPU Salient features: One or more caches: cache affinity is important Semaphores/locks typically implemented as spin-locks:

More information

Multiprocessor Scheduling

Multiprocessor Scheduling Multiprocessor Scheduling Will consider only shared memory multiprocessor or multi-core CPU Salient features: One or more caches: cache affinity is important Semaphores/locks typically implemented as spin-locks:

More information

Rule partitioning versus task sharing in parallel processing of universal production systems

Rule partitioning versus task sharing in parallel processing of universal production systems Rule partitioning versus task sharing in parallel processing of universal production systems byhee WON SUNY at Buffalo Amherst, New York ABSTRACT Most research efforts in parallel processing of production

More information

Admin Plus Pack Option. ExecView Web Console. Backup Exec Admin Console

Admin Plus Pack Option. ExecView Web Console. Backup Exec Admin Console WHITE PAPER Managing Distributed Backup Servers VERITAS Backup Exec TM 9.0 for Windows Servers Admin Plus Pack Option ExecView Web Console Backup Exec Admin Console VERSION INCLUDES TABLE OF CONTENTS STYLES

More information

Module 16: Distributed System Structures. Operating System Concepts 8 th Edition,

Module 16: Distributed System Structures. Operating System Concepts 8 th Edition, Module 16: Distributed System Structures, Silberschatz, Galvin and Gagne 2009 Chapter 16: Distributed System Structures Motivation Types of Network-Based Operating Systems Network Structure Network Topology

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

TANDBERG Management Suite - Redundancy Configuration and Overview

TANDBERG Management Suite - Redundancy Configuration and Overview Management Suite - Redundancy Configuration and Overview TMS Software version 11.7 TANDBERG D50396 Rev 2.1.1 This document is not to be reproduced in whole or in part without the permission in writing

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition, Chapter 17: Distributed-File Systems, Silberschatz, Galvin and Gagne 2009 Chapter 17 Distributed-File Systems Background Naming and Transparency Remote File Access Stateful versus Stateless Service File

More information

Disks and I/O Hakan Uraz - File Organization 1

Disks and I/O Hakan Uraz - File Organization 1 Disks and I/O 2006 Hakan Uraz - File Organization 1 Disk Drive 2006 Hakan Uraz - File Organization 2 Tracks and Sectors on Disk Surface 2006 Hakan Uraz - File Organization 3 A Set of Cylinders on Disk

More information

Part I Overview Chapter 1: Introduction

Part I Overview Chapter 1: Introduction Part I Overview Chapter 1: Introduction Fall 2010 1 What is an Operating System? A computer system can be roughly divided into the hardware, the operating system, the application i programs, and dthe users.

More information

Application generators: a case study

Application generators: a case study Application generators: a case study by JAMES H. WALDROP Hamilton Brothers Oil Company Denver, Colorado ABSTRACT Hamilton Brothers Oil Company recently implemented a complex accounting and finance system.

More information

Distributed Systems Operation System Support

Distributed Systems Operation System Support Hajussüsteemid MTAT.08.009 Distributed Systems Operation System Support slides are adopted from: lecture: Operating System(OS) support (years 2016, 2017) book: Distributed Systems: Concepts and Design,

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information