An Organizational Grid of Federated MOSIX Clusters

Size: px

Start display at page:

Download "An Organizational Grid of Federated MOSIX Clusters"

Arlene Charles
5 years ago
Views:

2005 IEEE International Symposium on Cluster Computing and the Grid An Organizational Grid of Federated MOSIX Clusters Amnon Barak, Amnon Shiloh, and Lior Amar Department of Computer Science The

1 2005 IEEE International Symposium on Cluster Computing and the Grid An Organizational Grid of Federated MOSIX Clusters Amnon Barak, Amnon Shiloh, and Lior Amar Department of Computer Science The Hebrew University of Jerusalem Jerusalem, Israel Abstract MOSIX is a cluster management system that uses process migration to allow a Linux cluster to perform like a parallel computer. Recently it has been extended with new features that could make a grid of Linux clusters run as a cooperative system of federated clusters. On one hand, it supports automatic workload distribution among connected clusters that belong to different owners, while still preserving the autonomy of each owner to disconnect its cluster from the grid at any time, without sacrificing migrated processes from other clusters. Other new features of MOSIX include grid-wide automatic resource discovery; a precedence scheme for local processes and among guest processes (from other clusters); flood control; a secure run-time environment (sandbox) which prevents guest processes from accessing local resources in a hosting system, and support of cluster partitions. The resulting grid management system is suitable to create an intra-organizational highperformance computational grid, e.g., in an enterprise or in a campus. The paper presents enhanced and new features of MOSIX and their performance. Keywords: cluster computing, grid computing, process migration, organizational grid, sandbox 1 Introduction Grid computing is an emerging technology that uses the Internet to allow sharing of computational and data resources among geographically dispersed users within and across institutional boundaries [6]. For example, scientific grid computing refer to applications with a large resource requirements that can not be satisfied by traditional clusters or supercomputers. Existing grid packages, such as Cactus [1, 11], Condor [17], Globus [7] and Nimrod/G [14] already provide essential grid services, such as batch scheduling, assignment of processes to nodes, checkpointing, process migration, inter-process communication and remote data access. Due to the diversity of the grid resources, the disruptive environment and the unpredictable requirements of processes, some areas that need further improvements are automatic (on-line) management, including adaptive resource discovery and workload distribution, as well as support of a secure run-time environment for non-local processes. MOSIX [3, 4, 13] is a cluster management system that allows a set of x86 nodes to perform like a single parallel computer. Users can run parallel (and sequential) applications by letting MOSIX automatically seek resources and migrate processes among nodes to improve the overall performance, without changing the run-time environment of the migrated processes. This paper presents a grid management system that extends the cluster version of MOSIX with new features that could make a grid run as a cooperative system of federated clusters. Our system model consists of independent clusters, e.g., of different groups, whose owners wish to share their computational resources, while still maintaining control over their private resources. The main features of the resulting grid system are: 1. Automatic resource discovery: users need not know the details of the configuration or the state of any specific resource. 2. Preemptive (transparent) process migration within and across different clusters, and automatic load-balancing. 3. Adaptive management that responds to changes of available and required resources. 4. A secure run-time environment for guest processes. 5. Support of a flexible configuration: clusters can be partitioned or combined. 6. A run-time precedence for local over guest processes and among guest processes. 7. Flood prevention. 8. Support of a dynamic environment: clusters can be connected or disconnected at any time /05/$ IEEE 350

We note that the first four features were obtained by enhancing existing cluster features of MOSIX to the grid environment, while the last four are new features that were developed for that

2 We note that the first four features were obtained by enhancing existing cluster features of MOSIX to the grid environment, while the last four are new features that were developed for that environment. The new system is particularly suitable to run compute intensive and other applications with moderate amounts of I/O, over fast trusted networks, which are common in an enterprise or in a campus grid. For example, in our campus there are 8 MOSIX clusters, ranging from 14 to 50 (mostly dual CPU) nodes. Each cluster is owned by a different group (owner) in various departments. These clusters are connected by a lgb/s campus-wide backbone, which also connects several other clusters (with almost 200 nodes) in student labs, that are not used during nights and weekends. All these clusters could form a campus grid with as many as 500 processors, see Fig. 1. Note that the leaves in the figure are cluster partitions. Figure 1: A 4-level campus grid. The paper is organized as follows: Section 2 presents cluster management features that were enhanced to support grid computing. Section 3 presents new grid management features. Section 4 presents the performance of new features. Section 5 describes related works and Section 6 summarizes our conclusions. 2 From a Cluster to a Grid This section presents cluster features of MOSIX that were enhanced for a grid environment. 2.1 Automatic Resource Discovery Resource discovery is performed by an on-line, hierarchical information dissemination scheme, that provides each node with the latest information about availability and the state of grid-wide resources. The scheme is based on a randomized gossip algorithm, in which each node regularly (e.g. every second) monitors the state of its resources, including the CPU speed and current load, free and used memory, current rates of I/O and network throughput. Each node then sends its most recent information, including indirect information that it has about other nodes, to a randomly chosen node in its cluster. Also, selected information, e.g., on the least loaded nodes, is exchanged among different clusters at a rate that is proportional to the relative distance between the corresponding clusters. In this scheme, information about newly available resources, e.g., clusters that have just become available, is quickly disseminated across the grid while information about nodes in disconnected clusters is quickly phased out, and thus could no longer be used. In [15] we presented bounds for the age properties and the rates of propagation of the above information dissemination scheme. 2.2 Preemptive Process Migration MOSIX supports automatic, grid-wide preemptive process migration, that can migrate almost any process [4]. Migrations are supervised by adaptive on-line algorithms that continuously attempt to improve the performance, e.g., by load-balancing, or by nmigrating processes from slower to faster nodes. These algorithms are particularly useful for applications with unpredictable or changing resource requirements. Within a cluster, process migration amounts to a copy of the memory image of the process and setting its execution environment. To reduce network occupancy in cross cluster migrations, the process memory image is compressed, using the LZOP [12] algorithm. Migration decisions are based on (run-time) process profiling and the latest information on availability of grid resources, as provided by the information dissemination scheme. Process profiling is performed by continuously collecting information about its characteristics, e.g., size, rates of system-calls, volume of IPC and I/O, etc. This information is then used by competitive on-line algorithms [9] to determine the best location for the process. These algorithms take into account the respective speed and current load of the nodes, the size of the migrated process vs. the free memory available 351

3 in different nodes, and the characteristics of the processes. In this scheme, when the profile of a process is changed or new resources become available, the system automatically responds by considering reassignment of processes to better locations. 2.3 A Virtual Run-Time Environment The MOSIX Virtual run-time Environment (MVE) is a software layer that allows a mnigrated process to run in a remote node, away from the (home) node in which it was created [3, 4]. This is accomplished by intercepting the system-calls of such processes. If the process was not migrated, then its system-calls are performed locally. After a process is migrated, its few system-calls that are site-independent are performed in the remote node, while the rest are forwarded to the MVE layer in its home-node, which then performs the system-calls on behalf of the process as if it was running in the home-node. The main outcome of the MVE scheme is that each process, including a migrated process, seems to be running in its home-node, and all the processes of a user's session share the run-time environment of the homenode. As a result, the user gets the illusion of running on a single node system. The drawback of this approach is increased communication overhead, which makes our system suitable to run compute bound and other non-intensive I/O applications in an organizational grid. The MVE layer guarantees that a migrated (guest) process cannot modify or even access local resources, other than CPU and memory in a remote (hosting) node. As explained above, this is accomplished by intercepting the system-calls of guest processes. Special care is taken to ensure that the few system-calls that are performed locally, could not access resources in the hosting node. The rest are forwarded to the home-node of the process. The net result is a secure run-time environment (sandbox) for all guest processes. 3 New Grid Features This section presents the new grid features, including a scheme for sharing cluster partitions, a method for allocating a run-time precedence among processes, flood control - to prevent overloading of nodes, and handling of long running processes in a disruptive environment. 3.1 Sharing of Cluster Partitions Each MOSIX cluster can be divided into several partitions, where a partition is a set of nodes that is usually allocated to one user (partition owner) at a time. The allocation of nodes to partitions could change from time to time, to reflect changing demands. Nodes, if any, that are not allocated, form a special partition. Each user is expected to login only to nodes in its allocated (home) partition in its local (group's) cluster, where all of his/her processes are created. Besides the home partition, a user may (temporary) get on loan additional partitions in other (remote) clusters. MOSIX supports process migration among nodes in all the user's partitions as if they were a single partition. Process migration can be done either automatically, using the load-balancing algorithms, or manually, by explicit requests. To enable grid-wide resource sharing, each owner may designate some nodes in its home partition to host guest processes of other owners. The remaining (reserved) nodes are allocated exclusively for the owner's processes. Guest processes may migrate to designated nodes only if these nodes are not used for a prolonged time (a parameter). They are automatically moved out whenever an owner reclaims its nodes, or when processes with higher precedence are moved in (see below). Since each partition can host processes from different owners, a partition level automatic runtime precedence scheme was developed to distinguish between such guest processes. 3.2 The Precedence Scheme The precedence scheme provides a run-time precedence among processes of different owners. Processes with a higher precedence preempt and push out all processes with a lower precedence. Note that a node may still share its resources among two or more owners with the same precedence. The precedence scheme consists of a precedence allocation method and an enforcement algorithm. Each partition owner is responsible to define a precedence table. The table contains information about partitions whose processes are allowed to move in, and if so, also their precedence. By proper setting of the precedence table, it is possible to combine two partitions in different clusters (symmetrically or asymmetrically), to block migration from specific partitions or even to attach a partition to several clusters. For example, the owner of partition P1 can get on loan a partition P2 in another cluster, if the owner of P2 sets the precedence 352

of P1 processes to be equal to that of local processes. Note that processes of P2 are not considered as local processes in P1 and could even be blocked all together.

4 of P1 processes to be equal to that of local processes. Note that processes of P2 are not considered as local processes in P1 and could even be blocked all together. The precedence algorithm has two responsibilities: to migrates out all processes with a lower precedence upon arrival of any processes with a higher precedence, and to guarantee that processes with a higher precedence can always move in, even if processes with a lower precedence are already running there. 3.3 Flood Control Flooding could occur when a user creates a large number of processes, to take advantage of grid resources (nodes), or when a large number of processes migrate back to their owner's partition, e.g., when clusters are disconnected. The first case is prevented by placing a (tunable) limit on the number of guest processes that are allowed to migrate to each node. Processes of a user that attempt to overload the grid beyond the allowable limit are not permitted to migrate. To prevent flooding by a large number of processes, including returning processes, each node has a limit on the number of local running processes. When this limit is reached, additional processes are automatically frozen and their memory images are stored in regular files. This method ensures that a large number of processes could be handled without exhausting the CPU and memory. Frozen processes are reactivated in a circular fashion in order to allow some work to be done, without overloading the owner's nodes. When resources become available again, the load-balancing algorithm migrates running processes away, thus allowing reactivation of more frozen processes. 3.4 Disruptive Configurations In our grid, the owner of each cluster can connect (disconnect) it to (from) the grid at any time, and also can block or allow migration of guest processes from each remote cluster to each node. After a request is issued to disconnect a cluster, if guest processes are present then they are moved out, and if local processes were migrated to other clusters then they are brought back. Note that guest processes are not necessarily migrated back to their respective home-nodes, although they can always do so. For this reason, users are not expected to login and/or initiate processes from remote clusters, since if that is allowed and the clusters are disconnected, those processes may have nowhere to return. In order to manage a large number of returning processes, the scheme relies on the freezing mechanism described in the previous section Support of Long Processes The process migration, freezing and gradual reactivation mechanisms provide support to grid applications that need to run for a long time, e.g., for days, weeks or even months. As explained above, before a remote cluster is disconnected, all guest processes are moved out. These processes are frozen and are gradually reactivated when grid resources become available again. For example, long processes that had been migrated to student farms during the night, were returned to their home cluster in the morning, only to migrate back to the student farms in the next night. 4 Performance This section presents the performance of MOSIX and its new features. We ran tests in a grid with two clusters (C14 and C20) and a workstation that were located in the same building, as well as between nodes in different buildings and different campuses of our university. The C14 cluster consisted of 14 identical (dual-xeon 3.06GHz, 4GB RAM) diskless nodes, and C20 consisted of 20 identical (dual-xeon 3.06GHz, 4GB RAM) diskless nodes. The workstation had a single Pentium IV 3GHz CPU, 1GB RAM, and a 40GB low-cost SATA- IDE disk. The nodes in each cluster were connected by an internal lgb/s Ethernet switch. The two clusters and the workstation were connected by a lgb/s switch. 4.1 Overhead of Migrated Processes This test presents the overhead of running migrated processes in different remote clusters. This overhead includes the migration, the communication and that of the MVE. We used four real-life applications, with increasing amounts of I/O. The first application, RC, is an intensive CPU program that generates random sets of clauses over propositional variables and analyses the satisfiability, size and distribution of variables of each set and its subsets [5]. The second application, SW, produces all possible alignments between pairs of proteins sequences using the Smith-Waterman ajgorithm [20]. SW uses a relative small amount of I/O. JELlium solves the fundamental Schroedinger/Newton equations of motion of electrons and nuclei of a molecule by computing the combined electron-nuclear 353

dynamics [2]. JELlium uses a moderate amount of I/O. Lastly, BLAT is a bioinformatics software for rapid mrna/dna and cross-species protein alignments [8], which uses a moderate amount of I/O.

5 dynamics [2]. JELlium uses a moderate amount of I/O. Lastly, BLAT is a bioinformatics software for rapid mrna/dna and cross-species protein alignments [8], which uses a moderate amount of I/O. We used identical Xeon 3.06GHz nodes and ran each program using four different settings: first, as a local (native Linux) process; then as a migrated process to a remote node in a cluster that was located in the same building and was connected by a lgb/s Ethernet; then as a migrated process to a cluster across campus (located about 1 Km away) that was connected once by a lgb/s and again by a 10OMb/s Ethernet (for reference), and lastly as a migrated process to a cluster in a campus across town (located about 10 Km away) that was connected by a 10OMb/s Ethernet. We note that in the last four tests, each process was migrated to the remote node immediately after its creation and it performed all its I/O and site-dependent system-calls in its home-node, across the network. The results of these tests (averaged over 5 runs) are shown in Table 1. In the table, the first four lines show the local run-time (Sec.) of each program, followed by the respective total amounts of I/O (MB), the block size (KB) used and the number of remote systemcalls performed. The next two pairs of rows list the run-times and the corresponding slowdown (vs. the local time) between clusters in the same building and across a campus grid using a lgb/s Ethernet. The last two pairs of rows list the run-times and the corresponding slowdown in the above campus grid and a grid across town using a 10OMb/s Ethernet. Table 1: Local vs. Remote Run-times RC SWI JEL BLAT 11 Local time Total I/O Block size - 32KB 32KB 64KB R-Syscalls 3,050 16,700 7,200 7,800 l lgb/s Ethernet Building Slowdown 0.32% 1.47% 1.16% 1.39% Campus Slowdown 0.50% 1.85% 1.18% 1.67% 10OMb/s Ethernet Campus Slowdown 0.72% 3.61% 9.10% 7.82% Town Slowdown 1.15% 11.80% 12.16% 10.95% From Table 1 it can be seen that with a lgb/s Ethernet the average slowdown of all the tested programs in the same campus was 1.2%, with less than 2% maximal slowdown for all the tested programs. The corresponding average slowdowns with a 10OMb/s Ethernet were 5.31% in the same campus and 9.02% across town. These results were expected due to the increased latencies and lower bandwidth of the 10OMb/s vs. lgb/s Ethernet, which affected mostly programs that performed I/0. In comparison, the slowdown of the CPUbound program (RC) increased only by 0.22% across camnpus and by 0.65% across town. These results confirm the claim that our system is suitable for applications with a moderate amount of I/O over fast networks in an enterprise or in a campus grid. 4.2 Grid-wide Load-balancing In The next two tests we measured the time needed to balance the load between the C14 and C20 clusters. Initially, all the nodes of C20 were idle, this cluster had one reserved node and no limit on the number of guest processes in the remaining nodes. Cluster C14 was disconnected from the grid. We created a set of 66 identical CPU-bound processes with a high precedence in C14 (the processes were dispersed evenly among all the nodes). Each process was allocated an array of size 64MB, which was continuously accessed, so as to keep the whole array in memory. The number of processes was chosen such that when C14 is connected to the grid, each node in the two clusters (except the reserved node in C20) will have two processes (equilibrium). The test started by a cluster-connect command in C14, allowing processes to migrate to C20, and ended when equilibrium was reached. The measured time (averaged over 5 runs) to balance the load between the two clusters is shown in Table 2, Line 1. From the obtained result it follows that the average migration rate was one process every 1.4 Sec., slightly higher than the 1 Sec. residence time imposed on processes by the loadbalancing algorithm, to prevent process swinging. Table 2: Load-balancing Times Initial No. Initial No. Equilibrium of Processes of Processes (migration) in C14 in C20 Time Sec Sec 354

4.2.1 The Precedence Scheme This test measures the time needed by higher precedence guest processes to replace lower precedence guest processes, using the load-balancing and the precedence algorithms.

6 4.2.1 The Precedence Scheme This test measures the time needed by higher precedence guest processes to replace lower precedence guest processes, using the load-balancing and the precedence algorithms. As in the previous test, initially C14 was disconnected from the grid and it ran a set of 66 identical higher precedence processes, while C20 had one reserved node and no limit on the number of guest processes in its remaining nodes. First we migrated 38 lower precedence processes (same type and size as above) from the workstation to the nodes of C20. We then issued a cluster-connect command to C14, allowing processes to migrate to C20. We measured the time until all the lower precedence processes were migrated back to the workstation and higher precedence processes were migrated from C14 to C20 and equilibrium was reached. The result of this test (averaged over 5 runs) is shown in Table 2, Line 2. From the result it follows that the average migration rate was one process every 1.6 Sec. As can be expected, the measured time is more than double the migration time in the previous test (Table 2, Line 1), since twice the number of processes were migrated and also because all the lower precedence processes were migrated to one workstation. 4.3 Disruptive Configurations This test provides an estimate for the time to move out guest processes from a hosting cluster that is about to be disconnected from the grid. Note that the measured time is similar to the time needed to bring back (to an owner's cluster) an identical set of local processes that were migrated to other clusters. We created a set of identical CPU-bound processes (same as before) in all the nodes of the C14 cluster. We then forced all the processes to migrate to the C20 cluster, to simulate a scenario in which the nodes of C20 were faster than the nodes of C14. The test started by initiating a cluster-disconnect command in C20, which forced all the guest processes to move out, and ended when all the guest processes were running in C14. The results of this test for different numbers of processes and different sizes, are presented in Table 3. In the table, Column 1 lists the total number of guest processes; Column 2 lists the size of each process; Column 3 shows the average (over 4 runs) of the measured migration times, and Column 4 shows the migration rates (MB/s), obtained by dividing the total amount of mi- Table 3: Time to Evacuate a Cluster No. of [Process Migration Processes Size Time Rate [[40 64 MB 26 Sec 98 MB/Sec MB 101 Sec 101 MB/Sec MB 198 Sec 103 MB/Sec MB 397 Sec 103 MB/Sec MB 50 Sec 1102 MB/Sec II MB 192 Sec 106 MB/Sec MB 388 Sec 105 MB/Sec U _. grated data (Column 1 times Column 2) by the total migration time (Column 3). The obtained results show that MOSIX can move out guest processes in a reasonable time, and also provides an estimate to delays that owners who participate in grid activities can expect when reclaiming a cluster. Note that process migration was performed at an average (weighted over all cases) rate of MB/s, which is over 93% of the 110 MB/s measured rate of TCP/IP between a pair of nodes in the different clusters. Finally, observe that the measured time to migrate out 40 processes from C20 (Table 3, Line 1) was twice faster than the time required to balance the load between C14 and C20 (Table 2, Line 1). As explained above, this is due to the restraining nature of the loadbalancing algorithm to prevent process swinging Handling a Large Number of Processes This test demonstrates how long it takes to migrate back processes when the workstation in which they were created is about to be disconnected from the grid. We created a set of identical processes (same kind as before) in the workstation and migrated all the processes to the C20 and C14 clusters, so that there were an equal number of processes in each node. The test started by a cluster-disconnect command in the workstation, which forced all the remote processes to be suspended and move back to the workstation according to the flood control scheme described in Section 3.3. We measured the time until all the processes were frozen in a local disk of the workstation. The results of this test for different numbers of processes and different sizes, are presented in Table 4. In the table, Column 1 lists the total number of processes; Column 2 lists the size of each process; Column 3 shows the average (over 5 runs) of the total migration times 355

Table 4: Time to Disconnect a Workstation No. of Process 1 Migration Processes Size Time l Rate 34 64 MB 69 Sec 32.6 MB/Sec 34 256 MB 266 Sec 33.1 MB/Sec 34 512 MB 520 Sec 33.

7 Table 4: Time to Disconnect a Workstation No. of Process 1 Migration Processes Size Time l Rate MB 69 Sec 32.6 MB/Sec MB 266 Sec 33.1 MB/Sec MB 520 Sec 33.7 MB/Sec MB 1127 Sec 31.1 MB/Sec MB 143 Sec 30.9 MB/Sec MB 552 Sec 31.9 MB/Sec 68 [512 MB Sec f 30.0 MB/Sec MB 299 Sec 29.9 MB/Sec MB 1197 Sec 29.4 MB/Sec until all the processes were stored in the workstation, and Column 4 shows the migration rates (in MB/s), which reflect the throughput of the (low-cost) disk at the workstation. The results of this test show that MOSIX is capable to migrate back and store a relatively large number of processes in a reasonable time. This means that even a user with one workstation could use grid resources when they are available and store processes locally, until such resources become available. The specific results provide an estimate for the time delays to disconnect a workstation from the grid. Obviously, the presented times are expected to be shorter if the migrations are back to a cluster instead of to a workstation. 5 Related Works Several grid management projects address issues that are presented in this paper. The usefulness of automatic process migration for grid-wide performance improvements of communicating processes was shown in [18]. The presented method assumes knowledge of the application run-times. It relies on a user-level checkpointing library which was implemented on top of MPI. Cactus [11] uses the "worm unit", an independent entity which is aware of grid information services and resource brokers, to migrate tasks in a dynamic grid environment. Migration is performed by an application level checkpointing, which requires some modification of applications. Condor [17] supports "preemptive resume" scheduling by linking compatible, independent applications with a library, which allows checkpointing and process migration. Its flocking mechanism allows aggregation of several pools into a single entity while preserving the owners right. Flocking is usually transparent to the users. This is accomplished by gateway machines that isolate each pool from the flock. The pool owner can connect or disconnect its pool from the grid anytime and share resources with any set of remote pools. Each such remote pool can be given a different priority. A framework for creating and managing a secure runtime environment in a grid is presented in [10]. It is shown how such a remote environment could be interfaced with Globus [7]. We believe that the MVE layer could be incorporated with this framework. Astrolab [19] and Network Weather Service (NWS) [21] are two systems that provide updated information about grid resources. Astrolab uses an hierarchical gossip protocol in which the rate of exchanged information depends on the proximity of nodes in an hierarchy tree. The algorithm that is used by our system is similar to the one used by Astrolab. NWS monitors and predicts the performance of computational grid resources, including network related resources. These predictions are used for adaptive application scheduling and migration decisions. 6 Conclusions and Future Work The paper presented enhanced cluster and new grid features of MOSEX for automatic management of resources in an organizational grid. These features include resource discovery and workload distribution; a precedence scheme for local processes and among guest processes; support of a flexible configuration; preservation of running processes when clusters are disconnected; flood control and a secure run-time environment for guest processes. In addition to the convenience of use, the performance of the system over a campus grid was nearly that of a local cluster. The obtained results confirm the claim that our system is suitable for compute bound and other applications with a moderate amount of I/O over fast networks in an enterprise or in a campus grid. A prototype grid with 8 clusters (almost 500 processors) is now under construction in our campus. Since not every cluster is used at all times, our system will allow better utilization of already existing clusters, e.g., in student labs, by users who need to run demanding applications, but cannot afford to own large clusters. The work described in this paper could be extended in several ways. First, we plan to incorporate an intermediate standard grid service layer, such as Globus [7], for secure communication, process and data migration. 356

8 Another possibility is to combine our system with a grid package that already uses Globus. An interesting extension of our work would be to migrate groups of communicating processes (a job) from one cluster to another, e.g., by generalization of the session migration of Zap [16], so that all inter-process communication, including remote system-calls within a group is confined to the same cluster, thus further reducing the communication overhead between a migrated process and its home-node. Another subject that we plan to investigate is methods for a fair-share distribution of the grid resources among users and cluster owners, e.g., based on a credit system, to encourage owners to connect their clusters to the grid. A much more difficult challenge is to develop a secure run-time environment, in which a hosting system could not interfere with guest processes. Acknowledgments We wish to thank R. Baer, A. Harel, J. Kent and E. Lozinskii for contributing applications for our tests and D. Braniss for his help. This research was supported in part by grants from the Ministry of Defense, from Mrs. B. Liber and from Dr. and Mrs. Silverston, Cambridge, UK. References [1] G. Allen, D. Angulo, I. Foster, G. Lanfermann, C. Liu, T. Radke, E. Seidel, and J. Shalf, "The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment," Int. J. of High Performance Applications and Supercomputing, 15(4), 2001, pp [2] R. Baer, and R. Gould, "A Method for ab Initio Nonlinear Electron Density Evolution," J. Chem. Phys., 114(8), 2001, pp [3] L. Amar, A. Barak, and A. Shiloh, "The MOSIX Direct File System Access Method for Supporting Scalable Cluster File Systems," Cluster Computing, 7(2), 2004, pp [4] A. Barak, 0. La'adan, and A. Shiloh, "Scalable Cluster Computing with MOSIX for Linux," Proc. 5th Annual Linux Expo, Raleigh, NC, May 1999, pp [5] E. Birnbaum, and E. L. Lozinskii, "Consistent subsets of inconsistent systems: structure and behavior," J. of Experimental and Theoretical Artificial Intelligence, 15(1), 2003, pp [6] I. Foster, and C. Kesselman, eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan- Kaufman Publishers, Inc. San Francisco, CA, [7] Globus, [8] W.J. Kent, "BLAT - The BLAST-Like Alignment Tool," Genome Res., 12(4), 2002, pp [9] A. Keren, and A. Barak, "Opportunity Cost Algorithms for Reduction of I/O and Interprocess Communication Overhead in a Computing Cluster," IEEE Tran. Parallel and Dist. Systems, 14(1), 2003, pp [10] K. Keahey, K. Doering, and I. Foster, "From Sandbox to Playground: Dynamic Virtual Environments in the Grid," Fifth IEEE/ACM Int. Workshop on Grid Computing (GRID'04), Pittsburgh, PA, Nov. 2004, pp [11] G. Lanfermann, G. Allen, T. Radke, E. Seidel, "Nomadic Migration: A New Tool for Dynamic Grid Computing," Proc. 10th IEEE Int. Symp. on High Performance Distributed Computing (HPDC-10'01), San Francisco, CA, Aug. 2001, pp [12] LZOP, [13] MOSIX, [14] Nimrod: Tools for Distributed Parametric Modeling, nimrod/nimrodg.htm, [15] I. Peer, A. Barak, and L. Amar, "A Gossip-Based Distributed Bulletin Board with Guaranteed Age Properties," Accepted to the Int. J. on Parallel Programming, Oct [16] S. Osman, D. Subhraveti, G. Su, and J. Nieh, "The Design and Implementation of Zap: A System for Migrating Computing Environments," Proc. 5th Symp. on Operating Systems Design and Implementation, Boston, MA, Dec. 2002, pp [17] A. Roy, and M. Livny, "Condor and Preemptive Resume Scheduling," Grid Resource Management: State of the Art and Future Trends, Kluwer Academic Publishers, 2003, pp [18] S. Vadhiyar, and J. Dongarra, "A performance Oriented Migration Framework for the Grid," Proc. 3rd IEEEI/ACM Int. Symp. on Cluster Computing and the Grid (CCGrid 2003), May 2003, pp [19] R. van Renesse, K. P. Birman, and W. Vogels, "Astrolabe: A Robust and Scalable Technology For Distributed Systems Monitoring, Management, and Data Mining," ACM Tran. on Computer Systems, 21(3), 2003, pp [20] T. F. Smith, and M. S. Waterman, 'Identification of Common Molecular Subsequences," J. of Mol. Biol., 147(1), 1981, pp [21] R. Wolski, "Experiences with Predicting Resource Performance On-line in Computational Grid Settings," ACM SIGMETRICS Performance Evaluation Review," 30(4), March 2003, pp

Overview of MOSIX. Prof. Amnon Barak Computer Science Department The Hebrew University.

Overview of MOSIX. Prof. Amnon Barak Computer Science Department The Hebrew University. Overview of MOSIX Prof. Amnon Barak Computer Science Department The Hebrew University http:// www.mosix.org Copyright 2006-2017. All rights reserved. 1 Background Clusters and multi-cluster private Clouds