MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes

Size: px
Start display at page:

Download "MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes"

Transcription

1 MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes Namyoon Woo, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul, , KOREA Taesoon Park Department of Computer Engineering Sejong University Seoul, , KOREA Abstract Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and practice of fault-tolerance systems and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management for MPI and message queue management. MPICH-GF requires no modification of application source codes and affects the MPICH communication as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the virtual device level. We have evaluated MPICH-GF with NPB applications on Globus middleware. 1 Introduction A computational grid (shortly a grid) is a specialized instance of a distributed system, which includes a heterogeneous collection of computers in different domains connected by networks [17, 18]. The grid has attracted considerable attention for its ability to utilize ubiquitous computational resources with a view of a single system image and it has been believed to benefit many computation-intensive parallel applications. Recently Argonne National Laboratory has proposed the Globus Toolkit for the framework of the grid and it has been taken as the de-facto standard of grid services [16]. Main features of the Globus Toolkit are global resource allocation and management, directory service, remote task running, user authentication and so on. Although the Globus Toolkit can monitor and manage global resources, it lacks dynamic process management such as fault-tolerance or dynamic load balancing which is essential to the distributed system. Distributed systems are not reliable enough to guarantee the completion of parallel processes in a determinate time because of their inherent failure factors. The system consists of a number of nodes, disks and network lines, which are all exposed to failures. Even a single local failure can be fatal to the parallel processes since it nullifies all of the computation results which have been executed in cooperation with one another. Assuming that there is one percent chance a single machine might crash during the execution of a parallel application, there is only a = 36 percent chance the whole system with one hundred machines would be alive throughout the the 1

2 application run. In order to increase the reliability of distributed systems, it is important to provide the system with fault-tolerance. Checkpointing and rollback-recovery is a well-known technique for fault-tolerance. Checkpointing is an operation to store the states of processes into stable storage for the purpose of recovery or migration [12]. A process can resume its previous state at any time with the latest checkpoint file and periodic checkpointing can minimize the computation loss incurred by a failure. Although several stand-alone checkpoint toolkits [23, 31, 38] have been implemented, they are not sufficient for parallel computing due to the following reasons: First, they cannot restore communication states such as sockets or shared memory. Second, they do not consider causal relationship among the states of processes. Process states may be dependent on one another in the message-passing environment [9, 27]. Hence, the independent process recovery without consideration on dependency may tangle the global process states. The states of inter-process relations should be reconstructed for consistent recovery. Consistent recovery for message-passing processes has been extensively studied for over two decades and several distributed algorithms for consistent recovery have been proposed. However, implementation of the algorithms seems another issue because they assume the followings: A process detects a failure and revives by itself. Both revived and survived processes can communicate after recovery without any procedure of channel reconstruction. Checkpointed binary files are always available. Also, there have been many efforts to make the theory practical [19, 28, 10, 6, 13, 24, 34, 22, 36, 4, 33, 30]. Their approaches take different strategies in the following context; system-level versus application-level checkpointing, kernel-level versus user-level checkpointing, user-transparency, support of non-blocking message transfer and direct versus indirect communication method. According to their strategies, implementation levels differ. However, the frameworks remain to be proofs or evaluation tools of the recovery algorithms. To the best of our knowledge there are only a few systems which are actually used in practice [19, 11], although they are only valid for the specific parallel programming model or the specific machine. Our goal in this paper is to construct the practical fault-tolerance system for message-passing applications on grids. We have integrated rollback-recovery algorithm with Message Passing Interface (MPI) [14], the de-facto standard of message-passing programming. As a research result, we present our MPICH-GF that is based on MPICH-G2 [15], the grid-enabled MPI implementation. MPICH-GF is completely transparent to the application developers and it requires no modification of application source codes. Any in-transit message during checkpointing is never lost whether it is for blocking operation or nonblocking operation. We have expanded the Globus job manager module to support rollback-recovery and to control distributed processes over domains. Above all, our main implementation issue is to provide dynamic process management that is not defined in the original MPI standard (version 1). Since the original MPI standard specifies only static process group management, a revived process, regarded as a new instance, cannot rejoin the process group. In order to enable a new MPI instance to communicate with the running processes, we have implemented a MPI Rejoin() function. Currently coordinated checkpointing protocol has been implemented and tested with MPICH-GF, and other consistent recovery algorithms (e.g. message logging) are under development. MPICH-GF operates on Linux kernel ver 2.4 with Globus toolkit 2.2. The rest of this paper is organized as follows: In Section 2, we present the concept of consistent recovery and the related works of the fault-tolerance system. Section 3 describes the operation of the original MPICH- G2. We propose the MPICH-GF architecture in Section 4 and address implementation issues in Section 5. The experimental results of MPICH-GF with Nas Parallel Benchmarks is shown in Section 6 and finally the conclusions are presented in Section 7. 2

3 2 Background 2.1 Consistent Recovery In the message-passing environment, states of processes may have dependency relation with one another, through message-receipt events. A consistent system state is the one in which for every message-receipt event reflected in the system state, the corresponding sending-event should be reflected [9]. If a process rolls back to a past state but any other process whose current state is dependent on the lost state does not roll back, inconsistency occurs. P 1 m 2 m 1 : Checkpoint P 2 Figure 1. Inconsistent global checkpoint Figure 1 shows a simple example of two processes whose local checkpoints do not form a consistent global checkpoint. Suppose that two processes should roll back to their latest local checkpoints. Then, process P 1 has not sent the message m 1 yet, but P 2 has marked m 1 as being received. In this case, m 1 becomes an orphan message and it causes inconsistency. Message m 2 is considered as a lost message, in the sense that P 2 waits for the arrival of m 2 which P 1 has marked as being already sent. If any in-transit message is not recorded during checkpointing, it becomes a lost message. Both of these message types cause the abnormal execution of processes during recovery. Extensive researches on consistent recovery have been conducted [12]. Approaches to the consistent recovery can be categorized into the coordinated checkpointing, the communication induced checkpointing and the message logging. In coordinated checkpointing, processes synchronize their local checkpointing so that a set of consistent global checkpoints can be guaranteed [9, 21]. On failure all the processes roll back to the latest global checkpoints for consistent recovery. However, it is a dominant idea that coordination would not scale up. Communication induced checkpointing (CIC) allows processes to checkpoint independently as well as prevents the domino effect using the information piggy-backed on the message [2, 12, 37]. In [2], Alvisi et al. disputed against the belief that CIC would be scalable. According to their experimental report, CIC generates enormous forced checkpoints and the process autonomy in placing local checkpoints does not seem to benefit in practice. Message logging records messages with checkpoint files in order to replay them for the recovery. It can be further classified into pessimistic, optimistic and causal message logging according to their policy on how to store message logs [3]. Log-based rollback recovery minimizes the amount of lost computation with high storage overhead. 2.2 Related Works Fault-tolerant systems for message-passing processes can be categorized as whether they support direct or indirect message transfer. With direct message transfer, application processes communicate with other processes directly, that is, client-to-client communication is possible. MPICH implementation is the representative case of direct transfer mode. With indirect message transfer, messages are sent through the medium like daemon: for example, PVM and LAM-MPI [7] are the cases. In these systems, processes do not have to know any physical 3

4 address of another process. Instead they maintain the connection with the medium and hence the recovery of the communication context is relatively easy. However, increase in message delay is inevitable. CoCheck [36] is a coordinated checkpointing system for PVM and tumpi. While CoCheck for tumpi supports direct transfer mode, the PVM version exploits PVM daemon to transfer messages. CoCheck exists as a thin library layered over PVM (or MPI) that wraps the original API. The process control messages are implemented at the same level as application messages. Unless an application process calls the recv() function explicitly, it cannot receive any control messages. MPICH-V [6] is a fault-tolerant MPICH version that supports pessimistic message logging. Every message is transferred to the remote Channel Memory server (CM) that logs and replays messages. CMs are assumed to be stable so that the revived process can recover simply by reconnecting to the CMs. According to their literature, the main reasons of using CMs are to cope with volatile nodes and to keep log data safely, even though the system should pay twice the cost for message delivery. FT-MPI [13] proposed by Fagg and Dongarra supports MPI-2 s dynamic task management. FT-MPI has been built on PVM or HARNESS core library that exploits daemon processes. Li and Tsay also proposed their LAM-MPI based implementation where messages are transferred via a multicast server on each node [22]. MPI-FT proposed by Batchu et al. [4] adopts task redundancy to provide fault-tolerance. This system has a central coordinator that relays messages to all the redundant processes. MPI-FT from Cyprus University [24] adopts message logging. An observer processor copies all the messages and reproduces them for the recovery. MPI-FT pre-spawns processes at the beginning so that one of the spare processes can inherit a failed process. This system has a high overhead for the storage of all the messages. There are also some research results supporting direct message transfer mode. Starfish [1] is a heterogeneous checkpointing toolkit based on Java virtual machine, which makes it possible for the processes to migrate among heterogeneous platforms. The limits of this system are that they have to be written in OCaml and that byte codes run more slowly than native codes. Egida [33] is an object-oriented toolkit that provides both communication induced checkpointing and message logging for MPICH with ch p4 device. An event handler hijacks events of MPI operations in order to do the corresponding pre-defined actions for the rollback-recovery. To support the atomicity of message transfer, it substitutes non-blocking operations with blocking ones. The master process with rank 0 is responsible for updating communication channel information on recovery. Therefore, the master process is not supposed to fail. The current version of Egida is able to detect only process failures and the failed process is recovered at the same node as it previously run. As a result, Egida cannot manage hardware failures. Hector [34] is similar to our MPICH-GF. Hector exists as a movable MPI library, MPI-TM, and several executables. There are hierarchical process managers which create and migrate application processes. Coordinated checkpointing has been implemented. Before checkpointing, every process closes its channel connection to ensure that there is no in-transit message left in the network. Processes have to reconnect the channel after checkpointing. One technique to prevent in-transit messages from being lost is to put off checkpointing until all the in-transit messages are delivered. In CoCheck, processes exchange ready-messages (RMs) with one another for the purpose of coordination and to guarantee the absence of in-transit messages [36]. RMs are sent through the same channel as the application messages. Since the channels are based on TCP sockets that satisfy FIFO property, RM s arrival guarantees that all the messages earlier than RM have arrived at the receiver. In Legion MPI-FT [30], each process reports the number of messages sent and received to the coordinator at the request of checkpointing. The coordinator calculates the number of in-transit messages and then waits for the processes informing that all the in-transit messages have arrived. The other way to prevent the loss of in-transit messages is to build a user-level reliable communication protocol in which in-transit messages are recorded at the sender s checkpoint file as being undelivered. Meth et al have named this technique as stop and discard in [25]. Messages received during checkpointing are discarded. After checkpointing, discarded messages are re-sent by the user-level reliable communication protocol. This mechanism requires one more memory copy at the sender side. RENEW [28] is an recoverable runtime system proposed by Neves et al. It also has the user-level reliable communication layers upon the UDP transport protocol in order to 4

5 log messages and to prevent the loss of in-transit messages. Some researches use application-level checkpointing for migration among heterogeneous nodes or for minimization of checkpoint file size [5, 29, 30, 32, 35]. This type of checkpointing burdens application developers with decision of when to checkpoint, what to store in checkpoint file and how to recover with the stored information. The CLIP toolkit [10] for the Intel Paragon provides semi-transparent checkpointing environment for users. The user must perform minor code modifications to define the checkpointing locations. Although the system-level checkpoint file is not heterogeneous itself, we believe that user transparency is an important virtue since it does not seem to happen that a process has to recover on a heterogeneous node even if there exists plenty of homogeneous nodes. In addition, application developers are not willing to accept such a programming effort. To sum up, most of the systems have been implemented using indirect communication mode or they are valid for specialized platforms only. 3 MPICH-G2 In this section, we describe the execution of MPI processes on Globus middleware and the communication mechanism of MPICH-G2. Message Passing Interface (MPI) is the de-facto standard specification for messagepassing programming that abstracts low-level message-passing primitives away from the developer [14]. Among several MPI implementations, MPICH [20] is the most popular for good performance and portability. Good portability of MPICH can be attributed to the abstraction of low-level operations, the Abstract Device Interface (ADI), as shown in Figure 2. An ADI s implementation is called a virtual device. Currently MPICH version includes about 15 virtual devices. Especially MPICH with a grid device globus2 is called MPICH-G2 [15]. Just for reference, our MPICH-GF has been implemented as the unmodified upper MPICH layer and our own virtual device ft-globus originated from globus2. Collective Operation Bcast(), Barrier(), Allreduce()... MPI Implementation Point to point operation Send()/Recv(), Isend()/Irecv(), Waitall()... ADI ch_shmem ch_p4 globus2... ft globus Figure 2. Layers of MPICH 3.1 Globus Run-time Module Globus toolkit [16] proposed by ANL is the de-facto standard of the grid middleware. It contains directory service, resource monitor/allocation, data sharing, authentication and authorization. However, it lacks dynamic run-time process control; for example, dynamic load balancing or fault-tolerance. Figure 3 describes how MPI processes are launched on Grid middleware. There are three main Globus modules which concern the process execution: DUROC (Dynamic Updated Request Online Co-allocator), GRAM (Globus Resource Allocation Management) job managers and a gatekeeper. DUROC distributes a user request to local GRAM modules and then a gatekeeper on each node checks whether the user is authenticated or not. If (s)he is an authenticated user, the gatekeeper launches a GRAM job manager. Finally the GRAM job manager forks 5

6 DUROC (Central Manager)... Gatekeeper fork() GRAM Job Manager (Local Manager) fork() MPI App. Figure 3. The procedure of process launching in Globus and executes requested processes. Neither DUROC nor GRAM job manager controls processes dynamically, but just monitors them. Anyway, Globus has the fundamental framework of hierarchical process management. For dynamic process management, we ve expanded DUROC and GRAM job manager s capability by modifying their source codes. We present their expanded abilities in Section 4 and from this context we use the terms, central/local manager instead of DUROC and GRAM manager respectively for the convenience. 3.2 globus2 Virtual Device Collective communication in MPICH-G2 is implemented as a combination of point-to-point (shortly P2P) communications based on non-blocking TCP sockets and active polling. MPICH has P2P communication primitives with the following semantics: Blocking operation : MPI Send(), MPI Recv() Non-blocking operation : MPI Isend(), MPI Irecv() Polling : MPI Wait(), MPI Waitall() The blocking send-operation submits a send-request to the kernel and waits until the kernel copies send-message to the kernel-level memory. The blocking receive-operation also waits until the kernel delivers the requested message to the user-level memory, while the non-blocking operation only registers its request and exits. Actual message delivery for the non-blocking operation is not performed until a polling function is called. Indeed, the blocking operation is the combination of the non-blocking operation and the polling function in MPICH-GF. Communication mechanism of globus2 device is similar to that of ch p4 device [8]. Each MPI process opens a listener port in order to accept a request for channel opening. On the receipt of a request, the receiver opens another socket and constructs a channel for two processes. All the listener information is transferred to every process at MPI initialization. The master process with rank 0 collects the listener information of the others and broadcasts them. However, globus2 does not fork another listener process as ch p4 does. The channel openning request is not be accepted until the receiver explicitly receives the request by calling the polling function. Figure 4 shows the structure of process group table commworldchannel of the globus2 device. The i-th entry in commworldchannel contains the information of a channel to the process with rank i. In Figure 4, 6

7 commworldchannel tcp_miproto_t send queue tcpsendreq channel 0 channel 1... channel i... channel n hostname listner port handlep send_q_tail send_q_head... buffer source rank destination rank tag datatype... Figure 4. commworldchannel structure we abstract this structure to show only the values of our concern. The pair of hostname and port is the address of a listener. handlep contains the real channel information. If handlep is null, the channel has not opened yet. Send-operation pushes a request into the send-queue and registers this request to globus io module. Then, the polling function is called to wait until the kernel handles the request. There are two receive-queues in MPICH: the unexpected queue and the posted queue (Figure 5). The former contains arrived messages whose receive-requests have not been issued yet. If a receive-operation is called, it examines the unexpected queue first whether the message has already arrived. If the corresponding message exists in the unexpected queue, it is delivered. Otherwise the receive-request is enqueued into the posted queue. On the message arrival, the handler checks whether there is a corresponding request in the posted queue. If it exists, the message is copied into the requested buffer. Otherwise, the message is pushed into the unexpected queue. unexpected queue posted queue Rhandle source rank tag context id rhandle buffer datatype source rank... Figure 5. Receive-queues of MPICH 4 MPICH-GF In the MPICH-GF system, MPI processes run under the control of hierarchical managers, a central manager and local managers. The central manager and the local managers are responsible for hardware or network failures and process failures, respectively. They are also responsible for checkpointing and automatic recovery. In this section, we present MPICH-GF s structure and checkpointing/recovery protocol implementation. 7

8 MPI Implementation Collective Communication P2P Communication ADI ft globus Original globus2 Checkpoint Toolkit Message Queue Management Dynamic Process Management Figure 6. The structure of MPICH-GF 4.1 Structure Figure 6 describes the MPICH-GF library structure. Fault-tolerance module has been implemented at the virtual device level, ft-globus and our MPICH-GF requires no modification of upper MPI implementation layer. The ftglobus device contains dynamic process group management, checkpoint toolkit and message queue management. Most of previous works have implemented fault-tolerance module at the upper layer making abstract of communication primitives, while our implementations has been performed at the lower layer. By doing so, previous approaches neglect to support the characteristic of some specific communication operations: for example, the nonblocking operation. In addition, the low level approach is inevitable to reconstruct the physical communication channel. 4.2 Coordinated Checkpointing For consistent recovery, the coordinated checkpointing protocol is employed. The central manager initiates global checkpointing periodically as shown in Figure 7. Then, local managers signal processes with SIGUSR1 so that the processes can be ready for checkpointing. The signal handler for checkpointing has been registered in the MPI process by the MPI initialization. On receipt of SIGUSR1, the signal handler executes a barrier-like function before checkpointing. By performing the barrier, two things can be guaranteed. One is that there is no orphan message between any two processes. The other is that there is no in-transit message because barrier messages push any previously issued message into the receiver. Channel implementation of globus2 is based on TCP sockets and hence FIFO property is kept. Pushed messages are stored at the receiver s queue in the user-level memory so that checkpoint file can include them. When processes are recovered with this global checkpoint, every in-transit messages are also restored. This technique is similar to Ready Message of CoCheck. After quasibarrier, each process generates the checkpoint file and informs the local manager that it completes checkpointing successfully. The central manager checks if all the checkpoint files have been generated and confirms the collection of checkpoint files as a new version of global checkpoint. 4.3 Consistent Recovery In the MPICH-GF system, hierarchical managers are responsible for failure detection and automatic recovery. We have implemented hierarchical managers by modifying original globus run-time module as described in Section 3. Since application processes are forked from the local manager, the local manager receives SIGCHLD when 8

9 Central Manager Local Manager Process Checkpointing Initiation Request for checkpointing Signal : SIGUSR1 Barrier Checkpointing Checkpoint completed Wait n replies Confirm Figure 7. Coordinated checkpointing protocol its forked application process terminates. Upon receiving the signal, the manager checks if the termination was through normal exit() by calling the system call, waitpid(). If that is the case, the execution is successfully done and the local manager doen not have do anything. Otherwise, the local manager regards it as a process failure and notifies the central manager. However, it is possible for the local manager to fail as well. The central manager monitors all the local managers by pinging them periodically. If a local manager does not answer, the central manager assumes that the local manager has failed or hardware/network has failed. Then, it re-submits the request to GRAM module in order to restore the failed processes from the checkpoint files. In coordinated checkpointing, a single failure results in the rollback of all the processes to the consistent global checkpoint. The central manager broadcasts both of a failure event and a rollback request. Our first approach to recovery was to kill all the processes on a single failure and to recreate them by submitting sub-requests to the gatekeeper on each node. This approach took too much time for gatekeeper s authentication and reconstruction of all the channels. To improve the efficiency of recovery, we allow the survived processes not to terminate. Instead, they dump checkpoint files to their memory by calling exec(). This mechanism affects only the user level memory without affecting the channel status remaining at the kernel side. After dumping the memory, the survived processes can communicate without any channel reconstruction. However, the channel to the failed process requires to be reconstructed. 5 Implementation Issues 5.1 Communication Channel Reconstruction The original MPI specification supports static process group management. Once a process group and channels are built, this information cannot be altered during the run time. In other words, no new process can join the group. A recovered process instance is regarded as new instance at the view of the group. Although it can restore the process state, it cannot communicate with the survived processes. Our proposed solution is to invalidate communication channels of failed process and to reconstruct them. We have implemented a new MPI Rejoin() function for this. Before a restored process resumes the computation, it calls MPI Rejoin() in order to update its listener information of the other s commworldchannel and to re-initiate its channel information. To update the listener port, the restored process informs the following values: 9

10 (3) broadcast Central Manager (2) forward new listner info. (3) broadcast Local Manager Local Manager (4) new commworldchannel info. (1) Inform new listner info. (4) new commworldchannel info. Recovered Survived Process Process Figure 8. Communication channel reconstruction protocol MPI_function() { signal_handler(){... if ( ft_globus_mutex == 1 ) then requested_for_coordination = FALSE requested_for_coordination = TRUE; ft_globus_mutex = 1; else do_checkpointing();... return; } if (requested_for_coordination == TRUE) then do_checkpoint(); endif ft_globus_mutex = 0; } Figure 9. Atomicity of message transfer global rank: the logical process ID of the previous run. hostname: the address of the current node where the process is restored. port number: the new listener port number. This information is broadcasted via hierarchical managers. The survived processes invalidate the channel to the failed process or renew the listener information of the restored process according to the event information from its local manager. Figure 8 describes the interaction among managers and processes for the MPI Rejoin() call. The restored process sets its handles free to re-initiate channels so that it can consider as if it has not created any channel to the others. The procedure of channel reconstruction is the same as that of the channel creation (as described in Section 3.2.) 5.2 Atomicity of Message Transfer Messages in MPICH-G2 are sent being divided into the header and the payload. If checkpointing is performed after receiving a header but before receiving the payload, partial message loss may happen. We provide the atomicity of message transfer in order to store and restore the communication context safely. In other words, checkpointing is not performed while the message transfer is in progress. We have made the MPI communication operation mutually exclusive by using the checkpoint signal handler as shown in Figure 9. Each mutually exclusive area for send and receive operations has been implemented in different levels. We set the whole send-operation area (MPI Send and MPI Isend) as a critical section. The process status in 10

11 checkpoint files should be either that any send-operation has not been called or that a message has been sent completely. We do not want a checkpoint file to contain send-requests in the send-queue because each sendrequest is related with the physical channel. If a process restores its state with send-queue entries, some of them may not be sent correctly because of channel updates. The kernel does not process any non-blocking send-request until a polling function MPI Wait() is called explicitly in the user code. MPICH-GF replaces the non-blocking send-operation into the blocking send-operation to ensure that no send-request exists in a checkpoint file. This replacement does not affect the performance or the correctness of the process because blocking operation just submits its request to the kernel, but does not wait for the delivery of the message at the receiver side. The implementation level of the receive-operation s critical section is lower than that of the send-operation. If the whole receive-operation is implemented as a critical section, the deadlock situation may happen in coordinated checkpointing (Figure 10.) In the figure, a sender is waiting that all the processes enter in the coordination procedure and a receiver is waiting for the arrival of the requested message. At the receiver side, the signal for coordination is delayed until the receive-call finishes. The blocking receive is a combination of the non-blocking receive and the loop of the polling function, and the actual message delivery from the kernel to the user memory is done by polling. To prevent the deadlock, we set the polling function in loop as a critical section. The checkpoint file may contain receive-requests in the receive-queue, which does not matter because they do not related with physical channel information. In order to match the arrived message and the receive-request, MPI s receive module checks only sender s rank and message tag. So the recovered process can receive messages corresponding to the restored receive-requests. For that reason, the non-blocking receive does not need to be replaced with the blocking one. P Request for Checkpointing P Waiting for Coordination Send() Coordination Dealyed Recv() Figure 10. Deadlock of blocking operation in coordinated checkpointing 6 Experimental Results In this section, we present performance results for five Nas Parallel Benchmarks applications [26] LU, BT, CG, IS and MG executing on a cluster-based grid system of four Intel Pentium III 800 MHz PCs with 256MB memory connected by ordinary 100 Mbps Ethernet. Linux kernel version 2.4 and Globus toolkit version 2.2 have been installed. We have measured the overhead of checkpointing and the cost of recovery. Table 1 shows the characteristics of the applications. Application IS uses all-to-all collective operations only. MG and LU use anonymous receive-operations with MPI ANY SOURCE option. However, the corresponding sender for the receive call is determined statically by message tagging. LU is the most communication intensive application. However its message size is the smallest. Each process of CG has two or three neighbors and it communicates with only one of them for an iteration. The message size of CG is relatively large. BT is also a 11

12 communication intensive application. The process has six neighbors in all directions of a 3D cube. However, all of them perform non-blocking operations for every iteration. 1 Application (Class) Description Communication Pattern Average Message Size (KB) Number of Sent Messages per process Executable file size (KB) Average Checkpoint Size (MB) BT (A) Navier-Stokes Equation 3D Mesh CG (B) Conjungate gradient method Chain IS (B) Integer Sort All-to-all LU (A) LU Decomposition Mesh MG (A) Multiple Grid Cube Table 1. Characteristics of the NPB applications used in the experiments We have measured the total execution time of NPB applications using MPICH-G2 and MPICH-GF respectively. To evalulate the checkpointing overhead, we varied the checkpoint periods from 10% to 50% of the total execution time. With 10% checkpoint period, applications perform checkpointing 9 times while only one checkpointing is done with 50% checkpoint period. Figure 11 (a) shows the total execution times of each case without any failure event. To evaluate the monitoring overhead, we measured the total execution time of MPICH-G2 and MPICH- GF without checkpointing respectively. In this experiment, the central manager queries local managers for their status every five seconds. If a local manager does not reply, it is assumed to have failed. The difference between first two cases in Figure 11 indicates the monitoring overhead. The other 5 cases show the execution time using MPICH-GF with 1,2,3,4,9 checkpointings performed respectively. Figure 11 (b) shows the average checkpointing overhead measured at the process. The checkpointing overhead consists of the communication overhead, barrier overhead and the disk overhead. We assume that the communication overhead as the network delay among the hierarchical managers in single coordinated checkpointing. We used the option O SYNC at file open in order to guarantee that the blocking disk writes have been flushed physically, which is considerably time-consuming. As shown in the figure, the barrier overhead is pretty small and the communication overhead is about the same for all the applications. The disk overhead is the dominant overhead for the checkpointing and is almost proportional to the checkpoint file size as expected. The barier overhead of IS (B Class) application is exceptionally large. It is due to the in-transit messages of IS whose sizes are the largest among all applications (Table 1). During the barrier operation, they are pushed to the user-level memory of receiver. Hence, it takes more time for IS to complete the barrier operation. Incremental checkpointing and forked checkpointing technique may be applied to minimize this disk overhead. However, we are not sure that the benefits of incremental checkpointing would be effective as expected since all the applications used have relatively small executables (about 1MB) while using a lot of data segment. For example, we have measured the checkpoint file size of IS application with different problem sizes: class A and class B. The checkpoint file sizes are 41.7 MBytes and MBytes respectively while the executable remains the same. The difference of checkpoint file sizes is caused by the heap and the stack and they tend to be changed frequently. In order to measure the recovery cost, we have measured the time at the central manager from the failure detection to the completion of the channel update broadcast. The recovery cost per single failure is presented in Figure 12. The recovery is performed as follows: failure detection, failure broadcast, authentication for the request and process re-launching. Among them, the authentication is the most dominant factor in recovery and it is closely related to the workings of the Globus middleware. The second dominant overhead is the process re-launching and most of the time is spent in reading the checkpoint file from hard disk. We note that both significant overhead factors are not related to the scalability issues. 1 Collective operations are implemented by calling non-blocking P2P operations. 12

13 MPICH-G2 MPICH-GF (no ckpt.) MPICH-GF (T=50%) MPICH-GF (T=33%) MPICH-GF (T=25%) MPICH-GF (T=20%) MPICH-GF (T=10%) Communication overhead Disk overhead Barrier overhead time (sec) time (sec) BT (A) CG (B) IS (B) LU (A) MG (A) 0 BT (A) CG (B) IS (B) LU (A) MG (A) (a) (b) Figure 11. Failure-free overhead: (a) total execution time (b) composition of the checkpointing overhead 7 Conclusions In this paper, we have presented the feasibility, architecture and evaluation of fault-tolerant system for gridenabled MPICH. Our proposed system minimizes the loss of computation of processes by periodic checkpointing and guarantees the consistent recovery. It does not require any modification of application source codes or MPICH upper layer. While previous researches have modified the higher level yielding the performance, MPICH-GF respects the communication characteristic of MPICH by lower level approach. Consideration on communication context of the lower level makes fine grain of checkpoint timing possible. However, all of our job have been accomplished at the user level. MPICH-GF inter-operates with Globus toolkit and can restore a failed process in any node across domains, only if GASS service is valid. We also have shown the implementation issue and the evaluation of coordinated checkpointing. At the point of writing this paper, implementation of independent checkpointing with message logging is under going. The central manager is exposed to a single point of failure. We plan to use redundancy on the central manager for the high availability. Our ultimate target grid system is a pile of clusters. Local failures happening inside a cluster should rather be managed in that cluster. As shown in our experimental results, the recovery cost contains authentication overhead which is time-consuming. Since the original task request has been already authenticated, recovery request may skip this procedure. In order to manage this local failure efficiently, more hierarchical management architecture is required. Currently, we are developing the multiple hierarchical manager system by redundancy. References [1] A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In Proceedings of IEEE Symposium on High Performance Distributed Computing, [2] L. Alvisi, E. N. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages ,

14 12 Process re-launching Job re-submission 10 Broadcasting failure event 8 time (sec) BT (A) CG (B) IS (B) LU (A) MG (A) Figure 12. Recovery cost [3] L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic, causal and optimal. IEEE Transactions on Software Engineering, 24(2): , FEB [4] R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J. P. Neelamegam, Y. Dandass, and M. Apte. MPI/FT:architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In 1st International Symposium on Cluster Computing and the Grid, May [5] A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogenous networks of workstations. Journal of Parallel and Distributed Computing, 43(2): , [6] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. F. Magniette, V. Néri, and A. Selikhov. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In SuperComputing 2002, [7] G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for mpi. In Proceeding of Supercomputing Symposium, pages , Toronto, Canada, [8] R. Butler and E. L. Lusk. Monitors, messages, and clusters: The p4 parallel programming system. Parallel Computing, 20(4): , [9] K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63 75, AUG [10] Y. Chen, K. Li, and J. S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of SC97: High Performance Networking & Computing, NOV [11] I. B. M. Corporation. IBM loadleveler: User s guide, SEP [12] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3): , [13] G. E. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, pages , [14] M. P. I. Forum. MPI:a message passing interface standard, MAY [15] I. Foster and N. T. Karonis. A grid-enabled MPI: Message passing in heterogeneous distributed computing systems. In Proceedings of SC 98. ACM Press, [16] I. Foster and C. Kesselman. The globus project: A status report. In Proceedings of the Heterogeneous Computing Workshop, pages 4 18, [17] I. Foster and C. Kesselman. The Grid: Blueprint for a Future Computing Infrastructure. Morgan Faufmann Publishers, [18] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications, 15(3),

15 [19] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A computation management agent for multi-institutional grids. In Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), AUG [20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. MPICH: A high-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6): , [21] R. Koo and S. Toueg. Checkpointing and rollbackrecovery for distributed systems. IEEE Transaction on Software Engineering, SE-13(1):23 31, [22] W.-J. Li and J.-J. Tsay. Checkpointing message-passing interface(mpi) parallel programs. In Pacific Rim International Symposium on Fault-Tolerant Systems (PRFTS), [23] M. J. Litzkow and M. Solomon. Supporting checkpointing and process migration outside the unix kernel. In USENIX Conference Proceedings, pages , San Francisco, CA, JAN [24] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou. Portable fault tolerance scheme for MPI. Parallel Processing Letters, 10(4): , [25] K. Z. Meth and W. G. Tuel. Parallel checkpoint/restart without message logging. In Proceedings of the 2000 International Workshops on Parallel Processing, [26] NASA Ames Research Center. Nas parallel benchmarks. Technical report, [27] R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE Transacsions on Parallel and Distributed Systems, 6(2): , [28] N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. In Symposium on Fault-Tolerant Computing, pages 58 67, [29] G. T. Nguyen, V. D. Tran, and M. Kotocová. Application recovery in parallel programming environment. In European PVM/MPI, pages , [30] A. Nguyen-Tuong. Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia, USA, Partial Fullfillment of the Requirements for the Degree Doctor of Computer Science. [31] J. S. Plank, M. beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. In USENIX Winter 1995 Technical Conference, JAN [32] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10): , [33] S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Symposium on Fault-Tolerant Computing, pages 48 55, [34] S. H. Russ, J. Robinson, B. K. Flachs, and B. Heckel. The hector distributed run-time environment. IEEE Transactions on Parallel and Distributed Systems, 9(11): , NOV [35] L. Silva and J. ao. Silva. System-level versus user-defined checkpointing. In Symposium on Reliable Distributed Systems 1998, pages 68 74, [36] G. Stellner. CoCheck: Checkpointing and process migration for MPI. In Proceedings of the International Parallel Processing Symposium, pages , APR [37] J. Tsai, S.-Y. Kuo, and Y.-M. Wang. Theoretical analysis for communication-induced checkpointing protocols with rollback dependency trackability. IEEE Transactions on Parallel and Distributed Systems, 9(10): , [38] V. Zandy, B. Miller, and M. Livny. Process hijacking. In Eighth International Symposium on High Performance Distributed Computing, pages , AUG

Application. Collective Operations. P2P Operations. ft-globus. Unexpected Q. Posted Q. Log / Replay Module. Send Q. Log / Replay Module.

Application. Collective Operations. P2P Operations. ft-globus. Unexpected Q. Posted Q. Log / Replay Module. Send Q. Log / Replay Module. Performance Evaluation of Consistent Recovery Protocols using MPICH-GF Namyoon Woo, Hyungsoo Jung, Dongin Shin, Hyuck Han,HeonY.Yeom 1, and Taesoon Park 2 1 School of Computer Science and Engineering Seoul

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

GEMS: A Fault Tolerant Grid Job Management System

GEMS: A Fault Tolerant Grid Job Management System GEMS: A Fault Tolerant Grid Job Management System Sriram Satish Tadepalli Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements

More information

Proactive Fault Tolerance in Large Systems

Proactive Fault Tolerance in Large Systems Proactive Fault Tolerance in Large Systems Sayantan Chakravorty Celso L. Mendes Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign {schkrvrt,cmendes,kale}@cs.uiuc.edu

More information

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications CPPC: A compiler assisted tool for portable checkpointing of message-passing applications Gabriel Rodríguez, María J. Martín, Patricia González, Juan Touriño, Ramón Doallo Computer Architecture Group,

More information

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet IRISA Sebastien.Monnet@irisa.fr Christine Morin IRISA/INRIA Christine.Morin@irisa.fr Ramamurthy Badrinath

More information

A Middleware Framework for Dynamically Reconfigurable MPI Applications

A Middleware Framework for Dynamically Reconfigurable MPI Applications A Middleware Framework for Dynamically Reconfigurable MPI Applications Kaoutar ElMaghraoui, Carlos A. Varela, Boleslaw K. Szymanski, and Joseph E. Flaherty Department of Computer Science Rensselaer Polytechnic

More information

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI 1 MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI Aurélien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, Franck Cappello INRIA/LRI, Université Paris-Sud, Orsay, France {

More information

Novel Log Management for Sender-based Message Logging

Novel Log Management for Sender-based Message Logging Novel Log Management for Sender-based Message Logging JINHO AHN College of Natural Sciences, Kyonggi University Department of Computer Science San 94-6 Yiuidong, Yeongtonggu, Suwonsi Gyeonggido 443-760

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

Checkpointing HPC Applications

Checkpointing HPC Applications Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures

More information

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

Developing a Thin and High Performance Implementation of Message Passing Interface 1

Developing a Thin and High Performance Implementation of Message Passing Interface 1 Developing a Thin and High Performance Implementation of Message Passing Interface 1 Theewara Vorakosit and Putchong Uthayopas Parallel Research Group Computer and Network System Research Laboratory Department

More information

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Joshua Hursey 1, Jeffrey M. Squyres 2, Timothy I. Mattox 1, Andrew Lumsdaine 1 1 Indiana University 2 Cisco Systems,

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations John von Neumann Institute for Computing A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations A. Duarte, D. Rexachs, E. Luque published in Parallel Computing: Current & Future Issues

More information

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System Taesoon Park 1,IlsooByun 1, and Heon Y. Yeom 2 1 Department of Computer Engineering, Sejong University, Seoul

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and

More information

Space-Efficient Page-Level Incremental Checkpointing *

Space-Efficient Page-Level Incremental Checkpointing * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science

More information

Managing MPICH-G2 Jobs with WebCom-G

Managing MPICH-G2 Jobs with WebCom-G Managing MPICH-G2 Jobs with WebCom-G Padraig J. O Dowd, Adarsh Patil and John P. Morrison Computer Science Dept., University College Cork, Ireland {p.odowd, adarsh, j.morrison}@cs.ucc.ie Abstract This

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

An Evaluation of Alternative Designs for a Grid Information Service

An Evaluation of Alternative Designs for a Grid Information Service An Evaluation of Alternative Designs for a Grid Information Service Warren Smith, Abdul Waheed *, David Meyers, Jerry Yan Computer Sciences Corporation * MRJ Technology Solutions Directory Research L.L.C.

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs

Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, Paul Stodghill

More information

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg

More information

UNICORE Globus: Interoperability of Grid Infrastructures

UNICORE Globus: Interoperability of Grid Infrastructures UNICORE : Interoperability of Grid Infrastructures Michael Rambadt Philipp Wieder Central Institute for Applied Mathematics (ZAM) Research Centre Juelich D 52425 Juelich, Germany Phone: +49 2461 612057

More information

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Lemarinier Pierre, Bouteiller Aurelien, Herault Thomas, Krawezik Geraud, Cappello Franck To cite this version: Lemarinier

More information

An introduction to checkpointing. for scientific applications

An introduction to checkpointing. for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

The Hector Distributed Run Time

The Hector Distributed Run Time The Hector Distributed Run Time Environment 1 A Manuscript Submitted to the IEEE Transactions on Parallel and Distributed Systems Dr. Samuel H. Russ, Jonathan Robinson, Dr. Brian K. Flachs, and Bjorn Heckel

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme Yue Zhang 1 and Yunxia Pei 2 1 Department of Math and Computer Science Center of Network, Henan Police College, Zhengzhou,

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Communication Characteristics in the NAS Parallel Benchmarks

Communication Characteristics in the NAS Parallel Benchmarks Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this

More information

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute

More information

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

PM2: High Performance Communication Middleware for Heterogeneous Network Environments PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,

More information

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

MPI/FT TM : Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing *

MPI/FT TM : Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing * MPI/FT TM : Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing * Rajanikanth Batchu, Jothi P. Neelamegam, Zhenqian Cui, Murali Beddhu,

More information

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,

More information

MPI-Mitten: Enabling Migration Technology in MPI

MPI-Mitten: Enabling Migration Technology in MPI MPI-Mitten: Enabling Migration Technology in MPI Cong Du and Xian-He Sun Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA {ducong, sun}@iit.edu Abstract Group communications

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs

Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Livermore,

More information

Proactive Fault Tolerance in MPI Applications via Task Migration

Proactive Fault Tolerance in MPI Applications via Task Migration Proactive Fault Tolerance in MPI Applications via Task Migration Sayantan Chakravorty Celso L. Mendes Laxmikant V. Kalé Department of Computer Science, University of Illinois at Urbana-Champaign {schkrvrt,cmendes,kale}@uiuc.edu

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

An introduction to checkpointing. for scientifc applications

An introduction to checkpointing. for scientifc applications damien.francois@uclouvain.be UCL/CISM An introduction to checkpointing for scientifc applications November 2016 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count 1 2 3^C

More information

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 11, Nov 2015, pp. 46-53, Article ID: IJCET_06_11_005 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=6&itype=11

More information

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments 1 A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments E. M. Karanikolaou and M. P. Bekakos Laboratory of Digital Systems, Department of Electrical and Computer Engineering,

More information

An MPI failure detector over PMPI 1

An MPI failure detector over PMPI 1 An MPI failure detector over PMPI 1 Donghoon Kim Department of Computer Science, North Carolina State University Raleigh, NC, USA Email : {dkim2}@ncsu.edu Abstract Fault Detectors are valuable services

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

Adding semi-coordinated checkpoints to RADIC in Multicore clusters

Adding semi-coordinated checkpoints to RADIC in Multicore clusters Adding semi-coordinated checkpoints to RADIC in Multicore clusters Marcela Castro 1, Dolores Rexachs 1, and Emilio Luque 1 1 Computer Architecture and Operating Systems Department, Universitat Autònoma

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project Introduction to GT3 The Globus Project Argonne National Laboratory USC Information Sciences Institute Copyright (C) 2003 University of Chicago and The University of Southern California. All Rights Reserved.

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk Expressing Fault Tolerant Algorithms with MPI-2 William D. Gropp Ewing Lusk www.mcs.anl.gov/~gropp Overview Myths about MPI and Fault Tolerance Error handling and reporting Goal of Fault Tolerance Run

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

The Use of the MPI Communication Library in the NAS Parallel Benchmarks

The Use of the MPI Communication Library in the NAS Parallel Benchmarks The Use of the MPI Communication Library in the NAS Parallel Benchmarks Theodore B. Tabe, Member, IEEE Computer Society, and Quentin F. Stout, Senior Member, IEEE Computer Society 1 Abstract The statistical

More information

The Design of a State Machine of Controlling the ALICE s Online-Offline Compute Platform. Sirapop Na Ranong

The Design of a State Machine of Controlling the ALICE s Online-Offline Compute Platform. Sirapop Na Ranong The Design of a State Machine of Controlling the ALICE s Online-Offline Compute Platform Sirapop Na Ranong Control, Configuration and Monitoring The Functional Requirement of Control System Responsible

More information

Estimation of MPI Application Performance on Volunteer Environments

Estimation of MPI Application Performance on Volunteer Environments Estimation of MPI Application Performance on Volunteer Environments Girish Nandagudi 1, Jaspal Subhlok 1, Edgar Gabriel 1, and Judit Gimenez 2 1 Department of Computer Science, University of Houston, {jaspal,

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

ProActive SPMD and Fault Tolerance Protocol and Benchmarks

ProActive SPMD and Fault Tolerance Protocol and Benchmarks 1 ProActive SPMD and Fault Tolerance Protocol and Benchmarks Brian Amedro et al. INRIA - CNRS 1st workshop INRIA-Illinois June 10-12, 2009 Paris 2 Outline ASP Model Overview ProActive SPMD Fault Tolerance

More information

SimpleChubby: a simple distributed lock service

SimpleChubby: a simple distributed lock service SimpleChubby: a simple distributed lock service Jing Pu, Mingyu Gao, Hang Qu 1 Introduction We implement a distributed lock service called SimpleChubby similar to the original Google Chubby lock service[1].

More information

Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI

Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI and Graham E. Fagg George Bosilca, Thara Angskun, Chen Zinzhong, Jelena Pjesivac-Grbovic, and Jack J. Dongarra

More information

Loaded: Server Load Balancing for IPv6

Loaded: Server Load Balancing for IPv6 Loaded: Server Load Balancing for IPv6 Sven Friedrich, Sebastian Krahmer, Lars Schneidenbach, Bettina Schnor Institute of Computer Science University Potsdam Potsdam, Germany fsfried, krahmer, lschneid,

More information

CS514: Intermediate Course in Computer Systems

CS514: Intermediate Course in Computer Systems CS514: Intermediate Course in Computer Systems Lecture 23: Nov 12, 2003 Chandy-Lamport Snapshots About these slides Consists largely of a talk by Keshav Pengali Point of the talk is to describe his distributed

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

A Distributed Media Service System Based on Globus Data-Management Technologies1

A Distributed Media Service System Based on Globus Data-Management Technologies1 A Distributed Media Service System Based on Globus Data-Management Technologies1 Xiang Yu, Shoubao Yang, and Yu Hong Dept. of Computer Science, University of Science and Technology of China, Hefei 230026,

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Prabhaker Mateti Wright State University Dayton, Ohio, USA Overview High performance computing High throughput computing NOW, HPC, and HTC Parallel algorithms Software

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing

A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing Sanya Tangpongprasit, Takahiro Katagiri, Hiroki Honda, Toshitsugu Yuba Graduate School of Information

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,

More information

Enhanced N+1 Parity Scheme combined with Message Logging

Enhanced N+1 Parity Scheme combined with Message Logging IMECS 008, 19-1 March, 008, Hong Kong Enhanced N+1 Parity Scheme combined with Message Logging Ch.D.V. Subba Rao and M.M. Naidu Abstract Checkpointing schemes facilitate fault recovery in distributed systems.

More information

The Cost of Recovery in Message Logging Protocols

The Cost of Recovery in Message Logging Protocols 160 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 12, NO. 2, MARCH/APRIL 2000 The Cost of Recovery in Message Logging Protocols Sriram Rao, Lorenzo Alvisi, and Harrick M. Vin AbstractÐPast

More information

MICE: A Prototype MPI Implementation in Converse Environment

MICE: A Prototype MPI Implementation in Converse Environment : A Prototype MPI Implementation in Converse Environment Milind A. Bhandarkar and Laxmikant V. Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)

More information

A Comprehensive User-level Checkpointing Strategy for MPI Applications

A Comprehensive User-level Checkpointing Strategy for MPI Applications A Comprehensive User-level Checkpointing Strategy for MPI Applications Technical Report # 2007-1, Department of Computer Science and Engineering, University at Buffalo, SUNY John Paul Walters Department

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

A Resource Look up Strategy for Distributed Computing

A Resource Look up Strategy for Distributed Computing A Resource Look up Strategy for Distributed Computing F. AGOSTARO, A. GENCO, S. SORCE DINFO - Dipartimento di Ingegneria Informatica Università degli Studi di Palermo Viale delle Scienze, edificio 6 90128

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information