Adding semi-coordinated checkpoints to RADIC in Multicore clusters

Size: px
Start display at page:

Download "Adding semi-coordinated checkpoints to RADIC in Multicore clusters"

Transcription

1 Adding semi-coordinated checkpoints to RADIC in Multicore clusters Marcela Castro 1, Dolores Rexachs 1, and Emilio Luque 1 1 Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, Barcelona, Spain Abstract Fault tolerance strategies should be adapted to current High Performance Computing with a growing number of processors. RADIC is a fault tolerance architecture based on pessimistic protocol based on receiver that follows a distributed behavior for protection and recovery. This protocol is effective in recovery, however, it introduces more overhead than others in protection. In multicore clusters, the latency added to protect messages between processes executing on a node, is increased due to the differences between intra-node network bandwidth and the inter-node one. When Coordinated checkpointing is used to save the state of the processes in a node, the overhead is reduced in not logging the those internal communications. A semi-coordinate checkpoint protocol is proposed in this paper. It combines the received-based pessimistic protocol with coordinated checkpoint. An overhead description is exposed to find out which message passing parallel applications are benefited using this alternative protocol. Experimental results using SPMD and MW compare the behavior of both protocols. Keywords: Fault-tolerance; High-Availability; RADIC; message passing; socket 1. Introduction High Performance Computing systems are evolving by adding multicores to their nodes. As a consequence, the probability of having node failures increases and fault tolerance solutions are used to assure the parallel application ends successfully in spite of such failures. However, the demand of more performance and availability drives to adapt fault tolerance strategies to multicore and manycores architectures. Fault tolerance rollbackrecovery protocols were explained and classified by Elnozahy [1]. One of the most used approaches is the coordinated checkpointing, although it is not advisable to scale to a large number of processes because it usually has a high coordination cost and whole processes would have to rollback in case of failure. RADIC [2] [3], a fault tolerance architecture for parallel applications, was designed to be distributed to not interfere with the scalability of the application being protected. As a general rule, a centralized component might add more than a proportional overhead when the number of processors increases reducing then the speedup and the scalability. The receiver-based pessimistic rollback-recovery protocol combined with uncoordinated checkpoint was adopted by RADIC because it accomplishes the distributed requirement during protection as well as in recovery phases. This design was made for cases that use one process by node [2]. The guaranty of successfully ending an execution in spite of failures has a performance cost or overhead. This cost has two parts. On the one hand, the overhead added during protection, also known as failure-free operations, and on the other hand, the overhead of the tasks of recovery phase. The analysis of the cost of recovery done in [4] concludes that receiver-based pessimistic protocol presents the lowest overheads in recovery time, however, it is expensive in failure-free operation. The overhead added by receiver-based pessimistic protocols in protection is caused by the time of logging each received message in stable storage. Consequently, the latency of send is theoretically duplicated since at least two hops are needed, one, to arrive to the receiver and the second to reach the stable storage located on a different node. Moreover, when the sender and the receiver are hosted on the same multicore node, the latency added to protect the message is dramatically increased due to the differences between intra-node network bandwidth and the inter-node one. As a consequence, the performance drawback of receivedbased pessimistic message logging is even more noticeable in multicore systems. Processes executing on the same node, which we named group, are related by failure probability [5]. Using coordinated checkpointing among the members of the group would save the cost of logging the received messages interchanged among them. Nevertheless, a coordination is required to avoid in-transit for obtaining a consistent recovering line free of orphan messages. This paper presents a semi-coordinated protocol which minimizes the overhead added by the receiver-based pessimistic protocol during protection but keeping the distributed behavior. It consists in using coordinated checkpoints among the members of the groups combined with receiver-based pessimistic message log for communications done between processes hosted by different nodes. The

2 content of this paper is organized as follows. In Section II we mention the related works. Section III describes the fully uncoordinated rollback recovery protocol currently used in RADIC designed at socket level [3] [6]. The Section IV explains how the semi-coordination protocol is added to RADIC obtaining a new model for protection and for recovery. The experimental evaluation is presented in Section V, and lastly, we state the conclusions and the future work in Section VI. 2. Related Works The combination of using coordinated checkpointing together with message logging has been already used in previous researches. A correlated set coordination among processes executing on the same multicore node combined with pessimistic message logging is presented in [5]. In this work a coordinating among processes in a node is done and also it is combined with a pessimist message log but based on sender. A different coordination protocol and the validation and experiments were done using Open-MPI while we are using RADIC at socket level [3] [6]. The research work [7] proposes a hybrid protocol combining coordinated with uncoordinated checkpoint. As it is targeted to grid environment, the criteria used to group processes is based on the network and the communication pattern to determine the kind of checkpoint that would be done. To obtain a global consistent state for the group, Communication Induced checkpoint (CIC) combinated with a pessimistic message logging. Using CIC for coordination might be not scalable for highly coupled processes since the number of forced checkpoints grows uncontrollably. Group-based coordinated checkpoint is stated in [8]. In this case, not message-log is used thus, a complete coordination is needed for recovery. It is applied to MVAPICH2. A combination of coordinated checkpoint with message log is proposed in [9], as a way to scale the most extended strategy of coordination of the whole processes. However, the criteria for grouping the processes is based on the communication behavior. A trace is done to give support on the creation of groups. Our approach uses the location of the processes to coordinate them as a unique set of processes but the user can also configure a different frequency of checkpoint for each process. In that case, the groups are formed with processes on the same node and with the same of frequency of checkpoint. Using this configuration, the user would give a more accurate checkpoint interval for each group according to the communication pattern of the parallel application. 3. Fully Uncoordinated RADIC Model This section explains the receiver-based pessimistic message log followed by RADIC. We begin with a brief of the architecture and how it works at socket level. Then, the protocol is described by separating the procedure done in protection from the followed in recovery. On both cases we focus on describing the overheads of each step. 3.1 RADIC-based Message Passing Fault Tolerance System RADIC has a distributed behavior in protection, detection and recovery phases. It uses uncoordinated checkpointing and receiver-based pessimist message log. Critical data like checkpoints and received messages of each parallel process are stored on a different node from the one in which it is running. This selection assures the execution completion if a minimum of three nodes are left operational after n nonsimultaneous faults. RADIC applied at socket layer would let fault tolerance parallel applications using different kind of message-passing libraries, which usually use the standard Socket API [10] for interconnection of the processes. There are two components also depicted in Figure 1: Observer (Oi): this entity is responsible for monitoring the application s communications and masks possible errors generated by communication failures. In RADIC at socket level, the observer intercepts send and recv functions to follow the message log protocol. The state is saved periodically by checkpointing. Critical data for recovering formed by received messages and checkpoints are sent to the protector Ti-1. There is an observer Oi attached to each parallel process Pi. Protector: (Ti) There is one on each node protecting the processes running on node Ni+1. It stores the critical data sent by the observers. In case of failure, the protector restarts the failed process using the last checkpoint. Protector detects node failures by sending heartbeats to its neighbors and by the detection of sockets errors. Fig. 1: RADIC diagram shows each observer Oi sends the critical data to its protector Ti-1. Each protector Ti sends heartbeat signal to Ti-1 The observers use five types of sockets to keep the control and reliability of their communications which are depicted in Figure 1. First, the virtual socket is the id known by the process to communicate with a remote peer, are the solid black arrows that connects P6 and P7 with its observer. Second, the real socket represented by a solid yellow line is the one that is actually connected with the peer, since the original connection could be broken after a checkpoint or a failure. Third, the control-ft socket,

3 depicted using a blue dotted line, it is an internal socket opened by two observers involved in a communication to interchange control information during re-connections and message logging. Then, dashed lines are RADIC-sockets used between Oi with Ti-1 and lastly, dotted black lines are used by each protector Ti to answer Oi the state of Ti-1 in case of failure. 3.2 Receiver-based pessimistic Protocol in Protection Phase A receiver-based pessimistic rollback recovery protocol let recover the state of each process until the point of failure. It adds more overhead than others like optimistic or causal approaches during the protection tasks but simplifies the recovery procedure because the effects of a failure are confined only to the restarting processes [4] [1]. Usually, it is used with uncoordinated checkpoint to decrease the rollback time in case of failure. Receiver-based pessimistic rollback recovery protocol assumes that all nondeterministic events are identified and their corresponding determinants are logged to stable storage. Receiving a packet is considered a nondeterministic event to log. Thus, this is solved interposing recv socket function and sending the received message to the protector afterwards. But pessimistic logging protocols are designed under the assumption that a failure can occur after any nondeterministic event in the computation. This assumption is pessimistic since in reality, failures are rare [1] and stipulates that if an event has not been logged on stable storage, then no process can depend on it. Because of that, a sender of a message waits until the complete sent message is saved in stable storage to validate it before continuing the operation. Once a received message is completely saved on stable storage, an acknowledgment is sent to the sender. The Figure 2 shows how a message is treated since it is generated from the sender process. Each step adds an overhead which is named prefixing it with Ts- or with Trdepending on if they are related to the send or with recv respectively. 1) The send(x) operation is interposed by the sender observer Os, which sends a numerated ack requirement to the Or using the control-ft socket. X is the length of the message. The overhead is named Ts-ack-req. The time used for sending the message Ts-msg it is not considered overhead because it corresponds with operation time performed by the process. 2) A recv(x1) operation is interposed by the receiver observer Or. X1 is the length of the expected message. According to the standard of recv socket function, when X1 is greater than the X actually available, a maximum of X would be delivered. Therefore, we consider that the length X1 is less or equal than the X sent. Or receives the acknowledgement requirement on a Tr-ack-req time. 3) Or receives the X bytes from the real-socket. The time used to receive the message Tr-msg is not an overhead due to it corresponds to the read operation performed by the process. The message is sent to the protector to save it in a Tr-save-msg(X) time. 4) Or sends the acknowledgment to the Os using the control-ft socket. Tr-send-ack is added. On the other peer, Os receives the acknowledgment and the send(x) finishes. Ts-wait-ack is the overhead of this wait. 5) Lastly, only if X1 is less than X, a set of recv(xi) is performed until X is completely read. In such cases, Or copies the next bytes from the X bytes received previously. The time is considered in Tr-msg(Xi). Fig. 2: Receiver-based Pessimistic Protocol in Protection Phase: Virtual/Real sockets: Solid lines - Control-Ft sockets: dotted lines - RADIC sockets: dashed lines 3.3 Receiver-based Pessimistic Protocol in Recovery When one of the nodes fails down, the failure is detected by the protector which restarts the processes that were running in failed node using the last checkpoint. BLCR [11] library is used to do uncoordinated checkpointing and restarting each parallel process. The recovery procedure is carried out by the observer by rolling forward the previous execution from the checkpoint until the point of failure. The saved messages are using in each re-execution of recv functions since those messages are not going to send them again. By the other hand, the send operations are skipped because they were done before the failure. However, as each send has associated a numerated acknowledgment requirement, the observers are able to detect and to skip a duplicated message if it is re-sent. This functionality is useful because this protocol considers that the recovery procedure finishes when the last message in stable storage is processed by a recv, but if the failure happened after a send, it would be re-sent as it is not considered part of the recovery. The Figure 3 depicts the recovery procedure and the overheads prefixed with Trcv- related to one of the virtual

4 sockets named i. The same procedure is repeated for socket used by the parallel process. 1) Immediately after restarting from checkpoint, the state of recovering is detected by obtaining it from BLCR library. Then, a connection with the local protector is established to query how many messages are pending to re-process for the virtual socket i. The value of Qmsg(sv[i]) is returned. The overhead of this step is measured with Trcv-qmsg. 2) Every send(x)(j) operation is interposed and skipped to avoid re-send messages, j is an integer greater or equal to 0 representing the amount of sends function being rolling forward in this virtual socket i. X is the length of the message. The time is measured with Trcv-send(X)(j). 3) Every recv(y)(k) operation is interposed and asking for it to the local protector. The messages are delivered in FIFO order for each virtual socket i. The recovery procedure for the virtual socket i re-executes k recv operation being k a value from 0 to Qmsg(sv[i]). The overhead of each of them is Trcv-recv(Y)(k). Y is the length of the message. 4) After re-executing Qmsg(sv[i]) recv functions, the virtual socket i is reconnected by establishing a realsocket with the remote peer. In this point, the recovery procedure for this virtual socket is finished. The overhead is Trcv-re-conn. is dramatically increased due to the differences in intranode network bandwidth and inter-node one. However, a coordination among the members of the group, introduces an overhead. Several algorithms have been proposed to coordinate checkpoint like the Chandy-Lamport algorithm [12] or the blocking coordinated [13]. The communications have to be silenced before checkpointing to avoid in-transit messages. When multicore clusters are used to execute parallel application, the processes running on the same node are related in case of a failure since they must be restarted and re-executed until the same point in time. As they are being located on the same node, they are likely to having or needing an intensive or fasting communication among them. Therefore, a coordination checkpoint protocol is useful because it avoids logging messages for communications between members of the group. RADIC protectors known which processes are executing in its node and the observers also can identified if the peer process is located or not at the same node. This section explains the whole strategy used for coordinating checkpoints among the members of groups combined with receiver-based pessimistic protocols for communications done between processes hosted by different nodes. First, the changes in current RADIC model are stated. Second, the coordination protocol used among the processes running on the same node before checkpointing is exposed. Lastly, the semi-coordinated checkpoint protocol is described both in protection and in recovery. 4.1 RADIC Model changes The proper component to carry out the coordination task among the members of groups for not adding additional tasks to observers is RADIC local protector. Actually, a connection between each observer with its local protector is established but until now it was just used just in case of a failure. Now, it is used also for perform the coordination. The Figure 4 represents a parallel application running on N nodes of a Multicore cluster. Each node i has a group of M Ni members. Fig. 3: Receiver-based Pessimistic Protocol in Recovery Phase 4. Semi-Coordinated RADIC Model Semi-coordinated checkpointing allows to RADIC to provide an alternative rollback recovery protocol to reduce overheads in multicore clusters. The performance drawback of received-based pessimist rollback recovery protocol becomes even worst during failure-free operations for communication between members of a group. The latency added to log the message Fig. 4: Coordinated Groups in Multicore Cluster Each group is coordinated by its local protector Tx to silence internal communication. The received-based pessimist protocol is kept for communications between members of different groups. By default, RADIC would consider that processes running on a node have the same checkpoint

5 interval but, since this is a configuration value for each process, when different intervals are configured, the groups are formed with the processes on the same node and with the same interval. In such cases, received-based pessimist protocol is used for communication between processes in the same node but with different checkpoint interval. This configuration would be useful for cases with processes that having different communication pattern are running on the same node but the optimal checkpoint interval is too much different. Moreover, as an extreme case, this functionality let turn again to a fully uncoordinated checkpointing by configuring different checkpoint interval for each parallel process. In order to facilitate the explanation, it is considered that each group in a node has the same checkpoint interval. In addition, when a node fails, and no spare node is available, RADIC recovery model establishes that the failed processes are recovered in the previous node where criticaldata was saved. In such cases, the Protector Tx-1 has to be able to recognize automatically at least two groups, the first that is still running on node x-1, and the second one being recovered. Although after node failure the groups are in the same node, RADIC keeps the groups uncoordinated until the end of execution to let move them if an spare node is available later. 4.2 Coordinated Checkpointing Protocol The protocol for coordinating the processes is shown in Figure 5. Two entities are involved in this procedure. The first is the protector T n running in one of N nodes depicted in Figure 4. In this node n, there are 1 to M n parallel processes to coordinate. The second entity is observer O nm attached to each of member of the group. When T n detects it is time for checkpointing, sends a message to each. After receiving that message, the observer stops the communication activity in the beginning of the next send or recv function. In this state, the coordination requirement of no in-transit messages between processes of the group is accomplished because all send operations are completely finished and acknowledged. Each observer replies to its local protector Tn that it is ready for the checkpoint. Once all the members are ready, the Tn calls to BLCR for checkpointing each process. BLCR executes the callback function provided by each O nm and the checkpoint is performed. After finishing, checkpoint files are sent to Tn-1 by Tn. 4.3 Semi-Coordinated Protocol in Protection RADIC at socket level keeps an identification of the remote process for each virtual socket. This information is interchanged when the control-ft socket is established. The identification is formed by node-id, process-id and virtualsocket. Group-id is now incorporated to support this new model. Using this data, the observer is able to know if the remote peer belongs to its group or not. The group-id is assigned in the beginning by the local protector Tx when the Fig. 5: Coordinated Checkpointing Protocol first communication is established between them. By default, the value is 0. The Figure 6 represents the procedure used by the observers when sender and receiver belong to the same group. It is different from the explained in 2 in that the saving of the received message in stable storage is skipped. Consequently, both sender and receiver overheads are reduced in the time needed to log messages, due to Or returns the acknowledge immediately after reading the message. Fig. 6: Protection protocol for Intra-Group Communications Instead, when the two peers belong to different groups, the observers still follow the protocol displayed in Figure 2. Although the overheads are reduced by avoiding logging messages done among groups, the execution time would not be reduced when: The amount of data interchanged by each group is not considerable. The application is computation bounded and most of the time the processes are executing and not waiting for communications results. As the communication is overlapped with the computations, less communication overhead does not mean less execution time. The overhead added by coordination protocol is more considerable than the saving time on eliminating the group message logging.

6 4.4 Semi-Coordinated Protocol in Recovery Semi-Coordinated checkpoint protocol changes the recovery explained previously in 3.3 because in case of a node failure, a group is rolling forward simultaneously. The recovery protocol depicted in Figure 3 is still applied for virtual socket between processes which do not belong to the same group. On the contrary, virtual sockets with members of the same group should be reconnected in the beginning of the recovery process. The sends and receives between members of the group are re-executed again, because the remote peer is also in recovering and no log messages were done. There are no considerable overhead differences in recovery. The sends and receive operations for group communications now are performed instead of skipping and looking for in storage respectively. 5. Experimental Results We test the fault tolerance system to compare the fully uncoordinated RADIC model with semi-coordinated one. The experiments were executed on a cluster formed by 4 nodes Intel Core i5-650 Processor 6GB RAM, Network Gigabit Ethernet. The OS used is Ubuntu Kernel server. We use heat-transfer SPMD application and a sum of matrices Master/Worker based on TCP sockets, which follow different communication patterns. This allows us to observe how the different approaches behave in both cases. There are three types of execution. First, without FT, label No FT. Second, in failure-free to test protection phase label Failure-Free and lastly, a failure is inject in the node N3 seconds after the first checkpoint, named Recovery. As there is no spare node available, the failed processes are recovered in the node N2. The executions in Failure Free and Recovery were done either using the fully uncoordinated protocol and using the semi-coordinated one. Checkpoints were done only on processes executing in N3. As each checkpoint closes and then reconnects the communications, using checkpoints in other processes would disturb current experiments by adding additional overheads not related to log messages protocols and coordination. The diagram used for comparing both protocols are the throughput by seconds. This metric let observe how the overheads introduced by fault tolerance impacts in the work effectively done by each of the processes. The Figure 7 compares the different executions No FT, failure-free and recovery of SPMD P5 process located on node N3. When fully uncoordinated is used the throughput falls down more than in semi-coordinated protocol. This advantage makes the process to only need 2.37% more time than the execution without FT. Instead, when fully uncoordinated adds 9.48% in execution time. Recovery using semi-coordinated shows a better performance as well, adding 27.49% against to 52.13% of uncoordinated case. Fig. 7: SPMD Process P5 executions In MW application, a sum of a 1000x1000 float matrix is done. The master sends 11k to each worker to sum and 11 bytes are returned to master. Executions of a worker hosted by failed node N3 are graphed in Figure 8. shows that using uncoordinated protocol is slightly better than semi-coordinated one, adding 4.49% and 33.23% in failurefree and recovery respectively to executions without FT. It can be seen that the overhead added by coordination increases the execution time in semi-coordinated protocol while no overhead is saving in messages log because only node N1 executing master and two workers have group communication. Fig. 8: MW Worker Executions To evaluate how the comparing results are related to the package size in use, executions using different workloads are done. Table 9 shows SPMD execution time in failure-free and recovery. The heat transfer application in this experiment is configured to make more intensive communications. It is observed that uncoordinated protocol is better for small

7 packet size. Usually, in those cases, the communication is overlapped with computation and the overhead of logging does not increase the execution times. Moreover, the coordination of checkpointing of each group impacts on it. In the same way, MW executions follow the same behavior. The results are displayed in table 10. As the package size increase, the semi-coordinated protocol is a better option for both communication patterns tested. Fig. 9: SPMD using different package size Fig. 10: MW using different package size 6. Conclusions and Future Work A semi-coordinating checkpoint protocol is added to RADIC model as an alternative fault tolerance algorithm to be used with parallel applications running on a multicore clusters. The experiments show that this protocol allows to decrease the overhead of fault tolerance. Applications using intensive or larger group communications are the target since they are likely to obtain a better execution time by avoiding the logging of their intra-node messages. This implementation is an early stage and has several instrumentations for taking times. We plan to do several optimizations and extending this work to standard MPI. We are working on a set of experiments to do a deeper comparative analysis between semi-coordinated and uncoordinated checkpointing protocol using a varied packet sizes and communication patterns. Acknowledgments This research has been supported by the MICINN Spain under contract TIN , the MINECO (MICINN) Spain contract TIN , the European ITEA2 project H4H, No and the Avanza Competitividad I+D+I contract TSI References [1] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput.Surv., vol. 34, no. 3, pp , September [2] G. Santos, A. Duarte, D. Rexachs, and E. Luque, Providing non-stop service for message-passing based parallel applications with radic, ser. Lecture Notes in Computer Science, vol LNCS, 2008, pp [3] M. Castro, D. Rexachs, and E. Luque, Transparent fault tolerance middleware at user level, in HPCS 12, 2012, pp [4] S. Rao, L. Alvisi, and H. M. Vin, The cost of recovery in message logging protocols, IEEE Trans.on Knowl.and Data Eng., vol. 12, no. 2, pp , mar [Online]. Available: [5] A. Bouteiller, T. Herault, G. Bosilca, and J. J. Dongarra, Correlated set coordination in fault tolerant message logging protocols, in Proceedings of the 17th international conference on Parallel processing - Volume Part II, ser. Euro-Par 11. Berlin, Heidelberg: Springer-Verlag, 2011, pp [Online]. Available: [6] M. Castro, D. Rexachs, and E. Luque, Radic-based message passing fault tolerance system, in ADVCOMP 2012, The Sixth International Conference on Advanced Engineering Computing and Applications in Sciences, 2012, pp [7] Y. Luo and D. Manivannan, Hope: A hybrid optimistic checkpointing and selective pessimistic message logging protocol for large scale distributed systems, Future Generation Computer Systems, vol. 28, no. 8, pp , [8] Q. Gao, W. Huang, M. J. Koop, and D. K. Panda, Group-based coordinated checkpointing for mpi: A case study on infiniband, in Parallel Processing, ICPP International Conference on, 2007, pp , id: 1. [9] J. C. Y. Ho, C.-L. Wang, and F. C. M. Lau, Scalable groupbased checkpoint/restart for large-scale message-passing systems, in Parallel and Distributed Processing, IPDPS IEEE International Symposium on, 2008, pp [10] M. K. McKusick, K. Bostic, M. J. Karels, and J. S. Quarterman, The design and implementation of the 4.4BSD operating system. Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc., [11] P. H. Hargrove and J. C. Duell, Berkeley lab checkpoint/restart (blcr) for linux clusters, Journal of Physics: Conference Series, vol. 46, no. 1, p. 494, [12] K. M. Chandy, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol. 3, pp , [13] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello, Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi, in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ser. SC 06. New York, NY, USA: ACM, [Online]. Available:

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes* RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes* Hugo Meyer 1, Dolores Rexachs 2, Emilio Luque 2 Computer Architecture and Operating Systems Department, University Autonoma of

More information

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments 1 A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments E. M. Karanikolaou and M. P. Bekakos Laboratory of Digital Systems, Department of Electrical and Computer Engineering,

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

Checkpointing HPC Applications

Checkpointing HPC Applications Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations John von Neumann Institute for Computing A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations A. Duarte, D. Rexachs, E. Luque published in Parallel Computing: Current & Future Issues

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet IRISA Sebastien.Monnet@irisa.fr Christine Morin IRISA/INRIA Christine.Morin@irisa.fr Ramamurthy Badrinath

More information

Efficiency Evaluation of the Input/Output System on Computer Clusters

Efficiency Evaluation of the Input/Output System on Computer Clusters Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de

More information

Novel Log Management for Sender-based Message Logging

Novel Log Management for Sender-based Message Logging Novel Log Management for Sender-based Message Logging JINHO AHN College of Natural Sciences, Kyonggi University Department of Computer Science San 94-6 Yiuidong, Yeongtonggu, Suwonsi Gyeonggido 443-760

More information

Correlated set coordination in fault tolerant message logging protocols for many-core clusters

Correlated set coordination in fault tolerant message logging protocols for many-core clusters CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 203; 25:572 585 Published online 2 July 202 in Wiley Online Library (wileyonlinelibrary.com)..2859 SPECIAL ISSUE

More information

Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion *

Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion * Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion * Ronal Muresano, Alvaro Wong, Dolores Rexachs and Emilio Luque Computer

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Design and Implementation of a Novel Message Logging Protocol for OpenFOAM

Design and Implementation of a Novel Message Logging Protocol for OpenFOAM Design and Implementation of a Novel Message Logging Protocol for OpenFOAM Xunyun Liu, Xiaoguang Ren, Yuhua Tang and Xinhai Xu State Key Laboratory of High Performance Computing National University of

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and

More information

A Consensus-based Fault-Tolerant Event Logger for High Performance Applications

A Consensus-based Fault-Tolerant Event Logger for High Performance Applications A Consensus-based Fault-Tolerant Event Logger for High Performance Applications Edson Tavares de Camargo and Elias P. Duarte Jr. and Fernando Pedone Federal University of Paraná (UFPR), Department of Informatics,

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI 1 MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI Aurélien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, Franck Cappello INRIA/LRI, Université Paris-Sud, Orsay, France {

More information

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, Franck Cappello INRIA Saclay-Île de France, F-91893

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

Efficient Shared Memory Message Passing for Inter-VM Communications

Efficient Shared Memory Message Passing for Inter-VM Communications Efficient Shared Memory Message Passing for Inter-VM Communications François Diakhaté 1, Marc Perache 1,RaymondNamyst 2, and Herve Jourdren 1 1 CEA DAM Ile de France 2 University of Bordeaux Abstract.

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Hunting for Bindings in Distributed Object-Oriented Systems

Hunting for Bindings in Distributed Object-Oriented Systems Hunting for Bindings in Distributed Object-Oriented Systems Magdalena S lawiñska Faculty of Electronics, Telecommunications and Informatics Gdańsk University of Technology Narutowicza 11/12, 80-952 Gdańsk,

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 11, Nov 2015, pp. 46-53, Article ID: IJCET_06_11_005 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=6&itype=11

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications Amina Guermouche, Thomas Ropars, Marc Snir, Franck Cappello To cite this version: Amina Guermouche,

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

An Improvement of TCP Downstream Between Heterogeneous Terminals in an Infrastructure Network

An Improvement of TCP Downstream Between Heterogeneous Terminals in an Infrastructure Network An Improvement of TCP Downstream Between Heterogeneous Terminals in an Infrastructure Network Yong-Hyun Kim, Ji-Hong Kim, Youn-Sik Hong, and Ki-Young Lee University of Incheon, 177 Dowha-dong Nam-gu, 402-749,

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

Increasing Reliability through Dynamic Virtual Clustering

Increasing Reliability through Dynamic Virtual Clustering Increasing Reliability through Dynamic Virtual Clustering Wesley Emeneker, Dan Stanzione High Performance Computing Initiative Ira A. Fulton School of Engineering Arizona State University Wesley.Emeneker@asu.edu,

More information

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems fastos.org/molar Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems Jyothish Varma 1, Chao Wang 1, Frank Mueller 1, Christian Engelmann, Stephen L. Scott 1 North Carolina State University,

More information

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, Franck Cappello To cite this version: Amina Guermouche,

More information

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Lemarinier Pierre, Bouteiller Aurelien, Herault Thomas, Krawezik Geraud, Cappello Franck To cite this version: Lemarinier

More information

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance

Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance 1 Exploiting edundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance arxiv:1511.00212v1 [cs.dc] 1 Nov 2015 Abstract Communication-avoiding algorithms allow redundant

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System Taesoon Park 1,IlsooByun 1, and Heon Y. Yeom 2 1 Department of Computer Engineering, Sejong University, Seoul

More information

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Joshua Hursey 1, Jeffrey M. Squyres 2, Timothy I. Mattox 1, Andrew Lumsdaine 1 1 Indiana University 2 Cisco Systems,

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Live Virtual Machine Migration with Efficient Working Set Prediction

Live Virtual Machine Migration with Efficient Working Set Prediction 2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore Live Virtual Machine Migration with Efficient Working Set Prediction Ei Phyu Zaw

More information

PREDICTING COMMUNICATION PERFORMANCE

PREDICTING COMMUNICATION PERFORMANCE PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

More information

Scalable and Fault Tolerant Failure Detection and Consensus

Scalable and Fault Tolerant Failure Detection and Consensus EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University MVAPICH Users Group 2016 Kapil Arya Checkpointing with DMTCP and MVAPICH2 for Supercomputing Kapil Arya Mesosphere, Inc. & Northeastern University DMTCP Developer Apache Mesos Committer kapil@mesosphere.io

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup From eventual to strong consistency Replication s via - Eventual consistency Multi-master: Any node can accept operation Asynchronously, nodes synchronize state COS 418: Distributed Systems Lecture 10

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload Summary As today s corporations process more and more data, the business ramifications of faster and more resilient database

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

Fault tolerance techniques for high-performance computing

Fault tolerance techniques for high-performance computing Fault tolerance techniques for high-performance computing Jack Dongarra 1,2,3, Thomas Herault 1 & Yves Robert 1,4 1. University of Tennessee Knoxville, USA 2. Oak Ride National Laboratory, USA 3. University

More information

TCP CONGESTION WINDOW CONTROL ON AN ISCSI READ ACCESS IN A LONG-LATENCY ENVIRONMENT

TCP CONGESTION WINDOW CONTROL ON AN ISCSI READ ACCESS IN A LONG-LATENCY ENVIRONMENT TCP CONGESTION WINDOW CONTROL ON AN ISCSI READ ACCESS IN A LONG-LATENCY ENVIRONMENT Machiko Toyoda Saneyasu Yamaguchi Masato Oguchi Ochanomizu University Otsuka 2-1-1, Bunkyo-ku, Tokyo, Japan Institute

More information

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions D. Manivannan Department of Computer Science University of Kentucky Lexington, KY 40506

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols George Bosilca, Aurelien Bouteiller, Thomas Herault 2, Pierre Lemarinier, and Jack Dongarra 3 University of Tennessee 2 University

More information

MATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD

MATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD ICIT 2013 The 6 th International Conference on Information Technology MATRIX-VECTOR MULTIPLICATIO ALGORITHM BEHAVIOR I THE CLOUD Sasko Ristov, Marjan Gusev and Goran Velkoski Ss. Cyril and Methodius University,

More information

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego Engineering Fault-Tolerant TCP/IP servers using FT-TCP Dmitrii Zagorodnov University of California San Diego Motivation Reliable network services are desirable but costly! Extra and/or specialized hardware

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Avida Checkpoint/Restart Implementation

Avida Checkpoint/Restart Implementation Avida Checkpoint/Restart Implementation Nilab Mohammad Mousa: McNair Scholar Dirk Colbry, Ph.D.: Mentor Computer Science Abstract As high performance computing centers (HPCC) continue to grow in popularity,

More information

DMTCP: Fixing the Single Point of Failure of the ROS Master

DMTCP: Fixing the Single Point of Failure of the ROS Master DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u

More information

Alleviating Scalability Issues of Checkpointing

Alleviating Scalability Issues of Checkpointing Rolf Riesen, Kurt Ferreira, Dilma Da Silva, Pierre Lemarinier, Dorian Arnold, Patrick G. Bridges 13 November 2012 Alleviating Scalability Issues of Checkpointing Protocols Overview 2 3 Motivation: scaling

More information

Measuring TCP bandwidth on top of a Gigabit and Myrinet network

Measuring TCP bandwidth on top of a Gigabit and Myrinet network Measuring TCP bandwidth on top of a Gigabit and Myrinet network Juan J. Costa, Javier Bueno Hedo, Xavier Martorell and Toni Cortes {jcosta,jbueno,xavim,toni}@ac.upc.edu December 7, 9 Abstract In this article

More information

Anna Morajko.

Anna Morajko. Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Module 16: Distributed System Structures

Module 16: Distributed System Structures Chapter 16: Distributed System Structures Module 16: Distributed System Structures Motivation Types of Network-Based Operating Systems Network Structure Network Topology Communication Structure Communication

More information

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg Distributed Recovery with K-Optimistic Logging Yi-Min Wang Om P. Damani Vijay K. Garg Abstract Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world

More information

Boundary Recognition in Sensor Networks. Ng Ying Tat and Ooi Wei Tsang

Boundary Recognition in Sensor Networks. Ng Ying Tat and Ooi Wei Tsang Boundary Recognition in Sensor Networks Ng Ying Tat and Ooi Wei Tsang School of Computing, National University of Singapore ABSTRACT Boundary recognition for wireless sensor networks has many applications,

More information

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications CPPC: A compiler assisted tool for portable checkpointing of message-passing applications Gabriel Rodríguez, María J. Martín, Patricia González, Juan Touriño, Ramón Doallo Computer Architecture Group,

More information

MPI History. MPI versions MPI-2 MPICH2

MPI History. MPI versions MPI-2 MPICH2 MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

IMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR

IMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR IMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR Hyohan Kim (HY-SDR Research Center, Hanyang Univ., Seoul, South Korea; hhkim@dsplab.hanyang.ac.kr); Chiyoung Ahn(HY-SDR Research center, Hanyang

More information

A Comprehensive User-level Checkpointing Strategy for MPI Applications

A Comprehensive User-level Checkpointing Strategy for MPI Applications A Comprehensive User-level Checkpointing Strategy for MPI Applications Technical Report # 2007-1, Department of Computer Science and Engineering, University at Buffalo, SUNY John Paul Walters Department

More information

Fault Tolerant Domain Decomposition for Parabolic Problems

Fault Tolerant Domain Decomposition for Parabolic Problems Fault Tolerant Domain Decomposition for Parabolic Problems Marc Garbey and Hatem Ltaief Department of Computer Science, University of Houston, Houston, TX 77204 USA garbey@cs.uh.edu, ltaief@cs.uh.edu 1

More information

Exploring I/O Virtualization Data paths for MPI Applications in a Cluster of VMs: A Networking Perspective

Exploring I/O Virtualization Data paths for MPI Applications in a Cluster of VMs: A Networking Perspective Exploring I/O Virtualization Data paths for MPI Applications in a Cluster of VMs: A Networking Perspective Anastassios Nanos, Georgios Goumas, and Nectarios Koziris Computing Systems Laboratory, National

More information

Proactive Fault Tolerance in Large Systems

Proactive Fault Tolerance in Large Systems Proactive Fault Tolerance in Large Systems Sayantan Chakravorty Celso L. Mendes Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign {schkrvrt,cmendes,kale}@cs.uiuc.edu

More information

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information