Adding semi-coordinated checkpoints to RADIC in Multicore clusters

Adding semi-coordinated checkpoints to RADIC in Multicore clusters Marcela Castro 1, Dolores Rexachs 1, and Emilio Luque 1 1 Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, Barcelona, Spain Abstract Fault tolerance strategies should be adapted to current High Performance Computing with a growing number of processors. RADIC is a fault tolerance architecture based on pessimistic protocol based on receiver that follows a distributed behavior for protection and recovery. This protocol is effective in recovery, however, it introduces more overhead than others in protection. In multicore clusters, the latency added to protect messages between processes executing on a node, is increased due to the differences between intra-node network bandwidth and the inter-node one. When Coordinated checkpointing is used to save the state of the processes in a node, the overhead is reduced in not logging the those internal communications. A semi-coordinate checkpoint protocol is proposed in this paper. It combines the received-based pessimistic protocol with coordinated checkpoint. An overhead description is exposed to find out which message passing parallel applications are benefited using this alternative protocol. Experimental results using SPMD and MW compare the behavior of both protocols. Keywords: Fault-tolerance; High-Availability; RADIC; message passing; socket 1. Introduction High Performance Computing systems are evolving by adding multicores to their nodes. As a consequence, the probability of having node failures increases and fault tolerance solutions are used to assure the parallel application ends successfully in spite of such failures. However, the demand of more performance and availability drives to adapt fault tolerance strategies to multicore and manycores architectures. Fault tolerance rollbackrecovery protocols were explained and classified by Elnozahy [1]. One of the most used approaches is the coordinated checkpointing, although it is not advisable to scale to a large number of processes because it usually has a high coordination cost and whole processes would have to rollback in case of failure. RADIC [2] [3], a fault tolerance architecture for parallel applications, was designed to be distributed to not interfere with the scalability of the application being protected. As a general rule, a centralized component might add more than a proportional overhead when the number of processors increases reducing then the speedup and the scalability. The receiver-based pessimistic rollback-recovery protocol combined with uncoordinated checkpoint was adopted by RADIC because it accomplishes the distributed requirement during protection as well as in recovery phases. This design was made for cases that use one process by node [2]. The guaranty of successfully ending an execution in spite of failures has a performance cost or overhead. This cost has two parts. On the one hand, the overhead added during protection, also known as failure-free operations, and on the other hand, the overhead of the tasks of recovery phase. The analysis of the cost of recovery done in [4] concludes that receiver-based pessimistic protocol presents the lowest overheads in recovery time, however, it is expensive in failure-free operation. The overhead added by receiver-based pessimistic protocols in protection is caused by the time of logging each received message in stable storage. Consequently, the latency of send is theoretically duplicated since at least two hops are needed, one, to arrive to the receiver and the second to reach the stable storage located on a different node. Moreover, when the sender and the receiver are hosted on the same multicore node, the latency added to protect the message is dramatically increased due to the differences between intra-node network bandwidth and the inter-node one. As a consequence, the performance drawback of receivedbased pessimistic message logging is even more noticeable in multicore systems. Processes executing on the same node, which we named group, are related by failure probability [5]. Using coordinated checkpointing among the members of the group would save the cost of logging the received messages interchanged among them. Nevertheless, a coordination is required to avoid in-transit for obtaining a consistent recovering line free of orphan messages. This paper presents a semi-coordinated protocol which minimizes the overhead added by the receiver-based pessimistic protocol during protection but keeping the distributed behavior. It consists in using coordinated checkpoints among the members of the groups combined with receiver-based pessimistic message log for communications done between processes hosted by different nodes. The

content of this paper is organized as follows. In Section II we mention the related works. Section III describes the fully uncoordinated rollback recovery protocol currently used in RADIC designed at socket level [3] [6]. The Section IV explains how the semi-coordination protocol is added to RADIC obtaining a new model for protection and for recovery. The experimental evaluation is presented in Section V, and lastly, we state the conclusions and the future work in Section VI. 2. Related Works The combination of using coordinated checkpointing together with message logging has been already used in previous researches. A correlated set coordination among processes executing on the same multicore node combined with pessimistic message logging is presented in [5]. In this work a coordinating among processes in a node is done and also it is combined with a pessimist message log but based on sender. A different coordination protocol and the validation and experiments were done using Open-MPI while we are using RADIC at socket level [3] [6]. The research work [7] proposes a hybrid protocol combining coordinated with uncoordinated checkpoint. As it is targeted to grid environment, the criteria used to group processes is based on the network and the communication pattern to determine the kind of checkpoint that would be done. To obtain a global consistent state for the group, Communication Induced checkpoint (CIC) combinated with a pessimistic message logging. Using CIC for coordination might be not scalable for highly coupled processes since the number of forced checkpoints grows uncontrollably. Group-based coordinated checkpoint is stated in [8]. In this case, not message-log is used thus, a complete coordination is needed for recovery. It is applied to MVAPICH2. A combination of coordinated checkpoint with message log is proposed in [9], as a way to scale the most extended strategy of coordination of the whole processes. However, the criteria for grouping the processes is based on the communication behavior. A trace is done to give support on the creation of groups. Our approach uses the location of the processes to coordinate them as a unique set of processes but the user can also configure a different frequency of checkpoint for each process. In that case, the groups are formed with processes on the same node and with the same of frequency of checkpoint. Using this configuration, the user would give a more accurate checkpoint interval for each group according to the communication pattern of the parallel application. 3. Fully Uncoordinated RADIC Model This section explains the receiver-based pessimistic message log followed by RADIC. We begin with a brief of the architecture and how it works at socket level. Then, the protocol is described by separating the procedure done in protection from the followed in recovery. On both cases we focus on describing the overheads of each step. 3.1 RADIC-based Message Passing Fault Tolerance System RADIC has a distributed behavior in protection, detection and recovery phases. It uses uncoordinated checkpointing and receiver-based pessimist message log. Critical data like checkpoints and received messages of each parallel process are stored on a different node from the one in which it is running. This selection assures the execution completion if a minimum of three nodes are left operational after n nonsimultaneous faults. RADIC applied at socket layer would let fault tolerance parallel applications using different kind of message-passing libraries, which usually use the standard Socket API [10] for interconnection of the processes. There are two components also depicted in Figure 1: Observer (Oi): this entity is responsible for monitoring the application s communications and masks possible errors generated by communication failures. In RADIC at socket level, the observer intercepts send and recv functions to follow the message log protocol. The state is saved periodically by checkpointing. Critical data for recovering formed by received messages and checkpoints are sent to the protector Ti-1. There is an observer Oi attached to each parallel process Pi. Protector: (Ti) There is one on each node protecting the processes running on node Ni+1. It stores the critical data sent by the observers. In case of failure, the protector restarts the failed process using the last checkpoint. Protector detects node failures by sending heartbeats to its neighbors and by the detection of sockets errors. Fig. 1: RADIC diagram shows each observer Oi sends the critical data to its protector Ti-1. Each protector Ti sends heartbeat signal to Ti-1 The observers use five types of sockets to keep the control and reliability of their communications which are depicted in Figure 1. First, the virtual socket is the id known by the process to communicate with a remote peer, are the solid black arrows that connects P6 and P7 with its observer. Second, the real socket represented by a solid yellow line is the one that is actually connected with the peer, since the original connection could be broken after a checkpoint or a failure. Third, the control-ft socket,

depicted using a blue dotted line, it is an internal socket opened by two observers involved in a communication to interchange control information during re-connections and message logging. Then, dashed lines are RADIC-sockets used between Oi with Ti-1 and lastly, dotted black lines are used by each protector Ti to answer Oi the state of Ti-1 in case of failure. 3.2 Receiver-based pessimistic Protocol in Protection Phase A receiver-based pessimistic rollback recovery protocol let recover the state of each process until the point of failure. It adds more overhead than others like optimistic or causal approaches during the protection tasks but simplifies the recovery procedure because the effects of a failure are confined only to the restarting processes [4] [1]. Usually, it is used with uncoordinated checkpoint to decrease the rollback time in case of failure. Receiver-based pessimistic rollback recovery protocol assumes that all nondeterministic events are identified and their corresponding determinants are logged to stable storage. Receiving a packet is considered a nondeterministic event to log. Thus, this is solved interposing recv socket function and sending the received message to the protector afterwards. But pessimistic logging protocols are designed under the assumption that a failure can occur after any nondeterministic event in the computation. This assumption is pessimistic since in reality, failures are rare [1] and stipulates that if an event has not been logged on stable storage, then no process can depend on it. Because of that, a sender of a message waits until the complete sent message is saved in stable storage to validate it before continuing the operation. Once a received message is completely saved on stable storage, an acknowledgment is sent to the sender. The Figure 2 shows how a message is treated since it is generated from the sender process. Each step adds an overhead which is named prefixing it with Ts- or with Trdepending on if they are related to the send or with recv respectively. 1) The send(x) operation is interposed by the sender observer Os, which sends a numerated ack requirement to the Or using the control-ft socket. X is the length of the message. The overhead is named Ts-ack-req. The time used for sending the message Ts-msg it is not considered overhead because it corresponds with operation time performed by the process. 2) A recv(x1) operation is interposed by the receiver observer Or. X1 is the length of the expected message. According to the standard of recv socket function, when X1 is greater than the X actually available, a maximum of X would be delivered. Therefore, we consider that the length X1 is less or equal than the X sent. Or receives the acknowledgement requirement on a Tr-ack-req time. 3) Or receives the X bytes from the real-socket. The time used to receive the message Tr-msg is not an overhead due to it corresponds to the read operation performed by the process. The message is sent to the protector to save it in a Tr-save-msg(X) time. 4) Or sends the acknowledgment to the Os using the control-ft socket. Tr-send-ack is added. On the other peer, Os receives the acknowledgment and the send(x) finishes. Ts-wait-ack is the overhead of this wait. 5) Lastly, only if X1 is less than X, a set of recv(xi) is performed until X is completely read. In such cases, Or copies the next bytes from the X bytes received previously. The time is considered in Tr-msg(Xi). Fig. 2: Receiver-based Pessimistic Protocol in Protection Phase: Virtual/Real sockets: Solid lines - Control-Ft sockets: dotted lines - RADIC sockets: dashed lines 3.3 Receiver-based Pessimistic Protocol in Recovery When one of the nodes fails down, the failure is detected by the protector which restarts the processes that were running in failed node using the last checkpoint. BLCR [11] library is used to do uncoordinated checkpointing and restarting each parallel process. The recovery procedure is carried out by the observer by rolling forward the previous execution from the checkpoint until the point of failure. The saved messages are using in each re-execution of recv functions since those messages are not going to send them again. By the other hand, the send operations are skipped because they were done before the failure. However, as each send has associated a numerated acknowledgment requirement, the observers are able to detect and to skip a duplicated message if it is re-sent. This functionality is useful because this protocol considers that the recovery procedure finishes when the last message in stable storage is processed by a recv, but if the failure happened after a send, it would be re-sent as it is not considered part of the recovery. The Figure 3 depicts the recovery procedure and the overheads prefixed with Trcv- related to one of the virtual

sockets named i. The same procedure is repeated for socket used by the parallel process. 1) Immediately after restarting from checkpoint, the state of recovering is detected by obtaining it from BLCR library. Then, a connection with the local protector is established to query how many messages are pending to re-process for the virtual socket i. The value of Qmsg(sv[i]) is returned. The overhead of this step is measured with Trcv-qmsg. 2) Every send(x)(j) operation is interposed and skipped to avoid re-send messages, j is an integer greater or equal to 0 representing the amount of sends function being rolling forward in this virtual socket i. X is the length of the message. The time is measured with Trcv-send(X)(j). 3) Every recv(y)(k) operation is interposed and asking for it to the local protector. The messages are delivered in FIFO order for each virtual socket i. The recovery procedure for the virtual socket i re-executes k recv operation being k a value from 0 to Qmsg(sv[i]). The overhead of each of them is Trcv-recv(Y)(k). Y is the length of the message. 4) After re-executing Qmsg(sv[i]) recv functions, the virtual socket i is reconnected by establishing a realsocket with the remote peer. In this point, the recovery procedure for this virtual socket is finished. The overhead is Trcv-re-conn. is dramatically increased due to the differences in intranode network bandwidth and inter-node one. However, a coordination among the members of the group, introduces an overhead. Several algorithms have been proposed to coordinate checkpoint like the Chandy-Lamport algorithm [12] or the blocking coordinated [13]. The communications have to be silenced before checkpointing to avoid in-transit messages. When multicore clusters are used to execute parallel application, the processes running on the same node are related in case of a failure since they must be restarted and re-executed until the same point in time. As they are being located on the same node, they are likely to having or needing an intensive or fasting communication among them. Therefore, a coordination checkpoint protocol is useful because it avoids logging messages for communications between members of the group. RADIC protectors known which processes are executing in its node and the observers also can identified if the peer process is located or not at the same node. This section explains the whole strategy used for coordinating checkpoints among the members of groups combined with receiver-based pessimistic protocols for communications done between processes hosted by different nodes. First, the changes in current RADIC model are stated. Second, the coordination protocol used among the processes running on the same node before checkpointing is exposed. Lastly, the semi-coordinated checkpoint protocol is described both in protection and in recovery. 4.1 RADIC Model changes The proper component to carry out the coordination task among the members of groups for not adding additional tasks to observers is RADIC local protector. Actually, a connection between each observer with its local protector is established but until now it was just used just in case of a failure. Now, it is used also for perform the coordination. The Figure 4 represents a parallel application running on N nodes of a Multicore cluster. Each node i has a group of M Ni members. Fig. 3: Receiver-based Pessimistic Protocol in Recovery Phase 4. Semi-Coordinated RADIC Model Semi-coordinated checkpointing allows to RADIC to provide an alternative rollback recovery protocol to reduce overheads in multicore clusters. The performance drawback of received-based pessimist rollback recovery protocol becomes even worst during failure-free operations for communication between members of a group. The latency added to log the message Fig. 4: Coordinated Groups in Multicore Cluster Each group is coordinated by its local protector Tx to silence internal communication. The received-based pessimist protocol is kept for communications between members of different groups. By default, RADIC would consider that processes running on a node have the same checkpoint

interval but, since this is a configuration value for each process, when different intervals are configured, the groups are formed with the processes on the same node and with the same interval. In such cases, received-based pessimist protocol is used for communication between processes in the same node but with different checkpoint interval. This configuration would be useful for cases with processes that having different communication pattern are running on the same node but the optimal checkpoint interval is too much different. Moreover, as an extreme case, this functionality let turn again to a fully uncoordinated checkpointing by configuring different checkpoint interval for each parallel process. In order to facilitate the explanation, it is considered that each group in a node has the same checkpoint interval. In addition, when a node fails, and no spare node is available, RADIC recovery model establishes that the failed processes are recovered in the previous node where criticaldata was saved. In such cases, the Protector Tx-1 has to be able to recognize automatically at least two groups, the first that is still running on node x-1, and the second one being recovered. Although after node failure the groups are in the same node, RADIC keeps the groups uncoordinated until the end of execution to let move them if an spare node is available later. 4.2 Coordinated Checkpointing Protocol The protocol for coordinating the processes is shown in Figure 5. Two entities are involved in this procedure. The first is the protector T n running in one of N nodes depicted in Figure 4. In this node n, there are 1 to M n parallel processes to coordinate. The second entity is observer O nm attached to each of member of the group. When T n detects it is time for checkpointing, sends a message to each. After receiving that message, the observer stops the communication activity in the beginning of the next send or recv function. In this state, the coordination requirement of no in-transit messages between processes of the group is accomplished because all send operations are completely finished and acknowledged. Each observer replies to its local protector Tn that it is ready for the checkpoint. Once all the members are ready, the Tn calls to BLCR for checkpointing each process. BLCR executes the callback function provided by each O nm and the checkpoint is performed. After finishing, checkpoint files are sent to Tn-1 by Tn. 4.3 Semi-Coordinated Protocol in Protection RADIC at socket level keeps an identification of the remote process for each virtual socket. This information is interchanged when the control-ft socket is established. The identification is formed by node-id, process-id and virtualsocket. Group-id is now incorporated to support this new model. Using this data, the observer is able to know if the remote peer belongs to its group or not. The group-id is assigned in the beginning by the local protector Tx when the Fig. 5: Coordinated Checkpointing Protocol first communication is established between them. By default, the value is 0. The Figure 6 represents the procedure used by the observers when sender and receiver belong to the same group. It is different from the explained in 2 in that the saving of the received message in stable storage is skipped. Consequently, both sender and receiver overheads are reduced in the time needed to log messages, due to Or returns the acknowledge immediately after reading the message. Fig. 6: Protection protocol for Intra-Group Communications Instead, when the two peers belong to different groups, the observers still follow the protocol displayed in Figure 2. Although the overheads are reduced by avoiding logging messages done among groups, the execution time would not be reduced when: The amount of data interchanged by each group is not considerable. The application is computation bounded and most of the time the processes are executing and not waiting for communications results. As the communication is overlapped with the computations, less communication overhead does not mean less execution time. The overhead added by coordination protocol is more considerable than the saving time on eliminating the group message logging.

4.4 Semi-Coordinated Protocol in Recovery Semi-Coordinated checkpoint protocol changes the recovery explained previously in 3.3 because in case of a node failure, a group is rolling forward simultaneously. The recovery protocol depicted in Figure 3 is still applied for virtual socket between processes which do not belong to the same group. On the contrary, virtual sockets with members of the same group should be reconnected in the beginning of the recovery process. The sends and receives between members of the group are re-executed again, because the remote peer is also in recovering and no log messages were done. There are no considerable overhead differences in recovery. The sends and receive operations for group communications now are performed instead of skipping and looking for in storage respectively. 5. Experimental Results We test the fault tolerance system to compare the fully uncoordinated RADIC model with semi-coordinated one. The experiments were executed on a cluster formed by 4 nodes Intel Core i5-650 Processor 6GB RAM, Network Gigabit Ethernet. The OS used is Ubuntu 10.04 Kernel 2.6.32-33-server. We use heat-transfer SPMD application and a sum of matrices Master/Worker based on TCP sockets, which follow different communication patterns. This allows us to observe how the different approaches behave in both cases. There are three types of execution. First, without FT, label No FT. Second, in failure-free to test protection phase label Failure-Free and lastly, a failure is inject in the node N3 seconds after the first checkpoint, named Recovery. As there is no spare node available, the failed processes are recovered in the node N2. The executions in Failure Free and Recovery were done either using the fully uncoordinated protocol and using the semi-coordinated one. Checkpoints were done only on processes executing in N3. As each checkpoint closes and then reconnects the communications, using checkpoints in other processes would disturb current experiments by adding additional overheads not related to log messages protocols and coordination. The diagram used for comparing both protocols are the throughput by seconds. This metric let observe how the overheads introduced by fault tolerance impacts in the work effectively done by each of the processes. The Figure 7 compares the different executions No FT, failure-free and recovery of SPMD P5 process located on node N3. When fully uncoordinated is used the throughput falls down more than in semi-coordinated protocol. This advantage makes the process to only need 2.37% more time than the execution without FT. Instead, when fully uncoordinated adds 9.48% in execution time. Recovery using semi-coordinated shows a better performance as well, adding 27.49% against to 52.13% of uncoordinated case. Fig. 7: SPMD Process P5 executions In MW application, a sum of a 1000x1000 float matrix is done. The master sends 11k to each worker to sum and 11 bytes are returned to master. Executions of a worker hosted by failed node N3 are graphed in Figure 8. shows that using uncoordinated protocol is slightly better than semi-coordinated one, adding 4.49% and 33.23% in failurefree and recovery respectively to executions without FT. It can be seen that the overhead added by coordination increases the execution time in semi-coordinated protocol while no overhead is saving in messages log because only node N1 executing master and two workers have group communication. Fig. 8: MW Worker Executions To evaluate how the comparing results are related to the package size in use, executions using different workloads are done. Table 9 shows SPMD execution time in failure-free and recovery. The heat transfer application in this experiment is configured to make more intensive communications. It is observed that uncoordinated protocol is better for small

packet size. Usually, in those cases, the communication is overlapped with computation and the overhead of logging does not increase the execution times. Moreover, the coordination of checkpointing of each group impacts on it. In the same way, MW executions follow the same behavior. The results are displayed in table 10. As the package size increase, the semi-coordinated protocol is a better option for both communication patterns tested. Fig. 9: SPMD using different package size Fig. 10: MW using different package size 6. Conclusions and Future Work A semi-coordinating checkpoint protocol is added to RADIC model as an alternative fault tolerance algorithm to be used with parallel applications running on a multicore clusters. The experiments show that this protocol allows to decrease the overhead of fault tolerance. Applications using intensive or larger group communications are the target since they are likely to obtain a better execution time by avoiding the logging of their intra-node messages. This implementation is an early stage and has several instrumentations for taking times. We plan to do several optimizations and extending this work to standard MPI. We are working on a set of experiments to do a deeper comparative analysis between semi-coordinated and uncoordinated checkpointing protocol using a varied packet sizes and communication patterns. Acknowledgments This research has been supported by the MICINN Spain under contract TIN2007-64974, the MINECO (MICINN) Spain contract TIN2011-24384, the European ITEA2 project H4H, No 09011 and the Avanza Competitividad I+D+I contract TSI-020400-2010-120. References [1] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput.Surv., vol. 34, no. 3, pp. 375 408, September 2002. [2] G. Santos, A. Duarte, D. Rexachs, and E. Luque, Providing non-stop service for message-passing based parallel applications with radic, ser. Lecture Notes in Computer Science, vol. 5168 LNCS, 2008, pp. 58 67. [3] M. Castro, D. Rexachs, and E. Luque, Transparent fault tolerance middleware at user level, in HPCS 12, 2012, pp. 566 572. [4] S. Rao, L. Alvisi, and H. M. Vin, The cost of recovery in message logging protocols, IEEE Trans.on Knowl.and Data Eng., vol. 12, no. 2, pp. 160 173, mar 2000. [Online]. Available: http://dx.doi.org/10.1109/69.842260 [5] A. Bouteiller, T. Herault, G. Bosilca, and J. J. Dongarra, Correlated set coordination in fault tolerant message logging protocols, in Proceedings of the 17th international conference on Parallel processing - Volume Part II, ser. Euro-Par 11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 51 64. [Online]. Available: http://dl.acm.org/citation.cfm?id=2033408.2033415 [6] M. Castro, D. Rexachs, and E. Luque, Radic-based message passing fault tolerance system, in ADVCOMP 2012, The Sixth International Conference on Advanced Engineering Computing and Applications in Sciences, 2012, pp. 59 64. [7] Y. Luo and D. Manivannan, Hope: A hybrid optimistic checkpointing and selective pessimistic message logging protocol for large scale distributed systems, Future Generation Computer Systems, vol. 28, no. 8, pp. 1217 1235, 10 2012. [8] Q. Gao, W. Huang, M. J. Koop, and D. K. Panda, Group-based coordinated checkpointing for mpi: A case study on infiniband, in Parallel Processing, 2007. ICPP 2007. International Conference on, 2007, pp. 47 47, id: 1. [9] J. C. Y. Ho, C.-L. Wang, and F. C. M. Lau, Scalable groupbased checkpoint/restart for large-scale message-passing systems, in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, 2008, pp. 1 12. [10] M. K. McKusick, K. Bostic, M. J. Karels, and J. S. Quarterman, The design and implementation of the 4.4BSD operating system. Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc., 1996. [11] P. H. Hargrove and J. C. Duell, Berkeley lab checkpoint/restart (blcr) for linux clusters, Journal of Physics: Conference Series, vol. 46, no. 1, p. 494, 2006. [12] K. M. Chandy, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol. 3, pp. 63 75, 1985. [13] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello, Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi, in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ser. SC 06. New York, NY, USA: ACM, 2006. [Online]. Available: http://doi.acm.org/10.1145/1188455.1188587