RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes* Hugo Meyer 1, Dolores Rexachs 2, Emilio Luque 2 Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Bellaterra, Barcelona, Spain 1 hugo.meyer@caos.uab.es, 2 {dolores.rexachs, emilio.luque}@uab.es Abstract As the mean time between failures has decreased, applications should be able to handle failures avoiding performance degradation if possible. This work is focused on a decentralized, scalable, transparent and flexible fault tolerant architecture named RADIC an acronym for Redundant Array of Distributed and Independent Controllers. As MPI is a de facto standard used for communication in parallel computers, RADIC have been included into it. RADIC s default behavior is to restart failed processes on already used nodes, overloading them. Moreover it also could be configured to keep the initial process per node ratio, maintaining the original performance in case of failures. In this paper we propose a transparent and automatic management of spare nodes in order to avoid performance degradation and to minimize the mean time to recovery MTTR when using them. Our work provides transparent fault tolerance for applications that are written using the MPI standard. Initial evaluations show how the application performance is restored as it used to be before a failure, minimizing the MTTR by managing faults automatically. Keywords RADIC, MPI, Fault Tolerance, Decentralized, Spare Nodes, Uncoordinated Checkpoint. 1 Introduction * This research has been supported by the MICINN Spain under contract TIN2007-64974, the MINECO (MICINN) Spain under contract TIN2011-24384, the European ITEA2 project H4H, No 09011 and the Avanza Competitividad I+D+I program under contract TSI-020400-2010-120. This paper is addressed to the PDPTA 12 Conference Considering the many long-running parallel applications that are executed on High Performance Computer HPCclusters and the increase in the failure rate [1] on these computers, it becomes imperative to make these applications resilient to faults. Hardware failures may cause unscheduled stops to applications. If there are not any fault tolerant mechanisms to prevent it, these applications will have to be re-executed from the beginning. If a fault tolerant mechanism is used, failures could be treated. In such environment an automatic and application transparent fault tolerance mechanism is desirable. It could also reduce the complexity of applications development. Failure treatment and management are crucial to maintain the performance of HPC applications that are executed over several days. One of the most commonly used approaches to deal with failures in HPC parallel applications is the rollback-recovery approach based on checkpoint and restart protocols. Rollbackrecovery protocols periodically save processes states in order to rollback in case of faults. Checkpoints could be performed using a coordinated or uncoordinated checkpointing protocol. Coordinated checkpointing protocols create a consistent set of checkpoints by stopping all the processes in the parallel application in a consistent state and then taking a snapshot of the entire application. This approach minimizes the overhead of fault free execution, but in case of faults, all processes (even those that have not failed) must rollback to the previous consistent saved state. All the computation time used to progress the parallel application execution before the fault and after the last snapshot is loosed. In uncoordinated checkpointing protocols, each process is checkpointed individually, and it could happen in different moments of the execution. Thus, there is not a global consistent state. The advantage of this method is that in case of faults only the affected processes must rollback. In order to avoid the domino effect [2], this approach should be combined with an event logging protocol. When a parallel application is executed, we usually seek for executions with an optimal amount of resources to maximize the speedup or efficiency. When a failure occurs and the application loses some resources all the initial tuning effort is loosed. In this paper we present new RADIC [3] enhancements to avoid performance degradation when failures occur. The objective is achieved using automatic spare nodes management to maintain the initial amount of resources when node failures occur. We also try to minimize the MTTR after a failure is detected by managing faults without human intervention. For that reason, every fault tolerant tasks and decisions are made automatically. The RADIC architecture has been integrated into the Open MPI library to allow execution of real scientific parallel applications and to be application-transparent. Our approach considers the consequences that node failures bring to parallel applications. A physical failure affects computing components. If these components are not replaced properly there is a loss in computational capacity. Figure 1. RADIC components

Running a parallel application with fewer resources than the optimal, causes this degradation. This work is presented and divided as follow: section 2 describes the RADIC architecture, its components and how it operates to protect an application against failures. In section 3 we introduce the related work on fault tolerant systems. The section 4 presents the integration of RADIC into the Open MPI library to provide user-transparent fault tolerance. Next, section 5 illustrates the initial results obtained with the described implementation. Finally, section 6 presents the conclusions and future lines. 2 RADIC Architecture RADIC [3] is a fault tolerant architecture for message passing systems based on rollback-recovery techniques. These techniques rely on uncoordinated checkpoint protocol combined with a receiver based pessimistic event log [4]. The approach that was chosen does not need any coordinated or centralized action or element to carry out their fault tolerance tasks and mechanisms, so application scalability depends on the application itself. The RADIC architecture acts as a fault tolerant layer between the MPI standard and the parallel machine (fault probable). This fault tolerant layer provides a fault-resilient environment for parallel application even when the application runs over a fault-probable parallel machine. Our work is focused on providing an applicationtransparent fault tolerant middleware within a message passing library, specifically, Open MPI [5]. Critical data such as checkpoints and event logs are stored in a different node than the one in which the process is running. Processes that were residing in a failed node will be restarted in another node from their latest checkpoint, and will consume the event log in order to reach the before fault state. RADIC policies provide a transparent, decentralized, scalable and flexible fault tolerance solution. 2.1 RADIC components RADIC provides fault tolerance based on two main components: protectors and observers. In the Figure 1 we illustrated computing nodes (Ny), application process (Px), the protectors (Ty), and the observers (Ox) where the sub-index x represents the process number and y represents the node number. Protectors and observers work together with the aim of building a distributed fault tolerant controller. Both components are described below: Observers: are responsible of process monitoring and fault masking. Each application process has one observer attached to it. The observers performs event logging of received messages in a pessimistic manner, they also take periodic checkpoints of the process to which it is attached. Checkpoints and logging data are sent and stored in their protectors located in another node (Figure 1). During recovery, the observers are in charge of processing with the event log, replaying them in order to reach the same state before fault. Protectors: on each node there is a protector running, their main function is to detect node failures via a heartbeat/watchdog protocol. Protectors also store checkpoints and event logs sent by observers. When a failure occurs, the protector has to restart the failed processes that it protects; they also have to reestablish the heartbeat/watchdog protocol since it gets broken due to node failures. 2.2 RADIC Operation Fast failure detection is one of RADIC priorities, since it is one of the variables that affect the MTTR. RADIC first detection mechanism is a heartbeat/watchdog protocol that allows protectors to learn about neighbor s protectors faults. As every communication goes through the observers, they have absolutely control of messages exchange between peers. Observers can also detect and mask faults. Each protector maintain a data structure called radictable, where each entry (an entry per process exists in the application) of the structure is composed of a process id, the URI of the process, URI of its protector, and a unique clock of received and sent messages. When a process fails and get restarted, the observers consult the radictable in order to find about the node where the process has been recovered by asking the process s protector. The protectors updated the radictable on demand when they identify any processes failures. In the Figure 2a it is possible to see a fault free execution using RADIC without spare nodes support. When a failure occurs (Figure 2b), the parallel application execution will continue with one less node. The node failure is detected by the heartbeat/watchdog mechanism. After the failure, the heartbeat/watchdog mechanism is reconstructed, and T4 indicates T2 as the new protector of P4 (Figure 2c). O4 needs to take a checkpoint of P4, because its latest checkpoint gets lost when T3 fails. T2 restarts and re-executes P3(Figure 2d), and also will indicate that the new protector of P3 is T1. Then O3 will take a checkpoint of P3 and send the data to T1. Finally, O3 we erase old message logs. The protectors have two operating modes: active or passive. Active is when they form part of the detection scheme and there are some application processes running on its node (all nodes of Figure 2). Protectors may be in a passive state when they are running in a spare node, this is a low consumption state (to avoid node and network overload). 2.3 Spare Nodes in RADIC When a failure occurs and the failed process is restarted in the same node its protector is running, if this node already has application processes running on it, the node becomes overloaded. This could slow down the execution of both Figure 2. a) Fault free execution. b) Failure in Node 3. c) Heartbeat/watchdog restoration and assignation of a new protector to P4. d) Restart of the process P3 in node N2.

processes. As a consequence of this, the performance of the entire application could be affected, increasing its execution time. One method to maintain the initial performance in such a scenario is to use spare nodes to restart failed processes [6] instead of overloading the non-failed nodes. Spare nodes are that initially are not used by the parallel application. In the Figure 3a, we can observe the execution of a parallel application using 4 nodes and having 1 spare node (NS). When a failure occurs in the node N3 (Figure 3b) the protector T2 will detect the failure of protector T3 and it consult a table to find the information about the spare nodes location and state (sparetable). The sparetable (Table 1) is replicated among all protectors. Spares are assigned as failures occur, and the replicated information is updated on demand, so all the operations are made in a decentralized and transparent manner. Eventually, these tables could be outdated, however it does not affects the RADIC operation, since this information will be checked before using any spare. After consulting its sparetable the protector T2 confirms the availability of the spare NS (Figure 3c) and if it is available T2 transfer the latest checkpoint and event logging data of process P3 to NS (Figure 3d). Finally, the protector TS restart P3 and become an active protector by joining the heartbeat/watchdog protection scheme (Figure 3e). 3 Related Work Many proposals have been made to provide fault tolerance for message passing applications. Most strategies are based on a coordinated checkpointing approach or an uncoordinated checkpointing strategy combined with a logging mechanism. Currently, there are several checkpoint-restart tools available. We can highlight BLCR (Berkeley Lab s Checkpoint/Restart) [7] and DMTCP (Distributed MultiThreaded Checkpoint) [8]. DMTCP works at user space and BLCR works at kernel level. BLCR is one of the most used libraries to provide fault tolerance in parallel systems. To use BLCR in parallel applications, MPI libraries should at least reopen communication channels after restart [9]. Table 2 highlights the features of three of the most popular fault tolerant frameworks integrated into MPI libraries and our approach. Most solutions use a centralized storage. However, due to scalability reasons, it is desirable to avoid any centralized element. Our approach differs from MPICH-V2 [10] because we do not use any centralized storage because with RADIC, every computing node could stores critical data from process residing in another node. Also we use a pessimistic receiver based logging protocol. MPICH-V2 is now a deprecated implementation. MPICH-VCL is designed to reduce overhead during fault free execution by avoiding message logs. It is based on Chandy-Lamport algorithm [11]. MPICH2-PCL [12] uses a blocking coordinated checkpointing protocol. LAM-MPI [13] is previous to Open MPI. It modularizes a checkpoint/restart approach to allow the usage of multiple checkpoint/restart techniques. The implementation supports communications over TCP and Myrinet in combination with BLCR and SELF checkpointing operations. LAM-MPI uses a coordinated checkpoint approach and needs a communication thread between the checkpoint/restart system and the process mpirun to schedule checkpoints. Figure 3. RADIC with spare nodes. a) Execution before failure with one spare node. b) Failure in node 3. c) The protector T2 check availability of spare node NS. d) Protector T2 transfer the checkpoint of process P3 to spare node NS. e) Protector TS restart process P3 and also the communications. The current checkpoint/restart implementation of the Open MPI library [9] aims to combine the best features from these methods described above. The implementation uses a distributed checkpoint/restart mechanism where each checkpoint is taken independently, but coordination is needed to make a consistent global state, which requires the interruption of all processes at the same time. Another work that has become important is the Coordinated Infrastructure for Fault Tolerant Systems CIFTS- [14]. It is a framework that enables system software components to share fault information with other components to take some action in order to get adapted to faults. The main difference with our proposal is that we deal with faults automatically and transparently to applications. This allows us to reduce the MTTR. 4 RADIC in MPI The first prototype of RADIC was called RADICMPI [15] and it has been developed as a small subset of the MPI standard. As a message passing library is very limited. As this implementation does not have all the MPI primitives, it cannot execute many of the scientific applications available Table 1. Sparetable Spare Id Address Observers 0 Node5 1 1 Node6 0 RADICMPI does not consider collective operations and other complex functions that many applications use. For that reason, instead of extending the prototype to comply the MPI standard, we decided to integrate the RADIC architecture into a well-established MPI implementation. It allows the correct execution of any MPI application using the fault tolerance policies and mechanisms of RADIC (Section II). In the next paragraphs we will explain some important features of the integration of RADIC into Open MPI. 4.1 Open MPI Architecture A depth research about the inclusion of RADIC in Open MPI has been made in [16]. The implementation is named

RADIC-OMPI and integrates the basic protection level of RADIC. It does not include spare nodes management. Open MPI architecture has been already described in [5]. For that reason, in this paper we will focus only on the components relevant to the RADIC integration. The Open MPI frameworks are divided in three groups that are: Open MPI (OMPI) which provides the API to write parallel applications; Open Run-Time Environment (ORTE) which provides the execution environment for parallel applications; and Open Portable Layer (OPAL) which provides an abstraction to some operating system functions. To launch a given parallel application, an ORTE daemon is launched in every node that takes part in the parallel application. These daemons communicate between them to create the parallel runtime environment. Once this environment is created the application processes are launched by these daemons. Every process exchange information about communication channels during the Module Exchange (MODEX) operation which is an all-to-all communication. The protector functionalities have been integrated into the ORTE daemon because in Open MPI there is always one daemon running in each node, wich fits the protector requirements. Table 2. Fault tolerant MPI libraries. Name FT Strategy Detection and Recovery MPICH-V2 - Uncoordinated Ckpt. - Automatic. - Sender based pessimistic log. - Centralized storage. MPICH- VCL - Coordinated Ckpt. - Chandy-Lamport Algorithm. - Automatic. Open MPI - Centralized storage. - Coordinated Ckpt. - Centralized storage. - Fault Detection and safe stop. - Manual recovery. - Automatic and application transparent. RADIC - Uncoordinated Ckpt. - Pessimistic Receiver based Log. - Distributed storage. OMPI provides a three-layer framework stack for MPI communication: Point-to-point Management Layer (PML) which allows wrapper stacking. The observer, because of its behavior, has been implemented as a PML component; this ensures the existence of one observer per application process. Byte Transfer Layer (BTL) that implements all the communication drivers. BTL Management Layer (BML) that acts as a container to the drivers implemented by the BTL framework. The Open MPI implementation provides a framework to schedule checkpoint/restart requests. This framework is called Snapshot Coordinator (SnapC). The generated checkpoints are transferred through the File Manager (FileM) framework. All these communications to schedule and manage the transferring of the checkpoint files are made using the Out of Band (OOB) framework. 4.2 RADIC Implementation To define the initial heartbeat/watchdog fault detection protection scheme and protection mapping a simple algorithm is used: each observer sets his protector as the next logical node, and the last node sets the first one as its protector. All protectors should fill the radictable before launching the parallel application and update it with new information when failures occur. The update of the radictable does not require any collective operation. Thus many protectors could have an outdated version of the radictable. However, the radictable will be updated further on demand, when observers try to contact restarted processes. Regarding to the fault tolerances mechanism and their integration into Open MPI, the following observations can be made: Uncoordinated checkpoints: each process performs its checkpoints through a checkpoint thread. Checkpoints are triggered by a timer (checkpoint interval) or by other events. Before a checkpoint is made, to ensure that there is no in transit messages all communication channels are flushed and remain unused until the checkpointing operation finishes. After a checkpoint is made, each process transfers their checkpoint files using the FileM framework and then the communication within processes are allowed again. Message Log: since the observer is located in the PML framework, it ensures that all communications through it are logged and then transferred to the correspondent protector. The protector only confirms a message reception after the message has been saved. Messages are marked as received by the remote process after the receiver and its protector confirm the message reception (pessimistic receiver based log). Failure detection mechanism: failures are detected when communications fails; this mechanism requires the modification of lower layers to raise errors to the PML framework where the faults are managed. It avoids application stops. A heartbeat/watchdog mechanism is also used. The protectors send heartbeats to the next logical node and the receiver protector resets the watchdog timer after reception. Failure management: the default behavior of the library is to finalize when a failure occurs (fail-stop) Hence RADIC needs to mask failures to continue execution and avoid fault propagation to the application level. When a protector finds out about a failure, the restarting operation is initiated. Recovery: the recovery is composed of three phases. In the first one, a protector restores the failed process from its checkpoint with its attached observer. Then the restored observer sets its new protector, reexecutes the process while consuming the event logging data and then takes a checkpoint. Finally, the process execution is resumed after its checkpoint is sent to its new protector, just to ensure its protection. Protectors involved in the fault also reestablish the protection mechanism. We consider the recovery as an atomic procedure. Reconfiguration: when the recovery ends, the communications have to be restored. To achieve this

goal the lower layers of Open MPI must be modified to redirect all the communications to the new address of the process. To avoid collective operation this information is updated on demand or by a token mechanism. 4.3 Proposal: Spare Nodes Management in Open MPI An important aspect that has to be considered when running parallel applications is the performance. The previous implementation of the RADIC architecture [16] allows the successful completion of parallel applications even in presence of failures. However, it does not consider the management of extra resources to replace failed nodes. Including the spare nodes management into RADIC, the applications will not only end correctly but also will avoid performance degradation due to loss of computational resources. Our proposal is not restricted on avoiding performance lost, we also propose a mechanism for automatically select spare nodes and include them on the parallel environment domain without user intervention. By doing the spare nodes management transparently and automatically, we minimize the MTTR. When including spare nodes into the RADIC architecture, the restarting and reconfiguration are the most affected mechanisms. To reconfigure the system, a deterministic algorithm to find restarted processes is needed. When using RADIC without spare nodes (Figure 2), failed processes are restarted in their protectors. If an observer tries to reach a relocated failed process, it will take a look at its radictable to find the old protector of the failed process (this information may be outdated). Then, the observer will ask about that process. The old protector will say that it is no longer protecting such a process, and will point who is the new protector (Figure 2). If a failure occurs and there are spare nodes available, the spare will be included into the parallel environment domain and failed processes should be restarted in it. The heartbeat/watchdog mechanism will be reestablished and the involved protectors will update their radictable and sparetable (Table 1). Considering Figure 3e, if process P1 wants to reach P3, O1 will ask T2 about P3. T2 will point that P3 is residing in the spare NS. Then O1 will tell T1 to update its radictable and its sparetable and P1 will finally try to contact P3. The process described above is distributed and decentralized, and each process will do it only when it is strictly necessary, avoiding the costly Module Exchange (MODEX) collective of Open MPI. The main problem when restarting a process in another node is that we need an ORTE daemon running in that node to adopt the new process as a child. Moreover, all future communication with the restarted process needs to be redirected to its new location. For that reason, ORTE daemons are launched even in spare nodes, but no application process is launched on it until it be required as a spare node. An additional problem that must be addressed is that a sender observer must not consider as a failure the lack of communication with other processes when the receiver process is doing a checkpoint or is restarting. The sender observer will fail to communicate, and will consult the receiver s protector to find about the state of the receiver. The protector will indicate that the process is checkpointing or restarting, and the communication will be retried later. The radictable and sparetable were included inside the job information structure (orte_jmap_t). When the parallel application starts, each protector (ORTE daemon) populates its radictable and its sparetable. The radictable and sparetable are updated (on demand) when a protector notices that a process has restarted in another place. If the application runs out of spares, the default mechanism of RADIC is used (Figure 2). 5 Experimental Results A fault tolerant architecture, generally, introduces some kind of overhead in the system it is protecting. These overheads are generally caused by replication in some of its forms. The overheads introduced by RADIC are mostly caused by the uncoordinated checkpoints and the pessimistic log mechanism as it has been showed in [16]. Failures may cause degradation because of the loss of the computational capacity if there are no spare nodes available. The experimental evaluation that has been done tries to shows how fast is the failure detection and recovery mechanisms of our proposal, and how fast it can include automatically spare nodes into the parallel environment in order to avoid the impact on the performance of applications when resources are loosed. We present experimental results using three different benchmarks: a static matrix multiplication benchmark, the LU benchmark that is part of the NAS Parallel Benchmarks (NPB) [17] and the SMG2000 application [18]. The matrix multiplication application is modeled as a master/worker application, the master sends the data to the workers only at the start, and collects the results when the application finalizes. Each application process is assigned to one core during normal execution. The matrix multiplication implemented has few communications (only at the beginning and at the end). Experiments have been made using a Dell PowerEdge M600 with 8 nodes, each node with 2 quad-core Intel Xeon E5430 running at 2.66 GHz. Each node has 16 GBytes of main memory and and a dual embedded Broadcom NetXtreme IITM 5708 Gigabit Ethernet. RADIC have been integrated into version 1.7 of Open MPI. Our main objective is to depict the application Figure 4. Throughput of the Matrix Multiplication application with and without spare nodes(32 processes Checkpoint interval = 30 sec).

Figure 5. a) Performance of the LU Benchmark with and without Spare Nodes. b) Performance of the SMG2000 application with and without Spare Nodes. performance degradation avoidance when failures occur in parallel computers. By using spare nodes automatically and transparently to restart failed processes into them, we can decrease the MTTR to a minimum while maintaining application performance as it was before failure. As we mentioned before, it is crucial to deal with failures as fast as possible. If the application loses a node and we use the default approach of RADIC (Figure 2) one of the nodes become overloaded. As a consequence of this, the whole application throughput could decreases. Replace failed nodes with spares is not trivial, because it is necessary to include the spare node into the parallel environment world and then restart the failed process or processes in it transparently and automatically. Therefore, application performance is affected only by a short period. The experiments try to depict how performance (in terms of throughput) is affected after a failure without using spare nodes, and the benefits of using them. To obtain the operations per second of the Matrix Multiplication application we divided the sub-matrix size that computes each process by the time spent into an internal iteration. The checkpoint intervals that we use to make the experiments are only for test purposes. If we want to define valid checkpoint intervals we can use the model proposed in [19]. Figure 4 shows three executions of the matrix multiplication benchmark. The green line shows the fault-free execution. The blue line shows the execution of the application using RADIC with 2 spare nodes (and 2 failures). The red line shows the execution of RADIC without using spare nodes, and with one failure. When a fault occurs and the application is not using spare nodes, the failed processes are restarted on their protectors and these nodes become overloaded. It occurs because processes compete for available resources and the application loses about 40% of its initial throughput. If we use RADIC with spare nodes, the application loses throughput only for a short period, until a spare is selected and the process is restarted in it, after that, the initial throughput is restored. The static matrix multiplication is only a synthetic application. To evaluate the throughput changes with real applications using RADIC, we have designed a new set of experiments using the LU benchmark and the SMG2000 benchmark (Figure 5). Figure 5a depicts the behavior of the LU benchmark using RADIC with and without spare nodes (and without faults). The application has an irregular behavior because it computes sparse matrices. If a failure occurs and the application does not have any spare node, it losses about 35% of its throughput. However, if the application uses spare nodes, the throughput is reduced in 25% during the recovery and after that, the initial throughput is recovered. Figure 5b depicts the throughput of the SMG2000 benchmark. The checkpointing and restart operations are quite expensive for this application because the memory footprint of each process is about 2GB. If a fault occurs, and the process is restarted on its protector, the application losses about 15% of its initial throughput. However, if the process is restarted on a spare node, the initial throughput is maintained. As is known, the execution time of an application affected by faults depends on the moment in which the failure occurs. For that reason when treating faults we focus on showing the degradation in terms of throughput not in terms of execution time. Considering the results we can conclude that transparent and automatic management of spare nodes reduces avoids increments in the MTTR and maintains the application throughput avoiding system overload. 6 Conclusions and Future Work The proposal presented in this paper has demonstrated that the RADIC policies are effective to automatically and transparently manage spare nodes, avoiding long recovery times while maintains initial application performance. In this paper we have presented the design and evaluation of an alternative method to restart failed processes automatically maintaining the original computational capacity. This is an important issue because usually, applications are configured to execute with an optimal number of nodes. Loosing computational resources due to hardware failures decreases the application performance.

Having scalability as an objective, it is imperative to use a decentralized fault tolerance approach. Furthermore, when failures occur, a transparent an automatic fault treatment is desired, because the parallel application will experiment performance degradation only for a short period of time. The use of a fault tolerance architecture with RADIC characteristics is desirable because it does not require any user intervention and it is also configurable to use available resources. When running parallel applications in computer clusters, frequently, there are free nodes that are not being used by any application, so these nodes could be used as spare nodes. The implementation of RADIC into the Open MPI library has several advantages. The first one is that Open MPI is a widely used library used in the scientific world. It allows the utilization of RADIC with real scientific applications. The second advantage is that our implementation makes easier the processes migration without stopping the parallel application execution. Initial analyses also show that RADIC will complement correctly with the MPI3 standard. The MPI3 standard will make easier the failure management because more information about failures will be available, so the possibilities to take corrective actions will increase. The integration of RADIC into a stable Open MPI implementation as well as provide an interface for live migration are pending tasks. RADIC also need to start taking into account applications with I/O events (transactional applications). 7 References [1] Bianca Schroeder and Garth A. Gibson, "Understanding failures in petascale computers," Journal of Physics: Conference Series, vol. 78, 2007. [2] B. Randell, "System structure for software fault tolerance," SIGPLAN Not., vol. 10, no. 6, pp. 437--449, April 1975. [3] Amancio Duarte, Dolores Rexachs, and Emilio Luque, "Increasing the cluster availability using RADIC," IEEE International Conference on Cluster Computing, pp. 1-- 8, 2006. [4] E. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375--408, September 2002. [5] E Gabriel et al., "Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,", 2004, p. 97-104. [6] Guna Santos and Angelo Duarte, "Increasing the Performability of Computer Clusters Using RADIC II," International Conference on Availability, Reliability and Security., pp. 653--658, 2008. [7] Jason Duell, "The design and implementation of Berkeley Labs linux Checkpoint/Restart," 2003. [8] Jason Ansel, Kapil Arya, and Gene Cooperman, "DMTCP: Transparent checkpointing for cluster computations and the desktop,", 2009, pp. 1--12. [9] Joshua Hursey, Jeffrey M. Squyres, Timothy I. Mattox, and Andrew Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," In Workshop on Dependable Parallel, Distributed and Network-Centric Systems(DPDNS), in conjunction with IPDPS, pp. 1-8, 2007. [10] Aurélien Bouteiller and Thomas Hérault, "MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging," in Supercomputing, 2003 ACM/IEEE Conference, 2003, pp. 25-25. [11] K M Chandy and Leslie Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63--75, February 1985. [12] Darius Buntinas et al., "Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols," Future Generation Comp. Syst., vol. 24, pp. 73-84, 2006. [13] Sriram Sankaran, Jeffrey M Squyres, Brian Barrett, and Andrew Lumsdaine, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," In Proceedings, LACSI Symposium, Sante Fe, pp. 479--493, 2003. [14] Rinku Gupta et al., "CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems," Parallel Processing, International Conference on, vol. 0, pp. 237-245, 2009. [15] Angelo Duarte, Dolores Rexachs, and Emilio Luque, "An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI," pp. 150--157, 2006. [16] Leonardo Fialho, Guna Santos, Angelo Duarte, Dolores Rexachs, and Emilio Luque, "Challenges and Issues of the Integration of RADIC into Open MPI,", 2009, pp. 73--83. [17] D Bailey et al., "The Nas Parallel Benchmarks," International Journal of High Performance Computing Applications, 1994. [18] Kiattisak Ngiamsoongnirn, Ekachai Juntasaro, Varangrat Juntasaro, and Putchong Uthayopas, "A Parallel Semi- Coarsening Multigrid Algorithm for Solving the Reynolds-Averaged Navier-Stokes Equations,", 2004, pp. 258--266. [19] Leonardo Fialho, Dolores Rexachs, and Emilio Luque, "What is Missing in Current Checkpoint Interval Models?," Distributed Computing Systems, International Conference on, vol. 0, pp. 322-332, 2011.