RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*"

Transcription

1 RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes* Hugo Meyer 1, Dolores Rexachs 2, Emilio Luque 2 Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Bellaterra, Barcelona, Spain 1 2 {dolores.rexachs, Abstract As the mean time between failures has decreased, applications should be able to handle failures avoiding performance degradation if possible. This work is focused on a decentralized, scalable, transparent and flexible fault tolerant architecture named RADIC an acronym for Redundant Array of Distributed and Independent Controllers. As MPI is a de facto standard used for communication in parallel computers, RADIC have been included into it. RADIC s default behavior is to restart failed processes on already used nodes, overloading them. Moreover it also could be configured to keep the initial process per node ratio, maintaining the original performance in case of failures. In this paper we propose a transparent and automatic management of spare nodes in order to avoid performance degradation and to minimize the mean time to recovery MTTR when using them. Our work provides transparent fault tolerance for applications that are written using the MPI standard. Initial evaluations show how the application performance is restored as it used to be before a failure, minimizing the MTTR by managing faults automatically. Keywords RADIC, MPI, Fault Tolerance, Decentralized, Spare Nodes, Uncoordinated Checkpoint. 1 Introduction * This research has been supported by the MICINN Spain under contract TIN , the MINECO (MICINN) Spain under contract TIN , the European ITEA2 project H4H, No and the Avanza Competitividad I+D+I program under contract TSI This paper is addressed to the PDPTA 12 Conference Considering the many long-running parallel applications that are executed on High Performance Computer HPCclusters and the increase in the failure rate [1] on these computers, it becomes imperative to make these applications resilient to faults. Hardware failures may cause unscheduled stops to applications. If there are not any fault tolerant mechanisms to prevent it, these applications will have to be re-executed from the beginning. If a fault tolerant mechanism is used, failures could be treated. In such environment an automatic and application transparent fault tolerance mechanism is desirable. It could also reduce the complexity of applications development. Failure treatment and management are crucial to maintain the performance of HPC applications that are executed over several days. One of the most commonly used approaches to deal with failures in HPC parallel applications is the rollback-recovery approach based on checkpoint and restart protocols. Rollbackrecovery protocols periodically save processes states in order to rollback in case of faults. Checkpoints could be performed using a coordinated or uncoordinated checkpointing protocol. Coordinated checkpointing protocols create a consistent set of checkpoints by stopping all the processes in the parallel application in a consistent state and then taking a snapshot of the entire application. This approach minimizes the overhead of fault free execution, but in case of faults, all processes (even those that have not failed) must rollback to the previous consistent saved state. All the computation time used to progress the parallel application execution before the fault and after the last snapshot is loosed. In uncoordinated checkpointing protocols, each process is checkpointed individually, and it could happen in different moments of the execution. Thus, there is not a global consistent state. The advantage of this method is that in case of faults only the affected processes must rollback. In order to avoid the domino effect [2], this approach should be combined with an event logging protocol. When a parallel application is executed, we usually seek for executions with an optimal amount of resources to maximize the speedup or efficiency. When a failure occurs and the application loses some resources all the initial tuning effort is loosed. In this paper we present new RADIC [3] enhancements to avoid performance degradation when failures occur. The objective is achieved using automatic spare nodes management to maintain the initial amount of resources when node failures occur. We also try to minimize the MTTR after a failure is detected by managing faults without human intervention. For that reason, every fault tolerant tasks and decisions are made automatically. The RADIC architecture has been integrated into the Open MPI library to allow execution of real scientific parallel applications and to be application-transparent. Our approach considers the consequences that node failures bring to parallel applications. A physical failure affects computing components. If these components are not replaced properly there is a loss in computational capacity. Figure 1. RADIC components

2 Running a parallel application with fewer resources than the optimal, causes this degradation. This work is presented and divided as follow: section 2 describes the RADIC architecture, its components and how it operates to protect an application against failures. In section 3 we introduce the related work on fault tolerant systems. The section 4 presents the integration of RADIC into the Open MPI library to provide user-transparent fault tolerance. Next, section 5 illustrates the initial results obtained with the described implementation. Finally, section 6 presents the conclusions and future lines. 2 RADIC Architecture RADIC [3] is a fault tolerant architecture for message passing systems based on rollback-recovery techniques. These techniques rely on uncoordinated checkpoint protocol combined with a receiver based pessimistic event log [4]. The approach that was chosen does not need any coordinated or centralized action or element to carry out their fault tolerance tasks and mechanisms, so application scalability depends on the application itself. The RADIC architecture acts as a fault tolerant layer between the MPI standard and the parallel machine (fault probable). This fault tolerant layer provides a fault-resilient environment for parallel application even when the application runs over a fault-probable parallel machine. Our work is focused on providing an applicationtransparent fault tolerant middleware within a message passing library, specifically, Open MPI [5]. Critical data such as checkpoints and event logs are stored in a different node than the one in which the process is running. Processes that were residing in a failed node will be restarted in another node from their latest checkpoint, and will consume the event log in order to reach the before fault state. RADIC policies provide a transparent, decentralized, scalable and flexible fault tolerance solution. 2.1 RADIC components RADIC provides fault tolerance based on two main components: protectors and observers. In the Figure 1 we illustrated computing nodes (Ny), application process (Px), the protectors (Ty), and the observers (Ox) where the sub-index x represents the process number and y represents the node number. Protectors and observers work together with the aim of building a distributed fault tolerant controller. Both components are described below: Observers: are responsible of process monitoring and fault masking. Each application process has one observer attached to it. The observers performs event logging of received messages in a pessimistic manner, they also take periodic checkpoints of the process to which it is attached. Checkpoints and logging data are sent and stored in their protectors located in another node (Figure 1). During recovery, the observers are in charge of processing with the event log, replaying them in order to reach the same state before fault. Protectors: on each node there is a protector running, their main function is to detect node failures via a heartbeat/watchdog protocol. Protectors also store checkpoints and event logs sent by observers. When a failure occurs, the protector has to restart the failed processes that it protects; they also have to reestablish the heartbeat/watchdog protocol since it gets broken due to node failures. 2.2 RADIC Operation Fast failure detection is one of RADIC priorities, since it is one of the variables that affect the MTTR. RADIC first detection mechanism is a heartbeat/watchdog protocol that allows protectors to learn about neighbor s protectors faults. As every communication goes through the observers, they have absolutely control of messages exchange between peers. Observers can also detect and mask faults. Each protector maintain a data structure called radictable, where each entry (an entry per process exists in the application) of the structure is composed of a process id, the URI of the process, URI of its protector, and a unique clock of received and sent messages. When a process fails and get restarted, the observers consult the radictable in order to find about the node where the process has been recovered by asking the process s protector. The protectors updated the radictable on demand when they identify any processes failures. In the Figure 2a it is possible to see a fault free execution using RADIC without spare nodes support. When a failure occurs (Figure 2b), the parallel application execution will continue with one less node. The node failure is detected by the heartbeat/watchdog mechanism. After the failure, the heartbeat/watchdog mechanism is reconstructed, and T4 indicates T2 as the new protector of P4 (Figure 2c). O4 needs to take a checkpoint of P4, because its latest checkpoint gets lost when T3 fails. T2 restarts and re-executes P3(Figure 2d), and also will indicate that the new protector of P3 is T1. Then O3 will take a checkpoint of P3 and send the data to T1. Finally, O3 we erase old message logs. The protectors have two operating modes: active or passive. Active is when they form part of the detection scheme and there are some application processes running on its node (all nodes of Figure 2). Protectors may be in a passive state when they are running in a spare node, this is a low consumption state (to avoid node and network overload). 2.3 Spare Nodes in RADIC When a failure occurs and the failed process is restarted in the same node its protector is running, if this node already has application processes running on it, the node becomes overloaded. This could slow down the execution of both Figure 2. a) Fault free execution. b) Failure in Node 3. c) Heartbeat/watchdog restoration and assignation of a new protector to P4. d) Restart of the process P3 in node N2.

3 processes. As a consequence of this, the performance of the entire application could be affected, increasing its execution time. One method to maintain the initial performance in such a scenario is to use spare nodes to restart failed processes [6] instead of overloading the non-failed nodes. Spare nodes are that initially are not used by the parallel application. In the Figure 3a, we can observe the execution of a parallel application using 4 nodes and having 1 spare node (NS). When a failure occurs in the node N3 (Figure 3b) the protector T2 will detect the failure of protector T3 and it consult a table to find the information about the spare nodes location and state (sparetable). The sparetable (Table 1) is replicated among all protectors. Spares are assigned as failures occur, and the replicated information is updated on demand, so all the operations are made in a decentralized and transparent manner. Eventually, these tables could be outdated, however it does not affects the RADIC operation, since this information will be checked before using any spare. After consulting its sparetable the protector T2 confirms the availability of the spare NS (Figure 3c) and if it is available T2 transfer the latest checkpoint and event logging data of process P3 to NS (Figure 3d). Finally, the protector TS restart P3 and become an active protector by joining the heartbeat/watchdog protection scheme (Figure 3e). 3 Related Work Many proposals have been made to provide fault tolerance for message passing applications. Most strategies are based on a coordinated checkpointing approach or an uncoordinated checkpointing strategy combined with a logging mechanism. Currently, there are several checkpoint-restart tools available. We can highlight BLCR (Berkeley Lab s Checkpoint/Restart) [7] and DMTCP (Distributed MultiThreaded Checkpoint) [8]. DMTCP works at user space and BLCR works at kernel level. BLCR is one of the most used libraries to provide fault tolerance in parallel systems. To use BLCR in parallel applications, MPI libraries should at least reopen communication channels after restart [9]. Table 2 highlights the features of three of the most popular fault tolerant frameworks integrated into MPI libraries and our approach. Most solutions use a centralized storage. However, due to scalability reasons, it is desirable to avoid any centralized element. Our approach differs from MPICH-V2 [10] because we do not use any centralized storage because with RADIC, every computing node could stores critical data from process residing in another node. Also we use a pessimistic receiver based logging protocol. MPICH-V2 is now a deprecated implementation. MPICH-VCL is designed to reduce overhead during fault free execution by avoiding message logs. It is based on Chandy-Lamport algorithm [11]. MPICH2-PCL [12] uses a blocking coordinated checkpointing protocol. LAM-MPI [13] is previous to Open MPI. It modularizes a checkpoint/restart approach to allow the usage of multiple checkpoint/restart techniques. The implementation supports communications over TCP and Myrinet in combination with BLCR and SELF checkpointing operations. LAM-MPI uses a coordinated checkpoint approach and needs a communication thread between the checkpoint/restart system and the process mpirun to schedule checkpoints. Figure 3. RADIC with spare nodes. a) Execution before failure with one spare node. b) Failure in node 3. c) The protector T2 check availability of spare node NS. d) Protector T2 transfer the checkpoint of process P3 to spare node NS. e) Protector TS restart process P3 and also the communications. The current checkpoint/restart implementation of the Open MPI library [9] aims to combine the best features from these methods described above. The implementation uses a distributed checkpoint/restart mechanism where each checkpoint is taken independently, but coordination is needed to make a consistent global state, which requires the interruption of all processes at the same time. Another work that has become important is the Coordinated Infrastructure for Fault Tolerant Systems CIFTS- [14]. It is a framework that enables system software components to share fault information with other components to take some action in order to get adapted to faults. The main difference with our proposal is that we deal with faults automatically and transparently to applications. This allows us to reduce the MTTR. 4 RADIC in MPI The first prototype of RADIC was called RADICMPI [15] and it has been developed as a small subset of the MPI standard. As a message passing library is very limited. As this implementation does not have all the MPI primitives, it cannot execute many of the scientific applications available Table 1. Sparetable Spare Id Address Observers 0 Node5 1 1 Node6 0 RADICMPI does not consider collective operations and other complex functions that many applications use. For that reason, instead of extending the prototype to comply the MPI standard, we decided to integrate the RADIC architecture into a well-established MPI implementation. It allows the correct execution of any MPI application using the fault tolerance policies and mechanisms of RADIC (Section II). In the next paragraphs we will explain some important features of the integration of RADIC into Open MPI. 4.1 Open MPI Architecture A depth research about the inclusion of RADIC in Open MPI has been made in [16]. The implementation is named

4 RADIC-OMPI and integrates the basic protection level of RADIC. It does not include spare nodes management. Open MPI architecture has been already described in [5]. For that reason, in this paper we will focus only on the components relevant to the RADIC integration. The Open MPI frameworks are divided in three groups that are: Open MPI (OMPI) which provides the API to write parallel applications; Open Run-Time Environment (ORTE) which provides the execution environment for parallel applications; and Open Portable Layer (OPAL) which provides an abstraction to some operating system functions. To launch a given parallel application, an ORTE daemon is launched in every node that takes part in the parallel application. These daemons communicate between them to create the parallel runtime environment. Once this environment is created the application processes are launched by these daemons. Every process exchange information about communication channels during the Module Exchange (MODEX) operation which is an all-to-all communication. The protector functionalities have been integrated into the ORTE daemon because in Open MPI there is always one daemon running in each node, wich fits the protector requirements. Table 2. Fault tolerant MPI libraries. Name FT Strategy Detection and Recovery MPICH-V2 - Uncoordinated Ckpt. - Automatic. - Sender based pessimistic log. - Centralized storage. MPICH- VCL - Coordinated Ckpt. - Chandy-Lamport Algorithm. - Automatic. Open MPI - Centralized storage. - Coordinated Ckpt. - Centralized storage. - Fault Detection and safe stop. - Manual recovery. - Automatic and application transparent. RADIC - Uncoordinated Ckpt. - Pessimistic Receiver based Log. - Distributed storage. OMPI provides a three-layer framework stack for MPI communication: Point-to-point Management Layer (PML) which allows wrapper stacking. The observer, because of its behavior, has been implemented as a PML component; this ensures the existence of one observer per application process. Byte Transfer Layer (BTL) that implements all the communication drivers. BTL Management Layer (BML) that acts as a container to the drivers implemented by the BTL framework. The Open MPI implementation provides a framework to schedule checkpoint/restart requests. This framework is called Snapshot Coordinator (SnapC). The generated checkpoints are transferred through the File Manager (FileM) framework. All these communications to schedule and manage the transferring of the checkpoint files are made using the Out of Band (OOB) framework. 4.2 RADIC Implementation To define the initial heartbeat/watchdog fault detection protection scheme and protection mapping a simple algorithm is used: each observer sets his protector as the next logical node, and the last node sets the first one as its protector. All protectors should fill the radictable before launching the parallel application and update it with new information when failures occur. The update of the radictable does not require any collective operation. Thus many protectors could have an outdated version of the radictable. However, the radictable will be updated further on demand, when observers try to contact restarted processes. Regarding to the fault tolerances mechanism and their integration into Open MPI, the following observations can be made: Uncoordinated checkpoints: each process performs its checkpoints through a checkpoint thread. Checkpoints are triggered by a timer (checkpoint interval) or by other events. Before a checkpoint is made, to ensure that there is no in transit messages all communication channels are flushed and remain unused until the checkpointing operation finishes. After a checkpoint is made, each process transfers their checkpoint files using the FileM framework and then the communication within processes are allowed again. Message Log: since the observer is located in the PML framework, it ensures that all communications through it are logged and then transferred to the correspondent protector. The protector only confirms a message reception after the message has been saved. Messages are marked as received by the remote process after the receiver and its protector confirm the message reception (pessimistic receiver based log). Failure detection mechanism: failures are detected when communications fails; this mechanism requires the modification of lower layers to raise errors to the PML framework where the faults are managed. It avoids application stops. A heartbeat/watchdog mechanism is also used. The protectors send heartbeats to the next logical node and the receiver protector resets the watchdog timer after reception. Failure management: the default behavior of the library is to finalize when a failure occurs (fail-stop) Hence RADIC needs to mask failures to continue execution and avoid fault propagation to the application level. When a protector finds out about a failure, the restarting operation is initiated. Recovery: the recovery is composed of three phases. In the first one, a protector restores the failed process from its checkpoint with its attached observer. Then the restored observer sets its new protector, reexecutes the process while consuming the event logging data and then takes a checkpoint. Finally, the process execution is resumed after its checkpoint is sent to its new protector, just to ensure its protection. Protectors involved in the fault also reestablish the protection mechanism. We consider the recovery as an atomic procedure. Reconfiguration: when the recovery ends, the communications have to be restored. To achieve this

5 goal the lower layers of Open MPI must be modified to redirect all the communications to the new address of the process. To avoid collective operation this information is updated on demand or by a token mechanism. 4.3 Proposal: Spare Nodes Management in Open MPI An important aspect that has to be considered when running parallel applications is the performance. The previous implementation of the RADIC architecture [16] allows the successful completion of parallel applications even in presence of failures. However, it does not consider the management of extra resources to replace failed nodes. Including the spare nodes management into RADIC, the applications will not only end correctly but also will avoid performance degradation due to loss of computational resources. Our proposal is not restricted on avoiding performance lost, we also propose a mechanism for automatically select spare nodes and include them on the parallel environment domain without user intervention. By doing the spare nodes management transparently and automatically, we minimize the MTTR. When including spare nodes into the RADIC architecture, the restarting and reconfiguration are the most affected mechanisms. To reconfigure the system, a deterministic algorithm to find restarted processes is needed. When using RADIC without spare nodes (Figure 2), failed processes are restarted in their protectors. If an observer tries to reach a relocated failed process, it will take a look at its radictable to find the old protector of the failed process (this information may be outdated). Then, the observer will ask about that process. The old protector will say that it is no longer protecting such a process, and will point who is the new protector (Figure 2). If a failure occurs and there are spare nodes available, the spare will be included into the parallel environment domain and failed processes should be restarted in it. The heartbeat/watchdog mechanism will be reestablished and the involved protectors will update their radictable and sparetable (Table 1). Considering Figure 3e, if process P1 wants to reach P3, O1 will ask T2 about P3. T2 will point that P3 is residing in the spare NS. Then O1 will tell T1 to update its radictable and its sparetable and P1 will finally try to contact P3. The process described above is distributed and decentralized, and each process will do it only when it is strictly necessary, avoiding the costly Module Exchange (MODEX) collective of Open MPI. The main problem when restarting a process in another node is that we need an ORTE daemon running in that node to adopt the new process as a child. Moreover, all future communication with the restarted process needs to be redirected to its new location. For that reason, ORTE daemons are launched even in spare nodes, but no application process is launched on it until it be required as a spare node. An additional problem that must be addressed is that a sender observer must not consider as a failure the lack of communication with other processes when the receiver process is doing a checkpoint or is restarting. The sender observer will fail to communicate, and will consult the receiver s protector to find about the state of the receiver. The protector will indicate that the process is checkpointing or restarting, and the communication will be retried later. The radictable and sparetable were included inside the job information structure (orte_jmap_t). When the parallel application starts, each protector (ORTE daemon) populates its radictable and its sparetable. The radictable and sparetable are updated (on demand) when a protector notices that a process has restarted in another place. If the application runs out of spares, the default mechanism of RADIC is used (Figure 2). 5 Experimental Results A fault tolerant architecture, generally, introduces some kind of overhead in the system it is protecting. These overheads are generally caused by replication in some of its forms. The overheads introduced by RADIC are mostly caused by the uncoordinated checkpoints and the pessimistic log mechanism as it has been showed in [16]. Failures may cause degradation because of the loss of the computational capacity if there are no spare nodes available. The experimental evaluation that has been done tries to shows how fast is the failure detection and recovery mechanisms of our proposal, and how fast it can include automatically spare nodes into the parallel environment in order to avoid the impact on the performance of applications when resources are loosed. We present experimental results using three different benchmarks: a static matrix multiplication benchmark, the LU benchmark that is part of the NAS Parallel Benchmarks (NPB) [17] and the SMG2000 application [18]. The matrix multiplication application is modeled as a master/worker application, the master sends the data to the workers only at the start, and collects the results when the application finalizes. Each application process is assigned to one core during normal execution. The matrix multiplication implemented has few communications (only at the beginning and at the end). Experiments have been made using a Dell PowerEdge M600 with 8 nodes, each node with 2 quad-core Intel Xeon E5430 running at 2.66 GHz. Each node has 16 GBytes of main memory and and a dual embedded Broadcom NetXtreme IITM 5708 Gigabit Ethernet. RADIC have been integrated into version 1.7 of Open MPI. Our main objective is to depict the application Figure 4. Throughput of the Matrix Multiplication application with and without spare nodes(32 processes Checkpoint interval = 30 sec).

6 Figure 5. a) Performance of the LU Benchmark with and without Spare Nodes. b) Performance of the SMG2000 application with and without Spare Nodes. performance degradation avoidance when failures occur in parallel computers. By using spare nodes automatically and transparently to restart failed processes into them, we can decrease the MTTR to a minimum while maintaining application performance as it was before failure. As we mentioned before, it is crucial to deal with failures as fast as possible. If the application loses a node and we use the default approach of RADIC (Figure 2) one of the nodes become overloaded. As a consequence of this, the whole application throughput could decreases. Replace failed nodes with spares is not trivial, because it is necessary to include the spare node into the parallel environment world and then restart the failed process or processes in it transparently and automatically. Therefore, application performance is affected only by a short period. The experiments try to depict how performance (in terms of throughput) is affected after a failure without using spare nodes, and the benefits of using them. To obtain the operations per second of the Matrix Multiplication application we divided the sub-matrix size that computes each process by the time spent into an internal iteration. The checkpoint intervals that we use to make the experiments are only for test purposes. If we want to define valid checkpoint intervals we can use the model proposed in [19]. Figure 4 shows three executions of the matrix multiplication benchmark. The green line shows the fault-free execution. The blue line shows the execution of the application using RADIC with 2 spare nodes (and 2 failures). The red line shows the execution of RADIC without using spare nodes, and with one failure. When a fault occurs and the application is not using spare nodes, the failed processes are restarted on their protectors and these nodes become overloaded. It occurs because processes compete for available resources and the application loses about 40% of its initial throughput. If we use RADIC with spare nodes, the application loses throughput only for a short period, until a spare is selected and the process is restarted in it, after that, the initial throughput is restored. The static matrix multiplication is only a synthetic application. To evaluate the throughput changes with real applications using RADIC, we have designed a new set of experiments using the LU benchmark and the SMG2000 benchmark (Figure 5). Figure 5a depicts the behavior of the LU benchmark using RADIC with and without spare nodes (and without faults). The application has an irregular behavior because it computes sparse matrices. If a failure occurs and the application does not have any spare node, it losses about 35% of its throughput. However, if the application uses spare nodes, the throughput is reduced in 25% during the recovery and after that, the initial throughput is recovered. Figure 5b depicts the throughput of the SMG2000 benchmark. The checkpointing and restart operations are quite expensive for this application because the memory footprint of each process is about 2GB. If a fault occurs, and the process is restarted on its protector, the application losses about 15% of its initial throughput. However, if the process is restarted on a spare node, the initial throughput is maintained. As is known, the execution time of an application affected by faults depends on the moment in which the failure occurs. For that reason when treating faults we focus on showing the degradation in terms of throughput not in terms of execution time. Considering the results we can conclude that transparent and automatic management of spare nodes reduces avoids increments in the MTTR and maintains the application throughput avoiding system overload. 6 Conclusions and Future Work The proposal presented in this paper has demonstrated that the RADIC policies are effective to automatically and transparently manage spare nodes, avoiding long recovery times while maintains initial application performance. In this paper we have presented the design and evaluation of an alternative method to restart failed processes automatically maintaining the original computational capacity. This is an important issue because usually, applications are configured to execute with an optimal number of nodes. Loosing computational resources due to hardware failures decreases the application performance.

7 Having scalability as an objective, it is imperative to use a decentralized fault tolerance approach. Furthermore, when failures occur, a transparent an automatic fault treatment is desired, because the parallel application will experiment performance degradation only for a short period of time. The use of a fault tolerance architecture with RADIC characteristics is desirable because it does not require any user intervention and it is also configurable to use available resources. When running parallel applications in computer clusters, frequently, there are free nodes that are not being used by any application, so these nodes could be used as spare nodes. The implementation of RADIC into the Open MPI library has several advantages. The first one is that Open MPI is a widely used library used in the scientific world. It allows the utilization of RADIC with real scientific applications. The second advantage is that our implementation makes easier the processes migration without stopping the parallel application execution. Initial analyses also show that RADIC will complement correctly with the MPI3 standard. The MPI3 standard will make easier the failure management because more information about failures will be available, so the possibilities to take corrective actions will increase. The integration of RADIC into a stable Open MPI implementation as well as provide an interface for live migration are pending tasks. RADIC also need to start taking into account applications with I/O events (transactional applications). 7 References [1] Bianca Schroeder and Garth A. Gibson, "Understanding failures in petascale computers," Journal of Physics: Conference Series, vol. 78, [2] B. Randell, "System structure for software fault tolerance," SIGPLAN Not., vol. 10, no. 6, pp , April [3] Amancio Duarte, Dolores Rexachs, and Emilio Luque, "Increasing the cluster availability using RADIC," IEEE International Conference on Cluster Computing, pp , [4] E. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp , September [5] E Gabriel et al., "Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,", 2004, p [6] Guna Santos and Angelo Duarte, "Increasing the Performability of Computer Clusters Using RADIC II," International Conference on Availability, Reliability and Security., pp , [7] Jason Duell, "The design and implementation of Berkeley Labs linux Checkpoint/Restart," [8] Jason Ansel, Kapil Arya, and Gene Cooperman, "DMTCP: Transparent checkpointing for cluster computations and the desktop,", 2009, pp [9] Joshua Hursey, Jeffrey M. Squyres, Timothy I. Mattox, and Andrew Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," In Workshop on Dependable Parallel, Distributed and Network-Centric Systems(DPDNS), in conjunction with IPDPS, pp. 1-8, [10] Aurélien Bouteiller and Thomas Hérault, "MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging," in Supercomputing, 2003 ACM/IEEE Conference, 2003, pp [11] K M Chandy and Leslie Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Trans. Comput. Syst., vol. 3, no. 1, pp , February [12] Darius Buntinas et al., "Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols," Future Generation Comp. Syst., vol. 24, pp , [13] Sriram Sankaran, Jeffrey M Squyres, Brian Barrett, and Andrew Lumsdaine, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," In Proceedings, LACSI Symposium, Sante Fe, pp , [14] Rinku Gupta et al., "CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems," Parallel Processing, International Conference on, vol. 0, pp , [15] Angelo Duarte, Dolores Rexachs, and Emilio Luque, "An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI," pp , [16] Leonardo Fialho, Guna Santos, Angelo Duarte, Dolores Rexachs, and Emilio Luque, "Challenges and Issues of the Integration of RADIC into Open MPI,", 2009, pp [17] D Bailey et al., "The Nas Parallel Benchmarks," International Journal of High Performance Computing Applications, [18] Kiattisak Ngiamsoongnirn, Ekachai Juntasaro, Varangrat Juntasaro, and Putchong Uthayopas, "A Parallel Semi- Coarsening Multigrid Algorithm for Solving the Reynolds-Averaged Navier-Stokes Equations,", 2004, pp [19] Leonardo Fialho, Dolores Rexachs, and Emilio Luque, "What is Missing in Current Checkpoint Interval Models?," Distributed Computing Systems, International Conference on, vol. 0, pp , 2011.

Adding semi-coordinated checkpoints to RADIC in Multicore clusters

Adding semi-coordinated checkpoints to RADIC in Multicore clusters Adding semi-coordinated checkpoints to RADIC in Multicore clusters Marcela Castro 1, Dolores Rexachs 1, and Emilio Luque 1 1 Computer Architecture and Operating Systems Department, Universitat Autònoma

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations John von Neumann Institute for Computing A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations A. Duarte, D. Rexachs, E. Luque published in Parallel Computing: Current & Future Issues

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Joshua Hursey 1, Jeffrey M. Squyres 2, Timothy I. Mattox 1, Andrew Lumsdaine 1 1 Indiana University 2 Cisco Systems,

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

DMTCP: Fixing the Single Point of Failure of the ROS Master

DMTCP: Fixing the Single Point of Failure of the ROS Master DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u

More information

Avida Checkpoint/Restart Implementation

Avida Checkpoint/Restart Implementation Avida Checkpoint/Restart Implementation Nilab Mohammad Mousa: McNair Scholar Dirk Colbry, Ph.D.: Mentor Computer Science Abstract As high performance computing centers (HPCC) continue to grow in popularity,

More information

A Composable Runtime Recovery Policy Framework Supporting Resilient HPC Applications

A Composable Runtime Recovery Policy Framework Supporting Resilient HPC Applications A Composable Runtime Recovery Policy Framework Supporting Resilient HPC Applications Joshua Hursey, Andrew Lumsdaine Open Systems Laboratory, Indiana University, Bloomington, IN USA 4745 Email: {jjhursey,lums}@osl.iu.edu

More information

Checkpointing HPC Applications

Checkpointing HPC Applications Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures

More information

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems fastos.org/molar Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems Jyothish Varma 1, Chao Wang 1, Frank Mueller 1, Christian Engelmann, Stephen L. Scott 1 North Carolina State University,

More information

A Global Operating System for HPC Clusters

A Global Operating System for HPC Clusters A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ

More information

Increasing Reliability through Dynamic Virtual Clustering

Increasing Reliability through Dynamic Virtual Clustering Increasing Reliability through Dynamic Virtual Clustering Wesley Emeneker, Dan Stanzione High Performance Computing Initiative Ira A. Fulton School of Engineering Arizona State University Wesley.Emeneker@asu.edu,

More information

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University MVAPICH Users Group 2016 Kapil Arya Checkpointing with DMTCP and MVAPICH2 for Supercomputing Kapil Arya Mesosphere, Inc. & Northeastern University DMTCP Developer Apache Mesos Committer kapil@mesosphere.io

More information

IOS: A Middleware for Decentralized Distributed Computing

IOS: A Middleware for Decentralized Distributed Computing IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments 1 A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments E. M. Karanikolaou and M. P. Bekakos Laboratory of Digital Systems, Department of Electrical and Computer Engineering,

More information

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,

More information

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Lemarinier Pierre, Bouteiller Aurelien, Herault Thomas, Krawezik Geraud, Cappello Franck To cite this version: Lemarinier

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

MPI versions. MPI History

MPI versions. MPI History MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

An introduction to checkpointing. for scientific applications

An introduction to checkpointing. for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count

More information

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI 1 MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI Aurélien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, Franck Cappello INRIA/LRI, Université Paris-Sud, Orsay, France {

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

MPI History. MPI versions MPI-2 MPICH2

MPI History. MPI versions MPI-2 MPICH2 MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS The 0th IEEE International Conference on High Performance Computing and Communications UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

Efficiency Evaluation of the Input/Output System on Computer Clusters

Efficiency Evaluation of the Input/Output System on Computer Clusters Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de

More information

A Comprehensive User-level Checkpointing Strategy for MPI Applications

A Comprehensive User-level Checkpointing Strategy for MPI Applications A Comprehensive User-level Checkpointing Strategy for MPI Applications Technical Report # 2007-1, Department of Computer Science and Engineering, University at Buffalo, SUNY John Paul Walters Department

More information

Fault tolerance techniques for high-performance computing

Fault tolerance techniques for high-performance computing Fault tolerance techniques for high-performance computing Jack Dongarra 1,2,3, Thomas Herault 1 & Yves Robert 1,4 1. University of Tennessee Knoxville, USA 2. Oak Ride National Laboratory, USA 3. University

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

Redesigning the Message Logging Model for High Performance

Redesigning the Message Logging Model for High Performance CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1 7 [Version: 2002/09/19 v2.02] Redesigning the Message Logging Model for High Performance Aurelien Bouteiller,

More information

InfoBrief. Platform ROCKS Enterprise Edition Dell Cluster Software Offering. Key Points

InfoBrief. Platform ROCKS Enterprise Edition Dell Cluster Software Offering. Key Points InfoBrief Platform ROCKS Enterprise Edition Dell Cluster Software Offering Key Points High Performance Computing Clusters (HPCC) offer a cost effective, scalable solution for demanding, compute intensive

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Nitin H. Vaidya. Texas A&M University. Phone: Fax: Technical Report July Abstract

Nitin H. Vaidya. Texas A&M University. Phone: Fax: Technical Report July Abstract Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

Scalable and Fault Tolerant Failure Detection and Consensus

Scalable and Fault Tolerant Failure Detection and Consensus EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann

More information

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction

More information

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet IRISA Sebastien.Monnet@irisa.fr Christine Morin IRISA/INRIA Christine.Morin@irisa.fr Ramamurthy Badrinath

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications CPPC: A compiler assisted tool for portable checkpointing of message-passing applications Gabriel Rodríguez, María J. Martín, Patricia González, Juan Touriño, Ramón Doallo Computer Architecture Group,

More information

An MPI failure detector over PMPI 1

An MPI failure detector over PMPI 1 An MPI failure detector over PMPI 1 Donghoon Kim Department of Computer Science, North Carolina State University Raleigh, NC, USA Email : {dkim2}@ncsu.edu Abstract Fault Detectors are valuable services

More information

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,

More information

Novel Log Management for Sender-based Message Logging

Novel Log Management for Sender-based Message Logging Novel Log Management for Sender-based Message Logging JINHO AHN College of Natural Sciences, Kyonggi University Department of Computer Science San 94-6 Yiuidong, Yeongtonggu, Suwonsi Gyeonggido 443-760

More information

Comparing different approaches for Incremental Checkpointing: The Showdown

Comparing different approaches for Incremental Checkpointing: The Showdown Comparing different approaches for Incremental Checkpointing: The Showdown Manav Vasavada, Frank Mueller Department of Computer Science North Carolina State University Raleigh, NC 27695-7534 e-mail: mueller@cs.ncsu.edu

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

COMPUTATIONAL clusters with hundreds or thousands

COMPUTATIONAL clusters with hundreds or thousands 1 Replication-Based Fault-Tolerance for MPI Applications John Paul Walters and Vipin Chaudhary, Member, IEEE Abstract As computational clusters increase in size, their mean-time-to-failure reduces drastically.

More information

Research Article Optimizing Checkpoint Restart with Data Deduplication

Research Article Optimizing Checkpoint Restart with Data Deduplication Scientific Programming Volume 2016, Article ID 9315493, 11 pages http://dx.doi.org/10.1155/2016/9315493 Research Article Optimizing Checkpoint Restart with Data Deduplication Zhengyu Chen, Jianhua Sun,

More information

A Consensus-based Fault-Tolerant Event Logger for High Performance Applications

A Consensus-based Fault-Tolerant Event Logger for High Performance Applications A Consensus-based Fault-Tolerant Event Logger for High Performance Applications Edson Tavares de Camargo and Elias P. Duarte Jr. and Fernando Pedone Federal University of Paraná (UFPR), Department of Informatics,

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Implementing a Hardware-Based Barrier in Open MPI

Implementing a Hardware-Based Barrier in Open MPI Implementing a Hardware-Based Barrier in Open MPI - A Case Study - Torsten Hoefler 1, Jeffrey M. Squyres 2, Torsten Mehlan 1 Frank Mietke 1 and Wolfgang Rehm 1 1 Technical University of Chemnitz 2 Open

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications Sourav Chakraborty 1, Ignacio Laguna 2, Murali Emani 2, Kathryn Mohror 2, Dhabaleswar K (DK) Panda 1, Martin Schulz

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Checkpointing using DMTCP, Condor, Matlab and FReD

Checkpointing using DMTCP, Condor, Matlab and FReD Checkpointing using DMTCP, Condor, Matlab and FReD Gene Cooperman (presenting) High Performance Computing Laboratory College of Computer and Information Science Northeastern University, Boston gene@ccs.neu.edu

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

CIFTS: A Coordinated Infrastructure for Fault Tolerant Systems : Experiences and Challenges

CIFTS: A Coordinated Infrastructure for Fault Tolerant Systems : Experiences and Challenges CIFTS: A Coordinated Infrastructure for Fault Tolerant Systems : Experiences and Challenges Rinku Gupta Mathematics and Computer Science Division Argonne National Laboratory CIFTS Project The CIFTS Project

More information

Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion *

Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion * Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion * Ronal Muresano, Alvaro Wong, Dolores Rexachs and Emilio Luque Computer

More information

Failure Detection within MPI Jobs: Periodic Outperforms Sporadic

Failure Detection within MPI Jobs: Periodic Outperforms Sporadic Failure Detection within MPI Jobs: Periodic Outperforms Sporadic Kishor Kharbas 1, Donghoon Kim 1, Kamal KC 1, Torsten Hoefler 2, Frank Mueller 1 1 Dept. of Computer Science, North Carolina State University,

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Application Fault Tolerance Using Continuous Checkpoint/Restart

Application Fault Tolerance Using Continuous Checkpoint/Restart Application Fault Tolerance Using Continuous Checkpoint/Restart Tomoki Sekiyama Linux Technology Center Yokohama Research Laboratory Hitachi Ltd. Outline 1. Overview of Application Fault Tolerance and

More information

Work Project Report: Benchmark for 100 Gbps Ethernet network analysis

Work Project Report: Benchmark for 100 Gbps Ethernet network analysis Work Project Report: Benchmark for 100 Gbps Ethernet network analysis CERN Summer Student Programme 2016 Student: Iraklis Moutidis imoutidi@cern.ch Main supervisor: Balazs Voneki balazs.voneki@cern.ch

More information

Adaptive Runtime Support

Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

A Component Architecture for LAM/MPI

A Component Architecture for LAM/MPI A Component Architecture for LAM/MPI Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Lab, Indiana University Abstract. To better manage the ever increasing complexity of

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications,

New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications, New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications, Paweł Czarnul and Marcin Frączak Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology,

More information

Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version / SSI Version 1.0.0

Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version / SSI Version 1.0.0 Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version 1.0.0 / SSI Version 1.0.0 Sriram Sankaran Jeffrey M. Squyres Brian Barrett Andrew Lumsdaine http://www.lam-mpi.org/ Open

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Using RDMA for Lock Management

Using RDMA for Lock Management Using RDMA for Lock Management Yeounoh Chung Erfan Zamanian {yeounoh, erfanz}@cs.brown.edu Supervised by: John Meehan Stan Zdonik {john, sbz}@cs.brown.edu Abstract arxiv:1507.03274v2 [cs.dc] 20 Jul 2015

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

Consolidating OLTP Workloads on Dell PowerEdge R th generation Servers

Consolidating OLTP Workloads on Dell PowerEdge R th generation Servers Consolidating OLTP Workloads on Dell PowerEdge R720 12 th generation Servers B Balamurugan Phani MV Dell Database Solutions Engineering March 2012 This document is for informational purposes only and may

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

Recovering Device Drivers

Recovering Device Drivers 1 Recovering Device Drivers Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy University of Washington Presenter: Hayun Lee Embedded Software Lab. Symposium on Operating Systems

More information

Automated Timer Generation for Empirical Tuning

Automated Timer Generation for Empirical Tuning Automated Timer Generation for Empirical Tuning Josh Magee Qing Yi R. Clint Whaley University of Texas at San Antonio SMART'10 1 Propositions How do we measure success for tuning? The performance of the

More information

Applying RADIC in Open MPI

Applying RADIC in Open MPI Departament d'arquitectura de Computadors i Sistemes Operatius Màster en Computació d'altes Prestacions Applying RADIC in Open MPI The methodology used to implement RADIC over a Message Passing Library

More information

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel Open MPI und ADCL Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen Department of Computer Science University of Houston gabriel@cs.uh.edu Is MPI dead? New MPI libraries released

More information

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Scalable Fault Tolerance Schemes using Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

New Features in LS-DYNA HYBRID Version

New Features in LS-DYNA HYBRID Version 11 th International LS-DYNA Users Conference Computing Technology New Features in LS-DYNA HYBRID Version Nick Meng 1, Jason Wang 2, Satish Pathy 2 1 Intel Corporation, Software and Services Group 2 Livermore

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 28 th October 2013 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information