On the Survivability of Standard MPI Applications

Size: px

Start display at page:

Download "On the Survivability of Standard MPI Applications"

Daniella Randall
6 years ago
Views:

1 On the Survivability of Standard MPI Applicions Anand Tikotekar 1, Chokchai Leangsuksun 1 Stephen L. Scott 2 Louisiana Tech University 1 Oak Ridge Nional Laborory 2 box@lech.edu 1 a007@lech.edu 1 scottsl@ornl.gov 2 Abstract. Job loss due to failure represents a common vulnerability in High Performance Computing (HPC), especially in the Message Passing Interface (MPI) environment. Rollback-recovery has been used to mitige faulty issues for long running applicions. However, to de, the rollback-recovery such as checkpoint mechanism alone may not be sufficient to ensure fault tolerance for MPI applicions due to a stic view of MPI coopering machines and lack of resilient ability to endure outages. In fact, MPI applicions are prone to cascading failures, where one participing node causes the total failure. In this paper, we address fault issues in the MPI environment by improving runtime availability with self-healing and self-cloning th toleres the outage of cluster computing systems. We develop a framework th augments a standard HPC cluster with a fault tolerance capability job level th preserves the job queue, and a parallel MPI job submitted through a resource manager enabling the non-stop execution even after encountering failure. 1. Introduction Large clusters are now ubiquitous in high performance computing due to their ability to solve computionally intensive problems in cost effective terms. Jobs when submitted to a large scale cluster are typically parallel and MPI-type applicions which can be very long-running. Imagine a scenario where the applicion completion time is of the order of magnitude th is longer than system mean time to failure. If the HPC job crashes due to failure in any cluster node, the applicion must be resubmitted. To our knowledge, this outage phenomenon is MPI Achilles heel due to the fact, th MPI runtime are based on stic implemention [4]. The applicions can not take advantage of added nodes nor can it resize itself to fewer nodes after initiali- 1 Research supported by Office of science, U.S. Department of Energy, the fastos program, Grant # DE-FG02-04ER Research supported by the mhemics, Informion and Computional Sciences Office,Office of Advanced Scientific Computing Research, Office of science, U.S. Department of Energy, under contract No. DE-AC05-00OR22725 with UT-Btelle, LLC.

2 2 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott zion. Furthermore, MPI applicions also face cascading failures whereby one node is able to stall the entire MPI job. By and large, parallel jobs are normally submitted through a resource manager. A typical resource manager has no knowledge whether a MPI job has stalled or still actively running. This type of detection is a very difficult task. The job may be in the running ste since the runtime daemon or the MPI main process is still active while a failure occurs in one of the participing nodes. There are some resource-managers today th provide checkpoint support intrinsically such as in Condor [1], LSF [2]. However, their checkpoint mechanism was not designed to provide transparent fault tolerance for running MPI applicions. Any node failure during the execution results in the whole job failure. Moreover, the whole job queue will be gone when loosing the head node. There are existing works th provide coordined checkpoint mechanism for MPI. Unfortunely, the checkpoint recovery will not succeed in a permanent failure where the restart environment is not identical to the one before the failure. Moreover, a solution must keep track of which jobs are queued, launched or running on which nodes due to the stic nure of MPI for aforementioned recovery issues. This informion is important as to enable transparent fault tolerance by job queue replicion and automic resubmission of jobs in the event of node outages. Our framework introduces a job registrion/retrieval (JRR) mechanism to keep track detailed job informion until the completion. By improving runtime availability with multi-head HA-OSCAR self-healing and selfcloning, incorporing JRR and existing coordined-checkpoint mechanisms, our proof-of-concept demonstres th any parallel MPI job submitted through a normal resource manager is enabled its non-stop execution even when encountering failure. We schemically abstract our framework by illustring basic failure modes in a cluster in section 3. Section 4 sketches the proposed framework. We discuss Job registrion/retrieval (JRR) mechanism in Section 5. Section 6 entails performance study and experimental overhead of our framework, and follows by the conclusion in Section Reled Work There are different approaches available to tackle the reliability issues in HPC, including high availability (HA), resource managers with fault resilience, checkpointrecovery mechanism, etc. LinuxHA [12] is a tool for building HA clusters. However, LinuxHA only provides a heartbe and failover mechanism for a fl-structure cluster which may not be suitable for HPC clusters. OSCAR [14] and ROCKS are common software stacks for deploying and managing Beowulf clusters [14], [7]. As far as availability is concerned, the Beowulf clusters suffer from the possibility of the head node becoming a Single Point of Failure (SPF). The cluster can go down completely with the failure of head node. Thus, there is a need to focus on the high availability aspect of the cluster design. HA-OSCAR deals with availability and fault issues the geway node with multi-head failover architecture and service level fault tolerance mechanisms. Replicion, proactive monitoring and recovery are essential techniques in HA-OSCAR. Moreover, HA-OSCAR provides a flexible and extensible in-

3 On the survivability of MPI applicions 3 terface for customizable fault management, fail-over operion, and alert management. Typically, HPC job management systems consist of two parts, resource managers and dediced job schedulers. PBS [11], Torque [19], SLURM [17], and SGE [8] are resource managers th are commonly deployed in HPC clusters. Torque is based on the OpenPBS project. It is an open source software which is enhanced with scalability, node fault tolerance, and better logging facilities compared to OpenPBS. On the downside the Torque server possesses a single point of failure the head node. In addition, Torque provides no support for checkpoint. SGE [8] supports checkpoint mechanism but has known limitions in controlling parallel MPI applicions and retrieving resource usage da. Lawrence Livermore Nional Laborory SLURM [17] was developed with scalability in mind coupled with fault tolerance and simplistic management. SLURM has an Active/Standby server configurion for fault tolerance. In an event of the outage of the primary SLURM control daemon, the backup controller assumes control. The SLURM controller daemon writes its current ste to disk when the backup controller takes over to preserve the job queue. Currently, SLURM has no support for checkpoint, and does not address issues for running jobs. Condor is used for high throughput computing. Condor supports both serial and parallel jobs but provides checkpoint and process migrion only for serial jobs. The condor central manager is also a single point of failure. In [3] authors have developed a system called Déjà vu aiming to achieve transparent fault tolerance for HPC environment. Déjà vu is not open source software and to de has not been released yet. Moreover, it requires compiler-based code instrumention approach which may introduce software bug and overhead. Zhang [23] have designed a system called checkpoint based rollback recovery and migrion. This system relies on migrion of processes in the case of node failures. This suffers from a single point of failure. One of the important schemes to achieve fault tolerance in HPC clusters is checkpoint/restart technique. There are numerous Linux-based checkpoint/restart packages such as BLCR [10], CoCheck [7], Epckpt [20] and Libckpt [18]. In [13] Morin et al, propose a transparent message passing parallel applicions checkpoint mechanism in Kerrighed. LAM/MPI [16] supports coordined checkpoint/restart based on the BLCR. However, when a failure occurs, most existing checkpoint/restart tools require a manual restart. To support transparent recovery, one must ensure availability requirements and self-awareness to tolere outages and perform automic recovery with reasonable cost. Checkpoint scheme should be utilized in conjunction with the runtime such as job scheduling mechanism for self-recovery. At the user level, there have been tempts to provide MPI fault tolerance such as FT-MPI [6]. FT-MPI provides a try-cch approach to detect a process outage (MPI communicor) and can recover from the failure. FT-MPI alone can not provide fault tolerance, especially the head node. Open MPI [5] is a recent open source production quality MPI implemention th aims to centralize strong points of all other MPI implementions. It also aims to provide fault tolerance but leave some questions open such as how to detect faults and propage the event notificion to processes etc. To summarize, above solutions although tackling the same problem, lack in least one of the following areas:

4 4 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott 1. Some solutions do not provide fault tolerance for parallel jobs in the case of head node failure. 2. Solutions th provide head node outage support such as Slurm do not account for compute node failures. 3. Some solutions do not provide checkpointing as an external support, and one can not use the nive methods with the particular solution 4. Products such as Déjà vu are not open source. Our approach addresses all the above issues in a single solution. 3. Basic MPI job failure cases and proposed resolution Fig 1: Head node outage, Basic failure mode 1 Case 1: Fig 1. Describes th a parallel job is launched on the head node and its processes are running on compute nodes. We assume th HPC applicions are MPI type and the system is a multi-head HA-OSCAR cluster in order to address a single-point-of-failure in a typical cluster. When the head node fails, the standby head-node takes the control (i.e. failover) and must recover all the jobs. It is necessary th job queue must be repliced or stored on a reliable storage and accessible by the standby head. Running jobs must be periodically saved. This checkpoint mechanism is to ensure fault tolerance for the running applicion. All the running jobs have to be restarted and queued jobs must be restored since the resource management and scheduling service must first be restarted due to the head node failure.

5 On the survivability of MPI applicions 5 Fig 2: Compute node outage, Basic failure mode 2 Case 2: Fig 2 shows th a parallel job is launched on a compute node (e.g. C1) and its processes are running on other nodes and then the node (C1) crashes. In this case, jobs running on the failed node must be restarted from the last checkpointed ste. We also must remove the original stalled jobs before the recovery from their checkpoint. There are other failure cases th depend upon where the jobs are launched (either on head node or compute node) and which corresponding MPI processes crash. Our framework aims to handle all these cases. Moreover, the framework is also designed to cope with back to back failures of compute node and head node along with multiple compute node failures the same time. 4. Algorithm In this section, we detail our algorithm for the proposed framework. This algorithm was implemented and valided with HA-OSCAR 1.1, LAM/MPI 7.0, Torque and BLCR. It however should be straightforward to adapt this algorithm to other implementions. We also enhance HA-OSCAR to deal with compute node outage with a lazy redundancy cloning technique. It is an on-demand compute node cloning th is similar to failover technique and to ensure high availability of compute nodes in a permanent failure situion. Details of the lazy redundancy technique can be found in [22]. S0: Start the job queue replicor; compute node failure detection mechanism (CN) and Restarting mechanism (RM) daemons on the head node. S1: Submit a MPI parallel job through a scheduling interface. S2: Replice the job on the back up with the stus as held. S3: Do until no node outage: // the following activities are performed on node X.

6 6 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott Preprocessing Retrieve job id, user Call checkpoint controller with tunable interval Job registrion Determine the alloced nodes using LAM/MPI policy Determine the MPI ID of the job Insert job details in the dabase for further analysis. Checkpoint Checkpoint the job tunable interval Copy checkpoint files to the backup Creion of a job file Cree a job specificion file for restarting purposes Copy job file to the standby node Post processing Delete all the spec files Delete the checkpoint files if the root job finishes. S4: End Loop S5: If there is a head node outage: // on the standby head node a. Start the head node failover mechanism i. Clone the head node ii. Start resource manager and scheduler iii. Call the RM b. Determine the list of the jobs to be restarted. i. Restart all jobs th were running the time of failure. c. Cree respective LAM environment to Submit jobs through the RM from each directory i. Determine the needed session suffix for LAM before creing LAM environment. d. Release remaining jobs th are in the queued ste. e. Goto step S3 S6: End If S7: If there is a compute node crash: // on the primary head node a. Start the compute node failover mechanism i. Clone the failed node with the spare one ii. Restart all the necessary services iii. Call the RM b. Determine the list of the jobs to be restarted. i. Scan the respective directories c. Retrieve job informion and based on the informion, submit the job through the RM after creing the respective LAM environment. i. Retrieve the informion regarding the alloced nodes using the job id d. Goto step S3 S8: End If The above algorithm has three main phases. Job registrion takes place as each job entered the system. Two daemons, Head node failure detector (HN) and compute node failure detector (CN), execute their respective algorithms after an outage is de-

7 On the survivability of MPI applicions 7 tected. Fig 1 and 2 illustre two main failure and recovery scenarios. Fig 1 refers to S5 step in our algorithm while Fig 2 represents S7 step. We rely on the cloning approach to replace the failed node as well as maintaining applicion runtime configurion, quick detection and recovery mechanism Job Registrion/Retrieval mechanism This section describes how job registrion and retrieval (JRR) takes place. The job registrion mechanism stores important informion on a reliable storage about applicions and their runtime informion such as job id, a boolean value set for each node suggesting whether node has corresponding processes for a given job and the MPI process id. The registrion takes place from the respective node where the job is launched by the scheduler. Boolean values are derived from node allocion policies found in a typical resource management system. In our test bed, the LAM/MPI implemention controls the allocion policy. This informion is crucial for the recovery mechanism. If the number of processes is greer than the number of virtual processor specified then all the node columns will have a value of 1 representing th the job will run on all the nodes. This in turn suggests th this particular job must be restarted when any node fails. During recovery, the registrion provides required informion on where to start the job based on the given job id. It is imperive th when a job finishes, its registrion must be cleaned up. This important task is achieved through our enhancement in the post-processing module in the resource management system. 5. Results and analysis Fig 3(a) shows a comparison of various recovery approaches. In the case of the head node recovery scenario, the standby node assumes a control over the failed primary head and automically restarts running jobs from their last checkpoint followed by rest of jobs in the queue; preserving the queue sequence. This idea is conveyed in Fig 3(a), where the total complete time for a given job after the head node crashes would be T R + R T,, where T R is the time required for transparent recovery and R T is the time to complete job after last checkpoint. In our experiment, time for transparent recovery is about 60 seconds in the case of the head node failure. Fig 3 also provides the comparison in contrast with other popular approaches. Fig 3(b) describes a compute node failure scenario. The total complete time for a job after the node crash would be T R + Ts + R T, where Ts is the time spent by the job waiting in queue. The transparent recovery time in the case of client failover is about 30 seconds. The job complete time in the two scenarios (Head node crash and compute node crash) differ by the amount of time spent by the job in the queue waiting to be run. For brevity, we omit a detailed study of checkpoint overhead. The performance overhead of LAM/MPI and BLCR to checkpoint and restart a MPI job was studied and shown in [16].

8 8 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott Fig 3(a) Fig 3 (b) Recovery analysis in the case of Head node as well as client node failure The experiments were conducted on a Linux cluster based on Oscar 4.1 [14] with Torque as the resource management system. The node nive OS is Red H 9.0. The cluster consists of 1 head node, 1 standby head node, 1 compute node and 1 spare node. We choose LAM/MPI for MPI implemention, BLCR for checkpoint/restart along with FAM [15] interface for replicion of the jobs. We employ pbs_sched as our scheduler. We then enhance the experiment with HA-OSCAR 1.1 th enables a transparent recovery in the case of node failure. In our test environment we simuled two types of failures. One of the failures consisted of shutting down the head node and compute nodes. We also simuled the other failure by unplugging the network cables of head node as well as compute nodes. We conducted our experiment using a communicion intensive job which is called Ring. We chose this job since the job durion increases as the number of processes increase. We therefore, could simply extend or shorten the job durion by increasing or decreasing the number of processes respectively. Case 1: Two jobs were submitted and launched on compute node (CN) as well as head node, respectively. Job A was launched on the compute node with 6 processes. Job B was launched on Head node (HN) with 4 processes. We simuled the failure in this case by unplugging the network cable of the compute node.

9 On the survivability of MPI applicions 9 Job Submitted Ckpt interval (mins) Table 1: Job stuses and breakdown for Case 1 Launched on Failure Recovered Finished A 18:05:28 8 CN 18:14:32 18:15:40 18:22:52 B 18:06:14 5 HN 18:14:32 18:15:12 18:20:04 Case 2: Job submission is similar to Case 1. Job A was launched on the compute node with 5 processes. Job B was launched on Head node with 3 processes. In this case, we simuled the failure by powering down the compute node. Table 2: Job stuses and breakdown for Case 2 Launched on Failure Recovered Job Submitted Ckpt interval (mins) Finished A 18:39:23 8 CN 18:49:10 18:50:02 18:54:12 B 18:40:28 5 HN 18:49:10 NA 18:47:20 Case 3: Job A was launched on the compute node with 5 processes. Job B was launched on Head node with 3 processes. The failure in this case was of the second type in th we unplugged the network cable of the compute node. Table 3: Job running breakdown for Case 3 Launched on Failure Recovered Job Submitted Ckpt interval (mins) Finished A 19:06:35 4 CN 19:15:32 19:16:32 19:24:50 B 19:07:43 7 HN 19:15:32 19:16:18 19:19:10 Case 4: Two jobs were launched on compute node and head node. Job A was launched on the compute node with 6 processes. Job B was launched on Head node with 4 processes. The Head node failure was simuled by unplugging the network cable of the node. Table 4: Job running breakdown for Case 4 Job Sub Ckpt interval (mins) Launched on Failure Replicion time (Sec) Recovered Finished A 08:13 8 CN 17: :56 27:20 B 09:15 6 HN 17: :46 23:30 Case 5: Job A was launched on the compute node with 4 processes. Job B was launched on Head node with 4 processes. We simuled back to back failures in this

10 10 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott case. We first simuled the compute node outage by pulling the network cable and then after some time simuled the head node outage by shutting down the head node. Job Sub (02) Table 5: Job running breakdown for Case 5 Ckpt interval (mins) Compute node failed Replicion time (Sec) Recovered Head node failed Recovered At Finished A (CN) 00: : :24 10:50 12:48 18:50 B (HN) 01: : :10 10:50 12:36 18:40 From various experiments listed in the tables 1-5, the recovery times are different for jobs launched Head node and compute nodes. This difference is due to the sequential nure of the job resubmission and recovery scripts. The average delay for each job during the restart is about 12 seconds. The significant portion of this delay is caused by the LAM/MPI daemon synchronizion. The Fig 4 shows the completion time taken by a job against the Mean time to failure (MTTF) of the nodes. We assume th, MTTF is same for all the nodes in our cluster. The MTTF values are meant for demonstrion purpose only. The graph in Fig 5 illustres the various completion times of the same job against the checkpoint frequency by interjecting failures specific intervals. It also suggested th the checkpoint frequency is an important factor towards the job completion. Checkpoint just before the failure would cause less overhead as well as minimizing loss of computional time. The failures were interjected 92 minutes on the compute node and 110 minutes on the head node. We emuled the failures by either pulling the network cable of the node or shutting down the node. To emphasize the framework recovery time during the failure of a node, we compare our MTTR (Mean time to recovery) against a hypothetical MTTR (which does not use our framework). Although we used MTTR for demonstrion purposes, number of studies shows th the MTTR values (in the case of hard drive failures, system failures) could range from 30 minutes to 30 days [21]. In addition, Fig 4 illustres th how our framework handles multiple failures by showing job completion time against the MTTF of the nodes.

11 On the survivability of MPI applicions 11 compute node failures 300 completion time of Job A When MTTF is less than 60 min, jobs th have checkpoint interval more than the MTTF won t complete (-1) Mean Time to Failure ckpt interval = 60mins, our framew ork M TTR=20 mins ckpt interval=29mins, our framework Fig 4: Job completion time in compute node failures Clearly, the completion time is largely proportional of checkpoint frequency and the MTTR. The graph shows th our framework results in very insignificant increments in the job completion time in the events of multiple failures as opposed to other solutions. This is due to our approach with a very minimal overhead (i.e. a small time to repair) during the transparent recovery. The graph in Fig 4 also shows one instance where the MTTF is less than the checkpoint frequency. The job with a checkpoint interval 60 minutes would not finish if the MTTF of the nodes on which it is executing is 30 minutes. Checkpoint interval in mins Table 6: showing completion times with multiple failures Completion time with failure 92 mins in mins MTTR = 1mins Completion time with failure 110 minutes in minutes MTTR=2 mins

12 12 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott node failures with various MTTRs 350 time taken to complete by Job A Co mpletio n time w/o failure compute node failed 92 mins M TTR=30, compute node failure Head node failed 110 mins checkpoint interval for Job A M TTR=40mins, head node failure Fig 5: Job completion time for different Ckpt intervals and CN, HN failures We also measured the overhead of the proposed framework including network and CPU usage. Our benchmark suggests th the network overhead vs. the number of nodes by our experimental heartbe (HB) mechanism grows linearly. We have also projected the network overhead for higher number of nodes. For the HB interval, we keep the default interval as 1 second. Thus, there are 2 HB packets per node per second. Each packet consists of 84 bytes. Therefore, the amount of additional network traffic in the case of 100 nodes is about 17 KB during every second. Network usage for bulk job submission Number of packets genered Number of jobs added in Bulk Fig 6: Network usage for periodic and event-based approaches

13 On the survivability of MPI applicions 13 Fig 6 illustres the network usage for our job queue replicion. We adopt an event-based monitoring and replicion technique. Event based monitoring keeps the standby head in sync with every job th is submitted, completed or deleted from the primary head system. Average CPU Usage of Restarting Mechanism CPU% Average CPU Usage Numbers of Job to be Restarted Fig 7: Avg. CPU overhead of Restarting mechanism Average CPU utilizion for our job recovery mechanism is captured in Fig 7. The average CPU overhead ranges from less than 1% to 3% when the number of jobs in the queue is from 1 to Conclusion In this paper we proposed the framework th augments a standard HPC cluster with a transparent fault tolerance capability job level. With our framework, a parallel MPI job submitted through a typical resource manager is resilient to most common failure in the cluster when encountering node outages. Preliminary results suggest th MPI jobs can continue their execution and job queue is preserved regardless of failures the head node and compute nodes. We also detailed the corresponding algorithm, analysis and details of our techniques. Furthermore, the framework does not require any modificion to HPC environments such as standard MPI implemention (LAM/MPI), resource manager (PBS/Torque) and existing MPI programs. Thus, our solution could be easily adapted to other job queue and MPI implementions. The core complexity of the algorithm is O (n*k) where n is the number of nodes and k is the number of jobs th need to be resubmitted. The complexity could be further reduced and investigion is currently being conducted in full force. This paper shows th any node outage including head node and compute nodes can be handled by our framework. We outlined number of different cases th portray real-world scenarios and also detailed how to deal with them. The framework has a reliable job registrion/retrieval mechanism th enables us to systemically and transparently handle

14 14 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott MPI jobs in any node outage events. The results show th the framework has distinct advantages when compared to existing recovery mechanisms. In addition, the framework overhead is minimal and is easily outweighed by benefits from our mechanisms 7. References [1] Adding high availability to Condor Central manager, [2] Costen F, Brooke J, Pettipher M, investigion to make best use of LSF with high efficiency, Cluster Computing, Proceedings. 1st IEEE Computer Society Internional Workshop on 2-3 Dec Page(s): [3] Déjà vu software: [4] Donald Baker, Beyond MPI, Linux magazine, 15 th November 2005 [5] Edgar Gabriel, Graham E. Fagg, George Bosilca, Open MPI: Goals, Concept, and Design of a Next Generion MPI Implemention, In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September [6] G. E. Fagg, A. Bukovsky, and J. J. Dongarra. HARNESS and fault tolerant MPI, Parallel Computing, 27: , [7] G. Stellner, CoCheck: Checkpointing and Process Migrion for MPI. In Proceedings of the 10th Internional Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, [8] Gentzsch W, Sun Grid Engine: towards creing a compute power grid, Cluster Computing and the Grid, Proceedings. First IEEE/ACM Internional Symposium on May 2001 Page(s):35 36 [9] Jiannong Cao; Yinghao Li; Minyi Guo, Process migrion for MPI applicions based on coordined checkpoint, Parallel and Distributed Systems, Proceedings. 11th Internional Conference on Volume 1, July 2005 Page(s): Vol. 1 [10] K.M. Chandy, A Survey of Analytic Models for Rollback and Recovery Stregies, Computer, vol. 8, no. 5, pp , 1975 [11] Lbeaus Bayucan, Robert L. Henderson, et al, Portable Bch System External Reference Specificion, MRJ Technology Solutions, May [12] LinuxHA Clustering Project, [13] Mthieu Fertre and Christine Morin, Transparent message passing parallel applicions Checkpoint in Kerrighed, HAPCW in conjunction with LACSI 2005, Santa Fe, New Mexico. [14] OSCAR Software download available : [15] Python FAM interface available : [16] S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initied Checkpoint. The 2003 Los Alamos Computer Science Institute Symposium, Santa Fe, NM. October [17] SLURM: Simple Linux Utility for Resource Management, [18] S. Plank, M. Beck, G. Kingsley, and K. Li Libckpt: Transparent Checkpoint under UNIX. In Usenix Winter 1995 Technical Conference, page , [19] The Torque Resource Manager, [20] Truly transparent Checkpoint of Parallel applicions available :

15 On the survivability of MPI applicions 15 [21] Values for MTTR dasheet.pdf [22] S. Sudhakar and C. Leangsuksun. A hybrid monitoring and broadcast heartbe technique for large-scale cluster systems. Technical report, Computer Science, Louisiana Tech University, Mar [23] Y. Zhang, Checkpoint and migrion of parallel processes based on MPI, In Proceedings of the 3 rd Linux Cluster Institute Conference, October , Florida.

Avida Checkpoint/Restart Implementation

Avida Checkpoint/Restart Implementation Nilab Mohammad Mousa: McNair Scholar Dirk Colbry, Ph.D.: Mentor Computer Science Abstract As high performance computing centers (HPCC) continue to grow in popularity,