On the Survivability of Standard MPI Applications

Size: px
Start display at page:

Download "On the Survivability of Standard MPI Applications"

Transcription

1 On the Survivability of Standard MPI Applicions Anand Tikotekar 1, Chokchai Leangsuksun 1 Stephen L. Scott 2 Louisiana Tech University 1 Oak Ridge Nional Laborory 2 box@lech.edu 1 a007@lech.edu 1 scottsl@ornl.gov 2 Abstract. Job loss due to failure represents a common vulnerability in High Performance Computing (HPC), especially in the Message Passing Interface (MPI) environment. Rollback-recovery has been used to mitige faulty issues for long running applicions. However, to de, the rollback-recovery such as checkpoint mechanism alone may not be sufficient to ensure fault tolerance for MPI applicions due to a stic view of MPI coopering machines and lack of resilient ability to endure outages. In fact, MPI applicions are prone to cascading failures, where one participing node causes the total failure. In this paper, we address fault issues in the MPI environment by improving runtime availability with self-healing and self-cloning th toleres the outage of cluster computing systems. We develop a framework th augments a standard HPC cluster with a fault tolerance capability job level th preserves the job queue, and a parallel MPI job submitted through a resource manager enabling the non-stop execution even after encountering failure. 1. Introduction Large clusters are now ubiquitous in high performance computing due to their ability to solve computionally intensive problems in cost effective terms. Jobs when submitted to a large scale cluster are typically parallel and MPI-type applicions which can be very long-running. Imagine a scenario where the applicion completion time is of the order of magnitude th is longer than system mean time to failure. If the HPC job crashes due to failure in any cluster node, the applicion must be resubmitted. To our knowledge, this outage phenomenon is MPI Achilles heel due to the fact, th MPI runtime are based on stic implemention [4]. The applicions can not take advantage of added nodes nor can it resize itself to fewer nodes after initiali- 1 Research supported by Office of science, U.S. Department of Energy, the fastos program, Grant # DE-FG02-04ER Research supported by the mhemics, Informion and Computional Sciences Office,Office of Advanced Scientific Computing Research, Office of science, U.S. Department of Energy, under contract No. DE-AC05-00OR22725 with UT-Btelle, LLC.

2 2 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott zion. Furthermore, MPI applicions also face cascading failures whereby one node is able to stall the entire MPI job. By and large, parallel jobs are normally submitted through a resource manager. A typical resource manager has no knowledge whether a MPI job has stalled or still actively running. This type of detection is a very difficult task. The job may be in the running ste since the runtime daemon or the MPI main process is still active while a failure occurs in one of the participing nodes. There are some resource-managers today th provide checkpoint support intrinsically such as in Condor [1], LSF [2]. However, their checkpoint mechanism was not designed to provide transparent fault tolerance for running MPI applicions. Any node failure during the execution results in the whole job failure. Moreover, the whole job queue will be gone when loosing the head node. There are existing works th provide coordined checkpoint mechanism for MPI. Unfortunely, the checkpoint recovery will not succeed in a permanent failure where the restart environment is not identical to the one before the failure. Moreover, a solution must keep track of which jobs are queued, launched or running on which nodes due to the stic nure of MPI for aforementioned recovery issues. This informion is important as to enable transparent fault tolerance by job queue replicion and automic resubmission of jobs in the event of node outages. Our framework introduces a job registrion/retrieval (JRR) mechanism to keep track detailed job informion until the completion. By improving runtime availability with multi-head HA-OSCAR self-healing and selfcloning, incorporing JRR and existing coordined-checkpoint mechanisms, our proof-of-concept demonstres th any parallel MPI job submitted through a normal resource manager is enabled its non-stop execution even when encountering failure. We schemically abstract our framework by illustring basic failure modes in a cluster in section 3. Section 4 sketches the proposed framework. We discuss Job registrion/retrieval (JRR) mechanism in Section 5. Section 6 entails performance study and experimental overhead of our framework, and follows by the conclusion in Section Reled Work There are different approaches available to tackle the reliability issues in HPC, including high availability (HA), resource managers with fault resilience, checkpointrecovery mechanism, etc. LinuxHA [12] is a tool for building HA clusters. However, LinuxHA only provides a heartbe and failover mechanism for a fl-structure cluster which may not be suitable for HPC clusters. OSCAR [14] and ROCKS are common software stacks for deploying and managing Beowulf clusters [14], [7]. As far as availability is concerned, the Beowulf clusters suffer from the possibility of the head node becoming a Single Point of Failure (SPF). The cluster can go down completely with the failure of head node. Thus, there is a need to focus on the high availability aspect of the cluster design. HA-OSCAR deals with availability and fault issues the geway node with multi-head failover architecture and service level fault tolerance mechanisms. Replicion, proactive monitoring and recovery are essential techniques in HA-OSCAR. Moreover, HA-OSCAR provides a flexible and extensible in-

3 On the survivability of MPI applicions 3 terface for customizable fault management, fail-over operion, and alert management. Typically, HPC job management systems consist of two parts, resource managers and dediced job schedulers. PBS [11], Torque [19], SLURM [17], and SGE [8] are resource managers th are commonly deployed in HPC clusters. Torque is based on the OpenPBS project. It is an open source software which is enhanced with scalability, node fault tolerance, and better logging facilities compared to OpenPBS. On the downside the Torque server possesses a single point of failure the head node. In addition, Torque provides no support for checkpoint. SGE [8] supports checkpoint mechanism but has known limitions in controlling parallel MPI applicions and retrieving resource usage da. Lawrence Livermore Nional Laborory SLURM [17] was developed with scalability in mind coupled with fault tolerance and simplistic management. SLURM has an Active/Standby server configurion for fault tolerance. In an event of the outage of the primary SLURM control daemon, the backup controller assumes control. The SLURM controller daemon writes its current ste to disk when the backup controller takes over to preserve the job queue. Currently, SLURM has no support for checkpoint, and does not address issues for running jobs. Condor is used for high throughput computing. Condor supports both serial and parallel jobs but provides checkpoint and process migrion only for serial jobs. The condor central manager is also a single point of failure. In [3] authors have developed a system called Déjà vu aiming to achieve transparent fault tolerance for HPC environment. Déjà vu is not open source software and to de has not been released yet. Moreover, it requires compiler-based code instrumention approach which may introduce software bug and overhead. Zhang [23] have designed a system called checkpoint based rollback recovery and migrion. This system relies on migrion of processes in the case of node failures. This suffers from a single point of failure. One of the important schemes to achieve fault tolerance in HPC clusters is checkpoint/restart technique. There are numerous Linux-based checkpoint/restart packages such as BLCR [10], CoCheck [7], Epckpt [20] and Libckpt [18]. In [13] Morin et al, propose a transparent message passing parallel applicions checkpoint mechanism in Kerrighed. LAM/MPI [16] supports coordined checkpoint/restart based on the BLCR. However, when a failure occurs, most existing checkpoint/restart tools require a manual restart. To support transparent recovery, one must ensure availability requirements and self-awareness to tolere outages and perform automic recovery with reasonable cost. Checkpoint scheme should be utilized in conjunction with the runtime such as job scheduling mechanism for self-recovery. At the user level, there have been tempts to provide MPI fault tolerance such as FT-MPI [6]. FT-MPI provides a try-cch approach to detect a process outage (MPI communicor) and can recover from the failure. FT-MPI alone can not provide fault tolerance, especially the head node. Open MPI [5] is a recent open source production quality MPI implemention th aims to centralize strong points of all other MPI implementions. It also aims to provide fault tolerance but leave some questions open such as how to detect faults and propage the event notificion to processes etc. To summarize, above solutions although tackling the same problem, lack in least one of the following areas:

4 4 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott 1. Some solutions do not provide fault tolerance for parallel jobs in the case of head node failure. 2. Solutions th provide head node outage support such as Slurm do not account for compute node failures. 3. Some solutions do not provide checkpointing as an external support, and one can not use the nive methods with the particular solution 4. Products such as Déjà vu are not open source. Our approach addresses all the above issues in a single solution. 3. Basic MPI job failure cases and proposed resolution Fig 1: Head node outage, Basic failure mode 1 Case 1: Fig 1. Describes th a parallel job is launched on the head node and its processes are running on compute nodes. We assume th HPC applicions are MPI type and the system is a multi-head HA-OSCAR cluster in order to address a single-point-of-failure in a typical cluster. When the head node fails, the standby head-node takes the control (i.e. failover) and must recover all the jobs. It is necessary th job queue must be repliced or stored on a reliable storage and accessible by the standby head. Running jobs must be periodically saved. This checkpoint mechanism is to ensure fault tolerance for the running applicion. All the running jobs have to be restarted and queued jobs must be restored since the resource management and scheduling service must first be restarted due to the head node failure.

5 On the survivability of MPI applicions 5 Fig 2: Compute node outage, Basic failure mode 2 Case 2: Fig 2 shows th a parallel job is launched on a compute node (e.g. C1) and its processes are running on other nodes and then the node (C1) crashes. In this case, jobs running on the failed node must be restarted from the last checkpointed ste. We also must remove the original stalled jobs before the recovery from their checkpoint. There are other failure cases th depend upon where the jobs are launched (either on head node or compute node) and which corresponding MPI processes crash. Our framework aims to handle all these cases. Moreover, the framework is also designed to cope with back to back failures of compute node and head node along with multiple compute node failures the same time. 4. Algorithm In this section, we detail our algorithm for the proposed framework. This algorithm was implemented and valided with HA-OSCAR 1.1, LAM/MPI 7.0, Torque and BLCR. It however should be straightforward to adapt this algorithm to other implementions. We also enhance HA-OSCAR to deal with compute node outage with a lazy redundancy cloning technique. It is an on-demand compute node cloning th is similar to failover technique and to ensure high availability of compute nodes in a permanent failure situion. Details of the lazy redundancy technique can be found in [22]. S0: Start the job queue replicor; compute node failure detection mechanism (CN) and Restarting mechanism (RM) daemons on the head node. S1: Submit a MPI parallel job through a scheduling interface. S2: Replice the job on the back up with the stus as held. S3: Do until no node outage: // the following activities are performed on node X.

6 6 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott Preprocessing Retrieve job id, user Call checkpoint controller with tunable interval Job registrion Determine the alloced nodes using LAM/MPI policy Determine the MPI ID of the job Insert job details in the dabase for further analysis. Checkpoint Checkpoint the job tunable interval Copy checkpoint files to the backup Creion of a job file Cree a job specificion file for restarting purposes Copy job file to the standby node Post processing Delete all the spec files Delete the checkpoint files if the root job finishes. S4: End Loop S5: If there is a head node outage: // on the standby head node a. Start the head node failover mechanism i. Clone the head node ii. Start resource manager and scheduler iii. Call the RM b. Determine the list of the jobs to be restarted. i. Restart all jobs th were running the time of failure. c. Cree respective LAM environment to Submit jobs through the RM from each directory i. Determine the needed session suffix for LAM before creing LAM environment. d. Release remaining jobs th are in the queued ste. e. Goto step S3 S6: End If S7: If there is a compute node crash: // on the primary head node a. Start the compute node failover mechanism i. Clone the failed node with the spare one ii. Restart all the necessary services iii. Call the RM b. Determine the list of the jobs to be restarted. i. Scan the respective directories c. Retrieve job informion and based on the informion, submit the job through the RM after creing the respective LAM environment. i. Retrieve the informion regarding the alloced nodes using the job id d. Goto step S3 S8: End If The above algorithm has three main phases. Job registrion takes place as each job entered the system. Two daemons, Head node failure detector (HN) and compute node failure detector (CN), execute their respective algorithms after an outage is de-

7 On the survivability of MPI applicions 7 tected. Fig 1 and 2 illustre two main failure and recovery scenarios. Fig 1 refers to S5 step in our algorithm while Fig 2 represents S7 step. We rely on the cloning approach to replace the failed node as well as maintaining applicion runtime configurion, quick detection and recovery mechanism Job Registrion/Retrieval mechanism This section describes how job registrion and retrieval (JRR) takes place. The job registrion mechanism stores important informion on a reliable storage about applicions and their runtime informion such as job id, a boolean value set for each node suggesting whether node has corresponding processes for a given job and the MPI process id. The registrion takes place from the respective node where the job is launched by the scheduler. Boolean values are derived from node allocion policies found in a typical resource management system. In our test bed, the LAM/MPI implemention controls the allocion policy. This informion is crucial for the recovery mechanism. If the number of processes is greer than the number of virtual processor specified then all the node columns will have a value of 1 representing th the job will run on all the nodes. This in turn suggests th this particular job must be restarted when any node fails. During recovery, the registrion provides required informion on where to start the job based on the given job id. It is imperive th when a job finishes, its registrion must be cleaned up. This important task is achieved through our enhancement in the post-processing module in the resource management system. 5. Results and analysis Fig 3(a) shows a comparison of various recovery approaches. In the case of the head node recovery scenario, the standby node assumes a control over the failed primary head and automically restarts running jobs from their last checkpoint followed by rest of jobs in the queue; preserving the queue sequence. This idea is conveyed in Fig 3(a), where the total complete time for a given job after the head node crashes would be T R + R T,, where T R is the time required for transparent recovery and R T is the time to complete job after last checkpoint. In our experiment, time for transparent recovery is about 60 seconds in the case of the head node failure. Fig 3 also provides the comparison in contrast with other popular approaches. Fig 3(b) describes a compute node failure scenario. The total complete time for a job after the node crash would be T R + Ts + R T, where Ts is the time spent by the job waiting in queue. The transparent recovery time in the case of client failover is about 30 seconds. The job complete time in the two scenarios (Head node crash and compute node crash) differ by the amount of time spent by the job in the queue waiting to be run. For brevity, we omit a detailed study of checkpoint overhead. The performance overhead of LAM/MPI and BLCR to checkpoint and restart a MPI job was studied and shown in [16].

8 8 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott Fig 3(a) Fig 3 (b) Recovery analysis in the case of Head node as well as client node failure The experiments were conducted on a Linux cluster based on Oscar 4.1 [14] with Torque as the resource management system. The node nive OS is Red H 9.0. The cluster consists of 1 head node, 1 standby head node, 1 compute node and 1 spare node. We choose LAM/MPI for MPI implemention, BLCR for checkpoint/restart along with FAM [15] interface for replicion of the jobs. We employ pbs_sched as our scheduler. We then enhance the experiment with HA-OSCAR 1.1 th enables a transparent recovery in the case of node failure. In our test environment we simuled two types of failures. One of the failures consisted of shutting down the head node and compute nodes. We also simuled the other failure by unplugging the network cables of head node as well as compute nodes. We conducted our experiment using a communicion intensive job which is called Ring. We chose this job since the job durion increases as the number of processes increase. We therefore, could simply extend or shorten the job durion by increasing or decreasing the number of processes respectively. Case 1: Two jobs were submitted and launched on compute node (CN) as well as head node, respectively. Job A was launched on the compute node with 6 processes. Job B was launched on Head node (HN) with 4 processes. We simuled the failure in this case by unplugging the network cable of the compute node.

9 On the survivability of MPI applicions 9 Job Submitted Ckpt interval (mins) Table 1: Job stuses and breakdown for Case 1 Launched on Failure Recovered Finished A 18:05:28 8 CN 18:14:32 18:15:40 18:22:52 B 18:06:14 5 HN 18:14:32 18:15:12 18:20:04 Case 2: Job submission is similar to Case 1. Job A was launched on the compute node with 5 processes. Job B was launched on Head node with 3 processes. In this case, we simuled the failure by powering down the compute node. Table 2: Job stuses and breakdown for Case 2 Launched on Failure Recovered Job Submitted Ckpt interval (mins) Finished A 18:39:23 8 CN 18:49:10 18:50:02 18:54:12 B 18:40:28 5 HN 18:49:10 NA 18:47:20 Case 3: Job A was launched on the compute node with 5 processes. Job B was launched on Head node with 3 processes. The failure in this case was of the second type in th we unplugged the network cable of the compute node. Table 3: Job running breakdown for Case 3 Launched on Failure Recovered Job Submitted Ckpt interval (mins) Finished A 19:06:35 4 CN 19:15:32 19:16:32 19:24:50 B 19:07:43 7 HN 19:15:32 19:16:18 19:19:10 Case 4: Two jobs were launched on compute node and head node. Job A was launched on the compute node with 6 processes. Job B was launched on Head node with 4 processes. The Head node failure was simuled by unplugging the network cable of the node. Table 4: Job running breakdown for Case 4 Job Sub Ckpt interval (mins) Launched on Failure Replicion time (Sec) Recovered Finished A 08:13 8 CN 17: :56 27:20 B 09:15 6 HN 17: :46 23:30 Case 5: Job A was launched on the compute node with 4 processes. Job B was launched on Head node with 4 processes. We simuled back to back failures in this

10 10 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott case. We first simuled the compute node outage by pulling the network cable and then after some time simuled the head node outage by shutting down the head node. Job Sub (02) Table 5: Job running breakdown for Case 5 Ckpt interval (mins) Compute node failed Replicion time (Sec) Recovered Head node failed Recovered At Finished A (CN) 00: : :24 10:50 12:48 18:50 B (HN) 01: : :10 10:50 12:36 18:40 From various experiments listed in the tables 1-5, the recovery times are different for jobs launched Head node and compute nodes. This difference is due to the sequential nure of the job resubmission and recovery scripts. The average delay for each job during the restart is about 12 seconds. The significant portion of this delay is caused by the LAM/MPI daemon synchronizion. The Fig 4 shows the completion time taken by a job against the Mean time to failure (MTTF) of the nodes. We assume th, MTTF is same for all the nodes in our cluster. The MTTF values are meant for demonstrion purpose only. The graph in Fig 5 illustres the various completion times of the same job against the checkpoint frequency by interjecting failures specific intervals. It also suggested th the checkpoint frequency is an important factor towards the job completion. Checkpoint just before the failure would cause less overhead as well as minimizing loss of computional time. The failures were interjected 92 minutes on the compute node and 110 minutes on the head node. We emuled the failures by either pulling the network cable of the node or shutting down the node. To emphasize the framework recovery time during the failure of a node, we compare our MTTR (Mean time to recovery) against a hypothetical MTTR (which does not use our framework). Although we used MTTR for demonstrion purposes, number of studies shows th the MTTR values (in the case of hard drive failures, system failures) could range from 30 minutes to 30 days [21]. In addition, Fig 4 illustres th how our framework handles multiple failures by showing job completion time against the MTTF of the nodes.

11 On the survivability of MPI applicions 11 compute node failures 300 completion time of Job A When MTTF is less than 60 min, jobs th have checkpoint interval more than the MTTF won t complete (-1) Mean Time to Failure ckpt interval = 60mins, our framew ork M TTR=20 mins ckpt interval=29mins, our framework Fig 4: Job completion time in compute node failures Clearly, the completion time is largely proportional of checkpoint frequency and the MTTR. The graph shows th our framework results in very insignificant increments in the job completion time in the events of multiple failures as opposed to other solutions. This is due to our approach with a very minimal overhead (i.e. a small time to repair) during the transparent recovery. The graph in Fig 4 also shows one instance where the MTTF is less than the checkpoint frequency. The job with a checkpoint interval 60 minutes would not finish if the MTTF of the nodes on which it is executing is 30 minutes. Checkpoint interval in mins Table 6: showing completion times with multiple failures Completion time with failure 92 mins in mins MTTR = 1mins Completion time with failure 110 minutes in minutes MTTR=2 mins

12 12 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott node failures with various MTTRs 350 time taken to complete by Job A Co mpletio n time w/o failure compute node failed 92 mins M TTR=30, compute node failure Head node failed 110 mins checkpoint interval for Job A M TTR=40mins, head node failure Fig 5: Job completion time for different Ckpt intervals and CN, HN failures We also measured the overhead of the proposed framework including network and CPU usage. Our benchmark suggests th the network overhead vs. the number of nodes by our experimental heartbe (HB) mechanism grows linearly. We have also projected the network overhead for higher number of nodes. For the HB interval, we keep the default interval as 1 second. Thus, there are 2 HB packets per node per second. Each packet consists of 84 bytes. Therefore, the amount of additional network traffic in the case of 100 nodes is about 17 KB during every second. Network usage for bulk job submission Number of packets genered Number of jobs added in Bulk Fig 6: Network usage for periodic and event-based approaches

13 On the survivability of MPI applicions 13 Fig 6 illustres the network usage for our job queue replicion. We adopt an event-based monitoring and replicion technique. Event based monitoring keeps the standby head in sync with every job th is submitted, completed or deleted from the primary head system. Average CPU Usage of Restarting Mechanism CPU% Average CPU Usage Numbers of Job to be Restarted Fig 7: Avg. CPU overhead of Restarting mechanism Average CPU utilizion for our job recovery mechanism is captured in Fig 7. The average CPU overhead ranges from less than 1% to 3% when the number of jobs in the queue is from 1 to Conclusion In this paper we proposed the framework th augments a standard HPC cluster with a transparent fault tolerance capability job level. With our framework, a parallel MPI job submitted through a typical resource manager is resilient to most common failure in the cluster when encountering node outages. Preliminary results suggest th MPI jobs can continue their execution and job queue is preserved regardless of failures the head node and compute nodes. We also detailed the corresponding algorithm, analysis and details of our techniques. Furthermore, the framework does not require any modificion to HPC environments such as standard MPI implemention (LAM/MPI), resource manager (PBS/Torque) and existing MPI programs. Thus, our solution could be easily adapted to other job queue and MPI implementions. The core complexity of the algorithm is O (n*k) where n is the number of nodes and k is the number of jobs th need to be resubmitted. The complexity could be further reduced and investigion is currently being conducted in full force. This paper shows th any node outage including head node and compute nodes can be handled by our framework. We outlined number of different cases th portray real-world scenarios and also detailed how to deal with them. The framework has a reliable job registrion/retrieval mechanism th enables us to systemically and transparently handle

14 14 Anand Tikotekar, Box Leangsuksun, Stephan L. Scott MPI jobs in any node outage events. The results show th the framework has distinct advantages when compared to existing recovery mechanisms. In addition, the framework overhead is minimal and is easily outweighed by benefits from our mechanisms 7. References [1] Adding high availability to Condor Central manager, [2] Costen F, Brooke J, Pettipher M, investigion to make best use of LSF with high efficiency, Cluster Computing, Proceedings. 1st IEEE Computer Society Internional Workshop on 2-3 Dec Page(s): [3] Déjà vu software: [4] Donald Baker, Beyond MPI, Linux magazine, 15 th November 2005 [5] Edgar Gabriel, Graham E. Fagg, George Bosilca, Open MPI: Goals, Concept, and Design of a Next Generion MPI Implemention, In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September [6] G. E. Fagg, A. Bukovsky, and J. J. Dongarra. HARNESS and fault tolerant MPI, Parallel Computing, 27: , [7] G. Stellner, CoCheck: Checkpointing and Process Migrion for MPI. In Proceedings of the 10th Internional Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, [8] Gentzsch W, Sun Grid Engine: towards creing a compute power grid, Cluster Computing and the Grid, Proceedings. First IEEE/ACM Internional Symposium on May 2001 Page(s):35 36 [9] Jiannong Cao; Yinghao Li; Minyi Guo, Process migrion for MPI applicions based on coordined checkpoint, Parallel and Distributed Systems, Proceedings. 11th Internional Conference on Volume 1, July 2005 Page(s): Vol. 1 [10] K.M. Chandy, A Survey of Analytic Models for Rollback and Recovery Stregies, Computer, vol. 8, no. 5, pp , 1975 [11] Lbeaus Bayucan, Robert L. Henderson, et al, Portable Bch System External Reference Specificion, MRJ Technology Solutions, May [12] LinuxHA Clustering Project, [13] Mthieu Fertre and Christine Morin, Transparent message passing parallel applicions Checkpoint in Kerrighed, HAPCW in conjunction with LACSI 2005, Santa Fe, New Mexico. [14] OSCAR Software download available : [15] Python FAM interface available : [16] S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initied Checkpoint. The 2003 Los Alamos Computer Science Institute Symposium, Santa Fe, NM. October [17] SLURM: Simple Linux Utility for Resource Management, [18] S. Plank, M. Beck, G. Kingsley, and K. Li Libckpt: Transparent Checkpoint under UNIX. In Usenix Winter 1995 Technical Conference, page , [19] The Torque Resource Manager, [20] Truly transparent Checkpoint of Parallel applicions available :

15 On the survivability of MPI applicions 15 [21] Values for MTTR dasheet.pdf [22] S. Sudhakar and C. Leangsuksun. A hybrid monitoring and broadcast heartbe technique for large-scale cluster systems. Technical report, Computer Science, Louisiana Tech University, Mar [23] Y. Zhang, Checkpoint and migrion of parallel processes based on MPI, In Proceedings of the 3 rd Linux Cluster Institute Conference, October , Florida.

Avida Checkpoint/Restart Implementation

Avida Checkpoint/Restart Implementation Avida Checkpoint/Restart Implementation Nilab Mohammad Mousa: McNair Scholar Dirk Colbry, Ph.D.: Mentor Computer Science Abstract As high performance computing centers (HPCC) continue to grow in popularity,

More information

A Scalable Unified Fault Tolerance for HPC Environments

A Scalable Unified Fault Tolerance for HPC Environments A Scalable Unified Fault Tolerance for HPC Environments Kulathep Charoenpornwattana *, Chokchai Leangsuksun *1, Geffory Vallee **, Anand Tikotekar **, Stephen Scott **2 * Louisiana Tech University, Ruston,

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

MPI versions. MPI History

MPI versions. MPI History MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

MPI History. MPI versions MPI-2 MPICH2

MPI History. MPI versions MPI-2 MPICH2 MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications,

New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications, New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications, Paweł Czarnul and Marcin Frączak Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology,

More information

Increasing Reliability through Dynamic Virtual Clustering

Increasing Reliability through Dynamic Virtual Clustering Increasing Reliability through Dynamic Virtual Clustering Wesley Emeneker, Dan Stanzione High Performance Computing Initiative Ira A. Fulton School of Engineering Arizona State University Wesley.Emeneker@asu.edu,

More information

An introduction to checkpointing. for scientific applications

An introduction to checkpointing. for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS

SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS Thara Angskun, Graham Fagg, George Bosilca, Jelena Pješivac Grbović, and Jack Dongarra,2,3 University of Tennessee, 2 Oak Ridge National

More information

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications Sourav Chakraborty 1, Ignacio Laguna 2, Murali Emani 2, Kathryn Mohror 2, Dhabaleswar K (DK) Panda 1, Martin Schulz

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

An introduction to checkpointing. for scientifc applications

An introduction to checkpointing. for scientifc applications damien.francois@uclouvain.be UCL/CISM An introduction to checkpointing for scientifc applications November 2016 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count 1 2 3^C

More information

An Empirical Study of High Availability in Stream Processing Systems

An Empirical Study of High Availability in Stream Processing Systems An Empirical Study of High Availability in Stream Processing Systems Yu Gu, Zhe Zhang, Fan Ye, Hao Yang, Minkyong Kim, Hui Lei, Zhen Liu Stream Processing Model software operators (PEs) Ω Unexpected machine

More information

Progress Report on Transparent Checkpointing for Supercomputing

Progress Report on Transparent Checkpointing for Supercomputing Progress Report on Transparent Checkpointing for Supercomputing Jiajun Cao, Rohan Garg College of Computer and Information Science, Northeastern University {jiajun,rohgarg}@ccs.neu.edu August 21, 2015

More information

Concepts for High Availability in Scientific High-End Computing

Concepts for High Availability in Scientific High-End Computing Concepts for High Availability in Scientific High-End Computing C. Engelmann 1,2 and S. L. Scott 1 1 Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA 2

More information

Toward An Integrated Cluster File System

Toward An Integrated Cluster File System Toward An Integrated Cluster File System Adrien Lebre February 1 st, 2008 XtreemOS IP project is funded by the European Commission under contract IST-FP6-033576 Outline Context Kerrighed and root file

More information

High Availability and Performance Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments

High Availability and Performance Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments CSC469/557: Winter 2006 High Availability and Performance Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments Chokchai (Box( Box) ) Leangsuksun, Associate Professor, Computer Science

More information

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract

More information

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM Szabolcs Pota 1, Gergely Sipos 2, Zoltan Juhasz 1,3 and Peter Kacsuk 2 1 Department of Information Systems, University of Veszprem, Hungary 2 Laboratory

More information

Combing Partial Redundancy and Checkpointing for HPC

Combing Partial Redundancy and Checkpointing for HPC Combing Partial Redundancy and Checkpointing for HPC James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann North Carolina State University Sandia National Laboratory

More information

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Diana Hecht 1 and Constantine Katsinis 2 1 Electrical and Computer Engineering, University of Alabama in Huntsville,

More information

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Joshua Hursey 1, Jeffrey M. Squyres 2, Timothy I. Mattox 1, Andrew Lumsdaine 1 1 Indiana University 2 Cisco Systems,

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware

Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware Mohamed Taher 1, Kris Gaj 2, Tarek El-Ghazawi 1, and Nikitas Alexandridis 1 1 The George Washington University 2 George Mason

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes* RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes* Hugo Meyer 1, Dolores Rexachs 2, Emilio Luque 2 Computer Architecture and Operating Systems Department, University Autonoma of

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Bluemin: A Suite for Management of PC Clusters

Bluemin: A Suite for Management of PC Clusters Bluemin: A Suite for Management of PC Clusters Hai Jin, Hao Zhang, Qincheng Zhang, Baoli Chen, Weizhong Qiang School of Computer Science and Engineering Huazhong University of Science and Technology Wuhan,

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Introduction to the Service Availability Forum

Introduction to the Service Availability Forum . Introduction to the Service Availability Forum Contents Introduction Quick AIS Specification overview AIS Dependability services AIS Communication services Programming model DEMO Design of dependable

More information

Space-Efficient Page-Level Incremental Checkpointing *

Space-Efficient Page-Level Incremental Checkpointing * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science

More information

Box s 1 minute Bio l B. Eng (AE 1983): Khon Kean University

Box s 1 minute Bio l B. Eng (AE 1983): Khon Kean University CSC469/585: Winter 2011-12 High Availability and Performance Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments Chokchai (Box) Leangsuksun, Associate Professor, Computer Science

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS The 0th IEEE International Conference on High Performance Computing and Communications UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems

Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems Christian Engelmann, Hong Ong, and Stephen L. Scott Computer Science and Mathematics Division, Oak Ridge

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

A More Realistic Way of Stressing the End-to-end I/O System

A More Realistic Way of Stressing the End-to-end I/O System A More Realistic Way of Stressing the End-to-end I/O System Verónica G. Vergara Larrea Sarp Oral Dustin Leverman Hai Ah Nam Feiyi Wang James Simmons CUG 2015 April 29, 2015 Chicago, IL ORNL is managed

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Investigating Resilient HPRC with Minimally-Invasive System Monitoring

Investigating Resilient HPRC with Minimally-Invasive System Monitoring Investigating Resilient HPRC with Minimally-Invasive System Monitoring Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab UNC Charlotte Agenda Exascale systems are

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Prabhaker Mateti Wright State University Dayton, Ohio, USA Overview High performance computing High throughput computing NOW, HPC, and HTC Parallel algorithms Software

More information

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations John von Neumann Institute for Computing A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations A. Duarte, D. Rexachs, E. Luque published in Parallel Computing: Current & Future Issues

More information

Distributed and Cloud Computing

Distributed and Cloud Computing Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 2: Computer Clusters for Scalable parallel Computing Adapted from Kai Hwang University of Southern California March 30, 2012 Copyright

More information

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet IRISA Sebastien.Monnet@irisa.fr Christine Morin IRISA/INRIA Christine.Morin@irisa.fr Ramamurthy Badrinath

More information

Grid Compute Resources and Job Management

Grid Compute Resources and Job Management Grid Compute Resources and Job Management How do we access the grid? Command line with tools that you'll use Specialised applications Ex: Write a program to process images that sends data to run on the

More information

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

Fault Tolerance in Distributed Paradigms

Fault Tolerance in Distributed Paradigms 2011 International Conference on Computer Communication and Management Proc.of CSIT vol.5 (2011) (2011) IACSIT Press, Singapore Fault Tolerance in Distributed Paradigms Sajjad Haider 1, Naveed Riaz Ansari

More information

Chapter 18 Distributed Systems and Web Services

Chapter 18 Distributed Systems and Web Services Chapter 18 Distributed Systems and Web Services Outline 18.1 Introduction 18.2 Distributed File Systems 18.2.1 Distributed File System Concepts 18.2.2 Network File System (NFS) 18.2.3 Andrew File System

More information

N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart. Liang PENG Lip Kian NG

N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart. Liang PENG Lip Kian NG N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart Liang PENG Lip Kian NG N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart Liang PENG Lip Kian NG APSTC-TB-2004-005 Abstract: N1GE6, formerly

More information

Scalable Fault Tolerant MPI: Extending the recovery algorithm

Scalable Fault Tolerant MPI: Extending the recovery algorithm Scalable Fault Tolerant MPI: Extending the recovery algorithm Graham E. Fagg, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, and Jack J. Dongarra Dept. of Computer Science, 1122 Volunteer Blvd.,

More information

Redundancy for Routers using Enhanced VRRP

Redundancy for Routers using Enhanced VRRP Redundancy for Routers using Enhanced VRRP 1 G.K.Venkatesh, 2 P.V. Rao 1 Asst. Prof, Electronics Engg, Jain University Banglaore, India 2 Prof., Department of Electronics Engg., Rajarajeshwari College

More information

Analysis and Algorithms for Partial Protection in Mesh Networks

Analysis and Algorithms for Partial Protection in Mesh Networks Analysis and Algorithms for Partial Protection in Mesh Networks Greg uperman MIT LIDS Cambridge, MA 02139 gregk@mit.edu Eytan Modiano MIT LIDS Cambridge, MA 02139 modiano@mit.edu Aradhana Narula-Tam MIT

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems fastos.org/molar Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems Jyothish Varma 1, Chao Wang 1, Frank Mueller 1, Christian Engelmann, Stephen L. Scott 1 North Carolina State University,

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017 A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council Perth, July 31-Aug 01, 2017 http://levlafayette.com Necessary and Sufficient Definitions High Performance Computing: High

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

HA Use Cases. 1 Introduction. 2 Basic Use Cases

HA Use Cases. 1 Introduction. 2 Basic Use Cases HA Use Cases 1 Introduction This use case document outlines the model and failure modes for NFV systems. Its goal is along with the requirements documents and gap analysis help set context for engagement

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002 Maximum Availability Architecture: Overview An Oracle White Paper July 2002 Maximum Availability Architecture: Overview Abstract...3 Introduction...3 Architecture Overview...4 Application Tier...5 Network

More information

Virtualization with VMware ESX and VirtualCenter SMB to Enterprise

Virtualization with VMware ESX and VirtualCenter SMB to Enterprise Virtualization with VMware ESX and VirtualCenter SMB to Enterprise This class is an intense, five-day introduction to virtualization using VMware s immensely popular Virtual Infrastructure suite including

More information

A Comprehensive User-level Checkpointing Strategy for MPI Applications

A Comprehensive User-level Checkpointing Strategy for MPI Applications A Comprehensive User-level Checkpointing Strategy for MPI Applications Technical Report # 2007-1, Department of Computer Science and Engineering, University at Buffalo, SUNY John Paul Walters Department

More information

Scalable Fault Tolerant Protocol for Parallel Runtime Environments

Scalable Fault Tolerant Protocol for Parallel Runtime Environments Scalable Fault Tolerant Protocol for Parallel Runtime Environments Thara Angskun, Graham E. Fagg, George Bosilca, Jelena Pješivac Grbović, and Jack J. Dongarra Dept. of Computer Science, 1122 Volunteer

More information

EOS: An Extensible Operating System

EOS: An Extensible Operating System EOS: An Extensible Operating System Executive Summary Performance and stability of the network is a business requirement for datacenter networking, where a single scalable fabric carries both network and

More information

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research

More information

Analysis of the Component Architecture Overhead in Open MPI

Analysis of the Component Architecture Overhead in Open MPI Analysis of the Component Architecture Overhead in Open MPI B. Barrett 1, J.M. Squyres 1, A. Lumsdaine 1, R.L. Graham 2, G. Bosilca 3 Open Systems Laboratory, Indiana University {brbarret, jsquyres, lums}@osl.iu.edu

More information

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT The Monte Carlo Method: Versatility Unbounded in a Dynamic Computing World Chattanooga, Tennessee, April 17-21, 2005, on CD-ROM, American Nuclear Society, LaGrange Park, IL (2005) MONTE CARLO SIMULATION

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Utilizing Databases in Grid Engine 6.0

Utilizing Databases in Grid Engine 6.0 Utilizing Databases in Grid Engine 6.0 Joachim Gabler Software Engineer Sun Microsystems http://sun.com/grid Current status flat file spooling binary format for jobs ASCII format for other objects accounting

More information

Middleware Architecture for the Interconnection of Distributed and Parallel Systems

Middleware Architecture for the Interconnection of Distributed and Parallel Systems e-informatica Software Engineering Journal, Volume 6, Issue 1, 2012, pages: 39 45, DOI 10.5277/e-Inf120103 Middleware Architecture for the Interconnection of Distributed and Parallel Systems Ovidiu Gherman,

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

Batch Systems. Running calculations on HPC resources

Batch Systems. Running calculations on HPC resources Batch Systems Running calculations on HPC resources Outline What is a batch system? How do I interact with the batch system Job submission scripts Interactive jobs Common batch systems Converting between

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

QuickStart Guide vcenter Server Heartbeat 5.5 Update 1 EN

QuickStart Guide vcenter Server Heartbeat 5.5 Update 1 EN vcenter Server Heartbeat 5.5 Update 1 EN-000205-00 You can find the most up-to-date technical documentation on the VMware Web site at: http://www.vmware.com/support/ The VMware Web site also provides the

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved. Configuring the Oracle Network Environment Objectives After completing this lesson, you should be able to: Use Enterprise Manager to: Create additional listeners Create Oracle Net Service aliases Configure

More information

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction

More information

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Business Continuity and Disaster Recovery. Ed Crowley Ch 12 Business Continuity and Disaster Recovery Ed Crowley Ch 12 Topics Disaster Recovery Business Impact Analysis MTBF and MTTR RTO and RPO Redundancy Failover Backup Sites Load Balancing Mirror Sites Disaster

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Implementing an efficient method of check-pointing on CPU-GPU

Implementing an efficient method of check-pointing on CPU-GPU Implementing an efficient method of check-pointing on CPU-GPU Harsha Sutaone, Sharath Prasad and Sumanth Suraneni Abstract In this paper, we describe the design, implementation, verification and analysis

More information

Distributed File Systems Part II. Distributed File System Implementation

Distributed File Systems Part II. Distributed File System Implementation s Part II Daniel A. Menascé Implementation File Usage Patterns File System Structure Caching Replication Example: NFS 1 Implementation: File Usage Patterns Static Measurements: - distribution of file size,

More information

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R. Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R. Vallee Motivation & Challenges Bigger machines (e.g., TITAN, upcoming Exascale

More information

A Component Architecture for LAM/MPI

A Component Architecture for LAM/MPI A Component Architecture for LAM/MPI Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Lab, Indiana University Abstract. To better manage the ever increasing complexity of

More information

High Availability and Disaster Recovery Solutions for Perforce

High Availability and Disaster Recovery Solutions for Perforce High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce

More information

Application Fault Tolerance Using Continuous Checkpoint/Restart

Application Fault Tolerance Using Continuous Checkpoint/Restart Application Fault Tolerance Using Continuous Checkpoint/Restart Tomoki Sekiyama Linux Technology Center Yokohama Research Laboratory Hitachi Ltd. Outline 1. Overview of Application Fault Tolerance and

More information

FairCom White Paper Caching and Data Integrity Recommendations

FairCom White Paper Caching and Data Integrity Recommendations FairCom White Paper Caching and Data Integrity Recommendations Contents 1. Best Practices - Caching vs. Data Integrity... 1 1.1 The effects of caching on data recovery... 1 2. Disk Caching... 2 2.1 Data

More information

The Walking Dead Michael Nitschinger

The Walking Dead Michael Nitschinger The Walking Dead A Survival Guide to Resilient Reactive Applications Michael Nitschinger @daschl the right Mindset 2 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3 4 5 Not

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Problems for Resource Brokering in Large and Dynamic Grid Environments

Problems for Resource Brokering in Large and Dynamic Grid Environments Problems for Resource Brokering in Large and Dynamic Grid Environments Cătălin L. Dumitrescu Computer Science Department The University of Chicago cldumitr@cs.uchicago.edu (currently at TU Delft) Kindly

More information

NUSGRID a computational grid at NUS

NUSGRID a computational grid at NUS NUSGRID a computational grid at NUS Grace Foo (SVU/Academic Computing, Computer Centre) SVU is leading an initiative to set up a campus wide computational grid prototype at NUS. The initiative arose out

More information