New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications,

Size: px
Start display at page:

Download "New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications,"

Transcription

1 New User-Guided and ckpt-based Checkpointing Libraries for Parallel MPI Applications, Paweł Czarnul and Marcin Frączak Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Poland pczarnul Abstract. We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first one, a user-guided library requires from the programmer to support packing and unpacking code with an easy-to-use API using MPI constants. It uses MPI-2 collective I/O calls or a dedicated master process for checkpointing. The other version is a technically advanced parallel implementation of checkpointing based on the user-level ckpt library. It uses wrappers for MPI calls in the user program which enables to run a shadow MPI application just for communication purposes. Communication between original processes and the shadow MPI code is done via shared memory segments to which communication buffers are mapped. We present checkpoint/restart times for the two approaches and subversions proposed by us compared to an available LAMMPI/BLCR checkpointing solution for MPI applications. The performance of all the versions and I/O optimizations are discussed for a 4-node, 16-processor cluster with NFS and specifically for single SMP nodes with a local file system. 1 Introduction and Goals Checkpointing of applications can allow an application to be checkpointed periodically to be restarted after a system crash. Secondly, it enables a process migrate to another node to balance load or make some nodes available to the user. For checkpointing of parallel MPI programs, the solution must either handle all pending communication at the time a checkpoint signal is issued or assume a simplified model in which checkpoints are generated at designated points where there is no application data in the buffers and the network. The following checkpointing methods can be distinguished: 1. user guided ([1], [2], [3]) programmer specifies what data needs to be included or excluded in/from the checkpoint, 2. user level libraries like ckpt ([4]), Condor ([5]) and Libckpt ([6]) usually require linking a library to a program with slight or no modifications to the code. Do not require root privileges but are often limited in handling system calls, threads etc. [7] announces Hector (alpha version) checkpointing for MPI with Dynamite 2.0, Task WP.13 of 6 T C/06098 CLUSTERIX - The National Linux Cluster. Calculations carried out at the Academic Computer Center in Gdansk, Poland. Partially covered by the Polish National Grant KBN No. 4 T11C B. Di Martino et al. (Eds.): EuroPVM/MPI 2005, LNCS 3666, pp , c Springer-Verlag Berlin Heidelberg 2005

2 352 P. Czarnul and M. Frączak 3. hybrid approaches [8] presents an interesting work in which the programmer inserts calls to PotentialCheckpoint in the code where checkpoint can be invoked. The whole memory space is saved automatically though. 4. modifications or extensions of existing implementations e.g.: LAM MPI/BLCR (LAM [9] coupled with kernel-level BLCR [10]) or MPICH-V ([11], MPICH coupled with Condor [5]). This solution is especially attractive for parallel MPI programs as can offer a truly transparent solution but is tighly coupled with the internals of a particular implementation. May require root privileges as LAM/BLCR. The contribution of our work are two advanced implementations of checkpointing libraries, using coordinated ([8]) algorithms 1 : PARUG: flexible user-guided version with fast MPI-2 file system calls saving checkpoint collectively or through a dedicated master process, only necessary data needs to be packed which is selected by the knowledgeable programmer in the code. Thus it has the potential of saving only data necessary after restart. PARCKPT: extensible, fast and transparent checkpointing version for any MPI implementation using a sequential checkpointing library a ckpt-based ([4]) version was developed with checkpoints saved locally or through a checkpoint server (ckpt feature). Contrary to other transparent solutions like LAMMPI/ BLCR or MPICH-V, PARCKPT can be used with any MPI implementation giving transparent checkpointing, can also be adapted for use with other sequential checkpointing libraries/tools examples of which are Condor ([5]) and Libckpt ([6]). These benefits come at the cost of a slightly limited application model synchronous parallel application it is assumed that all processes successively reach the same points in the program code where checkpoint can occur and pending messages are not considered at those points. This shortens time to checkpoint and the model is suitable for a wide range of applications e.g. SPMD ([12]) or synchronous Master-Slave. 2 Proposed User-Guided Checkpointing In PARUG, collective routine CX_CheckCheckpoint() (sample code shown in Figure 1) needs to be inserted by the programmer into potential points where checkpoints can occur denoting iterations, which can, if ordered by a signal earlier, save the states of the processes using MPI-2. Alternatively, the state of the application can be saved by one designated process if the programmer does not provide synchronized operations. The sequence of actions for the checkpointing procedure is as follows (Figure 1): 1. Send a SIGUSR1 signal to any MPI process which sets a global flag. 2. As soon as the process calls function CX_CheckCheckpoint() (potentially in a loop of SPMD computations), the flag is read and asynchronous messages are sent to the other processes. An iteration number (Paragraph 1, hidden from the programmer) for checkpointing to occur at is also propagated (received by MPI_Irecv calls). 1 Available from authors, to be released at pczarnul.

3 New User-Guided and ckpt-based Checkpointing Libraries 353 Fig. 1. Proposed User-guided Approach: Inter Process Communication Schema 3. In the following operating modes of PARUG, corresponding actions occur: CX_SYNCHRONIZED: when the program reaches the required iteration number all processes save data pointed by the programmer to one checkpoint file. All processes can synchronize on CX_CheckCheckpoint() for checkpointing; CX_LOOSE: only one selected master process saves data to the checkpoint file at the first call to CX_CheckCheckpoint(). Independently from the above, the library can operate in two modes: 1. Parallel data write (default) checkpoint data is written/read by MPI_File _write_at_all/ MPI_File_read_at_all MPI-2 functions which can speed-up data write times/access by grouping, collective buffering etc. ([13]). 2. Data write through a master process all checkpoint data from all processes is sent to the process with MPI rank 0 which then writes data to a file using MPI-2 calls. 3 Proposed ckpt-based Parallel MPI Checkpointing Library In PARCKPT, no code changes are required but all MPI functions are replaced by wrappers (preceded by RES_, sample code shown in Figure 2). The wrappers for MPI communication routines denote the aforementioned potential checkpoint points and count iterations internally which is used to calculate a global iteration number for checkpointing. Thus, currently, the library can be used with synchronous applications with a uniform number of communication actions per per process per iteration. In this solution, a static library is linked with the original user application instead of an MPI library. The new library includes functions substituting MPI functions (preceded by RES_). Original MPI functions are called by another process, a wrapper. This makes it possible to checkpoint the original processes using ckpt since the processes do not call true MPI functions. The wrapper also prepares the MPI world before the start and after the restart of the user application. For each process of the application a separate wrapper process is created (Figure 2).

4 354 P. Czarnul and M. Frączak Fig. 2. Proposed ckpt-based Checkpointing: Inter Process Communication Schema Application wrapper communication uses signals and shared memory i.e. the user application only passes data through shared memory to the wrapper which calls MPI functions. However, copying of user data into/from shared memory regions when passing/fetching it to/from the wrapper for sending/receiving decreases performance. We attach the shared memory window to the memory region that already contains user data the buffers. This is done using the shmat function with the SHM_REMAP flag set which removes all other memory mappings from that memory region though. Thus the buffer data is saved in a temporary buffer and restored after the attachment. This causes a serious slowdown once, but speeds things up if the same buffer is used repeatedly. A typical scenario for start/checkpoint/restart looks as follows: 1. Preprocessing the application source code, substituting any calls to MPI functions with their RES_ substitutes and page aligning data buffers. 2. Start of the wrapper which starts the user application process. 3. SIGUSR1 signal to an application process starts checkpointing. 4. During the next MPI action after receiving the signal, the process that received it sends a specific MPI message to every other process of the application. The message defines the checkpoint at some iteration in the future. 5. At the defined iteration processes order their wrappers to leave the MPI world gracefully (call MPI_Finalize()) and exit. 6. The application processes checkpoint (using ckpt, [4]) and exit. 7. Upon restart the wrappers restart the application processes which continue without noticing any checkpoint/restart. Similarly to PARUG, we distinguish two subversions of the implementation: 1. standard all processes simply write data in the local file system. 2. a ckpt server (implemented in ckpt) is used to which checkpoints are sent over TCP and then saved locally).

5 4 Experimental Results New User-Guided and ckpt-based Checkpointing Libraries Testbed Environment and Parallel MPI Application All simulations used a 16-processor cluster (four 4-processor nodes, 512MB RAM each) with Pentium III Xeons and Ethernet switches. On one node checkpoints were saved locally (node g55) while the other nodes (g52-g53)savedtog55 via NFS. We used an SPMD MPI application (LAMMPI 7.0.6, BLCR for LAM/BLCR) which runs 1000 time steps in which cells of a 2D domain are updated. The domain is divided equally among the processors. Between iterations, processes exchange boundary cell data. We varied the size of the domain from 32MB to 128MB. The implementation corresponds to parallel applications like electromagnetic modeling or medical simulations ([12]). For PARUG we pack the whole domain data. PARCKPT and LAM/BLCR pack communication buffers etc. additionally. In practice, they will save more data than PARUG. We aimed at the assessment of checkpoint/restart costs for all the methods. 4.2 Proposed User-Guided Approach vs Checkpointing with ckpt Library Figure 3 presents PARUG s execution times with one checkpoint/restart executed after 500 out of 1000 iterations. Within one node (2 and 4 processors on g55) the parallel data write method was faster i.e. MPI-2 calls were more efficient than routing data through one process on this node. However, for larger configurations, writes through a master residing on node g55 were faster than even MPI-2 collective calls. The internode NFS throughput appeared to be lower compared to native MPI send/recvs and fast disk access from node g55 or the rcp internode throughput (measured). Figure 4 shows execution times for the standard and ckptserv versions of PAR- CKPT, with one checkpoint/restart executed after 500 out of 1000 iterations. The ckpt server version is faster for configurations larger than one node. On one node standard writes to separate files are faster than routing through one local process. Fig. 3. PARUG: Execution Times of the Testbed Application with One Checkpoint/Restart Fig. 4. PARCKPT: Execution Times of the Testbed Application with One Checkpoint/Restart

6 356 P. Czarnul and M. Frączak 4.3 Comparison of Parallel MPI Checkpointing Methods Finally, we compared both PARUG and PARCKPT (combinations of best subversions) against each other and LAM/BLCR (Table 1, Figures 5 and 6). Both PARCKPT and LAM/BLCR use sequential checkpointing libraries in a parallel MPI environment. Table 1. Comparison of Tested Parallel Checkpointing Methods Method Subversion Ckpt+Restart Time Features PARUG MPI-2 version slow on NFS, fast flexible, not transparent to programmer, fast, packs on SMP node only necessary data, can restart on different no of designated fast on NFS, slow processes than checkpointed, limited programming master on SMP node model (extendable with some programming effort) PAR std local slow on NFS, fast theoretically (almost) fully transparent although requires CKPT writes on SMP node synchronous operations, limited set of MPI functions supported now, fast, checkpoints larger ckptserv fast on NFS, slow than PARUG and LAM/BLCR, uses LINUXspecific on SMP node memory mappings for high performance LAMMPI/BLCR slower than fully transparent to programmer, easy-to-use, PARUG, faster checkpoints smaller than PARCKPT, only slightly than PARCKPT larger than PARUG, *for 1-node checkpoints of for smaller sizes, app processes appeared several seconds earlier than slower for larger the mpirun checkpoint. The application processes were working until that time. The former yields times very close (but longer) to PARUG (although the mpirun checkpoint is required to restart). PARUG is the fastest (see LAM/BLCR note * in Table 1) since it packs/unpacks least data and apparently because of fast collective MPI-2 calls within one SMP node. On larger configurations it uses one designated master on node g55. It is followed by: On 2 and 4 processors (one node): LAM/BLCR generates smaller checkpoints than PARCKPT. This can account for faster LAM/BLCR for smaller sizes. For larger checkpoints LAM/BLCR was slower. [10] and [14] list performance limitations of BLCR, namely for larger checkpoints in the VMADump module used by BLCR. These are ([10]) writing memory pages using separate write() calls and making copies of pages while checkpointing which can cause memory overuse and swapping. LAM/BLCR empties the network from pending messages while keeping the application working unlike PARCKPT. Also see note * in Table 1. On 8-16 processors: PARCKPT is faster than LAM/BLCR it sends checkpoints from processes via TCP to ckptsrv on node g55 rather than saving locally via the slow NFS as LAM/BLCR does. LAM/BLCR failed to run any MPI application on more than two nodes (cr_init() failed). We also assessed the overhead of the following components for the testbed application without checkpointing, compared to a standard MPI application without checkpointing: LAMMPI with BLCR no measurable overhead, ckptserv no measurable

7 New User-Guided and ckpt-based Checkpointing Libraries 357 Fig. 5. Comparison of Checkpointing Approaches: Execution Times of the Testbed Application with One Checkpoint/Restart Fig. 6. Comparison of Checkpointing Approaches: Checkpoint/Restart Times overhead compared to the standard PARCKPT version, PARCKPT the overhead due to the additional wrappers and shared memory communication and signal synchronization from 2% on 4 processors to 6% on 16 processors for the domain size of 128MB. 5 Summary and Future Work We have presented two new checkpointing libraries, their design and showed they offer better performance than LAM/BLCR for large checkpoints in a specific (NFS) environment, at the cost of a constrained application model. For the two solutions, we have investigated two subversions with fast MPI-2 calls/designated master process for PARUG and local writes/ckptserv for PARCKPT. We showed the latter options are faster on a shared NFS on two or more nodes while the former on single SMP nodes. NFS optimizations will be investigated as well. We plan on the incorporation of other checkpointing libraries into the PARCKPT scheme. Currently our PARCKPT supports a limited set of MPI functions which will be extended. We have also developed a parser for user applications which replaces MPI calls with PARCKPT-specific wrappers.

8 358 P. Czarnul and M. Frączak References 1. Silva, L., Silva, J.: System-level versus user-defined checkpointing. In: Proceedings. Seventeenth IEEE Symposium on Reliable Distributed Systems. (1998) Czarnul, P.: Programming, Tuning and Automatic Parallelization of Irregular Divide-and- Conquer Applications in DAMPVM/DAC. International Journal of High Performance Computing Applications 17 (2003) CUMULVS: (Collaborative User Migration, User Library for Visualization and Steering) Distributed Computing Group, Computer Science and Mathematics Division, Oak Ridge National Laboratory, 4. Zandy, V.C.: (ckpt library) edu/ zandy/ckpt/. 5. Condor Team, Attention: Professor Miron Livny, Dept of Computer Sciences, 1210 W. Dayton St., Madison, WI , (608) or Condor Team, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI: (The Condor Project, CondorâĂŹs Checkpoint Mechanism) 6. J.S.Plank, M.Beck, G.Kingsley, K.Li: libckpt: Transparent Checkpointing Under UNIX. Conference Proceedings USENIX Winter 1995 Technical Conference (1995) 7. Romanov, S., Malashonok, D.Y., Iskra, K., Gubala, T.: The Dynamite checkpointer 2.0. Faculty of Science, Informatics Institute. (2003) 8. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, San Diego, California, USA (2003) Sankaran, S., Squyres, J., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: System-initiated checkpointing. Los Alamos Computer Science Institute (LACSI) Symposium (2003) 10. Duell, J., Hargrove, P., Roman, E.: The Design and Implementation of Berkeley Lab s Linux Checkpoint/Restart. In: Future Technologies Group white paper. (2003) 11. Franck Cappello, Project Leader at al.: (Mpich-v: Mpi implementation for volatile resources) bouteill/mpich-v. 12. Czarnul, P., Grzeda, K.: Parallel Simulations of Electrophysiological Phenomena in Myocardium on Large 32 and 64-bit Linux Clusters. In: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19-22, Proceedings. (Volume 3241/2004.) 13. Message Passing Interface Forum: MPI-2: Extensions to the Message-Passing Interface. (1997) University of Tennessee, Knoxville, Tennessee. 14. Sankaran, S., Squyres, J., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing, Los Alamos Computer Science Institute (LACSI) Symposium (2003)

Avida Checkpoint/Restart Implementation

Avida Checkpoint/Restart Implementation Avida Checkpoint/Restart Implementation Nilab Mohammad Mousa: McNair Scholar Dirk Colbry, Ph.D.: Mentor Computer Science Abstract As high performance computing centers (HPCC) continue to grow in popularity,

More information

Increasing Reliability through Dynamic Virtual Clustering

Increasing Reliability through Dynamic Virtual Clustering Increasing Reliability through Dynamic Virtual Clustering Wesley Emeneker, Dan Stanzione High Performance Computing Initiative Ira A. Fulton School of Engineering Arizona State University Wesley.Emeneker@asu.edu,

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,

More information

A Comprehensive User-level Checkpointing Strategy for MPI Applications

A Comprehensive User-level Checkpointing Strategy for MPI Applications A Comprehensive User-level Checkpointing Strategy for MPI Applications Technical Report # 2007-1, Department of Computer Science and Engineering, University at Buffalo, SUNY John Paul Walters Department

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

Checkpointing using DMTCP, Condor, Matlab and FReD

Checkpointing using DMTCP, Condor, Matlab and FReD Checkpointing using DMTCP, Condor, Matlab and FReD Gene Cooperman (presenting) High Performance Computing Laboratory College of Computer and Information Science Northeastern University, Boston gene@ccs.neu.edu

More information

MPI History. MPI versions MPI-2 MPICH2

MPI History. MPI versions MPI-2 MPICH2 MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration in HPC Environments Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin,

More information

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University MVAPICH Users Group 2016 Kapil Arya Checkpointing with DMTCP and MVAPICH2 for Supercomputing Kapil Arya Mesosphere, Inc. & Northeastern University DMTCP Developer Apache Mesos Committer kapil@mesosphere.io

More information

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS The 0th IEEE International Conference on High Performance Computing and Communications UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

A Component Architecture for LAM/MPI

A Component Architecture for LAM/MPI A Component Architecture for LAM/MPI Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Lab, Indiana University Abstract. To better manage the ever increasing complexity of

More information

MPI versions. MPI History

MPI versions. MPI History MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

Hungarian Supercomputing Grid 1

Hungarian Supercomputing Grid 1 Hungarian Supercomputing Grid 1 Péter Kacsuk MTA SZTAKI Victor Hugo u. 18-22, Budapest, HUNGARY www.lpds.sztaki.hu E-mail: kacsuk@sztaki.hu Abstract. The main objective of the paper is to describe the

More information

Progress Report on Transparent Checkpointing for Supercomputing

Progress Report on Transparent Checkpointing for Supercomputing Progress Report on Transparent Checkpointing for Supercomputing Jiajun Cao, Rohan Garg College of Computer and Information Science, Northeastern University {jiajun,rohgarg}@ccs.neu.edu August 21, 2015

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

Live cd cluster performance

Live cd cluster performance Live cd cluster performance Haronil Estevez Department of Computer Science Columbia University, New York Advisor: Professor Stephen A. Edwards May 10, 2004 Contents 1 Abstract 2 2 Background 2 2.1 Knoppix......................................

More information

Operating Systems Fundamentals. What is an Operating System? Focus. Computer System Components. Chapter 1: Introduction

Operating Systems Fundamentals. What is an Operating System? Focus. Computer System Components. Chapter 1: Introduction Operating Systems Fundamentals Overview of Operating Systems Ahmed Tawfik Modern Operating Systems are increasingly complex Operating System Millions of Lines of Code DOS 0.015 Windows 95 11 Windows 98

More information

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Joshua Hursey 1, Jeffrey M. Squyres 2, Timothy I. Mattox 1, Andrew Lumsdaine 1 1 Indiana University 2 Cisco Systems,

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

Modeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs

Modeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs Modeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs David A. Reimann, Vipin Chaudhary 2, and Ishwar K. Sethi 3 Department of Mathematics, Albion College,

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

Ed D Azevedo Oak Ridge National Laboratory Piotr Luszczek University of Tennessee

Ed D Azevedo Oak Ridge National Laboratory Piotr Luszczek University of Tennessee A Framework for Check-Pointed Fault-Tolerant Out-of-Core Linear Algebra Ed D Azevedo (e6d@ornl.gov) Oak Ridge National Laboratory Piotr Luszczek (luszczek@cs.utk.edu) University of Tennessee Acknowledgement

More information

Efficiency of Functional Languages in Client-Server Applications

Efficiency of Functional Languages in Client-Server Applications Efficiency of Functional Languages in Client-Server Applications *Dr. Maurice Eggen Dr. Gerald Pitts Department of Computer Science Trinity University San Antonio, Texas Phone 210 999 7487 Fax 210 999

More information

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education

More information

Toward An Integrated Cluster File System

Toward An Integrated Cluster File System Toward An Integrated Cluster File System Adrien Lebre February 1 st, 2008 XtreemOS IP project is funded by the European Commission under contract IST-FP6-033576 Outline Context Kerrighed and root file

More information

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti 1, Yoshio Turner Internet Systems and Storage Laboratory

More information

Monitoring System for Distributed Java Applications

Monitoring System for Distributed Java Applications Monitoring System for Distributed Java Applications W lodzimierz Funika 1, Marian Bubak 1,2, and Marcin Smȩtek 1 1 Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059 Kraków, Poland 2 Academic

More information

Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version / SSI Version 1.0.0

Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version / SSI Version 1.0.0 Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version 1.0.0 / SSI Version 1.0.0 Sriram Sankaran Jeffrey M. Squyres Brian Barrett Andrew Lumsdaine http://www.lam-mpi.org/ Open

More information

DMTCP: Fixing the Single Point of Failure of the ROS Master

DMTCP: Fixing the Single Point of Failure of the ROS Master DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u

More information

System-level Transparent Checkpointing for OpenSHMEM

System-level Transparent Checkpointing for OpenSHMEM System-level Transparent Checkpointing for OpenSHMEM Rohan Garg 1, Jérôme Vienne 2, and Gene Cooperman 1 1 Northeastern University, Boston MA 02115, USA, {rohgarg,gene}@ccs.neu.edu 2 Texas Advanced Computing

More information

Space-Efficient Page-Level Incremental Checkpointing *

Space-Efficient Page-Level Incremental Checkpointing * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science

More information

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN: Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN: 1137-3601 revista@aepia.org Asociación Española para la Inteligencia Artificial España Kus, Waclaw; Burczynski, Tadeusz

More information

Visual Debugging of MPI Applications

Visual Debugging of MPI Applications Visual Debugging of MPI Applications Basile Schaeli 1, Ali Al-Shabibi 1 and Roger D. Hersch 1 1 Ecole Polytechnique Fédérale de Lausanne (EPFL) School of Computer and Communication Sciences CH-1015 Lausanne,

More information

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM Szabolcs Pota 1, Gergely Sipos 2, Zoltan Juhasz 1,3 and Peter Kacsuk 2 1 Department of Information Systems, University of Veszprem, Hungary 2 Laboratory

More information

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations John von Neumann Institute for Computing A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations A. Duarte, D. Rexachs, E. Luque published in Parallel Computing: Current & Future Issues

More information

Introduction. What is an Operating System? A Modern Computer System. Computer System Components. What is an Operating System?

Introduction. What is an Operating System? A Modern Computer System. Computer System Components. What is an Operating System? Introduction CSCI 315 Operating Systems Design Department of Computer Science What is an Operating System? A Modern Computer System Computer System Components Disks... Mouse Keyboard Printer 1. Hardware

More information

An introduction to checkpointing. for scientific applications

An introduction to checkpointing. for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count

More information

MPI Collective Algorithm Selection and Quadtree Encoding

MPI Collective Algorithm Selection and Quadtree Encoding MPI Collective Algorithm Selection and Quadtree Encoding Jelena Pješivac Grbović, Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra Innovative Computing Laboratory, University of Tennessee

More information

Towards Efficient MapReduce Using MPI

Towards Efficient MapReduce Using MPI Towards Efficient MapReduce Using MPI Torsten Hoefler¹, Andrew Lumsdaine¹, Jack Dongarra² ¹Open Systems Lab Indiana University Bloomington ²Dept. of Computer Science University of Tennessee Knoxville 09/09/09

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

A Framework for Testing AIS Implementations

A Framework for Testing AIS Implementations A Framework for Testing AIS Implementations Tamás Horváth and Tibor Sulyán Dept. of Control Engineering and Information Technology, Budapest University of Technology and Economics, Budapest, Hungary {tom,

More information

Remote Task Submission and Publishing in BeesyCluster : Security and Efficiency of Web Service Interface

Remote Task Submission and Publishing in BeesyCluster : Security and Efficiency of Web Service Interface Remote Task Submission and Publishing in BeesyCluster : Security and Efficiency of Web Service Interface Paweł Czarnul, Michał Bajor, Marcin Fraczak, Anna Banaszczyk, Marcin Fiszer and Katarzyna Ramczykowska

More information

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems Robert Grimm University of Washington Extensions Added to running system Interact through low-latency interfaces Form

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System Taesoon Park 1,IlsooByun 1, and Heon Y. Yeom 2 1 Department of Computer Engineering, Sejong University, Seoul

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet Condor and BOINC Distributed and Volunteer Computing Presented by Adam Bazinet Condor Developed at the University of Wisconsin-Madison Condor is aimed at High Throughput Computing (HTC) on collections

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING

AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING DR. ROGER EGGEN Department of Computer and Information Sciences University of North Florida Jacksonville, Florida 32224 USA ree@unf.edu

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

Influence of the Progress Engine on the Performance of Asynchronous Communication Libraries

Influence of the Progress Engine on the Performance of Asynchronous Communication Libraries Influence of the Progress Engine on the Performance of Asynchronous Communication Libraries Edgar Gabriel Department of Computer Science University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Adaptive Runtime Support

Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems fastos.org/molar Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems Jyothish Varma 1, Chao Wang 1, Frank Mueller 1, Christian Engelmann, Stephen L. Scott 1 North Carolina State University,

More information

I/O in the Gardens Non-Dedicated Cluster Computing Environment

I/O in the Gardens Non-Dedicated Cluster Computing Environment I/O in the Gardens Non-Dedicated Cluster Computing Environment Paul Roe and Siu Yuen Chan School of Computing Science Queensland University of Technology Australia fp.roe, s.chang@qut.edu.au Abstract Gardens

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real -Time Systems Handheld Systems Computing Environments

More information

Developing a Thin and High Performance Implementation of Message Passing Interface 1

Developing a Thin and High Performance Implementation of Message Passing Interface 1 Developing a Thin and High Performance Implementation of Message Passing Interface 1 Theewara Vorakosit and Putchong Uthayopas Parallel Research Group Computer and Network System Research Laboratory Department

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens School of Electrical

More information

SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS

SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS Thara Angskun, Graham Fagg, George Bosilca, Jelena Pješivac Grbović, and Jack Dongarra,2,3 University of Tennessee, 2 Oak Ridge National

More information

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition, Chapter 17: Distributed-File Systems, Silberschatz, Galvin and Gagne 2009 Chapter 17 Distributed-File Systems Background Naming and Transparency Remote File Access Stateful versus Stateless Service File

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

ScalaIOTrace: Scalable I/O Tracing and Analysis

ScalaIOTrace: Scalable I/O Tracing and Analysis ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,

More information

Transparent Checkpoint and Restart Technology for CUDA applications. Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology

Transparent Checkpoint and Restart Technology for CUDA applications. Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology Transparent Checkpoint and Restart Technology for CUDA applications Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology Taichiro, SUZUKI 2010.4 ~ 2014.3 Bachelor course at Tokyo

More information

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption Marc Pérache, Patrick Carribault, and Hervé Jourdren CEA, DAM, DIF F-91297 Arpajon, France {marc.perache,patrick.carribault,herve.jourdren}@cea.fr

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

Analysis of the Component Architecture Overhead in Open MPI

Analysis of the Component Architecture Overhead in Open MPI Analysis of the Component Architecture Overhead in Open MPI B. Barrett 1, J.M. Squyres 1, A. Lumsdaine 1, R.L. Graham 2, G. Bosilca 3 Open Systems Laboratory, Indiana University {brbarret, jsquyres, lums}@osl.iu.edu

More information

IOS: A Middleware for Decentralized Distributed Computing

IOS: A Middleware for Decentralized Distributed Computing IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

Programming Model Support for Dependable, Elastic Cloud Applications

Programming Model Support for Dependable, Elastic Cloud Applications Programming Model Support for Dependable, Elastic Cloud Applications Wei-Chiu Chuang, Bo Sang, Charles Killian, Milind Kulkarni Department of Computer Science, School of Electrical and Computer Engineering

More information

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri PAC094 Performance Tips for New Features in Workstation 5 Anne Holler Irfan Ahmad Aravind Pavuluri Overview of Talk Virtual machine teams 64-bit guests SMP guests e1000 NIC support Fast snapshots Virtual

More information

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Scalable Fault Tolerance Schemes using Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

Approaches to Parallel Computing

Approaches to Parallel Computing Approaches to Parallel Computing K. Cooper 1 1 Department of Mathematics Washington State University 2019 Paradigms Concept Many hands make light work... Set several processors to work on separate aspects

More information

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003 New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

A Case for Standard Non-Blocking Collective Operations

A Case for Standard Non-Blocking Collective Operations A Case for Standard Non-Blocking Collective Operations T. Hoefler,2, P. Kambadur, R. L. Graham 3, G. Shipman 4 and A. Lumsdaine Open Systems Lab 2 Computer Architecture Group Indiana University Technical

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

Operating System: an Overview. Lucia Dwi Krisnawati, MA

Operating System: an Overview. Lucia Dwi Krisnawati, MA Operating System: an Overview Lucia Dwi Krisnawati, MA What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. Operating system goals:

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

MarkLogic Server. Database Replication Guide. MarkLogic 9 May, Copyright 2017 MarkLogic Corporation. All rights reserved.

MarkLogic Server. Database Replication Guide. MarkLogic 9 May, Copyright 2017 MarkLogic Corporation. All rights reserved. Database Replication Guide 1 MarkLogic 9 May, 2017 Last Revised: 9.0-3, September, 2017 Copyright 2017 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents Database Replication

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

A trace-driven analysis of disk working set sizes

A trace-driven analysis of disk working set sizes A trace-driven analysis of disk working set sizes Chris Ruemmler and John Wilkes Operating Systems Research Department Hewlett-Packard Laboratories, Palo Alto, CA HPL OSR 93 23, 5 April 993 Keywords: UNIX,

More information

On the Survivability of Standard MPI Applications

On the Survivability of Standard MPI Applications On the Survivability of Standard MPI Applicions Anand Tikotekar 1, Chokchai Leangsuksun 1 Stephen L. Scott 2 Louisiana Tech University 1 Oak Ridge Nional Laborory 2 box@lech.edu 1 a007@lech.edu 1 scottsl@ornl.gov

More information

MPI - Today and Tomorrow

MPI - Today and Tomorrow MPI - Today and Tomorrow ScicomP 9 - Bologna, Italy Dick Treumann - MPI Development The material presented represents a mix of experimentation, prototyping and development. While topics discussed may appear

More information

Computer Architecture and OS. EECS678 Lecture 2

Computer Architecture and OS. EECS678 Lecture 2 Computer Architecture and OS EECS678 Lecture 2 1 Recap What is an OS? An intermediary between users and hardware A program that is always running A resource manager Manage resources efficiently and fairly

More information

Towards Breast Anatomy Simulation Using GPUs

Towards Breast Anatomy Simulation Using GPUs Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA

More information

Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters*

Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters* Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters* Chao-Tung Yang, Chun-Sheng Liao, and Ping-I Chen High-Performance Computing Laboratory Department of Computer

More information

Process Migration for Resilient Applications

Process Migration for Resilient Applications Process Migration for Resilient Applications TR11-004 Kathleen McGill* Thayer School of Engineering Dartmouth College Hanover, NH 03755 Kathleen.N.McGill@Dartmouth.edu Stephen Taylor Thayer School of Engineering

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model Sun B. Chung Department of Quantitative Methods and Computer Science University of St. Thomas sbchung@stthomas.edu

More information