What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption 1 2 3 4 4 5 6 7 a place where something is halted for inspection Why we need it? For safety or for load balancing Fault tolerance if MTBF < execution time Environment unstable Load balancing Dynamic load balancing Resource constraint Where to? Depending on the usage but always on a safe place Reliable storage (eg disk) Reliable server Losing a normally require restarting the application. When to? One of the most difficult question. If often we can decrease the performance If not when we restart we can lose an important amount of computations Depend on the size of the And/Or on the application s algorithm And on the reliability of the host system. Who need? Those interested in increasing the reliability through time redundancy 1
How to? How to? Seems easy but finally require answers to a lot of questions. The initial state of the program Different executable formats (a.out, elf, coff, pef ) But they contain mostly the same information, the difference is how they store it and how they reference it. An executable contain (non exhaustive list): The instruction bloc (text segment) Initialized data The Size of the uninitialized data segment The state of a process The OS load the instruction and data in different sections in the virtual address space of the process The OS initialize the stack and heap Load the shared libraries (when required) Give the hand to the entry point of the application The state of a process Include data sections, the stack, the heap, all allocated memory, the registers values Plus others information available only at the kernel level like: Open devices state (eg files, sockets) Signals, timers Shared memory segments (shmem) How to? 2
What should we keep? Quick answer: as much information as possible Data segment : initial + allocated by the application Stack and heap Registers values Others if available How to? Data Area: Known range of addresses or some OS support Linux deliver this information is /proc/$pid/maps 08048000-0808b000 r-xp 00000000 03:02 1515230 /bin/tcsh 0808b000-08090000 rw-p 00042000 03:02 1515230 /bin/tcsh 08090000-081a9000 rwxp 00000000 00:00 0 40000000-40013000 r-xp 00000000 03:02 1776145 /lib/ld-2.2.5.so Stack Stack and Heap Only need to save the active regions Heap can be obtained using sbrk(0) Bottom of the stack is in /proc/$pid/maps Top of the stack: address of the last local variable Unused Heap Registers: setjmp() and longjmp() provide this functionality (similar to non local goto) jmpbuf envjmp; int main( int argc, char* argv[] ) { if(!setjmp( envjmp) ) { /* initial execution */ foo(); } else { /* back from longjmp */ } } int foo() { longjmp(envjmp); } Others like file descriptors, sockets, shared memory are not completely available directly in the user space. OS help required Non trivial Require kernel modifications Or let the user application restore them as the user has additional knowledge about them. 3
How to? Who is responsible to do it? Several approach possible The Operating System Modification of the kernel or module At the user Additional library linked to the program Shared library inserted with LD_PRELOAD Directly the user application. OS level Idea: ask the entity who has all information (the kernel) to Advantages: Transparent No changes to the application All information can be saved (but maybe not restored) File descriptors, sockets, shared memory segments Thread information All data will be saved (including unused) Even if we have the information it can impossible to restore everything Application level User Level OS User level Idea: let a specialized library do the Advantages: Can decrease the size of the Still transparent in some situations The user can describe the data to be saved via the library API Require additional library Sometimes require recompilation or relink Application level User Level OS Application level Idea: let the entity having ALL knowledge about the behavior of the application do the. Advantages: Dramatically decrease the size of the (?) Everything can be restored (including files ) Not transparent Additional complexity on the application All the complexity of the to the programmer Application level User Level OS Multi-level and optimizations Multi-level libraries User level + OS level: improved data AND file descriptors. Easy to integrate in complex applications. Optimizations: We want to have everything faster without overhead 4
Incremental Idea: decrease the size of the Detect the page of memory changed after the last Require a memory page protection mechanism (mprotect) and a way to trap the SIG_FAULT signal. Or compute a local difference between the last and the current one Require additional disk space locally. Incremental Benefits: Less data to => the time spent in decrease Time spent to save the decrease Protecting memory pages is a costly operation (one time by every protected page) Or computing the local difference involve I/O operations. Forked Idea: at fork the process, then the parent while the child continue the execution. Saving the is overlapped with the execution of the application Require nearly 2 times more memory except when copy-on-write can be used. Checkpoint compression Idea: on the fly compression of the Decrease the disk space required for the but increase the computations. Improvement only if the disk is slower than compression. Checkpoint in parallel applications Transparency: application, MP API+Fault management, automatic. application ckpt: application store intermediate results and restart form them MP API+FM: message passing API returns errors to be handled by the programmer automatic: runtime detects faults and handle recovery Checkpoint coordination: no, coordinated, uncoordinated. coordinated: all processes are synchronized, network is flushed before ckpt; all processes rollback from the same snapshot uncoordinated: each process independently of the others each process is restarted independently of the others Message logging: no, pessimistic, optimistic, causal. pessimistic: all messages are logged on reliable media and used for replay optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent causal: optimistic+antecedence Graph, reduces the recovery time Checkpoint libraries Sprite [Douglis, Ousterhout, 1991] Task migration Transparent Remote procedure calls Kernel level No fault tolerance Condor [Lizkow, Livny, Tannenbaum, 1991] Task migration Transparent User level Include servers Compression No parallel applications Clip [Chen, Li, Plank, 1997] Not cross platform Parallel applications Global synchronization (Chandy-Lamport algorithm) 5
Checkpoint libraries Classification Libckpt [Plank, Beck, Kingsley, Li, 1994] Transparent (user configurable) User level Non blocking Incremental No compression No parallel applications No server Cocheck [Stellner, 1996] / Netsolve [Plank, Casanova, Beck, Dongarra,1999] Based on condor mechanisms Dedicated for parallel applications Global Synchronization (Chandy-Lamport algorithm) MPI-FT [Louca, Neophytou, Lachanas, Evripidou, 2000] transparent Optimistic Log : decentralized, only one fault Pessimistic Log : centralized, arbitrary number of faults Automatic Checkpoint Log based based Optimistic log (sender based) Causal log Optimistic recovery Cocheck Manetho Framework Independent of MPI In distributed systems n faults n faults with coherent [Ste96] [EZ92] [SY85] Starfish Enrichment of MPI [AF99] Clip Egida Semi-transparent API [CLP97] [RAV99] Pruitt 98 2 faults sender based [PRU98] Communication Lib. Sender based Mess. Log. 1 fault sender based [JZ87] Level Non Automatic Pessimistic log Coordinated MPI/FT FT-MPI Modification of MPI routines Redundance of tasks User Fault Treatment [BNC01] [FD00] MPI-FT MPICH-V N fault N faults Centralized server Distributed logging [LNLE00] Coordinated Checkpoint (Chandy/Lamport) Examples The objective is to the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Uncoordinated Checkpoint No global synchronization (scalable) Nodes may at any time (independently of the others) Need to log undeterministic events: In-transit Messages detection/ global stop failure Ckpt Sync restart detection failure Ckpt Nodes Nodes Sketch of execution with a crash Processes 0 CM CM Worst condition: in-transit message + Ckpt image 1 1 2 2 Ckpt image CS 1 2 Ckpt images Rollback to latest process 2 Pseudo time scale Crash 6