What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

Similar documents
Overview. EuroPVM/MPI 2003 September Team HARNESS

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

Advanced Memory Management

Process Migration. David Zuercher CS555 Distributed Systems

Application Fault Tolerance Using Continuous Checkpoint/Restart

Processes. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 326: Operating Systems. Process Execution. Lecture 5

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

GEMS: A Fault Tolerant Grid Job Management System

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI

Process Migration via Remote Fork: a Viable Programming Model? Branden J. Moor! cse 598z: Distributed Systems December 02, 2004

First Results in a Thread-Parallel Geant4

Background: Operating Systems

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Lecture 4: Process Management

Processes. Johan Montelius KTH

A process. the stack

Rollback-Recovery p Σ Σ

Failure Tolerance. Distributed Systems Santa Clara University

MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes

Sistemi in Tempo Reale

CPPC: A compiler assisted tool for portable checkpointing of message-passing applications

the Cornell Checkpoint (pre-)compiler

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Proactive Fault Tolerance in Large Systems

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

processes based on Message Passing Interface

CSCE 313: Intro to Computer Systems

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

CS514: Intermediate Course in Computer Systems

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Processes. Process Management Chapter 3. When does a process gets created? When does a process gets terminated?

les are not generally available by NFS or AFS, with a mechanism called \remote system calls". These are discussed in section 4.1. Our general method f

Fault Tolerance. Distributed Systems IT332

Memory Management. Fundamentally two related, but distinct, issues. Management of logical address space resource

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall 2008.

Application. Collective Operations. P2P Operations. ft-globus. Unexpected Q. Posted Q. Log / Replay Module. Send Q. Log / Replay Module.

An introduction to checkpointing. for scientific applications

FAULT TOLERANT SYSTEMS

Space-Efficient Page-Level Incremental Checkpointing *

FAULT TOLERANT SYSTEMS

Protection and System Calls. Otto J. Anshus

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Processes. OS Structure. OS Structure. Modes of Execution. Typical Functions of an OS Kernel. Non-Kernel OS. COMP755 Advanced Operating Systems

Fault Tolerance. Basic Concepts

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Introduction to OS Processes in Unix, Linux, and Windows MOS 2.1 Mahmoud El-Gayyar

CSC369 Lecture 2. Larry Zhang

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

Processes and Threads

Overhead-Free Portable Thread-Stack Checkpoints

Processes (Intro) Yannis Smaragdakis, U. Athens

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Checkpointing HPC Applications

3.1 Introduction. Computers perform operations concurrently

Proactive Fault Tolerance in MPI Applications via Task Migration

PROCESSES. Jo, Heeseung

Processes. Jo, Heeseung

Processes. CS439: Principles of Computer Systems January 30, 2019

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Increasing Reliability through Dynamic Virtual Clustering

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Virtual Memory Management

Fault tolerance techniques for high-performance computing

4. The Abstraction: The Process

Administrative Details. CS 140 Final Review Session. Pre-Midterm. Plan For Today. Disks + I/O. Pre-Midterm, cont.

CSE 153 Design of Operating Systems Fall 2018

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

CPS 310 first midterm exam, 2/26/2014

CSCE Operating Systems Interrupts, Exceptions, and Signals. Qiang Zeng, Ph.D. Fall 2018

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Memory management. Johan Montelius KTH

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

CSC369 Lecture 2. Larry Zhang, September 21, 2015

CISC2200 Threads Spring 2015

CS 5460/6460 Operating Systems

Processes and Non-Preemptive Scheduling. Otto J. Anshus

CS 3305 Intro to Threads. Lecture 6

EEE3052: Introduction to Operating Systems. Fall Project #1

Operating System Structure

Processes. Dr. Yingwu Zhu

OS lpr. www. nfsd gcc emacs ls 1/27/09. Process Management. CS 537 Lecture 3: Processes. Example OS in operation. Why Processes? Simplicity + Speed

Last time: introduction. Networks and Operating Systems ( ) Chapter 2: Processes. This time. General OS structure. The kernel is a program!

Lecture 3: O/S Organization. plan: O/S organization processes isolation

Introduction Programmer Interface User Interface Process Management Memory Management File System I/O System Interprocess Communication

CS510 Operating System Foundations. Jonathan Walpole

CS 550 Operating Systems Spring Interrupt

Operating System. Chapter 3. Process. Lynn Choi School of Electrical Engineering

Computer Systems II. First Two Major Computer System Evolution Steps

ECE 598 Advanced Operating Systems Lecture 10

Inf2C - Computer Systems Lecture 16 Exceptions and Processor Management

0x1A Great Papers in Computer Security

NPTEL Course Jan K. Gopinath Indian Institute of Science

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

CS 322 Operating Systems Practice Midterm Questions

Transcription:

What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption 1 2 3 4 4 5 6 7 a place where something is halted for inspection Why we need it? For safety or for load balancing Fault tolerance if MTBF < execution time Environment unstable Load balancing Dynamic load balancing Resource constraint Where to? Depending on the usage but always on a safe place Reliable storage (eg disk) Reliable server Losing a normally require restarting the application. When to? One of the most difficult question. If often we can decrease the performance If not when we restart we can lose an important amount of computations Depend on the size of the And/Or on the application s algorithm And on the reliability of the host system. Who need? Those interested in increasing the reliability through time redundancy 1

How to? How to? Seems easy but finally require answers to a lot of questions. The initial state of the program Different executable formats (a.out, elf, coff, pef ) But they contain mostly the same information, the difference is how they store it and how they reference it. An executable contain (non exhaustive list): The instruction bloc (text segment) Initialized data The Size of the uninitialized data segment The state of a process The OS load the instruction and data in different sections in the virtual address space of the process The OS initialize the stack and heap Load the shared libraries (when required) Give the hand to the entry point of the application The state of a process Include data sections, the stack, the heap, all allocated memory, the registers values Plus others information available only at the kernel level like: Open devices state (eg files, sockets) Signals, timers Shared memory segments (shmem) How to? 2

What should we keep? Quick answer: as much information as possible Data segment : initial + allocated by the application Stack and heap Registers values Others if available How to? Data Area: Known range of addresses or some OS support Linux deliver this information is /proc/$pid/maps 08048000-0808b000 r-xp 00000000 03:02 1515230 /bin/tcsh 0808b000-08090000 rw-p 00042000 03:02 1515230 /bin/tcsh 08090000-081a9000 rwxp 00000000 00:00 0 40000000-40013000 r-xp 00000000 03:02 1776145 /lib/ld-2.2.5.so Stack Stack and Heap Only need to save the active regions Heap can be obtained using sbrk(0) Bottom of the stack is in /proc/$pid/maps Top of the stack: address of the last local variable Unused Heap Registers: setjmp() and longjmp() provide this functionality (similar to non local goto) jmpbuf envjmp; int main( int argc, char* argv[] ) { if(!setjmp( envjmp) ) { /* initial execution */ foo(); } else { /* back from longjmp */ } } int foo() { longjmp(envjmp); } Others like file descriptors, sockets, shared memory are not completely available directly in the user space. OS help required Non trivial Require kernel modifications Or let the user application restore them as the user has additional knowledge about them. 3

How to? Who is responsible to do it? Several approach possible The Operating System Modification of the kernel or module At the user Additional library linked to the program Shared library inserted with LD_PRELOAD Directly the user application. OS level Idea: ask the entity who has all information (the kernel) to Advantages: Transparent No changes to the application All information can be saved (but maybe not restored) File descriptors, sockets, shared memory segments Thread information All data will be saved (including unused) Even if we have the information it can impossible to restore everything Application level User Level OS User level Idea: let a specialized library do the Advantages: Can decrease the size of the Still transparent in some situations The user can describe the data to be saved via the library API Require additional library Sometimes require recompilation or relink Application level User Level OS Application level Idea: let the entity having ALL knowledge about the behavior of the application do the. Advantages: Dramatically decrease the size of the (?) Everything can be restored (including files ) Not transparent Additional complexity on the application All the complexity of the to the programmer Application level User Level OS Multi-level and optimizations Multi-level libraries User level + OS level: improved data AND file descriptors. Easy to integrate in complex applications. Optimizations: We want to have everything faster without overhead 4

Incremental Idea: decrease the size of the Detect the page of memory changed after the last Require a memory page protection mechanism (mprotect) and a way to trap the SIG_FAULT signal. Or compute a local difference between the last and the current one Require additional disk space locally. Incremental Benefits: Less data to => the time spent in decrease Time spent to save the decrease Protecting memory pages is a costly operation (one time by every protected page) Or computing the local difference involve I/O operations. Forked Idea: at fork the process, then the parent while the child continue the execution. Saving the is overlapped with the execution of the application Require nearly 2 times more memory except when copy-on-write can be used. Checkpoint compression Idea: on the fly compression of the Decrease the disk space required for the but increase the computations. Improvement only if the disk is slower than compression. Checkpoint in parallel applications Transparency: application, MP API+Fault management, automatic. application ckpt: application store intermediate results and restart form them MP API+FM: message passing API returns errors to be handled by the programmer automatic: runtime detects faults and handle recovery Checkpoint coordination: no, coordinated, uncoordinated. coordinated: all processes are synchronized, network is flushed before ckpt; all processes rollback from the same snapshot uncoordinated: each process independently of the others each process is restarted independently of the others Message logging: no, pessimistic, optimistic, causal. pessimistic: all messages are logged on reliable media and used for replay optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent causal: optimistic+antecedence Graph, reduces the recovery time Checkpoint libraries Sprite [Douglis, Ousterhout, 1991] Task migration Transparent Remote procedure calls Kernel level No fault tolerance Condor [Lizkow, Livny, Tannenbaum, 1991] Task migration Transparent User level Include servers Compression No parallel applications Clip [Chen, Li, Plank, 1997] Not cross platform Parallel applications Global synchronization (Chandy-Lamport algorithm) 5

Checkpoint libraries Classification Libckpt [Plank, Beck, Kingsley, Li, 1994] Transparent (user configurable) User level Non blocking Incremental No compression No parallel applications No server Cocheck [Stellner, 1996] / Netsolve [Plank, Casanova, Beck, Dongarra,1999] Based on condor mechanisms Dedicated for parallel applications Global Synchronization (Chandy-Lamport algorithm) MPI-FT [Louca, Neophytou, Lachanas, Evripidou, 2000] transparent Optimistic Log : decentralized, only one fault Pessimistic Log : centralized, arbitrary number of faults Automatic Checkpoint Log based based Optimistic log (sender based) Causal log Optimistic recovery Cocheck Manetho Framework Independent of MPI In distributed systems n faults n faults with coherent [Ste96] [EZ92] [SY85] Starfish Enrichment of MPI [AF99] Clip Egida Semi-transparent API [CLP97] [RAV99] Pruitt 98 2 faults sender based [PRU98] Communication Lib. Sender based Mess. Log. 1 fault sender based [JZ87] Level Non Automatic Pessimistic log Coordinated MPI/FT FT-MPI Modification of MPI routines Redundance of tasks User Fault Treatment [BNC01] [FD00] MPI-FT MPICH-V N fault N faults Centralized server Distributed logging [LNLE00] Coordinated Checkpoint (Chandy/Lamport) Examples The objective is to the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Uncoordinated Checkpoint No global synchronization (scalable) Nodes may at any time (independently of the others) Need to log undeterministic events: In-transit Messages detection/ global stop failure Ckpt Sync restart detection failure Ckpt Nodes Nodes Sketch of execution with a crash Processes 0 CM CM Worst condition: in-transit message + Ckpt image 1 1 2 2 Ckpt image CS 1 2 Ckpt images Rollback to latest process 2 Pseudo time scale Crash 6