An introduction to checkpointing. for scientifc applications

Similar documents
An introduction to checkpointing. for scientific applications

Most of the work is done in the context of the process rather than handled separately by the kernel

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Computer Systems II. First Two Major Computer System Evolution Steps

Advanced Memory Management

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Processes and Threads

Announcements Processes: Part II. Operating Systems. Autumn CS4023

SMD149 - Operating Systems

Windows History 2009 Windows 7 2

Linux-CR: Transparent Application Checkpoint-Restart in Linux

Processes. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COS 318: Operating Systems

W4118 Operating Systems. Junfeng Yang

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Lecture 4: Process Management

Autosave for Research Where to Start with Checkpoint/Restart

3.1 Introduction. Computers perform operations concurrently

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Last time: introduction. Networks and Operating Systems ( ) Chapter 2: Processes. This time. General OS structure. The kernel is a program!

A process. the stack

CS510 Operating System Foundations. Jonathan Walpole

Roadmap. Tevfik Ko!ar. CSC Operating Systems Fall Lecture - III Processes. Louisiana State University. Processes. September 1 st, 2009

Operating System. Chapter 3. Process. Lynn Choi School of Electrical Engineering

Distributed Systems Operation System Support

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5.

New User Seminar: Part 2 (best practices)

Processes. Process Management Chapter 3. When does a process gets created? When does a process gets terminated?

* What are the different states for a task in an OS?

CPSC 341 OS & Networks. Processes. Dr. Yingwu Zhu

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

Chapter 4: Multi-Threaded Programming

System Call. Preview. System Call. System Call. System Call 9/7/2018

Processes. CS439: Principles of Computer Systems January 24, 2018

Application Fault Tolerance Using Continuous Checkpoint/Restart

Processes. Johan Montelius KTH


Process Scheduling Queues

IT 540 Operating Systems ECE519 Advanced Operating Systems

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Unix Processes. What is a Process?

OS 1 st Exam Name Solution St # (Q1) (19 points) True/False. Circle the appropriate choice (there are no trick questions).

The Kernel Abstraction

CSE325 Principles of Operating Systems. Processes. David P. Duggan February 1, 2011

ECE 598 Advanced Operating Systems Lecture 23

CSCE Operating Systems Interrupts, Exceptions, and Signals. Qiang Zeng, Ph.D. Fall 2018

Processes. OS Structure. OS Structure. Modes of Execution. Typical Functions of an OS Kernel. Non-Kernel OS. COMP755 Advanced Operating Systems

COS 318: Operating Systems. Overview. Andy Bavier Computer Science Department Princeton University

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

PROCESS CONTROL: PROCESS CREATION: UNIT-VI PROCESS CONTROL III-II R

CS370 Operating Systems

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Processes. Process Concept

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Chapter 4: Threads. Operating System Concepts 9 th Edition

ECE 574 Cluster Computing Lecture 8

Graham vs legacy systems

ELEC 377 Operating Systems. Week 1 Class 2

REVIEW OF COMMONLY USED DATA STRUCTURES IN OS

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition,

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Chap 4, 5: Process. Dongkun Shin, SKKU

Systems Programming/ C and UNIX

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Lecture 7: Signals and Events. CSC 469H1F Fall 2006 Angela Demke Brown

COS 318: Operating Systems. Deadlocks. Jaswinder Pal Singh Computer Science Department Princeton University

Fall 2015 COMP Operating Systems. Lab #3

Chapter 4: Threads. Chapter 4: Threads

Processes. Dr. Yingwu Zhu

CSC 4320 Test 1 Spring 2017

Processes. CS439: Principles of Computer Systems January 30, 2019

Process Description and Control. Chapter 3

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. Sanzheng Qiao. December, Department of Computing and Software

Background: Operating Systems

Signals: Management and Implementation. Sanjiv K. Bhatia Univ. of Missouri St. Louis

Processes and Threads. Processes and Threads. Processes (2) Processes (1)

Concurrent Programming. Copyright 2017 by Robert M. Dondero, Ph.D. Princeton University

VEOS high level design. Revision 2.1 NEC

What is a Process? Processes and Process Management Details for running a program

Announcement. Exercise #2 will be out today. Due date is next Monday

Computer System Overview

The cow and Zaphod... Virtual Memory #2 Feb. 21, 2007

Process Description and Control. Chapter 3

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Lecture 4: Memory Management & The Programming Interface

Each terminal window has a process group associated with it this defines the current foreground process group. Keyboard-generated signals are sent to

Chapter 4: Multithreaded Programming

Sample Questions. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

The Kernel Abstraction. Chapter 2 OSPP Part I

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

Lecture 1 Introduction (Chapter 1 of Textbook)

Protection and System Calls. Otto J. Anshus

ENGR 3950U / CSCI 3020U Midterm Exam SOLUTIONS, Fall 2012 SOLUTIONS

MPI History. MPI versions MPI-2 MPICH2

CSCE 313: Intro to Computer Systems

Transcription:

damien.francois@uclouvain.be UCL/CISM An introduction to checkpointing for scientifc applications November 2016 CISM/CÉCI training session

What is checkpointing?

Without checkpointing: $./count 1 2 3^C $./count 1 2 3 With checkpointing: $./count 1 2 3^C $./count 4 5 6

Without checkpointing: With checkpointing: $./countcheckpointing: $./count 1 1 2 2 'saving' a computation 3^C 3^C $./count $./count so that it can be resumed later 1 4 (rather than started again) 2 5 3 6

Why do we need checkpointing?

Goals of checkpointing in HPC: 1. Fit in time constraints 2. Debugging, monitoring 3. Cope with NODE_FAILs 4. Gang scheduling and preemption

The idea: Save the program state Values in variables Open fles... Position in the code Signal or event... every time a checkpoint is encountered and restart from there upon (un)planned stop rather than bootstrap again from scratch starting loops at iteration 0 creating tmp fles...

The key questions... Transparency Transparency for for developer developer Portability Portability to to other other systems systems Size Size of of state state to to save save Checkpointing Checkpointing overhead overhead Do I need to write a lot of additional code? Can I stop on one system and restart on another? How many GB of disk does it require? How many FLOPs lost to ensure checkpointing?

Who's in charge of all that? Transparency for developer Portability to other systems Size of state to save Checkpointing overhead the application itself -- +++ -- - a library - ++ -- - the compiler + ++ - + a run-time + + ++ + ++ - ++ ++ +++ -- +++ +++ the OS the hardware

Today's agenda: How to make your program checkpoint-able -> concepts and examples -> recipes (design patterns) and signals -> Slurm -> parallel checkpointing

So you can play On hmem: ~dfr/checkpointing.tgz

1 Making a program checkpoint-able by saving its state every iteration and looking for a state fle on startup.

Without checkpointing: $./count 1 2 3^C $./count 1 2 3 With checkpointing: $./count 1 2 3^C $./count 4 5 6

Without checkpointing: $./count 1 2 3^C $./count 1 2 3 $./count 1 2 3^C $./count 4 5 6

The general recipe 1. Look for a state fle (name can be hardcoded, or, better, passed as parameter) 2. If found, then restore state (initialize all variables with content of the fle state) Else, bootstrap (create initial state) 3. Periodically save the state In the previous example : The state is just an integer Periodically means at each iteration

So you can play 1. Translate 'count' in your favorite language 2. Adapt it to enable checkpointing

Python recipe

R recipe

Octave recipe

Fortran recipe

C recipe

Java recipe

2 Using UNIX signals to reduce overhead : do not save the state at each iteration -- wait for the signal.

UNIX processes can receive 'signals' from the user, the OS, or another process

UNIX processes can receive 'signals' from the user, the OS, or another process ^C ^D kill -9 kill ^Z fg, bg

UNIX processes can receive 'signals' from the user, the OS, or another process e.g.

UNIX processes can receive 'signals' from the user, the OS, or another process e.g.

UNIX processes can receive 'signals' with an associated default action

UNIX processes can receive 'signals' and handle ('trap') them

The general recipe 1. Register a signal handler (a function that will modify a global variable when recieving a signal) 2. Test the value of the global variable periodically (At a moment when the state is consistent an easy to recreate) 3. If the value indicates so, save state to disk (and optionally gracefully stop) In the previous example : The state is just an integer Periodically means at each iteration

So you can play Adapt your program to handle signals

Useful links C: http://www.gnu.org/software/libc/manual/html_node/basic-signalhandling.html#basic-signal-handling Fortran: https://gcc.gnu.org/onlinedocs/gcc-4.4.1/gfortran/signal.html Python: https://docs.python.org/2/library/signal.html Octave: http://octave.sourceforge.net/octave/function/octave_core_fle_name.html R: https://stat.ethz.ch/r-manual/r-devel/library/base/html/signals.html Java: http://docs.oracle.com/javase/7/docs/api/java/lang/interruptedexception.html

Previous C recipe

C signal recipe

C signal recipe

Fortan signal recipe

Fortan signal recipe

Java signal recipe

Python signal recipe

Octave signal recipe

R signal recipe

3 Use Slurm signaling abilities to manage checkpoint-able software in Slurm scripts on the clusters.

scancel is used to send signals to jobs

Example: use scancel --signal USR1 $SLURM_JOB_ID to force state dump for reviewing/debugging Python signal recipe

--signal to have Slurm send signals automatically before the end of the allocation

Set non-zero return code when stopping because of a received signal Fortran signal recipe

Then you can have your job re-queued automatically

Set a non-zero exit code C: exit(1) Fortran: stop 1 Octave: exit( 1 ) R: quit( status=1 ) Python: sys.exit( 1 ) Java: System.exit( 1 )

Or chain the jobs...

Using a signal-based watchdog to re-queue the job just before it is killed

4 Parallel programs are better checkpointed after a global synchronization.

In the fork-join model, checkpoint after a join and before a fork Checkpoint here Easily ensure state consistency Allows restarting with a different number of threads

5 Use programs and libraries that enable other programs with checkpoint/restart capabilities.

Such program needs to: 1. Access the process' memory (the c/r program forks itself as the process, or uses a kernel module) 2. Access the processor state at any moment (it uses signals to interrupt the process and provoke storage of the registers on the stack) 3. Track the state changing actions (fork, exec, system, etc.) (wrap standard library functions with LD_PRELOAD'ed custom functions) 4. Inject checkpointing code in the program (LD_PRELOAD a library with signal handlers)

Such program needs to: 1. Access the process' memory (the c/r program forks itself as the process, or uses a kernel module) 2. Access the processor state at any moment (it uses signals to interrupt the process and provoke storage of the registers on the stack) 3. Track the state changing actions (fork, exec, system, etc.) (wrap standard library functions with LD_PRELOAD'ed custom functions) 4. Inject checkpointing code in the program (LD_PRELOAD a library with signal handlers)

LD_PRELOAD magic

LD_PRELOAD magic

damien.francois@uclouvain.be UCL/CISM Summary, Wrap-up and Conclusions. October 2014 CISM/CÉCI training session

Never again force your users to click 'Discard'...

Make initializations conditional Save minimal reconstructable state periodically Save full workspace upon signal Checkpoint after a synchronization

So you can play Adapt your own program for checkpoint/restart what is the minimal reconstructible state? what file format for the checkpoint? what frequence/what signal? what start strategy: look for checkpoint file vs command-line parameter, etc.? what initalization should I modify? what files should I re-open?