Autosave for Research Where to Start with Checkpoint/Restart

Similar documents
An introduction to checkpointing. for scientific applications

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

Applications of DMTCP for Checkpointing

Checkpointing using DMTCP, Condor, Matlab and FReD

An introduction to checkpointing. for scientifc applications

Virtualization. ...or how adding another layer of abstraction is changing the world. CIS 399: Unix Skills University of Pennsylvania.

DMTCP: Fixing the Single Point of Failure of the ROS Master

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Application Fault Tolerance Using Continuous Checkpoint/Restart

Linux-CR: Transparent Application Checkpoint-Restart in Linux

Operating System Kernels

First Results in a Thread-Parallel Geant4

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Lecture 09: VMs and VCS head in the clouds

Snapify: Capturing Snapshots of Offload Applications on Xeon Phi Manycore Processors

CS 326: Operating Systems. Process Execution. Lecture 5

Practical High Performance Computing

October 23, CERN, Switzerland. BOINC Virtual Machine Controller Infrastructure. David García Quintas. Introduction. Development (ie, How?

Processes, Context Switching, and Scheduling. Kevin Webb Swarthmore College January 30, 2018

Distributed Systems 27. Process Migration & Allocation

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Processes and Threads

Making Applications Mobile

Linux-CR: Transparent Application Checkpoint-Restart in Linux

Advanced Memory Management

Compute Node Linux (CNL) The Evolution of a Compute OS

Progress Report on Transparent Checkpointing for Supercomputing

Travis Cardwell Technical Meeting

Background: Operating Systems

MASSIVE SCALE USB DEVICE DRIVER FUZZ WITHOUT DEVICE. HC Tencent s XuanwuLab

Project #1: Tracing, System Calls, and Processes

HTCondor: Virtualization (without Virtual Machines)

IPC and Unix Special Files

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Docker for HPC? Yes, Singularity! Josef Hrabal

Introduction to Parallel Programming

A process. the stack

Introduction to Visualization on Stampede

Process Migration via Remote Fork: a Viable Programming Model? Branden J. Moor! cse 598z: Distributed Systems December 02, 2004

Distributed Systems CSCI-B 534/ENGR E-510. Spring 2019 Instructor: Prateek Sharma

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

Deterministic Replay and Reverse Debugging for QEMU

CS5460: Operating Systems

Implementing a GDB Stub in Lightweight Kitten OS

Chapter 5 C. Virtual machines

Here to take you beyond. ECEP Course syllabus. Emertxe Information Technologies ECEP course syllabus

User-Space Debugging Simplifies Driver Development

Concurrent Processing in Client-Server Software

Microservice Deployment. Software Engineering II Sharif University of Technology MohammadAmin Fazli

CSCE Operating Systems Interrupts, Exceptions, and Signals. Qiang Zeng, Ph.D. Fall 2018

Processes. Johan Montelius KTH

Investigating Resilient HPRC with Minimally-Invasive System Monitoring

Linux Kernel Validation Tools. Nicholas Mc Guire Distributed & Embedded Systems Lab Lanzhou University, China

Administrative Details. CS 140 Final Review Session. Pre-Midterm. Plan For Today. Disks + I/O. Pre-Midterm, cont.

CSE 153 Design of Operating Systems Fall 18

CS 0449 Project 4: /dev/rps Due: Friday, December 8, 2017, at 11:59pm

Killing Zombies, Working, Sleeping, and Spawning Children

VEOS high level design. Revision 2.1 NEC

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

What we saw. Desarrollo de Aplicaciones en Red. 1. OS Design. 2. Service description. 1.1 Operating System Service (1)

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Deterministic Process Groups in

The Use of Cloud Computing Resources in an HPC Environment

HPCC - Hrothgar. Getting Started User Guide TotalView. High Performance Computing Center Texas Tech University

Virtualization Device Emulator Testing Technology. Speaker: Qinghao Tang Title 360 Marvel Team Leader

Reverse Engineering Malware Dynamic Analysis of Binary Malware II

ECE 598 Advanced Operating Systems Lecture 23

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

4. Jump to *RA 4. StackGuard 5. Execute code 5. Instruction Set Randomization 6. Make system call 6. System call Randomization

From Handcraft to Unikraft:

NLUUG, Bunnik CloudABI: safe, testable and maintainable software for UNIX Speaker: Ed Schouten,

CSE543 - Computer and Network Security Module: Virtualization

Virtualization (II) SPD Course 17/03/2010 Massimo Coppola

Hypervisor security. Evgeny Yakovlev, DEFCON NN, 2017

CSE543 - Computer and Network Security Module: Virtualization

The failure of Operating Systems,

6.828: OS/Language Co-design. Adam Belay

Operating Systems Structure and Processes Lars Ailo Bongo Spring 2017 (using slides by Otto J. Anshus University of Tromsø/Oslo)

Yabusame: Postcopy Live Migration for Qemu/KVM

Sandboxing. (1) Motivation. (2) Sandboxing Approaches. (3) Chroot

Virtualisation: The KVM Way. Amit Shah

Appendix A GLOSSARY. SYS-ED/ Computer Education Techniques, Inc.

Performance Tools for Technical Computing

TOOLSMITHING AN IDA BRIDGE: A TOOL BUILDING CASE STUDY. Adam Pridgen Matt Wollenweber

OS DESIGN PATTERNS II. CS124 Operating Systems Fall , Lecture 4

containerization: more than the new virtualization

The benefits and costs of writing a POSIX kernel in a high-level language

Distributed Systems COMP 212. Lecture 18 Othon Michail

CodeDynamics 2018 Release Notes

Allinea DDT Debugger. Dan Mazur, McGill HPC March 5,

Debugging with TotalView

the murgiahack experiment Gianluca Guida

Chapter 2: Operating-System Structures

Red Hat enterprise virtualization 3.1 feature comparison

W11 Hyper-V security. Jesper Krogh.

CSEN 602-Operating Systems, Spring 2018 Practice Assignment 2 Solutions Discussion:

Avoiding the 16 Biggest DA & DRS Configuration Mistakes

Virtual Machines. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Virtualization. Michael Tsai 2018/4/16

Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated Desktops

Transcription:

Autosave for Research Where to Start with Checkpoint/Restart Brandon Barker Computational Scientist Cornell University Center for Advanced Computing (CAC) brandon.barker@cornell.edu Workshop: High Performance Computing on Stampede January 15, 2015 www.cac.cornell.edu

The problem 1. You test out your newly developed software on a small dataset. 2. All is well, you submit a big job and go read some papers or watch some TV. 3. Several days later, one of the following happens: Someone uses up all the memory on the system. Power failure Unplanned maintenance??? 1/13/2015 www.cac.cornell.edu 2

Ad-hoc solutions i.e. dodgy, incomplete, and error-prone solutions Save data every N iterations Takes time to code. May miss some data. Have to write custom resume code. For all of these, different sections of the program may need different save and restore procedures. For some tasks: run on discrete chunks of data. Works best when There are many data items. Each item can be processed quickly. There are no dependencies between items In short, embarrassingly parallel programs can use simple book-keeping for C/R. Still, it involves some work on the part of the researcher for restoration. 1/13/2015 www.cac.cornell.edu 3

What is C/R? Think of virtual machines: if you ve ever saved and restarted a virtual machine or emulator, you have used a type of C/R! Checkpoint: save the program state Program memory, open file descriptors, open sockets, process ids (PIDs), UNIX pipes, shared memory segments, etc. For distributed processes, need to coordinate checkpointing across processes. Restart: restart the process with saved state Some of the above require special permissions to restore (e.g. PIDs); not all C/R models can accommodate this. Others (like VMs) get it for free. Some of the above may be impossible to restore in certain contexts (e.g. sockets that have closed and cannot be re-established). 1/13/2015 www.cac.cornell.edu 4

Use cases of C/R Recovery/fault tolerance (restart after a crash). Save scientific interactive session: R, MATLAB, IPython, etc. Skip long initialization times. Interact with and analyze results of in-progress CPU-intensive process. Debugging Checkpoint image for ultimate in reproducibility. Make an existing debugger reversible Migrate process (or even multiple VMs). Robust exception handling in languages without exceptions or garbage collectors. 1/13/2015 www.cac.cornell.edu 5

Virtual Machine (VM) C/R VM-level C/R is relatively easy to implement, once you have a VM: the system is already isolated. Implementations Most any hypervisor platform: KVM, Virtualbox, VMWare, etc. Pros KVM is relatively lightweight. Very simple to use. Few surprises. Many applications supported; few limitations. Cons Operating in a VM context requires predefined partitioning of RAM and CPU resources. More overhead in most categories (storage of VM image, RAM snapshot, etc.). Still a challenge for multi-vm C/R. 1/13/2015 www.cac.cornell.edu 6

Containers with C/R Containers are a form of virtualization that uses a single OS kernel to run multiple, seemingly isolated, OS environments. Implementations OpenVZ CRIU C/R In Userspace (Not all containers support C/R) Pros Like VMs, enjoy the benefit of existing virtualization technology. Fewer surprises. Cons May incur additional overhead, due to C/R of unnecessary processes and storage. Still a challenge for multi-vm C/R. 1/13/2015 www.cac.cornell.edu 7

Kernel-modifying C/R Requires kernel modules or kernel patches to run. Implementations OpenVZ BLCR Berkeley Lab C/R CRIU But now in mainline! Pros Varied. Cons Required modification of the kernel. May not work for all kernels (BLCR does not past 3.7.1). 1/13/2015 www.cac.cornell.edu 8

(Multi) application C/R Checkpoint one or several interacting processes. Does not use the full container model. Implementations BLCR CRIU DMTCP Distributed MultiThreaded CheckPointing Pros Usually simple to use. Lower overhead. Cons May have surprises; applications use different advanced feature sets (e.g. IPC), and each package will have a different feature set. Test first! BLCR requires modification of application for static linking. DMTCP static linking support is experimental CRIU is a bit new. 1/13/2015 www.cac.cornell.edu 9

Custom C/R This is like ad-hoc, but when you do it even though you know other C/R solutions exist. Libraries that help (p)hdf5 NetCDF Pros Very low over-head. Few surprises if done properly. Cons Needs thorough testing for each application. Lots of development time. Less standardization. Always a chance something is missed. 1/13/2015 www.cac.cornell.edu 10

What is a good C/R solution for HPC? Requirements Must be non-invasive No kernel modifications Preferably no libraries needed on nodes. Should have low overhead. Must support distributed applications. Bonuses Easy to use. Stable for the user. It looks like DMTCP is the best candidate, for now. 1/13/2015 www.cac.cornell.edu 11

An overview of DMTCP Distributed MultiThreaded CheckPointing. Threads (OpenMP, POSIX threads), MPI. Easy to build and install library. Not necessary to link with existing dynamically linked applications. DMTCP libs replace (wrap) standard libs and syscalls. DMTCP lib directory should be in LD_LIBRARY_PATH and LD_PRELOAD (handled by DMTCP scripts). 1/13/2015 www.cac.cornell.edu 12

Counting in C #include <stdio.h> #include <unistd.h> dmtcp_checkpoint -i 5./count dmtcp_restart ckpt_count xxx.dmtcp int main(void) { unsigned long i = 0; while (1) { printf("%lu ", i); i = i + 1; sleep(1); fflush(stdout); } } 1/13/2015 www.cac.cornell.edu 13

Counting in Perl #/usr/bin/perl -w $ = 1; # autoflush STDOUT dmtcp_checkpoint -i 5 perl count.pl dmtcp_restart ckpt_perl_xxx.dmtcp $i = 0; while (true) { print "$i "; $i = $i + 1; sleep(1); } 1/13/2015 www.cac.cornell.edu 14

X11 (graphics) support All current non-vm C/R relies on VNC for X11 support. DMTCP has a known bug with checkpointing xterm. Due to dependence on VNC and general complications, only try to use if you have to. Not supported on Stampede or most other large HPC systems. 1/13/2015 www.cac.cornell.edu 15

Reversible Debugging with FReD Supports GDB and several interpreters Allows you to inspect one part of the program, then go back to a previous state without restarting the debugger. github.com/fred-dbg/fred youtu.be/1l_wgzz0jee 1/13/2015 www.cac.cornell.edu 16

DMTCP plugins Used to modify the behavior of DMTCP Modify behavior at the time of checkpoint or restart. Add wrapper functions around library functions (including syscalls). Much of DMTCP itself is now written as plugins. Custom plugins could add support for callbacks in user s program. 1/13/2015 www.cac.cornell.edu 17

Conclusions DMTCP appears to be good for most jobs for now. Also easy to install. CRIU will likely be a strong contender in the future, but it is not yet ready for HPC. BLCR and OpenVZ may be more robust than DMTCP for PID restoration (provided your kernel has support for BLCR or OpenVZ). For the foreseeable future, it is unlikely that any one C/R framework will meet everyone s needs. 1/13/2015 www.cac.cornell.edu 18

Additional Resources Source and other docs: github.com/cornell-comp-internal/cr-demos FReD: github.com/fred-dbg/fred Python and DMTCP: youtu.be/1l_wgzz0jee Comparison Chart: criu.org/comparison_to_other_cr_projects 1/13/2015 www.cac.cornell.edu 19